summaryrefslogtreecommitdiffstats
path: root/mm/percpu.c
AgeCommit message (Collapse)Author
2009-09-01percpu: don't assume existence of cpu0Tejun Heo
percpu incorrectly assumed that cpu0 was always there which led to the following warning and eventual oops on sparc machines w/o cpu0. WARNING: at mm/percpu.c:651 pcpu_map+0xdc/0x100() Modules linked in: Call Trace: [000000000045eb70] warn_slowpath_common+0x50/0xa0 [000000000045ebdc] warn_slowpath_null+0x1c/0x40 [00000000004d493c] pcpu_map+0xdc/0x100 [00000000004d59a4] pcpu_alloc+0x3e4/0x4e0 [00000000004d5af8] __alloc_percpu+0x18/0x40 [00000000005b112c] __percpu_counter_init+0x4c/0xc0 ... Unable to handle kernel NULL pointer dereference ... I7: <sysfs_new_dirent+0x30/0x120> Disabling lock debugging due to kernel taint Caller[000000000053c1b0]: sysfs_new_dirent+0x30/0x120 Caller[000000000053c7a4]: create_dir+0x24/0xc0 Caller[000000000053c870]: sysfs_create_dir+0x30/0x80 Caller[00000000005990e8]: kobject_add_internal+0xc8/0x200 ... Kernel panic - not syncing: Attempted to kill the idle task! This patch fixes the problem by backporting parts from devel branch to make percpu core not depend on the existence of cpu0. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Meelis Roos <mroos@linux.ee> Cc: David Miller <davem@davemloft.net>
2009-08-14percpu: use the right flag for get_vm_area()Amerigo Wang
get_vm_area() only accepts VM_* flags, not GFP_*. And according to the doc of get_vm_area(), here should be VM_ALLOC. Signed-off-by: WANG Cong <amwang@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>
2009-08-14percpu, sparc64: fix sparse possible cpu map handlingTejun Heo
percpu code has been assuming num_possible_cpus() == nr_cpu_ids which is incorrect if cpu_possible_map contains holes. This causes percpu code to access beyond allocated memories and vmalloc areas. On a sparc64 machine with cpus 0 and 2 (u60), this triggers the following warning or fails boot. WARNING: at /devel/tj/os/work/mm/vmalloc.c:106 vmap_page_range_noflush+0x1f0/0x240() Modules linked in: Call Trace: [00000000004b17d0] vmap_page_range_noflush+0x1f0/0x240 [00000000004b1840] map_vm_area+0x20/0x60 [00000000004b1950] __vmalloc_area_node+0xd0/0x160 [0000000000593434] deflate_init+0x14/0xe0 [0000000000583b94] __crypto_alloc_tfm+0xd4/0x1e0 [00000000005844f0] crypto_alloc_base+0x50/0xa0 [000000000058b898] alg_test_comp+0x18/0x80 [000000000058dad4] alg_test+0x54/0x180 [000000000058af00] cryptomgr_test+0x40/0x60 [0000000000473098] kthread+0x58/0x80 [000000000042b590] kernel_thread+0x30/0x60 [0000000000472fd0] kthreadd+0xf0/0x160 ---[ end trace 429b268a213317ba ]--- This patch fixes generic percpu functions and sparc64 setup_per_cpu_areas() so that they handle sparse cpu_possible_map properly. Please note that on x86, cpu_possible_map() doesn't contain holes and thus num_possible_cpus() == nr_cpu_ids and this patch doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu>
2009-06-22x86: implement percpu_alloc kernel parameterTejun Heo
According to Andi, it isn't clear whether lpage allocator is worth the trouble as there are many processors where PMD TLB is far scarcer than PTE TLB. The advantage or disadvantage probably depends on the actual size of percpu area and specific processor. As performance degradation due to TLB pressure tends to be highly workload specific and subtle, it is difficult to decide which way to go without more data. This patch implements percpu_alloc kernel parameter to allow selecting which first chunk allocator to use to ease debugging and testing. While at it, make sure all the failure paths report why something failed to help determining why certain allocator isn't working. Also, kill the "Great future plan" comment which had already been realized quite some time ago. [ Impact: allow explicit percpu first chunk allocator selection ] Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jan Beulich <JBeulich@novell.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu>
2009-06-22percpu: fix too lazy vunmap cache flushingTejun Heo
In pcpu_unmap(), flushing virtual cache on vunmap can't be delayed as the page is going to be returned to the page allocator. Only TLB flushing can be put off such that vmalloc code can handle it lazily. Fix it. [ Impact: fix subtle virtual cache flush bug ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Ingo Molnar <mingo@elte.hu>
2009-04-08percpu: remove rbtree and use page->index insteadChristoph Lameter
Impact: use page->index for addr to chunk mapping instead of dedicated rbtree The rbtree is used to determine the chunk from the virtual address. However, we can already determine the page struct from a virtual address and there are several unused fields in page struct used by vmalloc. Use the index field to store a pointer to the chunk. Then there is no need anymore for an rbtree. tj: * s/(set|get)_chunk/pcpu_\1_page_chunk/ * Drop inline from the above two functions and moved them upwards so that they are with other simple helpers. * Initial pages might not (actually most of the time don't) live in the vmalloc area. With the previous patch to manually reverse-map both first chunks, this is no longer an issue. Removed pcpu_set_chunk() call on initial pages. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: rusty@rustcorp.com.au Cc: Paul Mundt <lethal@linux-sh.org> Cc: rmk@arm.linux.org.uk Cc: starvik@axis.com Cc: ralf@linux-mips.org Cc: davem@davemloft.net Cc: cooloney@kernel.org Cc: kyle@mcmartin.ca Cc: matthew@wil.cx Cc: grundler@parisc-linux.org Cc: takata@linux-m32r.org Cc: benh@kernel.crashing.org Cc: rth@twiddle.net Cc: ink@jurassic.park.msu.ru Cc: heiko.carstens@de.ibm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> LKML-Reference: <49D43D58.4050102@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08percpu: don't put the first chunk in reverse-map rbtreeTejun Heo
Impact: both first chunks don't use rbtree, no functional change There can be two first chunks - reserved and dynamic with the former one being optional. Dynamic first chunk was linked on reverse-mapping rbtree while the reserved one was mapped manually using the start address and reserved offset limit. This patch makes both first chunks to be looked up manually without using the rbtree. This is to help getting rid of the rbtree. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: rusty@rustcorp.com.au Cc: Paul Mundt <lethal@linux-sh.org> Cc: rmk@arm.linux.org.uk Cc: starvik@axis.com Cc: ralf@linux-mips.org Cc: davem@davemloft.net Cc: cooloney@kernel.org Cc: kyle@mcmartin.ca Cc: matthew@wil.cx Cc: grundler@parisc-linux.org Cc: takata@linux-m32r.org Cc: benh@kernel.crashing.org Cc: rth@twiddle.net Cc: ink@jurassic.park.msu.ru Cc: heiko.carstens@de.ibm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux.com> LKML-Reference: <49D43CEA.3040609@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-10percpu: generalize embedding first chunk setup helperTejun Heo
Impact: code reorganization Separate out embedding first chunk setup helper from x86 embedding first chunk allocator and put it in mm/percpu.c. This will be used by the default percpu first chunk allocator and possibly by other archs. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-10percpu: more flexibility for @dyn_size of pcpu_setup_first_chunk()Tejun Heo
Impact: cleanup, more flexibility for first chunk init Non-negative @dyn_size used to be allowed iff @unit_size wasn't auto. This restriction stemmed from implementation detail and made things a bit less intuitive. This patch allows @dyn_size to be specified regardless of @unit_size and swaps the positions of @dyn_size and @unit_size so that the parameter order makes more sense (static, reserved and dyn sizes followed by enclosing unit_size). While at it, add @unit_size >= PCPU_MIN_UNIT_SIZE sanity check. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-10percpu: make x86 addr <-> pcpu ptr conversion macros genericTejun Heo
Impact: generic addr <-> pcpu ptr conversion macros There's nothing arch specific about x86 __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr(). With proper __per_cpu_load and __per_cpu_start defined, they'll do the right thing regardless of actual layout. Move these macros from arch/x86/include/asm/percpu.h to mm/percpu.c and allow archs to override it as necessary. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-07percpu: finer grained locking to break deadlock and allow atomic freeTejun Heo
Impact: fix deadlock and allow atomic free Percpu allocation always uses GFP_KERNEL and whole alloc/free paths were protected by single mutex. All percpu allocations have been from GFP_KERNEL-safe context and the original allocator had this assumption too. However, by protecting both alloc and free paths with the same mutex, the new allocator creates free -> alloc -> GFP_KERNEL dependency which the original allocator didn't have. This can lead to deadlock if free is called from FS or IO paths. Also, in general, allocators are expected to allow free to be called from atomic context. This patch implements finer grained locking to break the deadlock and allow atomic free. For details, please read the "Synchronization rules" comment. While at it, also add CONTEXT: to function comments to describe which context they expect to be called from and what they do to it. This problem was reported by Thomas Gleixner and Peter Zijlstra. http://thread.gmane.org/gmane.linux.kernel/802384 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Peter Zijlstra <peterz@infradead.org>
2009-03-07percpu: move fully free chunk reclamation into a workTejun Heo
Impact: code reorganization for later changes Do fully free chunk reclamation using a work. This change is to prepare for locking changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-07percpu: move chunk area map extension out of area allocationTejun Heo
Impact: code reorganization for later changes Separate out chunk area map extension into a separate function - pcpu_extend_area_map() - and call it directly from pcpu_alloc() such that pcpu_alloc_area() is guaranteed to have enough area map slots on invocation. With this change, pcpu_alloc_area() does only area allocation and the only failure mode is when the chunk doens't have enough room, so there's no need to distinguish it from memory allocation failures. Make it return -1 on such cases instead of hacky -ENOSPC. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-07percpu: replace pcpu_realloc() with pcpu_mem_alloc() and pcpu_mem_free()Tejun Heo
Impact: code reorganization for later changes With static map handling moved to pcpu_split_block(), pcpu_realloc() only clutters the code and it's also unsuitable for scheduled locking changes. Implement and use pcpu_mem_alloc/free() instead. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-06percpu, module: implement reserved allocation and use it for module percpu ↵Tejun Heo
variables Impact: add reserved allocation functionality and use it for module percpu variables This patch implements reserved allocation from the first chunk. When setting up the first chunk, arch can ask to set aside certain number of bytes right after the core static area which is available only through a separate reserved allocator. This will be used primarily for module static percpu variables on architectures with limited relocation range to ensure that the module perpcu symbols are inside the relocatable range. If reserved area is requested, the first chunk becomes reserved and isn't available for regular allocation. If the first chunk also includes piggy-back dynamic allocation area, a separate chunk mapping the same region is created to serve dynamic allocation. The first one is called static first chunk and the second dynamic first chunk. Although they share the page map, their different area map initializations guarantee they serve disjoint areas according to their purposes. If arch doesn't setup reserved area, reserved allocation is handled like any other allocation. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-06percpu: add an indirection ptr for chunk page map accessTejun Heo
Impact: allow sharing page map, no functional difference yet Make chunk->page access indirect by adding a pointer and renaming the actual array to page_ar. This will be used by future changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-06percpu: use negative for auto for pcpu_setup_first_chunk() argumentsTejun Heo
Impact: argument semantic cleanup In pcpu_setup_first_chunk(), zero @unit_size and @dyn_size meant auto-sizing. It's okay for @unit_size as 0 doesn't make sense but 0 dynamic reserve size is valid. Alos, if arch @dyn_size is calculated from other parameters, it might end up passing in 0 @dyn_size and malfunction when the size is automatically adjusted. This patch makes both @unit_size and @dyn_size ssize_t and use -1 for auto sizing. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-06percpu: improve first chunk initial area map handlingTejun Heo
Impact: no functional change When the first chunk is created, its initial area map is not allocated because kmalloc isn't online yet. The map is allocated and initialized on the first allocation request on the chunk. This works fine but the scattering of initialization logic between the init function and allocation path is a bit confusing. This patch makes the first chunk initialize and use minimal statically allocated map from pcpu_setpu_first_chunk(). The map resizing path still needs to handle this specially but it's more straight-forward and gives more latitude to the init path. This will ease future changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-06percpu: cosmetic renames in pcpu_setup_first_chunk()Tejun Heo
Impact: cosmetic, preparation for future changes Make the following renames in pcpur_setup_first_chunk() in preparation for future changes. * s/free_size/dyn_size/ * s/static_vm/first_vm/ * s/static_chunk/schunk/ Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-01percpu: kill compile warning in pcpu_populate_chunk()Tejun Heo
Impact: remove compile warning Mark local variable map_end in pcpu_populate_chunk() with uninitialized_var(). The variable is always used in tandem with map_start and guaranteed to be initialized before use but gcc doesn't understand that. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ingo Molnar <mingo@elte.hu>
2009-02-24percpu: add __read_mostly to variables which are mostly read onlyTejun Heo
Most global variables in percpu allocator are initialized during boot and read only from that point on. Add __read_mostly as per Rusty's suggestion. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au>
2009-02-24percpu: give more latitude to arch specific first chunk initializationTejun Heo
Impact: more latitude for first percpu chunk allocation The first percpu chunk serves the kernel static percpu area and may or may not contain extra room for further dynamic allocation. Initialization of the first chunk needs to be done before normal memory allocation service is up, so it has its own init path - pcpu_setup_static(). It seems archs need more latitude while initializing the first chunk for example to take advantage of large page mapping. This patch makes the following changes to allow this. * Define PERCPU_DYNAMIC_RESERVE to give arch hint about how much space to reserve in the first chunk for further dynamic allocation. * Rename pcpu_setup_static() to pcpu_setup_first_chunk(). * Make pcpu_setup_first_chunk() much more flexible by fetching page pointer by callback and adding optional @unit_size, @free_size and @base_addr arguments which allow archs to selectively part of chunk initialization to their likings. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-24percpu: remove unit_size power-of-2 restrictionTejun Heo
Impact: allow unit_size to be arbitrary multiple of PAGE_SIZE In dynamic percpu allocator, there is no reason the unit size should be power of two. Remove the restriction. As non-power-of-two unit size means that empty chunks fall into the same slot index as lightly occupied chunks which is bad for reclaming. Reserve an extra slot for empty chunks. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-24vmalloc: add @align to vm_area_register_early()Tejun Heo
Impact: allow larger alignment for early vmalloc area allocation Some early vmalloc users might want larger alignment, for example, for custom large page mapping. Add @align to vm_area_register_early(). While at it, drop docbook comment on non-existent @size. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
2009-02-24percpu: fix pcpu_chunk_struct_sizeTejun Heo
Impact: fix short allocation leading to memory corruption While dropping rvalue wrapping macros around global parameters, pcpu_chunk_struct_size was set incorrectly resulting in shorter page pointer array. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-21percpu: clean up size usageTejun Heo
Andrew was concerned about the unit of variables named or have suffix size. Every usage in percpu allocator is in bytes but make it super clear by adding comments. While at it, make pcpu_depopulate_chunk() take int @off and @size like everyone else. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>
2009-02-20percpu: implement new dynamic percpu allocatorTejun Heo
Impact: new scalable dynamic percpu allocator which allows dynamic percpu areas to be accessed the same way as static ones Implement scalable dynamic percpu allocator which can be used for both static and dynamic percpu areas. This will allow static and dynamic areas to share faster direct access methods. This feature is optional and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by arch. Please read comment on top of mm/percpu.c for details. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>