summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2009-09-08shmfs: use 'check_acl' instead of 'permission'Linus Torvalds
shmfs wants purely standard POSIX ACL semantics, so we can use the new generic VFS layer POSIX ACL checking rather than cooking our own 'permission()' function. Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-08kmemleak: fix sparse warning for static declarationsLuis R. Rodriguez
This fixes these sparse warnings: mm/kmemleak.c:1179:6: warning: symbol 'start_scan_thread' was not declared. Should it be static? mm/kmemleak.c:1194:6: warning: symbol 'stop_scan_thread' was not declared. Should it be static? Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-09-08kmemleak: fix sparse warning over overshadowed flagsLuis R. Rodriguez
A secondary irq_save is not required as a locking before it was already disabling irqs. This fixes this sparse warning: mm/kmemleak.c:512:31: warning: symbol 'flags' shadows an earlier one mm/kmemleak.c:448:23: originally declared here Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-09-08kmemleak: move common painting code togetherLuis R. Rodriguez
When painting grey or black we do the same thing, bring this together into a helper and identify coloring grey or black explicitly with defines. This makes this a little easier to read. Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-09-08kmemleak: add clear command supportLuis R. Rodriguez
In an ideal world your kmemleak output will be small, when its not (usually during initial bootup) you can use the clear command to ingore previously reported and unreferenced kmemleak objects. We do this by painting all currently reported unreferenced objects grey. We paint them grey instead of black to allow future scans on the same objects as such objects could still potentially reference newly allocated objects in the future. To test a critical section on demand with a clean /sys/kernel/debug/kmemleak you can do: echo clear > /sys/kernel/debug/kmemleak test your kernel or modules echo scan > /sys/kernel/debug/kmemleak Then as usual to get your report with: cat /sys/kernel/debug/kmemleak Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-09-08kmemleak: use bool for true/false questionsLuis R. Rodriguez
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-09-08kmemleak: Do no create the clean-up thread during kmemleak_disable()Catalin Marinas
The kmemleak_disable() function could be called from various contexts including IRQ. It creates a clean-up thread but the kthread_create() function has restrictions on which contexts it can be called from, mainly because of the kthread_create_lock. The patch changes the kmemleak clean-up thread to a workqueue. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Eric Paris <eparis@redhat.com>
2009-09-05Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: don't assume existence of cpu0
2009-09-05Merge branch 'slab/urgent' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: Fix kmem_cache_destroy() with SLAB_DESTROY_BY_RCU
2009-09-05page-allocator: always change pageblock ownership when anti-fragmentation is ↵Mel Gorman
disabled On low-memory systems, anti-fragmentation gets disabled as fragmentation cannot be avoided on a sufficiently large boundary to be worthwhile. Once disabled, there is a period of time when all the pageblocks are marked MOVABLE and the expectation is that they get marked UNMOVABLE at each call to __rmqueue_fallback(). However, when MAX_ORDER is large the pageblocks do not change ownership because the normal criteria are not met. This has the effect of prematurely breaking up too many large contiguous blocks. This is most serious on NOMMU systems which depend on high-order allocations to boot. This patch causes pageblocks to change ownership on every fallback when anti-fragmentation is disabled. This prevents the large blocks being prematurely broken up. This is a fix to commit 49255c619fbd482d704289b5eb2795f8e3b7ff2e [page allocator: move check for disabled anti-fragmentation out of fastpath] and the problem affects 2.6.31-rc8. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Paul Mundt <lethal@linux-sh.org> Cc: David Howells <dhowells@redhat.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-05nommu: fix error handling in do_mmap_pgoff()David Howells
Fix the error handling in do_mmap_pgoff(). If do_mmap_shared_file() or do_mmap_private() fail, we jump to the error_put_region label at which point we cann __put_nommu_region() on the region - but we haven't yet added the region to the tree, and so __put_nommu_region() may BUG because the region tree is empty or it may corrupt the region tree. To get around this, we can afford to add the region to the region tree before calling do_mmap_shared_file() or do_mmap_private() as we keep nommu_region_sem write-locked, so no-one can race with us by seeing a transient region. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-04kmemleak: Scan all thread stacksCatalin Marinas
This patch changes the for_each_process() loop with the do_each_thread()/while_each_thread() pair. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-09-04kmemleak: Don't scan uninitialized memory when kmemcheck is enabledPekka Enberg
Ingo Molnar reported the following kmemcheck warning when running both kmemleak and kmemcheck enabled: PM: Adding info for No Bus:vcsa7 WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f6f6e1a4) d873f9f600000000c42ae4c1005c87f70000000070665f666978656400000000 i i i i u u u u i i i i i i i i i i i i i i i i i i i i i u u u ^ Pid: 3091, comm: kmemleak Not tainted (2.6.31-rc7-tip #1303) P4DC6 EIP: 0060:[<c110301f>] EFLAGS: 00010006 CPU: 0 EIP is at scan_block+0x3f/0xe0 EAX: f40bd700 EBX: f40bd780 ECX: f16b46c0 EDX: 00000001 ESI: f6f6e1a4 EDI: 00000000 EBP: f10f3f4c ESP: c2605fcc DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: e89a4844 CR3: 30ff1000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff4ff0 DR7: 00000400 [<c110313c>] scan_object+0x7c/0xf0 [<c1103389>] kmemleak_scan+0x1d9/0x400 [<c1103a3c>] kmemleak_scan_thread+0x4c/0xb0 [<c10819d4>] kthread+0x74/0x80 [<c10257db>] kernel_thread_helper+0x7/0x3c [<ffffffff>] 0xffffffff kmemleak: 515 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 42 new suspected memory leaks (see /sys/kernel/debug/kmemleak) The problem here is that kmemleak will scan partially initialized objects that makes kmemcheck complain. Fix that up by skipping uninitialized memory regions when kmemcheck is enabled. Reported-by: Ingo Molnar <mingo@elte.hu> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-09-03slub: Fix kmem_cache_destroy() with SLAB_DESTROY_BY_RCUEric Dumazet
kmem_cache_destroy() should call rcu_barrier() *after* kmem_cache_close() and *before* sysfs_slab_remove() or risk rcu_free_slab() being called after kmem_cache is deleted (kfreed). rmmod nf_conntrack can crash the machine because it has to kmem_cache_destroy() a SLAB_DESTROY_BY_RCU enabled cache. Cc: <stable@kernel.org> Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-09-03slub: release kobject if sysfs_create_group failed in sysfs_slab_addXiaotian Feng
When CONFIG_SLUB_DEBUG is enabled, sysfs_slab_add should unlink and put the kobject if sysfs_create_group failed. Otherwise, sysfs_slab_add returns error then free kmem_cache s, thus memory of s->kobj is leaked. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-09-01percpu: don't assume existence of cpu0Tejun Heo
percpu incorrectly assumed that cpu0 was always there which led to the following warning and eventual oops on sparc machines w/o cpu0. WARNING: at mm/percpu.c:651 pcpu_map+0xdc/0x100() Modules linked in: Call Trace: [000000000045eb70] warn_slowpath_common+0x50/0xa0 [000000000045ebdc] warn_slowpath_null+0x1c/0x40 [00000000004d493c] pcpu_map+0xdc/0x100 [00000000004d59a4] pcpu_alloc+0x3e4/0x4e0 [00000000004d5af8] __alloc_percpu+0x18/0x40 [00000000005b112c] __percpu_counter_init+0x4c/0xc0 ... Unable to handle kernel NULL pointer dereference ... I7: <sysfs_new_dirent+0x30/0x120> Disabling lock debugging due to kernel taint Caller[000000000053c1b0]: sysfs_new_dirent+0x30/0x120 Caller[000000000053c7a4]: create_dir+0x24/0xc0 Caller[000000000053c870]: sysfs_create_dir+0x30/0x80 Caller[00000000005990e8]: kobject_add_internal+0xc8/0x200 ... Kernel panic - not syncing: Attempted to kill the idle task! This patch fixes the problem by backporting parts from devel branch to make percpu core not depend on the existence of cpu0. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Meelis Roos <mroos@linux.ee> Cc: David Miller <davem@davemloft.net>
2009-08-31mm: remove !NUMA condition from PAGEFLAGS_EXTENDED condition setH. Peter Anvin
CONFIG_PAGEFLAGS_EXTENDED disables a trick to conserve pageflags. This trick is indended to be enabled when the pressure on page flags is very high. The previous condition was: - depends on 64BIT || SPARSEMEM_VMEMMAP || !NUMA || !SPARSEMEM ... however, the sparsemem code already has a way to crowd out the node number from the pageflags, which means that !NUMA actually doesn't contribute to hard pageflags exhaustion. This is required for the new PG_uncached flag to not cause pageflags exhaustion on x86_32 + PAE + SPARSEMEM + !NUMA. Signed-off-by: H. Peter Anvin <hpa@zytor.com> LKML-Reference: <4A9828F4.4040905@zytor.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Suresh Siddha <suresh.siddha@intel.com>
2009-08-30SLUB: fix ARCH_KMALLOC_MINALIGN cases 64 and 256Aaro Koskinen
If the minalign is 64 bytes, then the 96 byte cache should not be created because it would conflict with the 128 byte cache. If the minalign is 256 bytes, patching the size_index table should not result in a buffer overrun. The calculation "(i - 1) / 8" used to access size_index[] is moved to a separate function as suggested by Christoph Lameter. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-08-27kmemleak: Printing of the objects hex dumpSergey Senozhatsky
Introducing printing of the objects hex dump to the seq file. The number of lines to be printed is limited to HEX_MAX_LINES to prevent seq file spamming. The actual number of printed bytes is less than or equal to (HEX_MAX_LINES * HEX_ROW_SIZE). (slight adjustments by Catalin Marinas) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@mail.by> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-08-27kmemleak: Do not report alloc_bootmem blocks as leaksCatalin Marinas
This patch sets the min_count for alloc_bootmem objects to 0 so that they are never reported as leaks. This is because many of these blocks are only referred via the physical address which is not looked up by kmemleak. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
2009-08-27kmemleak: Save the stack trace for early allocationsCatalin Marinas
Before slab is initialised, kmemleak save the allocations in an early log buffer. They are later recorded as normal memory allocations. This patch adds the stack trace saving to the early log buffer, otherwise the information shown for such objects only refers to the kmemleak_init() function. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-08-27kmemleak: Mark the early log buffer as __initdataCatalin Marinas
This buffer isn't needed after kmemleak was initialised so it can be freed together with the .init.data section. This patch also marks functions conditionally accessing the early log variables with __ref. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-08-27kmemleak: Dump object information on requestCatalin Marinas
By writing dump=<addr> to the kmemleak file, kmemleak will look up an object with that address and dump the information it has about it to syslog. This is useful in debugging memory leaks. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-08-27kmemleak: Allow rescheduling during an object scanningCatalin Marinas
If the object size is bigger than a predefined value (4K in this case), release the object lock during scanning and call cond_resched(). Re-acquire the lock after rescheduling and test whether the object is still valid. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-08-26mm: fix for infinite churning of mlocked pagesMinchan Kim
An mlocked page might lose the isolatation race. This causes the page to clear PG_mlocked while it remains in a VM_LOCKED vma. This means it can be put onto the [in]active list. We can rescue it by using try_to_unmap() in shrink_page_list(). But now, As Wu Fengguang pointed out, vmscan has a bug. If the page has PG_referenced, it can't reach try_to_unmap() in shrink_page_list() but is put into the active list. If the page is referenced repeatedly, it can remain on the [in]active list without being moving to the unevictable list. This patch fixes it. Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <<kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-19SLUB: Fix some coding style issuesAmerigo Wang
Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-08-18Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: use the right flag for get_vm_area() percpu, sparc64: fix sparse possible cpu map handling init: set nr_cpu_ids before setup_per_cpu_areas()
2009-08-18mm: build_zonelists(): move clear node_load[] to __build_all_zonelists()Bo Liu
If node_load[] is cleared everytime build_zonelists() is called,node_load[] will have no help to find the next node that should appear in the given node's fallback list. Because of the bug, zonelist's node_order is not calculated as expected. This bug affects on big machine, which has asynmetric node distance. [synmetric NUMA's node distance] 0 1 2 0 10 12 12 1 12 10 12 2 12 12 10 [asynmetric NUMA's node distance] 0 1 2 0 10 12 20 1 12 10 14 2 20 14 10 This (my bug) is very old but no one has reported this for a long time. Maybe because the number of asynmetric NUMA is very small and they use cpuset for customizing node memory allocation fallback. [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Signed-off-by: Bo Liu <bo-liu@hotmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-18nommu: check fd read permission in validate_mmap_request()Graff Yang
According to the POSIX (1003.1-2008), the file descriptor shall have been opened with read permission, regardless of the protection options specified to mmap(). The ltp test cases mmap06/07 need this. Signed-off-by: Graff Yang <graff.yang@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-18mm: revert "oom: move oom_adj value"KOSAKI Motohiro
The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve("foo-bar-cmd"); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-18SLUB: Drop write permission to /proc/slabinfoWANG Cong
SLUB does not support writes to /proc/slabinfo so there should not be write permission to do that either. Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-08-17Security/SELinux: seperate lsm specific mmap_min_addrEric Paris
Currently SELinux enforcement of controls on the ability to map low memory is determined by the mmap_min_addr tunable. This patch causes SELinux to ignore the tunable and instead use a seperate Kconfig option specific to how much space the LSM should protect. The tunable will now only control the need for CAP_SYS_RAWIO and SELinux permissions will always protect the amount of low memory designated by CONFIG_LSM_MMAP_MIN_ADDR. This allows users who need to disable the mmap_min_addr controls (usual reason being they run WINE as a non-root user) to do so and still have SELinux controls preventing confined domains (like a web server) from being able to map some area of low memory. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2009-08-14percpu: kill lpage first chunk allocatorTejun Heo
With x86 converted to embedding allocator, lpage doesn't have any user left. Kill it along with cpa handling code. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jan Beulich <JBeulich@novell.com>
2009-08-14percpu: update embedding first chunk allocator to handle sparse unitsTejun Heo
Now that percpu core can handle very sparse units, given that vmalloc space is large enough, embedding first chunk allocator can use any memory to build the first chunk. This patch teaches pcpu_embed_first_chunk() about distances between cpus and to use alloc/free callbacks to allocate node specific areas for each group and use them for the first chunk. This brings the benefits of embedding allocator to NUMA configurations - no extra TLB pressure with the flexibility of unified dynamic allocator and no need to restructure arch code to build memory layout suitable for percpu. With units put into atom_size aligned groups according to cpu distances, using large page for dynamic chunks is also easily possible with falling back to reuglar pages if large allocation fails. Embedding allocator users are converted to specify NULL cpu_distance_fn, so this patch doesn't cause any visible behavior difference. Following patches will convert them. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-08-14percpu: use group information to allocate vmap areas sparselyTejun Heo
ai->groups[] contains which units need to be put consecutively and at what offset from the chunk base address. Compile this information into pcpu_group_offsets[] and pcpu_group_sizes[] in pcpu_setup_first_chunk() and use them to allocate sparse vm areas using pcpu_get_vm_areas(). This will be used to allow directly using sparse NUMA memories as percpu areas. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de>
2009-08-14vmalloc: implement pcpu_get_vm_areas()Tejun Heo
To directly use spread NUMA memories for percpu units, percpu allocator will be updated to allow sparsely mapping units in a chunk. As the distances between units can be very large, this makes allocating single vmap area for each chunk undesirable. This patch implements pcpu_get_vm_areas() and pcpu_free_vm_areas() which allocates and frees sparse congruent vmap areas. pcpu_get_vm_areas() take @offsets and @sizes array which define distances and sizes of vmap areas. It scans down from the top of vmalloc area looking for the top-most address which can accomodate all the areas. The top-down scan is to avoid interacting with regular vmallocs which can push up these congruent areas up little by little ending up wasting address space and page table. To speed up top-down scan, the highest possible address hint is maintained. Although the scan is linear from the hint, given the usual large holes between memory addresses between NUMA nodes, the scanning is highly likely to finish after finding the first hole for the last unit which is scanned first. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de>
2009-08-14vmalloc: separate out insert_vmalloc_vm()Tejun Heo
Separate out insert_vmalloc_vm() from __get_vm_area_node(). insert_vmalloc_vm() initializes vm_struct from vmap_area and inserts it into vmlist. insert_vmalloc_vm() only initializes fields which can be determined from @vm, @flags and @caller The rest should be initialized by the caller. For __get_vm_area_node(), all other fields just need to be cleared and this is done by using kzalloc instead of kmalloc. This will be used to implement pcpu_get_vm_areas(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de>
2009-08-14percpu: add chunk->base_addrTejun Heo
The only thing percpu allocator wants to know about a vmalloc area is the base address. Instead of requiring chunk->vm, add chunk->base_addr which contains the necessary value. This simplifies the code a bit and makes the dummy first_vm unnecessary. This change will ease allowing a chunk to be mapped by multiple vms. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-08-14percpu: add pcpu_unit_offsets[]Tejun Heo
Currently units are mapped sequentially into address space. This patch adds pcpu_unit_offsets[] which allows units to be mapped to arbitrary offsets from the chunk base address. This is necessary to allow sparse embedding which might would need to allocate address ranges and memory areas which aren't aligned to unit size but allocation atom size (page or large page size). This also simplifies things a bit by removing the need to calculate offset from unit number. With this change, there's no need for the arch code to know pcpu_unit_size. Update pcpu_setup_first_chunk() and first chunk allocators to return regular 0 or -errno return code instead of unit size or -errno. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David S. Miller <davem@davemloft.net>
2009-08-14percpu: introduce pcpu_alloc_info and pcpu_group_infoTejun Heo
Till now, non-linear cpu->unit map was expressed using an integer array which maps each cpu to a unit and used only by lpage allocator. Although how many units have been placed in a single contiguos area (group) is known while building unit_map, the information is lost when the result is recorded into the unit_map array. For lpage allocator, as all allocations are done by lpages and whether two adjacent lpages are in the same group or not is irrelevant, this didn't cause any problem. Non-linear cpu->unit mapping will be used for sparse embedding and this grouping information is necessary for that. This patch introduces pcpu_alloc_info which contains all the information necessary for initializing percpu allocator. pcpu_alloc_info contains array of pcpu_group_info which describes how units are grouped and mapped to cpus. pcpu_group_info also has base_offset field to specify its offset from the chunk's base address. pcpu_build_alloc_info() initializes this field as if all groups are allocated back-to-back as is currently done but this will be used to sparsely place groups. pcpu_alloc_info is a rather complex data structure which contains a flexible array which in turn points to nested cpu_map arrays. * pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to help dealing with pcpu_alloc_info. * pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info, generalized and renamed to pcpu_build_alloc_info(). @cpu_distance_fn may be NULL indicating that all cpus are of LOCAL_DISTANCE. * pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info, generalized and renamed to pcpu_dump_alloc_info(). It now also prints which group each alloc unit belongs to. * pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the separate parameters. All first chunk allocators are updated to use pcpu_build_alloc_info() to build alloc_info and call pcpu_setup_first_chunk() with it. This has the side effect of packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4. * x86 setup_pcpu_lpage() is updated to deal with alloc_info. * sparc64 setup_per_cpu_areas() is updated to build alloc_info. Although the changes made by this patch are pretty pervasive, it doesn't cause any behavior difference other than packing of sparse cpus. It mostly changes how information is passed among initialization functions and makes room for more flexibility. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net>
2009-08-14percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upwardTejun Heo
Unit map handling will be generalized and extended and used for embedding sparse first chunk and other purposes. Relocate two unit_map related functions upward in preparation. This patch just moves the code without any actual change. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-08-14percpu: add @align to pcpu_fc_alloc_fn_tTejun Heo
pcpu_fc_alloc_fn_t is about to see more interesting usage, add @align parameter. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-08-14percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()Tejun Heo
Now that all actual first chunk allocation and copying happen in the first chunk allocators and helpers, there's no reason for pcpu_setup_first_chunk() to try to determine @dyn_size automatically. The only left user is page first chunk allocator. Make it determine dyn_size like other allocators and make @dyn_size mandatory for pcpu_setup_first_chunk(). Signed-off-by: Tejun Heo <tj@kernel.org>
2009-08-14percpu: drop @static_size from first chunk allocatorsTejun Heo
First chunk allocators assume percpu areas have been linked using one of PERCPU_*() macros and depend on __per_cpu_load symbol defined by those macros, so there isn't much point in passing in static area size explicitly when it can be easily calculated from __per_cpu_start and __per_cpu_end. Drop @static_size from all percpu first chunk allocators and helpers. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-08-14percpu: generalize first chunk allocator selectionTejun Heo
Now that all first chunk allocators are in mm/percpu.c, it makes sense to make generalize percpu_alloc kernel parameter. Define PCPU_FC_* and set pcpu_chosen_fc using early_param() in mm/percpu.c. Arch code can use the set value to determine which first chunk allocator to use. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-08-14percpu: build first chunk allocators selectivelyTejun Heo
There's no need to build unused first chunk allocators in. Define CONFIG_NEED_PER_CPU_*_FIRST_CHUNK and let archs enable them selectively. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-08-14percpu: rename 4k first chunk allocator to pageTejun Heo
Page size isn't always 4k depending on arch and configuration. Rename 4k first chunk allocator to page. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Howells <dhowells@redhat.com>
2009-08-14percpu: improve boot messagesTejun Heo
Improve percpu boot messages such that they're uniform and contain more information. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
2009-08-14percpu: fix pcpu_reclaim() lockingTejun Heo
pcpu_reclaim() calls pcpu_depopulate_chunk() which makes use of pages array and bitmap returned by pcpu_get_pages_and_bitmap() and thus should be called under pcpu_alloc_mutex. pcpu_reclaim() released the mutex before calling depopulate leading to double free and other strange problems caused by the unexpected concurrent usages of pages array and bitmap. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
2009-08-14Merge branch 'percpu-for-linus' into percpu-for-nextTejun Heo
Conflicts: arch/sparc/kernel/smp_64.c arch/x86/kernel/cpu/perf_counter.c arch/x86/kernel/setup_percpu.c drivers/cpufreq/cpufreq_ondemand.c mm/percpu.c Conflicts in core and arch percpu codes are mostly from commit ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all the first chunk allocators into mm/percpu.c, the changes are moved from arch code to mm/percpu.c. Signed-off-by: Tejun Heo <tj@kernel.org>