summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2009-06-29kmemleak: Do not report new leaked objects if the scanning was stoppedCatalin Marinas
If the scanning was stopped with a signal, it is possible that some objects are left with a white colour (potential leaks) and reported. Add a check to avoid reporting such objects. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-29SLAB: Fix lockdep annotationsPekka Enberg
Commit 8429db5... ("slab: setup cpu caches later on when interrupts are enabled") broke mm/slab.c lockdep annotations: [ 11.554715] ============================================= [ 11.555249] [ INFO: possible recursive locking detected ] [ 11.555560] 2.6.31-rc1 #896 [ 11.555861] --------------------------------------------- [ 11.556127] udevd/1899 is trying to acquire lock: [ 11.556436] (&nc->lock){-.-...}, at: [<ffffffff810c337f>] kmem_cache_free+0xcd/0x25b [ 11.557101] [ 11.557102] but task is already holding lock: [ 11.557706] (&nc->lock){-.-...}, at: [<ffffffff810c3cd0>] kfree+0x137/0x292 [ 11.558109] [ 11.558109] other info that might help us debug this: [ 11.558720] 2 locks held by udevd/1899: [ 11.558983] #0: (&nc->lock){-.-...}, at: [<ffffffff810c3cd0>] kfree+0x137/0x292 [ 11.559734] #1: (&parent->list_lock){-.-...}, at: [<ffffffff810c36c7>] __drain_alien_cache+0x3b/0xbd [ 11.560442] [ 11.560443] stack backtrace: [ 11.561009] Pid: 1899, comm: udevd Not tainted 2.6.31-rc1 #896 [ 11.561276] Call Trace: [ 11.561632] [<ffffffff81065ed6>] __lock_acquire+0x15ec/0x168f [ 11.561901] [<ffffffff81065f60>] ? __lock_acquire+0x1676/0x168f [ 11.562171] [<ffffffff81063c52>] ? trace_hardirqs_on_caller+0x113/0x13e [ 11.562490] [<ffffffff8150c337>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 11.562807] [<ffffffff8106603a>] lock_acquire+0xc1/0xe5 [ 11.563073] [<ffffffff810c337f>] ? kmem_cache_free+0xcd/0x25b [ 11.563385] [<ffffffff8150c8fc>] _spin_lock+0x31/0x66 [ 11.563696] [<ffffffff810c337f>] ? kmem_cache_free+0xcd/0x25b [ 11.563964] [<ffffffff810c337f>] kmem_cache_free+0xcd/0x25b [ 11.564235] [<ffffffff8109bf8c>] ? __free_pages+0x1b/0x24 [ 11.564551] [<ffffffff810c3564>] slab_destroy+0x57/0x5c [ 11.564860] [<ffffffff810c3641>] free_block+0xd8/0x123 [ 11.565126] [<ffffffff810c372e>] __drain_alien_cache+0xa2/0xbd [ 11.565441] [<ffffffff810c3ce5>] kfree+0x14c/0x292 [ 11.565752] [<ffffffff8144a007>] skb_release_data+0xc6/0xcb [ 11.566020] [<ffffffff81449cf0>] __kfree_skb+0x19/0x86 [ 11.566286] [<ffffffff81449d88>] consume_skb+0x2b/0x2d [ 11.566631] [<ffffffff8144cbe0>] skb_free_datagram+0x14/0x3a [ 11.566901] [<ffffffff81462eef>] netlink_recvmsg+0x164/0x258 [ 11.567170] [<ffffffff81443461>] sock_recvmsg+0xe5/0xfe [ 11.567486] [<ffffffff810ab063>] ? might_fault+0xaf/0xb1 [ 11.567802] [<ffffffff81053a78>] ? autoremove_wake_function+0x0/0x38 [ 11.568073] [<ffffffff810d84ca>] ? core_sys_select+0x3d/0x2b4 [ 11.568378] [<ffffffff81065f60>] ? __lock_acquire+0x1676/0x168f [ 11.568693] [<ffffffff81442dc1>] ? sockfd_lookup_light+0x1b/0x54 [ 11.568961] [<ffffffff81444416>] sys_recvfrom+0xa3/0xf8 [ 11.569228] [<ffffffff81063c8a>] ? trace_hardirqs_on+0xd/0xf [ 11.569546] [<ffffffff8100af2b>] system_call_fastpath+0x16/0x1b# Fix that up. Closes-bug: http://bugzilla.kernel.org/show_bug.cgi?id=13654 Tested-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-28Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, delay: tsc based udelay should have rdtsc_barrier x86, setup: correct include file in <asm/boot.h> x86, setup: Fix typo "CONFIG_x86_64" in <asm/boot.h> x86, mce: percpu mcheck_timer should be pinned x86: Add sysctl to allow panic on IOCK NMI error x86: Fix uv bau sending buffer initialization x86, mce: Fix mce resume on 32bit x86: Move init_gbpages() to setup_arch() x86: ensure percpu lpage doesn't consume too much vmalloc space x86: implement percpu_alloc kernel parameter x86: fix pageattr handling for lpage percpu allocator and re-enable it x86: reorganize cpa_process_alias() x86: prepare setup_pcpu_lpage() for pageattr fix x86: rename remap percpu first chunk allocator to lpage x86: fix duplicate free in setup_pcpu_remap() failure path percpu: fix too lazy vunmap cache flushing x86: Set cpu_llc_id on AMD CPUs
2009-06-26kmemleak: Slightly change the policy on newly allocated objectsCatalin Marinas
Newly allocated objects are more likely to be reported as false positives. Kmemleak ignores the reporting of objects younger than 5 seconds. However, this age was calculated after the memory scanning completed which usually takes longer than 5 seconds. This patch make the minimum object age calculation in relation to the start of the memory scanning. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-26kmemleak: Do not trigger a scan when reading the debug/kmemleak fileCatalin Marinas
Since there is a kernel thread for automatically scanning the memory, it makes sense for the debug/kmemleak file to only show its findings. This patch also adds support for "echo scan > debug/kmemleak" to trigger an intermediate memory scan and eliminates the kmemleak_mutex (scan_mutex covers all the cases now). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-26kmemleak: Simplify the reports logged by the scanning threadCatalin Marinas
Because of false positives, the memory scanning thread may print too much information. This patch changes the scanning thread to only print the number of newly suspected leaks. Further information can be read from the /sys/kernel/debug/kmemleak file. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-26kmemleak: Enable task stacks scanning by defaultCatalin Marinas
This is to reduce the number of false positives reported. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-26fix RCU-callback-after-kmem_cache_destroy problem in sl[aou]bPaul E. McKenney
Jesper noted that kmem_cache_destroy() invokes synchronize_rcu() rather than rcu_barrier() in the SLAB_DESTROY_BY_RCU case, which could result in RCU callbacks accessing a kmem_cache after it had been destroyed. Cc: <stable@kernel.org> Acked-by: Matt Mackall <mpm@selenic.com> Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-26nommu: provide follow_pfn().Paul Mundt
With the introduction of follow_pfn() as an exported symbol, modules have begun making use of it. Unfortunately this was not reflected on nommu at the time, so the in-tree users have subsequently all blown up with link errors there. This provides a simple follow_pfn() that just returns addr >> PAGE_SHIFT, which will do the right thing on nommu. There is no need to do range checking within the vma, as the find_vma() case will already take care of this. Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2009-06-25clarify get_user_pages() prototypePeter Zijlstra
Currently the 4th parameter of get_user_pages() is called len, but its in pages, not bytes. Rename the thing to nr_pages to avoid future confusion. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-25kmemleak: Allow the early log buffer to be configurable.Catalin Marinas
(feature suggested by Sergey Senozhatsky) Kmemleak needs to track all the memory allocations but some of these happen before kmemleak is initialised. These are stored in an internal buffer which may be exceeded in some kernel configurations. This patch adds a configuration option with a default value of 400 and also removes the stack dump when the early log buffer is exceeded. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@mail.by>
2009-06-24Merge branches 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/{vfs-2.6,audit-current} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: another race fix in jfs_check_acl() Get "no acls for this inode" right, fix shmem breakage inline functions left without protection of ifdef (acl) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: audit: inode watches depend on CONFIG_AUDIT not CONFIG_AUDIT_SYSCALL
2009-06-24Get "no acls for this inode" right, fix shmem breakageAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24SLUB: Don't pass __GFP_FAIL for the initial allocationPekka Enberg
SLUB uses higher order allocations by default but falls back to small orders under memory pressure. Make sure the GFP mask used in the initial allocation doesn't include __GFP_NOFAIL. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-24Don't warn about order-1 allocations with __GFP_NOFAILLinus Torvalds
Traditionally, we never failed small orders (even regardless of any __GFP_NOFAIL flags), and slab will allocate order-1 allocations even for small allocations that could fit in a single page (in order to avoid excessive fragmentation). Maybe we should remove this warning entirely, but before making that judgement, at least limit it to bigger allocations. Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-24switch shmem to inode->i_aclAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24percpu: clean up percpu variable definitionsTejun Heo
Percpu variable definition is about to be updated such that all percpu symbols including the static ones must be unique. Update percpu variable definitions accordingly. * as,cfq: rename ioc_count uniquely * cpufreq: rename cpu_dbs_info uniquely * xen: move nesting_count out of xen_evtchn_do_upcall() and rename it * mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and rename it * ipv4,6: rename cookie_scratch uniquely * x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to pmc_irq_entry and nmi_entry to pmc_nmi_entry * perf_counter: rename disable_count to perf_disable_count * ftrace: rename test_event_disable to ftrace_test_event_disable * kmemleak: rename test_pointer to kmemleak_test_pointer * mce: rename next_interval to mce_next_interval [ Impact: percpu usage cleanups, no duplicate static percpu var names ] Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Dave Jones <davej@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: linux-mm <linux-mm@kvack.org> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andi Kleen <andi@firstfloor.org>
2009-06-24percpu: cleanup percpu array definitionsTejun Heo
Currently, the following three different ways to define percpu arrays are in use. 1. DEFINE_PER_CPU(elem_type[array_len], array_name); 2. DEFINE_PER_CPU(elem_type, array_name[array_len]); 3. DEFINE_PER_CPU(elem_type, array_name)[array_len]; Unify to #1 which correctly separates the roles of the two parameters and thus allows more flexibility in the way percpu variables are defined. [ Impact: cleanup ] Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tony Luck <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: linux-mm@kvack.org Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David S. Miller <davem@davemloft.net>
2009-06-24percpu: use dynamic percpu allocator as the default percpu allocatorTejun Heo
This patch makes most !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use dynamic percpu allocator. The first chunk is allocated using embedding helper and 8k is reserved for modules. This ensures that the new allocator behaves almost identically to the original allocator as long as static percpu variables are concerned, so it shouldn't introduce much breakage. s390 and alpha use custom SHIFT_PERCPU_PTR() to work around addressing range limit the addressing model imposes. Unfortunately, this breaks if the address is specified using a variable, so for now, the two archs aren't converted. The following architectures are affected by this change. * sh * arm * cris * mips * sparc(32) * blackfin * avr32 * parisc (broken, under investigation) * m32r * powerpc(32) As this change makes the dynamic allocator the default one, CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert - CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be converted archs. These archs implement their own setup_per_cpu_areas() and the conversion is not trivial. * powerpc(64) * sparc(64) * ia64 * alpha * s390 Boot and batch alloc/free tests on x86_32 with debug code (x86_32 doesn't use default first chunk initialization). Compile tested on sparc(32), powerpc(32), arm and alpha. Kyle McMartin reported that this change breaks parisc. The problem is still under investigation and he is okay with pushing this patch forward and fixing parisc later. [ Impact: use dynamic allocator for most archs w/o custom percpu setup ] Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Mikael Starvik <starvik@axis.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Bryan Wu <cooloney@kernel.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu>
2009-06-23mm: fix handling of pagesets for downed cpusDimitri Sivanich
After downing/upping a cpu, an attempt to set /proc/sys/vm/percpu_pagelist_fraction results in an oops in percpu_pagelist_fraction_sysctl_handler(). If a processor is downed then we need to set the pageset pointer back to the boot pageset. Updates of the high water marks should not access pagesets of unpopulated zones (those pointer go to the boot pagesets which would be no longer functional if their size would be increased beyond zero). Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Mel Gorman <mel@csn.ul.ie> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-23mm: pass mm to grab_swap_tokenHugh Dickins
If a kthread happens to use get_user_pages() on an mm (as KSM does), there's a chance that it will end up trying to read in a swap page, then oops in grab_swap_token() because the kthread has no mm: GUP passes down the right mm, so grab_swap_token() ought to be using it. We have not identified a stronger case than KSM's daemon (not yet in mainline), but the issue must have come up before, since RHEL has included a fix for this for years (though a different fix, they just back out of grab_swap_token if current->mm is unset: which is what we first proposed, but using the right mm here seems more correct). Reported-by: Izik Eidus <ieidus@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-23Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6Linus Torvalds
* 'kmemleak' of git://linux-arm.org/linux-2.6: kmemleak: Do not force the slab debugging Kconfig options kmemleak: use pr_fmt
2009-06-23mm: don't rely on flags coincidenceHugh Dickins
Indeed FOLL_WRITE matches FAULT_FLAG_WRITE, matches GUP_FLAGS_WRITE, and it's tempting to devise a set of Grand Unified Paging flags; but not today. So until then, let's rely upon the compiler to spot the coincidence, "rather than have that subtle dependency and a comment for it" - as you remarked in another context yesterday. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-23hugetlb: fault flags instead of write_accessHugh Dickins
handle_mm_fault() is now passing fault flags rather than write_access down to hugetlb_fault(), so better recognize that in hugetlb_fault(), and in hugetlb_no_page(). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-23mm: fix incorrect page removal from LRUKAMEZAWA Hiroyuki
The isolated page is "cursor_page" not "page". This could cause LRU list corruption under memory pressure, caught by CONFIG_DEBUG_LIST. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-23kmemleak: use pr_fmtJoe Perches
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-22x86: implement percpu_alloc kernel parameterTejun Heo
According to Andi, it isn't clear whether lpage allocator is worth the trouble as there are many processors where PMD TLB is far scarcer than PTE TLB. The advantage or disadvantage probably depends on the actual size of percpu area and specific processor. As performance degradation due to TLB pressure tends to be highly workload specific and subtle, it is difficult to decide which way to go without more data. This patch implements percpu_alloc kernel parameter to allow selecting which first chunk allocator to use to ease debugging and testing. While at it, make sure all the failure paths report why something failed to help determining why certain allocator isn't working. Also, kill the "Great future plan" comment which had already been realized quite some time ago. [ Impact: allow explicit percpu first chunk allocator selection ] Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jan Beulich <JBeulich@novell.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu>
2009-06-22percpu: fix too lazy vunmap cache flushingTejun Heo
In pcpu_unmap(), flushing virtual cache on vunmap can't be delayed as the page is going to be returned to the page allocator. Only TLB flushing can be put off such that vmalloc code can handle it lazily. Fix it. [ Impact: fix subtle virtual cache flush bug ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Ingo Molnar <mingo@elte.hu>
2009-06-21Move FAULT_FLAG_xyz into handle_mm_fault() callersLinus Torvalds
This allows the callers to now pass down the full set of FAULT_FLAG_xyz flags to handle_mm_fault(). All callers have been (mechanically) converted to the new calling convention, there's almost certainly room for architectures to clean up their code and then add FAULT_FLAG_RETRY when that support is added. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-21Remove internal use of 'write_access' in mm/memory.cLinus Torvalds
The fault handling routines really want more fine-grained flags than a single "was it a write fault" boolean - the callers will want to set flags like "you can return a retry error" etc. And that's actually how the VM works internally, but right now the top-level fault handling functions in mm/memory.c all pass just the 'write_access' boolean around. This switches them over to pass around the FAULT_FLAG_xyzzy 'flags' variable instead. The 'write_access' calling convention still exists for the exported 'handle_mm_fault()' function, but that is next. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-20mm: page_alloc: clear PG_locked before checking flags on freeJohannes Weiner
da456f1 "page allocator: do not disable interrupts in free_page_mlock()" moved the PG_mlocked clearing after the flag sanity checking which makes mlocked pages always trigger 'bad page'. Fix this by clearing the bit up front. Reported--and-debugged-by: Peter Chubb <peter.chubb@nicta.com.au> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Maxim Levitsky <maximlevitsky@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-19bootmem.c: avoid c90 declaration warningJoe Perches
[akpm@linux-foundation.org: cleanup] Signed-off-by: Joe Perches <joe@perches.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18mm: Extend gfp masking to the page allocatorBenjamin Herrenschmidt
The page allocator also needs the masking of gfp flags during boot, so this moves it out of slab/slub and uses it with the page allocator as well. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18memcg: fix lru rotation in isolate_pagesKAMEZAWA Hiroyuki
Try to fix memcg's lru rotation sanity: make memcg use the same logic as the global LRU does. Now, at __isolate_lru_page() retruns -EBUSY, the page is rotated to the tail of LRU in global LRU's isolate LRU pages. But in memcg, it's not handled. This makes memcg do the same behavior as global LRU and rotate LRU in the page is busy. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18memcg: fix behavior under memory.limit equals to memsw.limitKAMEZAWA Hiroyuki
A user can set memcg.limit_in_bytes == memcg.memsw.limit_in_bytes when the user just want to limit the total size of applications, in other words, not very interested in memory usage itself. In this case, swap-out will be done only by global-LRU. But, under current implementation, memory.limit_in_bytes is checked at first and try_to_free_page() may do swap-out. But, that swap-out is useless for memsw.limit_in_bytes and the thread may hit limit again. This patch tries to fix the current behavior at memory.limit == memsw.limit case. And documentation is updated to explain the behavior of this special case. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18memcg: fix swap accountingKAMEZAWA Hiroyuki
This patch fixes mis-accounting of swap usage in memcg. In the current implementation, memcg's swap account is uncharged only when swap is completely freed. But there are several cases where swap cannot be freed cleanly. For handling that, this patch changes that memcg uncharges swap account when swap has no references other than cache. By this, memcg's swap entry accounting can be fully synchronous with the application's behavior. This patch also changes memcg's hooks for swap-out. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18memcg: remove some redundant checksLi Zefan
We don't need to check do_swap_account in the case that the function which checks do_swap_account will never get called if do_swap_account == 0. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18memcg: add file-based RSS accountingBalbir Singh
Add file RSS tracking per memory cgroup We currently don't track file RSS, the RSS we report is actually anon RSS. All the file mapped pages, come in through the page cache and get accounted there. This patch adds support for accounting file RSS pages. It should 1. Help improve the metrics reported by the memory resource controller 2. Will form the basis for a future shared memory accounting heuristic that has been proposed by Kamezawa. Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat to "anon_rss". We however, add "mapped_file" data and hope to educate the end user through documentation. [hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.cn> Cc: Paul Menage <menage@google.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18cgroups: make messages more readableRandy Dunlap
Fix some cgroup messages to read better. Update MAINTAINERS to include mm/*cgroup* files. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6Linus Torvalds
* 'kmemleak' of git://linux-arm.org/linux-2.6: kmemleak: Fix some typos in comments kmemleak: Rename kmemleak_panic to kmemleak_stop kmemleak: Only use GFP_KERNEL|GFP_ATOMIC for the internal allocations
2009-06-17kmemleak: Fix some typos in commentsCatalin Marinas
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-17kmemleak: Rename kmemleak_panic to kmemleak_stopCatalin Marinas
This is to avoid the confusion created by the "panic" word. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-17kmemleak: Only use GFP_KERNEL|GFP_ATOMIC for the internal allocationsCatalin Marinas
Kmemleak allocates memory for pointer tracking and it tries to avoid using GFP_ATOMIC if the caller doesn't require it. However other gfp flags may be passed by the caller which aren't required by kmemleak. This patch filters the gfp flags so that only GFP_KERNEL | GFP_ATOMIC are used. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-17Merge branch 'slub/earlyboot' into for-linusPekka Enberg
Conflicts: mm/slub.c
2009-06-17Merge branches 'slab/documentation', 'slab/fixes', 'slob/cleanups' and ↵Pekka Enberg
'slub/fixes' into for-linus
2009-06-16Merge branch 'akpm'Linus Torvalds
* akpm: (182 commits) fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset fbdev: *bfin*: fix __dev{init,exit} markings fbdev: *bfin*: drop unnecessary calls to memset fbdev: bfin-t350mcqb-fb: drop unused local variables fbdev: blackfin has __raw I/O accessors, so use them in fb.h fbdev: s1d13xxxfb: add accelerated bitblt functions tcx: use standard fields for framebuffer physical address and length fbdev: add support for handoff from firmware to hw framebuffers intelfb: fix a bug when changing video timing fbdev: use framebuffer_release() for freeing fb_info structures radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb? s3c-fb: CPUFREQ frequency scaling support s3c-fb: fix resource releasing on error during probing carminefb: fix possible access beyond end of carmine_modedb[] acornfb: remove fb_mmap function mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF mb862xxfb: restrict compliation of platform driver to PPC Samsung SoC Framebuffer driver: add Alpha Channel support atmel-lcdc: fix pixclock upper bound detection offb: use framebuffer_alloc() to allocate fb_info struct ... Manually fix up conflicts due to kmemcheck in mm/slab.c
2009-06-16mm: fix lumpy reclaim lru handling at isolate_lru_pagesKAMEZAWA Hiroyuki
At lumpy reclaim, a page failed to be taken by __isolate_lru_page() can be pushed back to "src" list by list_move(). But the page may not be from "src" list. This pushes the page back to wrong LRU. And list_move() itself is unnecessary because the page is not on top of LRU. Then, leave it as it is if __isolate_lru_page() fails. Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16vmscan: count the number of times zone_reclaim() scans and failsMel Gorman
On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but it is possible that the heuristic will fail and the CPU gets tied up scanning uselessly. Detecting the situation requires some guesswork and experimentation so this patch adds a counter "zreclaim_failed" to /proc/vmstat. If during high CPU utilisation this counter is increasing rapidly, then the resolution to the problem may be to set /proc/sys/vm/zone_reclaim_mode to 0. [akpm@linux-foundation.org: name things consistently] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16vmscan: do not unconditionally treat zones that fail zone_reclaim() as fullMel Gorman
On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. The problem is that zone_reclaim() failing at all means the zone gets marked full. This can cause situations where a zone is usable, but is being skipped because it has been considered full. Take a situation where a large tmpfs mount is occuping a large percentage of memory overall. The pages do not get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full and the zonelist cache considers them not worth trying in the future. This patch makes zone_reclaim() return more fine-grained information about what occured when zone_reclaim() failued. The zone only gets marked full if it really is unreclaimable. If it's a case that the scan did not occur or if enough pages were not reclaimed with the limited reclaim_mode, then the zone is simply skipped. There is a side-effect to this patch. Currently, if zone_reclaim() successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go ahead. With this patch applied, zone watermarks are rechecked after zone_reclaim() does some work. This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26 ("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the zonelist_cache was introduced. It was not intended that zone_reclaim() aggressively consider the zone to be full when it failed as full direct reclaim can still be an option. Due to the age of the bug, it should be considered a -stable candidate. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16vmscan: properly account for the number of page cache pages zone_reclaim() ↵Mel Gorman
can reclaim A bug was brought to my attention against a distro kernel but it affects mainline and I believe problems like this have been reported in various guises on the mailing lists although I don't have specific examples at the moment. The reported problem was that malloc() stalled for a long time (minutes in some cases) if a large tmpfs mount was occupying a large percentage of memory overall. The pages did not get cleaned or reclaimed by zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists are uselessly scanned frequencly making the CPU spin at near 100%. This patchset intends to address that bug and bring the behaviour of zone_reclaim() more in line with expectations which were noticed during investigation. It is based on top of mmotm and takes advantage of Kosaki's work with respect to zone_reclaim(). Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the scan should go ahead. The broken heuristic is what was causing the malloc() stall as it uselessly scanned the LRU constantly. Currently, zone_reclaim is assuming zone_reclaim_mode is 1 and historically it could not deal with tmpfs pages at all. This fixes up the heuristic so that an unnecessary scan is more likely to be correctly avoided. Patch 2 notes that zone_reclaim() returning a failure automatically means the zone is marked full. This is not always true. It could have failed because the GFP mask or zone_reclaim_mode were unsuitable. Patch 3 introduces a counter zreclaim_failed that will increment each time the zone_reclaim scan-avoidance heuristics fail. If that counter is rapidly increasing, then zone_reclaim_mode should be set to 0 as a temporarily resolution and a bug reported because the scan-avoidance heuristic is still broken. This patch: On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but the problem is that the heuristic is not being properly applied and is basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of proper detection can manfiest as high CPU usage as the LRU list is scanned uselessly. Historically, once enabled it was depending on NR_FILE_PAGES which may include swapcache pages that the reclaim_mode cannot deal with. Patch vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included pages that were not file-backed such as swapcache and made a calculation based on the inactive, active and mapped files. This is far superior when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a reasonable starting figure. This patch alters how zone_reclaim() works out how many pages it might be able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in the reclaim_mode it will either consider NR_FILE_PAGES as potential candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set, then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is not set, then NR_FILE_MAPPED are not. [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages] [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>