summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2011-12-08writeback: set max_pause to lowest value on zero bdi_dirtyWu Fengguang
Some trace shows lots of bdi_dirty=0 lines where it's actually some small value if w/o the accounting errors in the per-cpu bdi stats. In this case the max pause time should really be set to the smallest (non-zero) value to avoid IO queue underrun and improve throughput. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-08writeback: permit through good bdi even when global dirty exceededWu Fengguang
On a system with 1 local mount and 1 NFS mount, if the NFS server becomes not responding when dd to the NFS mount, the NFS dirty pages may exceed the global dirty limit and _every_ task involving writing will be blocked. The whole system appears unresponsive. The workaround is to permit through the bdi's that only has a small number of dirty pages. The number chosen (bdi_stat_error pages) is not enough to enable the local disk to run in optimal throughput, however is enough to make the system responsive on a broken NFS mount. The user can then kill the dirtiers on the NFS mount and increase the global dirty limit to bring up the local disk's throughput. It risks allowing dirty pages to grow much larger than the global dirty limit when there are 1000+ mounts, however that's very unlikely to happen, especially in low memory profiles. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-08writeback: comment on the bdi dirty thresholdWu Fengguang
We do "floating proportions" to let active devices to grow its target share of dirty pages and stalled/inactive devices to decrease its target share over time. It works well except in the case of "an inactive disk suddenly goes busy", where the initial target share may be too small. To mitigate this, bdi_position_ratio() has the below line to raise a small bdi_thresh when it's safe to do so, so that the disk be feed with enough dirty pages for efficient IO and in turn fast rampup of bdi_thresh: bdi_thresh = max(bdi_thresh, (limit - dirty) / 8); balance_dirty_pages() normally does negative feedback control which adjusts ratelimit to balance the bdi dirty pages around the target. In some extreme cases when that is not enough, it will have to block the tasks completely until the bdi dirty pages drop below bdi_thresh. Acked-by: Jan Kara <jack@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-06mm, x86: Remove debug_pagealloc_enabledStanislaw Gruszka
When (no)bootmem finish operation, it pass pages to buddy allocator. Since debug_pagealloc_enabled is not set, we will do not protect pages, what is not what we want with CONFIG_DEBUG_PAGEALLOC=y. To fix remove debug_pagealloc_enabled. That variable was introduced by commit 12d6f21e "x86: do not PSE on CONFIG_DEBUG_PAGEALLOC=y" to get more CPA (change page attribude) code testing. But currently we have CONFIG_CPA_DEBUG, which test CPA. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1322582711-14571-1-git-send-email-sgruszka@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-05Merge branch 'vmalloc' of git://git.linaro.org/people/nico/linux into ↵Russell King
devel-stable
2011-12-05slab, lockdep: Fix silly bugPeter Zijlstra
Commit 30765b92 ("slab, lockdep: Annotate the locks before using them") moves the init_lock_keys() call from after g_cpucache_up = FULL, to before it. And overlooks the fact that init_node_lock_keys() tests for it and ignores everything !FULL. Introduce a LATE stage and change the lockdep test to be <LATE. Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: stable@kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-02kmemleak: Add support for memory hotplugLaura Abbott
Ensure that memory hotplug can co-exist with kmemleak by taking the hotplug lock before scanning the memory banks. Signed-off-by: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2011-12-02kmemleak: Handle percpu memory allocationCatalin Marinas
This patch adds kmemleak callbacks from the percpu allocator, reducing a number of false positives caused by kmemleak not scanning such memory blocks. The percpu chunks are never reported as leaks because of current kmemleak limitations with the __percpu pointer not pointing directly to the actual chunks. Reported-by: Huajun Li <huajun.li.lee@gmail.com> Acked-by: Christoph Lameter <cl@gentwo.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2011-12-02kmemleak: Report previously found leaks even after an errorCatalin Marinas
If an error fatal to kmemleak (like memory allocation failure) happens, kmemleak disables itself but it also removes the access to any previously found memory leaks. This patch allows read-only access to the kmemleak debugfs interface but disables any other action. Reported-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2011-12-02kmemleak: When the early log buffer is exceeded, report the actual numberCatalin Marinas
Just telling that the early log buffer has been exceeded doesn't mean much. This patch moves the error printing to the kmemleak_init() function and displays the actual calls to the kmemleak API during early logging. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2011-12-02kmemleak: Show where early_log issues come fromCatalin Marinas
Based on initial patch by Steven Rostedt. Early kmemleak warnings did not show where the actual kmemleak API had been called from but rather just a backtrace to the kmemleak_init() function. By having all early kmemleak logs record the stack_trace, we can have kmemleak_init() write exactly where the problem occurred. This patch adds the setting of the kmemleak_warning variable every time a kmemleak warning is issued. The kmemleak_init() function checks this variable during early log replaying and prints the log trace if there was any warning. Reported-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@google.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-02fs: Make write(2) interruptible by a fatal signalJan Kara
Currently write(2) to a file is not interruptible by any signal. Sometimes this is desirable, e.g. when you want to quickly kill a process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make task in balance_dirty_pages() killable"), it's necessary to abort the current write accordingly to avoid it quickly dirtying lots more pages at unthrottled rate. This patch makes write interruptible by SIGKILL. We do not allow write to be interruptible by any other signal because that has larger potential of screwing some badly written applications. Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com> Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com> Acked-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-11-29Merge branch 'slab/urgent' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: slub: avoid potential NULL dereference or corruption slub: use irqsafe_cpu_cmpxchg for put_cpu_partial slub: move discard_slab out of node lock slub: use correct parameter to add a page to partial list tail
2011-11-28Merge branch 'for-3.2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-3.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: explain why per_cpu_ptr_to_phys() is more complicated than necessary percpu: fix chunk range calculation percpu: rename pcpu_mem_alloc to pcpu_mem_zalloc
2011-11-28Merge branch 'master' into x86/memblockTejun Heo
Conflicts & resolutions: * arch/x86/xen/setup.c dc91c728fd "xen: allow extra memory to be in multiple regions" 24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..." conflicted on xen_add_extra_mem() updates. The resolution is trivial as the latter just want to replace memblock_x86_reserve_range() with memblock_reserve(). * drivers/pci/intel-iommu.c 166e9278a3f "x86/ia64: intel-iommu: move to drivers/iommu/" 5dfe8660a3d "bootmem: Replace work_with_active_regions() with..." conflicted as the former moved the file under drivers/iommu/. Resolved by applying the chnages from the latter on the moved file. * mm/Kconfig 6661672053a "memblock: add NO_BOOTMEM config symbol" c378ddd53f9 "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option" conflicted trivially. Both added config options. Just letting both add their own options resolves the conflict. * mm/memblock.c d1f0ece6cdc "mm/memblock.c: small function definition fixes" ed7b56a799c "memblock: Remove memblock_memory_can_coalesce()" confliected. The former updates function removed by the latter. Resolution is trivial. Signed-off-by: Tejun Heo <tj@kernel.org>
2011-11-27slub: add missed accountingShaohua Li
With per-cpu partial list, slab is added to partial list first and then moved to node list. The __slab_free() code path for add/remove_partial is almost deprecated(except for slub debug). But we forget to account add/remove_partial when move per-cpu partial pages to node list, so the statistics for such events are always 0. Add corresponding accounting. This is against the patch "slub: use correct parameter to add a page to partial list tail" Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-27Merge branch 'slab/urgent' into slab/nextPekka Enberg
2011-11-24slub: avoid potential NULL dereference or corruptionEric Dumazet
show_slab_objects() can trigger NULL dereferences or memory corruption. Another cpu can change its c->page to NULL or c->node to NUMA_NO_NODE while we use them. Use ACCESS_ONCE(c->page) and ACCESS_ONCE(c->node) to make sure this cannot happen. Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-24slub: use irqsafe_cpu_cmpxchg for put_cpu_partialChristoph Lameter
The cmpxchg must be irq safe. The fallback for this_cpu_cmpxchg only disables preemption which results in per cpu partial page operation potentially failing on non x86 platforms. This patch fixes the following problem reported by Christian Kujau: I seem to hit it with heavy disk & cpu IO is in progress on this PowerBook G4. Full dmesg & .config: http://nerdbynature.de/bits/3.2.0-rc1/oops/ I've enabled some debug options and now it really points to slub.c:2166 http://nerdbynature.de/bits/3.2.0-rc1/oops/oops4m.jpg With debug options enabled I'm currently in the xmon debugger, not sure what to make of it yet, I'll try to get something useful out of it :) Reported-by: Christian Kujau <lists@nerdbynature.de> Tested-by: Christian Kujau <lists@nerdbynature.de> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-23Merge branch 'pm-freezer' of ↵Rafael J. Wysocki
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into pm-freezer * 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: (24 commits) freezer: fix wait_event_freezable/__thaw_task races freezer: kill unused set_freezable_with_signal() dmatest: don't use set_freezable_with_signal() usb_storage: don't use set_freezable_with_signal() freezer: remove unused @sig_only from freeze_task() freezer: use lock_task_sighand() in fake_signal_wake_up() freezer: restructure __refrigerator() freezer: fix set_freezable[_with_signal]() race freezer: remove should_send_signal() and update frozen() freezer: remove now unused TIF_FREEZE freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE cgroup_freezer: prepare for removal of TIF_FREEZE freezer: clean up freeze_processes() failure path freezer: kill PF_FREEZING freezer: test freezable conditions while holding freezer_lock freezer: make freezing indicate freeze condition in effect freezer: use dedicated lock instead of task_lock() + memory barrier freezer: don't distinguish nosig tasks on thaw freezer: remove racy clear_freeze_flag() and set PF_NOFREEZE on dead tasks freezer: rename thaw_process() to __thaw_task() and simplify the implementation ...
2011-11-23percpu: explain why per_cpu_ptr_to_phys() is more complicated than necessaryDave Young
Add comments about current per_cpu_ptr_to_phys implementation to explain why the logic is more complicated than necessary. -tj: relocated comment into kerneldoc comment Signed-off-by: Dave Young <dyoung@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-11-22Merge branch 'writeback-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: remove vm_dirties and task->dirties writeback: hard throttle 1000+ dd on a slow USB stick mm: Make task in balance_dirty_pages() killable
2011-11-22percpu: fix chunk range calculationTejun Heo
Percpu allocator recorded the cpus which map to the first and last units in pcpu_first/last_unit_cpu respectively and used them to determine the address range of a chunk - e.g. it assumed that the first unit has the lowest address in a chunk while the last unit has the highest address. This simply isn't true. Groups in a chunk can have arbitrary positive or negative offsets from the previous one and there is no guarantee that the first unit occupies the lowest offset while the last one the highest. Fix it by actually comparing unit offsets to determine cpus occupying the lowest and highest offsets. Also, rename pcu_first/last_unit_cpu to pcpu_low/high_unit_cpu to avoid confusion. The chunk address range is used to flush cache on vmalloc area map/unmap and decide whether a given address is in the first chunk by per_cpu_ptr_to_phys() and the bug was discovered by invalid per_cpu_ptr_to_phys() translation for crash_note. Kudos to Dave Young for tracking down the problem. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: WANG Cong <xiyou.wangcong@gmail.com> Reported-by: Dave Young <dyoung@redhat.com> Tested-by: Dave Young <dyoung@redhat.com> LKML-Reference: <4EC21F67.10905@redhat.com> Cc: stable @kernel.org
2011-11-22percpu: rename pcpu_mem_alloc to pcpu_mem_zallocBob Liu
Currently pcpu_mem_alloc() is implemented always return zeroed memory. So rename it to make user like pcpu_get_pages_and_bitmap() know don't reinit it. Signed-off-by: Bob Liu <lliubbo@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-11-21freezer: rename thaw_process() to __thaw_task() and simplify the implementationTejun Heo
thaw_process() now has only internal users - system and cgroup freezers. Remove the unnecessary return value, rename, unexport and collapse __thaw_process() into it. This will help further updates to the freezer code. -v3: oom_kill grew a use of thaw_process() while this patch was pending. Convert it to use __thaw_task() for now. In the longer term, this should be handled by allowing tasks to die if killed even if it's frozen. -v2: minor style update as suggested by Matt. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul Menage <menage@google.com> Cc: Matt Helsley <matthltc@us.ibm.com>
2011-11-21freezer: implement and use kthread_freezable_should_stop()Tejun Heo
Writeback and thinkpad_acpi have been using thaw_process() to prevent deadlock between the freezer and kthread_stop(); unfortunately, this is inherently racy - nothing prevents freezing from happening between thaw_process() and kthread_stop(). This patch implements kthread_freezable_should_stop() which enters refrigerator if necessary but is guaranteed to return if kthread_stop() is invoked. Both thaw_process() users are converted to use the new function. Note that this deadlock condition exists for many of freezable kthreads. They need to be converted to use the new should_stop or freezable workqueue. Tested with synthetic test case. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Henrique de Moraes Holschuh <ibm-acpi@hmh.eng.br> Cc: Jens Axboe <axboe@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com>
2011-11-18mm: add vm_area_add_early()Nicolas Pitre
The existing vm_area_register_early() allows for early vmalloc space allocation. However upcoming cleanups in the ARM architecture require that some fixed locations in the vmalloc area be reserved also very early. The name "vm_area_register_early" would have been a good name for the reservation part without the allocation. Since it is already in use with different semantics, let's create vm_area_add_early() instead. Both vm_area_register_early() and vm_area_add_early() can be used together meaning that the former is now implemented using the later where it is ensured that no conflicting areas are added, but no attempt is made to make the allocation scheme in vm_area_register_early() more sophisticated. After all, you must know what you're doing when using those functions. Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org
2011-11-18Merge branch 'stable/for-linus-fixes-3.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen-gntalloc: signedness bug in add_grefs() xen-gntalloc: integer overflow in gntalloc_ioctl_alloc() xen-gntdev: integer overflow in gntdev_alloc_map() xen:pvhvm: enable PVHVM VCPU placement when using more than 32 CPUs. xen/balloon: Avoid OOM when requesting highmem xen: Remove hanging references to CONFIG_XEN_PLATFORM_PCI xen: map foreign pages for shared rings by updating the PTEs directly
2011-11-18Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-block: block: add missed trace_block_plug paride: fix potential information leak in pg_read() bio: change some signed vars to unsigned block: avoid unnecessary plug list flush cciss: auto engage SCSI mid layer at driver load time loop: cleanup set_status interface include/linux/bio.h: use a static inline function for bio_integrity_clone() loop: prevent information leak after failed read block: Always check length of all iov entries in blk_rq_map_user_iov() The Windows driver .inf disables ASPM on all cciss devices. Do the same. backing-dev: ensure wakeup_timer is deleted block: Revert "[SCSI] genhd: add a new attribute "alias" in gendisk"
2011-11-17writeback: remove vm_dirties and task->dirtiesWu Fengguang
They are not used any more. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-11-17writeback: hard throttle 1000+ dd on a slow USB stickWu Fengguang
The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms on every 1 4KB-page, which means it cannot throttle a task under 4KB/200ms=20KB/s. So when there are more than 512 dd writing to a 10MB/s USB stick, its bdi dirty pages could grow out of control. Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1) means a limit of 4KB/s. They can eventually be safeguarded by the global limit check (nr_dirty < dirty_thresh). However if someone is also writing to an HDD at the same time, it'll get poor HDD write performance. We at least want to maintain good write performance for other devices when one device is attacked by some "massive parallel" workload, or suffers from slow write bandwidth, or somehow get stalled due to some error condition (eg. NFS server not responding). For a stalled device, we need to completely block its dirtiers, too, before its bdi dirty pages grow all the way up to the global limit and leave no space for the other functional devices. So change the loop exit condition to /* * Always enforce global dirty limit; also enforce bdi dirty limit * if the normal max_pause sleeps cannot keep things under control. */ if (nr_dirty < dirty_thresh && (bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1)) break; which can be further simplified to if (task_ratelimit) break; Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-11-17mm: cleanup the comment for head/tail pages of compound pages in mm/page_alloc.cWang Sheng-Hui
Only tail pages point at the head page using their ->first_page fields. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-11-16slab: add taint flag outputting to debug paths.Dave Jones
When we get corruption reports, it's useful to see if the kernel was tainted, to rule out problems we can't do anything about. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-16slub: add taint flag outputting to debug pathsDave Jones
When we get corruption reports, it's useful to see if the kernel was tainted, to rule out problems we can't do anything about. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-16xen: map foreign pages for shared rings by updating the PTEs directlyDavid Vrabel
When mapping a foreign page with xenbus_map_ring_valloc() with the GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and pass a pointer to the PTE (in init_mm). After the page is mapped, the usual fault mechanism can be used to update additional MMs. This allows the vmalloc_sync_all() to be removed from alloc_vm_area(). Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> [v1: Squashed fix by Michal for no-mmu case] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Michal Simek <monstr@monstr.eu>
2011-11-16mm: Make task in balance_dirty_pages() killableJan Kara
There is no reason why task in balance_dirty_pages() shouldn't be killable and it helps in recovering from some error conditions (like when filesystem goes in error state and cannot accept writeback anymore but we still want to kill processes using it to be able to unmount it). There will be follow up patches to further abort the generic_perform_write() and other filesystem write loops, to avoid large write + SIGKILL combination exceeding the dirty limit and possibly strange OOM. Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com> Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com> Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-11-15hugetlb: release pages in the error path of hugetlb_cow()Hillf Danton
If we fail to prepare an anon_vma, the {new, old}_page should be released, or they will leak. Signed-off-by: Hillf Danton <dhillf@gmail.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-15oom: do not kill tasks with oom_score_adj OOM_SCORE_ADJ_MINMichal Hocko
Commit c9f01245 ("oom: remove oom_disable_count") has removed the oom_disable_count counter which has been used for early break out from oom_badness so we could never select a task with oom_score_adj set to OOM_SCORE_ADJ_MIN (oom disabled). Now that the counter is gone we are always going through heuristics calculation and we always return a non zero positive value. This means that we can end up killing a task with OOM disabled because it is indistinguishable from regular tasks with 1% resp. CAP_SYS_ADMIN tasks with 3% usage of memory or tasks with oom_score_adj set but OOM enabled. Let's break out early if the task should have OOM disabled. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-15slub: move discard_slab out of node lockShaohua Li
Lockdep reports there is potential deadlock for slub node list_lock. discard_slab() is called with the lock hold in unfreeze_partials(), which could trigger a slab allocation, which could hold the lock again. discard_slab() doesn't need hold the lock actually, if the slab is already removed from partial list. Acked-by: Christoph Lameter <cl@linux.com> Reported-and-tested-by: Yong Zhang <yong.zhang0@gmail.com> Reported-and-tested-by: Julie Sullivan <kernelmail.jms@gmail.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-15slub: use correct parameter to add a page to partial list tailShaohua Li
unfreeze_partials() needs add the page to partial list tail, since such page hasn't too many free objects. We now explictly use DEACTIVATE_TO_TAIL for this, while DEACTIVATE_TO_TAIL != 1. This will cause performance regression (eg, more lock contention in node->list_lock) without below fix. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-11backing-dev: ensure wakeup_timer is deletedRabin Vincent
bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links from all super_blocks and then del_timer_sync() the writeback timer. However, this can race with __mark_inode_dirty(), leading to bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi we're unregistering, after we've called del_timer_sync(). This can end up with the bdi being freed with an active timer inside it, as in the case of the following dump after the removal of an SD card. Fix this by redoing the del_timer_sync() in bdi_destory(). ------------[ cut here ]------------ WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8() ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180 Modules linked in: Backtrace: [<c00109dc>] (dump_backtrace+0x0/0x110) from [<c0236e4c>] (dump_stack+0x18/0x1c) r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000 [<c0236e34>] (dump_stack+0x0/0x1c) from [<c0025e6c>] (warn_slowpath_common+0x54/0x6c) [<c0025e18>] (warn_slowpath_common+0x0/0x6c) from [<c0025f28>] (warn_slowpath_fmt+0x38/0x40) r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29 r3:00000009 [<c0025ef0>] (warn_slowpath_fmt+0x0/0x40) from [<c015eb4c>] (debug_print_object+0x9c/0xc8) r3:c02b1b29 r2:c02bc662 [<c015eab0>] (debug_print_object+0x0/0xc8) from [<c015f574>] (debug_check_no_obj_freed+0xac/0x1dc) r6:c7964000 r5:00000001 r4:c7964000 [<c015f4c8>] (debug_check_no_obj_freed+0x0/0x1dc) from [<c00a9e38>] (kmem_cache_free+0x88/0x1f8) [<c00a9db0>] (kmem_cache_free+0x0/0x1f8) from [<c014286c>] (blk_release_queue+0x70/0x78) [<c01427fc>] (blk_release_queue+0x0/0x78) from [<c015290c>] (kobject_release+0x70/0x84) r5:c79641f0 r4:c796420c [<c015289c>] (kobject_release+0x0/0x84) from [<c0153ce4>] (kref_put+0x68/0x80) r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c [<c0153c7c>] (kref_put+0x0/0x80) from [<c01527d0>] (kobject_put+0x48/0x5c) r5:c79643b4 r4:c79641f0 [<c0152788>] (kobject_put+0x0/0x5c) from [<c013ddd8>] (blk_cleanup_queue+0x68/0x74) r4:c7964000 [<c013dd70>] (blk_cleanup_queue+0x0/0x74) from [<c01a6370>] (mmc_blk_put+0x78/0xe8) r5:00000000 r4:c794c400 [<c01a62f8>] (mmc_blk_put+0x0/0xe8) from [<c01a64b4>] (mmc_blk_release+0x24/0x38) r5:c794c400 r4:c0322824 [<c01a6490>] (mmc_blk_release+0x0/0x38) from [<c00de11c>] (__blkdev_put+0xe8/0x170) r5:c78d5e00 r4:c74083c0 [<c00de034>] (__blkdev_put+0x0/0x170) from [<c00de2c0>] (blkdev_put+0x11c/0x12c) r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0 r3:00000000 [<c00de1a4>] (blkdev_put+0x0/0x12c) from [<c00b0724>] (kill_block_super+0x60/0x6c) r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0 [<c00b06c4>] (kill_block_super+0x0/0x6c) from [<c00b0a94>] (deactivate_locked_super+0x44/0x70) r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4 [<c00b0a50>] (deactivate_locked_super+0x0/0x70) from [<c00b1358>] (deactivate_super+0x6c/0x70) r5:c794dc00 r4:c794dc00 [<c00b12ec>] (deactivate_super+0x0/0x70) from [<c00c88b0>] (mntput_no_expire+0x188/0x194) r5:c794dc00 r4:c7942300 [<c00c8728>] (mntput_no_expire+0x0/0x194) from [<c00c95e0>] (sys_umount+0x2e4/0x310) r6:c7942300 r5:00000000 r4:00000000 r3:00000000 [<c00c92fc>] (sys_umount+0x0/0x310) from [<c000d940>] (ret_fast_syscall+0x0/0x30) ---[ end trace e5c83c92ada51c76 ]--- Cc: stable@kernel.org Signed-off-by: Rabin Vincent <rabin.vincent@stericsson.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-10slab: introduce slab_max_order kernel parameterDavid Rientjes
Introduce new slab_max_order kernel parameter which is the equivalent of slub_max_order. For immediate purposes, allows users to override the heuristic that sets the max order to 1 by default if they have more than 32MB of RAM. This may result in page allocation failures if there is substantial fragmentation. Another usecase would be to increase the max order for better performance. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-10slab: rename slab_break_gfp_order to slab_max_orderDavid Rientjes
slab_break_gfp_order is more appropriately named slab_max_order since it enforces the maximum order size of slabs as long as a single object will still fit. Also rename BREAK_GFP_ORDER_{LO,HI} accordingly. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-07writeback: fix uninitialized task_ratelimitWu Fengguang
In balance_dirty_pages() task_ratelimit may be not initialized (initialization skiped by goto pause), and then used when calling tracing hook. Fix it by moving the task_ratelimit assignment before goto pause. Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-11-06Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-11-06Merge branch 'writeback-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Add a 'reason' to wb_writeback_work writeback: send work item to queue_io, move_expired_inodes writeback: trace event balance_dirty_pages writeback: trace event bdi_dirty_ratelimit writeback: fix ppc compile warnings on do_div(long long, unsigned long) writeback: per-bdi background threshold writeback: dirty position control - bdi reserve area writeback: control dirty pause time writeback: limit max dirty pause time writeback: IO-less balance_dirty_pages() writeback: per task dirty rate limit writeback: stabilize bdi->dirty_ratelimit writeback: dirty rate control writeback: add bg_threshold parameter to __bdi_update_bandwidth() writeback: dirty position control writeback: account per-bdi accumulated dirtied pages
2011-11-04Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c
2011-11-02Merge branch 'akpm' (Andrew's incoming - part two)Linus Torvalds
Says Andrew: "60 patches. That's good enough for -rc1 I guess. I have quite a lot of detritus to be rechecked, work through maintainers, etc. - most of the remains of MM - rtc - various misc - cgroups - memcg - cpusets - procfs - ipc - rapidio - sysctl - pps - w1 - drivers/misc - aio" * akpm: (60 commits) memcg: replace ss->id_lock with a rwlock aio: allocate kiocbs in batches drivers/misc/vmw_balloon.c: fix typo in code comment drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop w1: disable irqs in critical section drivers/w1/w1_int.c: multiple masters used same init_name drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal drivers/power/ds2780_battery.c: add a nolock function to w1 interface drivers/power/ds2780_battery.c: create central point for calling w1 interface w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it pps gpio client: add missing dependency pps: new client driver using GPIO pps: default echo function include/linux/dma-mapping.h: add dma_zalloc_coherent() sysctl: make CONFIG_SYSCTL_SYSCALL default to n sysctl: add support for poll() RapidIO: documentation update drivers/net/rionet.c: fix ethernet address macros for LE platforms RapidIO: fix potential null deref in rio_setup_device() RapidIO: add mport driver for Tsi721 bridge ...
2011-11-02mm/page_cgroup.c: quiet sparse noiseH Hartley Sweeten
warning: symbol 'swap_cgroup_ctrl' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02memcg: Fix race condition in memcg_check_events() with this_cpu usageSteven Rostedt
Various code in memcontrol.c () calls this_cpu_read() on the calculations to be done from two different percpu variables, or does an open-coded read-modify-write on a single percpu variable. Disable preemption throughout these operations so that the writes go to the correct palces. [hannes@cmpxchg.org: added this_cpu to __this_cpu conversion] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Greg Thelen <gthelen@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>