summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2014-05-23mm/memory-failure.c: fix memory leak by race between poison and unpoisonNaoya Horiguchi
When a memory error happens on an in-use page or (free and in-use) hugepage, the victim page is isolated with its refcount set to one. When you try to unpoison it later, unpoison_memory() calls put_page() for it twice in order to bring the page back to free page pool (buddy or free hugepage list). However, if another memory error occurs on the page which we are unpoisoning, memory_failure() returns without releasing the refcount which was incremented in the same call at first, which results in memory leak and unconsistent num_poisoned_pages statistics. This patch fixes it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: <stable@vger.kernel.org> [2.6.32+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-23memcg: fix swapcache charge from kernel thread contextMichal Hocko
Commit 284f39afeaa4 ("mm: memcg: push !mm handling out to page cache charge function") explicitly checks for page cache charges without any mm context (from kernel thread context[1]). This seemed to be the only possible case where memory could be charged without mm context so commit 03583f1a631c ("memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()") removed the mm check from get_mem_cgroup_from_mm(). This however caused another NULL ptr dereference during early boot when loopback kernel thread splices to tmpfs as reported by Stephan Kulow: BUG: unable to handle kernel NULL pointer dereference at 0000000000000360 IP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60 Oops: 0000 [#1] SMP Modules linked in: btrfs dm_multipath dm_mod scsi_dh multipath raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod parport_pc parport nls_utf8 isofs usb_storage iscsi_ibft iscsi_boot_sysfs arc4 ecb fan thermal nfs lockd fscache nls_iso8859_1 nls_cp437 sg st hid_generic usbhid af_packet sunrpc sr_mod cdrom ata_generic uhci_hcd virtio_net virtio_blk ehci_hcd usbcore ata_piix floppy processor button usb_common virtio_pci virtio_ring virtio edd squashfs loop ppa] CPU: 0 PID: 97 Comm: loop1 Not tainted 3.15.0-rc5-5-default #1 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: __mem_cgroup_try_charge_swapin+0x40/0xe0 mem_cgroup_charge_file+0x8b/0xd0 shmem_getpage_gfp+0x66b/0x7b0 shmem_file_splice_read+0x18f/0x430 splice_direct_to_actor+0xa2/0x1c0 do_lo_receive+0x5a/0x60 [loop] loop_thread+0x298/0x720 [loop] kthread+0xc6/0xe0 ret_from_fork+0x7c/0xb0 Also Branimir Maksimovic reported the following oops which is tiggered for the swapcache charge path from the accounting code for kernel threads: CPU: 1 PID: 160 Comm: kworker/u8:5 Tainted: P OE 3.15.0-rc5-core2-custom #159 Hardware name: System manufacturer System Product Name/MAXIMUSV GENE, BIOS 1903 08/19/2013 task: ffff880404e349b0 ti: ffff88040486a000 task.ti: ffff88040486a000 RIP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60 Call Trace: __mem_cgroup_try_charge_swapin+0x45/0xf0 mem_cgroup_charge_file+0x9c/0xe0 shmem_getpage_gfp+0x62c/0x770 shmem_write_begin+0x38/0x40 generic_perform_write+0xc5/0x1c0 __generic_file_aio_write+0x1d1/0x3f0 generic_file_aio_write+0x4f/0xc0 do_sync_write+0x5a/0x90 do_acct_process+0x4b1/0x550 acct_process+0x6d/0xa0 do_exit+0x827/0xa70 kthread+0xc3/0xf0 This patch fixes the issue by reintroducing mm check into get_mem_cgroup_from_mm. We could do the same trick in __mem_cgroup_try_charge_swapin as we do for the regular page cache path but it is not worth troubles. The check is not that expensive and it is better to have get_mem_cgroup_from_mm more robust. [1] - http://marc.info/?l=linux-mm&m=139463617808941&w=2 Fixes: 03583f1a631c ("memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()") Reported-and-tested-by: Stephan Kulow <coolo@suse.com> Reported-by: Branimir Maksimovic <branimir.maksimovic@gmail.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-23mm: madvise: fix MADV_WILLNEED on shmem swapoutsJohannes Weiner
MADV_WILLNEED currently does not read swapped out shmem pages back in. Commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") made find_get_page() filter exceptional radix tree entries but failed to convert all find_get_page() callers that WANT exceptional entries over to find_get_entry(). One of them is shmem swap readahead in madvise, which now skips over any swap-out records. Convert it to find_get_entry(). Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-23mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECTJens Axboe
In some testing I ran today (some fio jobs that spread over two nodes), we end up spending 40% of the time in filemap_check_errors(). That smells fishy. Looking further, this is basically what happens: blkdev_aio_read() generic_file_aio_read() filemap_write_and_wait_range() if (!mapping->nr_pages) filemap_check_errors() and filemap_check_errors() always attempts two test_and_clear_bit() on the mapping flags, thus dirtying it for every single invocation. The patch below tests each of these bits before clearing them, avoiding this issue. In my test case (4-socket box), performance went from 1.7M IOPS to 4.0M IOPS. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-23hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free ↵Chen Yucong
hugepage For handling a free hugepage in memory failure, the race will happen if another thread hwpoisoned this hugepage concurrently. So we need to check PageHWPoison instead of !PageHWPoison. If hwpoison_filter(p) returns true or a race happens, then we need to unlock_page(hpage). Signed-off-by: Chen Yucong <slaoub@gmail.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-20Merge tag 'metag-for-v3.15-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag Pull Metag architecture and related fixes from James Hogan: "Mostly fixes for metag and parisc relating to upgrowing stacks. - Fix missing compiler barriers in metag memory barriers. - Fix BUG_ON on metag when RLIMIT_STACK hard limit is increased beyond safe value. - Make maximum stack size configurable. This reduces the default user stack size back to 80MB (especially on parisc after their removal of _STK_LIM_MAX override). This only affects metag and parisc. - Remove metag _STK_LIM_MAX override to match other arches and follow parisc, now that it is safe to do so (due to the BUG_ON fix mentioned above). - Finally now that both metag and parisc _STK_LIM_MAX overrides have been removed, it makes sense to remove _STK_LIM_MAX altogether" * tag 'metag-for-v3.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag: asm-generic: remove _STK_LIM_MAX metag: Remove _STK_LIM_MAX override parisc,metag: Do not hardcode maximum userspace stack size metag: Reduce maximum stack size to 256MB metag: fix memory barriers
2014-05-15parisc,metag: Do not hardcode maximum userspace stack sizeHelge Deller
This patch affects only architectures where the stack grows upwards (currently parisc and metag only). On those do not hardcode the maximum initial stack size to 1GB for 32-bit processes, but make it configurable via a config option. The main problem with the hardcoded stack size is, that we have two memory regions which grow upwards: stack and heap. To keep most of the memory available for heap in a flexmap memory layout, it makes no sense to hard allocate up to 1GB of the memory for stack which can't be used as heap then. This patch makes the stack size for 32-bit processes configurable and uses 80MB as default value which has been in use during the last few years on parisc and which hasn't showed any problems yet. Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: linux-parisc@vger.kernel.org Cc: linux-metag@vger.kernel.org Cc: John David Anglin <dave.anglin@bell.net>
2014-05-13Merge branch 'for-3.15-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull a percpu fix from Tejun Heo: "Fix for a percpu allocator bug where it could try to kfree() a memory region allocated using vmalloc(). The bug has been there for years now and is unlikely to have ever triggered given the size of struct pcpu_chunk. It's still theoretically possible and the fix is simple and safe enough, so the patch is marked with -stable" * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree()
2014-05-11mm, thp: close race between mremap() and split_huge_page()Kirill A. Shutemov
It's critical for split_huge_page() (and migration) to catch and freeze all PMDs on rmap walk. It gets tricky if there's concurrent fork() or mremap() since usually we copy/move page table entries on dup_mm() or move_page_tables() without rmap lock taken. To get it work we rely on rmap walk order to not miss any entry. We expect to see destination VMA after source one to work correctly. But after switching rmap implementation to interval tree it's not always possible to preserve expected walk order. It works fine for dup_mm() since new VMA has the same vma_start_pgoff() / vma_last_pgoff() and explicitly insert dst VMA after src one with vma_interval_tree_insert_after(). But on move_vma() destination VMA can be merged into adjacent one and as result shifted left in interval tree. Fortunately, we can detect the situation and prevent race with rmap walk by moving page table entries under rmap lock. See commit 38a76013ad80. Problem is that we miss the lock when we move transhuge PMD. Most likely this bug caused the crash[1]. [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473 Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michel Lespinasse <walken@google.com> Cc: Dave Jones <davej@redhat.com> Cc: David Miller <davem@davemloft.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-11mm: postpone the disabling of kmemleak early loggingCatalin Marinas
Commit 8910ae896c8c ("kmemleak: change some global variables to int"), in addition to the atomic -> int conversion, moved the disabling of kmemleak_early_log to the beginning of the kmemleak_init() function, before the full kmemleak tracing is actually enabled. In this small window, kmem_cache_create() is called by kmemleak which triggers additional memory allocation that are not traced. This patch restores the original logic with kmemleak_early_log disabling when kmemleak is fully functional. Fixes: 8910ae896c8c (kmemleak: change some global variables to int) Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06Merge branch 'akpm' (incoming from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "13 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: agp: info leak in agpioc_info_wrap() fs/affs/super.c: bugfix / double free fanotify: fix -EOVERFLOW with large files on 64-bit slub: use sysfs'es release mechanism for kmem_cache revert "mm: vmscan: do not swap anon pages just because free+file is low" autofs: fix lockref lookup mm: filemap: update find_get_pages_tag() to deal with shadow entries mm/compaction: make isolate_freepages start at pageblock boundary MAINTAINERS: zswap/zbud: change maintainer email address mm/page-writeback.c: fix divide by zero in pos_ratio_polynom hugetlb: ensure hugepage access is denied if hugepages are not supported slub: fix memcg_propagate_slab_attrs drivers/rtc/rtc-pcf8523.c: fix month definition
2014-05-06slub: use sysfs'es release mechanism for kmem_cacheChristoph Lameter
debugobjects warning during netfilter exit: ------------[ cut here ]------------ WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0() ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20 Modules linked in: CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G W 3.11.0-next-20130906-sasha #3984 Workqueue: netns cleanup_net Call Trace: dump_stack+0x52/0x87 warn_slowpath_common+0x8c/0xc0 warn_slowpath_fmt+0x46/0x50 debug_print_object+0x8d/0xb0 __debug_check_no_obj_freed+0xa5/0x220 debug_check_no_obj_freed+0x15/0x20 kmem_cache_free+0x197/0x340 kmem_cache_destroy+0x86/0xe0 nf_conntrack_cleanup_net_list+0x131/0x170 nf_conntrack_pernet_exit+0x5d/0x70 ops_exit_list+0x5e/0x70 cleanup_net+0xfb/0x1c0 process_one_work+0x338/0x550 worker_thread+0x215/0x350 kthread+0xe7/0xf0 ret_from_fork+0x7c/0xb0 Also during dcookie cleanup: WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0() ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20 Modules linked in: CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408 Call Trace: dump_stack (lib/dump_stack.c:52) warn_slowpath_common (kernel/panic.c:430) warn_slowpath_fmt (kernel/panic.c:445) debug_print_object (lib/debugobjects.c:262) __debug_check_no_obj_freed (lib/debugobjects.c:697) debug_check_no_obj_freed (lib/debugobjects.c:726) kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717) kmem_cache_destroy (mm/slab_common.c:363) dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343) event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153) __fput (fs/file_table.c:217) ____fput (fs/file_table.c:253) task_work_run (kernel/task_work.c:125 (discriminator 1)) do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751) int_signal (arch/x86/kernel/entry_64.S:807) Sysfs has a release mechanism. Use that to release the kmem_cache structure if CONFIG_SYSFS is enabled. Only slub is changed - slab currently only supports /proc/slabinfo and not /sys/kernel/slab/*. We talked about adding that and someone was working on it. [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build] [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more] Signed-off-by: Christoph Lameter <cl@linux.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Greg KH <greg@kroah.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06revert "mm: vmscan: do not swap anon pages just because free+file is low"Johannes Weiner
This reverts commit 0bf1457f0cfc ("mm: vmscan: do not swap anon pages just because free+file is low") because it introduced a regression in mostly-anonymous workloads, where reclaim would become ineffective and trap every allocating task in direct reclaim. The problem is that there is a runaway feedback loop in the scan balance between file and anon, where the balance tips heavily towards a tiny thrashing file LRU and anonymous pages are no longer being looked at. The commit in question removed the safe guard that would detect such situations and respond with forced anonymous reclaim. This commit was part of a series to fix premature swapping in loads with relatively little cache, and while it made a small difference, the cure is obviously worse than the disease. Revert it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06mm: filemap: update find_get_pages_tag() to deal with shadow entriesJohannes Weiner
Dave Jones reports the following crash when find_get_pages_tag() runs into an exceptional entry: kernel BUG at mm/filemap.c:1347! RIP: find_get_pages_tag+0x1cb/0x220 Call Trace: find_get_pages_tag+0x36/0x220 pagevec_lookup_tag+0x21/0x30 filemap_fdatawait_range+0xbe/0x1e0 filemap_fdatawait+0x27/0x30 sync_inodes_sb+0x204/0x2a0 sync_inodes_one_sb+0x19/0x20 iterate_supers+0xb2/0x110 sys_sync+0x44/0xb0 ia32_do_call+0x13/0x13 1343 /* 1344 * This function is never used on a shmem/tmpfs 1345 * mapping, so a swap entry won't be found here. 1346 */ 1347 BUG(); After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") this comment and BUG() are out of date because exceptional entries can now appear in all mappings - as shadows of recently evicted pages. However, as Hugh Dickins notes, "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably any other PAGECACHE_TAG_*) to appear on an exceptional entry. I expect it comes down to an occasional race in RCU lookup of the radix_tree: lacking absolute synchronization, we might sometimes catch an exceptional entry, with the tag which really belongs with the unexceptional entry which was there an instant before." And indeed, not only is the tree walk lockless, the tags are also read in chunks, one radix tree node at a time. There is plenty of time for page reclaim to swoop in and replace a page that was already looked up as tagged with a shadow entry. Remove the BUG() and update the comment. While reviewing all other lookup sites for whether they properly deal with shadow entries of evicted pages, update all the comments and fix memcg file charge moving to not miss shmem/tmpfs swapcache pages. Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06mm/compaction: make isolate_freepages start at pageblock boundaryVlastimil Babka
The compaction freepage scanner implementation in isolate_freepages() starts by taking the current cc->free_pfn value as the first pfn. In a for loop, it scans from this first pfn to the end of the pageblock, and then subtracts pageblock_nr_pages from the first pfn to obtain the first pfn for the next for loop iteration. This means that when cc->free_pfn starts at offset X rather than being aligned on pageblock boundary, the scanner will start at offset X in all scanned pageblock, ignoring potentially many free pages. Currently this can happen when a) zone's end pfn is not pageblock aligned, or b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE enabled and a hole spanning the beginning of a pageblock This patch fixes the problem by aligning the initial pfn in isolate_freepages() to pageblock boundary. This also permits replacing the end-of-pageblock alignment within the for loop with a simple pageblock_nr_pages increment. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06mm/page-writeback.c: fix divide by zero in pos_ratio_polynomRik van Riel
It is possible for "limit - setpoint + 1" to equal zero, after getting truncated to a 32 bit variable, and resulting in a divide by zero error. Using the fully 64 bit divide functions avoids this problem. It also will cause pos_ratio_polynom() to return the correct value when (setpoint - limit) exceeds 2^32. Also uninline pos_ratio_polynom, at Andrew's request. Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06hugetlb: ensure hugepage access is denied if hugepages are not supportedNishanth Aravamudan
Currently, I am seeing the following when I `mount -t hugetlbfs /none /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's related to the fact that hugetlbfs is properly not correctly setting itself up in this state?: Unable to handle kernel paging request for data at address 0x00000031 Faulting instruction address: 0xc000000000245710 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA pSeries .... In KVM guests on Power, in a guest not backed by hugepages, we see the following: AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 64 kB HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages are not supported at boot-time, but this is only checked in hugetlb_init(). Extract the check to a helper function, and use it in a few relevant places. This does make hugetlbfs not supported (not registered at all) in this environment. I believe this is fine, as there are no valid hugepages and that won't change at runtime. [akpm@linux-foundation.org: use pr_info(), per Mel] [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined] Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06slub: fix memcg_propagate_slab_attrsVladimir Davydov
After creating a cache for a memcg we should initialize its sysfs attrs with the values from its parent. That's what memcg_propagate_slab_attrs is for. Currently it's broken - we clearly muddled root-vs-memcg caches there. Let's fix it up. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "dcache fixes + kvfree() (uninlined, exported by mm/util.c) + posix_acl bugfix from hch" The dcache fixes are for a subtle LRU list corruption bug reported by Miklos Szeredi, where people inside IBM saw list corruptions with the LTP/host01 test. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: nick kvfree() from apparmor posix_acl: handle NULL ACL in posix_acl_equiv_mode dcache: don't need rcu in shrink_dentry_list() more graceful recovery in umount_collect() don't remove from shrink list in select_collect() dentry_kill(): don't try to remove from shrink list expand the call of dentry_lru_del() in dentry_kill() new helper: dentry_free() fold try_prune_one_dentry() fold d_kill() and d_free() fix races between __d_instantiate() and checks of dentry flags
2014-05-06nick kvfree() from apparmorAl Viro
too many places open-code it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-05slab: Fix off by one in object max number tests.David Miller
If freelist_idx_t is a byte, SLAB_OBJ_MAX_NUM should be 255 not 256, and likewise if freelist_idx_t is a short, then it should be 65535 not 65536. This was leading to all kinds of random crashes on sparc64 where PAGE_SIZE is 8192. One problem shown was that if spinlock debugging was enabled, we'd get deadlocks in copy_pte_range() or do_wp_page() with the same cpu already holding a lock it shouldn't hold, or the lock belonging to a completely unrelated process. Fixes: a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-05slab: fix the type of the index on freelist index accessorJoonsoo Kim
Commit a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab") changes the size of freelist index and also changes prototype of accessor function to freelist index. And there was a mistake. The mistake is that although it changes the size of freelist index correctly, it changes the size of the index of freelist index incorrectly. With patch, freelist index can be 1 byte or 2 bytes, that means that num of object on on a slab can be more than 255. So we need more than 1 byte for the index to find the index of free object on freelist. But, above patch makes this index type 1 byte, so slab which have more than 255 objects cannot work properly and in consequence of it, the system cannot boot. This issue was reported by Steven King on m68knommu which would use 2 bytes freelist index: https://lkml.org/lkml/2014/4/16/433 To fix is easy. To change the type of the index of freelist index on accessor functions is enough to fix this bug. Although 2 bytes is enough, I use 4 bytes since it have no bad effect and make things more easier. This fix was suggested and tested by Steven in his original report. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-and-acked-by: Steven King <sfking@fdwdc.com> Acked-by: Christoph Lameter <cl@linux.com> Tested-by: James Hogan <james.hogan@imgtec.com> Tested-by: David Miller <davem@davemloft.net> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-28mm: don't pointlessly use BUG_ON() for sanity checkLinus Torvalds
BUG_ON() is a big hammer, and should be used _only_ if there is some major corruption that you cannot possibly recover from, making it imperative that the current process (and possibly the whole machine) be terminated with extreme prejudice. The trivial sanity check in the vmacache code is *not* such a fatal error. Recovering from it is absolutely trivial, and using BUG_ON() just makes it harder to debug for no actual advantage. To make matters worse, the placement of the BUG_ON() (only if the range check matched) actually makes it harder to hit the sanity check to begin with, so _if_ there is a bug (and we just got a report from Srivatsa Bhat that this can indeed trigger), it is harder to debug not just because the machine is possibly dead, but because we don't have better coverage. BUG_ON() must *die*. Maybe we should add a checkpatch warning for it, because it is simply just about the worst thing you can ever do if you hit some "this cannot happen" situation. Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-25mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing partsLinus Torvalds
The mmu-gather operation 'tlb_flush_mmu()' has done two things: the actual tlb flush operation, and the batched freeing of the pages that the TLB entries pointed at. This splits the operation into separate phases, so that the forced batched flushing done by zap_pte_range() can now do the actual TLB flush while still holding the page table lock, but delay the batched freeing of all the pages to after the lock has been dropped. This in turn allows us to avoid a race condition between set_page_dirty() (as called by zap_pte_range() when it finds a dirty shared memory pte) and page_mkclean(): because we now flush all the dirty page data from the TLB's while holding the pte lock, page_mkclean() will be held up walking the (recently cleaned) page tables until after the TLB entries have been flushed from all CPU's. Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-22mm: make fixup_user_fault() check the vma access rights tooLinus Torvalds
fixup_user_fault() is used by the futex code when the direct user access fails, and the futex code wants it to either map in the page in a usable form or return an error. It relied on handle_mm_fault() to map the page, and correctly checked the error return from that, but while that does map the page, it doesn't actually guarantee that the page will be mapped with sufficient permissions to be then accessed. So do the appropriate tests of the vma access rights by hand. [ Side note: arguably handle_mm_fault() could just do that itself, but we have traditionally done it in the caller, because some callers - notably get_user_pages() - have been able to access pages even when they are mapped with PROT_NONE. Maybe we should re-visit that design decision, but in the meantime this is the minimal patch. ] Found by Dave Jones running his trinity tool. Reported-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18thp: close race between split and zap huge pagesKirill A. Shutemov
Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have the same root cause. Let's look to them one by one. The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!". It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page(). From my testing I see that page_mapcount() is higher than mapcount here. I think it happens due to race between zap_huge_pmd() and page_check_address_pmd(). page_check_address_pmd() misses PMD which is under zap: CPU0 CPU1 zap_huge_pmd() pmdp_get_and_clear() __split_huge_page() anon_vma_interval_tree_foreach() __split_huge_page_splitting() page_check_address_pmd() mm_find_pmd() /* * We check if PMD present without taking ptl: no * serialization against zap_huge_pmd(). We miss this PMD, * it's not accounted to 'mapcount' in __split_huge_page(). */ pmd_present(pmd) == 0 BUG_ON(mapcount != page_mapcount(page)) // CRASH!!! page_remove_rmap(page) atomic_add_negative(-1, &page->_mapcount) The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!". It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd(). This happens in similar way: CPU0 CPU1 zap_huge_pmd() pmdp_get_and_clear() page_remove_rmap(page) atomic_add_negative(-1, &page->_mapcount) __split_huge_page() anon_vma_interval_tree_foreach() __split_huge_page_splitting() page_check_address_pmd() mm_find_pmd() pmd_present(pmd) == 0 /* The same comment as above */ /* * No crash this time since we already decremented page->_mapcount in * zap_huge_pmd(). */ BUG_ON(mapcount != page_mapcount(page)) /* * We split the compound page here into small pages without * serialization against zap_huge_pmd() */ __split_huge_page_refcount() VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!! So my understanding the problem is pmd_present() check in mm_find_pmd() without taking page table lock. The bug was introduced by me commit with commit 117b0791ac42. Sorry for that. :( Let's open code mm_find_pmd() in page_check_address_pmd() and do the check under page table lock. Note that __page_check_address() does the same for PTE entires if sync != 0. I've stress tested split and zap code paths for 36+ hours by now and don't see crashes with the patch applied. Before it took <20 min to trigger the first bug and few hours for second one (if we ignore first). [1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com> [2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Dave Jones <davej@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [3.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18mm: fix new kernel-doc warning in filemap.cRandy Dunlap
Fix new kernel-doc warning in mm/filemap.c: Warning(mm/filemap.c:2600): Excess function parameter 'ppos' description in '__generic_file_aio_write' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()Mizuma, Masayoshi
soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm: hugetlb: fix softlockup when a large number of hugepages are freed." can happen in return_unused_surplus_pages(), so let's fix it. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-18vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state()Christoph Lameter
Seems to be called with preemption enabled. Therefore it must use mod_zone_page_state instead. Signed-off-by: Christoph Lameter <cl@linux.com> Reported-by: Grygorii Strashko <grygorii.strashko@ti.com> Tested-by: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Tejun Heo <tj@kernel.org> Cc: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-14percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree()Jianyu Zhan
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) + BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long) It hardly could be ever bigger than PAGE_SIZE even for large-scale machine, but for consistency with its couterpart pcpu_mem_zalloc(), use pcpu_mem_free() instead. Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use pcpu_mem_free() instead of kfree()") addressed this problem, but missed this one. tj: commit message updated Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online) Cc: stable@vger.kernel.org
2014-04-13mm: Initialize error in shmem_file_aio_read()Geert Uytterhoeven
Some versions of gcc even warn about it: mm/shmem.c: In function ‘shmem_file_aio_read’: mm/shmem.c:1414: warning: ‘error’ may be used uninitialized in this function If the loop is aborted during the first iteration by one of the two first break statements, error will be uninitialized. Introduced by commit 6e58e79db8a1 ("introduce copy_page_to_iter, kill loop over iovec in generic_file_aio_read()"). Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-13Merge branch 'slab/next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull slab changes from Pekka Enberg: "The biggest change is byte-sized freelist indices which reduces slab freelist memory usage: https://lkml.org/lkml/2013/12/2/64" * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: mm: slab/slub: use page->list consistently instead of page->lru mm/slab.c: cleanup outdated comments and unify variables naming slab: fix wrongly used macro slub: fix high order page allocation problem with __GFP_NOFAIL slab: Make allocations with GFP_ZERO slightly more efficient slab: make more slab management structure off the slab slab: introduce byte sized index for the freelist of a slab slab: restrict the number of objects in a slab slab: introduce helper functions to get/set free object slab: factor out calculate nr objects in cache_estimate
2014-04-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
2014-04-12Merge git://git.infradead.org/users/eparis/auditLinus Torvalds
Pull audit updates from Eric Paris. * git://git.infradead.org/users/eparis/audit: (28 commits) AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range audit: do not cast audit_rule_data pointers pointlesly AUDIT: Allow login in non-init namespaces audit: define audit_is_compat in kernel internal header kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c sched: declare pid_alive as inline audit: use uapi/linux/audit.h for AUDIT_ARCH declarations syscall_get_arch: remove useless function arguments audit: remove stray newline from audit_log_execve_info() audit_panic() call audit: remove stray newlines from audit_log_lost messages audit: include subject in login records audit: remove superfluous new- prefix in AUDIT_LOGIN messages audit: allow user processes to log from another PID namespace audit: anchor all pid references in the initial pid namespace audit: convert PPIDs to the inital PID namespace. pid: get pid_t ppid of task in init_pid_ns audit: rename the misleading audit_get_context() to audit_take_context() audit: Add generic compat syscall support audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL ...
2014-04-12missing bits of "splice: fix racy pipe->buffers uses"Al Viro
that commit has fixed only the parts of that mess in fs/splice.c itself; there had been more in several other ->splice_read() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-11mm: slab/slub: use page->list consistently instead of page->lruDave Hansen
'struct page' has two list_head fields: 'lru' and 'list'. Conveniently, they are unioned together. This means that code can use them interchangably, which gets horribly confusing like with this nugget from slab.c: > list_del(&page->lru); > if (page->active == cachep->num) > list_add(&page->list, &n->slabs_full); This patch makes the slab and slub code use page->lru universally instead of mixing ->list and ->lru. So, the new rule is: page->lru is what the you use if you want to keep your page on a list. Don't like the fact that it's not called ->list? Too bad. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2014-04-08mm: vmscan: do not swap anon pages just because free+file is lowJohannes Weiner
Page reclaim force-scans / swaps anonymous pages when file cache drops below the high watermark of a zone in order to prevent what little cache remains from thrashing. However, on bigger machines the high watermark value can be quite large and when the workload is dominated by a static anonymous/shmem set, the file set might just be a small window of used-once cache. In such situations, the VM starts swapping heavily when instead it should be recycling the no longer used cache. This is a longer-standing problem, but it's more likely to trigger after commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") because file pages can no longer accumulate in a single zone and are dispersed into smaller fractions among the available zones. To resolve this, do not force scan anon when file pages are low but instead rely on the scan/rotation ratios to make the right prediction. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: <stable@kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07Merge branch 'akpm' (incoming from Andrew)Linus Torvalds
Merge second patch-bomb from Andrew Morton: - the rest of MM - zram updates - zswap updates - exit - procfs - exec - wait - crash dump - lib/idr - rapidio - adfs, affs, bfs, ufs - cris - Kconfig things - initramfs - small amount of IPC material - percpu enhancements - early ioremap support - various other misc things * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (156 commits) MAINTAINERS: update Intel C600 SAS driver maintainers fs/ufs: remove unused ufs_super_block_third pointer fs/ufs: remove unused ufs_super_block_second pointer fs/ufs: remove unused ufs_super_block_first pointer fs/ufs/super.c: add __init to init_inodecache() doc/kernel-parameters.txt: add early_ioremap_debug arm64: add early_ioremap support arm64: initialize pgprot info earlier in boot x86: use generic early_ioremap mm: create generic early_ioremap() support x86/mm: sparse warning fix for early_memremap lglock: map to spinlock when !CONFIG_SMP percpu: add preemption checks to __this_cpu ops vmstat: use raw_cpu_ops to avoid false positives on preemption checks slub: use raw_cpu_inc for incrementing statistics net: replace __this_cpu_inc in route.c with raw_cpu_inc modules: use raw_cpu_write for initialization of per cpu refcount. mm: use raw_cpu ops for determining current NUMA node percpu: add raw_cpu_ops slub: fix leak of 'name' in sysfs_slab_add ...
2014-04-07mm: create generic early_ioremap() supportMark Salter
This patch creates a generic implementation of early_ioremap() support based on the existing x86 implementation. early_ioremp() is useful for early boot code which needs to temporarily map I/O or memory regions before normal mapping functions such as ioremap() are available. Some architectures have optional MMU. In the no-MMU case, the remap functions simply return the passed in physical address and the unmap functions do nothing. Signed-off-by: Mark Salter <msalter@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Borislav Petkov <borislav.petkov@amd.com> Cc: Dave Young <dyoung@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07slub: use raw_cpu_inc for incrementing statisticsChristoph Lameter
Statistics are not critical to the operation of the allocation but should also not cause too much overhead. When __this_cpu_inc is altered to check if preemption is disabled this triggers. Use raw_cpu_inc to avoid the checks. Using this_cpu_ops may cause interrupt disable/enable sequences on various arches which may significantly impact allocator performance. [akpm@linux-foundation.org: add comment] Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07slub: fix leak of 'name' in sysfs_slab_addDave Jones
The failure paths of sysfs_slab_add don't release the allocation of 'name' made by create_unique_id() a few lines above the context of the diff below. Create a common exit path to make it more obvious what needs freeing. [vdavydov@parallels.com: free the name only if !unmergeable] Signed-off-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07slub: rework sysfs layout for memcg cachesVladimir Davydov
Currently, we try to arrange sysfs entries for memcg caches in the same manner as for global caches. Apart from turning /sys/kernel/slab into a mess when there are a lot of kmem-active memcgs created, it actually does not work properly - we won't create more than one link to a memcg cache in case its parent is merged with another cache. For instance, if A is a root cache merged with another root cache B, we will have the following sysfs setup: X A -> X B -> X where X is some unique id (see create_unique_id()). Now if memcgs M and N start to allocate from cache A (or B, which is the same), we will get: X X:M X:N A -> X B -> X A:M -> X:M A:N -> X:N Since B is an alias for A, we won't get entries B:M and B:N, which is confusing. It is more logical to have entries for memcg caches under the corresponding root cache's sysfs directory. This would allow us to keep sysfs layout clean, and avoid such inconsistencies like one described above. This patch does the trick. It creates a "cgroup" kset in each root cache kobject to keep its children caches there. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07slub: adjust memcg caches when creating cache aliasVladimir Davydov
Otherwise, kzalloc() called from a memcg won't clear the whole object. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07memcg, slab: do not destroy children caches if parent has aliasesVladimir Davydov
Currently we destroy children caches at the very beginning of kmem_cache_destroy(). This is wrong, because the root cache will not necessarily be destroyed in the end - if it has aliases (refcount > 0), kmem_cache_destroy() will simply decrement its refcount and return. In this case, at best we will get a bunch of warnings in dmesg, like this one: kmem_cache_destroy kmalloc-32:0: Slab cache still has objects CPU: 1 PID: 7139 Comm: modprobe Tainted: G B W 3.13.0+ #117 Call Trace: dump_stack+0x49/0x5b kmem_cache_destroy+0xdf/0xf0 kmem_cache_destroy_memcg_children+0x97/0xc0 kmem_cache_destroy+0xf/0xf0 xfs_mru_cache_uninit+0x21/0x30 [xfs] exit_xfs_fs+0x2e/0xc44 [xfs] SyS_delete_module+0x198/0x1f0 system_call_fastpath+0x16/0x1b At worst - if kmem_cache_destroy() will race with an allocation from a memcg cache - the kernel will panic. This patch fixes this by moving children caches destruction after the check if the cache has aliases. Plus, it forbids destroying a root cache if it still has children caches, because each children cache keeps a reference to its parent. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07memcg, slab: unregister cache from memcg before starting to destroy itVladimir Davydov
Currently, memcg_unregister_cache(), which deletes the cache being destroyed from the memcg_slab_caches list, is called after __kmem_cache_shutdown() (see kmem_cache_destroy()), which starts to destroy the cache. As a result, one can access a partially destroyed cache while traversing a memcg_slab_caches list, which can have deadly consequences (for instance, cache_show() called for each cache on a memcg_slab_caches list from mem_cgroup_slabinfo_read() will dereference pointers to already freed data). To fix this, let's move memcg_unregister_cache() before the cache destruction process beginning, issuing memcg_register_cache() on failure. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07memcg, slab: separate memcg vs root cache creation pathsVladimir Davydov
Memcg-awareness turned kmem_cache_create() into a dirty interweaving of memcg-only and except-for-memcg calls. To clean this up, let's move the code responsible for memcg cache creation to a separate function. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07memcg, slab: cleanup memcg cache creationVladimir Davydov
This patch cleans up the memcg cache creation path as follows: - Move memcg cache name creation to a separate function to be called from kmem_cache_create_memcg(). This allows us to get rid of the mutex protecting the temporary buffer used for the name formatting, because the whole cache creation path is protected by the slab_mutex. - Get rid of memcg_create_kmem_cache(). This function serves as a proxy to kmem_cache_create_memcg(). After separating the cache name creation path, it would be reduced to a function call, so let's inline it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07memcg, slab: never try to merge memcg cachesVladimir Davydov
When a kmem cache is created (kmem_cache_create_memcg()), we first try to find a compatible cache that already exists and can handle requests from the new cache, i.e. has the same object size, alignment, ctor, etc. If there is such a cache, we do not create any new caches, instead we simply increment the refcount of the cache found and return it. Currently we do this procedure not only when creating root caches, but also for memcg caches. However, there is no point in that, because, as every memcg cache has exactly the same parameters as its parent and cache merging cannot be turned off in runtime (only on boot by passing "slub_nomerge"), the root caches of any two potentially mergeable memcg caches should be merged already, i.e. it must be the same root cache, and therefore we couldn't even get to the memcg cache creation, because it already exists. The only exception is boot caches - they are explicitly forbidden to be merged by setting their refcount to -1. There are currently only two of them - kmem_cache and kmem_cache_node, which are used in slab internals (I do not count kmalloc caches as their refcount is set to 1 immediately after creation). Since they are prevented from merging preliminary I guess we should avoid to merge their children too. So let's remove the useless code responsible for merging memcg caches. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07mm/zswap.c: remove unnecessary parenthesesSeongJae Park
Fix following trivial checkpatch error: ERROR: return is not a function, parentheses are not required Signed-off-by: SeongJae Park <sj38.park@gmail.com> Acked-by: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07mm/zswap: support multiple swap devicesMinchan Kim
Cai Liu reporeted that now zbud pool pages counting has a problem when multiple swap is used because it just counts only one swap intead of all of swap so zswap cannot control writeback properly. The result is unnecessary writeback or no writeback when we should really writeback. IOW, it made zswap crazy. Another problem in zswap is: For example, let's assume we use two swap A and B with different priority and A already has charged 19% long time ago and let's assume that A swap is full now so VM start to use B so that B has charged 1% recently. It menas zswap charged (19% + 1%) is full by default. Then, if VM want to swap out more pages into B, zbud_reclaim_page would be evict one of pages in B's pool and it would be repeated continuously. It's totally LRU reverse problem and swap thrashing in B would happen. This patch makes zswap consider mutliple swap by creating *a* zbud pool which will be shared by multiple swap so all of zswap pages in multiple swap keep order by LRU so it can prevent above two problems. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Cai Liu <cai.liu@samsung.com> Suggested-by: Weijie Yang <weijie.yang.kh@gmail.com> Cc: Seth Jennings <sjennings@variantweb.net> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>