summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2012-07-19mm: frontswap: split out function to clear a page outSasha Levin
Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-18bounce: allow use of bounce pool via config optionChris Metcalf
The tilegx USB OHCI support needs the bounce pool since we're not using the IOMMU to handle 32-bit addresses. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
2012-07-17Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge Andrew's remaining patches for 3.5: "Nine fixes" * Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (9 commits) mm: fix lost kswapd wakeup in kswapd_stop() m32r: make memset() global for CONFIG_KERNEL_BZIP2=y m32r: add memcpy() for CONFIG_KERNEL_GZIP=y m32r: consistently use "suffix-$(...)" m32r: fix 'fix breakage from "m32r: use generic ptrace_resume code"' fallout m32r: fix pull clearing RESTORE_SIGMASK into block_sigmask() fallout m32r: remove duplicate definition of PTRACE_O_TRACESYSGOOD mn10300: fix "pull clearing RESTORE_SIGMASK into block_sigmask()" fallout bootmem: make ___alloc_bootmem_node_nopanic() really nopanic
2012-07-17mm: fix lost kswapd wakeup in kswapd_stop()Aaditya Kumar
Offlining memory may block forever, waiting for kswapd() to wake up because kswapd() does not check the event kthread->should_stop before sleeping. The proper pattern, from Documentation/memory-barriers.txt, is: --- waker --- event_indicated = 1; wake_up_process(event_daemon); --- sleeper --- for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (event_indicated) break; schedule(); } set_current_state() may be wrapped by: prepare_to_wait(); In the kswapd() case, event_indicated is kthread->should_stop. === offlining memory (waker) === kswapd_stop() kthread_stop() kthread->should_stop = 1 wake_up_process() wait_for_completion() === kswapd_try_to_sleep (sleeper) === kswapd_try_to_sleep() prepare_to_wait() . . schedule() . . finish_wait() The schedule() needs to be protected by a test of kthread->should_stop, which is wrapped by kthread_should_stop(). Reproducer: Do heavy file I/O in background. Do a memory offline/online in a tight loop Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-17bootmem: make ___alloc_bootmem_node_nopanic() really nopanicYinghai Lu
In reaction to commit 99ab7b19440a ("mm: sparse: fix usemap allocation above node descriptor section") Johannes said: | while backporting the below patch, I realised that your fix busted | f5bf18fa22f8 again. The problem was not a panicking version on | allocation failure but when the usemap size was too large such that | goal + size > limit triggers the BUG_ON in the bootmem allocator. So | we need a version that passes limit ONLY if the usemap is smaller than | the section. after checking the code, the name of ___alloc_bootmem_node_nopanic() does not reflect the fact. Make bootmem really not panic. Hope will kill bootmem sooner. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.3.x, 3.4.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-17Merge branch 'fixes-for-linus' of ↵Linus Torvalds
git://git.linaro.org/people/mszyprowski/linux-dma-mapping Pull CMA and DMA-mapping fixes from Marek Szyprowski: "Another set of minor fixups for recently merged Contiguous Memory Allocator and ARM DMA-mapping changes. Those patches fix mysterious crashes on systems with CMA and Himem enabled as well as some corner cases caused by typical off-by-one bug." * 'fixes-for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: ARM: dma-mapping: modify condition check while freeing pages mm: cma: fix condition check when setting global cma area mm: cma: don't replace lowmem pages with highmem
2012-07-14don't pass nameidata to ->create()Al Viro
boolean "does it have to be exclusive?" flag is passed instead; Local filesystem should just ignore it - the object is guaranteed not to be there yet. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-12slob: Fix early boot kernel crashChristoph Lameter
Commit fd3142a59af2012a7c5dc72ec97a4935ff1c5fc6 broke slob since a piece of a change for a later patch slipped into it. Fengguang Wu writes: The commit crashes the kernel w/o any dmesg output (the attached one is created by the script as a summary for that run). This is very reproducible in kvm for the attached config. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-11memblock: free allocated memblock_reserved_regions laterYinghai Lu
memblock_free_reserved_regions() calls memblock_free(), but memblock_free() would double reserved.regions too, so we could free the old range for reserved.regions. Also tj said there is another bug which could be related to this. | I don't think we're saving any noticeable | amount by doing this "free - give it to page allocator - reserve | again" dancing. We should just allocate regions aligned to page | boundaries and free them later when memblock is no longer in use. in that case, when DEBUG_PAGEALLOC, will get panic: memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39 BUG: unable to handle kernel paging request at ffff88102febd948 IP: [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155 PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447 RIP: 0010:[<ffffffff836a5774>] [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155 See the discussion at https://lkml.org/lkml/2012/6/13/469 So try to allocate with PAGE_SIZE alignment and free it later. Reported-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11mm: sparse: fix usemap allocation above node descriptor sectionYinghai Lu
After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint in alloc_bootmem_section"), usemap allocations may easily be placed outside the optimal section that holds the node descriptor, even if there is space available in that section. This results in unnecessary hotplug dependencies that need to have the node unplugged before the section holding the usemap. The reason is that the bootmem allocator doesn't guarantee a linear search starting from the passed allocation goal but may start out at a much higher address absent an upper limit. Fix this by trying the allocation with the limit at the section end, then retry without if that fails. This keeps the fix from f5bf18fa22f8 of not panicking if the allocation does not fit in the section, but still makes sure to try to stay within the section at first. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.3.x, 3.4.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11mm: sparse: fix section usemap placement calculationYinghai Lu
Commit 238305bb4d41 ("mm: remove sparsemem allocation details from the bootmem allocator") introduced a bug in the allocation goal calculation that put section usemaps not in the same section as the node descriptors, creating unnecessary hotplug dependencies between them: node 0 must be removed before remove section 16399 node 1 must be removed before remove section 16399 node 2 must be removed before remove section 16399 node 3 must be removed before remove section 16399 node 4 must be removed before remove section 16399 node 5 must be removed before remove section 16399 node 6 must be removed before remove section 16399 The reason is that it applies PAGE_SECTION_MASK to the physical address of the node descriptor when finding a suitable place to put the usemap, when this mask is actually intended to be used with PFNs. Because the PFN mask is wider, the target address will point beyond the wanted section holding the node descriptor and the node must be offlined before the section holding the usemap can go. Fix this by extending the mask to address width before use. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11shmem: cleanup shmem_add_to_page_cacheHugh Dickins
shmem_add_to_page_cache() has three callsites, but only one of them wants the radix_tree_preload() (an exceptional entry guarantees that the radix tree node is present in the other cases), and only that site can achieve mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a no-op in the other cases). We did it this way originally to reflect add_to_page_cache_locked(); but it's confusing now, so move the radix_tree preloading and mem_cgroup uncharging to that one caller. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11shmem: fix negative rss in memcg memory.statHugh Dickins
When adding the page_private checks before calling shmem_replace_page(), I did realize that there is a further race, but thought it too unlikely to need a hurried fix. But independently I've been chasing why a mem cgroup's memory.stat sometimes shows negative rss after all tasks have gone: I expected it to be a stats gathering bug, but actually it's shmem swapping's fault. It's an old surprise, that when you lock_page(lookup_swap_cache(swap)), the page may have been removed from swapcache before getting the lock; or it may have been freed and reused and be back in swapcache; and it can even be using the same swap location as before (page_private same). The swapoff case is already secure against this (swap cannot be reused until the whole area has been swapped off, and a new swapped on); and shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check for the expected radix_tree entry - but a little too late. By that time, we might have already decided to shmem_replace_page(): I don't know of a problem from that, but I'd feel more at ease not to do so spuriously. And we have already done mem_cgroup_cache_charge(), on perhaps the wrong mem cgroup: and this charge is not then undone on the error path, because PageSwapCache ends up preventing that. It's this last case which causes the occasional negative rss in memory.stat: the page is charged here as cache, but (sometimes) found to be anon when eventually it's uncharged - and in between, it's an undeserved charge on the wrong memcg. Fix this by adding an earlier check on the radix_tree entry: it's inelegant to descend the tree twice, but swapping is not the fast path, and a better solution would need a pair (try+commit) of memcg calls, and a rework of shmem_replace_page() to keep out of the swapcache. We can use the added shmem_confirm_swap() function to replace the find_get_page+page_cache_release we were already doing on the error path. And add a comment on that -EEXIST: it seems a peculiar errno to be using, but originates from its use in radix_tree_insert(). [It can be surprising to see positive rss left in a memcg's memory.stat after all tasks have gone, since it is supposed to count anonymous but not shmem. Aside from sharing anon pages via fork with a task in some other memcg, it often happens after swapping: because a swap page can't be freed while under writeback, nor while locked. So it's not an error, and these residual pages are easily freed once pressure demands.] Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11tmpfs: revert SEEK_DATA and SEEK_HOLEHugh Dickins
Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE"). I believe it's correct, and it's been nice to have from rc1 to rc6; but as the original commit said: I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it would be of any use to them on tmpfs. This code adds 92 lines and 752 bytes on x86_64 - is that bloat or worthwhile? Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs to the dumb generic support for v3.5. We can always reinstate it later if useful, and anyone needing it in a hurry can just get it out of git. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Josef Bacik <josef@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andreas Dilger <adilger@dilger.ca> Cc: Dave Chinner <david@fromorbit.com> Cc: Marco Stornelli <marco.stornelli@gmail.com> Cc: Jeff liu <jeff.liu@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11mm/memory_hotplug.c: release memory resources if hotadd_new_pgdat() failsWen Congyang
We should goto error to release memory resource if hotadd_new_pgdat() failed. Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki ISIMATU <isimatu.yasuaki@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11mm, thp: abort compaction if migration page cannot be charged to memcgDavid Rientjes
If page migration cannot charge the temporary page to the memcg, migrate_pages() will return -ENOMEM. This isn't considered in memory compaction however, and the loop continues to iterate over all pageblocks trying to isolate and migrate pages. If a small number of very large memcgs happen to be oom, however, these attempts will mostly be futile leading to an enormous amout of cpu consumption due to the page migration failures. This patch will short circuit and fail memory compaction if migrate_pages() returns -ENOMEM. COMPACT_PARTIAL is returned in case some migrations were successful so that the page allocator will retry. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11memory hotplug: fix invalid memory access caused by stale kswapd pointerJiang Liu
kswapd_stop() is called to destroy the kswapd work thread when all memory of a NUMA node has been offlined. But kswapd_stop() only terminates the work thread without resetting NODE_DATA(nid)->kswapd to NULL. The stale pointer will prevent kswapd_run() from creating a new work thread when adding memory to the memory-less NUMA node again. Eventually the stale pointer may cause invalid memory access. An example stack dump as below. It's reproduced with 2.6.32, but latest kernel has the same issue. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051a94>] exit_creds+0x12/0x78 PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/memory/memory391/state CPU 11 Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285 RIP: 0010:exit_creds+0x12/0x78 RSP: 0018:ffff8806044f1d78 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502 RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000 RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10 R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000 R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001 FS: 00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600) Stack: ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1 0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003 Call Trace: __put_task_struct+0x5d/0x97 kthread_stop+0x50/0x58 offline_pages+0x324/0x3da memory_block_change_state+0x179/0x1db store_mem_state+0x9e/0xbb sysfs_write_file+0xd0/0x107 vfs_write+0xad/0x169 sys_write+0x45/0x6e system_call_fastpath+0x16/0x1b Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0 RIP exit_creds+0x12/0x78 RSP <ffff8806044f1d78> CR2: 0000000000000000 [akpm@linux-foundation.org: add pglist_data.kswapd locking comments] Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11Merge branch 'mce-ripvfix' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/mce Merge memory fault handling fix from Tony Luck. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-11x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faultsTony Luck
In commit dad1743e5993f1 ("x86/mce: Only restart instruction after machine check recovery if it is safe") we fixed mce_notify_process() to force a signal to the current process if it was not restartable (RIPV bit not set in MCG_STATUS). But doing it here means that the process doesn't get told the virtual address of the fault via siginfo_t->si_addr. This would prevent application level recovery from the fault. Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so that we will provide the right information with the signal. Signed-off-by: Tony Luck <tony.luck@intel.com> Acked-by: Borislav Petkov <borislav.petkov@amd.com> Cc: stable@kernel.org # 3.4+
2012-07-10mm, slub: ensure irqs are enabled for kmemcheckDavid Rientjes
kmemcheck_alloc_shadow() requires irqs to be enabled, so wait to disable them until after its called for __GFP_WAIT allocations. This fixes a warning for such allocations: WARNING: at kernel/lockdep.c:2739 lockdep_trace_alloc+0x14e/0x1c0() Acked-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-09mm, sl[aou]b: Move kmem_cache_create mutex handling to common codeChristoph Lameter
Move the mutex handling into the common kmem_cache_create() function. Then we can also move more checks out of SLAB's kmem_cache_create() into the common code. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-09mm, sl[aou]b: Use a common mutex definitionChristoph Lameter
Use the mutex definition from SLAB and make it the common way to take a sleeping lock. This has the effect of using a mutex instead of a rw semaphore for SLUB. SLOB gains the use of a mutex for kmem_cache_create serialization. Not needed now but SLOB may acquire some more features later (like slabinfo / sysfs support) through the expansion of the common code that will need this. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-09mm, sl[aou]b: Common definition for boot state of the slab allocatorsChristoph Lameter
All allocators have some sort of support for the bootstrap status. Setup a common definition for the boot states and make all slab allocators use that definition. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-09mm, sl[aou]b: Extract common code for kmem_cache_create()Christoph Lameter
Kmem_cache_create() does a variety of sanity checks but those vary depending on the allocator. Use the strictest tests and put them into a slab_common file. Make the tests conditional on CONFIG_DEBUG_VM. This patch has the effect of adding sanity checks for SLUB and SLOB under CONFIG_DEBUG_VM and removes the checks in SLAB for !CONFIG_DEBUG_VM. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-09slub: remove invalid reference to list iterator variableJulia Lawall
If list_for_each_entry, etc complete a traversal of the list, the iterator variable ends up pointing to an address at an offset from the list head, and not a meaningful structure. Thus this value should not be used after the end of the iterator. The patch replaces s->name by al->name, which is referenced nearby. This problem was found using Coccinelle (http://coccinelle.lip6.fr/). Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-06mm: Hold a file reference in madvise_removeAndy Lutomirski
Otherwise the code races with munmap (causing a use-after-free of the vma) or with close (causing a use-after-free of the struct file). The bug was introduced by commit 90ed52ebe481 ("[PATCH] holepunch: fix mmap_sem i_mutex deadlock") Cc: Hugh Dickins <hugh@veritas.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-06mm: cma: don't replace lowmem pages with highmemRabin Vincent
The filesystem layer expects pages in the block device's mapping to not be in highmem (the mapping's gfp mask is set in bdget()), but CMA can currently replace lowmem pages with highmem pages, leading to crashes in filesystem code such as the one below: Unable to handle kernel NULL pointer dereference at virtual address 00000400 pgd = c0c98000 [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000 Internal error: Oops: 817 [#1] PREEMPT SMP ARM CPU: 0 Not tainted (3.5.0-rc5+ #80) PC is at __memzero+0x24/0x80 ... Process fsstress (pid: 323, stack limit = 0xc0cbc2f0) Backtrace: [<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98) [<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc) r4:c15337f0 [<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98) [<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac) r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0 [<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24) r6:beccdcf0 r5:00074000 r4:beccdbbc [<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30) Fix this by replacing only highmem pages with highmem. Reported-by: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
2012-07-03Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block bits from Jens Axboe: "As vacation is coming up, thought I'd better get rid of my pending changes in my for-linus branch for this iteration. It contains: - Two patches for mtip32xx. Killing a non-compliant sysfs interface and moving it to debugfs, where it belongs. - A few patches from Asias. Two legit bug fixes, and one killing an interface that is no longer in use. - A patch from Jan, making the annoying partition ioctl warning a bit less annoying, by restricting it to !CAP_SYS_RAWIO only. - Three bug fixes for drbd from Lars Ellenberg. - A fix for an old regression for umem, it hasn't really worked since the plugging scheme was changed in 3.0. - A few fixes from Tejun. - A splice fix from Eric Dumazet, fixing an issue with pipe resizing." * 'for-linus' of git://git.kernel.dk/linux-block: scsi: Silence unnecessary warnings about ioctl to partition block: Drop dead function blk_abort_queue() block: Mitigate lock unbalance caused by lock switching block: Avoid missed wakeup in request waitqueue umem: fix up unplugging splice: fix racy pipe->buffers uses drbd: fix null pointer dereference with on-congestion policy when diskless drbd: fix list corruption by failing but already aborted reads drbd: fix access of unallocated pages and kernel panic xen/blkfront: Add WARN to deal with misbehaving backends. blkcg: drop local variable @q from blkg_destroy() mtip32xx: Create debugfs entries for troubleshooting mtip32xx: Remove 'registers' and 'flags' from sysfs blkcg: fix blkg_alloc() failure path block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED block: fix return value on cfq_init() failure mtip32xx: Remove version.h header file inclusion xen/blkback: Copy id field when doing BLKIF_DISCARD.
2012-07-02slab: move FULL state transition to an initcallGlauber Costa
During kmem_cache_init_late(), we transition to the LATE state, and after some more work, to the FULL state, its last state This is quite different from slub, that will only transition to its last state (previously SYSFS), in a (late)initcall, after a lot more of the kernel is ready. This means that in slab, we have no way to taking actions dependent on the initialization of other pieces of the kernel that are supposed to start way after kmem_init_late(), such as cgroups initialization. To achieve more consistency in this behavior, that patch only transitions to the UP state in kmem_init_late. In my analysis, setup_cpu_cache() should be happy to test for >= UP, instead of == FULL. It also has passed some tests I've made. We then only mark FULL state after the reap timers are in place, meaning that no further setup is expected. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-02slab: Fix a typo in commit 8c138b "slab: Get rid of obj_size macro"Feng Tang
Commit 8c138b only sits in Pekka's and linux-next tree now, which tries to replace obj_size(cachep) with cachep->object_size, but has a typo in kmem_cache_free() by using "size" instead of "object_size", which casues some regressions. Reported-and-tested-by: Fengguang Wu <wfg@linux.intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-02mm, slab: Build fix for recent kmem_cache changesThierry Reding
Commit 3b0efdf ("mm, sl[aou]b: Extract common fields from struct kmem_cache") renamed the kmem_cache structure's "next" field to "list" but forgot to update one instance in leaks_show(). Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-02slab: rename gfpflags to allocflagsGlauber Costa
A consistent name with slub saves us an acessor function. In both caches, this field represents the same thing. We would like to use it from the mem_cgroup code. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> CC: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-29Merge branch 'master' into for-nextJiri Kosina
Conflicts: include/linux/mmzone.h Synced with Linus' tree so that trivial patch can be applied on top of up-to-date code properly. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
2012-06-28mm/vmscan: cleanup comment error in balance_pgdatWanpeng Li
Signed-off-by: Wanpeng Li <liwp.linux@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-06-28mm: fix page reclaim comment errorWanpeng Li
Since there are five lists in LRU cache, the array nr in get_scan_count should be: nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan nr[2] = file inactive pages to scan; nr[3] = file active pages to scan Signed-off-by: Wanpeng Li <liwp.linux@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-06-27mm/mmu_gather: enable tlb flush range in generic mmu_gatherAlex Shi
This patch enabled the tlb flush range support in generic mmu layer. Most of arch has self tlb flush range support, like ARM/IA64 etc. X86 arch has no this support in hardware yet. But another instruction 'invlpg' can implement this function in some degree. So, enable this feather in generic layer for x86 now. and maybe useful for other archs in further. Generic mmu_gather struct is protected by micro HAVE_GENERIC_MMU_GATHER. Other archs that has flush range supported own self mmu_gather struct. So, now this change is safe for them. In future we may unify this struct and related functions on multiple archs. Thanks for Peter Zijlstra time and time reminder for multiple architecture code safe! Signed-off-by: Alex Shi <alex.shi@intel.com> Link: http://lkml.kernel.org/r/1340845344-27557-7-git-send-email-alex.shi@intel.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-06-25mempool: add @gfp_mask to mempool_create_node()Tejun Heo
mempool_create_node() currently assumes %GFP_KERNEL. Its only user, blk_init_free_list(), is about to be updated to use other allocation flags - add @gfp_mask argument to the function. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-20mm, mempolicy: fix mbind() to do synchronous migrationDavid Rientjes
If the range passed to mbind() is not allocated on nodes set in the nodemask, it migrates the pages to respect the constraint. The final formal of migrate_pages() is a mode of type enum migrate_mode, not a boolean. do_mbind() is currently passing "true" which is the equivalent of MIGRATE_SYNC_LIGHT. This should instead be MIGRATE_SYNC for synchronous page migration. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20mm/memblock: fix overlapping allocation when doubling reserved arrayGreg Pearson
__alloc_memory_core_early() asks memblock for a range of memory then try to reserve it. If the reserved region array lacks space for the new range, memblock_double_array() is called to allocate more space for the array. If memblock is used to allocate memory for the new array it can end up using a range that overlaps with the range originally allocated in __alloc_memory_core_early(), leading to possible data corruption. With this patch memblock_double_array() now calls memblock_find_in_range() with a narrowed candidate range (in cases where the reserved.regions array is being doubled) so any memory allocated will not overlap with the original range that was being reserved. The range is narrowed by passing in the starting address and size of the previously allocated range. Then the range above the ending address is searched and if a candidate is not found, the range below the starting address is searched. Signed-off-by: Greg Pearson <greg.pearson@hp.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20mm/memory.c: fix kernel-doc warningsRandy Dunlap
Fix kernel-doc warnings in mm/memory.c: Warning(mm/memory.c:1377): No description found for parameter 'start' Warning(mm/memory.c:1377): Excess function parameter 'address' description in 'zap_page_range' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20mm: fix kernel-doc warningsWanpeng Li
Fix kernel-doc warnings such as Warning(../mm/page_cgroup.c:432): No description found for parameter 'id' Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record' Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20mm, thp: print useful information when mmap_sem is unlocked in zap_pmd_rangeDavid Rientjes
Andrea asked for addr, end, vma->vm_start, and vma->vm_end to be emitted when !rwsem_is_locked(&tlb->mm->mmap_sem). Otherwise, debugging the underlying issue is more difficult. Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20memcg: fix use_hierarchy css_is_ancestor oops regressionHugh Dickins
If use_hierarchy is set, reclaim testing soon oopses in css_is_ancestor() called from __mem_cgroup_same_or_subtree() called from page_referenced(): when processes are exiting, it's easy for mm_match_cgroup() to pass along a NULL memcg coming from a NULL mm->owner. Check for that in __mem_cgroup_same_or_subtree(). Return true or false? False because we cannot know if it was in the hierarchy, but also false because it's better not to count a reference from an exiting process. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20mm, oom: fix and cleanup oom score calculationsDavid Rientjes
The divide in p->signal->oom_score_adj * totalpages / 1000 within oom_badness() was causing an overflow of the signed long data type. This adds both the root bias and p->signal->oom_score_adj before doing the normalization which fixes the issue and also cleans up the calculation. Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20slub: refactoring unfreeze_partials()Joonsoo Kim
Current implementation of unfreeze_partials() is so complicated, but benefit from it is insignificant. In addition many code in do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab. Under current implementation which test status of cpu partial slab and acquire list_lock in do {} while loop, we don't need to acquire a list_lock and gain a little benefit when front of the cpu partial slab is to be discarded, but this is a rare case. In case that add_partial is performed and cmpxchg_double_slab is failed, remove_partial should be called case by case. I think that these are disadvantages of current implementation, so I do refactoring unfreeze_partials(). Minimizing code in do {} while loop introduce a reduced fail rate of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256' when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done. ** before ** Cmpxchg_double Looping ------------------------ Locked Cmpxchg Double redos 182685 Unlocked Cmpxchg Double redos 0 ** after ** Cmpxchg_double Looping ------------------------ Locked Cmpxchg Double redos 177995 Unlocked Cmpxchg Double redos 1 We can see cmpxchg_double_slab fail rate is improved slightly. Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'. ** before ** Performance counter stats for './hackbench 50 process 4000' (30 runs): 108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% ) 2,919,550 context-switches # 0.027 M/sec ( +- 3.07% ) 100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% ) 124,201 page-faults # 0.001 M/sec ( +- 0.15% ) 401,500,234,387 cycles # 3.700 GHz ( +- 0.24% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% ) 45,934,956,860 branches # 423.297 M/sec ( +- 0.14% ) 188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% ) 13.691837307 seconds time elapsed ( +- 0.24% ) ** after ** Performance counter stats for './hackbench 50 process 4000' (30 runs): 107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% ) 2,834,781 context-switches # 0.026 M/sec ( +- 2.33% ) 93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% ) 123,967 page-faults # 0.001 M/sec ( +- 0.15% ) 398,781,421,836 cycles # 3.700 GHz ( +- 0.22% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% ) 45,855,370,128 branches # 425.436 M/sec ( +- 0.10% ) 169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% ) 13.596272341 seconds time elapsed ( +- 0.22% ) No regression is found, but rather we can see slightly better result. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-20slub: use __cmpxchg_double_slab() at interrupt disabled placeJoonsoo Kim
get_freelist(), unfreeze_partials() are only called with interrupt disabled, so __cmpxchg_double_slab() is suitable. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-20slab/mempolicy: always use local policy from interrupt contextAndi Kleen
slab_node() could access current->mempolicy from interrupt context. However there's a race condition during exit where the mempolicy is first freed and then the pointer zeroed. Using this from interrupts seems bogus anyways. The interrupt will interrupt a random process and therefore get a random mempolicy. Many times, this will be idle's, which noone can change. Just disable this here and always use local for slab from interrupts. I also cleaned up the callers of slab_node a bit which always passed the same argument. I believe the original mempolicy code did that in fact, so it's likely a regression. v2: send version with correct logic v3: simplify. fix typo. Reported-by: Arun Sharma <asharma@fb.com> Cc: penberg@kernel.org Cc: cl@linux.com Signed-off-by: Andi Kleen <ak@linux.intel.com> [tdmackey@twitter.com: Rework control flow based on feedback from cl@linux.com, fix logic, and cleanup current task_struct reference] Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Mackey <tdmackey@twitter.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-15swap: fix shmem swapping when more than 8 areasHugh Dickins
Minchan Kim reports that when a system has many swap areas, and tmpfs swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read back the page cannot locate it, and the read fails with -ENOMEM. Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry( swp_entry_to_pte()) technique for determining maximum usable swap offset, without stopping to realize that that actually depends upon the pte swap encoding shifting swap offset to the higher bits and truncating it there. Whereas our radix_tree swap encoding leaves offset in the lower bits: it's swap "type" (that is, index of swap area) that was truncated. Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header(). This does not reduce the usable size of a swap area any further, it leaves it as claimed when making the original commit: no change from 3.0 on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB per swapfile on i386 with PAE. It's not a change I would have risked five years ago, but with x86_64 supported for ten years, I believe it's appropriate now. Hmm, and what if some architecture implements its swap pte with offset encoded below type? That would equally break the maximum usable swap offset check. Happily, they all follow the same tradition of encoding offset above type, but I'll prepare a check on that for next. Reported-and-Reviewed-and-Tested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org [3.1, 3.2, 3.3, 3.4] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-15Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core updates (RCU and locking) from Ingo Molnar: "Most of the diffstat comes from the RCU slow boot regression fixes, but there's also a debuggability improvements/fixes." * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: memblock: Document memblock_is_region_{memory,reserved}() rcu: Precompute RCU_FAST_NO_HZ timer offsets rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structure rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacks rcu: RCU_FAST_NO_HZ detection of callback adoption spinlock: Indicate that a lockup is only suspected kdump: Execute kmsg_dump(KMSG_DUMP_PANIC) after smp_send_stop() panic: Make panic_on_oops configurable
2012-06-14slab: Get rid of obj_size macroChristoph Lameter
The size of the slab object is frequently needed. Since we now have a size field directly in the kmem_cache structure there is no need anymore of the obj_size macro/function. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>