summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2011-05-25mm: mmu_gather reworkPeter Zijlstra
Rework the existing mmu_gather infrastructure. The direct purpose of these patches was to allow preemptible mmu_gather, but even without that I think these patches provide an improvement to the status quo. The first 9 patches rework the mmu_gather infrastructure. For review purpose I've split them into generic and per-arch patches with the last of those a generic cleanup. The next patch provides generic RCU page-table freeing, and the followup is a patch converting s390 to use this. I've also got 4 patches from DaveM lined up (not included in this series) that uses this to implement gup_fast() for sparc64. Then there is one patch that extends the generic mmu_gather batching. After that follow the mm preemptibility patches, these make part of the mm a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to mutexes which together with the mmu_gather rework makes mmu_gather preemptible as well. Making i_mmap_lock a mutex also enables a clean-up of the truncate code. This also allows for preemptible mmu_notifiers, something that XPMEM I think wants. Furthermore, it removes the new and universially detested unmap_mutex. This patch: Remove the first obstacle towards a fully preemptible mmu_gather. The current scheme assumes mmu_gather is always done with preemption disabled and uses per-cpu storage for the page batches. Change this to try and allocate a page for batching and in case of failure, use a small on-stack array to make some progress. Preemptible mmu_gather is desired in general and usable once i_mmap_lock becomes a mutex. Doing it before the mutex conversion saves us from having to rework the code by moving the mmu_gather bits inside the pte_lock. Also avoid flushing the tlb batches from under the pte lock, this is useful even without the i_mmap_lock conversion as it significantly reduces pte lock hold times. [akpm@linux-foundation.org: fix comment tpyo] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: make expand_downwards() symmetrical with expand_upwards()Michal Hocko
Currently we have expand_upwards exported while expand_downwards is accessible only via expand_stack or expand_stack_downwards. check_stack_guard_page is a nice example of the asymmetry. It uses expand_stack for VM_GROWSDOWN while expand_upwards is called for VM_GROWSUP case. Let's clean this up by exporting both functions and make those names consistent. Let's use expand_{upwards,downwards} because expanding doesn't always involve stack manipulation (an example is ia64_do_page_fault which uses expand_upwards for registers backing store expansion). expand_downwards has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards version in the early process initialization phase for growsup configuration. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hugh Dickins <hughd@google.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm/vmalloc: remove guard page from between vmap blocksJohannes Weiner
The vmap allocator is used to, among other things, allocate per-cpu vmap blocks, where each vmap block is naturally aligned to its own size. Obviously, leaving a guard page after each vmap area forbids packing vmap blocks efficiently and can make the kernel run out of possible vmap blocks long before overall vmap space is exhausted. The new interface to map a user-supplied page array into linear vmalloc space (vm_map_ram) insists on allocating from a vmap block (instead of falling back to a custom area) when the area size is below a certain threshold. With heavy users of this interface (e.g. XFS) and limited vmalloc space on 32-bit, vmap block exhaustion is a real problem. Remove the guard page from the core vmap allocator. vmalloc and the old vmap interface enforce a guard page on their own at a higher level. Note that without this patch, we had accidental guard pages after those vm_map_ram areas that happened to be at the end of a vmap block, but not between every area. This patch removes this accidental guard page only. If we want guard pages after every vm_map_ram area, this should be done separately. And just like with vmalloc and the old interface on a different level, not in the core allocator. Mel pointed out: "If necessary, the guard page could be reintroduced as a debugging-only option (CONFIG_DEBUG_PAGEALLOC?). Otherwise it seems reasonable." Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Dave Chinner <david@fromorbit.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25include/linux/gfp.h: convert BUG_ON() into VM_BUG_ON()Dave Hansen
VM_BUG_ON() if effectively a BUG_ON() undef #ifdef CONFIG_DEBUG_VM. That is exactly what we have here now, and two different folks have suggested doing it this way. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25include/linux/gfp.h: work around apparent sparse confusionDave Hansen
Running sparse on page_alloc.c today, it errors out: include/linux/gfp.h:254:17: error: bad constant expression include/linux/gfp.h:254:17: error: cannot size expression which is a line in gfp_zone(): BUILD_BUG_ON((GFP_ZONE_BAD >> bit) & 1); That's really unfortunate, because it ends up hiding all of the other legitimate sparse messages like this: mm/page_alloc.c:5315:59: warning: incorrect type in argument 1 (different base types) mm/page_alloc.c:5315:59: expected unsigned long [unsigned] [usertype] size mm/page_alloc.c:5315:59: got restricted gfp_t [usertype] <noident> ... Having sparse be able to catch these very oopsable bugs is a lot more important than keeping a BUILD_BUG_ON(). Kill the BUILD_BUG_ON(). Compiles on x86_64 with and without CONFIG_DEBUG_VM=y. defconfig boots fine for me. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25oom: replace PF_OOM_ORIGIN with toggling oom_score_adjDavid Rientjes
There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm/compaction: reverse the change that forbade sync migraton with ↵Andrea Arcangeli
__GFP_NO_KSWAPD It's uncertain this has been beneficial, so it's safer to undo it. All other compaction users would still go in synchronous mode if a first attempt at async compaction failed. Hopefully we don't need to force special behavior for THP (which is the only __GFP_NO_KSWAPD user so far and it's the easier to exercise and to be noticeable). This also make __GFP_NO_KSWAPD return to its original strict semantics specific to bypass kswapd, as THP allocations have khugepaged for the async THP allocations/compactions. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Alex Villacis Lasso <avillaci@fiec.espol.edu.ec> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occurKOSAKI Motohiro
Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug doesn't. There is no reason for this. [akpm@linux-foundation.org: fix CONFIG_SMP=n build] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occursKOSAKI Motohiro
Currently, memory hotplug calls setup_per_zone_wmarks() and calculate_zone_inactive_ratio(), but doesn't call setup_per_zone_lowmem_reserve(). It means the number of reserved pages aren't updated even if memory hot plug occur. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm, mem-hotplug: fix section mismatch. setup_per_zone_inactive_ratio() ↵KOSAKI Motohiro
should be __meminit. Commit bce7394a3e ("page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens") introduced invalid section references. Now, setup_per_zone_inactive_ratio() is marked __init and then it can't be referenced from memory hotplug code. This patch marks it as __meminit and also marks caller as __ref. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25x86,mm: make pagefault killableKOSAKI Motohiro
When an oom killing occurs, almost all processes are getting stuck at the following two points. 1) __alloc_pages_nodemask 2) __lock_page_or_retry 1) is not very problematic because TIF_MEMDIE leads to an allocation failure and getting out from page allocator. 2) is more problematic. In an OOM situation, zones typically don't have page cache at all and memory starvation might lead to greatly reduced IO performance. When a fork bomb occurs, TIF_MEMDIE tasks don't die quickly, meaning that a fork bomb may create new process quickly rather than the oom-killer killing it. Then, the system may become livelocked. This patch makes the pagefault interruptible by SIGKILL. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: introduce wait_on_page_locked_killable()KOSAKI Motohiro
commit 2687a356 ("Add lock_page_killable") introduced killable lock_page(). Similarly this patch introdues killable wait_on_page_locked(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: per-node vmstat: show proper vmstatsKOSAKI Motohiro
commit 2ac390370a ("writeback: add /sys/devices/system/node/<node>/vmstat") added vmstat entry. But strangely it only show nr_written and nr_dirtied. # cat /sys/devices/system/node/node20/vmstat nr_written 0 nr_dirtied 0 Of course, It's not adequate. With this patch, the vmstat show all vm stastics as /proc/vmstat. # cat /sys/devices/system/node/node0/vmstat nr_free_pages 899224 nr_inactive_anon 201 nr_active_anon 17380 nr_inactive_file 31572 nr_active_file 28277 nr_unevictable 0 nr_mlock 0 nr_anon_pages 17321 nr_mapped 8640 nr_file_pages 60107 nr_dirty 33 nr_writeback 0 nr_slab_reclaimable 6850 nr_slab_unreclaimable 7604 nr_page_table_pages 3105 nr_kernel_stack 175 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 nr_writeback_temp 0 nr_isolated_anon 0 nr_isolated_file 0 nr_shmem 260 nr_dirtied 1050 nr_written 938 numa_hit 962872 numa_miss 0 numa_foreign 0 numa_interleave 8617 numa_local 962872 numa_other 0 nr_anon_transparent_hugepages 0 [akpm@linux-foundation.org: no externs in .c files] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michael Rubin <mrubin@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: nommu: fix a compile warning in do_mmap_pgoff()Namhyung Kim
Because 'ret' is declared as int, not unsigned long, no need to cast the error contants into unsigned long. If you compile this code on a 64-bit machine somehow, you'll see following warning: CC mm/nommu.o mm/nommu.c: In function `do_mmap_pgoff': mm/nommu.c:1411: warning: overflow in implicit constant conversion Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: nommu: fix a potential memory leak in do_mmap_private()Namhyung Kim
If f_op->read() fails and sysctl_nr_trim_pages > 1, there could be a memory leak between @region->vm_end and @region->vm_top. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: nommu: check the vma list when unmapping file-mapped vmaNamhyung Kim
Now we have the sorted vma list, use it in do_munmap() to check that we have an exact match. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: nommu: find vma using the sorted vma listNamhyung Kim
Now we have the sorted vma list, use it in the find_vma[_exact]() rather than doing linear search on the rb-tree. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: nommu: don't scan the vma list when deletingNamhyung Kim
Since commit 297c5eee3724 ("mm: make the vma list be doubly linked") made it a doubly linked list, we don't need to scan the list when deleting @vma. And the original code didn't update the prev pointer. Fix it too. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: nommu: sort mm->mmap list properlyNamhyung Kim
When I was reading nommu code, I found that it handles the vma list/tree in an unusual way. IIUC, because there can be more than one identical/overrapped vmas in the list/tree, it sorts the tree more strictly and does a linear search on the tree. But it doesn't applied to the list (i.e. the list could be constructed in a different order than the tree so that we can't use the list when finding the first vma in that order). Since inserting/sorting a vma in the tree and link is done at the same time, we can easily construct both of them in the same order. And linear searching on the tree could be more costly than doing it on the list, it can be converted to use the list. Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly linked") made the list be doubly linked, there were a couple of code need to be fixed to construct the list properly. Patch 1/6 is a preparation. It maintains the list sorted same as the tree and construct doubly-linked list properly. Patch 2/6 is a simple optimization for the vma deletion. Patch 3/6 and 4/6 convert tree traversal to list traversal and the rest are simple fixes and cleanups. This patch: @vma added into @mm should be sorted by start addr, end addr and VMA struct addr in that order because we may get identical VMAs in the @mm. However this was true only for the rbtree, not for the list. This patch fixes this by remembering 'rb_prev' during the tree traversal like find_vma_prepare() does and linking the @vma via __vma_link_list(). After this patch, we can iterate the whole VMAs in correct order simply by using @mm->mmap list. [akpm@linux-foundation.org: avoid duplicating __vma_link_list()] Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: remove unused zone_idx variable from set_migratetype_isolateSergey Senozhatsky
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mmap: avoid merging cloned VMAsShaohua Li
Avoid merging a VMA with another VMA which is cloned from the parent process. The cloned VMA shares the anon_vma lock with the parent process's VMA. If we do the merge, more vmas (even the new range is only for current process) use the perent process's anon_vma lock. This introduces scalability issues. find_mergeable_anon_vma() already considers this case. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mmap: avoid unnecessary anon_vma lockShaohua Li
If we only change vma->vm_end, we can avoid taking anon_vma lock even if 'insert' isn't NULL, which is the case of split_vma. As I understand it, we need the lock before because rmap must get the 'insert' VMA when we adjust old VMA's vm_end (the 'insert' VMA is linked to anon_vma list in __insert_vm_struct before). But now this isn't true any more. The 'insert' VMA is already linked to anon_vma list in __split_vma(with anon_vma_clone()) instead of __insert_vm_struct. There is no race rmap can't get required VMAs. So the anon_vma lock is unnecessary, and this can reduce one locking in brk case and improve scalability. Signed-off-by: Shaohua Li<shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mmap: add alignment for some variablesShaohua Li
Make some variables have correct alignment/section to avoid cache issue. In a workload which heavily does mmap/munmap, the variables will be used frequently. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25arch, mm: filter disallowed nodes from arch specific show_mem functionsDavid Rientjes
Architectures that implement their own show_mem() function did not pass the filter argument to show_free_areas() to appropriately avoid emitting the state of nodes that are disallowed in the current context. This patch now passes the filter argument to show_free_areas() so those nodes are now avoided. This patch also removes the show_free_areas() wrapper around __show_free_areas() and converts existing callers to pass an empty filter. ia64 emits additional information for each node, so skip_free_areas_zone() must be made global to filter disallowed nodes and it is converted to use a nid argument rather than a zone for this use case. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <jejb@parisc-linux.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25xtensa/mm: remove WANT_PAGE_VIRTUALSebastian Andrzej Siewior
This is not useful: it provides page->virtual and is used with highmem. xtensa has no support for highmem and those HIGHMEM bits which are found by grep are partly implemented. The interesting functions like kmap() are missing. If someone actually implements the complete HIGHMEM support he could use HASHED_PAGE_VIRTUAL like most others do. Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Chris Zankel <chris@zankel.net> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25xtensa/mm: remove pgtable.cSebastian Andrzej Siewior
It is not referenced by any Makefile. pte_alloc_one_kernel() and pte_alloc_one() is implemented in arch/xtensa/include/asm/pgalloc.h. Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Chris Zankel <chris@zankel.net> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25drivers/video/backlight/adp5520_bl.c: check strict_strtoul() return valueLiu Yuan
It should check if strict_strtoul() succeeds. [akpm@linux-foundation.org: don't override strict_strtoul() return value] Signed-off-by: Liu Yuan <tailai.ly@taobao.com> Acked-by: Michael Hennerich <michael.hennerich@analog.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: vmscan: correctly check if reclaimer should schedule during shrink_slabMinchan Kim
It has been reported on some laptops that kswapd is consuming large amounts of CPU and not being scheduled when SLUB is enabled during large amounts of file copying. It is expected that this is due to kswapd missing every cond_resched() point because; shrink_page_list() calls cond_resched() if inactive pages were isolated which in turn may not happen if all_unreclaimable is set in shrink_zones(). If for whatver reason, all_unreclaimable is set on all zones, we can miss calling cond_resched(). balance_pgdat() only calls cond_resched if the zones are not balanced. For a high-order allocation that is balanced, it checks order-0 again. During that window, order-0 might have become unbalanced so it loops again for order-0 and returns that it was reclaiming for order-0 to kswapd(). It can then find that a caller has rewoken kswapd for a high-order and re-enters balance_pgdat() without ever calling cond_resched(). shrink_slab only calls cond_resched() if we are reclaiming slab pages. If there are a large number of direct reclaimers, the shrinker_rwsem can be contended and prevent kswapd calling cond_resched(). This patch modifies the shrink_slab() case. If the semaphore is contended, the caller will still check cond_resched(). After each successful call into a shrinker, the check for cond_resched() remains in case one shrinker is particularly slow. [mgorman@suse.de: preserve call to cond_resched after each call into shrinker] Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Tested-by: Colin King <colin.king@canonical.com> Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Chris Mason <chris.mason@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@kernel.org> [2.6.38+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: vmscan: correct use of pgdat_balanced in sleeping_prematurelyJohannes Weiner
There are a few reports of people experiencing hangs when copying large amounts of data with kswapd using a large amount of CPU which appear to be due to recent reclaim changes. SLUB using high orders is the trigger but not the root cause as SLUB has been using high orders for a while. The root cause was bugs introduced into reclaim which are addressed by the following two patches. Patch 1 corrects logic introduced by commit 1741c877 ("mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced") to allow kswapd to go to sleep when balanced for high orders. Patch 2 notes that it is possible for kswapd to miss every cond_resched() and updates shrink_slab() so it'll at least reach that scheduling point. Chris Wood reports that these two patches in isolation are sufficient to prevent the system hanging. AFAIK, they should also resolve similar hangs experienced by James Bottomley. This patch: Johannes Weiner poined out that the logic in commit 1741c877 ("mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced") is backwards. Instead of allowing kswapd to go to sleep when balancing for high order allocations, it keeps it kswapd running uselessly. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Tested-by: Colin King <colin.king@canonical.com> Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Chris Mason <chris.mason@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> [2.6.38+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25slub: Fix double bit unlock in debug modeChristoph Lameter
Commit 442b06bcea23 ("slub: Remove node check in slab_free") added a call to deactivate_slab() in the debug case in __slab_alloc(), which unlocks the current slab used for allocation. Going to the label 'unlock_out' then does it again. Also, in the debug case we do not need all the other processing that the 'unlock_out' path does. We always fall back to the slow path in the debug case. So the tid update is useless. Similarly, ALLOC_SLOWPATH would just be incremented for all allocations. Also a pretty useless thing. So simply restore irq flags and return the object. Signed-off-by: Christoph Lameter <cl@linux.com> Reported-and-bisected-by: James Morris <jmorris@namei.org> Reported-by: Ingo Molnar <mingo@elte.hu> Reported-by: Jens Axboe <jaxboe@fusionio.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-24Merge branch 'for-linus/2640/i2c' of git://git.fluff.org/bjdooks/linuxLinus Torvalds
* 'for-linus/2640/i2c' of git://git.fluff.org/bjdooks/linux: (21 commits) mach-ux500: set proper I2C platform data from MOP500s i2c-nomadik: break out single messsage transmission i2c-nomadik: reset the hw after status check i2c-nomadik: remove the unnecessary delay i2c-nomadik: change the TX and RX threshold i2c-nomadik: add code to retry on timeout failure i2c-nomadik: use pm_runtime API i2c-nomadik: print abort cause only on abort tag i2c-nomadik: correct adapter timeout initialization i2c-nomadik: remove the redundant error message i2c-nomadik: corrrect returned error numbers i2c-nomadik: fix speed enumerator i2c-nomadik: make i2c timeout specific per i2c bus i2c-nomadik: add regulator support i2c: i2c-sh_mobile bus speed platform data V2 i2c: i2c-sh_mobile clock string removal i2c-eg20t: Support new device ML7223 IOH i2c: tegra: Add de-bounce cycles. i2c: tegra: fix repeated start handling i2c: tegra: recover from spurious interrupt storm ...
2011-05-25Merge branches 'for-2639/i2c-eg20t', 'for-2639/i2c-shmobile', ↵Ben Dooks
'for-2639/i2c-tegra' and 'for-2639/i2c-nomadik2' into for-linus/2640/i2c
2011-05-25mach-ux500: set proper I2C platform data from MOP500sLinus Walleij
This specifies the new per-platform timeout per I2C bus and switches the I2C buses to fast mode, and increase the FIFO depth to 8 for reads and writes. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: break out single messsage transmissionLinus Walleij
Reduce code size in the message transfer function by factoring out a single-message transfer function. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: reset the hw after status checkVirupax Sadashivpetimath
In case of I2C timeout, reset the HW only after the HW status is read, otherwise the staus will be lost. Signed-off-by: Virupax Sadashivpetimath <virupax.sadashivpetimath@stericsson.com> Reviewed-by: Jonas Aberg <jonas.aberg@stericsson.com> Reviewed-by: Srinidhi Kasagar <srinidhi.kasagar@stericsson.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: remove the unnecessary delayVirupax Sadashivpetimath
The delay in the driver seems to be not needed, so remove it. Signed-off-by: Virupax Sadashivpetimath <virupax.sadashivpetimath@stericsson.com> Reviewed-by: Markus Grape <markus.grape@stericsson.com> Tested-by: Per Persson <per.xb.persson@stericsson.com> Tested-by: Chethan Krishna N <chethan.krishna@stericsson.com> Reviewed-by: Srinidhi Kasagar <srinidhi.kasagar@stericsson.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: change the TX and RX thresholdVirupax Sadashivpetimath
1) Increase RX FIFO threshold so that there is a reduction in the number of interrupts handled to complete a transaction. 2) Fill TX FIFO in the write function. Signed-off-by: Virupax Sadashivpetimath <virupax.sadashivpetimath@stericsson.com> Reviewed-by: Jonas Aberg <jonas.aberg@stericsson.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: add code to retry on timeout failureVirupax Sadashivpetimath
It is seen that i2c-nomadik controller randomly stops generating the interrupts leading to a i2c timeout. As a workaround to this problem, add retries to the on going transfer on failure. Signed-off-by: Virupax Sadashivpetimath <virupax.sadashivpetimath@stericsson.com> Reviewed-by: Jonas ABERG <jonas.aberg@stericsson.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: use pm_runtime APIRabin Vincent
Use the pm_runtime API for pins control. Signed-off-by: Rabin Vincent <rabin.vincent@stericsson.com> Reviewed-by: Srinidhi Kasagar <srinidhi.kasagar@stericsson.com> Reviewed-by: Jonas Aberg <jonas.aberg@stericsson.com> [deleted some surplus runtime PM code] Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: print abort cause only on abort tagVirupax Sadashivpetimath
Modify the code to: 1)Print the cause of i2c failure only if the status is set to ABORT. 2)Print slave address on send/receive fail, will help in which slave failed. Signed-off-by: Virupax Sadashivpetimath <virupax.sadashivpetimath@stericsson.com> Reviewed-by: Jonas Aberg <jonas.aberg@stericsson.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: correct adapter timeout initializationVirupax Sadashivpetimath
Correct the incorrect initialization of adapter timeout not to be in milliseconds, as it needs to be done in jiffies. Signed-off-by: Virupax Sadashivpetimath <virupax.sadashivpetimath@stericsson.com> Reviewed-by: Srinidhi Kasagar <srinidhi.kasagar@stericsson.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: remove the redundant error messagesrinidhi kasagar
The abort cause string itself is an error, so remove the redundant explicit error message. Signed-off-by: Srinidhi Kasagar <srinidhi.kasagar@stericsson.com> Reviewed-by: Jonas Aberg <jonas.aberg@stericsson.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: corrrect returned error numbersVirupax Sadashivpetimath
The code was returning bad error numbers or just -1 in some cases. Signed-off-by: Virupax Sadashivpetimath <virupax.sadashivpetimath@stericsson.com> Reviewed-by: Srinidhi Kasagar <srinidhi.kasagar@stericsson.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: fix speed enumeratorLinus Walleij
The I2C speed enumerators in the i2c-nomadik header file were in the wrong order. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: make i2c timeout specific per i2c busVirupax Sadashivpetimath
Add option to have different i2c timeout delay for different i2c buses specified in platform data. Default to the old value unless specified. Signed-off-by: Virupax Sadashivpetimath <virupax.sadashivpetimath@stericsson.com> Reviewed-by: Srinidhi Kasagar <srinidhi.kasagar@stericsson.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-25i2c-nomadik: add regulator supportJonas Aberg
This on-chip I2C controller needs to fetch the regulator representing its voltage domain so that it won't be switched off. Signed-off-by: Jonas Aberg <jonas.aberg@stericsson.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Dooks <ben-linux@fluff.org>
2011-05-24Merge branch 'i2c-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging * 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging: i2c-parport: Various cleanups i2c-i801: Don't depend on other kernel driver config options i2c-i801: Check for vendor Fujitsu before probing for apanel i2c-i801: Don't probe for slaves on IDF channels i2c-i801: SMBus patch for Intel Panther Point DeviceIDs i2c/writing-clients: Fix foo_driver.id_table
2011-05-24Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: jbd: Fix comment to match the code in journal_start() jbd/jbd2: remove obsolete summarise_journal_usage. jbd: Fix forever sleeping process in do_get_write_access() ext2: fix error msg when mounting fs with too-large blocksize jbd: fix fsync() tid wraparound bug ext3: Fix fs corruption when make_indexed_dir() fails ext3: Fix lock inversion in ext3_symlink()
2011-05-24Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: make plock operation killable dlm: remove shared message stub for recovery dlm: delayed reply message warning dlm: Remove superfluous call to recalc_sigpending()
2011-05-24Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (43 commits) TOMOYO: Fix wrong domainname validation. SELINUX: add /sys/fs/selinux mount point to put selinuxfs CRED: Fix load_flat_shared_library() to initialise bprm correctly SELinux: introduce path_has_perm flex_array: allow 0 length elements flex_arrays: allow zero length flex arrays flex_array: flex_array_prealloc takes a number of elements, not an end SELinux: pass last path component in may_create SELinux: put name based create rules in a hashtable SELinux: generic hashtab entry counter SELinux: calculate and print hashtab stats with a generic function SELinux: skip filename trans rules if ttype does not match parent dir SELinux: rename filename_compute_type argument to *type instead of *con SELinux: fix comment to state filename_compute_type takes an objname not a qstr SMACK: smack_file_lock can use the struct path LSM: separate LSM_AUDIT_DATA_DENTRY from LSM_AUDIT_DATA_PATH LSM: split LSM_AUDIT_DATA_FS into _PATH and _INODE SELINUX: Make selinux cache VFS RCU walks safe SECURITY: Move exec_permission RCU checks into security modules SELinux: security_read_policy should take a size_t not ssize_t ...