summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2009-06-16page allocator: do not disable interrupts in free_page_mlock()Mel Gorman
free_page_mlock() tests and clears PG_mlocked using locked versions of the bit operations. If set, it disables interrupts to update counters and this happens on every page free even though interrupts are disabled very shortly afterwards a second time. This is wasteful. This patch splits what free_page_mlock() does. The bit check is still made. However, the update of counters is delayed until the interrupts are disabled and the non-lock version for clearing the bit is used. One potential weirdness with this split is that the counters do not get updated if the bad_page() check is triggered but a system showing bad pages is getting screwed already. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: do not call get_pageblock_migratetype() more than necessaryMel Gorman
get_pageblock_migratetype() is potentially called twice for every page free. Once, when being freed to the pcp lists and once when being freed back to buddy. When freeing from the pcp lists, it is known what the pageblock type was at the time of free so use it rather than rechecking. In low memory situations under memory pressure, this might skew anti-fragmentation slightly but the interference is minimal and decisions that are fragmenting memory are being made anyway. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: inline __rmqueue_fallback()Mel Gorman
__rmqueue_fallback() is in the slow path but has only one call site. Because there is only one call-site, this function can then be inlined without causing text bloat. On an x86-based config, it made no difference as the savings were padded out by NOP instructions. Milage varies but text will either decrease in size or remain static. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: inline buffered_rmqueue()Mel Gorman
buffered_rmqueue() is in the fast path so inline it. Because it only has one call site, this function can then be inlined without causing text bloat. On an x86-based config, it made no difference as the savings were padded out by NOP instructions. Milage varies but text will either decrease in size or remain static. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: inline __rmqueue_smallest()Mel Gorman
Inline __rmqueue_smallest by altering flow very slightly so that there is only one call site. Because there is only one call-site, this function can then be inlined without causing text bloat. On an x86-based config, this patch reduces text by 16 bytes. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: remove a branch by assuming __GFP_HIGH == ALLOC_HIGHMel Gorman
Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these flags are equal to each other, we can eliminate a branch. [akpm@linux-foundation.org: Suggested the hack] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: calculate the alloc_flags for allocation only oncePeter Zijlstra
Factor out the mapping between GFP and alloc_flags only once. Once factored out, it only needs to be calculated once but some care must be taken. [neilb@suse.de says] As the test: - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) - && !in_interrupt()) { - if (!(gfp_mask & __GFP_NOMEMALLOC)) { has been replaced with a slightly weaker one: + if (alloc_flags & ALLOC_NO_WATERMARKS) { Without care, this would allow recursion into the allocator via direct reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set but TF_MEMDIE callers are now allowed to directly reclaim where they would have been prevented in the past. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Neil Brown <neilb@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: calculate the migratetype for allocation only onceMel Gorman
GFP mask is converted into a migratetype when deciding which pagelist to take a page from. However, it is happening multiple times per allocation, at least once per zone traversed. Calculate it once. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: calculate the preferred zone for allocation only onceMel Gorman
get_page_from_freelist() can be called multiple times for an allocation. Part of this calculates the preferred_zone which is the first usable zone in the zonelist but the zone depends on the GFP flags specified at the beginning of the allocation call. This patch calculates preferred_zone once. It's safe to do this because if preferred_zone is NULL at the start of the call, no amount of direct reclaim or other actions will change the fact the allocation will fail. [akpm@linux-foundation.org: remove (void) casts] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: move check for disabled anti-fragmentation out of fastpathMel Gorman
On low-memory systems, anti-fragmentation gets disabled as there is nothing it can do and it would just incur overhead shuffling pages between lists constantly. Currently the check is made in the free page fast path for every page. This patch moves it to a slow path. On machines with low memory, there will be small amount of additional overhead as pages get shuffled between lists but it should quickly settle. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: break up the allocator entry point into fast and slow pathsMel Gorman
The core of the page allocator is one giant function which allocates memory on the stack and makes calculations that may not be needed for every allocation. This patch breaks up the allocator path into fast and slow paths for clarity. Note the slow paths are still inlined but the entry is marked unlikely. If they were not inlined, it actally increases text size to generate the as there is only one call site. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: check only once if the zonelist is suitable for the allocationMel Gorman
It is possible with __GFP_THISNODE that no zones are suitable. This patch makes sure the check is only made once. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: do not check NUMA node ID when the caller knows the node is ↵Mel Gorman
valid Callers of alloc_pages_node() can optionally specify -1 as a node to mean "allocate from the current node". However, a number of the callers in fast paths know for a fact their node is valid. To avoid a comparison and branch, this patch adds alloc_pages_exact_node() that only checks the nid with VM_BUG_ON(). Callers that know their node is valid are then converted. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Paul Mundt <lethal@linux-sh.org> [for the SLOB NUMA bits] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: do not sanity check order in the fast pathMel Gorman
No user of the allocator API should be passing in an order >= MAX_ORDER but we check for it on each and every allocation. Delete this check and make it a VM_BUG_ON check further down the call path. [akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16page allocator: replace __alloc_pages_internal() with __alloc_pages_nodemask()Mel Gorman
The start of a large patch series to clean up and optimise the page allocator. The performance improvements are in a wide range depending on the exact machine but the results I've seen so fair are approximately; kernbench: 0 to 0.12% (elapsed time) 0.49% to 3.20% (sys time) aim9: -4% to 30% (for page_test and brk_test) tbench: -1% to 4% hackbench: -2.5% to 3.45% (mostly within the noise though) netperf-udp -1.34% to 4.06% (varies between machines a bit) netperf-tcp -0.44% to 5.22% (varies between machines a bit) I haven't sysbench figures at hand, but previously they were within the -0.5% to 2% range. On netperf, the client and server were bound to opposite number CPUs to maximise the problems with cache line bouncing of the struct pages so I expect different people to report different results for netperf depending on their exact machine and how they ran the test (different machines, same cpus client/server, shared cache but two threads client/server, different socket client/server etc). I also measured the vmlinux sizes for a single x86-based config with CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the .config is based on the Debian Lenny kernel config so I expect it to be reasonably typical. This patch: __alloc_pages_internal is the core page allocator function but essentially it is an alias of __alloc_pages_nodemask. Naming a publicly available and exported function "internal" is also a big ugly. This patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old nodemask function. Warning - This patch renames an exported symbol. No kernel driver is affected by external drivers calling __alloc_pages_internal() should change the call to __alloc_pages_nodemask() without any alteration of parameters. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: alloc_large_system_hash check orderHugh Dickins
On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(), to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on order >= MAX_ORDER - it's hoping for order 11. alloc_large_system_hash() had better make its own check on the order. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: David Miller <davem@davemloft.net> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16cpuset,mm: update tasks' mems_allowed in timeMiao Xie
Fix allocating page cache/slab object on the unallowed node when memory spread is set by updating tasks' mems_allowed after its cpuset's mems is changed. In order to update tasks' mems_allowed in time, we must modify the code of memory policy. Because the memory policy is applied in the process's context originally. After applying this patch, one task directly manipulates anothers mems_allowed, and we use alloc_lock in the task_struct to protect mems_allowed and memory policy of the task. But in the fast path, we didn't use lock to protect them, because adding a lock may lead to performance regression. But if we don't add a lock,the task might see no nodes when changing cpuset's mems_allowed to some non-overlapping set. In order to avoid it, we set all new allowed nodes, then clear newly disallowed ones. [lee.schermerhorn@hp.com: The rework of mpol_new() to extract the adjusting of the node mask to apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind() with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local allocation. Fix this by adding the check for MPOL_PREFERRED and empty node mask to mpol_new_mpolicy(). Remove the now unneeded 'nodes = NULL' from mpol_new(). Note that mpol_new_mempolicy() is always called with a non-NULL 'nodes' parameter now that it has been removed from mpol_new(). Therefore, we don't need to test nodes for NULL before testing it for 'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to verify this assumption.] [lee.schermerhorn@hp.com: I don't think the function name 'mpol_new_mempolicy' is descriptive enough to differentiate it from mpol_new(). This function applies cpuset set context, usually constraining nodes to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag is set, it also translates the nodes. So I settled on 'mpol_set_nodemask()', because the comment block for mpol_new() mentions that we need to call this function to "set nodes". Some additional minor line length, whitespace and typo cleanup.] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Paul Menage <menage@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm/page-writeback.c: dirty limit type should be unsigned longH Hartley Sweeten
get_dirty_limits() calls clip_bdi_dirty_limit() and task_dirty_limit() with variable pbdi_dirty as one of the arguments. This variable is an unsigned long * but both functions expect it to be a long *. This causes the following sparse warnings: warning: incorrect type in argument 3 (different signedness) expected long *pbdi_dirty got unsigned long *pbdi_dirty warning: incorrect type in argument 2 (different signedness) expected long *pdirty got unsigned long *pbdi_dirty Fix the warnings by changing the long * to unsigned long * in both functions. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNCKOSAKI Motohiro
Commit 33c120ed2843090e2bd316de1588b8bf8b96cbde ("more aggressively use lumpy reclaim") increased how aggressive lumpy reclaim was by isolating both active and inactive pages for asynchronous lumpy reclaim on costly-high-order pages and for cheap-high-order when memory pressure is high. However, if the system is under heavy pressure and there are dirty pages, asynchronous IO may not be sufficient to reclaim a suitable page in time. This patch causes the caller to enter synchronous lumpy reclaim for costly-high-order pages and for cheap-high-order pages when under memory pressure. Minchan.kim@gmail.com said: Andy added synchronous lumpy reclaim with c661b078fd62abe06fd11fab4ac5e4eeafe26b6d. At that time, lumpy reclaim is not agressive. His intension is just for high-order users.(above PAGE_ALLOC_COSTLY_ORDER). After some time, Rik added aggressive lumpy reclaim with 33c120ed2843090e2bd316de1588b8bf8b96cbde. His intention was to do lumpy reclaim when high-order users and trouble getting a small set of contiguous pages. So we also have to add synchronous pageout for small set of contiguous pages. Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andy Whitcroft <apw@shadowen.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <Minchan.kim@gmail.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: clean up get_user_pages_fast() documentationNick Piggin
Move more documentation for get_user_pages_fast into the new kerneldoc comment. Add some comments for get_user_pages as well. Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't there initially because it was once a static inline function. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Andy Grover <andy.grover@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: enforce full sync mmap readahead sizeWu Fengguang
Now that we do readahead for sequential mmap reads, here is a simple evaluation of the impacts, and one further optimization. It's an NFS-root debian desktop system, readahead size = 60 pages. The numbers are grabbed after a fresh boot into console. approach pgmajfault RA miss ratio mmap IO count avg IO size(pages) A 383 31.6% 383 11 B 225 32.4% 390 11 C 224 32.6% 307 13 case A: mmap sync/async readahead disabled case B: mmap sync/async readahead enabled, with enforced full async readahead size case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size or: A = vanilla 2.6.30-rc1 B = A plus mmap readahead C = B plus this patch The numbers show that - there are good possibilities for random mmap reads to trigger readahead - 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead - case C can further reduce IO count by 1/4 - readahead miss ratios are not quite affected The theory is - readahead is _good_ for clustered random reads, and can perform _better_ than readaround because they could be _async_. - async readahead size is guaranteed to be larger than readaround size, and they are _async_, hence will mostly behave better However for B - sync readahead size could be smaller than readaround size, hence may make things worse by produce more smaller IOs which will be fixed by this patch. Final conclusion: - mmap readahead reduced major faults by 1/3 and no obvious overheads; - mmap io can be further reduced by 1/4 with this patch. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: remove redundant test in shrink_readahead_size_eio()Wu Fengguang
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: introduce context readahead algorithmWu Fengguang
Introduce page cache context based readahead algorithm. This is to better support concurrent read streams in general. RATIONALE --------- The current readahead algorithm detects interleaved reads in a _passive_ way. Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,... By checking for (offset == prev_offset + 1), it will discover the sequentialness between 3,4 and between 1004,1005, and start doing sequential readahead for the individual streams since page 4 and page 1005. The context readahead algorithm guarantees to discover the sequentialness no matter how the streams are interleaved. For the above example, it will start sequential readahead since page 2 and 1002. The trick is to poke for page @offset-1 in the page cache when it has no other clues on the sequentialness of request @offset: if the current requenst belongs to a sequential stream, that stream must have accessed page @offset-1 recently, and the page will still be cached now. So if page @offset-1 is there, we can take request @offset as a sequential access. BENEFICIARIES ------------- - strictly interleaved reads i.e. 1,1001,2,1002,3,1003,... the current readahead will take them as silly random reads; the context readahead will take them as two sequential streams. - cooperative IO processes i.e. NFS and SCST They create a thread pool, farming off (sequential) IO requests to different threads which will be performing interleaved IO. It was not easy(or possible) to reliably tell from file->f_ra all those cooperative processes working on the same sequential stream, since they will have different file->f_ra instances. And NFSD's file->f_ra is particularly unusable, since their file objects are dynamically created for each request. The nfsd does have code trying to restore the f_ra bits, but not satisfactory. The new scheme is to detect the sequential pattern via looking up the page cache, which provides one single and consistent view of the pages recently accessed. That makes sequential detection for cooperative processes possible. USER REPORT ----------- Vladislav recommends the addition of context readahead as a result of his SCST benchmarks. It leads to 6%~40% performance gains in various cases and achieves equal performance in others. http://lkml.org/lkml/2009/3/19/239 OVERHEADS --------- In theory, it introduces one extra page cache lookup per random read. However the below benchmark shows context readahead to be slightly faster, wondering.. Randomly reading 200MB amount of data on a sparse file, repeat 20 times for each block size. The average throughputs are: original ra context ra gain 4K random reads: 65.561MB/s 65.648MB/s +0.1% 16K random reads: 124.767MB/s 124.951MB/s +0.1% 64K random reads: 162.123MB/s 162.278MB/s +0.1% Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Tested-by: Vladislav Bolkhovitin <vst@vlnb.net> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: move the random read case to bottomWu Fengguang
Split all readahead cases, and move the random one to bottom. No behavior changes. This is to prepare for the introduction of context readahead, and make it easy for inserting accounting/tracing points for each case. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Vladislav Bolkhovitin <vst@vlnb.net> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: record mmap read-around states in file_ra_stateWu Fengguang
Mmap read-around now shares the same code style and data structure with readahead code. This also removes do_page_cache_readahead(). Its last user, mmap read-around, has been changed to call ra_submit(). The no-readahead-if-congested logic is dumped by the way. Users will be pretty sensitive about the slow loading of executables. So it's unfavorable to disabled mmap read-around on a congested queue. [akpm@linux-foundation.org: coding-style fixes] Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: enforce full readahead size on async mmap readaheadWu Fengguang
We need this in one particular case and two more general ones. Now we do async readahead for sequential mmap reads, and do it with the help of PG_readahead. For normal reads, PG_readahead is the sufficient condition to do a sequential readahead. But unfortunately, for mmap reads, there is a tiny nuisance: [11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4 [11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17 [11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32 [11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~ An unfavorably small readahead. The original dumb read-around size could be more efficient. That happened because ld-linux.so does a read(832) in L1 before mmap(), which triggers a 4-page readahead, with the second page tagged PG_readahead. L0: open("/lib/libc.so.6", O_RDONLY) = 3 L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0 L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000 L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0 L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000 L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000 L7: close(3) = 0 In general, the PG_readahead flag will also be hit in cases - sequential reads - clustered random reads A full readahead size is desirable in both cases. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: sequential mmap readaheadWu Fengguang
Auto-detect sequential mmap reads and do readahead for them. The sequential mmap readahead will be triggered when - sync readahead: it's a major fault and (prev_offset == offset-1); - async readahead: minor fault on PG_readahead page with valid readahead state. The benefits of doing readahead instead of read-around: - less I/O wait thanks to async readahead - double real I/O size and no more cache hits The single stream case is improved a little. For 100,000 sequential mmap reads: user system cpu total (1-1) plain -mm, 128KB readaround: 3.224 2.554 48.40% 11.838 (1-2) plain -mm, 256KB readaround: 3.170 2.392 46.20% 11.976 (2) patched -mm, 128KB readahead: 3.117 2.448 47.33% 11.607 The patched (2) has smallest total time, since it has no cache hit overheads and less I/O block time(thanks to async readahead). Here the I/O size makes no much difference, since there's only one single stream. Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB, since the half of the read-around pages will be readahead cache hits. This is going to make _real_ differences for _concurrent_ IO streams. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: clean up and simplify the code for filemap page fault readaheadLinus Torvalds
This shouldn't really change behavior all that much, but the single rather complex function with read-ahead inside a loop etc is broken up into more manageable pieces. The behaviour is also less subtle, with the read-ahead being done up-front rather than inside some subtle loop and thus avoiding the now unnecessary extra state variables (ie "did_readaround" is gone). Fengguang: the code split in fact fixed a bug reported by Pavel Levshin: the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in which case the original code will directly jump to no_cached_page reading. Cc: Pavel Levshin <lpk@581.spb.su> Cc: <wli@movementarian.org> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: remove sync/async readahead call dependencyWu Fengguang
The readahead call scheme is error-prone in that it expects the call sites to check for async readahead after doing a sync one. I.e. if (!page) page_cache_sync_readahead(); page = find_get_page(); if (page && PageReadahead(page)) page_cache_async_readahead(); This is because PG_readahead could be set by a sync readahead for the _current_ newly faulted in page, and the readahead code simply expects one more callback on the same page to start the async readahead. If the caller fails to do so, it will miss the PG_readahead bits and never able to start an async readahead. Eliminate this insane constraint by piggy-backing the async part into the current readahead window. Now if an async readahead should be started immediately after a sync one, the readahead logic itself will do it. So the following code becomes valid: (the 'else' in particular) if (!page) page_cache_sync_readahead(); else if (PageReadahead(page)) page_cache_async_readahead(); Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: increase interleaved readahead sizeWu Fengguang
Make sure interleaved readahead size is larger than request size. This also makes the readahead window grow up more quickly. Reported-by: Xu Chenfeng <xcf@ustc.edu.cn> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: remove one unnecessary radix tree lookupWu Fengguang
(hit_readahead_marker != 0) means the page at @offset is present, so we can search for non-present page starting from @offset+1. Reported-by: Xu Chenfeng <xcf@ustc.edu.cn> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: apply max_sane_readahead() limit in ondemand_readahead()Wu Fengguang
Just in case someone aggressively sets a huge readahead size. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16readahead: move max_sane_readahead() calls into force_page_cache_readahead()Wu Fengguang
Impact: code simplification. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16mm: consolidate init_mm definitionAlexey Dobriyan
* create mm/init-mm.c, move init_mm there * remove INIT_MM, initialize init_mm with C99 initializer * unexport init_mm on all arches: init_mm is already unexported on x86. One strange place is some OMAP driver (drivers/video/omap/) which won't build modular, but it's already wants get_vm_area() export. Somebody should look there. [akpm@linux-foundation.org: add missing #includes] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mike Frysinger <vapier.adi@gmail.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-12Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6Linus Torvalds
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (30 commits) [S390] wire up sys_perf_counter_open [S390] wire up sys_rt_tgsigqueueinfo [S390] ftrace: add system call tracer support [S390] ftrace: add function graph tracer support [S390] ftrace: add function trace mcount test support [S390] ftrace: add dynamic ftrace support [S390] kprobes: use probe_kernel_write [S390] maccess: arch specific probe_kernel_write() implementation [S390] maccess: add weak attribute to probe_kernel_write [S390] profile_tick called twice [S390] dasd: forward internal errors to dasd_sleep_on caller [S390] dasd: sync after async probe [S390] dasd: check_characteristics cleanup [S390] dasd: no High Performance FICON in 31-bit mode [S390] dcssblk: revert devt conversion [S390] qdio: fix access beyond ARRAY_SIZE of irq_ptr->{in,out}put_qs [S390] vmalloc: add vmalloc kernel parameter support [S390] uaccess: use might_fault() instead of might_sleep() [S390] 3270: lock dependency fixes [S390] 3270: do not register with tty_register_device ...
2009-06-12PM/Suspend: Do not shrink memory before suspendRafael J. Wysocki
Remove the shrinking of memory from the suspend-to-RAM code, where it is not really necessary. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@tuxonice.net> Acked-by: Wu Fengguang <fengguang.wu@intel.com>
2009-06-12slab: setup cpu caches later on when interrupts are enabledPekka Enberg
Fixes the following boot-time warning: [ 0.000000] ------------[ cut here ]------------ [ 0.000000] WARNING: at kernel/smp.c:369 smp_call_function_many+0x56/0x1bc() [ 0.000000] Hardware name: [ 0.000000] Modules linked in: [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30 #492 [ 0.000000] Call Trace: [ 0.000000] [<ffffffff8149e021>] ? _spin_unlock+0x4f/0x5c [ 0.000000] [<ffffffff8108f11b>] ? smp_call_function_many+0x56/0x1bc [ 0.000000] [<ffffffff81061764>] warn_slowpath_common+0x7c/0xa9 [ 0.000000] [<ffffffff810617a5>] warn_slowpath_null+0x14/0x16 [ 0.000000] [<ffffffff8108f11b>] smp_call_function_many+0x56/0x1bc [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff8108f2be>] smp_call_function+0x3d/0x68 [ 0.000000] [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54 [ 0.000000] [<ffffffff81066fd8>] on_each_cpu+0x31/0x7c [ 0.000000] [<ffffffff810f64f5>] do_tune_cpucache+0x119/0x454 [ 0.000000] [<ffffffff81087080>] ? lockdep_init_map+0x94/0x10b [ 0.000000] [<ffffffff818133b0>] ? kmem_cache_init+0x421/0x593 [ 0.000000] [<ffffffff810f69cf>] enable_cpucache+0x68/0xad [ 0.000000] [<ffffffff818133c3>] kmem_cache_init+0x434/0x593 [ 0.000000] [<ffffffff8180987c>] ? mem_init+0x156/0x161 [ 0.000000] [<ffffffff817f8aae>] start_kernel+0x1cc/0x3b9 [ 0.000000] [<ffffffff817f829a>] x86_64_start_reservations+0xaa/0xae [ 0.000000] [<ffffffff817f837f>] x86_64_start_kernel+0xe1/0xe8 [ 0.000000] ---[ end trace 4eaa2a86a8e2da22 ]--- Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-12slab,slub: don't enable interrupts during early bootPekka Enberg
As explained by Benjamin Herrenschmidt: Oh and btw, your patch alone doesn't fix powerpc, because it's missing a whole bunch of GFP_KERNEL's in the arch code... You would have to grep the entire kernel for things that check slab_is_available() and even then you'll be missing some. For example, slab_is_available() didn't always exist, and so in the early days on powerpc, we used a mem_init_done global that is set form mem_init() (not perfect but works in practice). And we still have code using that to do the test. Therefore, mask out __GFP_WAIT, __GFP_IO, and __GFP_FS in the slab allocators in early boot code to avoid enabling interrupts. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-12slab: fix gfp flag in setup_cpu_cache()Pekka Enberg
Fixes the following warning during bootup when compiling with CONFIG_SLAB: [ 0.000000] ------------[ cut here ]------------ [ 0.000000] WARNING: at kernel/lockdep.c:2282 lockdep_trace_alloc+0x91/0xb9() [ 0.000000] Hardware name: [ 0.000000] Modules linked in: [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30 #491 [ 0.000000] Call Trace: [ 0.000000] [<ffffffff81087d84>] ? lockdep_trace_alloc+0x91/0xb9 [ 0.000000] [<ffffffff81061764>] warn_slowpath_common+0x7c/0xa9 [ 0.000000] [<ffffffff810617a5>] warn_slowpath_null+0x14/0x16 [ 0.000000] [<ffffffff81087d84>] lockdep_trace_alloc+0x91/0xb9 [ 0.000000] [<ffffffff810f5b03>] kmem_cache_alloc_node_notrace+0x26/0xdf [ 0.000000] [<ffffffff81487f4e>] ? setup_cpu_cache+0x7e/0x210 [ 0.000000] [<ffffffff81487fe3>] setup_cpu_cache+0x113/0x210 [ 0.000000] [<ffffffff810f73ff>] kmem_cache_create+0x409/0x486 [ 0.000000] [<ffffffff818131c1>] kmem_cache_init+0x232/0x593 [ 0.000000] [<ffffffff8180987c>] ? mem_init+0x156/0x161 [ 0.000000] [<ffffffff817f8aae>] start_kernel+0x1cc/0x3b9 [ 0.000000] [<ffffffff817f829a>] x86_64_start_reservations+0xaa/0xae [ 0.000000] [<ffffffff817f837f>] x86_64_start_kernel+0xe1/0xe8 [ 0.000000] ---[ end trace 4eaa2a86a8e2da22 ]--- Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-12[S390] maccess: add weak attribute to probe_kernel_writeHeiko Carstens
probe_kernel_write() gets used to write to the kernel address space. E.g. to patch the kernel (kgdb, ftrace, kprobes...). Some architectures however enable write protection for the kernel text section, so that writes to this region would fault. This patch allows to specify an architecture specific version of probe_kernel_write() which allows to handle and bypass write protection of the text segment. That way it is still possible to catch random writes to kernel text and explicitly allow writes via this interface. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-06-12memcg: fix page_cgroup fatal error in FLATMEMKAMEZAWA Hiroyuki
Now, SLAB is configured in very early stage and it can be used in init routine now. But replacing alloc_bootmem() in FLAT/DISCONTIGMEM's page_cgroup() initialization breaks the allocation, now. (Works well in SPARSEMEM case...it supports MEMORY_HOTPLUG and size of page_cgroup is in reasonable size (< 1 << MAX_ORDER.) This patch revive FLATMEM+memory cgroup by using alloc_bootmem. In future, We stop to support FLATMEM (if no users) or rewrite codes for flatmem completely.But this will adds more messy codes and overheads. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11Merge branch 'for-linus' of git://linux-arm.org/linux-2.6Linus Torvalds
* 'for-linus' of git://linux-arm.org/linux-2.6: kmemleak: Add the corresponding MAINTAINERS entry kmemleak: Simple testing module for kmemleak kmemleak: Enable the building of the memory leak detector kmemleak: Remove some of the kmemleak false positives kmemleak: Add modules support kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash kmemleak: Add the vmalloc memory allocation/freeing hooks kmemleak: Add the slub memory allocation/freeing hooks kmemleak: Add the slob memory allocation/freeing hooks kmemleak: Add the slab memory allocation/freeing hooks kmemleak: Add documentation on the memory leak detector kmemleak: Add the base support Manual conflict resolution (with the slab/earlyboot changes) in: drivers/char/vt.c init/main.c mm/slab.c
2009-06-11Merge branch 'perfcounters-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits) perf_counter: Turn off by default perf_counter: Add counter->id to the throttle event perf_counter: Better align code perf_counter: Rename L2 to LL cache perf_counter: Standardize event names perf_counter: Rename enums perf_counter tools: Clean up u64 usage perf_counter: Rename perf_counter_limit sysctl perf_counter: More paranoia settings perf_counter: powerpc: Implement generalized cache events for POWER processors perf_counters: powerpc: Add support for POWER7 processors perf_counter: Accurate period data perf_counter: Introduce struct for sample data perf_counter tools: Normalize data using per sample period data perf_counter: Annotate exit ctx recursion perf_counter tools: Propagate signals properly perf_counter tools: Small frequency related fixes perf_counter: More aggressive frequency adjustment perf_counter/x86: Fix the model number of Intel Core2 processors perf_counter, x86: Correct some event and umask values for Intel processors ...
2009-06-11Merge branch 'topic/slab/earlyboot' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'topic/slab/earlyboot' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: vgacon: use slab allocator instead of the bootmem allocator irq: use kcalloc() instead of the bootmem allocator sched: use slab in cpupri_init() sched: use alloc_cpumask_var() instead of alloc_bootmem_cpumask_var() memcg: don't use bootmem allocator in setup code irq/cpumask: make memoryless node zero happy x86: remove some alloc_bootmem_cpumask_var calling vt: use kzalloc() instead of the bootmem allocator sched: use kzalloc() instead of the bootmem allocator init: introduce mm_init() vmalloc: use kzalloc() instead of alloc_bootmem() slab: setup allocators earlier in the boot sequence bootmem: fix slab fallback on numa bootmem: use slab if bootmem is no longer available
2009-06-11Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits) block: add request clone interface (v2) floppy: fix hibernation ramdisk: remove long-deprecated "ramdisk=" boot-time parameter fs/bio.c: add missing __user annotation block: prevent possible io_context->refcount overflow Add serial number support for virtio_blk, V4a block: Add missing bounce_pfn stacking and fix comments Revert "block: Fix bounce limit setting in DM" cciss: decode unit attention in SCSI error handling code cciss: Remove no longer needed sendcmd reject processing code cciss: change SCSI error handling routines to work with interrupts enabled. cciss: separate error processing and command retrying code in sendcmd_withirq_core() cciss: factor out fix target status processing code from sendcmd functions cciss: simplify interface of sendcmd() and sendcmd_withirq() cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code cciss: Use schedule_timeout_uninterruptible in SCSI error handling code block: needs to set the residual length of a bidi request Revert "block: implement blkdev_readpages" block: Fix bounce limit setting in DM Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt ... Manually fix conflicts with tracing updates in: block/blk-sysfs.c drivers/ide/ide-atapi.c drivers/ide/ide-cd.c drivers/ide/ide-floppy.c drivers/ide/ide-tape.c include/trace/events/block.h kernel/trace/blktrace.c
2009-06-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (44 commits) nommu: Provide mmap_min_addr definition. TOMOYO: Add description of lists and structures. TOMOYO: Remove unused field. integrity: ima audit dentry_open failure TOMOYO: Remove unused parameter. security: use mmap_min_addr indepedently of security models TOMOYO: Simplify policy reader. TOMOYO: Remove redundant markers. SELinux: define audit permissions for audit tree netlink messages TOMOYO: Remove unused mutex. tomoyo: avoid get+put of task_struct smack: Remove redundant initialization. integrity: nfsd imbalance bug fix rootplug: Remove redundant initialization. smack: do not beyond ARRAY_SIZE of data integrity: move ima_counts_get integrity: path_check update IMA: Add __init notation to ima functions IMA: Minimal IMA policy and boot param for TCB IMA policy selinux: remove obsolete read buffer limit from sel_read_bool ...
2009-06-11memcg: don't use bootmem allocator in setup codeYinghai Lu
The bootmem allocator is no longer available for page_cgroup_init() because we set up the kernel slab allocator much earlier now. Cc: Ingo Molnar <mingo@elte.hu> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11vmalloc: use kzalloc() instead of alloc_bootmem()Pekka Enberg
We can call vmalloc_init() after kmem_cache_init() and use kzalloc() instead of the bootmem allocator when initializing vmalloc data structures. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11slab: setup allocators earlier in the boot sequencePekka Enberg
This patch makes kmalloc() available earlier in the boot sequence so we can get rid of some bootmem allocations. The bulk of the changes are due to kmem_cache_init() being called with interrupts disabled which requires some changes to allocator boostrap code. Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps before we call mem_init() during boot as reported by Ingo Molnar: We have a hard crash in the WP-protect code: [ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000 [ 0.000000] EDI 00000188 ESI 00000ac7 EBP c17eaf9c ESP c17eaf8c [ 0.000000] EBX 000014e0 EDX 0000000e ECX 01856067 EAX 00000001 [ 0.000000] err 00000003 EIP c10135b1 CS 00000060 flg 00010002 [ 0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc [ 0.000000] 00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0 [ 0.000000] c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203 [ 0.000000] Call Trace: [ 0.000000] [<c15357c2>] ? printk+0x14/0x16 [ 0.000000] [<c10135b1>] ? do_test_wp_bit+0x19/0x23 [ 0.000000] [<c17fd410>] ? test_wp_bit+0x26/0x64 [ 0.000000] [<c17fd7e5>] ? mem_init+0x1ba/0x1d8 [ 0.000000] [<c17f1668>] ? start_kernel+0x164/0x2f7 [ 0.000000] [<c17f1322>] ? unknown_bootoption+0x0/0x19c [ 0.000000] [<c17f106a>] ? __init_begin+0x6a/0x6f Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11bootmem: fix slab fallback on numaPekka Enberg
If the user requested bootmem allocation on a specific node, we should use kzalloc_node() for the fallback allocation. Cc: Ingo Molnar <mingo@elte.hu> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>