Age | Commit message (Collapse) | Author |
|
shrink_zone() can deactivate active anon pages even if we don't have a
swap device. Many embedded products don't have a swap device. So the
deactivation of anon pages is unnecessary.
This patch prevents unnecessary deactivation of anon lru pages. But, it
don't prevent aging of anon pages to swap out.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
migrate_prep() is fairly expensive (72us on 16-core barcelona 1.9GHz).
Commit 3140a2273009c01c27d316f35ab76a37e105fdd8 improved move_pages()
throughput by breaking it into chunks, but it also made migrate_prep() be
called once per chunk (every 128pages or so) instead of once per
move_pages().
This patch reverts to calling migrate_prep() only once per chunk as we did
before 2.6.29. It is also a followup to commit
0aedadf91a70a11c4a3e7c7d99b21e5528af8d5d ("mm: move migrate_prep out from
under mmap_sem").
This improves migration throughput on the above machine from 600MB/s to
750MB/s.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, the following scenario appears to be possible in theory:
* Tasks are frozen for hibernation or suspend.
* Free pages are almost exhausted.
* Certain piece of code in the suspend code path attempts to allocate
some memory using GFP_KERNEL and allocation order less than or
equal to PAGE_ALLOC_COSTLY_ORDER.
* __alloc_pages_internal() cannot find a free page so it invokes the
OOM killer.
* The OOM killer attempts to kill a task, but the task is frozen, so
it doesn't die immediately.
* __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
to find a free page and invokes the OOM killer.
* No progress can be made.
Although it is now hard to trigger during hibernation due to the memory
shrinking carried out by the hibernation code, it is theoretically
possible to trigger during suspend after the memory shrinking has been
removed from that code path. Moreover, since memory allocations are
going to be used for the hibernation memory shrinking, it will be even
more likely to happen during hibernation.
To prevent it from happening, introduce the oom_killer_disabled switch
that will cause __alloc_pages_internal() to fail in the situations in
which the OOM killer would have been called and make the freezer set
this switch after tasks have been successfully frozen.
[akpm@linux-foundation.org: be nicer to the namespace]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The posix_madvise() function succeeds (and does nothing) when called with
parameters (NULL, 0, -1); according to LSB tests, it should fail with
EINVAL because -1 is not a valid flag.
When called with a valid address and size, it correctly fails.
So perform an initial check for valid flags first.
Reported-by: Jiri Dluhos <jdluhos@novell.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-and-Tested-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
__GFP_NOFAIL is a bad fiction. Allocations _can_ fail, and callers should
detect and suitably handle this (and not by lamely moving the infinite
loop up to the caller level either).
Attempting to use __GFP_NOFAIL for a higher-order allocation is even
worse, so add a once-off runtime check for this to slap people around for
even thinking about trying it.
Cc: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Analoguous to follow_phys(), add a helper that looks up the PFN at a
user virtual address in an IO mapping or a raw PFN mapping.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Magnus Damm <magnus.damm@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Magnus Damm <magnus.damm@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
A generic readonly page table lookup helper to map an address space and an
address from it to a pte.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Magnus Damm <magnus.damm@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The caller of setup_per_zone_inactive_ratio is an __init function. There
is no need to keep the callee after it completed as well. Also fix a
comment.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
int_sqrt() returns 0 if its argument is zero so call it if only needed.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This effectively lifts the unit of updates to nr_inactive_* and
pgdeactivate from PAGEVEC_SIZE=14 to SWAP_CLUSTER_MAX=32, or
MAX_ORDER_NR_PAGES=1024 for reclaim_zone().
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The lru->nr_saved_scan's are not meaningful counters for even kernel
developers. They typically are smaller than 32 and are always 0 for large
lists. So remove them from /proc/zoneinfo.
Hopefully this interface change won't break too many scripts.
/proc/zoneinfo is too unstructured to be script friendly, and I wonder the
affected scripts - if there are any - are still bleeding since the not
long ago commit "vmscan: split LRU lists into anon & file sets", which
also touched the "scanned" line :)
If we are to re-export accumulated vmscan counts in the future, they can
go to new lines in /proc/zoneinfo instead of the current form, or to
/sys/devices/system/node/node0/meminfo?
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The vmscan batching logic is twisting. Move it into a standalone function
nr_scan_try_batch() and document it. No behavior change.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When the file LRU lists are dominated by streaming IO pages, evict those
pages first, before considering evicting other pages.
This should be safe from deadlocks or performance problems
because only three things can happen to an inactive file page:
1) referenced twice and promoted to the active list
2) evicted by the pageout code
3) under IO, after which it will get evicted or promoted
The pages freed in this way can either be reused for streaming IO, or
allocated for something else. If the pages are used for streaming IO,
this pageout pattern continues. Otherwise, we will fall back to the
normal pageout pattern.
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Elladan <elladan@eskimo.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
A series of patches to enhance the /proc/pagemap interface and to add a
userspace executable which can be used to present the pagemap data.
Export 10 more flags to end users (and more for kernel developers):
11. KPF_MMAP (pseudo flag) memory mapped page
12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
13. KPF_SWAPCACHE page is in swap cache
14. KPF_SWAPBACKED page is swap/RAM backed
15. KPF_COMPOUND_HEAD (*)
16. KPF_COMPOUND_TAIL (*)
17. KPF_HUGE hugeTLB pages
18. KPF_UNEVICTABLE page is in the unevictable LRU list
19. KPF_HWPOISON hardware detected corruption
20. KPF_NOPAGE (pseudo flag) no page frame at the address
(*) For compound pages, exporting _both_ head/tail info enables
users to tell where a compound page starts/ends, and its order.
a simple demo of the page-types tool
# ./page-types -h
page-types [options]
-r|--raw Raw mode, for kernel developers
-a|--addr addr-spec Walk a range of pages
-b|--bits bits-spec Walk pages with specified bits
-l|--list Show page details in ranges
-L|--list-each Show page details one by one
-N|--no-summary Don't show summay info
-h|--help Show this usage message
addr-spec:
N one page at offset N (unit: pages)
N+M pages range from N to N+M-1
N,M pages range from N to M-1
N, pages range from N to end
,M pages range from 0 to M
bits-spec:
bit1,bit2 (flags & (bit1|bit2)) != 0
bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1
bit1,~bit2 (flags & (bit1|bit2)) == bit1
=bit1,bit2 flags == (bit1|bit2)
bit-names:
locked error referenced uptodate
dirty lru active slab
writeback reclaim buddy mmap
anonymous swapcache swapbacked compound_head
compound_tail huge unevictable hwpoison
nopage reserved(r) mlocked(r) mappedtodisk(r)
private(r) private_2(r) owner_private(r) arch(r)
uncached(r) readahead(o) slob_free(o) slub_frozen(o)
slub_debug(o)
(r) raw mode bits (o) overloaded bits
# ./page-types
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000000 487369 1903 _________________________________
0x0000000000000014 5 0 __R_D____________________________ referenced,dirty
0x0000000000000020 1 0 _____l___________________________ lru
0x0000000000000024 34 0 __R__l___________________________ referenced,lru
0x0000000000000028 3838 14 ___U_l___________________________ uptodate,lru
0x0001000000000028 48 0 ___U_l_______________________I___ uptodate,lru,readahead
0x000000000000002c 6478 25 __RU_l___________________________ referenced,uptodate,lru
0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead
0x0000000000000040 8344 32 ______A__________________________ active
0x0000000000000060 1 0 _____lA__________________________ lru,active
0x0000000000000068 348 1 ___U_lA__________________________ uptodate,lru,active
0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead
0x000000000000006c 988 3 __RU_lA__________________________ referenced,uptodate,lru,active
0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead
0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked
0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400 503 1 __________B______________________ buddy
0x0000000000000804 1 0 __R________M_____________________ referenced,mmap
0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap
0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead
0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap
0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead
0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap
0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead
0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap
0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000 492 1 ____________a____________________ anonymous
0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked
0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c 30 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 513968 2007
# ./page-types -r
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000000 468002 1828 _________________________________
0x0000000100000000 19102 74 _____________________r___________ reserved
0x0000000000008000 41 0 _______________H_________________ compound_head
0x0000000000010000 188 0 ________________T________________ compound_tail
0x0000000000008014 1 0 __R_D__________H_________________ referenced,dirty,compound_head
0x0000000000010014 4 0 __R_D___________T________________ referenced,dirty,compound_tail
0x0000000000000020 1 0 _____l___________________________ lru
0x0000000800000024 34 0 __R__l__________________P________ referenced,lru,private
0x0000000000000028 3794 14 ___U_l___________________________ uptodate,lru
0x0001000000000028 46 0 ___U_l_______________________I___ uptodate,lru,readahead
0x0000000400000028 44 0 ___U_l_________________d_________ uptodate,lru,mappedtodisk
0x0001000400000028 2 0 ___U_l_________________d_____I___ uptodate,lru,mappedtodisk,readahead
0x000000000000002c 6434 25 __RU_l___________________________ referenced,uptodate,lru
0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead
0x000000040000002c 14 0 __RU_l_________________d_________ referenced,uptodate,lru,mappedtodisk
0x000000080000002c 30 0 __RU_l__________________P________ referenced,uptodate,lru,private
0x0000000800000040 8124 31 ______A_________________P________ active,private
0x0000000000000040 219 0 ______A__________________________ active
0x0000000800000060 1 0 _____lA_________________P________ lru,active,private
0x0000000000000068 322 1 ___U_lA__________________________ uptodate,lru,active
0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead
0x0000000400000068 13 0 ___U_lA________________d_________ uptodate,lru,active,mappedtodisk
0x0000000800000068 12 0 ___U_lA_________________P________ uptodate,lru,active,private
0x000000000000006c 977 3 __RU_lA__________________________ referenced,uptodate,lru,active
0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead
0x000000040000006c 5 0 __RU_lA________________d_________ referenced,uptodate,lru,active,mappedtodisk
0x000000080000006c 3 0 __RU_lA_________________P________ referenced,uptodate,lru,active,private
0x0000000c0000006c 3 0 __RU_lA________________dP________ referenced,uptodate,lru,active,mappedtodisk,private
0x0000000c00000068 1 0 ___U_lA________________dP________ uptodate,lru,active,mappedtodisk,private
0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked
0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400 538 2 __________B______________________ buddy
0x0000000000000804 1 0 __R________M_____________________ referenced,mmap
0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap
0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead
0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap
0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead
0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap
0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead
0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap
0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000 492 1 ____________a____________________ anonymous
0x0000000000005008 2 0 ___U________a_b__________________ uptodate,anonymous,swapbacked
0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked
0x000000000000580c 1 0 __RU_______Ma_b__________________ referenced,uptodate,mmap,anonymous,swapbacked
0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c 29 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 513968 2007
# ./page-types --raw --list --no-summary --bits reserved
offset count flags
0 15 _____________________r___________
31 4 _____________________r___________
159 97 _____________________r___________
4096 2067 _____________________r___________
6752 2390 _____________________r___________
9355 3 _____________________r___________
9728 14526 _____________________r___________
This patch:
Introduce PageHuge(), which identifies huge/gigantic pages by their
dedicated compound destructor functions.
Also move prep_compound_gigantic_page() to hugetlb.c and make
__free_pages_ok() non-static.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
logic
alloc_large_system_hash() has logic for freeing pages at the end of an
excessively large power-of-two buffer that is a duplicate of what is in
alloc_pages_exact(). This patch converts alloc_large_system_hash() to use
alloc_pages_exact().
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Callers may speculatively call different allocators in order of preference
trying to allocate a buffer of a given size. The order needed to allocate
this may be larger than what the page allocator can normally handle.
While the allocator mostly does the right thing, it should not direct
reclaim or wakeup kswapd with a bogus order. This patch sanity checks the
order in the slow path and returns NULL if it is too large.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, free_page_mlock() is only called from page_alloc.c. Thus, we
can move it to page_alloc.c.
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
SLAB currently avoids checking a bitmap repeatedly by checking once and
storing a flag. When the addition of nr_online_nodes as a cheaper version
of num_online_nodes(), this check can be replaced by nr_online_nodes.
(Christoph did a patch that this is lifted almost verbatim from)
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
fast paths
num_online_nodes() is called in a number of places but most often by the
page allocator when deciding whether the zonelist needs to be filtered
based on cpusets or the zonelist cache. This is actually a heavy function
and touches a number of cache lines.
This patch stores the number of online nodes at boot time and updates the
value when nodes get onlined and offlined. The value is then used in a
number of important paths in place of num_online_nodes().
[rientjes@google.com: do not override definition of node_set_online() with macro]
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Local interrupts are disabled when freeing pages to the PCP list. Part of
that free checks what the migratetype of the pageblock the page is in but
it checks this with interrupts disabled and interupts should never be
disabled longer than necessary. This patch checks the pagetype with
interrupts enabled with the impact that it is possible a page is freed to
the wrong list when a pageblock changes type. As that block is now
already considered mixed from an anti-fragmentation perspective, it's not
of vital importance.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When pages are being freed to the buddy allocator, the zone NR_FREE_PAGES
counter must be updated. In the case of bulk per-cpu page freeing, it's
updated once per page. This retouches cache lines more than necessary.
Update the counters one per per-cpu bulk free.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
pages_min, pages_low or pages_high is used as the zone watermark when
allocating the pages. Two branches in the allocator hotpath determine
which watermark to use.
This patch uses the flags as an array index into a watermark array that is
indexed with WMARK_* defines accessed via helpers. All call sites that
use zone->pages_* are updated to use the helpers for accessing the values
and the array offsets for setting.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
sanity checks
A number of sanity checks are made on each page allocation and free
including that the page count is zero. page_count() checks for compound
pages and checks the count of the head page if true. However, in these
paths, we do not care if the page is compound or not as the count of each
tail page should also be zero.
This patch makes two changes to the use of page_count() in the free path.
It converts one check of page_count() to a VM_BUG_ON() as the count should
have been unconditionally checked earlier in the free path. It also
avoids checking for compound pages.
[mel@csn.ul.ie: Wrote changelog]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is a zonelist cache which is used to track zones that are not in the
allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost for
no gain. This patch only uses the zonelist cache when there are NUMA
nodes.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
free_page_mlock() tests and clears PG_mlocked using locked versions of the
bit operations. If set, it disables interrupts to update counters and
this happens on every page free even though interrupts are disabled very
shortly afterwards a second time. This is wasteful.
This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled and the non-lock version for clearing the bit is used. One
potential weirdness with this split is that the counters do not get
updated if the bad_page() check is triggered but a system showing bad
pages is getting screwed already.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and decisions
that are fragmenting memory are being made anyway.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
__rmqueue_fallback() is in the slow path but has only one call site.
Because there is only one call-site, this function can then be inlined
without causing text bloat. On an x86-based config, it made no difference
as the savings were padded out by NOP instructions. Milage varies but
text will either decrease in size or remain static.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
buffered_rmqueue() is in the fast path so inline it. Because it only has
one call site, this function can then be inlined without causing text
bloat. On an x86-based config, it made no difference as the savings were
padded out by NOP instructions. Milage varies but text will either
decrease in size or remain static.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Inline __rmqueue_smallest by altering flow very slightly so that there is
only one call site. Because there is only one call-site, this function
can then be inlined without causing text bloat. On an x86-based config,
this patch reduces text by 16 bytes.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
flags are equal to each other, we can eliminate a branch.
[akpm@linux-foundation.org: Suggested the hack]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Factor out the mapping between GFP and alloc_flags only once. Once
factored out, it only needs to be calculated once but some care must be
taken.
[neilb@suse.de says]
As the test:
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {
has been replaced with a slightly weaker one:
+ if (alloc_flags & ALLOC_NO_WATERMARKS) {
Without care, this would allow recursion into the allocator via direct
reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set but
TF_MEMDIE callers are now allowed to directly reclaim where they would
have been prevented in the past.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per allocation,
at least once per zone traversed. Calculate it once.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable zone
in the zonelist but the zone depends on the GFP flags specified at the
beginning of the allocation call. This patch calculates preferred_zone
once. It's safe to do this because if preferred_zone is NULL at the start
of the call, no amount of direct reclaim or other actions will change the
fact the allocation will fail.
[akpm@linux-foundation.org: remove (void) casts]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
On low-memory systems, anti-fragmentation gets disabled as there is
nothing it can do and it would just incur overhead shuffling pages between
lists constantly. Currently the check is made in the free page fast path
for every page. This patch moves it to a slow path. On machines with low
memory, there will be small amount of additional overhead as pages get
shuffled between lists but it should quickly settle.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The core of the page allocator is one giant function which allocates
memory on the stack and makes calculations that may not be needed for
every allocation. This patch breaks up the allocator path into fast and
slow paths for clarity. Note the slow paths are still inlined but the
entry is marked unlikely. If they were not inlined, it actally increases
text size to generate the as there is only one call site.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
It is possible with __GFP_THISNODE that no zones are suitable. This patch
makes sure the check is only made once.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
valid
Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in
fast paths know for a fact their node is valid. To avoid a comparison and
branch, this patch adds alloc_pages_exact_node() that only checks the nid
with VM_BUG_ON(). Callers that know their node is valid are then
converted.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Paul Mundt <lethal@linux-sh.org> [for the SLOB NUMA bits]
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.
[akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The start of a large patch series to clean up and optimise the page
allocator.
The performance improvements are in a wide range depending on the exact
machine but the results I've seen so fair are approximately;
kernbench: 0 to 0.12% (elapsed time)
0.49% to 3.20% (sys time)
aim9: -4% to 30% (for page_test and brk_test)
tbench: -1% to 4%
hackbench: -2.5% to 3.45% (mostly within the noise though)
netperf-udp -1.34% to 4.06% (varies between machines a bit)
netperf-tcp -0.44% to 5.22% (varies between machines a bit)
I haven't sysbench figures at hand, but previously they were within the
-0.5% to 2% range.
On netperf, the client and server were bound to opposite number CPUs to
maximise the problems with cache line bouncing of the struct pages so I
expect different people to report different results for netperf depending
on their exact machine and how they ran the test (different machines, same
cpus client/server, shared cache but two threads client/server, different
socket client/server etc).
I also measured the vmlinux sizes for a single x86-based config with
CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the
.config is based on the Debian Lenny kernel config so I expect it to be
reasonably typical.
This patch:
__alloc_pages_internal is the core page allocator function but essentially
it is an alias of __alloc_pages_nodemask. Naming a publicly available and
exported function "internal" is also a big ugly. This patch renames
__alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old
nodemask function.
Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(),
to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on
order >= MAX_ORDER - it's hoping for order 11. alloc_large_system_hash()
had better make its own check on the order.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.
In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.
But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.
[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().
Remove the now unneeded 'nodes = NULL' from mpol_new().
Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:
I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().
This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".
Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
get_dirty_limits() calls clip_bdi_dirty_limit() and task_dirty_limit()
with variable pbdi_dirty as one of the arguments. This variable is an
unsigned long * but both functions expect it to be a long *. This causes
the following sparse warnings:
warning: incorrect type in argument 3 (different signedness)
expected long *pbdi_dirty
got unsigned long *pbdi_dirty
warning: incorrect type in argument 2 (different signedness)
expected long *pdirty
got unsigned long *pbdi_dirty
Fix the warnings by changing the long * to unsigned long * in both
functions.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 33c120ed2843090e2bd316de1588b8bf8b96cbde ("more aggressively use
lumpy reclaim") increased how aggressive lumpy reclaim was by isolating
both active and inactive pages for asynchronous lumpy reclaim on
costly-high-order pages and for cheap-high-order when memory pressure is
high. However, if the system is under heavy pressure and there are dirty
pages, asynchronous IO may not be sufficient to reclaim a suitable page in
time.
This patch causes the caller to enter synchronous lumpy reclaim for
costly-high-order pages and for cheap-high-order pages when under memory
pressure.
Minchan.kim@gmail.com said:
Andy added synchronous lumpy reclaim with
c661b078fd62abe06fd11fab4ac5e4eeafe26b6d. At that time, lumpy reclaim is
not agressive. His intension is just for high-order users.(above
PAGE_ALLOC_COSTLY_ORDER).
After some time, Rik added aggressive lumpy reclaim with
33c120ed2843090e2bd316de1588b8bf8b96cbde. His intention was to do lumpy
reclaim when high-order users and trouble getting a small set of
contiguous pages.
So we also have to add synchronous pageout for small set of contiguous
pages.
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <Minchan.kim@gmail.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Move more documentation for get_user_pages_fast into the new kerneldoc comment.
Add some comments for get_user_pages as well.
Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't
there initially because it was once a static inline function.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Andy Grover <andy.grover@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now that we do readahead for sequential mmap reads, here is a simple
evaluation of the impacts, and one further optimization.
It's an NFS-root debian desktop system, readahead size = 60 pages.
The numbers are grabbed after a fresh boot into console.
approach pgmajfault RA miss ratio mmap IO count avg IO size(pages)
A 383 31.6% 383 11
B 225 32.4% 390 11
C 224 32.6% 307 13
case A: mmap sync/async readahead disabled
case B: mmap sync/async readahead enabled, with enforced full async readahead size
case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
or:
A = vanilla 2.6.30-rc1
B = A plus mmap readahead
C = B plus this patch
The numbers show that
- there are good possibilities for random mmap reads to trigger readahead
- 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
- case C can further reduce IO count by 1/4
- readahead miss ratios are not quite affected
The theory is
- readahead is _good_ for clustered random reads, and can perform
_better_ than readaround because they could be _async_.
- async readahead size is guaranteed to be larger than readaround
size, and they are _async_, hence will mostly behave better
However for B
- sync readahead size could be smaller than readaround size, hence may
make things worse by produce more smaller IOs
which will be fixed by this patch.
Final conclusion:
- mmap readahead reduced major faults by 1/3 and no obvious overheads;
- mmap io can be further reduced by 1/4 with this patch.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Introduce page cache context based readahead algorithm.
This is to better support concurrent read streams in general.
RATIONALE
---------
The current readahead algorithm detects interleaved reads in a _passive_ way.
Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
By checking for (offset == prev_offset + 1), it will discover the sequentialness
between 3,4 and between 1004,1005, and start doing sequential readahead for the
individual streams since page 4 and page 1005.
The context readahead algorithm guarantees to discover the sequentialness no
matter how the streams are interleaved. For the above example, it will start
sequential readahead since page 2 and 1002.
The trick is to poke for page @offset-1 in the page cache when it has no other
clues on the sequentialness of request @offset: if the current requenst belongs
to a sequential stream, that stream must have accessed page @offset-1 recently,
and the page will still be cached now. So if page @offset-1 is there, we can
take request @offset as a sequential access.
BENEFICIARIES
-------------
- strictly interleaved reads i.e. 1,1001,2,1002,3,1003,...
the current readahead will take them as silly random reads;
the context readahead will take them as two sequential streams.
- cooperative IO processes i.e. NFS and SCST
They create a thread pool, farming off (sequential) IO requests to different
threads which will be performing interleaved IO.
It was not easy(or possible) to reliably tell from file->f_ra all those
cooperative processes working on the same sequential stream, since they will
have different file->f_ra instances. And NFSD's file->f_ra is particularly
unusable, since their file objects are dynamically created for each request.
The nfsd does have code trying to restore the f_ra bits, but not satisfactory.
The new scheme is to detect the sequential pattern via looking up the page
cache, which provides one single and consistent view of the pages recently
accessed. That makes sequential detection for cooperative processes possible.
USER REPORT
-----------
Vladislav recommends the addition of context readahead as a result of his SCST
benchmarks. It leads to 6%~40% performance gains in various cases and achieves
equal performance in others. http://lkml.org/lkml/2009/3/19/239
OVERHEADS
---------
In theory, it introduces one extra page cache lookup per random read. However
the below benchmark shows context readahead to be slightly faster, wondering..
Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
each block size. The average throughputs are:
original ra context ra gain
4K random reads: 65.561MB/s 65.648MB/s +0.1%
16K random reads: 124.767MB/s 124.951MB/s +0.1%
64K random reads: 162.123MB/s 162.278MB/s +0.1%
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Vladislav Bolkhovitin <vst@vlnb.net>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Split all readahead cases, and move the random one to bottom.
No behavior changes.
This is to prepare for the introduction of context readahead, and make it
easy for inserting accounting/tracing points for each case.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Vladislav Bolkhovitin <vst@vlnb.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Mmap read-around now shares the same code style and data structure with
readahead code.
This also removes do_page_cache_readahead(). Its last user, mmap
read-around, has been changed to call ra_submit().
The no-readahead-if-congested logic is dumped by the way. Users will be
pretty sensitive about the slow loading of executables. So it's
unfavorable to disabled mmap read-around on a congested queue.
[akpm@linux-foundation.org: coding-style fixes]
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|