From 345227d705f2318e9bc088e79fe71a38bb5fe82b Mon Sep 17 00:00:00 2001 From: "Gustavo F. Padovan" Date: Fri, 20 May 2011 21:23:37 +0200 Subject: backing-dev: Kill set but not used var in bdi_debug_stats_show() Signed-off-by: Gustavo F. Padovan Signed-off-by: Jens Axboe --- mm/backing-dev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index befc87531e4..f032e6e1e09 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -63,10 +63,10 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long nr_dirty, nr_io, nr_more_io, nr_wb; + unsigned long nr_dirty, nr_io, nr_more_io; struct inode *inode; - nr_wb = nr_dirty = nr_io = nr_more_io = 0; + nr_dirty = nr_io = nr_more_io = 0; spin_lock(&inode_wb_list_lock); list_for_each_entry(inode, &wb->b_dirty, i_wb_list) nr_dirty++; -- cgit v1.2.3-70-g09d2 From a71ae47a2cbfa542c69f695809124da4e4dd9e8f Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Wed, 25 May 2011 09:47:43 -0500 Subject: slub: Fix double bit unlock in debug mode Commit 442b06bcea23 ("slub: Remove node check in slab_free") added a call to deactivate_slab() in the debug case in __slab_alloc(), which unlocks the current slab used for allocation. Going to the label 'unlock_out' then does it again. Also, in the debug case we do not need all the other processing that the 'unlock_out' path does. We always fall back to the slow path in the debug case. So the tid update is useless. Similarly, ALLOC_SLOWPATH would just be incremented for all allocations. Also a pretty useless thing. So simply restore irq flags and return the object. Signed-off-by: Christoph Lameter Reported-and-bisected-by: James Morris Reported-by: Ingo Molnar Reported-by: Jens Axboe Cc: Pekka Enberg Signed-off-by: Linus Torvalds --- mm/slub.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/slub.c b/mm/slub.c index 4ea7f1a22a9..4aad32d2e60 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1884,7 +1884,8 @@ debug: deactivate_slab(s, c); c->page = NULL; c->node = NUMA_NO_NODE; - goto unlock_out; + local_irq_restore(flags); + return object; } /* -- cgit v1.2.3-70-g09d2 From afc7e326a3f5bafc41324d7926c324414e343ee5 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 24 May 2011 17:11:09 -0700 Subject: mm: vmscan: correct use of pgdat_balanced in sleeping_prematurely There are a few reports of people experiencing hangs when copying large amounts of data with kswapd using a large amount of CPU which appear to be due to recent reclaim changes. SLUB using high orders is the trigger but not the root cause as SLUB has been using high orders for a while. The root cause was bugs introduced into reclaim which are addressed by the following two patches. Patch 1 corrects logic introduced by commit 1741c877 ("mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced") to allow kswapd to go to sleep when balanced for high orders. Patch 2 notes that it is possible for kswapd to miss every cond_resched() and updates shrink_slab() so it'll at least reach that scheduling point. Chris Wood reports that these two patches in isolation are sufficient to prevent the system hanging. AFAIK, they should also resolve similar hangs experienced by James Bottomley. This patch: Johannes Weiner poined out that the logic in commit 1741c877 ("mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced") is backwards. Instead of allowing kswapd to go to sleep when balancing for high order allocations, it keeps it kswapd running uselessly. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Signed-off-by: Johannes Weiner Reviewed-by: Wu Fengguang Cc: James Bottomley Tested-by: Colin King Cc: Raghavendra D Prabhu Cc: Jan Kara Cc: Chris Mason Cc: Christoph Lameter Cc: Pekka Enberg Cc: Rik van Riel Reviewed-by: Minchan Kim Reviewed-by: Wu Fengguang Cc: [2.6.38+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index c9177202c8c..66698f603aa 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2287,7 +2287,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, * must be balanced */ if (order) - return pgdat_balanced(pgdat, balanced, classzone_idx); + return !pgdat_balanced(pgdat, balanced, classzone_idx); else return !all_zones_ok; } -- cgit v1.2.3-70-g09d2 From f06590bd718ed950c98828e30ef93204028f3210 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Tue, 24 May 2011 17:11:11 -0700 Subject: mm: vmscan: correctly check if reclaimer should schedule during shrink_slab It has been reported on some laptops that kswapd is consuming large amounts of CPU and not being scheduled when SLUB is enabled during large amounts of file copying. It is expected that this is due to kswapd missing every cond_resched() point because; shrink_page_list() calls cond_resched() if inactive pages were isolated which in turn may not happen if all_unreclaimable is set in shrink_zones(). If for whatver reason, all_unreclaimable is set on all zones, we can miss calling cond_resched(). balance_pgdat() only calls cond_resched if the zones are not balanced. For a high-order allocation that is balanced, it checks order-0 again. During that window, order-0 might have become unbalanced so it loops again for order-0 and returns that it was reclaiming for order-0 to kswapd(). It can then find that a caller has rewoken kswapd for a high-order and re-enters balance_pgdat() without ever calling cond_resched(). shrink_slab only calls cond_resched() if we are reclaiming slab pages. If there are a large number of direct reclaimers, the shrinker_rwsem can be contended and prevent kswapd calling cond_resched(). This patch modifies the shrink_slab() case. If the semaphore is contended, the caller will still check cond_resched(). After each successful call into a shrinker, the check for cond_resched() remains in case one shrinker is particularly slow. [mgorman@suse.de: preserve call to cond_resched after each call into shrinker] Signed-off-by: Mel Gorman Signed-off-by: Minchan Kim Cc: Rik van Riel Cc: Johannes Weiner Cc: Wu Fengguang Cc: James Bottomley Tested-by: Colin King Cc: Raghavendra D Prabhu Cc: Jan Kara Cc: Chris Mason Cc: Christoph Lameter Cc: Pekka Enberg Cc: Rik van Riel Cc: [2.6.38+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index 66698f603aa..d303b60f4c2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -231,8 +231,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, if (scanned == 0) scanned = SWAP_CLUSTER_MAX; - if (!down_read_trylock(&shrinker_rwsem)) - return 1; /* Assume we'll be able to shrink next time */ + if (!down_read_trylock(&shrinker_rwsem)) { + /* Assume we'll be able to shrink next time */ + ret = 1; + goto out; + } list_for_each_entry(shrinker, &shrinker_list, list) { unsigned long long delta; @@ -283,6 +286,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, shrinker->nr += total_scan; } up_read(&shrinker_rwsem); +out: + cond_resched(); return ret; } -- cgit v1.2.3-70-g09d2 From 7bf02ea22c6cdd09e2d3f1d3c3fe366b834ae9af Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 24 May 2011 17:11:16 -0700 Subject: arch, mm: filter disallowed nodes from arch specific show_mem functions Architectures that implement their own show_mem() function did not pass the filter argument to show_free_areas() to appropriately avoid emitting the state of nodes that are disallowed in the current context. This patch now passes the filter argument to show_free_areas() so those nodes are now avoided. This patch also removes the show_free_areas() wrapper around __show_free_areas() and converts existing callers to pass an empty filter. ia64 emits additional information for each node, so skip_free_areas_zone() must be made global to filter disallowed nodes and it is converted to use a nid argument rather than a zone for this use case. Signed-off-by: David Rientjes Cc: Russell King Cc: Tony Luck Cc: Fenghua Yu Cc: Kyle McMartin Cc: Helge Deller Cc: James Bottomley Cc: "David S. Miller" Cc: Guan Xuetao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm/mm/init.c | 2 +- arch/ia64/mm/contig.c | 10 ++++++---- arch/ia64/mm/discontig.c | 10 ++++++---- arch/parisc/mm/init.c | 2 +- arch/sparc/kernel/setup_32.c | 2 +- arch/sparc/mm/init_32.c | 2 +- arch/unicore32/mm/init.c | 2 +- drivers/net/ioc3-eth.c | 2 +- drivers/tty/serial/68328serial.c | 2 +- include/linux/mm.h | 6 +++--- lib/show_mem.c | 2 +- mm/nommu.c | 6 +++--- mm/page_alloc.c | 22 ++++++++-------------- 13 files changed, 34 insertions(+), 36 deletions(-) (limited to 'mm') diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index 76f82ae44ef..3f17ea146f0 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -85,7 +85,7 @@ void show_mem(unsigned int filter) struct meminfo * mi = &meminfo; printk("Mem-info:\n"); - show_free_areas(); + show_free_areas(filter); for_each_bank (i, mi) { struct membank *bank = &mi->bank[i]; diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c index 9a018cde5d8..f114a3b14c6 100644 --- a/arch/ia64/mm/contig.c +++ b/arch/ia64/mm/contig.c @@ -44,13 +44,16 @@ void show_mem(unsigned int filter) pg_data_t *pgdat; printk(KERN_INFO "Mem-info:\n"); - show_free_areas(); + show_free_areas(filter); printk(KERN_INFO "Node memory in pages:\n"); for_each_online_pgdat(pgdat) { unsigned long present; unsigned long flags; int shared = 0, cached = 0, reserved = 0; + int nid = pgdat->node_id; + if (skip_free_areas_node(filter, nid)) + continue; pgdat_resize_lock(pgdat, &flags); present = pgdat->node_present_pages; for(i = 0; i < pgdat->node_spanned_pages; i++) { @@ -64,8 +67,7 @@ void show_mem(unsigned int filter) if (max_gap < LARGE_GAP) continue; #endif - i = vmemmap_find_next_valid_pfn(pgdat->node_id, - i) - 1; + i = vmemmap_find_next_valid_pfn(nid, i) - 1; continue; } if (PageReserved(page)) @@ -81,7 +83,7 @@ void show_mem(unsigned int filter) total_cached += cached; total_shared += shared; printk(KERN_INFO "Node %4d: RAM: %11ld, rsvd: %8d, " - "shrd: %10d, swpd: %10d\n", pgdat->node_id, + "shrd: %10d, swpd: %10d\n", nid, present, reserved, shared, cached); } printk(KERN_INFO "%ld pages of RAM\n", total_present); diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 82ab1bc6afb..c641333cd99 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -622,13 +622,16 @@ void show_mem(unsigned int filter) pg_data_t *pgdat; printk(KERN_INFO "Mem-info:\n"); - show_free_areas(); + show_free_areas(filter); printk(KERN_INFO "Node memory in pages:\n"); for_each_online_pgdat(pgdat) { unsigned long present; unsigned long flags; int shared = 0, cached = 0, reserved = 0; + int nid = pgdat->node_id; + if (skip_free_areas_node(filter, nid)) + continue; pgdat_resize_lock(pgdat, &flags); present = pgdat->node_present_pages; for(i = 0; i < pgdat->node_spanned_pages; i++) { @@ -638,8 +641,7 @@ void show_mem(unsigned int filter) if (pfn_valid(pgdat->node_start_pfn + i)) page = pfn_to_page(pgdat->node_start_pfn + i); else { - i = vmemmap_find_next_valid_pfn(pgdat->node_id, - i) - 1; + i = vmemmap_find_next_valid_pfn(nid, i) - 1; continue; } if (PageReserved(page)) @@ -655,7 +657,7 @@ void show_mem(unsigned int filter) total_cached += cached; total_shared += shared; printk(KERN_INFO "Node %4d: RAM: %11ld, rsvd: %8d, " - "shrd: %10d, swpd: %10d\n", pgdat->node_id, + "shrd: %10d, swpd: %10d\n", nid, present, reserved, shared, cached); } printk(KERN_INFO "%ld pages of RAM\n", total_present); diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c index 5fa1e273006..c5c9c65e502 100644 --- a/arch/parisc/mm/init.c +++ b/arch/parisc/mm/init.c @@ -686,7 +686,7 @@ void show_mem(unsigned int filter) int shared = 0, cached = 0; printk(KERN_INFO "Mem-info:\n"); - show_free_areas(); + show_free_areas(filter); #ifndef CONFIG_DISCONTIGMEM i = max_mapnr; while (i-- > 0) { diff --git a/arch/sparc/kernel/setup_32.c b/arch/sparc/kernel/setup_32.c index 3609bdee9ed..3249d3f3234 100644 --- a/arch/sparc/kernel/setup_32.c +++ b/arch/sparc/kernel/setup_32.c @@ -82,7 +82,7 @@ static void prom_sync_me(void) "nop\n\t" : : "r" (&trapbase)); prom_printf("PROM SYNC COMMAND...\n"); - show_free_areas(); + show_free_areas(0); if(current->pid != 0) { local_irq_enable(); sys_sync(); diff --git a/arch/sparc/mm/init_32.c b/arch/sparc/mm/init_32.c index 4c31e2b6e71..28c2cc81c9a 100644 --- a/arch/sparc/mm/init_32.c +++ b/arch/sparc/mm/init_32.c @@ -78,7 +78,7 @@ void __init kmap_init(void) void show_mem(unsigned int filter) { printk("Mem-info:\n"); - show_free_areas(); + show_free_areas(filter); printk("Free swap: %6ldkB\n", nr_swap_pages << (PAGE_SHIFT-10)); printk("%ld pages of RAM\n", totalram_pages); diff --git a/arch/unicore32/mm/init.c b/arch/unicore32/mm/init.c index 1fc02633f70..2d3e7112d2a 100644 --- a/arch/unicore32/mm/init.c +++ b/arch/unicore32/mm/init.c @@ -62,7 +62,7 @@ void show_mem(unsigned int filter) struct meminfo *mi = &meminfo; printk(KERN_DEFAULT "Mem-info:\n"); - show_free_areas(); + show_free_areas(filter); for_each_bank(i, mi) { struct membank *bank = &mi->bank[i]; diff --git a/drivers/net/ioc3-eth.c b/drivers/net/ioc3-eth.c index 96c95617195..32f07f868d8 100644 --- a/drivers/net/ioc3-eth.c +++ b/drivers/net/ioc3-eth.c @@ -915,7 +915,7 @@ static void ioc3_alloc_rings(struct net_device *dev) skb = ioc3_alloc_skb(RX_BUF_ALLOC_SIZE, GFP_ATOMIC); if (!skb) { - show_free_areas(); + show_free_areas(0); continue; } diff --git a/drivers/tty/serial/68328serial.c b/drivers/tty/serial/68328serial.c index d5bfd41707e..e0a77540b8c 100644 --- a/drivers/tty/serial/68328serial.c +++ b/drivers/tty/serial/68328serial.c @@ -281,7 +281,7 @@ static void receive_chars(struct m68k_serial *info, unsigned short rx) #ifdef CONFIG_MAGIC_SYSRQ } else if (ch == 0x10) { /* ^P */ show_state(); - show_free_areas(); + show_free_areas(0); show_buffers(); /* show_net_buffers(); */ return; diff --git a/include/linux/mm.h b/include/linux/mm.h index 6507dde38b1..1746f67c33d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -862,13 +862,13 @@ extern void pagefault_out_of_memory(void); #define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK) /* - * Flags passed to show_mem() and __show_free_areas() to suppress output in + * Flags passed to show_mem() and show_free_areas() to suppress output in * various contexts. */ #define SHOW_MEM_FILTER_NODES (0x0001u) /* filter disallowed nodes */ -extern void show_free_areas(void); -extern void __show_free_areas(unsigned int flags); +extern void show_free_areas(unsigned int flags); +extern bool skip_free_areas_node(unsigned int flags, int nid); int shmem_lock(struct file *file, int lock, struct user_struct *user); struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags); diff --git a/lib/show_mem.c b/lib/show_mem.c index 90cbe4bb596..4407f8c9b1f 100644 --- a/lib/show_mem.c +++ b/lib/show_mem.c @@ -16,7 +16,7 @@ void show_mem(unsigned int filter) nonshared = 0, highmem = 0; printk("Mem-Info:\n"); - __show_free_areas(filter); + show_free_areas(filter); for_each_online_pgdat(pgdat) { unsigned long i, flags; diff --git a/mm/nommu.c b/mm/nommu.c index c4c542c736a..5afce48ce33 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1235,7 +1235,7 @@ error_free: enomem: printk("Allocation of length %lu from process %d (%s) failed\n", len, current->pid, current->comm); - show_free_areas(); + show_free_areas(0); return -ENOMEM; } @@ -1468,14 +1468,14 @@ error_getting_vma: printk(KERN_WARNING "Allocation of vma for %lu byte allocation" " from process %d failed\n", len, current->pid); - show_free_areas(); + show_free_areas(0); return -ENOMEM; error_getting_region: printk(KERN_WARNING "Allocation of vm region for %lu byte allocation" " from process %d failed\n", len, current->pid); - show_free_areas(); + show_free_areas(0); return -ENOMEM; } EXPORT_SYMBOL(do_mmap_pgoff); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9d5498e2d0f..b1447522d34 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2473,10 +2473,10 @@ void si_meminfo_node(struct sysinfo *val, int nid) #endif /* - * Determine whether the zone's node should be displayed or not, depending on - * whether SHOW_MEM_FILTER_NODES was passed to __show_free_areas(). + * Determine whether the node should be displayed or not, depending on whether + * SHOW_MEM_FILTER_NODES was passed to show_free_areas(). */ -static bool skip_free_areas_zone(unsigned int flags, const struct zone *zone) +bool skip_free_areas_node(unsigned int flags, int nid) { bool ret = false; @@ -2484,8 +2484,7 @@ static bool skip_free_areas_zone(unsigned int flags, const struct zone *zone) goto out; get_mems_allowed(); - ret = !node_isset(zone->zone_pgdat->node_id, - cpuset_current_mems_allowed); + ret = !node_isset(nid, cpuset_current_mems_allowed); put_mems_allowed(); out: return ret; @@ -2500,13 +2499,13 @@ out: * Suppresses nodes that are not allowed by current's cpuset if * SHOW_MEM_FILTER_NODES is passed. */ -void __show_free_areas(unsigned int filter) +void show_free_areas(unsigned int filter) { int cpu; struct zone *zone; for_each_populated_zone(zone) { - if (skip_free_areas_zone(filter, zone)) + if (skip_free_areas_node(filter, zone_to_nid(zone))) continue; show_node(zone); printk("%s per-cpu:\n", zone->name); @@ -2549,7 +2548,7 @@ void __show_free_areas(unsigned int filter) for_each_populated_zone(zone) { int i; - if (skip_free_areas_zone(filter, zone)) + if (skip_free_areas_node(filter, zone_to_nid(zone))) continue; show_node(zone); printk("%s" @@ -2618,7 +2617,7 @@ void __show_free_areas(unsigned int filter) for_each_populated_zone(zone) { unsigned long nr[MAX_ORDER], flags, order, total = 0; - if (skip_free_areas_zone(filter, zone)) + if (skip_free_areas_node(filter, zone_to_nid(zone))) continue; show_node(zone); printk("%s: ", zone->name); @@ -2639,11 +2638,6 @@ void __show_free_areas(unsigned int filter) show_swap_cache_info(); } -void show_free_areas(void) -{ - __show_free_areas(0); -} - static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) { zoneref->zone = zone; -- cgit v1.2.3-70-g09d2 From 34679d7eac9ecc20face093db9aa610f1e9c893a Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Tue, 24 May 2011 17:11:18 -0700 Subject: mmap: add alignment for some variables Make some variables have correct alignment/section to avoid cache issue. In a workload which heavily does mmap/munmap, the variables will be used frequently. Signed-off-by: Shaohua Li Cc: Andi Kleen Cc: Rik van Riel Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mmap.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/mmap.c b/mm/mmap.c index 772140c53ab..eaec3df82a2 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -84,10 +84,14 @@ pgprot_t vm_get_page_prot(unsigned long vm_flags) } EXPORT_SYMBOL(vm_get_page_prot); -int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */ -int sysctl_overcommit_ratio = 50; /* default is 50% */ +int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; /* heuristic overcommit */ +int sysctl_overcommit_ratio __read_mostly = 50; /* default is 50% */ int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; -struct percpu_counter vm_committed_as; +/* + * Make sure vm_committed_as in one cacheline and not cacheline shared with + * other variables. It can be updated by several CPUs frequently. + */ +struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp; /* * Check that a process has enough memory to allocate a new virtual -- cgit v1.2.3-70-g09d2 From 5f70b962ccc2f2e6259417cf3d1233dc9e16cf5e Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Tue, 24 May 2011 17:11:19 -0700 Subject: mmap: avoid unnecessary anon_vma lock If we only change vma->vm_end, we can avoid taking anon_vma lock even if 'insert' isn't NULL, which is the case of split_vma. As I understand it, we need the lock before because rmap must get the 'insert' VMA when we adjust old VMA's vm_end (the 'insert' VMA is linked to anon_vma list in __insert_vm_struct before). But now this isn't true any more. The 'insert' VMA is already linked to anon_vma list in __split_vma(with anon_vma_clone()) instead of __insert_vm_struct. There is no race rmap can't get required VMAs. So the anon_vma lock is unnecessary, and this can reduce one locking in brk case and improve scalability. Signed-off-by: Shaohua Li Cc: Rik van Riel Acked-by: Hugh Dickins Cc: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/mmap.c b/mm/mmap.c index eaec3df82a2..15b1fae57ef 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -609,7 +609,7 @@ again: remove_next = 1 + (end > next->vm_end); * lock may be shared between many sibling processes. Skipping * the lock for brk adjustments makes a difference sometimes. */ - if (vma->anon_vma && (insert || importer || start != vma->vm_start)) { + if (vma->anon_vma && (importer || start != vma->vm_start)) { anon_vma = vma->anon_vma; anon_vma_lock(anon_vma); } -- cgit v1.2.3-70-g09d2 From 965f55dea0e331152fa53941a51e4e16f9f06fae Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Tue, 24 May 2011 17:11:20 -0700 Subject: mmap: avoid merging cloned VMAs Avoid merging a VMA with another VMA which is cloned from the parent process. The cloned VMA shares the anon_vma lock with the parent process's VMA. If we do the merge, more vmas (even the new range is only for current process) use the perent process's anon_vma lock. This introduces scalability issues. find_mergeable_anon_vma() already considers this case. Signed-off-by: Shaohua Li Cc: Rik van Riel Cc: Hugh Dickins Cc: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mmap.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/mmap.c b/mm/mmap.c index 15b1fae57ef..4fb5464f770 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -703,9 +703,17 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma, } static inline int is_mergeable_anon_vma(struct anon_vma *anon_vma1, - struct anon_vma *anon_vma2) + struct anon_vma *anon_vma2, + struct vm_area_struct *vma) { - return !anon_vma1 || !anon_vma2 || (anon_vma1 == anon_vma2); + /* + * The list_is_singular() test is to avoid merging VMA cloned from + * parents. This can improve scalability caused by anon_vma lock. + */ + if ((!anon_vma1 || !anon_vma2) && (!vma || + list_is_singular(&vma->anon_vma_chain))) + return 1; + return anon_vma1 == anon_vma2; } /* @@ -724,7 +732,7 @@ can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff) { if (is_mergeable_vma(vma, file, vm_flags) && - is_mergeable_anon_vma(anon_vma, vma->anon_vma)) { + is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { if (vma->vm_pgoff == vm_pgoff) return 1; } @@ -743,7 +751,7 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff) { if (is_mergeable_vma(vma, file, vm_flags) && - is_mergeable_anon_vma(anon_vma, vma->anon_vma)) { + is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { pgoff_t vm_pglen; vm_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; if (vma->vm_pgoff + vm_pglen == vm_pgoff) @@ -821,7 +829,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen) && is_mergeable_anon_vma(prev->anon_vma, - next->anon_vma)) { + next->anon_vma, NULL)) { /* cases 1, 6 */ err = vma_adjust(prev, prev->vm_start, next->vm_end, prev->vm_pgoff, NULL); -- cgit v1.2.3-70-g09d2 From ac3bbec5ec69b973317677e038de2d1a0c90c18c Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Tue, 24 May 2011 17:11:21 -0700 Subject: mm: remove unused zone_idx variable from set_migratetype_isolate Signed-off-by: Sergey Senozhatsky Reviewed-by: Christoph Lameter Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b1447522d34..eb19ba8396c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5502,10 +5502,8 @@ int set_migratetype_isolate(struct page *page) struct memory_isolate_notify arg; int notifier_ret; int ret = -EBUSY; - int zone_idx; zone = page_zone(page); - zone_idx = zone_idx(zone); spin_lock_irqsave(&zone->lock, flags); -- cgit v1.2.3-70-g09d2 From 6038def0d11b322019d0dbb43f2a611247dfbdb6 Mon Sep 17 00:00:00 2001 From: Namhyung Kim Date: Tue, 24 May 2011 17:11:22 -0700 Subject: mm: nommu: sort mm->mmap list properly When I was reading nommu code, I found that it handles the vma list/tree in an unusual way. IIUC, because there can be more than one identical/overrapped vmas in the list/tree, it sorts the tree more strictly and does a linear search on the tree. But it doesn't applied to the list (i.e. the list could be constructed in a different order than the tree so that we can't use the list when finding the first vma in that order). Since inserting/sorting a vma in the tree and link is done at the same time, we can easily construct both of them in the same order. And linear searching on the tree could be more costly than doing it on the list, it can be converted to use the list. Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly linked") made the list be doubly linked, there were a couple of code need to be fixed to construct the list properly. Patch 1/6 is a preparation. It maintains the list sorted same as the tree and construct doubly-linked list properly. Patch 2/6 is a simple optimization for the vma deletion. Patch 3/6 and 4/6 convert tree traversal to list traversal and the rest are simple fixes and cleanups. This patch: @vma added into @mm should be sorted by start addr, end addr and VMA struct addr in that order because we may get identical VMAs in the @mm. However this was true only for the rbtree, not for the list. This patch fixes this by remembering 'rb_prev' during the tree traversal like find_vma_prepare() does and linking the @vma via __vma_link_list(). After this patch, we can iterate the whole VMAs in correct order simply by using @mm->mmap list. [akpm@linux-foundation.org: avoid duplicating __vma_link_list()] Signed-off-by: Namhyung Kim Acked-by: Greg Ungerer Cc: David Howells Cc: Paul Mundt Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 4 ++++ mm/mmap.c | 23 ----------------------- mm/nommu.c | 38 ++++++++++++++++---------------------- mm/util.c | 24 ++++++++++++++++++++++++ 4 files changed, 44 insertions(+), 45 deletions(-) (limited to 'mm') diff --git a/mm/internal.h b/mm/internal.h index 9d0ced8e505..d071d380fb4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -66,6 +66,10 @@ static inline unsigned long page_order(struct page *page) return page_private(page); } +/* mm/util.c */ +void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, + struct vm_area_struct *prev, struct rb_node *rb_parent); + #ifdef CONFIG_MMU extern long mlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); diff --git a/mm/mmap.c b/mm/mmap.c index 4fb5464f770..e76f8d75288 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -398,29 +398,6 @@ find_vma_prepare(struct mm_struct *mm, unsigned long addr, return vma; } -static inline void -__vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, - struct vm_area_struct *prev, struct rb_node *rb_parent) -{ - struct vm_area_struct *next; - - vma->vm_prev = prev; - if (prev) { - next = prev->vm_next; - prev->vm_next = vma; - } else { - mm->mmap = vma; - if (rb_parent) - next = rb_entry(rb_parent, - struct vm_area_struct, vm_rb); - else - next = NULL; - } - vma->vm_next = next; - if (next) - next->vm_prev = vma; -} - void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma, struct rb_node **rb_link, struct rb_node *rb_parent) { diff --git a/mm/nommu.c b/mm/nommu.c index 5afce48ce33..0b16cb4c517 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -680,9 +680,9 @@ static void protect_vma(struct vm_area_struct *vma, unsigned long flags) */ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma) { - struct vm_area_struct *pvma, **pp, *next; + struct vm_area_struct *pvma, *prev; struct address_space *mapping; - struct rb_node **p, *parent; + struct rb_node **p, *parent, *rb_prev; kenter(",%p", vma); @@ -703,7 +703,7 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma) } /* add the VMA to the tree */ - parent = NULL; + parent = rb_prev = NULL; p = &mm->mm_rb.rb_node; while (*p) { parent = *p; @@ -713,17 +713,20 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma) * (the latter is necessary as we may get identical VMAs) */ if (vma->vm_start < pvma->vm_start) p = &(*p)->rb_left; - else if (vma->vm_start > pvma->vm_start) + else if (vma->vm_start > pvma->vm_start) { + rb_prev = parent; p = &(*p)->rb_right; - else if (vma->vm_end < pvma->vm_end) + } else if (vma->vm_end < pvma->vm_end) p = &(*p)->rb_left; - else if (vma->vm_end > pvma->vm_end) + else if (vma->vm_end > pvma->vm_end) { + rb_prev = parent; p = &(*p)->rb_right; - else if (vma < pvma) + } else if (vma < pvma) p = &(*p)->rb_left; - else if (vma > pvma) + else if (vma > pvma) { + rb_prev = parent; p = &(*p)->rb_right; - else + } else BUG(); } @@ -731,20 +734,11 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma) rb_insert_color(&vma->vm_rb, &mm->mm_rb); /* add VMA to the VMA list also */ - for (pp = &mm->mmap; (pvma = *pp); pp = &(*pp)->vm_next) { - if (pvma->vm_start > vma->vm_start) - break; - if (pvma->vm_start < vma->vm_start) - continue; - if (pvma->vm_end < vma->vm_end) - break; - } + prev = NULL; + if (rb_prev) + prev = rb_entry(rb_prev, struct vm_area_struct, vm_rb); - next = *pp; - *pp = vma; - vma->vm_next = next; - if (next) - next->vm_prev = vma; + __vma_link_list(mm, vma, prev, parent); } /* diff --git a/mm/util.c b/mm/util.c index e7b103a6fd2..88ea1bd661c 100644 --- a/mm/util.c +++ b/mm/util.c @@ -6,6 +6,8 @@ #include #include +#include "internal.h" + #define CREATE_TRACE_POINTS #include @@ -215,6 +217,28 @@ char *strndup_user(const char __user *s, long n) } EXPORT_SYMBOL(strndup_user); +void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, + struct vm_area_struct *prev, struct rb_node *rb_parent) +{ + struct vm_area_struct *next; + + vma->vm_prev = prev; + if (prev) { + next = prev->vm_next; + prev->vm_next = vma; + } else { + mm->mmap = vma; + if (rb_parent) + next = rb_entry(rb_parent, + struct vm_area_struct, vm_rb); + else + next = NULL; + } + vma->vm_next = next; + if (next) + next->vm_prev = vma; +} + #if defined(CONFIG_MMU) && !defined(HAVE_ARCH_PICK_MMAP_LAYOUT) void arch_pick_mmap_layout(struct mm_struct *mm) { -- cgit v1.2.3-70-g09d2 From b951bf2c4693bfc9744e11293be859209f65f579 Mon Sep 17 00:00:00 2001 From: Namhyung Kim Date: Tue, 24 May 2011 17:11:23 -0700 Subject: mm: nommu: don't scan the vma list when deleting Since commit 297c5eee3724 ("mm: make the vma list be doubly linked") made it a doubly linked list, we don't need to scan the list when deleting @vma. And the original code didn't update the prev pointer. Fix it too. Signed-off-by: Namhyung Kim Acked-by: Greg Ungerer Cc: David Howells Cc: Paul Mundt Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/nommu.c b/mm/nommu.c index 0b16cb4c517..2454d39dc06 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -746,7 +746,6 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma) */ static void delete_vma_from_mm(struct vm_area_struct *vma) { - struct vm_area_struct **pp; struct address_space *mapping; struct mm_struct *mm = vma->vm_mm; @@ -769,12 +768,14 @@ static void delete_vma_from_mm(struct vm_area_struct *vma) /* remove from the MM's tree and list */ rb_erase(&vma->vm_rb, &mm->mm_rb); - for (pp = &mm->mmap; *pp; pp = &(*pp)->vm_next) { - if (*pp == vma) { - *pp = vma->vm_next; - break; - } - } + + if (vma->vm_prev) + vma->vm_prev->vm_next = vma->vm_next; + else + mm->mmap = vma->vm_next; + + if (vma->vm_next) + vma->vm_next->vm_prev = vma->vm_prev; vma->vm_mm = NULL; } -- cgit v1.2.3-70-g09d2 From e922c4c5360980bfeb862b3ec307d36bb344dcae Mon Sep 17 00:00:00 2001 From: Namhyung Kim Date: Tue, 24 May 2011 17:11:24 -0700 Subject: mm: nommu: find vma using the sorted vma list Now we have the sorted vma list, use it in the find_vma[_exact]() rather than doing linear search on the rb-tree. Signed-off-by: Namhyung Kim Acked-by: Greg Ungerer Cc: David Howells Cc: Paul Mundt Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) (limited to 'mm') diff --git a/mm/nommu.c b/mm/nommu.c index 2454d39dc06..e5318f8efde 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -804,17 +804,15 @@ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma) struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) { struct vm_area_struct *vma; - struct rb_node *n = mm->mm_rb.rb_node; /* check the cache first */ vma = mm->mmap_cache; if (vma && vma->vm_start <= addr && vma->vm_end > addr) return vma; - /* trawl the tree (there may be multiple mappings in which addr + /* trawl the list (there may be multiple mappings in which addr * resides) */ - for (n = rb_first(&mm->mm_rb); n; n = rb_next(n)) { - vma = rb_entry(n, struct vm_area_struct, vm_rb); + for (vma = mm->mmap; vma; vma = vma->vm_next) { if (vma->vm_start > addr) return NULL; if (vma->vm_end > addr) { @@ -854,7 +852,6 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm, unsigned long len) { struct vm_area_struct *vma; - struct rb_node *n = mm->mm_rb.rb_node; unsigned long end = addr + len; /* check the cache first */ @@ -862,10 +859,9 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm, if (vma && vma->vm_start == addr && vma->vm_end == end) return vma; - /* trawl the tree (there may be multiple mappings in which addr + /* trawl the list (there may be multiple mappings in which addr * resides) */ - for (n = rb_first(&mm->mm_rb); n; n = rb_next(n)) { - vma = rb_entry(n, struct vm_area_struct, vm_rb); + for (vma = mm->mmap; vma; vma = vma->vm_next) { if (vma->vm_start < addr) continue; if (vma->vm_start > addr) -- cgit v1.2.3-70-g09d2 From d75a310c42c616c168953ed45c1091074f97828c Mon Sep 17 00:00:00 2001 From: Namhyung Kim Date: Tue, 24 May 2011 17:11:25 -0700 Subject: mm: nommu: check the vma list when unmapping file-mapped vma Now we have the sorted vma list, use it in do_munmap() to check that we have an exact match. Signed-off-by: Namhyung Kim Acked-by: Greg Ungerer Cc: David Howells Cc: Paul Mundt Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/nommu.c b/mm/nommu.c index e5318f8efde..0563fd9003d 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1635,7 +1635,6 @@ static int shrink_vma(struct mm_struct *mm, int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) { struct vm_area_struct *vma; - struct rb_node *rb; unsigned long end = start + len; int ret; @@ -1668,9 +1667,8 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) } if (end == vma->vm_end) goto erase_whole_vma; - rb = rb_next(&vma->vm_rb); - vma = rb_entry(rb, struct vm_area_struct, vm_rb); - } while (rb); + vma = vma->vm_next; + } while (vma); kleave(" = -EINVAL [split file]"); return -EINVAL; } else { -- cgit v1.2.3-70-g09d2 From 7223bb4a829628bdf37d544ed4363d99bac1ade6 Mon Sep 17 00:00:00 2001 From: Namhyung Kim Date: Tue, 24 May 2011 17:11:26 -0700 Subject: mm: nommu: fix a potential memory leak in do_mmap_private() If f_op->read() fails and sysctl_nr_trim_pages > 1, there could be a memory leak between @region->vm_end and @region->vm_top. Signed-off-by: Namhyung Kim Acked-by: Greg Ungerer Cc: David Howells Cc: Paul Mundt Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/nommu.c b/mm/nommu.c index 0563fd9003d..57ae6126b29 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1217,7 +1217,7 @@ static int do_mmap_private(struct vm_area_struct *vma, return 0; error_free: - free_page_series(region->vm_start, region->vm_end); + free_page_series(region->vm_start, region->vm_top); region->vm_start = vma->vm_start = 0; region->vm_end = vma->vm_end = 0; region->vm_top = 0; -- cgit v1.2.3-70-g09d2 From bb005a59e08733bb416126dc184f73120fc6366b Mon Sep 17 00:00:00 2001 From: Namhyung Kim Date: Tue, 24 May 2011 17:11:27 -0700 Subject: mm: nommu: fix a compile warning in do_mmap_pgoff() Because 'ret' is declared as int, not unsigned long, no need to cast the error contants into unsigned long. If you compile this code on a 64-bit machine somehow, you'll see following warning: CC mm/nommu.o mm/nommu.c: In function `do_mmap_pgoff': mm/nommu.c:1411: warning: overflow in implicit constant conversion Signed-off-by: Namhyung Kim Acked-by: Greg Ungerer Cc: David Howells Cc: Paul Mundt Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/nommu.c b/mm/nommu.c index 57ae6126b29..92e1a47d1e5 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1376,15 +1376,15 @@ unsigned long do_mmap_pgoff(struct file *file, if (capabilities & BDI_CAP_MAP_DIRECT) { addr = file->f_op->get_unmapped_area(file, addr, len, pgoff, flags); - if (IS_ERR((void *) addr)) { + if (IS_ERR_VALUE(addr)) { ret = addr; - if (ret != (unsigned long) -ENOSYS) + if (ret != -ENOSYS) goto error_just_free; /* the driver refused to tell us where to site * the mapping so we'll have to attempt to copy * it */ - ret = (unsigned long) -ENODEV; + ret = -ENODEV; if (!(capabilities & BDI_CAP_MAP_COPY)) goto error_just_free; -- cgit v1.2.3-70-g09d2 From fa25c503dfa203b921199ea42c0046c89f2ed49f Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 24 May 2011 17:11:28 -0700 Subject: mm: per-node vmstat: show proper vmstats commit 2ac390370a ("writeback: add /sys/devices/system/node//vmstat") added vmstat entry. But strangely it only show nr_written and nr_dirtied. # cat /sys/devices/system/node/node20/vmstat nr_written 0 nr_dirtied 0 Of course, It's not adequate. With this patch, the vmstat show all vm stastics as /proc/vmstat. # cat /sys/devices/system/node/node0/vmstat nr_free_pages 899224 nr_inactive_anon 201 nr_active_anon 17380 nr_inactive_file 31572 nr_active_file 28277 nr_unevictable 0 nr_mlock 0 nr_anon_pages 17321 nr_mapped 8640 nr_file_pages 60107 nr_dirty 33 nr_writeback 0 nr_slab_reclaimable 6850 nr_slab_unreclaimable 7604 nr_page_table_pages 3105 nr_kernel_stack 175 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 nr_writeback_temp 0 nr_isolated_anon 0 nr_isolated_file 0 nr_shmem 260 nr_dirtied 1050 nr_written 938 numa_hit 962872 numa_miss 0 numa_foreign 0 numa_interleave 8617 numa_local 962872 numa_other 0 nr_anon_transparent_hugepages 0 [akpm@linux-foundation.org: no externs in .c files] Signed-off-by: KOSAKI Motohiro Cc: Michael Rubin Cc: Wu Fengguang Acked-by: David Rientjes Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/node.c | 14 ++- include/linux/vmstat.h | 4 +- mm/vmstat.c | 261 +++++++++++++++++++++++++------------------------ 3 files changed, 144 insertions(+), 135 deletions(-) (limited to 'mm') diff --git a/drivers/base/node.c b/drivers/base/node.c index b3b72d64e80..793f796c4da 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -179,11 +180,14 @@ static ssize_t node_read_vmstat(struct sys_device *dev, struct sysdev_attribute *attr, char *buf) { int nid = dev->id; - return sprintf(buf, - "nr_written %lu\n" - "nr_dirtied %lu\n", - node_page_state(nid, NR_WRITTEN), - node_page_state(nid, NR_DIRTIED)); + int i; + int n = 0; + + for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) + n += sprintf(buf+n, "%s %lu\n", vmstat_text[i], + node_page_state(nid, i)); + + return n; } static SYSDEV_ATTR(vmstat, S_IRUGO, node_read_vmstat, NULL); diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 2b3831b58aa..e73d1030f2f 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -313,6 +313,8 @@ static inline void __dec_zone_page_state(struct page *page, #define set_pgdat_percpu_threshold(pgdat, callback) { } static inline void refresh_cpu_vm_stats(int cpu) { } -#endif +#endif /* CONFIG_SMP */ + +extern const char * const vmstat_text[]; #endif /* _LINUX_VMSTAT_H */ diff --git a/mm/vmstat.c b/mm/vmstat.c index 897ea9e8823..209546a8bdd 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -659,6 +659,138 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat, } #endif +#if defined(CONFIG_PROC_FS) || defined(CONFIG_SYSFS) +#ifdef CONFIG_ZONE_DMA +#define TEXT_FOR_DMA(xx) xx "_dma", +#else +#define TEXT_FOR_DMA(xx) +#endif + +#ifdef CONFIG_ZONE_DMA32 +#define TEXT_FOR_DMA32(xx) xx "_dma32", +#else +#define TEXT_FOR_DMA32(xx) +#endif + +#ifdef CONFIG_HIGHMEM +#define TEXT_FOR_HIGHMEM(xx) xx "_high", +#else +#define TEXT_FOR_HIGHMEM(xx) +#endif + +#define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \ + TEXT_FOR_HIGHMEM(xx) xx "_movable", + +const char * const vmstat_text[] = { + /* Zoned VM counters */ + "nr_free_pages", + "nr_inactive_anon", + "nr_active_anon", + "nr_inactive_file", + "nr_active_file", + "nr_unevictable", + "nr_mlock", + "nr_anon_pages", + "nr_mapped", + "nr_file_pages", + "nr_dirty", + "nr_writeback", + "nr_slab_reclaimable", + "nr_slab_unreclaimable", + "nr_page_table_pages", + "nr_kernel_stack", + "nr_unstable", + "nr_bounce", + "nr_vmscan_write", + "nr_writeback_temp", + "nr_isolated_anon", + "nr_isolated_file", + "nr_shmem", + "nr_dirtied", + "nr_written", + +#ifdef CONFIG_NUMA + "numa_hit", + "numa_miss", + "numa_foreign", + "numa_interleave", + "numa_local", + "numa_other", +#endif + "nr_anon_transparent_hugepages", + "nr_dirty_threshold", + "nr_dirty_background_threshold", + +#ifdef CONFIG_VM_EVENT_COUNTERS + "pgpgin", + "pgpgout", + "pswpin", + "pswpout", + + TEXTS_FOR_ZONES("pgalloc") + + "pgfree", + "pgactivate", + "pgdeactivate", + + "pgfault", + "pgmajfault", + + TEXTS_FOR_ZONES("pgrefill") + TEXTS_FOR_ZONES("pgsteal") + TEXTS_FOR_ZONES("pgscan_kswapd") + TEXTS_FOR_ZONES("pgscan_direct") + +#ifdef CONFIG_NUMA + "zone_reclaim_failed", +#endif + "pginodesteal", + "slabs_scanned", + "kswapd_steal", + "kswapd_inodesteal", + "kswapd_low_wmark_hit_quickly", + "kswapd_high_wmark_hit_quickly", + "kswapd_skip_congestion_wait", + "pageoutrun", + "allocstall", + + "pgrotated", + +#ifdef CONFIG_COMPACTION + "compact_blocks_moved", + "compact_pages_moved", + "compact_pagemigrate_failed", + "compact_stall", + "compact_fail", + "compact_success", +#endif + +#ifdef CONFIG_HUGETLB_PAGE + "htlb_buddy_alloc_success", + "htlb_buddy_alloc_fail", +#endif + "unevictable_pgs_culled", + "unevictable_pgs_scanned", + "unevictable_pgs_rescued", + "unevictable_pgs_mlocked", + "unevictable_pgs_munlocked", + "unevictable_pgs_cleared", + "unevictable_pgs_stranded", + "unevictable_pgs_mlockfreed", + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + "thp_fault_alloc", + "thp_fault_fallback", + "thp_collapse_alloc", + "thp_collapse_alloc_failed", + "thp_split", +#endif + +#endif /* CONFIG_VM_EVENTS_COUNTERS */ +}; +#endif /* CONFIG_PROC_FS || CONFIG_SYSFS */ + + #ifdef CONFIG_PROC_FS static void frag_show_print(struct seq_file *m, pg_data_t *pgdat, struct zone *zone) @@ -831,135 +963,6 @@ static const struct file_operations pagetypeinfo_file_ops = { .release = seq_release, }; -#ifdef CONFIG_ZONE_DMA -#define TEXT_FOR_DMA(xx) xx "_dma", -#else -#define TEXT_FOR_DMA(xx) -#endif - -#ifdef CONFIG_ZONE_DMA32 -#define TEXT_FOR_DMA32(xx) xx "_dma32", -#else -#define TEXT_FOR_DMA32(xx) -#endif - -#ifdef CONFIG_HIGHMEM -#define TEXT_FOR_HIGHMEM(xx) xx "_high", -#else -#define TEXT_FOR_HIGHMEM(xx) -#endif - -#define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \ - TEXT_FOR_HIGHMEM(xx) xx "_movable", - -static const char * const vmstat_text[] = { - /* Zoned VM counters */ - "nr_free_pages", - "nr_inactive_anon", - "nr_active_anon", - "nr_inactive_file", - "nr_active_file", - "nr_unevictable", - "nr_mlock", - "nr_anon_pages", - "nr_mapped", - "nr_file_pages", - "nr_dirty", - "nr_writeback", - "nr_slab_reclaimable", - "nr_slab_unreclaimable", - "nr_page_table_pages", - "nr_kernel_stack", - "nr_unstable", - "nr_bounce", - "nr_vmscan_write", - "nr_writeback_temp", - "nr_isolated_anon", - "nr_isolated_file", - "nr_shmem", - "nr_dirtied", - "nr_written", - -#ifdef CONFIG_NUMA - "numa_hit", - "numa_miss", - "numa_foreign", - "numa_interleave", - "numa_local", - "numa_other", -#endif - "nr_anon_transparent_hugepages", - "nr_dirty_threshold", - "nr_dirty_background_threshold", - -#ifdef CONFIG_VM_EVENT_COUNTERS - "pgpgin", - "pgpgout", - "pswpin", - "pswpout", - - TEXTS_FOR_ZONES("pgalloc") - - "pgfree", - "pgactivate", - "pgdeactivate", - - "pgfault", - "pgmajfault", - - TEXTS_FOR_ZONES("pgrefill") - TEXTS_FOR_ZONES("pgsteal") - TEXTS_FOR_ZONES("pgscan_kswapd") - TEXTS_FOR_ZONES("pgscan_direct") - -#ifdef CONFIG_NUMA - "zone_reclaim_failed", -#endif - "pginodesteal", - "slabs_scanned", - "kswapd_steal", - "kswapd_inodesteal", - "kswapd_low_wmark_hit_quickly", - "kswapd_high_wmark_hit_quickly", - "kswapd_skip_congestion_wait", - "pageoutrun", - "allocstall", - - "pgrotated", - -#ifdef CONFIG_COMPACTION - "compact_blocks_moved", - "compact_pages_moved", - "compact_pagemigrate_failed", - "compact_stall", - "compact_fail", - "compact_success", -#endif - -#ifdef CONFIG_HUGETLB_PAGE - "htlb_buddy_alloc_success", - "htlb_buddy_alloc_fail", -#endif - "unevictable_pgs_culled", - "unevictable_pgs_scanned", - "unevictable_pgs_rescued", - "unevictable_pgs_mlocked", - "unevictable_pgs_munlocked", - "unevictable_pgs_cleared", - "unevictable_pgs_stranded", - "unevictable_pgs_mlockfreed", - -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - "thp_fault_alloc", - "thp_fault_fallback", - "thp_collapse_alloc", - "thp_collapse_alloc_failed", - "thp_split", -#endif - -#endif /* CONFIG_VM_EVENTS_COUNTERS */ -}; - static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, struct zone *zone) { -- cgit v1.2.3-70-g09d2 From f62e00cc3a00bfbd394a79fc22b334c31f91bd5f Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 24 May 2011 17:11:29 -0700 Subject: mm: introduce wait_on_page_locked_killable() commit 2687a356 ("Add lock_page_killable") introduced killable lock_page(). Similarly this patch introdues killable wait_on_page_locked(). Signed-off-by: KOSAKI Motohiro Acked-by: KAMEZAWA Hiroyuki Reviewed-by: Minchan Kim Cc: Matthew Wilcox Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pagemap.h | 9 +++++++++ mm/filemap.c | 11 +++++++++++ 2 files changed, 20 insertions(+) (limited to 'mm') diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index c1195065264..ea268080380 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -357,6 +357,15 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm, */ extern void wait_on_page_bit(struct page *page, int bit_nr); +extern int wait_on_page_bit_killable(struct page *page, int bit_nr); + +static inline int wait_on_page_locked_killable(struct page *page) +{ + if (PageLocked(page)) + return wait_on_page_bit_killable(page, PG_locked); + return 0; +} + /* * Wait for a page to be unlocked. * diff --git a/mm/filemap.c b/mm/filemap.c index c641edf553a..dea8a38bb2b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -562,6 +562,17 @@ void wait_on_page_bit(struct page *page, int bit_nr) } EXPORT_SYMBOL(wait_on_page_bit); +int wait_on_page_bit_killable(struct page *page, int bit_nr) +{ + DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); + + if (!test_bit(bit_nr, &page->flags)) + return 0; + + return __wait_on_bit(page_waitqueue(page), &wait, + sleep_on_page_killable, TASK_KILLABLE); +} + /** * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue * @page: Page defining the wait queue of interest -- cgit v1.2.3-70-g09d2 From 37b23e0525d393d48a7d59f870b3bc061a30ccdb Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 24 May 2011 17:11:30 -0700 Subject: x86,mm: make pagefault killable When an oom killing occurs, almost all processes are getting stuck at the following two points. 1) __alloc_pages_nodemask 2) __lock_page_or_retry 1) is not very problematic because TIF_MEMDIE leads to an allocation failure and getting out from page allocator. 2) is more problematic. In an OOM situation, zones typically don't have page cache at all and memory starvation might lead to greatly reduced IO performance. When a fork bomb occurs, TIF_MEMDIE tasks don't die quickly, meaning that a fork bomb may create new process quickly rather than the oom-killer killing it. Then, the system may become livelocked. This patch makes the pagefault interruptible by SIGKILL. Signed-off-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Cc: Minchan Kim Cc: Matthew Wilcox Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/mm/fault.c | 12 +++++++++++- include/linux/mm.h | 1 + mm/filemap.c | 31 ++++++++++++++++++++++++------- 3 files changed, 36 insertions(+), 8 deletions(-) (limited to 'mm') diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index bcb394dfbb3..f7a2a054a3c 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -965,7 +965,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) struct mm_struct *mm; int fault; int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | (write ? FAULT_FLAG_WRITE : 0); tsk = current; @@ -1138,6 +1138,16 @@ good_area: return; } + /* + * Pagefault was interrupted by SIGKILL. We have no reason to + * continue pagefault. + */ + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) { + if (!(error_code & PF_USER)) + no_context(regs, error_code, address); + return; + } + /* * Major/minor page fault accounting is only done on the * initial attempt. If we go through a retry, it is extremely diff --git a/include/linux/mm.h b/include/linux/mm.h index 1746f67c33d..57d3d5fade1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -153,6 +153,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_MKWRITE 0x04 /* Fault was mkwrite of existing pte */ #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ +#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is diff --git a/mm/filemap.c b/mm/filemap.c index dea8a38bb2b..8144f87dcbb 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -654,15 +654,32 @@ EXPORT_SYMBOL_GPL(__lock_page_killable); int __lock_page_or_retry(struct page *page, struct mm_struct *mm, unsigned int flags) { - if (!(flags & FAULT_FLAG_ALLOW_RETRY)) { - __lock_page(page); - return 1; - } else { - if (!(flags & FAULT_FLAG_RETRY_NOWAIT)) { - up_read(&mm->mmap_sem); + if (flags & FAULT_FLAG_ALLOW_RETRY) { + /* + * CAUTION! In this case, mmap_sem is not released + * even though return 0. + */ + if (flags & FAULT_FLAG_RETRY_NOWAIT) + return 0; + + up_read(&mm->mmap_sem); + if (flags & FAULT_FLAG_KILLABLE) + wait_on_page_locked_killable(page); + else wait_on_page_locked(page); - } return 0; + } else { + if (flags & FAULT_FLAG_KILLABLE) { + int ret; + + ret = __lock_page_killable(page); + if (ret) { + up_read(&mm->mmap_sem); + return 0; + } + } else + __lock_page(page); + return 1; } } -- cgit v1.2.3-70-g09d2 From 839a4fcc8af7412be2efd11f0bd0504757f79f08 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 24 May 2011 17:11:31 -0700 Subject: mm, mem-hotplug: fix section mismatch. setup_per_zone_inactive_ratio() should be __meminit. Commit bce7394a3e ("page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens") introduced invalid section references. Now, setup_per_zone_inactive_ratio() is marked __init and then it can't be referenced from memory hotplug code. This patch marks it as __meminit and also marks caller as __ref. Signed-off-by: KOSAKI Motohiro Reviewed-by: Minchan Kim Cc: Yasunori Goto Cc: Rik van Riel Cc: Johannes Weiner Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 4 ++-- mm/page_alloc.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 9ca1d604f7c..2c4edc459fb 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -400,7 +400,7 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages, } -int online_pages(unsigned long pfn, unsigned long nr_pages) +int __ref online_pages(unsigned long pfn, unsigned long nr_pages) { unsigned long onlined_pages = 0; struct zone *zone; @@ -795,7 +795,7 @@ check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn) return offlined; } -static int offline_pages(unsigned long start_pfn, +static int __ref offline_pages(unsigned long start_pfn, unsigned long end_pfn, unsigned long timeout) { unsigned long pfn, nr_pages, expire; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eb19ba8396c..56d0be36be9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5094,7 +5094,7 @@ void setup_per_zone_wmarks(void) * 1TB 101 10GB * 10TB 320 32GB */ -void calculate_zone_inactive_ratio(struct zone *zone) +void __meminit calculate_zone_inactive_ratio(struct zone *zone) { unsigned int gb, ratio; @@ -5108,7 +5108,7 @@ void calculate_zone_inactive_ratio(struct zone *zone) zone->inactive_ratio = ratio; } -static void __init setup_per_zone_inactive_ratio(void) +static void __meminit setup_per_zone_inactive_ratio(void) { struct zone *zone; -- cgit v1.2.3-70-g09d2 From 1b79acc91115ba47e744b70bb166b77bd94f5855 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 24 May 2011 17:11:32 -0700 Subject: mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occurs Currently, memory hotplug calls setup_per_zone_wmarks() and calculate_zone_inactive_ratio(), but doesn't call setup_per_zone_lowmem_reserve(). It means the number of reserved pages aren't updated even if memory hot plug occur. This patch fixes it. Signed-off-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Acked-by: Mel Gorman Reviewed-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 +- mm/memory_hotplug.c | 9 +++++---- mm/page_alloc.c | 4 ++-- 3 files changed, 8 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/include/linux/mm.h b/include/linux/mm.h index 57d3d5fade1..e173cd297d8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1381,7 +1381,7 @@ extern void set_dma_reserve(unsigned long new_dma_reserve); extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long, enum memmap_context); extern void setup_per_zone_wmarks(void); -extern void calculate_zone_inactive_ratio(struct zone *zone); +extern int __meminit init_per_zone_wmark_min(void); extern void mem_init(void); extern void __init mmap_init(void); extern void show_mem(unsigned int flags); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2c4edc459fb..59ac18fefd6 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -459,8 +459,9 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages) zone_pcp_update(zone); mutex_unlock(&zonelists_mutex); - setup_per_zone_wmarks(); - calculate_zone_inactive_ratio(zone); + + init_per_zone_wmark_min(); + if (onlined_pages) { kswapd_run(zone_to_nid(zone)); node_set_state(zone_to_nid(zone), N_HIGH_MEMORY); @@ -893,8 +894,8 @@ repeat: zone->zone_pgdat->node_present_pages -= offlined_pages; totalram_pages -= offlined_pages; - setup_per_zone_wmarks(); - calculate_zone_inactive_ratio(zone); + init_per_zone_wmark_min(); + if (!node_present_pages(node)) { node_clear_state(node, N_HIGH_MEMORY); kswapd_stop(node); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 56d0be36be9..e133cea3693 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5094,7 +5094,7 @@ void setup_per_zone_wmarks(void) * 1TB 101 10GB * 10TB 320 32GB */ -void __meminit calculate_zone_inactive_ratio(struct zone *zone) +static void __meminit calculate_zone_inactive_ratio(struct zone *zone) { unsigned int gb, ratio; @@ -5140,7 +5140,7 @@ static void __meminit setup_per_zone_inactive_ratio(void) * 8192MB: 11584k * 16384MB: 16384k */ -static int __init init_per_zone_wmark_min(void) +int __meminit init_per_zone_wmark_min(void) { unsigned long lowmem_kbytes; -- cgit v1.2.3-70-g09d2 From a6cccdc36c966e51fd969560d870cfd37afbfa9c Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 24 May 2011 17:11:33 -0700 Subject: mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occur Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug doesn't. There is no reason for this. [akpm@linux-foundation.org: fix CONFIG_SMP=n build] Signed-off-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Acked-by: Mel Gorman Acked-by: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/vmstat.h | 3 +++ mm/page_alloc.c | 2 ++ mm/vmstat.c | 3 +-- 3 files changed, 6 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index e73d1030f2f..51359837511 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -261,6 +261,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); void refresh_cpu_vm_stats(int); +void refresh_zone_stat_thresholds(void); int calculate_pressure_threshold(struct zone *zone); int calculate_normal_threshold(struct zone *zone); @@ -313,6 +314,8 @@ static inline void __dec_zone_page_state(struct page *page, #define set_pgdat_percpu_threshold(pgdat, callback) { } static inline void refresh_cpu_vm_stats(int cpu) { } +static inline void refresh_zone_stat_thresholds(void) { } + #endif /* CONFIG_SMP */ extern const char * const vmstat_text[]; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e133cea3693..77773329aa7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include #include @@ -5152,6 +5153,7 @@ int __meminit init_per_zone_wmark_min(void) if (min_free_kbytes > 65536) min_free_kbytes = 65536; setup_per_zone_wmarks(); + refresh_zone_stat_thresholds(); setup_per_zone_lowmem_reserve(); setup_per_zone_inactive_ratio(); return 0; diff --git a/mm/vmstat.c b/mm/vmstat.c index 209546a8bdd..20c18b7694b 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -157,7 +157,7 @@ int calculate_normal_threshold(struct zone *zone) /* * Refresh the thresholds for each zone. */ -static void refresh_zone_stat_thresholds(void) +void refresh_zone_stat_thresholds(void) { struct zone *zone; int cpu; @@ -1201,7 +1201,6 @@ static int __init setup_vmstat(void) #ifdef CONFIG_SMP int cpu; - refresh_zone_stat_thresholds(); register_cpu_notifier(&vmstat_notifier); for_each_online_cpu(cpu) -- cgit v1.2.3-70-g09d2 From c6a140bf164829769499b5e50d380893da39b29e Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Tue, 24 May 2011 17:11:38 -0700 Subject: mm/compaction: reverse the change that forbade sync migraton with __GFP_NO_KSWAPD It's uncertain this has been beneficial, so it's safer to undo it. All other compaction users would still go in synchronous mode if a first attempt at async compaction failed. Hopefully we don't need to force special behavior for THP (which is the only __GFP_NO_KSWAPD user so far and it's the easier to exercise and to be noticeable). This also make __GFP_NO_KSWAPD return to its original strict semantics specific to bypass kswapd, as THP allocations have khugepaged for the async THP allocations/compactions. Signed-off-by: Andrea Arcangeli Cc: Alex Villacis Lasso Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 77773329aa7..83eaa2eb72f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2107,7 +2107,7 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); + sync_migration = true; /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, -- cgit v1.2.3-70-g09d2 From 72788c385604523422592249c19cba0187021e9b Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 24 May 2011 17:11:40 -0700 Subject: oom: replace PF_OOM_ORIGIN with toggling oom_score_adj There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes Reviewed-by: KOSAKI Motohiro Reviewed-by: Minchan Kim Cc: Hugh Dickins Cc: Izik Eidus Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/oom.h | 2 ++ include/linux/sched.h | 1 - mm/ksm.c | 7 +++++-- mm/oom_kill.c | 36 +++++++++++++++++++++++++++--------- mm/swapfile.c | 6 ++++-- 5 files changed, 38 insertions(+), 14 deletions(-) (limited to 'mm') diff --git a/include/linux/oom.h b/include/linux/oom.h index 5e3aa8311c5..4952fb874ad 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -40,6 +40,8 @@ enum oom_constraint { CONSTRAINT_MEMCG, }; +extern int test_set_oom_score_adj(int new_val); + extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem, const nodemask_t *nodemask, unsigned long totalpages); extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags); diff --git a/include/linux/sched.h b/include/linux/sched.h index aaf71e08222..44b8faaac7c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1753,7 +1753,6 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * #define PF_FROZEN 0x00010000 /* frozen for system suspend */ #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ #define PF_KSWAPD 0x00040000 /* I am kswapd */ -#define PF_OOM_ORIGIN 0x00080000 /* Allocating much memory to others */ #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ diff --git a/mm/ksm.c b/mm/ksm.c index 942dfc73a2f..d708b3ef226 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include "internal.h" @@ -1894,9 +1895,11 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr, if (ksm_run != flags) { ksm_run = flags; if (flags & KSM_RUN_UNMERGE) { - current->flags |= PF_OOM_ORIGIN; + int oom_score_adj; + + oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); err = unmerge_and_remove_all_rmap_items(); - current->flags &= ~PF_OOM_ORIGIN; + test_set_oom_score_adj(oom_score_adj); if (err) { ksm_run = KSM_RUN_STOP; count = err; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f52e85c80e8..e4b0991ca35 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -38,6 +38,33 @@ int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks = 1; static DEFINE_SPINLOCK(zone_scan_lock); +/** + * test_set_oom_score_adj() - set current's oom_score_adj and return old value + * @new_val: new oom_score_adj value + * + * Sets the oom_score_adj value for current to @new_val with proper + * synchronization and returns the old value. Usually used to temporarily + * set a value, save the old value in the caller, and then reinstate it later. + */ +int test_set_oom_score_adj(int new_val) +{ + struct sighand_struct *sighand = current->sighand; + int old_val; + + spin_lock_irq(&sighand->siglock); + old_val = current->signal->oom_score_adj; + if (new_val != old_val) { + if (new_val == OOM_SCORE_ADJ_MIN) + atomic_inc(¤t->mm->oom_disable_count); + else if (old_val == OOM_SCORE_ADJ_MIN) + atomic_dec(¤t->mm->oom_disable_count); + current->signal->oom_score_adj = new_val; + } + spin_unlock_irq(&sighand->siglock); + + return old_val; +} + #ifdef CONFIG_NUMA /** * has_intersects_mems_allowed() - check task eligiblity for kill @@ -154,15 +181,6 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem, return 0; } - /* - * When the PF_OOM_ORIGIN bit is set, it indicates the task should have - * priority for oom killing. - */ - if (p->flags & PF_OOM_ORIGIN) { - task_unlock(p); - return 1000; - } - /* * The memory controller may have a limit of 0 bytes, so avoid a divide * by zero, if necessary. diff --git a/mm/swapfile.c b/mm/swapfile.c index 8c6b3ce38f0..d537d29e9b7 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -1555,6 +1556,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) struct address_space *mapping; struct inode *inode; char *pathname; + int oom_score_adj; int i, type, prev; int err; @@ -1613,9 +1615,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) p->flags &= ~SWP_WRITEOK; spin_unlock(&swap_lock); - current->flags |= PF_OOM_ORIGIN; + oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); err = try_to_unuse(type); - current->flags &= ~PF_OOM_ORIGIN; + test_set_oom_score_adj(oom_score_adj); if (err) { /* -- cgit v1.2.3-70-g09d2 From 248ac0e1943ad1796393d281b096184719eb3f97 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 24 May 2011 17:11:43 -0700 Subject: mm/vmalloc: remove guard page from between vmap blocks The vmap allocator is used to, among other things, allocate per-cpu vmap blocks, where each vmap block is naturally aligned to its own size. Obviously, leaving a guard page after each vmap area forbids packing vmap blocks efficiently and can make the kernel run out of possible vmap blocks long before overall vmap space is exhausted. The new interface to map a user-supplied page array into linear vmalloc space (vm_map_ram) insists on allocating from a vmap block (instead of falling back to a custom area) when the area size is below a certain threshold. With heavy users of this interface (e.g. XFS) and limited vmalloc space on 32-bit, vmap block exhaustion is a real problem. Remove the guard page from the core vmap allocator. vmalloc and the old vmap interface enforce a guard page on their own at a higher level. Note that without this patch, we had accidental guard pages after those vm_map_ram areas that happened to be at the end of a vmap block, but not between every area. This patch removes this accidental guard page only. If we want guard pages after every vm_map_ram area, this should be done separately. And just like with vmalloc and the old interface on a different level, not in the core allocator. Mel pointed out: "If necessary, the guard page could be reintroduced as a debugging-only option (CONFIG_DEBUG_PAGEALLOC?). Otherwise it seems reasonable." Signed-off-by: Johannes Weiner Cc: Nick Piggin Cc: Dave Chinner Acked-by: Mel Gorman Cc: Hugh Dickins Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 5d6030235d7..4581ddcdda5 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -375,7 +375,7 @@ nocache: /* find starting point for our search */ if (free_vmap_cache) { first = rb_entry(free_vmap_cache, struct vmap_area, rb_node); - addr = ALIGN(first->va_end + PAGE_SIZE, align); + addr = ALIGN(first->va_end, align); if (addr < vstart) goto nocache; if (addr + size - 1 < addr) @@ -406,10 +406,10 @@ nocache: } /* from the starting point, walk areas until a suitable hole is found */ - while (addr + size >= first->va_start && addr + size <= vend) { + while (addr + size > first->va_start && addr + size <= vend) { if (addr + cached_hole_size < first->va_start) cached_hole_size = first->va_start - addr; - addr = ALIGN(first->va_end + PAGE_SIZE, align); + addr = ALIGN(first->va_end, align); if (addr + size - 1 < addr) goto overflow; -- cgit v1.2.3-70-g09d2 From d05f3169c0fbca16132ec7c2be71685c6de638b5 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 24 May 2011 17:11:44 -0700 Subject: mm: make expand_downwards() symmetrical with expand_upwards() Currently we have expand_upwards exported while expand_downwards is accessible only via expand_stack or expand_stack_downwards. check_stack_guard_page is a nice example of the asymmetry. It uses expand_stack for VM_GROWSDOWN while expand_upwards is called for VM_GROWSUP case. Let's clean this up by exporting both functions and make those names consistent. Let's use expand_{upwards,downwards} because expanding doesn't always involve stack manipulation (an example is ia64_do_page_fault which uses expand_upwards for registers backing store expansion). expand_downwards has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards version in the early process initialization phase for growsup configuration. Signed-off-by: Michal Hocko Acked-by: Hugh Dickins Cc: James Bottomley Cc: "Luck, Tony" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/exec.c | 2 +- include/linux/mm.h | 8 +++++--- mm/memory.c | 2 +- mm/mmap.c | 7 +------ 4 files changed, 8 insertions(+), 11 deletions(-) (limited to 'mm') diff --git a/fs/exec.c b/fs/exec.c index c1cf372f17a..e276d5e0abb 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -200,7 +200,7 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, #ifdef CONFIG_STACK_GROWSUP if (write) { - ret = expand_stack_downwards(bprm->vma, pos); + ret = expand_downwards(bprm->vma, pos); if (ret < 0) return NULL; } diff --git a/include/linux/mm.h b/include/linux/mm.h index e173cd297d8..d2948af126c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1518,15 +1518,17 @@ unsigned long ra_submit(struct file_ra_state *ra, struct address_space *mapping, struct file *filp); -/* Do stack extension */ +/* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ extern int expand_stack(struct vm_area_struct *vma, unsigned long address); + +/* CONFIG_STACK_GROWSUP still needs to to grow downwards at some places */ +extern int expand_downwards(struct vm_area_struct *vma, + unsigned long address); #if VM_GROWSUP extern int expand_upwards(struct vm_area_struct *vma, unsigned long address); #else #define expand_upwards(vma, address) do { } while (0) #endif -extern int expand_stack_downwards(struct vm_area_struct *vma, - unsigned long address); /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr); diff --git a/mm/memory.c b/mm/memory.c index 61e66f02656..4c6ea10f3d1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2966,7 +2966,7 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo if (prev && prev->vm_end == address) return prev->vm_flags & VM_GROWSDOWN ? 0 : -ENOMEM; - expand_stack(vma, address - PAGE_SIZE); + expand_downwards(vma, address - PAGE_SIZE); } if ((vma->vm_flags & VM_GROWSUP) && address + PAGE_SIZE == vma->vm_end) { struct vm_area_struct *next = vma->vm_next; diff --git a/mm/mmap.c b/mm/mmap.c index e76f8d75288..adb12527fd0 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1774,7 +1774,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) /* * vma is the first one with address < vma->vm_start. Have to extend vma. */ -static int expand_downwards(struct vm_area_struct *vma, +int expand_downwards(struct vm_area_struct *vma, unsigned long address) { int error; @@ -1821,11 +1821,6 @@ static int expand_downwards(struct vm_area_struct *vma, return error; } -int expand_stack_downwards(struct vm_area_struct *vma, unsigned long address) -{ - return expand_downwards(vma, address); -} - #ifdef CONFIG_STACK_GROWSUP int expand_stack(struct vm_area_struct *vma, unsigned long address) { -- cgit v1.2.3-70-g09d2 From d16dfc550f5326a4000f3322582a7c05dec91d7a Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 May 2011 17:11:45 -0700 Subject: mm: mmu_gather rework Rework the existing mmu_gather infrastructure. The direct purpose of these patches was to allow preemptible mmu_gather, but even without that I think these patches provide an improvement to the status quo. The first 9 patches rework the mmu_gather infrastructure. For review purpose I've split them into generic and per-arch patches with the last of those a generic cleanup. The next patch provides generic RCU page-table freeing, and the followup is a patch converting s390 to use this. I've also got 4 patches from DaveM lined up (not included in this series) that uses this to implement gup_fast() for sparc64. Then there is one patch that extends the generic mmu_gather batching. After that follow the mm preemptibility patches, these make part of the mm a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to mutexes which together with the mmu_gather rework makes mmu_gather preemptible as well. Making i_mmap_lock a mutex also enables a clean-up of the truncate code. This also allows for preemptible mmu_notifiers, something that XPMEM I think wants. Furthermore, it removes the new and universially detested unmap_mutex. This patch: Remove the first obstacle towards a fully preemptible mmu_gather. The current scheme assumes mmu_gather is always done with preemption disabled and uses per-cpu storage for the page batches. Change this to try and allocate a page for batching and in case of failure, use a small on-stack array to make some progress. Preemptible mmu_gather is desired in general and usable once i_mmap_lock becomes a mutex. Doing it before the mutex conversion saves us from having to rework the code by moving the mmu_gather bits inside the pte_lock. Also avoid flushing the tlb batches from under the pte lock, this is useful even without the i_mmap_lock conversion as it significantly reduces pte lock hold times. [akpm@linux-foundation.org: fix comment tpyo] Signed-off-by: Peter Zijlstra Cc: Benjamin Herrenschmidt Cc: David Miller Cc: Martin Schwidefsky Cc: Russell King Cc: Paul Mundt Cc: Jeff Dike Cc: Richard Weinberger Cc: Tony Luck Reviewed-by: KAMEZAWA Hiroyuki Acked-by: Hugh Dickins Acked-by: Mel Gorman Cc: KOSAKI Motohiro Cc: Nick Piggin Cc: Namhyung Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/exec.c | 10 ++--- include/asm-generic/tlb.h | 96 ++++++++++++++++++++++++++++++++++------------- include/linux/mm.h | 2 +- mm/memory.c | 46 +++++++++++------------ mm/mmap.c | 18 ++++----- 5 files changed, 107 insertions(+), 65 deletions(-) (limited to 'mm') diff --git a/fs/exec.c b/fs/exec.c index e276d5e0abb..936f5776655 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -600,7 +600,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) unsigned long length = old_end - old_start; unsigned long new_start = old_start - shift; unsigned long new_end = old_end - shift; - struct mmu_gather *tlb; + struct mmu_gather tlb; BUG_ON(new_start > new_end); @@ -626,12 +626,12 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) return -ENOMEM; lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0); + tlb_gather_mmu(&tlb, mm, 0); if (new_end > old_start) { /* * when the old and new regions overlap clear from new_end. */ - free_pgd_range(tlb, new_end, old_end, new_end, + free_pgd_range(&tlb, new_end, old_end, new_end, vma->vm_next ? vma->vm_next->vm_start : 0); } else { /* @@ -640,10 +640,10 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) * have constraints on va-space that make this illegal (IA64) - * for the others its just a little faster. */ - free_pgd_range(tlb, old_start, old_end, new_end, + free_pgd_range(&tlb, old_start, old_end, new_end, vma->vm_next ? vma->vm_next->vm_start : 0); } - tlb_finish_mmu(tlb, new_end, old_end); + tlb_finish_mmu(&tlb, new_end, old_end); /* * Shrink the vma to just the new range. Always succeeds. diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index e43f9766259..2d3547c8423 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -5,6 +5,8 @@ * Copyright 2001 Red Hat, Inc. * Based on code from mm/memory.c Copyright Linus Torvalds and others. * + * Copyright 2011 Red Hat, Inc., Peter Zijlstra + * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version @@ -22,51 +24,71 @@ * and page free order so much.. */ #ifdef CONFIG_SMP - #ifdef ARCH_FREE_PTR_NR - #define FREE_PTR_NR ARCH_FREE_PTR_NR - #else - #define FREE_PTE_NR 506 - #endif #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U) #else - #define FREE_PTE_NR 1 #define tlb_fast_mode(tlb) 1 #endif +/* + * If we can't allocate a page to make a big batch of page pointers + * to work on, then just handle a few from the on-stack structure. + */ +#define MMU_GATHER_BUNDLE 8 + /* struct mmu_gather is an opaque type used by the mm code for passing around * any data needed by arch specific code for tlb_remove_page. */ struct mmu_gather { struct mm_struct *mm; unsigned int nr; /* set to ~0U means fast mode */ + unsigned int max; /* nr < max */ unsigned int need_flush;/* Really unmapped some ptes? */ unsigned int fullmm; /* non-zero means full mm flush */ - struct page * pages[FREE_PTE_NR]; +#ifdef HAVE_ARCH_MMU_GATHER + struct arch_mmu_gather arch; +#endif + struct page **pages; + struct page *local[MMU_GATHER_BUNDLE]; }; -/* Users of the generic TLB shootdown code must declare this storage space. */ -DECLARE_PER_CPU(struct mmu_gather, mmu_gathers); +static inline void __tlb_alloc_page(struct mmu_gather *tlb) +{ + unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0); + + if (addr) { + tlb->pages = (void *)addr; + tlb->max = PAGE_SIZE / sizeof(struct page *); + } +} /* tlb_gather_mmu - * Return a pointer to an initialized struct mmu_gather. + * Called to initialize an (on-stack) mmu_gather structure for page-table + * tear-down from @mm. The @fullmm argument is used when @mm is without + * users and we're going to destroy the full address space (exit/execve). */ -static inline struct mmu_gather * -tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush) +static inline void +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm) { - struct mmu_gather *tlb = &get_cpu_var(mmu_gathers); - tlb->mm = mm; - /* Use fast mode if only one CPU is online */ - tlb->nr = num_online_cpus() > 1 ? 0U : ~0U; + tlb->max = ARRAY_SIZE(tlb->local); + tlb->pages = tlb->local; + + if (num_online_cpus() > 1) { + tlb->nr = 0; + __tlb_alloc_page(tlb); + } else /* Use fast mode if only one CPU is online */ + tlb->nr = ~0U; - tlb->fullmm = full_mm_flush; + tlb->fullmm = fullmm; - return tlb; +#ifdef HAVE_ARCH_MMU_GATHER + tlb->arch = ARCH_MMU_GATHER_INIT; +#endif } static inline void -tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) +tlb_flush_mmu(struct mmu_gather *tlb) { if (!tlb->need_flush) return; @@ -75,6 +97,13 @@ tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) if (!tlb_fast_mode(tlb)) { free_pages_and_swap_cache(tlb->pages, tlb->nr); tlb->nr = 0; + /* + * If we are using the local on-stack array of pages for MMU + * gather, try allocating an off-stack array again as we have + * recently freed pages. + */ + if (tlb->pages == tlb->local) + __tlb_alloc_page(tlb); } } @@ -85,29 +114,42 @@ tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) static inline void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { - tlb_flush_mmu(tlb, start, end); + tlb_flush_mmu(tlb); /* keep the page table cache within bounds */ check_pgt_cache(); - put_cpu_var(mmu_gathers); + if (tlb->pages != tlb->local) + free_pages((unsigned long)tlb->pages, 0); } -/* tlb_remove_page +/* __tlb_remove_page * Must perform the equivalent to __free_pte(pte_get_and_clear(ptep)), while * handling the additional races in SMP caused by other CPUs caching valid - * mappings in their TLBs. + * mappings in their TLBs. Returns the number of free page slots left. + * When out of page slots we must call tlb_flush_mmu(). */ -static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) +static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) { tlb->need_flush = 1; if (tlb_fast_mode(tlb)) { free_page_and_swap_cache(page); - return; + return 1; /* avoid calling tlb_flush_mmu() */ } tlb->pages[tlb->nr++] = page; - if (tlb->nr >= FREE_PTE_NR) - tlb_flush_mmu(tlb, 0, 0); + VM_BUG_ON(tlb->nr > tlb->max); + + return tlb->max - tlb->nr; +} + +/* tlb_remove_page + * Similar to __tlb_remove_page but will call tlb_flush_mmu() itself when + * required. + */ +static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) +{ + if (!__tlb_remove_page(tlb, page)) + tlb_flush_mmu(tlb); } /** diff --git a/include/linux/mm.h b/include/linux/mm.h index d2948af126c..ffcce9bf2b5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -906,7 +906,7 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, unsigned long size); unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *); -unsigned long unmap_vmas(struct mmu_gather **tlb, +unsigned long unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *); diff --git a/mm/memory.c b/mm/memory.c index 4c6ea10f3d1..19b2d44de9f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -912,12 +912,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, long *zap_work, struct zap_details *details) { struct mm_struct *mm = tlb->mm; + int force_flush = 0; pte_t *pte; spinlock_t *ptl; int rss[NR_MM_COUNTERS]; init_rss_vec(rss); - +again: pte = pte_offset_map_lock(mm, pmd, addr, &ptl); arch_enter_lazy_mmu_mode(); do { @@ -974,7 +975,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, page_remove_rmap(page); if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); - tlb_remove_page(tlb, page); + force_flush = !__tlb_remove_page(tlb, page); + if (force_flush) + break; continue; } /* @@ -1001,6 +1004,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte - 1, ptl); + /* + * mmu_gather ran out of room to batch pages, we break out of + * the PTE lock to avoid doing the potential expensive TLB invalidate + * and page-free while holding it. + */ + if (force_flush) { + force_flush = 0; + tlb_flush_mmu(tlb); + if (addr != end) + goto again; + } + return addr; } @@ -1121,17 +1136,14 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb, * ensure that any thus-far unmapped pages are flushed before unmap_vmas() * drops the lock and schedules. */ -unsigned long unmap_vmas(struct mmu_gather **tlbp, +unsigned long unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *details) { long zap_work = ZAP_BLOCK_SIZE; - unsigned long tlb_start = 0; /* For tlb_finish_mmu */ - int tlb_start_valid = 0; unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; - int fullmm = (*tlbp)->fullmm; struct mm_struct *mm = vma->vm_mm; mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); @@ -1152,11 +1164,6 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp, untrack_pfn_vma(vma, 0, 0); while (start != end) { - if (!tlb_start_valid) { - tlb_start = start; - tlb_start_valid = 1; - } - if (unlikely(is_vm_hugetlb_page(vma))) { /* * It is undesirable to test vma->vm_file as it @@ -1177,7 +1184,7 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp, start = end; } else - start = unmap_page_range(*tlbp, vma, + start = unmap_page_range(tlb, vma, start, end, &zap_work, details); if (zap_work > 0) { @@ -1185,19 +1192,13 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp, break; } - tlb_finish_mmu(*tlbp, tlb_start, start); - if (need_resched() || (i_mmap_lock && spin_needbreak(i_mmap_lock))) { - if (i_mmap_lock) { - *tlbp = NULL; + if (i_mmap_lock) goto out; - } cond_resched(); } - *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm); - tlb_start_valid = 0; zap_work = ZAP_BLOCK_SIZE; } } @@ -1217,16 +1218,15 @@ unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { struct mm_struct *mm = vma->vm_mm; - struct mmu_gather *tlb; + struct mmu_gather tlb; unsigned long end = address + size; unsigned long nr_accounted = 0; lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0); + tlb_gather_mmu(&tlb, mm, 0); update_hiwater_rss(mm); end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); - if (tlb) - tlb_finish_mmu(tlb, address, end); + tlb_finish_mmu(&tlb, address, end); return end; } diff --git a/mm/mmap.c b/mm/mmap.c index adb12527fd0..40d49986e71 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1903,17 +1903,17 @@ static void unmap_region(struct mm_struct *mm, unsigned long start, unsigned long end) { struct vm_area_struct *next = prev? prev->vm_next: mm->mmap; - struct mmu_gather *tlb; + struct mmu_gather tlb; unsigned long nr_accounted = 0; lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0); + tlb_gather_mmu(&tlb, mm, 0); update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, - next? next->vm_start: 0); - tlb_finish_mmu(tlb, start, end); + free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, + next ? next->vm_start : 0); + tlb_finish_mmu(&tlb, start, end); } /* @@ -2255,7 +2255,7 @@ EXPORT_SYMBOL(do_brk); /* Release all mmaps. */ void exit_mmap(struct mm_struct *mm) { - struct mmu_gather *tlb; + struct mmu_gather tlb; struct vm_area_struct *vma; unsigned long nr_accounted = 0; unsigned long end; @@ -2280,14 +2280,14 @@ void exit_mmap(struct mm_struct *mm) lru_add_drain(); flush_cache_mm(mm); - tlb = tlb_gather_mmu(mm, 1); + tlb_gather_mmu(&tlb, mm, 1); /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0); - tlb_finish_mmu(tlb, 0, end); + free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); + tlb_finish_mmu(&tlb, 0, end); /* * Walk the list again, actually closing and freeing it, -- cgit v1.2.3-70-g09d2 From 267239116987d64850ad2037d8e0f3071dc3b5ce Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 May 2011 17:12:00 -0700 Subject: mm, powerpc: move the RCU page-table freeing into generic code In case other architectures require RCU freed page-tables to implement gup_fast() and software filled hashes and similar things, provide the means to do so by moving the logic into generic code. Signed-off-by: Peter Zijlstra Requested-by: David Miller Cc: Benjamin Herrenschmidt Cc: Martin Schwidefsky Cc: Russell King Cc: Paul Mundt Cc: Jeff Dike Cc: Richard Weinberger Cc: Tony Luck Cc: KAMEZAWA Hiroyuki Cc: Hugh Dickins Cc: Mel Gorman Cc: KOSAKI Motohiro Cc: Nick Piggin Cc: Namhyung Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/Kconfig | 3 ++ arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/pgalloc.h | 21 ++++++-- arch/powerpc/include/asm/tlb.h | 10 ---- arch/powerpc/mm/pgtable.c | 98 -------------------------------------- arch/powerpc/mm/tlb_hash32.c | 3 -- arch/powerpc/mm/tlb_hash64.c | 3 -- arch/powerpc/mm/tlb_nohash.c | 3 -- include/asm-generic/tlb.h | 56 ++++++++++++++++++++-- mm/memory.c | 77 ++++++++++++++++++++++++++++++ 10 files changed, 150 insertions(+), 125 deletions(-) (limited to 'mm') diff --git a/arch/Kconfig b/arch/Kconfig index 8d24bacaa61..26b0e2397a5 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -175,4 +175,7 @@ config HAVE_ARCH_JUMP_LABEL config HAVE_ARCH_MUTEX_CPU_RELAX bool +config HAVE_RCU_TABLE_FREE + bool + source "kernel/gcov/Kconfig" diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index a3128ca0fe1..423145a6f7b 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -140,6 +140,7 @@ config PPC select IRQ_PER_CPU select GENERIC_IRQ_SHOW select GENERIC_IRQ_SHOW_LEVEL + select HAVE_RCU_TABLE_FREE if SMP config EARLY_PRINTK bool diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h index df1b4cbb2e7..bf301ac62f3 100644 --- a/arch/powerpc/include/asm/pgalloc.h +++ b/arch/powerpc/include/asm/pgalloc.h @@ -31,14 +31,29 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage) #endif #ifdef CONFIG_SMP -extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift); -extern void pte_free_finish(struct mmu_gather *tlb); +struct mmu_gather; +extern void tlb_remove_table(struct mmu_gather *, void *); + +static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift) +{ + unsigned long pgf = (unsigned long)table; + BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE); + pgf |= shift; + tlb_remove_table(tlb, (void *)pgf); +} + +static inline void __tlb_remove_table(void *_table) +{ + void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE); + unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE; + + pgtable_free(table, shift); +} #else /* CONFIG_SMP */ static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift) { pgtable_free(table, shift); } -static inline void pte_free_finish(struct mmu_gather *tlb) { } #endif /* !CONFIG_SMP */ static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage, diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index 8f0ed7adcd1..e2b428b0f7b 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -28,16 +28,6 @@ #define tlb_start_vma(tlb, vma) do { } while (0) #define tlb_end_vma(tlb, vma) do { } while (0) -#define HAVE_ARCH_MMU_GATHER 1 - -struct pte_freelist_batch; - -struct arch_mmu_gather { - struct pte_freelist_batch *batch; -}; - -#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, } - extern void tlb_flush(struct mmu_gather *tlb); /* Get the generic bits... */ diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 6e72788598f..af40c8768a7 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -33,104 +33,6 @@ #include "mmu_decl.h" -#ifdef CONFIG_SMP - -/* - * Handle batching of page table freeing on SMP. Page tables are - * queued up and send to be freed later by RCU in order to avoid - * freeing a page table page that is being walked without locks - */ - -static unsigned long pte_freelist_forced_free; - -struct pte_freelist_batch -{ - struct rcu_head rcu; - unsigned int index; - unsigned long tables[0]; -}; - -#define PTE_FREELIST_SIZE \ - ((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \ - / sizeof(unsigned long)) - -static void pte_free_smp_sync(void *arg) -{ - /* Do nothing, just ensure we sync with all CPUs */ -} - -/* This is only called when we are critically out of memory - * (and fail to get a page in pte_free_tlb). - */ -static void pgtable_free_now(void *table, unsigned shift) -{ - pte_freelist_forced_free++; - - smp_call_function(pte_free_smp_sync, NULL, 1); - - pgtable_free(table, shift); -} - -static void pte_free_rcu_callback(struct rcu_head *head) -{ - struct pte_freelist_batch *batch = - container_of(head, struct pte_freelist_batch, rcu); - unsigned int i; - - for (i = 0; i < batch->index; i++) { - void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE); - unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE; - - pgtable_free(table, shift); - } - - free_page((unsigned long)batch); -} - -static void pte_free_submit(struct pte_freelist_batch *batch) -{ - call_rcu_sched(&batch->rcu, pte_free_rcu_callback); -} - -void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift) -{ - struct pte_freelist_batch **batchp = &tlb->arch.batch; - unsigned long pgf; - - if (atomic_read(&tlb->mm->mm_users) < 2) { - pgtable_free(table, shift); - return; - } - - if (*batchp == NULL) { - *batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC); - if (*batchp == NULL) { - pgtable_free_now(table, shift); - return; - } - (*batchp)->index = 0; - } - BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE); - pgf = (unsigned long)table | shift; - (*batchp)->tables[(*batchp)->index++] = pgf; - if ((*batchp)->index == PTE_FREELIST_SIZE) { - pte_free_submit(*batchp); - *batchp = NULL; - } -} - -void pte_free_finish(struct mmu_gather *tlb) -{ - struct pte_freelist_batch **batchp = &tlb->arch.batch; - - if (*batchp == NULL) - return; - pte_free_submit(*batchp); - *batchp = NULL; -} - -#endif /* CONFIG_SMP */ - static inline int is_exec_fault(void) { return current->thread.regs && TRAP(current->thread.regs) == 0x400; diff --git a/arch/powerpc/mm/tlb_hash32.c b/arch/powerpc/mm/tlb_hash32.c index d555cdb06bc..27b863c1494 100644 --- a/arch/powerpc/mm/tlb_hash32.c +++ b/arch/powerpc/mm/tlb_hash32.c @@ -71,9 +71,6 @@ void tlb_flush(struct mmu_gather *tlb) */ _tlbia(); } - - /* Push out batch of freed page tables */ - pte_free_finish(tlb); } /* diff --git a/arch/powerpc/mm/tlb_hash64.c b/arch/powerpc/mm/tlb_hash64.c index 5c94ca34cd7..31f18207970 100644 --- a/arch/powerpc/mm/tlb_hash64.c +++ b/arch/powerpc/mm/tlb_hash64.c @@ -165,9 +165,6 @@ void tlb_flush(struct mmu_gather *tlb) __flush_tlb_pending(tlbbatch); put_cpu_var(ppc64_tlb_batch); - - /* Push out batch of freed page tables */ - pte_free_finish(tlb); } /** diff --git a/arch/powerpc/mm/tlb_nohash.c b/arch/powerpc/mm/tlb_nohash.c index 8eaf67d3204..0bdad3aecc6 100644 --- a/arch/powerpc/mm/tlb_nohash.c +++ b/arch/powerpc/mm/tlb_nohash.c @@ -299,9 +299,6 @@ EXPORT_SYMBOL(flush_tlb_range); void tlb_flush(struct mmu_gather *tlb) { flush_tlb_mm(tlb->mm); - - /* Push out batch of freed page tables */ - pte_free_finish(tlb); } /* diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 2d3547c8423..74f80f6b6cf 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -29,6 +29,49 @@ #define tlb_fast_mode(tlb) 1 #endif +#ifdef CONFIG_HAVE_RCU_TABLE_FREE +/* + * Semi RCU freeing of the page directories. + * + * This is needed by some architectures to implement software pagetable walkers. + * + * gup_fast() and other software pagetable walkers do a lockless page-table + * walk and therefore needs some synchronization with the freeing of the page + * directories. The chosen means to accomplish that is by disabling IRQs over + * the walk. + * + * Architectures that use IPIs to flush TLBs will then automagically DTRT, + * since we unlink the page, flush TLBs, free the page. Since the disabling of + * IRQs delays the completion of the TLB flush we can never observe an already + * freed page. + * + * Architectures that do not have this (PPC) need to delay the freeing by some + * other means, this is that means. + * + * What we do is batch the freed directory pages (tables) and RCU free them. + * We use the sched RCU variant, as that guarantees that IRQ/preempt disabling + * holds off grace periods. + * + * However, in order to batch these pages we need to allocate storage, this + * allocation is deep inside the MM code and can thus easily fail on memory + * pressure. To guarantee progress we fall back to single table freeing, see + * the implementation of tlb_remove_table_one(). + * + */ +struct mmu_table_batch { + struct rcu_head rcu; + unsigned int nr; + void *tables[0]; +}; + +#define MAX_TABLE_BATCH \ + ((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *)) + +extern void tlb_table_flush(struct mmu_gather *tlb); +extern void tlb_remove_table(struct mmu_gather *tlb, void *table); + +#endif + /* * If we can't allocate a page to make a big batch of page pointers * to work on, then just handle a few from the on-stack structure. @@ -40,13 +83,13 @@ */ struct mmu_gather { struct mm_struct *mm; +#ifdef CONFIG_HAVE_RCU_TABLE_FREE + struct mmu_table_batch *batch; +#endif unsigned int nr; /* set to ~0U means fast mode */ unsigned int max; /* nr < max */ unsigned int need_flush;/* Really unmapped some ptes? */ unsigned int fullmm; /* non-zero means full mm flush */ -#ifdef HAVE_ARCH_MMU_GATHER - struct arch_mmu_gather arch; -#endif struct page **pages; struct page *local[MMU_GATHER_BUNDLE]; }; @@ -82,8 +125,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm) tlb->fullmm = fullmm; -#ifdef HAVE_ARCH_MMU_GATHER - tlb->arch = ARCH_MMU_GATHER_INIT; +#ifdef CONFIG_HAVE_RCU_TABLE_FREE + tlb->batch = NULL; #endif } @@ -94,6 +137,9 @@ tlb_flush_mmu(struct mmu_gather *tlb) return; tlb->need_flush = 0; tlb_flush(tlb); +#ifdef CONFIG_HAVE_RCU_TABLE_FREE + tlb_table_flush(tlb); +#endif if (!tlb_fast_mode(tlb)) { free_pages_and_swap_cache(tlb->pages, tlb->nr); tlb->nr = 0; diff --git a/mm/memory.c b/mm/memory.c index 19b2d44de9f..a77fd23ee68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -193,6 +193,83 @@ static void check_sync_rss_stat(struct task_struct *task) #endif +#ifdef CONFIG_HAVE_RCU_TABLE_FREE + +/* + * See the comment near struct mmu_table_batch. + */ + +static void tlb_remove_table_smp_sync(void *arg) +{ + /* Simply deliver the interrupt */ +} + +static void tlb_remove_table_one(void *table) +{ + /* + * This isn't an RCU grace period and hence the page-tables cannot be + * assumed to be actually RCU-freed. + * + * It is however sufficient for software page-table walkers that rely on + * IRQ disabling. See the comment near struct mmu_table_batch. + */ + smp_call_function(tlb_remove_table_smp_sync, NULL, 1); + __tlb_remove_table(table); +} + +static void tlb_remove_table_rcu(struct rcu_head *head) +{ + struct mmu_table_batch *batch; + int i; + + batch = container_of(head, struct mmu_table_batch, rcu); + + for (i = 0; i < batch->nr; i++) + __tlb_remove_table(batch->tables[i]); + + free_page((unsigned long)batch); +} + +void tlb_table_flush(struct mmu_gather *tlb) +{ + struct mmu_table_batch **batch = &tlb->batch; + + if (*batch) { + call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu); + *batch = NULL; + } +} + +void tlb_remove_table(struct mmu_gather *tlb, void *table) +{ + struct mmu_table_batch **batch = &tlb->batch; + + tlb->need_flush = 1; + + /* + * When there's less then two users of this mm there cannot be a + * concurrent page-table walk. + */ + if (atomic_read(&tlb->mm->mm_users) < 2) { + __tlb_remove_table(table); + return; + } + + if (*batch == NULL) { + *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN); + if (*batch == NULL) { + tlb_remove_table_one(table); + return; + } + (*batch)->nr = 0; + } + (*batch)->tables[(*batch)->nr++] = table; + if ((*batch)->nr == MAX_TABLE_BATCH) + tlb_table_flush(tlb); +} + +#endif + /* * If a p?d_bad entry is found while walking page tables, report * the error, before resetting entry to p?d_none. Usually (but -- cgit v1.2.3-70-g09d2 From e303297e6c3a7b847c4731eb14006ca6b435ecca Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 May 2011 17:12:01 -0700 Subject: mm: extended batches for generic mmu_gather Instead of using a single batch (the small on-stack, or an allocated page), try and extend the batch every time it runs out and only flush once either the extend fails or we're done. Signed-off-by: Peter Zijlstra Requested-by: Nick Piggin Reviewed-by: KAMEZAWA Hiroyuki Acked-by: Hugh Dickins Cc: Benjamin Herrenschmidt Cc: David Miller Cc: Martin Schwidefsky Cc: Russell King Cc: Paul Mundt Cc: Jeff Dike Cc: Richard Weinberger Cc: Tony Luck Cc: Mel Gorman Cc: KOSAKI Motohiro Cc: Nick Piggin Cc: Namhyung Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/asm-generic/tlb.h | 129 +++++++++++++++++++++++++++++----------------- mm/memory.c | 2 +- 2 files changed, 84 insertions(+), 47 deletions(-) (limited to 'mm') diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 74f80f6b6cf..5a946a08ff9 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -19,16 +19,6 @@ #include #include -/* - * For UP we don't need to worry about TLB flush - * and page free order so much.. - */ -#ifdef CONFIG_SMP - #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U) -#else - #define tlb_fast_mode(tlb) 1 -#endif - #ifdef CONFIG_HAVE_RCU_TABLE_FREE /* * Semi RCU freeing of the page directories. @@ -78,6 +68,16 @@ extern void tlb_remove_table(struct mmu_gather *tlb, void *table); */ #define MMU_GATHER_BUNDLE 8 +struct mmu_gather_batch { + struct mmu_gather_batch *next; + unsigned int nr; + unsigned int max; + struct page *pages[0]; +}; + +#define MAX_GATHER_BATCH \ + ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *)) + /* struct mmu_gather is an opaque type used by the mm code for passing around * any data needed by arch specific code for tlb_remove_page. */ @@ -86,22 +86,48 @@ struct mmu_gather { #ifdef CONFIG_HAVE_RCU_TABLE_FREE struct mmu_table_batch *batch; #endif - unsigned int nr; /* set to ~0U means fast mode */ - unsigned int max; /* nr < max */ - unsigned int need_flush;/* Really unmapped some ptes? */ - unsigned int fullmm; /* non-zero means full mm flush */ - struct page **pages; - struct page *local[MMU_GATHER_BUNDLE]; + unsigned int need_flush : 1, /* Did free PTEs */ + fast_mode : 1; /* No batching */ + + unsigned int fullmm; + + struct mmu_gather_batch *active; + struct mmu_gather_batch local; + struct page *__pages[MMU_GATHER_BUNDLE]; }; -static inline void __tlb_alloc_page(struct mmu_gather *tlb) +/* + * For UP we don't need to worry about TLB flush + * and page free order so much.. + */ +#ifdef CONFIG_SMP + #define tlb_fast_mode(tlb) (tlb->fast_mode) +#else + #define tlb_fast_mode(tlb) 1 +#endif + +static inline int tlb_next_batch(struct mmu_gather *tlb) { - unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0); + struct mmu_gather_batch *batch; - if (addr) { - tlb->pages = (void *)addr; - tlb->max = PAGE_SIZE / sizeof(struct page *); + batch = tlb->active; + if (batch->next) { + tlb->active = batch->next; + return 1; } + + batch = (void *)__get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0); + if (!batch) + return 0; + + batch->next = NULL; + batch->nr = 0; + batch->max = MAX_GATHER_BATCH; + + tlb->active->next = batch; + tlb->active = batch; + + return 1; } /* tlb_gather_mmu @@ -114,16 +140,13 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm) { tlb->mm = mm; - tlb->max = ARRAY_SIZE(tlb->local); - tlb->pages = tlb->local; - - if (num_online_cpus() > 1) { - tlb->nr = 0; - __tlb_alloc_page(tlb); - } else /* Use fast mode if only one CPU is online */ - tlb->nr = ~0U; - - tlb->fullmm = fullmm; + tlb->fullmm = fullmm; + tlb->need_flush = 0; + tlb->fast_mode = (num_possible_cpus() == 1); + tlb->local.next = NULL; + tlb->local.nr = 0; + tlb->local.max = ARRAY_SIZE(tlb->__pages); + tlb->active = &tlb->local; #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb->batch = NULL; @@ -133,6 +156,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm) static inline void tlb_flush_mmu(struct mmu_gather *tlb) { + struct mmu_gather_batch *batch; + if (!tlb->need_flush) return; tlb->need_flush = 0; @@ -140,17 +165,15 @@ tlb_flush_mmu(struct mmu_gather *tlb) #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb_table_flush(tlb); #endif - if (!tlb_fast_mode(tlb)) { - free_pages_and_swap_cache(tlb->pages, tlb->nr); - tlb->nr = 0; - /* - * If we are using the local on-stack array of pages for MMU - * gather, try allocating an off-stack array again as we have - * recently freed pages. - */ - if (tlb->pages == tlb->local) - __tlb_alloc_page(tlb); + + if (tlb_fast_mode(tlb)) + return; + + for (batch = &tlb->local; batch; batch = batch->next) { + free_pages_and_swap_cache(batch->pages, batch->nr); + batch->nr = 0; } + tlb->active = &tlb->local; } /* tlb_finish_mmu @@ -160,13 +183,18 @@ tlb_flush_mmu(struct mmu_gather *tlb) static inline void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { + struct mmu_gather_batch *batch, *next; + tlb_flush_mmu(tlb); /* keep the page table cache within bounds */ check_pgt_cache(); - if (tlb->pages != tlb->local) - free_pages((unsigned long)tlb->pages, 0); + for (batch = tlb->local.next; batch; batch = next) { + next = batch->next; + free_pages((unsigned long)batch, 0); + } + tlb->local.next = NULL; } /* __tlb_remove_page @@ -177,15 +205,24 @@ tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) */ static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) { + struct mmu_gather_batch *batch; + tlb->need_flush = 1; + if (tlb_fast_mode(tlb)) { free_page_and_swap_cache(page); return 1; /* avoid calling tlb_flush_mmu() */ } - tlb->pages[tlb->nr++] = page; - VM_BUG_ON(tlb->nr > tlb->max); - return tlb->max - tlb->nr; + batch = tlb->active; + batch->pages[batch->nr++] = page; + VM_BUG_ON(batch->nr > batch->max); + if (batch->nr == batch->max) { + if (!tlb_next_batch(tlb)) + return 0; + } + + return batch->max - batch->nr; } /* tlb_remove_page diff --git a/mm/memory.c b/mm/memory.c index a77fd23ee68..17193d74f30 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -994,8 +994,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, spinlock_t *ptl; int rss[NR_MM_COUNTERS]; - init_rss_vec(rss); again: + init_rss_vec(rss); pte = pte_offset_map_lock(mm, pmd, addr, &ptl); arch_enter_lazy_mmu_mode(); do { -- cgit v1.2.3-70-g09d2 From 97a894136f29802da19a15541de3c019e1ca147e Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 May 2011 17:12:04 -0700 Subject: mm: Remove i_mmap_lock lockbreak Hugh says: "The only significant loser, I think, would be page reclaim (when concurrent with truncation): could spin for a long time waiting for the i_mmap_mutex it expects would soon be dropped? " Counter points: - cpu contention makes the spin stop (need_resched()) - zap pages should be freeing pages at a higher rate than reclaim ever can I think the simplification of the truncate code is definitely worth it. Effectively reverts: 2aa15890f3c ("mm: prevent concurrent unmap_mapping_range() on the same inode") and takes out the code that caused its problem. Signed-off-by: Peter Zijlstra Reviewed-by: KAMEZAWA Hiroyuki Cc: Hugh Dickins Cc: Benjamin Herrenschmidt Cc: David Miller Cc: Martin Schwidefsky Cc: Russell King Cc: Paul Mundt Cc: Jeff Dike Cc: Richard Weinberger Cc: Tony Luck Cc: Mel Gorman Cc: KOSAKI Motohiro Cc: Nick Piggin Cc: Namhyung Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/inode.c | 1 - include/linux/fs.h | 2 - include/linux/mm.h | 2 - include/linux/mm_types.h | 1 - kernel/fork.c | 1 - mm/memory.c | 195 +++++++---------------------------------------- mm/mmap.c | 13 +--- mm/mremap.c | 1 - 8 files changed, 28 insertions(+), 188 deletions(-) (limited to 'mm') diff --git a/fs/inode.c b/fs/inode.c index 05f4fa52132..7a7284c71ab 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -331,7 +331,6 @@ void address_space_init_once(struct address_space *mapping) spin_lock_init(&mapping->private_lock); INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap); INIT_LIST_HEAD(&mapping->i_mmap_nonlinear); - mutex_init(&mapping->unmap_mutex); } EXPORT_SYMBOL(address_space_init_once); diff --git a/include/linux/fs.h b/include/linux/fs.h index cdf9495df20..5d2c86bdf5b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -635,7 +635,6 @@ struct address_space { struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ spinlock_t i_mmap_lock; /* protect tree, count, list */ - unsigned int truncate_count; /* Cover race condition with truncate */ unsigned long nrpages; /* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ const struct address_space_operations *a_ops; /* methods */ @@ -644,7 +643,6 @@ struct address_space { spinlock_t private_lock; /* for use by the address_space */ struct list_head private_list; /* ditto */ struct address_space *assoc_mapping; /* ditto */ - struct mutex unmap_mutex; /* to protect unmapping */ } __attribute__((aligned(sizeof(long)))); /* * On most architectures that alignment is already the case; but diff --git a/include/linux/mm.h b/include/linux/mm.h index ffcce9bf2b5..2ad0ac8c3f3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -895,8 +895,6 @@ struct zap_details { struct address_space *check_mapping; /* Check page->mapping if set */ pgoff_t first_index; /* Lowest page->index to unmap */ pgoff_t last_index; /* Highest page->index to unmap */ - spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */ - unsigned long truncate_count; /* Compare vm_truncate_count */ }; struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 02aa5619709..201998e5b53 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -175,7 +175,6 @@ struct vm_area_struct { units, *not* PAGE_CACHE_SIZE */ struct file * vm_file; /* File we map to (can be NULL). */ void * vm_private_data; /* was vm_pte (shared mem) */ - unsigned long vm_truncate_count;/* truncate_count or restart_addr */ #ifndef CONFIG_MMU struct vm_region *vm_region; /* NOMMU mapping region */ diff --git a/kernel/fork.c b/kernel/fork.c index 2b44d82b823..4eef925477f 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -386,7 +386,6 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) spin_lock(&mapping->i_mmap_lock); if (tmp->vm_flags & VM_SHARED) mapping->i_mmap_writable++; - tmp->vm_truncate_count = mpnt->vm_truncate_count; flush_dcache_mmap_lock(mapping); /* insert tmp into the share list, just after mpnt */ vma_prio_tree_add(tmp, mpnt); diff --git a/mm/memory.c b/mm/memory.c index 17193d74f30..18655878b9f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -986,13 +986,13 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, - long *zap_work, struct zap_details *details) + struct zap_details *details) { struct mm_struct *mm = tlb->mm; int force_flush = 0; - pte_t *pte; - spinlock_t *ptl; int rss[NR_MM_COUNTERS]; + spinlock_t *ptl; + pte_t *pte; again: init_rss_vec(rss); @@ -1001,12 +1001,9 @@ again: do { pte_t ptent = *pte; if (pte_none(ptent)) { - (*zap_work)--; continue; } - (*zap_work) -= PAGE_SIZE; - if (pte_present(ptent)) { struct page *page; @@ -1075,7 +1072,7 @@ again: print_bad_pte(vma, addr, ptent, NULL); } pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); - } while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0)); + } while (pte++, addr += PAGE_SIZE, addr != end); add_mm_rss_vec(mm, rss); arch_leave_lazy_mmu_mode(); @@ -1099,7 +1096,7 @@ again: static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, - long *zap_work, struct zap_details *details) + struct zap_details *details) { pmd_t *pmd; unsigned long next; @@ -1111,19 +1108,15 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); - } else if (zap_huge_pmd(tlb, vma, pmd)) { - (*zap_work)--; + } else if (zap_huge_pmd(tlb, vma, pmd)) continue; - } /* fall through */ } - if (pmd_none_or_clear_bad(pmd)) { - (*zap_work)--; + if (pmd_none_or_clear_bad(pmd)) continue; - } - next = zap_pte_range(tlb, vma, pmd, addr, next, - zap_work, details); - } while (pmd++, addr = next, (addr != end && *zap_work > 0)); + next = zap_pte_range(tlb, vma, pmd, addr, next, details); + cond_resched(); + } while (pmd++, addr = next, addr != end); return addr; } @@ -1131,7 +1124,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, static inline unsigned long zap_pud_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr, unsigned long end, - long *zap_work, struct zap_details *details) + struct zap_details *details) { pud_t *pud; unsigned long next; @@ -1139,13 +1132,10 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb, pud = pud_offset(pgd, addr); do { next = pud_addr_end(addr, end); - if (pud_none_or_clear_bad(pud)) { - (*zap_work)--; + if (pud_none_or_clear_bad(pud)) continue; - } - next = zap_pmd_range(tlb, vma, pud, addr, next, - zap_work, details); - } while (pud++, addr = next, (addr != end && *zap_work > 0)); + next = zap_pmd_range(tlb, vma, pud, addr, next, details); + } while (pud++, addr = next, addr != end); return addr; } @@ -1153,7 +1143,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb, static unsigned long unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long end, - long *zap_work, struct zap_details *details) + struct zap_details *details) { pgd_t *pgd; unsigned long next; @@ -1167,13 +1157,10 @@ static unsigned long unmap_page_range(struct mmu_gather *tlb, pgd = pgd_offset(vma->vm_mm, addr); do { next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) { - (*zap_work)--; + if (pgd_none_or_clear_bad(pgd)) continue; - } - next = zap_pud_range(tlb, vma, pgd, addr, next, - zap_work, details); - } while (pgd++, addr = next, (addr != end && *zap_work > 0)); + next = zap_pud_range(tlb, vma, pgd, addr, next, details); + } while (pgd++, addr = next, addr != end); tlb_end_vma(tlb, vma); mem_cgroup_uncharge_end(); @@ -1218,9 +1205,7 @@ unsigned long unmap_vmas(struct mmu_gather *tlb, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *details) { - long zap_work = ZAP_BLOCK_SIZE; unsigned long start = start_addr; - spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; struct mm_struct *mm = vma->vm_mm; mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); @@ -1253,33 +1238,15 @@ unsigned long unmap_vmas(struct mmu_gather *tlb, * Since no pte has actually been setup, it is * safe to do nothing in this case. */ - if (vma->vm_file) { + if (vma->vm_file) unmap_hugepage_range(vma, start, end, NULL); - zap_work -= (end - start) / - pages_per_huge_page(hstate_vma(vma)); - } start = end; } else - start = unmap_page_range(tlb, vma, - start, end, &zap_work, details); - - if (zap_work > 0) { - BUG_ON(start != end); - break; - } - - if (need_resched() || - (i_mmap_lock && spin_needbreak(i_mmap_lock))) { - if (i_mmap_lock) - goto out; - cond_resched(); - } - - zap_work = ZAP_BLOCK_SIZE; + start = unmap_page_range(tlb, vma, start, end, details); } } -out: + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ } @@ -2612,96 +2579,11 @@ unwritable_page: return ret; } -/* - * Helper functions for unmap_mapping_range(). - * - * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __ - * - * We have to restart searching the prio_tree whenever we drop the lock, - * since the iterator is only valid while the lock is held, and anyway - * a later vma might be split and reinserted earlier while lock dropped. - * - * The list of nonlinear vmas could be handled more efficiently, using - * a placeholder, but handle it in the same way until a need is shown. - * It is important to search the prio_tree before nonlinear list: a vma - * may become nonlinear and be shifted from prio_tree to nonlinear list - * while the lock is dropped; but never shifted from list to prio_tree. - * - * In order to make forward progress despite restarting the search, - * vm_truncate_count is used to mark a vma as now dealt with, so we can - * quickly skip it next time around. Since the prio_tree search only - * shows us those vmas affected by unmapping the range in question, we - * can't efficiently keep all vmas in step with mapping->truncate_count: - * so instead reset them all whenever it wraps back to 0 (then go to 1). - * mapping->truncate_count and vma->vm_truncate_count are protected by - * i_mmap_lock. - * - * In order to make forward progress despite repeatedly restarting some - * large vma, note the restart_addr from unmap_vmas when it breaks out: - * and restart from that address when we reach that vma again. It might - * have been split or merged, shrunk or extended, but never shifted: so - * restart_addr remains valid so long as it remains in the vma's range. - * unmap_mapping_range forces truncate_count to leap over page-aligned - * values so we can save vma's restart_addr in its truncate_count field. - */ -#define is_restart_addr(truncate_count) (!((truncate_count) & ~PAGE_MASK)) - -static void reset_vma_truncate_counts(struct address_space *mapping) -{ - struct vm_area_struct *vma; - struct prio_tree_iter iter; - - vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) - vma->vm_truncate_count = 0; - list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) - vma->vm_truncate_count = 0; -} - -static int unmap_mapping_range_vma(struct vm_area_struct *vma, +static void unmap_mapping_range_vma(struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, struct zap_details *details) { - unsigned long restart_addr; - int need_break; - - /* - * files that support invalidating or truncating portions of the - * file from under mmaped areas must have their ->fault function - * return a locked page (and set VM_FAULT_LOCKED in the return). - * This provides synchronisation against concurrent unmapping here. - */ - -again: - restart_addr = vma->vm_truncate_count; - if (is_restart_addr(restart_addr) && start_addr < restart_addr) { - start_addr = restart_addr; - if (start_addr >= end_addr) { - /* Top of vma has been split off since last time */ - vma->vm_truncate_count = details->truncate_count; - return 0; - } - } - - restart_addr = zap_page_range(vma, start_addr, - end_addr - start_addr, details); - need_break = need_resched() || spin_needbreak(details->i_mmap_lock); - - if (restart_addr >= end_addr) { - /* We have now completed this vma: mark it so */ - vma->vm_truncate_count = details->truncate_count; - if (!need_break) - return 0; - } else { - /* Note restart_addr in vma's truncate_count field */ - vma->vm_truncate_count = restart_addr; - if (!need_break) - goto again; - } - - spin_unlock(details->i_mmap_lock); - cond_resched(); - spin_lock(details->i_mmap_lock); - return -EINTR; + zap_page_range(vma, start_addr, end_addr - start_addr, details); } static inline void unmap_mapping_range_tree(struct prio_tree_root *root, @@ -2711,12 +2593,8 @@ static inline void unmap_mapping_range_tree(struct prio_tree_root *root, struct prio_tree_iter iter; pgoff_t vba, vea, zba, zea; -restart: vma_prio_tree_foreach(vma, &iter, root, details->first_index, details->last_index) { - /* Skip quickly over those we have already dealt with */ - if (vma->vm_truncate_count == details->truncate_count) - continue; vba = vma->vm_pgoff; vea = vba + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) - 1; @@ -2728,11 +2606,10 @@ restart: if (zea > vea) zea = vea; - if (unmap_mapping_range_vma(vma, + unmap_mapping_range_vma(vma, ((zba - vba) << PAGE_SHIFT) + vma->vm_start, ((zea - vba + 1) << PAGE_SHIFT) + vma->vm_start, - details) < 0) - goto restart; + details); } } @@ -2747,15 +2624,9 @@ static inline void unmap_mapping_range_list(struct list_head *head, * across *all* the pages in each nonlinear VMA, not just the pages * whose virtual address lies outside the file truncation point. */ -restart: list_for_each_entry(vma, head, shared.vm_set.list) { - /* Skip quickly over those we have already dealt with */ - if (vma->vm_truncate_count == details->truncate_count) - continue; details->nonlinear_vma = vma; - if (unmap_mapping_range_vma(vma, vma->vm_start, - vma->vm_end, details) < 0) - goto restart; + unmap_mapping_range_vma(vma, vma->vm_start, vma->vm_end, details); } } @@ -2794,26 +2665,14 @@ void unmap_mapping_range(struct address_space *mapping, details.last_index = hba + hlen - 1; if (details.last_index < details.first_index) details.last_index = ULONG_MAX; - details.i_mmap_lock = &mapping->i_mmap_lock; - mutex_lock(&mapping->unmap_mutex); - spin_lock(&mapping->i_mmap_lock); - - /* Protect against endless unmapping loops */ - mapping->truncate_count++; - if (unlikely(is_restart_addr(mapping->truncate_count))) { - if (mapping->truncate_count == 0) - reset_vma_truncate_counts(mapping); - mapping->truncate_count++; - } - details.truncate_count = mapping->truncate_count; + spin_lock(&mapping->i_mmap_lock); if (unlikely(!prio_tree_empty(&mapping->i_mmap))) unmap_mapping_range_tree(&mapping->i_mmap, &details); if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details); spin_unlock(&mapping->i_mmap_lock); - mutex_unlock(&mapping->unmap_mutex); } EXPORT_SYMBOL(unmap_mapping_range); diff --git a/mm/mmap.c b/mm/mmap.c index 40d49986e71..50cb04bb56b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -445,10 +445,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, if (vma->vm_file) mapping = vma->vm_file->f_mapping; - if (mapping) { + if (mapping) spin_lock(&mapping->i_mmap_lock); - vma->vm_truncate_count = mapping->truncate_count; - } __vma_link(mm, vma, prev, rb_link, rb_parent); __vma_link_file(vma); @@ -558,16 +556,7 @@ again: remove_next = 1 + (end > next->vm_end); if (!(vma->vm_flags & VM_NONLINEAR)) root = &mapping->i_mmap; spin_lock(&mapping->i_mmap_lock); - if (importer && - vma->vm_truncate_count != next->vm_truncate_count) { - /* - * unmap_mapping_range might be in progress: - * ensure that the expanding vma is rescanned. - */ - importer->vm_truncate_count = 0; - } if (insert) { - insert->vm_truncate_count = vma->vm_truncate_count; /* * Put into prio_tree now, so instantiated pages * are visible to arm/parisc __flush_dcache_page diff --git a/mm/mremap.c b/mm/mremap.c index a7c1f9f9b94..909e1e1e99b 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -94,7 +94,6 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, */ mapping = vma->vm_file->f_mapping; spin_lock(&mapping->i_mmap_lock); - new_vma->vm_truncate_count = 0; } /* -- cgit v1.2.3-70-g09d2 From 3d48ae45e72390ddf8cc5256ac32ed6f7a19cbea Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 May 2011 17:12:06 -0700 Subject: mm: Convert i_mmap_lock to a mutex Straightforward conversion of i_mmap_lock to a mutex. Signed-off-by: Peter Zijlstra Acked-by: Hugh Dickins Cc: Benjamin Herrenschmidt Cc: David Miller Cc: Martin Schwidefsky Cc: Russell King Cc: Paul Mundt Cc: Jeff Dike Cc: Richard Weinberger Cc: Tony Luck Cc: KAMEZAWA Hiroyuki Cc: Mel Gorman Cc: KOSAKI Motohiro Cc: Nick Piggin Cc: Namhyung Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/lockstat.txt | 2 +- Documentation/vm/locking | 2 +- arch/x86/mm/hugetlbpage.c | 4 ++-- fs/hugetlbfs/inode.c | 4 ++-- fs/inode.c | 2 +- include/linux/fs.h | 2 +- include/linux/mmu_notifier.h | 2 +- kernel/fork.c | 4 ++-- mm/filemap.c | 10 +++++----- mm/filemap_xip.c | 4 ++-- mm/fremap.c | 4 ++-- mm/hugetlb.c | 14 +++++++------- mm/memory-failure.c | 4 ++-- mm/memory.c | 4 ++-- mm/mmap.c | 22 +++++++++++----------- mm/mremap.c | 4 ++-- mm/rmap.c | 28 ++++++++++++++-------------- 17 files changed, 58 insertions(+), 58 deletions(-) (limited to 'mm') diff --git a/Documentation/lockstat.txt b/Documentation/lockstat.txt index 65f4c795015..9c0a80d17a2 100644 --- a/Documentation/lockstat.txt +++ b/Documentation/lockstat.txt @@ -136,7 +136,7 @@ View the top contending locks: dcache_lock: 1037 1161 0.38 45.32 774.51 6611 243371 0.15 306.48 77387.24 &inode->i_mutex: 161 286 18446744073709 62882.54 1244614.55 3653 20598 18446744073709 62318.60 1693822.74 &zone->lru_lock: 94 94 0.53 7.33 92.10 4366 32690 0.29 59.81 16350.06 - &inode->i_data.i_mmap_lock: 79 79 0.40 3.77 53.03 11779 87755 0.28 116.93 29898.44 + &inode->i_data.i_mmap_mutex: 79 79 0.40 3.77 53.03 11779 87755 0.28 116.93 29898.44 &q->__queue_lock: 48 50 0.52 31.62 86.31 774 13131 0.17 113.08 12277.52 &rq->rq_lock_key: 43 47 0.74 68.50 170.63 3706 33929 0.22 107.99 17460.62 &rq->rq_lock_key#2: 39 46 0.75 6.68 49.03 2979 32292 0.17 125.17 17137.63 diff --git a/Documentation/vm/locking b/Documentation/vm/locking index 25fadb44876..f61228bd639 100644 --- a/Documentation/vm/locking +++ b/Documentation/vm/locking @@ -66,7 +66,7 @@ in some cases it is not really needed. Eg, vm_start is modified by expand_stack(), it is hard to come up with a destructive scenario without having the vmlist protection in this case. -The page_table_lock nests with the inode i_mmap_lock and the kmem cache +The page_table_lock nests with the inode i_mmap_mutex and the kmem cache c_spinlock spinlocks. This is okay, since the kmem code asks for pages after dropping c_spinlock. The page_table_lock also nests with pagecache_lock and pagemap_lru_lock spinlocks, and no code asks for memory with these locks diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index d4203988504..f581a18c0d4 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -72,7 +72,7 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (!vma_shareable(vma, addr)) return; - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; @@ -97,7 +97,7 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) put_page(virt_to_page(spte)); spin_unlock(&mm->page_table_lock); out: - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); } /* diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index b9eeb1cd03f..e7a035781b7 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -412,10 +412,10 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset) pgoff = offset >> PAGE_SHIFT; i_size_write(inode, offset); - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); if (!prio_tree_empty(&mapping->i_mmap)) hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff); - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); truncate_hugepages(inode, offset); return 0; } diff --git a/fs/inode.c b/fs/inode.c index 7a7284c71ab..88aa3bcc768 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -326,7 +326,7 @@ void address_space_init_once(struct address_space *mapping) memset(mapping, 0, sizeof(*mapping)); INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC); spin_lock_init(&mapping->tree_lock); - spin_lock_init(&mapping->i_mmap_lock); + mutex_init(&mapping->i_mmap_mutex); INIT_LIST_HEAD(&mapping->private_list); spin_lock_init(&mapping->private_lock); INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap); diff --git a/include/linux/fs.h b/include/linux/fs.h index 5d2c86bdf5b..5bb9e826019 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -634,7 +634,7 @@ struct address_space { unsigned int i_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ - spinlock_t i_mmap_lock; /* protect tree, count, list */ + struct mutex i_mmap_mutex; /* protect tree, count, list */ unsigned long nrpages; /* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ const struct address_space_operations *a_ops; /* methods */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index cc2e7dfea9d..a877dfc243e 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -150,7 +150,7 @@ struct mmu_notifier_ops { * Therefore notifier chains can only be traversed when either * * 1. mmap_sem is held. - * 2. One of the reverse map locks is held (i_mmap_lock or anon_vma->lock). + * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->lock). * 3. No other concurrent thread can access the list (release) */ struct mmu_notifier { diff --git a/kernel/fork.c b/kernel/fork.c index 4eef925477f..927692734bc 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -383,14 +383,14 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) get_file(file); if (tmp->vm_flags & VM_DENYWRITE) atomic_dec(&inode->i_writecount); - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); if (tmp->vm_flags & VM_SHARED) mapping->i_mmap_writable++; flush_dcache_mmap_lock(mapping); /* insert tmp into the share list, just after mpnt */ vma_prio_tree_add(tmp, mpnt); flush_dcache_mmap_unlock(mapping); - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); } /* diff --git a/mm/filemap.c b/mm/filemap.c index 8144f87dcbb..88354ae0b1f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -58,16 +58,16 @@ /* * Lock ordering: * - * ->i_mmap_lock (truncate_pagecache) + * ->i_mmap_mutex (truncate_pagecache) * ->private_lock (__free_pte->__set_page_dirty_buffers) * ->swap_lock (exclusive_swap_page, others) * ->mapping->tree_lock * * ->i_mutex - * ->i_mmap_lock (truncate->unmap_mapping_range) + * ->i_mmap_mutex (truncate->unmap_mapping_range) * * ->mmap_sem - * ->i_mmap_lock + * ->i_mmap_mutex * ->page_table_lock or pte_lock (various, mainly in memory.c) * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock) * @@ -84,7 +84,7 @@ * sb_lock (fs/fs-writeback.c) * ->mapping->tree_lock (__sync_single_inode) * - * ->i_mmap_lock + * ->i_mmap_mutex * ->anon_vma.lock (vma_adjust) * * ->anon_vma.lock @@ -106,7 +106,7 @@ * * (code doesn't rely on that order, so you could switch it around) * ->tasklist_lock (memory_failure, collect_procs_ao) - * ->i_mmap_lock + * ->i_mmap_mutex */ /* diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index 83364df74a3..93356cd1282 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -183,7 +183,7 @@ __xip_unmap (struct address_space * mapping, return; retry: - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { mm = vma->vm_mm; address = vma->vm_start + @@ -201,7 +201,7 @@ retry: page_cache_release(page); } } - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); if (locked) { mutex_unlock(&xip_sparse_mutex); diff --git a/mm/fremap.c b/mm/fremap.c index ec520c7b28d..7f4123056e0 100644 --- a/mm/fremap.c +++ b/mm/fremap.c @@ -211,13 +211,13 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, } goto out; } - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); flush_dcache_mmap_lock(mapping); vma->vm_flags |= VM_NONLINEAR; vma_prio_tree_remove(vma, &mapping->i_mmap); vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); flush_dcache_mmap_unlock(mapping); - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); } if (vma->vm_flags & VM_LOCKED) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bbb4a5bbb95..5fd68b95c67 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2205,7 +2205,7 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, unsigned long sz = huge_page_size(h); /* - * A page gathering list, protected by per file i_mmap_lock. The + * A page gathering list, protected by per file i_mmap_mutex. The * lock is used to avoid list corruption from multiple unmapping * of the same page since we are using page->lru. */ @@ -2274,9 +2274,9 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page) { - spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); + mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex); __unmap_hugepage_range(vma, start, end, ref_page); - spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); } /* @@ -2308,7 +2308,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, * this mapping should be shared between all the VMAs, * __unmap_hugepage_range() is called as the lock is already held */ - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); vma_prio_tree_foreach(iter_vma, &iter, &mapping->i_mmap, pgoff, pgoff) { /* Do not unmap the current VMA */ if (iter_vma == vma) @@ -2326,7 +2326,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, address, address + huge_page_size(h), page); } - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); return 1; } @@ -2810,7 +2810,7 @@ void hugetlb_change_protection(struct vm_area_struct *vma, BUG_ON(address >= end); flush_cache_range(vma, address, end); - spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); + mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex); spin_lock(&mm->page_table_lock); for (; address < end; address += huge_page_size(h)) { ptep = huge_pte_offset(mm, address); @@ -2825,7 +2825,7 @@ void hugetlb_change_protection(struct vm_area_struct *vma, } } spin_unlock(&mm->page_table_lock); - spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); flush_tlb_range(vma, start, end); } diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 2b9a5eef39e..12178ec32ab 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -429,7 +429,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, */ read_lock(&tasklist_lock); - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); for_each_process(tsk) { pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); @@ -449,7 +449,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, add_to_kill(tsk, page, vma, to_kill, tkc); } } - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); read_unlock(&tasklist_lock); } diff --git a/mm/memory.c b/mm/memory.c index 18655878b9f..7bbe4d3df75 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2667,12 +2667,12 @@ void unmap_mapping_range(struct address_space *mapping, details.last_index = ULONG_MAX; - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); if (unlikely(!prio_tree_empty(&mapping->i_mmap))) unmap_mapping_range_tree(&mapping->i_mmap, &details); if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details); - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); } EXPORT_SYMBOL(unmap_mapping_range); diff --git a/mm/mmap.c b/mm/mmap.c index 50cb04bb56b..26efbfca0b2 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -194,7 +194,7 @@ error: } /* - * Requires inode->i_mapping->i_mmap_lock + * Requires inode->i_mapping->i_mmap_mutex */ static void __remove_shared_vm_struct(struct vm_area_struct *vma, struct file *file, struct address_space *mapping) @@ -222,9 +222,9 @@ void unlink_file_vma(struct vm_area_struct *vma) if (file) { struct address_space *mapping = file->f_mapping; - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); __remove_shared_vm_struct(vma, file, mapping); - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); } } @@ -446,13 +446,13 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, mapping = vma->vm_file->f_mapping; if (mapping) - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); __vma_link(mm, vma, prev, rb_link, rb_parent); __vma_link_file(vma); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); mm->map_count++; validate_mm(mm); @@ -555,7 +555,7 @@ again: remove_next = 1 + (end > next->vm_end); mapping = file->f_mapping; if (!(vma->vm_flags & VM_NONLINEAR)) root = &mapping->i_mmap; - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); if (insert) { /* * Put into prio_tree now, so instantiated pages @@ -622,7 +622,7 @@ again: remove_next = 1 + (end > next->vm_end); if (anon_vma) anon_vma_unlock(anon_vma); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); if (remove_next) { if (file) { @@ -2290,7 +2290,7 @@ void exit_mmap(struct mm_struct *mm) /* Insert vm structure into process list sorted by address * and into the inode's i_mmap tree. If vm_file is non-NULL - * then i_mmap_lock is taken here. + * then i_mmap_mutex is taken here. */ int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma) { @@ -2532,7 +2532,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping) */ if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags)) BUG(); - spin_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem); + mutex_lock_nest_lock(&mapping->i_mmap_mutex, &mm->mmap_sem); } } @@ -2559,7 +2559,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping) * vma in this mm is backed by the same anon_vma or address_space. * * We can take all the locks in random order because the VM code - * taking i_mmap_lock or anon_vma->lock outside the mmap_sem never + * taking i_mmap_mutex or anon_vma->lock outside the mmap_sem never * takes more than one of them in a row. Secondly we're protected * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex. * @@ -2631,7 +2631,7 @@ static void vm_unlock_mapping(struct address_space *mapping) * AS_MM_ALL_LOCKS can't change to 0 from under us * because we hold the mm_all_locks_mutex. */ - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); if (!test_and_clear_bit(AS_MM_ALL_LOCKS, &mapping->flags)) BUG(); diff --git a/mm/mremap.c b/mm/mremap.c index 909e1e1e99b..506fa44403d 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -93,7 +93,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, * and we propagate stale pages into the dst afterward. */ mapping = vma->vm_file->f_mapping; - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); } /* @@ -122,7 +122,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, pte_unmap(new_pte - 1); pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } diff --git a/mm/rmap.c b/mm/rmap.c index 522e4a93cad..f0ef7ea5423 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -24,7 +24,7 @@ * inode->i_alloc_sem (vmtruncate_range) * mm->mmap_sem * page->flags PG_locked (lock_page) - * mapping->i_mmap_lock + * mapping->i_mmap_mutex * anon_vma->lock * mm->page_table_lock or pte_lock * zone->lru_lock (in mark_page_accessed, isolate_lru_page) @@ -646,14 +646,14 @@ static int page_referenced_file(struct page *page, * The page lock not only makes sure that page->mapping cannot * suddenly be NULLified by truncation, it makes sure that the * structure at mapping cannot be freed and reused yet, - * so we can safely take mapping->i_mmap_lock. + * so we can safely take mapping->i_mmap_mutex. */ BUG_ON(!PageLocked(page)); - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); /* - * i_mmap_lock does not stabilize mapcount at all, but mapcount + * i_mmap_mutex does not stabilize mapcount at all, but mapcount * is more likely to be accurate if we note it after spinning. */ mapcount = page_mapcount(page); @@ -675,7 +675,7 @@ static int page_referenced_file(struct page *page, break; } - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); return referenced; } @@ -762,7 +762,7 @@ static int page_mkclean_file(struct address_space *mapping, struct page *page) BUG_ON(PageAnon(page)); - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { if (vma->vm_flags & VM_SHARED) { unsigned long address = vma_address(page, vma); @@ -771,7 +771,7 @@ static int page_mkclean_file(struct address_space *mapping, struct page *page) ret += page_mkclean_one(page, vma, address); } } - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); return ret; } @@ -1119,7 +1119,7 @@ out_mlock: /* * We need mmap_sem locking, Otherwise VM_LOCKED check makes * unstable result and race. Plus, We can't wait here because - * we now hold anon_vma->lock or mapping->i_mmap_lock. + * we now hold anon_vma->lock or mapping->i_mmap_mutex. * if trylock failed, the page remain in evictable lru and later * vmscan could retry to move the page to unevictable lru if the * page is actually mlocked. @@ -1345,7 +1345,7 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags) unsigned long max_nl_size = 0; unsigned int mapcount; - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { unsigned long address = vma_address(page, vma); if (address == -EFAULT) @@ -1391,7 +1391,7 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags) mapcount = page_mapcount(page); if (!mapcount) goto out; - cond_resched_lock(&mapping->i_mmap_lock); + cond_resched(); max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK; if (max_nl_cursor == 0) @@ -1413,7 +1413,7 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags) } vma->vm_private_data = (void *) max_nl_cursor; } - cond_resched_lock(&mapping->i_mmap_lock); + cond_resched(); max_nl_cursor += CLUSTER_SIZE; } while (max_nl_cursor <= max_nl_size); @@ -1425,7 +1425,7 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags) list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) vma->vm_private_data = NULL; out: - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); return ret; } @@ -1544,7 +1544,7 @@ static int rmap_walk_file(struct page *page, int (*rmap_one)(struct page *, if (!mapping) return ret; - spin_lock(&mapping->i_mmap_lock); + mutex_lock(&mapping->i_mmap_mutex); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { unsigned long address = vma_address(page, vma); if (address == -EFAULT) @@ -1558,7 +1558,7 @@ static int rmap_walk_file(struct page *page, int (*rmap_one)(struct page *, * never contain migration ptes. Decide what to do about this * limitation to linear when we need rmap_walk() on nonlinear. */ - spin_unlock(&mapping->i_mmap_lock); + mutex_unlock(&mapping->i_mmap_mutex); return ret; } -- cgit v1.2.3-70-g09d2 From 25aeeb046e695c3093a86aa9386128ffb3b1bc32 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 May 2011 17:12:07 -0700 Subject: mm: revert page_lock_anon_vma() lock annotation Its beyond ugly and gets in the way. Signed-off-by: Peter Zijlstra Acked-by: Hugh Dickins Cc: Benjamin Herrenschmidt Cc: David Miller Cc: Martin Schwidefsky Cc: Russell King Cc: Paul Mundt Cc: Jeff Dike Cc: Richard Weinberger Cc: Tony Luck Cc: KAMEZAWA Hiroyuki Cc: Mel Gorman Cc: Namhyung Kim Cc: KOSAKI Motohiro Cc: Nick Piggin Cc: Namhyung Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 15 +-------------- mm/rmap.c | 4 +--- 2 files changed, 2 insertions(+), 17 deletions(-) (limited to 'mm') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 830e65dc01e..590c291a8cd 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -218,20 +218,7 @@ int try_to_munlock(struct page *); /* * Called by memory-failure.c to kill processes. */ -struct anon_vma *__page_lock_anon_vma(struct page *page); - -static inline struct anon_vma *page_lock_anon_vma(struct page *page) -{ - struct anon_vma *anon_vma; - - __cond_lock(RCU, anon_vma = __page_lock_anon_vma(page)); - - /* (void) is needed to make gcc happy */ - (void) __cond_lock(&anon_vma->root->lock, anon_vma); - - return anon_vma; -} - +struct anon_vma *page_lock_anon_vma(struct page *page); void page_unlock_anon_vma(struct anon_vma *anon_vma); int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); diff --git a/mm/rmap.c b/mm/rmap.c index f0ef7ea5423..c6044761617 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -323,7 +323,7 @@ void __init anon_vma_init(void) * Getting a lock on a stable anon_vma from a page off the LRU is * tricky: page_lock_anon_vma rely on RCU to guard against the races. */ -struct anon_vma *__page_lock_anon_vma(struct page *page) +struct anon_vma *page_lock_anon_vma(struct page *page) { struct anon_vma *anon_vma, *root_anon_vma; unsigned long anon_mapping; @@ -357,8 +357,6 @@ out: } void page_unlock_anon_vma(struct anon_vma *anon_vma) - __releases(&anon_vma->root->lock) - __releases(RCU) { anon_vma_unlock(anon_vma); rcu_read_unlock(); -- cgit v1.2.3-70-g09d2 From 6111e4ca6829a0e8b092b8e5eeb6b5366091f29c Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 May 2011 17:12:08 -0700 Subject: mm: improve page_lock_anon_vma() comment A slightly more verbose comment to go along with the trickery in page_lock_anon_vma(). Signed-off-by: Peter Zijlstra Reviewed-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Acked-by: Mel Gorman Acked-by: Hugh Dickins Cc: Benjamin Herrenschmidt Cc: David Miller Cc: Martin Schwidefsky Cc: Russell King Cc: Paul Mundt Cc: Jeff Dike Cc: Richard Weinberger Cc: Tony Luck Cc: Nick Piggin Cc: Namhyung Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/rmap.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/rmap.c b/mm/rmap.c index c6044761617..cc140811af5 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -320,8 +320,22 @@ void __init anon_vma_init(void) } /* - * Getting a lock on a stable anon_vma from a page off the LRU is - * tricky: page_lock_anon_vma rely on RCU to guard against the races. + * Getting a lock on a stable anon_vma from a page off the LRU is tricky! + * + * Since there is no serialization what so ever against page_remove_rmap() + * the best this function can do is return a locked anon_vma that might + * have been relevant to this page. + * + * The page might have been remapped to a different anon_vma or the anon_vma + * returned may already be freed (and even reused). + * + * All users of this function must be very careful when walking the anon_vma + * chain and verify that the page in question is indeed mapped in it + * [ something equivalent to page_mapped_in_vma() ]. + * + * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap() + * that the anon_vma pointer from page->mapping is valid if there is a + * mapcount, we can dereference the anon_vma after observing those. */ struct anon_vma *page_lock_anon_vma(struct page *page) { -- cgit v1.2.3-70-g09d2 From 746b18d421da7f27e948e8af1ad82b6d0309324d Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 May 2011 17:12:10 -0700 Subject: mm: use refcounts for page_lock_anon_vma() Convert page_lock_anon_vma() over to use refcounts. This is done to prepare for the conversion of anon_vma from spinlock to mutex. Sadly this inceases the cost of page_lock_anon_vma() from one to two atomics, a follow up patch addresses this, lets keep that simple for now. Signed-off-by: Peter Zijlstra Reviewed-by: KAMEZAWA Hiroyuki Reviewed-by: KOSAKI Motohiro Acked-by: Hugh Dickins Cc: Benjamin Herrenschmidt Cc: David Miller Cc: Martin Schwidefsky Cc: Russell King Cc: Paul Mundt Cc: Jeff Dike Cc: Richard Weinberger Cc: Tony Luck Cc: Mel Gorman Cc: Nick Piggin Cc: Namhyung Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 17 ++++------------- mm/rmap.c | 42 +++++++++++++++++++++++++++--------------- 2 files changed, 31 insertions(+), 28 deletions(-) (limited to 'mm') diff --git a/mm/migrate.c b/mm/migrate.c index 34132f8e910..e4a5c912983 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -721,15 +721,11 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private, * Only page_lock_anon_vma() understands the subtleties of * getting a hold on an anon_vma from outside one of its mms. */ - anon_vma = page_lock_anon_vma(page); + anon_vma = page_get_anon_vma(page); if (anon_vma) { /* - * Take a reference count on the anon_vma if the - * page is mapped so that it is guaranteed to - * exist when the page is remapped later + * Anon page */ - get_anon_vma(anon_vma); - page_unlock_anon_vma(anon_vma); } else if (PageSwapCache(page)) { /* * We cannot be sure that the anon_vma of an unmapped @@ -857,13 +853,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, lock_page(hpage); } - if (PageAnon(hpage)) { - anon_vma = page_lock_anon_vma(hpage); - if (anon_vma) { - get_anon_vma(anon_vma); - page_unlock_anon_vma(anon_vma); - } - } + if (PageAnon(hpage)) + anon_vma = page_get_anon_vma(hpage); try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); diff --git a/mm/rmap.c b/mm/rmap.c index cc140811af5..d271845d7d1 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -337,9 +337,9 @@ void __init anon_vma_init(void) * that the anon_vma pointer from page->mapping is valid if there is a * mapcount, we can dereference the anon_vma after observing those. */ -struct anon_vma *page_lock_anon_vma(struct page *page) +struct anon_vma *page_get_anon_vma(struct page *page) { - struct anon_vma *anon_vma, *root_anon_vma; + struct anon_vma *anon_vma = NULL; unsigned long anon_mapping; rcu_read_lock(); @@ -350,30 +350,42 @@ struct anon_vma *page_lock_anon_vma(struct page *page) goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); - root_anon_vma = ACCESS_ONCE(anon_vma->root); - spin_lock(&root_anon_vma->lock); + if (!atomic_inc_not_zero(&anon_vma->refcount)) { + anon_vma = NULL; + goto out; + } /* * If this page is still mapped, then its anon_vma cannot have been - * freed. But if it has been unmapped, we have no security against - * the anon_vma structure being freed and reused (for another anon_vma: - * SLAB_DESTROY_BY_RCU guarantees that - so the spin_lock above cannot - * corrupt): with anon_vma_prepare() or anon_vma_fork() redirecting - * anon_vma->root before page_unlock_anon_vma() is called to unlock. + * freed. But if it has been unmapped, we have no security against the + * anon_vma structure being freed and reused (for another anon_vma: + * SLAB_DESTROY_BY_RCU guarantees that - so the atomic_inc_not_zero() + * above cannot corrupt). */ - if (page_mapped(page)) - return anon_vma; - - spin_unlock(&root_anon_vma->lock); + if (!page_mapped(page)) { + put_anon_vma(anon_vma); + anon_vma = NULL; + } out: rcu_read_unlock(); - return NULL; + + return anon_vma; +} + +struct anon_vma *page_lock_anon_vma(struct page *page) +{ + struct anon_vma *anon_vma = page_get_anon_vma(page); + + if (anon_vma) + anon_vma_lock(anon_vma); + + return anon_vma; } void page_unlock_anon_vma(struct anon_vma *anon_vma) { anon_vma_unlock(anon_vma); - rcu_read_unlock(); + put_anon_vma(anon_vma); } /* -- cgit v1.2.3-70-g09d2 From 2b575eb64f7a9c701fb4bfdb12388ac547f6c2b6 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 May 2011 17:12:11 -0700 Subject: mm: convert anon_vma->lock to a mutex Straightforward conversion of anon_vma->lock to a mutex. Signed-off-by: Peter Zijlstra Acked-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Cc: Benjamin Herrenschmidt Cc: David Miller Cc: Martin Schwidefsky Cc: Russell King Cc: Paul Mundt Cc: Jeff Dike Cc: Richard Weinberger Cc: Tony Luck Cc: KAMEZAWA Hiroyuki Cc: Mel Gorman Cc: Nick Piggin Cc: Namhyung Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 8 ++------ include/linux/mmu_notifier.h | 2 +- include/linux/rmap.h | 14 +++++++------- mm/huge_memory.c | 4 ++-- mm/mmap.c | 10 +++++----- mm/rmap.c | 8 ++++---- 6 files changed, 21 insertions(+), 25 deletions(-) (limited to 'mm') diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 8847c8c2979..48c32ebf65a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -92,12 +92,8 @@ extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd); #define wait_split_huge_page(__anon_vma, __pmd) \ do { \ pmd_t *____pmd = (__pmd); \ - spin_unlock_wait(&(__anon_vma)->root->lock); \ - /* \ - * spin_unlock_wait() is just a loop in C and so the \ - * CPU can reorder anything around it. \ - */ \ - smp_mb(); \ + anon_vma_lock(__anon_vma); \ + anon_vma_unlock(__anon_vma); \ BUG_ON(pmd_trans_splitting(*____pmd) || \ pmd_trans_huge(*____pmd)); \ } while (0) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index a877dfc243e..1d1b1e13f79 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -150,7 +150,7 @@ struct mmu_notifier_ops { * Therefore notifier chains can only be traversed when either * * 1. mmap_sem is held. - * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->lock). + * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->mutex). * 3. No other concurrent thread can access the list (release) */ struct mmu_notifier { diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 590c291a8cd..2148b122779 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -7,7 +7,7 @@ #include #include #include -#include +#include #include /* @@ -26,7 +26,7 @@ */ struct anon_vma { struct anon_vma *root; /* Root of this anon_vma tree */ - spinlock_t lock; /* Serialize access to vma list */ + struct mutex mutex; /* Serialize access to vma list */ /* * The refcount is taken on an anon_vma when there is no * guarantee that the vma of page tables will exist for @@ -64,7 +64,7 @@ struct anon_vma_chain { struct vm_area_struct *vma; struct anon_vma *anon_vma; struct list_head same_vma; /* locked by mmap_sem & page_table_lock */ - struct list_head same_anon_vma; /* locked by anon_vma->lock */ + struct list_head same_anon_vma; /* locked by anon_vma->mutex */ }; #ifdef CONFIG_MMU @@ -93,24 +93,24 @@ static inline void vma_lock_anon_vma(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - spin_lock(&anon_vma->root->lock); + mutex_lock(&anon_vma->root->mutex); } static inline void vma_unlock_anon_vma(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - spin_unlock(&anon_vma->root->lock); + mutex_unlock(&anon_vma->root->mutex); } static inline void anon_vma_lock(struct anon_vma *anon_vma) { - spin_lock(&anon_vma->root->lock); + mutex_lock(&anon_vma->root->mutex); } static inline void anon_vma_unlock(struct anon_vma *anon_vma) { - spin_unlock(&anon_vma->root->lock); + mutex_unlock(&anon_vma->root->mutex); } /* diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 83326ad66d9..90eef404ec2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1139,7 +1139,7 @@ static int __split_huge_page_splitting(struct page *page, * We can't temporarily set the pmd to null in order * to split it, the pmd must remain marked huge at all * times or the VM won't take the pmd_trans_huge paths - * and it won't wait on the anon_vma->root->lock to + * and it won't wait on the anon_vma->root->mutex to * serialize against split_huge_page*. */ pmdp_splitting_flush_notify(vma, address, pmd); @@ -1333,7 +1333,7 @@ static int __split_huge_page_map(struct page *page, return ret; } -/* must be called with anon_vma->root->lock hold */ +/* must be called with anon_vma->root->mutex hold */ static void __split_huge_page(struct page *page, struct anon_vma *anon_vma) { diff --git a/mm/mmap.c b/mm/mmap.c index 26efbfca0b2..ac2631b7477 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2502,15 +2502,15 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma) * The LSB of head.next can't change from under us * because we hold the mm_all_locks_mutex. */ - spin_lock_nest_lock(&anon_vma->root->lock, &mm->mmap_sem); + mutex_lock_nest_lock(&anon_vma->root->mutex, &mm->mmap_sem); /* * We can safely modify head.next after taking the - * anon_vma->root->lock. If some other vma in this mm shares + * anon_vma->root->mutex. If some other vma in this mm shares * the same anon_vma we won't take it again. * * No need of atomic instructions here, head.next * can't change from under us thanks to the - * anon_vma->root->lock. + * anon_vma->root->mutex. */ if (__test_and_set_bit(0, (unsigned long *) &anon_vma->root->head.next)) @@ -2559,7 +2559,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping) * vma in this mm is backed by the same anon_vma or address_space. * * We can take all the locks in random order because the VM code - * taking i_mmap_mutex or anon_vma->lock outside the mmap_sem never + * taking i_mmap_mutex or anon_vma->mutex outside the mmap_sem never * takes more than one of them in a row. Secondly we're protected * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex. * @@ -2615,7 +2615,7 @@ static void vm_unlock_anon_vma(struct anon_vma *anon_vma) * * No need of atomic instructions here, head.next * can't change from under us until we release the - * anon_vma->root->lock. + * anon_vma->root->mutex. */ if (!__test_and_clear_bit(0, (unsigned long *) &anon_vma->root->head.next)) diff --git a/mm/rmap.c b/mm/rmap.c index d271845d7d1..ce29d405d09 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -25,7 +25,7 @@ * mm->mmap_sem * page->flags PG_locked (lock_page) * mapping->i_mmap_mutex - * anon_vma->lock + * anon_vma->mutex * mm->page_table_lock or pte_lock * zone->lru_lock (in mark_page_accessed, isolate_lru_page) * swap_lock (in swap_duplicate, swap_info_get) @@ -40,7 +40,7 @@ * * (code doesn't rely on that order so it could be switched around) * ->tasklist_lock - * anon_vma->lock (memory_failure, collect_procs_anon) + * anon_vma->mutex (memory_failure, collect_procs_anon) * pte map lock */ @@ -307,7 +307,7 @@ static void anon_vma_ctor(void *data) { struct anon_vma *anon_vma = data; - spin_lock_init(&anon_vma->lock); + mutex_init(&anon_vma->mutex); atomic_set(&anon_vma->refcount, 0); INIT_LIST_HEAD(&anon_vma->head); } @@ -1143,7 +1143,7 @@ out_mlock: /* * We need mmap_sem locking, Otherwise VM_LOCKED check makes * unstable result and race. Plus, We can't wait here because - * we now hold anon_vma->lock or mapping->i_mmap_mutex. + * we now hold anon_vma->mutex or mapping->i_mmap_mutex. * if trylock failed, the page remain in evictable lru and later * vmscan could retry to move the page to unevictable lru if the * page is actually mlocked. -- cgit v1.2.3-70-g09d2 From 88c22088bf235f50b09a10bd9f022b0472bcb6b5 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 May 2011 17:12:13 -0700 Subject: mm: optimize page_lock_anon_vma() fast-path Optimize the page_lock_anon_vma() fast path to be one atomic op, instead of two. Signed-off-by: Peter Zijlstra Reviewed-by: KAMEZAWA Hiroyuki Cc: Benjamin Herrenschmidt Cc: David Miller Cc: Martin Schwidefsky Cc: Russell King Cc: Paul Mundt Cc: Jeff Dike Cc: Richard Weinberger Cc: Tony Luck Cc: Hugh Dickins Cc: Mel Gorman Cc: KOSAKI Motohiro Cc: Nick Piggin Cc: Namhyung Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/rmap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 82 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/rmap.c b/mm/rmap.c index ce29d405d09..3a39b518a65 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -86,6 +86,29 @@ static inline struct anon_vma *anon_vma_alloc(void) static inline void anon_vma_free(struct anon_vma *anon_vma) { VM_BUG_ON(atomic_read(&anon_vma->refcount)); + + /* + * Synchronize against page_lock_anon_vma() such that + * we can safely hold the lock without the anon_vma getting + * freed. + * + * Relies on the full mb implied by the atomic_dec_and_test() from + * put_anon_vma() against the acquire barrier implied by + * mutex_trylock() from page_lock_anon_vma(). This orders: + * + * page_lock_anon_vma() VS put_anon_vma() + * mutex_trylock() atomic_dec_and_test() + * LOCK MB + * atomic_read() mutex_is_locked() + * + * LOCK should suffice since the actual taking of the lock must + * happen _before_ what follows. + */ + if (mutex_is_locked(&anon_vma->root->mutex)) { + anon_vma_lock(anon_vma); + anon_vma_unlock(anon_vma); + } + kmem_cache_free(anon_vma_cachep, anon_vma); } @@ -372,20 +395,75 @@ out: return anon_vma; } +/* + * Similar to page_get_anon_vma() except it locks the anon_vma. + * + * Its a little more complex as it tries to keep the fast path to a single + * atomic op -- the trylock. If we fail the trylock, we fall back to getting a + * reference like with page_get_anon_vma() and then block on the mutex. + */ struct anon_vma *page_lock_anon_vma(struct page *page) { - struct anon_vma *anon_vma = page_get_anon_vma(page); + struct anon_vma *anon_vma = NULL; + unsigned long anon_mapping; - if (anon_vma) - anon_vma_lock(anon_vma); + rcu_read_lock(); + anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping); + if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) + goto out; + if (!page_mapped(page)) + goto out; + + anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); + if (mutex_trylock(&anon_vma->root->mutex)) { + /* + * If we observe a !0 refcount, then holding the lock ensures + * the anon_vma will not go away, see __put_anon_vma(). + */ + if (!atomic_read(&anon_vma->refcount)) { + anon_vma_unlock(anon_vma); + anon_vma = NULL; + } + goto out; + } + + /* trylock failed, we got to sleep */ + if (!atomic_inc_not_zero(&anon_vma->refcount)) { + anon_vma = NULL; + goto out; + } + if (!page_mapped(page)) { + put_anon_vma(anon_vma); + anon_vma = NULL; + goto out; + } + + /* we pinned the anon_vma, its safe to sleep */ + rcu_read_unlock(); + anon_vma_lock(anon_vma); + + if (atomic_dec_and_test(&anon_vma->refcount)) { + /* + * Oops, we held the last refcount, release the lock + * and bail -- can't simply use put_anon_vma() because + * we'll deadlock on the anon_vma_lock() recursion. + */ + anon_vma_unlock(anon_vma); + __put_anon_vma(anon_vma); + anon_vma = NULL; + } + + return anon_vma; + +out: + rcu_read_unlock(); return anon_vma; } void page_unlock_anon_vma(struct anon_vma *anon_vma) { anon_vma_unlock(anon_vma); - put_anon_vma(anon_vma); } /* -- cgit v1.2.3-70-g09d2 From 9547d01bfb9c351dc19067f8a4cea9d3955f4125 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 24 May 2011 17:12:14 -0700 Subject: mm: uninline large generic tlb.h functions Some of these functions have grown beyond inline sanity, move them out-of-line. Signed-off-by: Peter Zijlstra Requested-by: Andrew Morton Requested-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/asm-generic/tlb.h | 135 +++++----------------------------------------- mm/memory.c | 124 +++++++++++++++++++++++++++++++++++++++++- 2 files changed, 135 insertions(+), 124 deletions(-) (limited to 'mm') diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 5a946a08ff9..e58fa777fa0 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -96,134 +96,25 @@ struct mmu_gather { struct page *__pages[MMU_GATHER_BUNDLE]; }; -/* - * For UP we don't need to worry about TLB flush - * and page free order so much.. - */ -#ifdef CONFIG_SMP - #define tlb_fast_mode(tlb) (tlb->fast_mode) -#else - #define tlb_fast_mode(tlb) 1 -#endif +#define HAVE_GENERIC_MMU_GATHER -static inline int tlb_next_batch(struct mmu_gather *tlb) +static inline int tlb_fast_mode(struct mmu_gather *tlb) { - struct mmu_gather_batch *batch; - - batch = tlb->active; - if (batch->next) { - tlb->active = batch->next; - return 1; - } - - batch = (void *)__get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0); - if (!batch) - return 0; - - batch->next = NULL; - batch->nr = 0; - batch->max = MAX_GATHER_BATCH; - - tlb->active->next = batch; - tlb->active = batch; - +#ifdef CONFIG_SMP + return tlb->fast_mode; +#else + /* + * For UP we don't need to worry about TLB flush + * and page free order so much.. + */ return 1; -} - -/* tlb_gather_mmu - * Called to initialize an (on-stack) mmu_gather structure for page-table - * tear-down from @mm. The @fullmm argument is used when @mm is without - * users and we're going to destroy the full address space (exit/execve). - */ -static inline void -tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm) -{ - tlb->mm = mm; - - tlb->fullmm = fullmm; - tlb->need_flush = 0; - tlb->fast_mode = (num_possible_cpus() == 1); - tlb->local.next = NULL; - tlb->local.nr = 0; - tlb->local.max = ARRAY_SIZE(tlb->__pages); - tlb->active = &tlb->local; - -#ifdef CONFIG_HAVE_RCU_TABLE_FREE - tlb->batch = NULL; #endif } -static inline void -tlb_flush_mmu(struct mmu_gather *tlb) -{ - struct mmu_gather_batch *batch; - - if (!tlb->need_flush) - return; - tlb->need_flush = 0; - tlb_flush(tlb); -#ifdef CONFIG_HAVE_RCU_TABLE_FREE - tlb_table_flush(tlb); -#endif - - if (tlb_fast_mode(tlb)) - return; - - for (batch = &tlb->local; batch; batch = batch->next) { - free_pages_and_swap_cache(batch->pages, batch->nr); - batch->nr = 0; - } - tlb->active = &tlb->local; -} - -/* tlb_finish_mmu - * Called at the end of the shootdown operation to free up any resources - * that were required. - */ -static inline void -tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) -{ - struct mmu_gather_batch *batch, *next; - - tlb_flush_mmu(tlb); - - /* keep the page table cache within bounds */ - check_pgt_cache(); - - for (batch = tlb->local.next; batch; batch = next) { - next = batch->next; - free_pages((unsigned long)batch, 0); - } - tlb->local.next = NULL; -} - -/* __tlb_remove_page - * Must perform the equivalent to __free_pte(pte_get_and_clear(ptep)), while - * handling the additional races in SMP caused by other CPUs caching valid - * mappings in their TLBs. Returns the number of free page slots left. - * When out of page slots we must call tlb_flush_mmu(). - */ -static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) -{ - struct mmu_gather_batch *batch; - - tlb->need_flush = 1; - - if (tlb_fast_mode(tlb)) { - free_page_and_swap_cache(page); - return 1; /* avoid calling tlb_flush_mmu() */ - } - - batch = tlb->active; - batch->pages[batch->nr++] = page; - VM_BUG_ON(batch->nr > batch->max); - if (batch->nr == batch->max) { - if (!tlb_next_batch(tlb)) - return 0; - } - - return batch->max - batch->nr; -} +void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm); +void tlb_flush_mmu(struct mmu_gather *tlb); +void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end); +int __tlb_remove_page(struct mmu_gather *tlb, struct page *page); /* tlb_remove_page * Similar to __tlb_remove_page but will call tlb_flush_mmu() itself when diff --git a/mm/memory.c b/mm/memory.c index 7bbe4d3df75..b73f677f0bb 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -182,7 +182,7 @@ void sync_mm_rss(struct task_struct *task, struct mm_struct *mm) { __sync_task_rss_stat(task, mm); } -#else +#else /* SPLIT_RSS_COUNTING */ #define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member) #define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member) @@ -191,8 +191,128 @@ static void check_sync_rss_stat(struct task_struct *task) { } +#endif /* SPLIT_RSS_COUNTING */ + +#ifdef HAVE_GENERIC_MMU_GATHER + +static int tlb_next_batch(struct mmu_gather *tlb) +{ + struct mmu_gather_batch *batch; + + batch = tlb->active; + if (batch->next) { + tlb->active = batch->next; + return 1; + } + + batch = (void *)__get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0); + if (!batch) + return 0; + + batch->next = NULL; + batch->nr = 0; + batch->max = MAX_GATHER_BATCH; + + tlb->active->next = batch; + tlb->active = batch; + + return 1; +} + +/* tlb_gather_mmu + * Called to initialize an (on-stack) mmu_gather structure for page-table + * tear-down from @mm. The @fullmm argument is used when @mm is without + * users and we're going to destroy the full address space (exit/execve). + */ +void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, bool fullmm) +{ + tlb->mm = mm; + + tlb->fullmm = fullmm; + tlb->need_flush = 0; + tlb->fast_mode = (num_possible_cpus() == 1); + tlb->local.next = NULL; + tlb->local.nr = 0; + tlb->local.max = ARRAY_SIZE(tlb->__pages); + tlb->active = &tlb->local; + +#ifdef CONFIG_HAVE_RCU_TABLE_FREE + tlb->batch = NULL; +#endif +} + +void tlb_flush_mmu(struct mmu_gather *tlb) +{ + struct mmu_gather_batch *batch; + + if (!tlb->need_flush) + return; + tlb->need_flush = 0; + tlb_flush(tlb); +#ifdef CONFIG_HAVE_RCU_TABLE_FREE + tlb_table_flush(tlb); #endif + if (tlb_fast_mode(tlb)) + return; + + for (batch = &tlb->local; batch; batch = batch->next) { + free_pages_and_swap_cache(batch->pages, batch->nr); + batch->nr = 0; + } + tlb->active = &tlb->local; +} + +/* tlb_finish_mmu + * Called at the end of the shootdown operation to free up any resources + * that were required. + */ +void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) +{ + struct mmu_gather_batch *batch, *next; + + tlb_flush_mmu(tlb); + + /* keep the page table cache within bounds */ + check_pgt_cache(); + + for (batch = tlb->local.next; batch; batch = next) { + next = batch->next; + free_pages((unsigned long)batch, 0); + } + tlb->local.next = NULL; +} + +/* __tlb_remove_page + * Must perform the equivalent to __free_pte(pte_get_and_clear(ptep)), while + * handling the additional races in SMP caused by other CPUs caching valid + * mappings in their TLBs. Returns the number of free page slots left. + * When out of page slots we must call tlb_flush_mmu(). + */ +int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) +{ + struct mmu_gather_batch *batch; + + tlb->need_flush = 1; + + if (tlb_fast_mode(tlb)) { + free_page_and_swap_cache(page); + return 1; /* avoid calling tlb_flush_mmu() */ + } + + batch = tlb->active; + batch->pages[batch->nr++] = page; + if (batch->nr == batch->max) { + if (!tlb_next_batch(tlb)) + return 0; + } + VM_BUG_ON(batch->nr > batch->max); + + return batch->max - batch->nr; +} + +#endif /* HAVE_GENERIC_MMU_GATHER */ + #ifdef CONFIG_HAVE_RCU_TABLE_FREE /* @@ -268,7 +388,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) tlb_table_flush(tlb); } -#endif +#endif /* CONFIG_HAVE_RCU_TABLE_FREE */ /* * If a p?d_bad entry is found while walking page tables, report -- cgit v1.2.3-70-g09d2 From 692e0b35427a088bf75d9363788c61c7edbe93a5 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Tue, 24 May 2011 17:12:14 -0700 Subject: mm: thp: optimize memcg charge in khugepaged We don't need to hold the mmmap_sem through mem_cgroup_newpage_charge(), the mmap_sem is only hold for keeping the vma stable and we don't need the vma stable anymore after we return from alloc_hugepage_vma(). Signed-off-by: Andrea Arcangeli Cc: Johannes Weiner Cc: Hugh Dickins Cc: David Rientjes Cc: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) (limited to 'mm') diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 90eef404ec2..615d9743a3c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1771,12 +1771,9 @@ static void collapse_huge_page(struct mm_struct *mm, VM_BUG_ON(address & ~HPAGE_PMD_MASK); #ifndef CONFIG_NUMA + up_read(&mm->mmap_sem); VM_BUG_ON(!*hpage); new_page = *hpage; - if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { - up_read(&mm->mmap_sem); - return; - } #else VM_BUG_ON(*hpage); /* @@ -1791,22 +1788,26 @@ static void collapse_huge_page(struct mm_struct *mm, */ new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address, node, __GFP_OTHER_NODE); + + /* + * After allocating the hugepage, release the mmap_sem read lock in + * preparation for taking it in write mode. + */ + up_read(&mm->mmap_sem); if (unlikely(!new_page)) { - up_read(&mm->mmap_sem); count_vm_event(THP_COLLAPSE_ALLOC_FAILED); *hpage = ERR_PTR(-ENOMEM); return; } +#endif + count_vm_event(THP_COLLAPSE_ALLOC); if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { - up_read(&mm->mmap_sem); +#ifdef CONFIG_NUMA put_page(new_page); +#endif return; } -#endif - - /* after allocating the hugepage upgrade to mmap_sem write mode */ - up_read(&mm->mmap_sem); /* * Prevent all access to pagetables with the exception of -- cgit v1.2.3-70-g09d2 From de03c72cfce5b263a674d04348b58475ec50163c Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 24 May 2011 17:12:15 -0700 Subject: mm: convert mm->cpu_vm_cpumask into cpumask_var_t cpumask_t is very big struct and cpu_vm_mask is placed wrong position. It might lead to reduce cache hit ratio. This patch has two change. 1) Move the place of cpumask into last of mm_struct. Because usually cpumask is accessed only front bits when the system has cpu-hotplug capability 2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory footprint if cpumask_size() will use nr_cpumask_bits properly in future. In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var. It may help to detect out of tree cpu_vm_mask users. This patch has no functional change. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro Cc: David Howells Cc: Koichi Yasutake Cc: Hugh Dickins Cc: Chris Metcalf Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cachetlb.txt | 2 +- arch/x86/kernel/tboot.c | 1 - include/linux/mm_types.h | 9 ++++++--- include/linux/sched.h | 1 + init/main.c | 2 ++ kernel/fork.c | 37 ++++++++++++++++++++++++++++++++++--- mm/init-mm.c | 1 - 7 files changed, 44 insertions(+), 9 deletions(-) (limited to 'mm') diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt index 9164ae3b83b..9b728dc1753 100644 --- a/Documentation/cachetlb.txt +++ b/Documentation/cachetlb.txt @@ -16,7 +16,7 @@ on all processors in the system. Don't let this scare you into thinking SMP cache/tlb flushing must be so inefficient, this is in fact an area where many optimizations are possible. For example, if it can be proven that a user address space has never executed -on a cpu (see vma->cpu_vm_mask), one need not perform a flush +on a cpu (see mm_cpumask()), one need not perform a flush for this address space on that cpu. First, the TLB flushing interfaces, since they are the simplest. The diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index 998e972f3b1..30ac65df7d4 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -110,7 +110,6 @@ static struct mm_struct tboot_mm = { .mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem), .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), .mmlist = LIST_HEAD_INIT(init_mm.mmlist), - .cpu_vm_mask = CPU_MASK_ALL, }; static inline void switch_to_tboot_pt(void) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 201998e5b53..c2f9ea7922f 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -265,8 +265,6 @@ struct mm_struct { struct linux_binfmt *binfmt; - cpumask_t cpu_vm_mask; - /* Architecture-specific MM context */ mm_context_t context; @@ -316,9 +314,14 @@ struct mm_struct { #ifdef CONFIG_TRANSPARENT_HUGEPAGE pgtable_t pmd_huge_pte; /* protected by page_table_lock */ #endif + + cpumask_var_t cpu_vm_mask_var; }; /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */ -#define mm_cpumask(mm) (&(mm)->cpu_vm_mask) +static inline cpumask_t *mm_cpumask(struct mm_struct *mm) +{ + return mm->cpu_vm_mask_var; +} #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 44b8faaac7c..f18300eddfc 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2176,6 +2176,7 @@ static inline void mmdrop(struct mm_struct * mm) if (unlikely(atomic_dec_and_test(&mm->mm_count))) __mmdrop(mm); } +extern int mm_init_cpumask(struct mm_struct *mm, struct mm_struct *oldmm); /* mmput gets rid of the mappings and all user-space */ extern void mmput(struct mm_struct *); diff --git a/init/main.c b/init/main.c index 48df882d51d..22da33918ae 100644 --- a/init/main.c +++ b/init/main.c @@ -509,6 +509,8 @@ asmlinkage void __init start_kernel(void) sort_main_extable(); trap_init(); mm_init(); + BUG_ON(mm_init_cpumask(&init_mm, 0)); + /* * Set up the scheduler prior starting any interrupts (such as the * timer interrupt). Full topology setup happens at smp_init() diff --git a/kernel/fork.c b/kernel/fork.c index 927692734bc..8e7e135d081 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -485,6 +485,20 @@ static void mm_init_aio(struct mm_struct *mm) #endif } +int mm_init_cpumask(struct mm_struct *mm, struct mm_struct *oldmm) +{ +#ifdef CONFIG_CPUMASK_OFFSTACK + if (!alloc_cpumask_var(&mm->cpu_vm_mask_var, GFP_KERNEL)) + return -ENOMEM; + + if (oldmm) + cpumask_copy(mm_cpumask(mm), mm_cpumask(oldmm)); + else + memset(mm_cpumask(mm), 0, cpumask_size()); +#endif + return 0; +} + static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p) { atomic_set(&mm->mm_users, 1); @@ -521,10 +535,20 @@ struct mm_struct * mm_alloc(void) struct mm_struct * mm; mm = allocate_mm(); - if (mm) { - memset(mm, 0, sizeof(*mm)); - mm = mm_init(mm, current); + if (!mm) + return NULL; + + memset(mm, 0, sizeof(*mm)); + mm = mm_init(mm, current); + if (!mm) + return NULL; + + if (mm_init_cpumask(mm, NULL)) { + mm_free_pgd(mm); + free_mm(mm); + return NULL; } + return mm; } @@ -536,6 +560,7 @@ struct mm_struct * mm_alloc(void) void __mmdrop(struct mm_struct *mm) { BUG_ON(mm == &init_mm); + free_cpumask_var(mm->cpu_vm_mask_var); mm_free_pgd(mm); destroy_context(mm); mmu_notifier_mm_destroy(mm); @@ -690,6 +715,9 @@ struct mm_struct *dup_mm(struct task_struct *tsk) if (!mm_init(mm, tsk)) goto fail_nomem; + if (mm_init_cpumask(mm, oldmm)) + goto fail_nocpumask; + if (init_new_context(tsk, mm)) goto fail_nocontext; @@ -716,6 +744,9 @@ fail_nomem: return NULL; fail_nocontext: + free_cpumask_var(mm->cpu_vm_mask_var); + +fail_nocpumask: /* * If init_new_context() failed, we cannot use mmput() to free the mm * because it calls destroy_context() diff --git a/mm/init-mm.c b/mm/init-mm.c index 1d29cdfe8eb..4019979b263 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -21,6 +21,5 @@ struct mm_struct init_mm = { .mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem), .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), .mmlist = LIST_HEAD_INIT(init_mm.mmlist), - .cpu_vm_mask = CPU_MASK_ALL, INIT_MM_CONTEXT(init_mm) }; -- cgit v1.2.3-70-g09d2 From a238ab5b0239575c179f4976064192c3f7409dad Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Tue, 24 May 2011 17:12:16 -0700 Subject: mm: break out page allocation warning code This originally started as a simple patch to give vmalloc() some more verbose output on failure on top of the plain page allocator messages. Johannes suggested that it might be nicer to lead with the vmalloc() info _before_ the page allocator messages. But, I do think there's a lot of value in what __alloc_pages_slowpath() does with its filtering and so forth. This patch creates a new function which other allocators can call instead of relying on the internal page allocator warnings. It also gives this function private rate-limiting which separates it from other printk_ratelimit() users. Signed-off-by: Dave Hansen Cc: Johannes Weiner Cc: David Rientjes Cc: Michal Nazarewicz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 ++ mm/page_alloc.c | 62 ++++++++++++++++++++++++++++++++++++------------------ 2 files changed, 43 insertions(+), 21 deletions(-) (limited to 'mm') diff --git a/include/linux/mm.h b/include/linux/mm.h index 2ad0ac8c3f3..8e2f2cd821f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1387,6 +1387,8 @@ extern void si_meminfo(struct sysinfo * val); extern void si_meminfo_node(struct sysinfo *val, int nid); extern int after_bootmem; +extern void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...); + extern void setup_per_cpu_pageset(void); extern void zone_pcp_update(struct zone *zone); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 83eaa2eb72f..44019da9632 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -1736,6 +1737,45 @@ static inline bool should_suppress_show_mem(void) return ret; } +static DEFINE_RATELIMIT_STATE(nopage_rs, + DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + +void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...) +{ + va_list args; + unsigned int filter = SHOW_MEM_FILTER_NODES; + + if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs)) + return; + + /* + * This documents exceptions given to allocations in certain + * contexts that are allowed to allocate outside current's set + * of allowed nodes. + */ + if (!(gfp_mask & __GFP_NOMEMALLOC)) + if (test_thread_flag(TIF_MEMDIE) || + (current->flags & (PF_MEMALLOC | PF_EXITING))) + filter &= ~SHOW_MEM_FILTER_NODES; + if (in_interrupt() || !(gfp_mask & __GFP_WAIT)) + filter &= ~SHOW_MEM_FILTER_NODES; + + if (fmt) { + printk(KERN_WARNING); + va_start(args, fmt); + vprintk(fmt, args); + va_end(args); + } + + pr_warning("%s: page allocation failure: order:%d, mode:0x%x\n", + current->comm, order, gfp_mask); + + dump_stack(); + if (!should_suppress_show_mem()) + show_mem(filter); +} + static inline int should_alloc_retry(gfp_t gfp_mask, unsigned int order, unsigned long pages_reclaimed) @@ -2178,27 +2218,7 @@ rebalance: } nopage: - if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) { - unsigned int filter = SHOW_MEM_FILTER_NODES; - - /* - * This documents exceptions given to allocations in certain - * contexts that are allowed to allocate outside current's set - * of allowed nodes. - */ - if (!(gfp_mask & __GFP_NOMEMALLOC)) - if (test_thread_flag(TIF_MEMDIE) || - (current->flags & (PF_MEMALLOC | PF_EXITING))) - filter &= ~SHOW_MEM_FILTER_NODES; - if (in_interrupt() || !wait) - filter &= ~SHOW_MEM_FILTER_NODES; - - pr_warning("%s: page allocation failure. order:%d, mode:0x%x\n", - current->comm, order, gfp_mask); - dump_stack(); - if (!should_suppress_show_mem()) - show_mem(filter); - } + warn_alloc_failed(gfp_mask, order, NULL); return page; got_pg: if (kmemcheck_enabled) -- cgit v1.2.3-70-g09d2 From 22943ab116af1ead4dc112ec408a93cf1365b34a Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Tue, 24 May 2011 17:12:18 -0700 Subject: mm: print vmalloc() state after allocation failures I was tracking down a page allocation failure that ended up in vmalloc(). Since vmalloc() uses 0-order pages, if somebody asks for an insane amount of memory, we'll still get a warning with "order:0" in it. That's not very useful. During recovery, vmalloc() also nicely frees all of the memory that it got up to the point of the failure. That is wonderful, but it also quickly hides any issues. We have a much different sitation if vmalloc() repeatedly fails 10GB in to: vmalloc(100 * 1<<30); versus repeatedly failing 4096 bytes in to a: vmalloc(8192); This patch will print out messages that look like this: [ 68.123503] vmalloc: allocation failure, allocated 6680576 of 13426688 bytes [ 68.124218] bash: page allocation failure: order:0, mode:0xd2 [ 68.124811] Pid: 3770, comm: bash Not tainted 2.6.39-rc3-00082-g85f2e68-dirty #333 [ 68.125579] Call Trace: [ 68.125853] [] warn_alloc_failed+0x146/0x170 [ 68.126464] [] ? printk+0x6c/0x70 [ 68.126791] [] ? alloc_pages_current+0x94/0xe0 [ 68.127661] [] __vmalloc_node_range+0x237/0x290 ... The 'order' variable is added for clarity when calling warn_alloc_failed() to avoid having an unexplained '0' as an argument. The 'tmp_mask' is because adding an open-coded '| __GFP_NOWARN' would take us over 80 columns for the alloc_pages_node() call. If we are going to add a line, it might as well be one that makes the sucker easier to read. As a side issue, I also noticed that ctl_ioctl() does vmalloc() based solely on an unverified value passed in from userspace. Granted, it's under CAP_SYS_ADMIN, but it still frightens me a bit. Signed-off-by: Dave Hansen Cc: Johannes Weiner Cc: David Rientjes Cc: Michal Nazarewicz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 4581ddcdda5..b5ccf3158d8 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1534,6 +1534,7 @@ static void *__vmalloc_node(unsigned long size, unsigned long align, static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, pgprot_t prot, int node, void *caller) { + const int order = 0; struct page **pages; unsigned int nr_pages, array_size, i; gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; @@ -1560,11 +1561,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, for (i = 0; i < area->nr_pages; i++) { struct page *page; + gfp_t tmp_mask = gfp_mask | __GFP_NOWARN; if (node < 0) - page = alloc_page(gfp_mask); + page = alloc_page(tmp_mask); else - page = alloc_pages_node(node, gfp_mask, 0); + page = alloc_pages_node(node, tmp_mask, order); if (unlikely(!page)) { /* Successfully allocated i pages, free them in __vunmap() */ @@ -1579,6 +1581,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, return area->addr; fail: + warn_alloc_failed(gfp_mask, order, "vmalloc: allocation failure, " + "allocated %ld of %ld bytes\n", + (area->nr_pages*PAGE_SIZE), area->size); vfree(area->addr); return NULL; } -- cgit v1.2.3-70-g09d2 From 700c2a46e88265326764197d5b8842490bae5569 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Tue, 24 May 2011 17:12:19 -0700 Subject: mem-hotplug: call isolate_lru_page with elevated refcount isolate_lru_page() must be called only with stable reference to page. So, let's grab normal page reference. Signed-off-by: Konstantin Khlebnikov Cc: Andi Kleen Cc: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro Cc: Mel Gorman Cc: Lee Schermerhorn Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 59ac18fefd6..89b0391f14a 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -706,7 +706,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) if (!pfn_valid(pfn)) continue; page = pfn_to_page(pfn); - if (!page_count(page)) + if (!get_page_unless_zero(page)) continue; /* * We can skip free pages. And we can only deal with pages on @@ -714,6 +714,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) */ ret = isolate_lru_page(page); if (!ret) { /* Success */ + put_page(page); list_add_tail(&page->lru, &source); move_pages--; inc_zone_page_state(page, NR_ISOLATED_ANON + @@ -725,6 +726,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) pfn); dump_page(page); #endif + put_page(page); /* Because we don't have big zone->lock. we should check this again here. */ if (page_count(page)) { -- cgit v1.2.3-70-g09d2 From bd486285f24ac2fd1ff64688fb0729712c5712c4 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Tue, 24 May 2011 17:12:20 -0700 Subject: mem-hwpoison: fix page refcount around isolate_lru_page() Drop first page reference only after calling isolate_lru_page() to keep page stable reference while isolating. Signed-off-by: Konstantin Khlebnikov Cc: Andi Kleen Cc: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro Cc: Mel Gorman Cc: Lee Schermerhorn Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory-failure.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 12178ec32ab..369b80e8141 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1440,16 +1440,12 @@ int soft_offline_page(struct page *page, int flags) */ ret = invalidate_inode_page(page); unlock_page(page); - /* - * Drop count because page migration doesn't like raised - * counts. The page could get re-allocated, but if it becomes - * LRU the isolation will just fail. * RED-PEN would be better to keep it isolated here, but we * would need to fix isolation locking first. */ - put_page(page); if (ret == 1) { + put_page(page); ret = 0; pr_info("soft_offline: %#lx: invalidated\n", pfn); goto done; @@ -1461,6 +1457,11 @@ int soft_offline_page(struct page *page, int flags) * handles a large number of cases for us. */ ret = isolate_lru_page(page); + /* + * Drop page reference which is came from get_any_page() + * successful isolate_lru_page() already took another one. + */ + put_page(page); if (!ret) { LIST_HEAD(pagelist); -- cgit v1.2.3-70-g09d2 From 0c917313a8d84fcc0c376db3f7edb7c06f06f920 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Tue, 24 May 2011 17:12:21 -0700 Subject: mm: strictly require elevated page refcount in isolate_lru_page() isolate_lru_page() must be called only with stable reference to the page, this is what is written in the comment above it, this is reasonable. current isolate_lru_page() users and its page extra reference sources: mm/huge_memory.c: __collapse_huge_page_isolate() - reference from pte mm/memcontrol.c: mem_cgroup_move_parent() - get_page_unless_zero() mem_cgroup_move_charge_pte_range() - reference from pte mm/memory-failure.c: soft_offline_page() - fixed, reference from get_any_page() delete_from_lru_cache() - reference from caller or get_page_unless_zero() [ seems like there bug, because __memory_failure() can call page_action() for hpages tail, but it is ok for isolate_lru_page(), tail getted and not in lru] mm/memory_hotplug.c: do_migrate_range() - fixed, get_page_unless_zero() mm/mempolicy.c: migrate_page_add() - reference from pte mm/migrate.c: do_move_page_to_node_array() - reference from follow_page() mlock.c: - various external references mm/vmscan.c: putback_lru_page() - reference from isolate_lru_page() It seems that all isolate_lru_page() users are ready now for this restriction. So, let's replace redundant get_page_unless_zero() with get_page() and add page initial reference count check with VM_BUG_ON() Signed-off-by: Konstantin Khlebnikov Cc: Andi Kleen Cc: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro Cc: Mel Gorman Cc: Lee Schermerhorn Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index d303b60f4c2..890f54184d9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1207,13 +1207,16 @@ int isolate_lru_page(struct page *page) { int ret = -EBUSY; + VM_BUG_ON(!page_count(page)); + if (PageLRU(page)) { struct zone *zone = page_zone(page); spin_lock_irq(&zone->lru_lock); - if (PageLRU(page) && get_page_unless_zero(page)) { + if (PageLRU(page)) { int lru = page_lru(page); ret = 0; + get_page(page); ClearPageLRU(page); del_page_from_lru_list(zone, page, lru); -- cgit v1.2.3-70-g09d2 From 6d3163ce86dd386b4f7bda80241d7fea2bc0bb1d Mon Sep 17 00:00:00 2001 From: Arve HjønnevÃ¥g Date: Tue, 24 May 2011 17:12:24 -0700 Subject: mm: check if any page in a pageblock is reserved before marking it MIGRATE_RESERVE MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This fixes a problem where the first pageblock got marked MIGRATE_RESERVE even though it only had a few free pages. eg, On current ARM port, The kernel starts at offset 0x8000 to leave room for boot parameters, and the memory is freed later. This in turn caused no contiguous memory to be reserved and frequent kswapd wakeups that emptied the caches to get more contiguous memory. Unfortunatelly, ARM needs order-2 allocation for pgd (see arm/mm/pgd.c#pgd_alloc()). Therefore the issue is not minor nor easy avoidable. [kosaki.motohiro@jp.fujitsu.com: added some explanation] [kosaki.motohiro@jp.fujitsu.com: add !pfn_valid_within() to check] [minchan.kim@gmail.com: check end_pfn in pageblock_is_reserved] Signed-off-by: John Stultz Signed-off-by: Arve HjønnevÃ¥g Signed-off-by: KOSAKI Motohiro Acked-by: Mel Gorman Acked-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 44019da9632..01e6b614839 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3328,6 +3328,20 @@ static inline unsigned long wait_table_bits(unsigned long size) #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1)) +/* + * Check if a pageblock contains reserved pages + */ +static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) +{ + unsigned long pfn; + + for (pfn = start_pfn; pfn < end_pfn; pfn++) { + if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) + return 1; + } + return 0; +} + /* * Mark a number of pageblocks as MIGRATE_RESERVE. The number * of blocks reserved is based on min_wmark_pages(zone). The memory within @@ -3337,7 +3351,7 @@ static inline unsigned long wait_table_bits(unsigned long size) */ static void setup_zone_migrate_reserve(struct zone *zone) { - unsigned long start_pfn, pfn, end_pfn; + unsigned long start_pfn, pfn, end_pfn, block_end_pfn; struct page *page; unsigned long block_migratetype; int reserve; @@ -3367,7 +3381,8 @@ static void setup_zone_migrate_reserve(struct zone *zone) continue; /* Blocks with reserved pages will never free, skip them. */ - if (PageReserved(page)) + block_end_pfn = min(pfn + pageblock_nr_pages, end_pfn); + if (pageblock_is_reserved(pfn, block_end_pfn)) continue; block_migratetype = get_pageblock_migratetype(page); -- cgit v1.2.3-70-g09d2 From 7b1de5868b124d8f399d8791ed30a9b679d64d4d Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Tue, 24 May 2011 17:12:25 -0700 Subject: readahead: readahead page allocations are OK to fail Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations. readahead page allocations are completely optional. They are OK to fail and in particular shall not trigger OOM on themselves. Reported-by: Dave Young Reviewed-by: KOSAKI Motohiro Signed-off-by: Wu Fengguang Reviewed-by: Minchan Kim Reviewed-by: Pekka Enberg Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pagemap.h | 6 ++++++ mm/readahead.c | 2 +- 2 files changed, 7 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index ea268080380..716875e5352 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -219,6 +219,12 @@ static inline struct page *page_cache_alloc_cold(struct address_space *x) return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD); } +static inline struct page *page_cache_alloc_readahead(struct address_space *x) +{ + return __page_cache_alloc(mapping_gfp_mask(x) | + __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN); +} + typedef int filler_t(void *, struct page *); extern struct page * find_get_page(struct address_space *mapping, diff --git a/mm/readahead.c b/mm/readahead.c index 2c0cc489e28..867f9dd82dc 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -180,7 +180,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp, if (page) continue; - page = page_cache_alloc_cold(mapping); + page = page_cache_alloc_readahead(mapping); if (!page) break; page->index = page_offset; -- cgit v1.2.3-70-g09d2 From a09ed5e00084448453c8bada4dcd31e5fbfc2f21 Mon Sep 17 00:00:00 2001 From: Ying Han Date: Tue, 24 May 2011 17:12:26 -0700 Subject: vmscan: change shrink_slab() interfaces by passing shrink_control Consolidate the existing parameters to shrink_slab() into a new shrink_control struct. This is needed later to pass the same struct to shrinkers. Signed-off-by: Ying Han Cc: KOSAKI Motohiro Cc: Minchan Kim Acked-by: Pavel Emelyanov Cc: KAMEZAWA Hiroyuki Cc: Mel Gorman Acked-by: Rik van Riel Cc: Johannes Weiner Cc: Hugh Dickins Cc: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/drop_caches.c | 6 +++++- include/linux/mm.h | 13 +++++++++++-- mm/memory-failure.c | 7 ++++++- mm/vmscan.c | 46 +++++++++++++++++++++++++++++++++------------- 4 files changed, 55 insertions(+), 17 deletions(-) (limited to 'mm') diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 98b77c89494..440999c2435 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -40,9 +40,13 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused) static void drop_slab(void) { int nr_objects; + struct shrink_control shrink = { + .gfp_mask = GFP_KERNEL, + .nr_scanned = 1000, + }; do { - nr_objects = shrink_slab(1000, GFP_KERNEL, 1000); + nr_objects = shrink_slab(&shrink, 1000); } while (nr_objects > 10); } diff --git a/include/linux/mm.h b/include/linux/mm.h index 8e2f2cd821f..32cfa9602d0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1161,6 +1161,15 @@ static inline void sync_mm_rss(struct task_struct *task, struct mm_struct *mm) } #endif +/* + * This struct is used to pass information from page reclaim to the shrinkers. + * We consolidate the values for easier extention later. + */ +struct shrink_control { + unsigned long nr_scanned; + gfp_t gfp_mask; +}; + /* * A callback you can register to apply pressure to ageable caches. * @@ -1630,8 +1639,8 @@ int in_gate_area_no_mm(unsigned long addr); int drop_caches_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); -unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, - unsigned long lru_pages); +unsigned long shrink_slab(struct shrink_control *shrink, + unsigned long lru_pages); #ifndef CONFIG_MMU #define randomize_va_space 0 diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 369b80e8141..341341b2b47 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -239,7 +239,12 @@ void shake_page(struct page *p, int access) if (access) { int nr; do { - nr = shrink_slab(1000, GFP_KERNEL, 1000); + struct shrink_control shrink = { + .gfp_mask = GFP_KERNEL, + .nr_scanned = 1000, + }; + + nr = shrink_slab(&shrink, 1000); if (page_count(p) == 1) break; } while (nr > 10); diff --git a/mm/vmscan.c b/mm/vmscan.c index 890f54184d9..e4e245ed1a5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -222,11 +222,13 @@ EXPORT_SYMBOL(unregister_shrinker); * * Returns the number of slab objects which we shrunk. */ -unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, - unsigned long lru_pages) +unsigned long shrink_slab(struct shrink_control *shrink, + unsigned long lru_pages) { struct shrinker *shrinker; unsigned long ret = 0; + unsigned long scanned = shrink->nr_scanned; + gfp_t gfp_mask = shrink->gfp_mask; if (scanned == 0) scanned = SWAP_CLUSTER_MAX; @@ -2035,7 +2037,8 @@ static bool all_unreclaimable(struct zonelist *zonelist, * else, the number of pages reclaimed */ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, - struct scan_control *sc) + struct scan_control *sc, + struct shrink_control *shrink) { int priority; unsigned long total_scanned = 0; @@ -2069,7 +2072,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, lru_pages += zone_reclaimable_pages(zone); } - shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages); + shrink->nr_scanned = sc->nr_scanned; + shrink_slab(shrink, lru_pages); if (reclaim_state) { sc->nr_reclaimed += reclaim_state->reclaimed_slab; reclaim_state->reclaimed_slab = 0; @@ -2141,12 +2145,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .mem_cgroup = NULL, .nodemask = nodemask, }; + struct shrink_control shrink = { + .gfp_mask = sc.gfp_mask, + }; trace_mm_vmscan_direct_reclaim_begin(order, sc.may_writepage, gfp_mask); - nr_reclaimed = do_try_to_free_pages(zonelist, &sc); + nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink); trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); @@ -2206,17 +2213,20 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, .order = 0, .mem_cgroup = mem_cont, .nodemask = NULL, /* we don't care the placement */ + .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | + (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), + }; + struct shrink_control shrink = { + .gfp_mask = sc.gfp_mask, }; - sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | - (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); zonelist = NODE_DATA(numa_node_id())->node_zonelists; trace_mm_vmscan_memcg_reclaim_begin(0, sc.may_writepage, sc.gfp_mask); - nr_reclaimed = do_try_to_free_pages(zonelist, &sc); + nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink); trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); @@ -2344,6 +2354,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, .order = order, .mem_cgroup = NULL, }; + struct shrink_control shrink = { + .gfp_mask = sc.gfp_mask, + }; loop_again: total_scanned = 0; sc.nr_reclaimed = 0; @@ -2443,8 +2456,8 @@ loop_again: end_zone, 0)) shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; - nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, - lru_pages); + shrink.nr_scanned = sc.nr_scanned; + nr_slab = shrink_slab(&shrink, lru_pages); sc.nr_reclaimed += reclaim_state->reclaimed_slab; total_scanned += sc.nr_scanned; @@ -2796,7 +2809,10 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) .swappiness = vm_swappiness, .order = 0, }; - struct zonelist * zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); + struct shrink_control shrink = { + .gfp_mask = sc.gfp_mask, + }; + struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); struct task_struct *p = current; unsigned long nr_reclaimed; @@ -2805,7 +2821,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - nr_reclaimed = do_try_to_free_pages(zonelist, &sc); + nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink); p->reclaim_state = NULL; lockdep_clear_current_reclaim_state(); @@ -2980,6 +2996,9 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) .swappiness = vm_swappiness, .order = order, }; + struct shrink_control shrink = { + .gfp_mask = sc.gfp_mask, + }; unsigned long nr_slab_pages0, nr_slab_pages1; cond_resched(); @@ -3006,6 +3025,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) } nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); + shrink.nr_scanned = sc.nr_scanned; if (nr_slab_pages0 > zone->min_slab_pages) { /* * shrink_slab() does not currently allow us to determine how @@ -3021,7 +3041,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) unsigned long lru_pages = zone_reclaimable_pages(zone); /* No reclaimable slab or very low memory pressure */ - if (!shrink_slab(sc.nr_scanned, gfp_mask, lru_pages)) + if (!shrink_slab(&shrink, lru_pages)) break; /* Freed enough memory */ -- cgit v1.2.3-70-g09d2 From 1495f230fa7750479c79e3656286b9183d662077 Mon Sep 17 00:00:00 2001 From: Ying Han Date: Tue, 24 May 2011 17:12:27 -0700 Subject: vmscan: change shrinker API by passing shrink_control struct Change each shrinker's API by consolidating the existing parameters into shrink_control struct. This will simplify any further features added w/o touching each file of shrinker. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: fix warning] [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API] [akpm@linux-foundation.org: fix xfs warning] [akpm@linux-foundation.org: update gfs2] Signed-off-by: Ying Han Cc: KOSAKI Motohiro Cc: Minchan Kim Acked-by: Pavel Emelyanov Cc: KAMEZAWA Hiroyuki Cc: Mel Gorman Acked-by: Rik van Riel Cc: Johannes Weiner Cc: Hugh Dickins Cc: Dave Hansen Cc: Steven Whitehouse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/kvm/mmu.c | 3 ++- drivers/gpu/drm/i915/i915_gem.c | 9 +++------ drivers/gpu/drm/ttm/ttm_page_alloc.c | 4 +++- drivers/staging/zcache/zcache.c | 5 ++++- fs/dcache.c | 8 ++++++-- fs/drop_caches.c | 3 +-- fs/gfs2/glock.c | 5 ++++- fs/gfs2/quota.c | 12 +++++++----- fs/gfs2/quota.h | 4 +++- fs/inode.c | 6 +++++- fs/mbcache.c | 10 ++++++---- fs/nfs/dir.c | 5 ++++- fs/nfs/internal.h | 2 +- fs/quota/dquot.c | 5 ++++- fs/xfs/linux-2.6/xfs_buf.c | 4 ++-- fs/xfs/linux-2.6/xfs_sync.c | 5 +++-- fs/xfs/quota/xfs_qm.c | 6 +++--- include/linux/mm.h | 19 +++++++++++-------- mm/memory-failure.c | 3 +-- mm/vmscan.c | 34 +++++++++++++++++++--------------- net/sunrpc/auth.c | 4 +++- 21 files changed, 95 insertions(+), 61 deletions(-) (limited to 'mm') diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 28418054b88..bd14bb4c859 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3545,10 +3545,11 @@ static int kvm_mmu_remove_some_alloc_mmu_pages(struct kvm *kvm, return kvm_mmu_prepare_zap_page(kvm, page, invalid_list); } -static int mmu_shrink(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask) +static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc) { struct kvm *kvm; struct kvm *kvm_freed = NULL; + int nr_to_scan = sc->nr_to_scan; if (nr_to_scan == 0) goto out; diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index c6289034e29..0b2e167d2bc 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -56,9 +56,7 @@ static int i915_gem_phys_pwrite(struct drm_device *dev, static void i915_gem_free_object_tail(struct drm_i915_gem_object *obj); static int i915_gem_inactive_shrink(struct shrinker *shrinker, - int nr_to_scan, - gfp_t gfp_mask); - + struct shrink_control *sc); /* some bookkeeping */ static void i915_gem_info_add_obj(struct drm_i915_private *dev_priv, @@ -4092,9 +4090,7 @@ i915_gpu_is_active(struct drm_device *dev) } static int -i915_gem_inactive_shrink(struct shrinker *shrinker, - int nr_to_scan, - gfp_t gfp_mask) +i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc) { struct drm_i915_private *dev_priv = container_of(shrinker, @@ -4102,6 +4098,7 @@ i915_gem_inactive_shrink(struct shrinker *shrinker, mm.inactive_shrinker); struct drm_device *dev = dev_priv->dev; struct drm_i915_gem_object *obj, *next; + int nr_to_scan = sc->nr_to_scan; int cnt; if (!mutex_trylock(&dev->struct_mutex)) diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c index 9d9d92945f8..d948575717b 100644 --- a/drivers/gpu/drm/ttm/ttm_page_alloc.c +++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c @@ -395,12 +395,14 @@ static int ttm_pool_get_num_unused_pages(void) /** * Callback for mm to request pool to reduce number of page held. */ -static int ttm_pool_mm_shrink(struct shrinker *shrink, int shrink_pages, gfp_t gfp_mask) +static int ttm_pool_mm_shrink(struct shrinker *shrink, + struct shrink_control *sc) { static atomic_t start_pool = ATOMIC_INIT(0); unsigned i; unsigned pool_offset = atomic_add_return(1, &start_pool); struct ttm_page_pool *pool; + int shrink_pages = sc->nr_to_scan; pool_offset = pool_offset % NUM_POOLS; /* select start pool in round robin fashion */ diff --git a/drivers/staging/zcache/zcache.c b/drivers/staging/zcache/zcache.c index b8a2b30a157..77ac2d4d3ef 100644 --- a/drivers/staging/zcache/zcache.c +++ b/drivers/staging/zcache/zcache.c @@ -1181,9 +1181,12 @@ static bool zcache_freeze; /* * zcache shrinker interface (only useful for ephemeral pages, so zbud only) */ -static int shrink_zcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask) +static int shrink_zcache_memory(struct shrinker *shrink, + struct shrink_control *sc) { int ret = -1; + int nr = sc->nr_to_scan; + gfp_t gfp_mask = sc->gfp_mask; if (nr >= 0) { if (!(gfp_mask & __GFP_FS)) diff --git a/fs/dcache.c b/fs/dcache.c index 18b2a1f10ed..37f72ee5bf7 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1220,7 +1220,7 @@ void shrink_dcache_parent(struct dentry * parent) EXPORT_SYMBOL(shrink_dcache_parent); /* - * Scan `nr' dentries and return the number which remain. + * Scan `sc->nr_slab_to_reclaim' dentries and return the number which remain. * * We need to avoid reentering the filesystem if the caller is performing a * GFP_NOFS allocation attempt. One example deadlock is: @@ -1231,8 +1231,12 @@ EXPORT_SYMBOL(shrink_dcache_parent); * * In this case we return -1 to tell the caller that we baled. */ -static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask) +static int shrink_dcache_memory(struct shrinker *shrink, + struct shrink_control *sc) { + int nr = sc->nr_to_scan; + gfp_t gfp_mask = sc->gfp_mask; + if (nr) { if (!(gfp_mask & __GFP_FS)) return -1; diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 440999c2435..c00e055b628 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -42,11 +42,10 @@ static void drop_slab(void) int nr_objects; struct shrink_control shrink = { .gfp_mask = GFP_KERNEL, - .nr_scanned = 1000, }; do { - nr_objects = shrink_slab(&shrink, 1000); + nr_objects = shrink_slab(&shrink, 1000, 1000); } while (nr_objects > 10); } diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index a2a6abbccc0..2792a790e50 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -1346,11 +1346,14 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret) } -static int gfs2_shrink_glock_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask) +static int gfs2_shrink_glock_memory(struct shrinker *shrink, + struct shrink_control *sc) { struct gfs2_glock *gl; int may_demote; int nr_skipped = 0; + int nr = sc->nr_to_scan; + gfp_t gfp_mask = sc->gfp_mask; LIST_HEAD(skipped); if (nr == 0) diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c index e23d9864c41..42e8d23bc04 100644 --- a/fs/gfs2/quota.c +++ b/fs/gfs2/quota.c @@ -38,6 +38,7 @@ #include #include +#include #include #include #include @@ -77,19 +78,20 @@ static LIST_HEAD(qd_lru_list); static atomic_t qd_lru_count = ATOMIC_INIT(0); static DEFINE_SPINLOCK(qd_lru_lock); -int gfs2_shrink_qd_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask) +int gfs2_shrink_qd_memory(struct shrinker *shrink, struct shrink_control *sc) { struct gfs2_quota_data *qd; struct gfs2_sbd *sdp; + int nr_to_scan = sc->nr_to_scan; - if (nr == 0) + if (nr_to_scan == 0) goto out; - if (!(gfp_mask & __GFP_FS)) + if (!(sc->gfp_mask & __GFP_FS)) return -1; spin_lock(&qd_lru_lock); - while (nr && !list_empty(&qd_lru_list)) { + while (nr_to_scan && !list_empty(&qd_lru_list)) { qd = list_entry(qd_lru_list.next, struct gfs2_quota_data, qd_reclaim); sdp = qd->qd_gl->gl_sbd; @@ -110,7 +112,7 @@ int gfs2_shrink_qd_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask) spin_unlock(&qd_lru_lock); kmem_cache_free(gfs2_quotad_cachep, qd); spin_lock(&qd_lru_lock); - nr--; + nr_to_scan--; } spin_unlock(&qd_lru_lock); diff --git a/fs/gfs2/quota.h b/fs/gfs2/quota.h index e7d236ca48b..90bf1c302a9 100644 --- a/fs/gfs2/quota.h +++ b/fs/gfs2/quota.h @@ -12,6 +12,7 @@ struct gfs2_inode; struct gfs2_sbd; +struct shrink_control; #define NO_QUOTA_CHANGE ((u32)-1) @@ -51,7 +52,8 @@ static inline int gfs2_quota_lock_check(struct gfs2_inode *ip) return ret; } -extern int gfs2_shrink_qd_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask); +extern int gfs2_shrink_qd_memory(struct shrinker *shrink, + struct shrink_control *sc); extern const struct quotactl_ops gfs2_quotactl_ops; #endif /* __QUOTA_DOT_H__ */ diff --git a/fs/inode.c b/fs/inode.c index 88aa3bcc768..990d284877a 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -751,8 +751,12 @@ static void prune_icache(int nr_to_scan) * This function is passed the number of inodes to scan, and it returns the * total number of remaining possibly-reclaimable inodes. */ -static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask) +static int shrink_icache_memory(struct shrinker *shrink, + struct shrink_control *sc) { + int nr = sc->nr_to_scan; + gfp_t gfp_mask = sc->gfp_mask; + if (nr) { /* * Nasty deadlock avoidance. We may hold various FS locks, diff --git a/fs/mbcache.c b/fs/mbcache.c index 2f174be0655..8c32ef3ba88 100644 --- a/fs/mbcache.c +++ b/fs/mbcache.c @@ -90,7 +90,8 @@ static DEFINE_SPINLOCK(mb_cache_spinlock); * What the mbcache registers as to get shrunk dynamically. */ -static int mb_cache_shrink_fn(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask); +static int mb_cache_shrink_fn(struct shrinker *shrink, + struct shrink_control *sc); static struct shrinker mb_cache_shrinker = { .shrink = mb_cache_shrink_fn, @@ -156,18 +157,19 @@ forget: * gets low. * * @shrink: (ignored) - * @nr_to_scan: Number of objects to scan - * @gfp_mask: (ignored) + * @sc: shrink_control passed from reclaim * * Returns the number of objects which are present in the cache. */ static int -mb_cache_shrink_fn(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask) +mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc) { LIST_HEAD(free_list); struct mb_cache *cache; struct mb_cache_entry *entry, *tmp; int count = 0; + int nr_to_scan = sc->nr_to_scan; + gfp_t gfp_mask = sc->gfp_mask; mb_debug("trying to free %d entries", nr_to_scan); spin_lock(&mb_cache_spinlock); diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 7237672216c..424e47773a8 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -2042,11 +2042,14 @@ static void nfs_access_free_list(struct list_head *head) } } -int nfs_access_cache_shrinker(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask) +int nfs_access_cache_shrinker(struct shrinker *shrink, + struct shrink_control *sc) { LIST_HEAD(head); struct nfs_inode *nfsi, *next; struct nfs_access_entry *cache; + int nr_to_scan = sc->nr_to_scan; + gfp_t gfp_mask = sc->gfp_mask; if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL) return (nr_to_scan == 0) ? 0 : -1; diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index ce118ce885d..2df6ca7b589 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -234,7 +234,7 @@ extern int nfs_init_client(struct nfs_client *clp, /* dir.c */ extern int nfs_access_cache_shrinker(struct shrinker *shrink, - int nr_to_scan, gfp_t gfp_mask); + struct shrink_control *sc); /* inode.c */ extern struct workqueue_struct *nfsiod_workqueue; diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c index d3c032f5fa0..5b572c89e6c 100644 --- a/fs/quota/dquot.c +++ b/fs/quota/dquot.c @@ -691,8 +691,11 @@ static void prune_dqcache(int count) * This is called from kswapd when we think we need some * more memory */ -static int shrink_dqcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask) +static int shrink_dqcache_memory(struct shrinker *shrink, + struct shrink_control *sc) { + int nr = sc->nr_to_scan; + if (nr) { spin_lock(&dq_list_lock); prune_dqcache(nr); diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c index 52b2b5da566..5e68099db2a 100644 --- a/fs/xfs/linux-2.6/xfs_buf.c +++ b/fs/xfs/linux-2.6/xfs_buf.c @@ -1422,12 +1422,12 @@ restart: int xfs_buftarg_shrink( struct shrinker *shrink, - int nr_to_scan, - gfp_t mask) + struct shrink_control *sc) { struct xfs_buftarg *btp = container_of(shrink, struct xfs_buftarg, bt_shrinker); struct xfs_buf *bp; + int nr_to_scan = sc->nr_to_scan; LIST_HEAD(dispose); if (!nr_to_scan) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index cb1bb2080e4..8ecad5ff9f9 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -1032,13 +1032,14 @@ xfs_reclaim_inodes( static int xfs_reclaim_inode_shrink( struct shrinker *shrink, - int nr_to_scan, - gfp_t gfp_mask) + struct shrink_control *sc) { struct xfs_mount *mp; struct xfs_perag *pag; xfs_agnumber_t ag; int reclaimable; + int nr_to_scan = sc->nr_to_scan; + gfp_t gfp_mask = sc->gfp_mask; mp = container_of(shrink, struct xfs_mount, m_inode_shrink); if (nr_to_scan) { diff --git a/fs/xfs/quota/xfs_qm.c b/fs/xfs/quota/xfs_qm.c index 69228aa8605..b94dace4e78 100644 --- a/fs/xfs/quota/xfs_qm.c +++ b/fs/xfs/quota/xfs_qm.c @@ -60,7 +60,7 @@ STATIC void xfs_qm_list_destroy(xfs_dqlist_t *); STATIC int xfs_qm_init_quotainos(xfs_mount_t *); STATIC int xfs_qm_init_quotainfo(xfs_mount_t *); -STATIC int xfs_qm_shake(struct shrinker *, int, gfp_t); +STATIC int xfs_qm_shake(struct shrinker *, struct shrink_control *); static struct shrinker xfs_qm_shaker = { .shrink = xfs_qm_shake, @@ -2009,10 +2009,10 @@ xfs_qm_shake_freelist( STATIC int xfs_qm_shake( struct shrinker *shrink, - int nr_to_scan, - gfp_t gfp_mask) + struct shrink_control *sc) { int ndqused, nfree, n; + gfp_t gfp_mask = sc->gfp_mask; if (!kmem_shake_allow(gfp_mask)) return 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 32cfa9602d0..5cbbf78eaac 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1166,18 +1166,20 @@ static inline void sync_mm_rss(struct task_struct *task, struct mm_struct *mm) * We consolidate the values for easier extention later. */ struct shrink_control { - unsigned long nr_scanned; gfp_t gfp_mask; + + /* How many slab objects shrinker() should scan and try to reclaim */ + unsigned long nr_to_scan; }; /* * A callback you can register to apply pressure to ageable caches. * - * 'shrink' is passed a count 'nr_to_scan' and a 'gfpmask'. It should - * look through the least-recently-used 'nr_to_scan' entries and - * attempt to free them up. It should return the number of objects - * which remain in the cache. If it returns -1, it means it cannot do - * any scanning at this time (eg. there is a risk of deadlock). + * 'sc' is passed shrink_control which includes a count 'nr_to_scan' + * and a 'gfpmask'. It should look through the least-recently-used + * 'nr_to_scan' entries and attempt to free them up. It should return + * the number of objects which remain in the cache. If it returns -1, it means + * it cannot do any scanning at this time (eg. there is a risk of deadlock). * * The 'gfpmask' refers to the allocation we are currently trying to * fulfil. @@ -1186,7 +1188,7 @@ struct shrink_control { * querying the cache size, so a fastpath for that case is appropriate. */ struct shrinker { - int (*shrink)(struct shrinker *, int nr_to_scan, gfp_t gfp_mask); + int (*shrink)(struct shrinker *, struct shrink_control *sc); int seeks; /* seeks to recreate an obj */ /* These are for internal use */ @@ -1640,7 +1642,8 @@ int in_gate_area_no_mm(unsigned long addr); int drop_caches_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); unsigned long shrink_slab(struct shrink_control *shrink, - unsigned long lru_pages); + unsigned long nr_pages_scanned, + unsigned long lru_pages); #ifndef CONFIG_MMU #define randomize_va_space 0 diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 341341b2b47..5c8f7e08928 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -241,10 +241,9 @@ void shake_page(struct page *p, int access) do { struct shrink_control shrink = { .gfp_mask = GFP_KERNEL, - .nr_scanned = 1000, }; - nr = shrink_slab(&shrink, 1000); + nr = shrink_slab(&shrink, 1000, 1000); if (page_count(p) == 1) break; } while (nr > 10); diff --git a/mm/vmscan.c b/mm/vmscan.c index e4e245ed1a5..7e0116150dc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -202,6 +202,14 @@ void unregister_shrinker(struct shrinker *shrinker) } EXPORT_SYMBOL(unregister_shrinker); +static inline int do_shrinker_shrink(struct shrinker *shrinker, + struct shrink_control *sc, + unsigned long nr_to_scan) +{ + sc->nr_to_scan = nr_to_scan; + return (*shrinker->shrink)(shrinker, sc); +} + #define SHRINK_BATCH 128 /* * Call the shrink functions to age shrinkable caches @@ -223,15 +231,14 @@ EXPORT_SYMBOL(unregister_shrinker); * Returns the number of slab objects which we shrunk. */ unsigned long shrink_slab(struct shrink_control *shrink, + unsigned long nr_pages_scanned, unsigned long lru_pages) { struct shrinker *shrinker; unsigned long ret = 0; - unsigned long scanned = shrink->nr_scanned; - gfp_t gfp_mask = shrink->gfp_mask; - if (scanned == 0) - scanned = SWAP_CLUSTER_MAX; + if (nr_pages_scanned == 0) + nr_pages_scanned = SWAP_CLUSTER_MAX; if (!down_read_trylock(&shrinker_rwsem)) { /* Assume we'll be able to shrink next time */ @@ -244,8 +251,8 @@ unsigned long shrink_slab(struct shrink_control *shrink, unsigned long total_scan; unsigned long max_pass; - max_pass = (*shrinker->shrink)(shrinker, 0, gfp_mask); - delta = (4 * scanned) / shrinker->seeks; + max_pass = do_shrinker_shrink(shrinker, shrink, 0); + delta = (4 * nr_pages_scanned) / shrinker->seeks; delta *= max_pass; do_div(delta, lru_pages + 1); shrinker->nr += delta; @@ -272,9 +279,9 @@ unsigned long shrink_slab(struct shrink_control *shrink, int shrink_ret; int nr_before; - nr_before = (*shrinker->shrink)(shrinker, 0, gfp_mask); - shrink_ret = (*shrinker->shrink)(shrinker, this_scan, - gfp_mask); + nr_before = do_shrinker_shrink(shrinker, shrink, 0); + shrink_ret = do_shrinker_shrink(shrinker, shrink, + this_scan); if (shrink_ret == -1) break; if (shrink_ret < nr_before) @@ -2072,8 +2079,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, lru_pages += zone_reclaimable_pages(zone); } - shrink->nr_scanned = sc->nr_scanned; - shrink_slab(shrink, lru_pages); + shrink_slab(shrink, sc->nr_scanned, lru_pages); if (reclaim_state) { sc->nr_reclaimed += reclaim_state->reclaimed_slab; reclaim_state->reclaimed_slab = 0; @@ -2456,8 +2462,7 @@ loop_again: end_zone, 0)) shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; - shrink.nr_scanned = sc.nr_scanned; - nr_slab = shrink_slab(&shrink, lru_pages); + nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages); sc.nr_reclaimed += reclaim_state->reclaimed_slab; total_scanned += sc.nr_scanned; @@ -3025,7 +3030,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) } nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); - shrink.nr_scanned = sc.nr_scanned; if (nr_slab_pages0 > zone->min_slab_pages) { /* * shrink_slab() does not currently allow us to determine how @@ -3041,7 +3045,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) unsigned long lru_pages = zone_reclaimable_pages(zone); /* No reclaimable slab or very low memory pressure */ - if (!shrink_slab(&shrink, lru_pages)) + if (!shrink_slab(&shrink, sc.nr_scanned, lru_pages)) break; /* Freed enough memory */ diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c index 67e31276682..cd6e4aa19db 100644 --- a/net/sunrpc/auth.c +++ b/net/sunrpc/auth.c @@ -326,10 +326,12 @@ rpcauth_prune_expired(struct list_head *free, int nr_to_scan) * Run memory cache shrinker. */ static int -rpcauth_cache_shrinker(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask) +rpcauth_cache_shrinker(struct shrinker *shrink, struct shrink_control *sc) { LIST_HEAD(free); int res; + int nr_to_scan = sc->nr_to_scan; + gfp_t gfp_mask = sc->gfp_mask; if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL) return (nr_to_scan == 0) ? 0 : -1; -- cgit v1.2.3-70-g09d2 From 275b12bf5486f6f531111fd3d7dbbf01df427cfe Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Tue, 24 May 2011 17:12:28 -0700 Subject: readahead: return early when readahead is disabled Reduce readahead overheads by returning early in do_sync_mmap_readahead(). tmpfs has ra_pages=0 and it can page fault really fast (not constraint by IO if not swapping). Signed-off-by: Wu Fengguang Tested-by: Tim Chen Reported-by: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index 88354ae0b1f..c974a286389 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1556,6 +1556,8 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* If we don't want any read-ahead, don't bother */ if (VM_RandomReadHint(vma)) return; + if (!ra->ra_pages) + return; if (VM_SequentialReadHint(vma) || offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) { @@ -1578,12 +1580,10 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, * mmap read-around */ ra_pages = max_sane_readahead(ra->ra_pages); - if (ra_pages) { - ra->start = max_t(long, 0, offset - ra_pages/2); - ra->size = ra_pages; - ra->async_size = 0; - ra_submit(ra, mapping, file); - } + ra->start = max_t(long, 0, offset - ra_pages / 2); + ra->size = ra_pages; + ra->async_size = 0; + ra_submit(ra, mapping, file); } /* -- cgit v1.2.3-70-g09d2 From 207d04baa3591a354711e863dd90087fc75873b3 Mon Sep 17 00:00:00 2001 From: Andi Kleen Date: Tue, 24 May 2011 17:12:29 -0700 Subject: readahead: reduce unnecessary mmap_miss increases The original INT_MAX is too large, reduce it to - avoid unnecessarily dirtying/bouncing the cache line - restore mmap read-around faster on changed access pattern Background: in the mosbench exim benchmark which does multi-threaded page faults on shared struct file, the ra->mmap_miss updates are found to cause excessive cache line bouncing on tmpfs. The ra state updates are needless for tmpfs because it actually disabled readahead totally (shmem_backing_dev_info.ra_pages == 0). Tested-by: Tim Chen Signed-off-by: Andi Kleen Signed-off-by: Wu Fengguang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index c974a286389..e5131392d32 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1566,7 +1566,8 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, return; } - if (ra->mmap_miss < INT_MAX) + /* Avoid banging the cache line if not needed */ + if (ra->mmap_miss < MMAP_LOTSAMISS * 10) ra->mmap_miss++; /* -- cgit v1.2.3-70-g09d2 From 2cbea1d3ab11946885d37a2461072ee4d687cb4e Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Tue, 24 May 2011 17:12:30 -0700 Subject: readahead: trigger mmap sequential readahead on PG_readahead Previously the mmap sequential readahead is triggered by updating ra->prev_pos on each page fault and compare it with current page offset. It costs dirtying the cache line on each _minor_ page fault. So remove the ra->prev_pos recording, and instead tag PG_readahead to trigger the possible sequential readahead. It's not only more simple, but also will work more reliably and reduce cache line bouncing on concurrent page faults on shared struct file. In the mosbench exim benchmark which does multi-threaded page faults on shared struct file, the ra->mmap_miss and ra->prev_pos updates are found to cause excessive cache line bouncing on tmpfs, which actually disabled readahead totally (shmem_backing_dev_info.ra_pages == 0). So remove the ra->prev_pos recording, and instead tag PG_readahead to trigger the possible sequential readahead. It's not only more simple, but also will work more reliably on concurrent reads on shared struct file. Signed-off-by: Wu Fengguang Tested-by: Tim Chen Reported-by: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index e5131392d32..68e782b3d3d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1559,8 +1559,7 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, if (!ra->ra_pages) return; - if (VM_SequentialReadHint(vma) || - offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) { + if (VM_SequentialReadHint(vma)) { page_cache_sync_readahead(mapping, ra, file, offset, ra->ra_pages); return; @@ -1583,7 +1582,7 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, ra_pages = max_sane_readahead(ra->ra_pages); ra->start = max_t(long, 0, offset - ra_pages / 2); ra->size = ra_pages; - ra->async_size = 0; + ra->async_size = ra_pages / 4; ra_submit(ra, mapping, file); } @@ -1689,7 +1688,6 @@ retry_find: return VM_FAULT_SIGBUS; } - ra->prev_pos = (loff_t)offset << PAGE_CACHE_SHIFT; vmf->page = page; return ret | VM_FAULT_LOCKED; -- cgit v1.2.3-70-g09d2 From 821ed6bbed3cf41c4050a431eeb822b33868d36a Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Tue, 24 May 2011 17:12:31 -0700 Subject: mm: filter unevictable page out in deactivate_page() It's pointless that deactive_page's operates on unevictable pages. This patch removes unnecessary overhead which might be a bit problem in case that there are many unevictable page in system(ex, mprotect workload) [akpm@linux-foundation.org: tidy up comment] Reviewed-by: KOSAKI Motohiro Signed-off-by: Minchan Kim Reviewed-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swap.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'mm') diff --git a/mm/swap.c b/mm/swap.c index 5602f1a1b1e..2f365d1a4bb 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -476,6 +476,13 @@ static void drain_cpu_pagevecs(int cpu) */ void deactivate_page(struct page *page) { + /* + * In a workload with many unevictable page such as mprotect, unevictable + * page deactivation for accelerating reclaim is pointless. + */ + if (PageUnevictable(page)) + return; + if (likely(get_page_unless_zero(page))) { struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs); -- cgit v1.2.3-70-g09d2 From a3bc42f584cf9024580adeb4031d4202dac05858 Mon Sep 17 00:00:00 2001 From: Daniel Kiper Date: Tue, 24 May 2011 17:12:31 -0700 Subject: mm: remove dependency on CONFIG_FLATMEM from online_page() online_pages() is only compiled for CONFIG_MEMORY_HOTPLUG_SPARSE, so there is no need to support CONFIG_FLATMEM code within it. This patch removes code that is never used. Signed-off-by: Daniel Kiper Acked-by: Dave Hansen Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 4 ---- 1 file changed, 4 deletions(-) (limited to 'mm') diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 89b0391f14a..9f646374e32 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -374,10 +374,6 @@ void online_page(struct page *page) totalhigh_pages++; #endif -#ifdef CONFIG_FLATMEM - max_mapnr = max(pfn, max_mapnr); -#endif - ClearPageReserved(page); init_page_count(page); __free_page(page); -- cgit v1.2.3-70-g09d2 From a197b59ae6e8bee56fcef37ea2482dc08414e2ac Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 24 May 2011 17:12:35 -0700 Subject: mm: fail GFP_DMA allocations when ZONE_DMA is not configured The page allocator will improperly return a page from ZONE_NORMAL even when __GFP_DMA is passed if CONFIG_ZONE_DMA is disabled. The caller expects DMA memory, perhaps for ISA devices with 16-bit address registers, and may get higher memory resulting in undefined behavior. This patch causes the page allocator to return NULL in such circumstances with a warning emitted to the kernel log on the first occurrence. Signed-off-by: David Rientjes Cc: Mel Gorman Cc: KOSAKI Motohiro Cc: KAMEZAWA Hiroyuki Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 01e6b614839..10a8c6da385 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2247,6 +2247,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (should_fail_alloc_page(gfp_mask, order)) return NULL; +#ifndef CONFIG_ZONE_DMA + if (WARN_ON_ONCE(gfp_mask & __GFP_DMA)) + return NULL; +#endif /* * Check the zones suitable for the gfp_mask contain at least one -- cgit v1.2.3-70-g09d2 From 4eb317072be81bd93906f768679f745bc574e6b7 Mon Sep 17 00:00:00 2001 From: Yinghai Lu Date: Tue, 24 May 2011 17:12:38 -0700 Subject: memblock/nobootmem: remove unneeded code from alloc_bootmem_node_high() The bootmem wrapper with memblock supports top-down now, so we no longer need this trick. Signed-off-by: Yinghai LU Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Olaf Hering Cc: Tejun Heo Cc: Lucas De Marchi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nobootmem.c | 23 ----------------------- 1 file changed, 23 deletions(-) (limited to 'mm') diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 9109049f0bb..6e93dc7f258 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -307,30 +307,7 @@ void * __init __alloc_bootmem_node(pg_data_t *pgdat, unsigned long size, void * __init __alloc_bootmem_node_high(pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal) { -#ifdef MAX_DMA32_PFN - unsigned long end_pfn; - - if (WARN_ON_ONCE(slab_is_available())) - return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id); - - /* update goal according ...MAX_DMA32_PFN */ - end_pfn = pgdat->node_start_pfn + pgdat->node_spanned_pages; - - if (end_pfn > MAX_DMA32_PFN + (128 >> (20 - PAGE_SHIFT)) && - (goal >> PAGE_SHIFT) < MAX_DMA32_PFN) { - void *ptr; - unsigned long new_goal; - - new_goal = MAX_DMA32_PFN << PAGE_SHIFT; - ptr = __alloc_memory_core_early(pgdat->node_id, size, align, - new_goal, -1ULL); - if (ptr) - return ptr; - } -#endif - return __alloc_bootmem_node(pgdat, size, align, goal); - } #ifdef CONFIG_SPARSEMEM -- cgit v1.2.3-70-g09d2 From b09e0fa4b4ea66266058eead43350bd7d55fec67 Mon Sep 17 00:00:00 2001 From: Eric Paris Date: Tue, 24 May 2011 17:12:39 -0700 Subject: tmpfs: implement generic xattr support Implement generic xattrs for tmpfs filesystems. The Feodra project, while trying to replace suid apps with file capabilities, realized that tmpfs, which is used on the build systems, does not support file capabilities and thus cannot be used to build packages which use file capabilities. Xattrs are also needed for overlayfs. The xattr interface is a bit odd. If a filesystem does not implement any {get,set,list}xattr functions the VFS will call into some random LSM hooks and the running LSM can then implement some method for handling xattrs. SELinux for example provides a method to support security.selinux but no other security.* xattrs. As it stands today when one enables CONFIG_TMPFS_POSIX_ACL tmpfs will have xattr handler routines specifically to handle acls. Because of this tmpfs would loose the VFS/LSM helpers to support the running LSM. To make up for that tmpfs had stub functions that did nothing but call into the LSM hooks which implement the helpers. This new patch does not use the LSM fallback functions and instead just implements a native get/set/list xattr feature for the full security.* and trusted.* namespace like a normal filesystem. This means that tmpfs can now support both security.selinux and security.capability, which was not previously possible. The basic implementation is that I attach a: struct shmem_xattr { struct list_head list; /* anchored by shmem_inode_info->xattr_list */ char *name; size_t size; char value[0]; }; Into the struct shmem_inode_info for each xattr that is set. This implementation could easily support the user.* namespace as well, except some care needs to be taken to prevent large amounts of unswappable memory being allocated for unprivileged users. [mszeredi@suse.cz: new config option, suport trusted.*, support symlinks] Signed-off-by: Eric Paris Signed-off-by: Miklos Szeredi Acked-by: Serge Hallyn Tested-by: Serge Hallyn Cc: Kyle McMartin Acked-by: Hugh Dickins Tested-by: Jordi Pujol Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/Kconfig | 18 ++- include/linux/shmem_fs.h | 8 +- mm/shmem.c | 320 +++++++++++++++++++++++++++++++++++++++-------- 3 files changed, 290 insertions(+), 56 deletions(-) (limited to 'mm') diff --git a/fs/Kconfig b/fs/Kconfig index f3aa9b08b22..979992dcb38 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -121,9 +121,25 @@ config TMPFS See for details. +config TMPFS_XATTR + bool "Tmpfs extended attributes" + depends on TMPFS + default n + help + Extended attributes are name:value pairs associated with inodes by + the kernel or by users (see the attr(5) manual page, or visit + for details). + + Currently this enables support for the trusted.* and + security.* namespaces. + + If unsure, say N. + + You need this for POSIX ACL support on tmpfs. + config TMPFS_POSIX_ACL bool "Tmpfs POSIX Access Control Lists" - depends on TMPFS + depends on TMPFS_XATTR select GENERIC_ACL help POSIX Access Control Lists (ACLs) support permissions for users and diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 399be5ad2f9..2b7fec84051 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -9,6 +9,8 @@ #define SHMEM_NR_DIRECT 16 +#define SHMEM_SYMLINK_INLINE_LEN (SHMEM_NR_DIRECT * sizeof(swp_entry_t)) + struct shmem_inode_info { spinlock_t lock; unsigned long flags; @@ -17,8 +19,12 @@ struct shmem_inode_info { unsigned long next_index; /* highest alloced index + 1 */ struct shared_policy policy; /* NUMA memory alloc policy */ struct page *i_indirect; /* top indirect blocks page */ - swp_entry_t i_direct[SHMEM_NR_DIRECT]; /* first blocks */ + union { + swp_entry_t i_direct[SHMEM_NR_DIRECT]; /* first blocks */ + char inline_symlink[SHMEM_SYMLINK_INLINE_LEN]; + }; struct list_head swaplist; /* chain of maybes on swap */ + struct list_head xattr_list; /* list of shmem_xattr */ struct inode vfs_inode; }; diff --git a/mm/shmem.c b/mm/shmem.c index ba4ad28b7db..69edb45a9f2 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -99,6 +99,13 @@ static struct vfsmount *shm_mnt; /* Pretend that each entry is of this size in directory's i_size */ #define BOGO_DIRENT_SIZE 20 +struct shmem_xattr { + struct list_head list; /* anchored by shmem_inode_info->xattr_list */ + char *name; /* xattr name */ + size_t size; + char value[0]; +}; + /* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */ enum sgp_type { SGP_READ, /* don't exceed i_size, don't allocate page */ @@ -822,6 +829,7 @@ static int shmem_notify_change(struct dentry *dentry, struct iattr *attr) static void shmem_evict_inode(struct inode *inode) { struct shmem_inode_info *info = SHMEM_I(inode); + struct shmem_xattr *xattr, *nxattr; if (inode->i_mapping->a_ops == &shmem_aops) { truncate_inode_pages(inode->i_mapping, 0); @@ -834,6 +842,11 @@ static void shmem_evict_inode(struct inode *inode) mutex_unlock(&shmem_swaplist_mutex); } } + + list_for_each_entry_safe(xattr, nxattr, &info->xattr_list, list) { + kfree(xattr->name); + kfree(xattr); + } BUG_ON(inode->i_blocks); shmem_free_inode(inode->i_sb); end_writeback(inode); @@ -1615,6 +1628,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode spin_lock_init(&info->lock); info->flags = flags & VM_NORESERVE; INIT_LIST_HEAD(&info->swaplist); + INIT_LIST_HEAD(&info->xattr_list); cache_no_acl(inode); switch (mode & S_IFMT) { @@ -2014,9 +2028,9 @@ static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *s info = SHMEM_I(inode); inode->i_size = len-1; - if (len <= (char *)inode - (char *)info) { + if (len <= SHMEM_SYMLINK_INLINE_LEN) { /* do it inline */ - memcpy(info, symname, len); + memcpy(info->inline_symlink, symname, len); inode->i_op = &shmem_symlink_inline_operations; } else { error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL); @@ -2042,7 +2056,7 @@ static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *s static void *shmem_follow_link_inline(struct dentry *dentry, struct nameidata *nd) { - nd_set_link(nd, (char *)SHMEM_I(dentry->d_inode)); + nd_set_link(nd, SHMEM_I(dentry->d_inode)->inline_symlink); return NULL; } @@ -2066,63 +2080,253 @@ static void shmem_put_link(struct dentry *dentry, struct nameidata *nd, void *co } } -static const struct inode_operations shmem_symlink_inline_operations = { - .readlink = generic_readlink, - .follow_link = shmem_follow_link_inline, -}; - -static const struct inode_operations shmem_symlink_inode_operations = { - .readlink = generic_readlink, - .follow_link = shmem_follow_link, - .put_link = shmem_put_link, -}; - -#ifdef CONFIG_TMPFS_POSIX_ACL +#ifdef CONFIG_TMPFS_XATTR /* - * Superblocks without xattr inode operations will get security.* xattr - * support from the VFS "for free". As soon as we have any other xattrs + * Superblocks without xattr inode operations may get some security.* xattr + * support from the LSM "for free". As soon as we have any other xattrs * like ACLs, we also need to implement the security.* handlers at * filesystem level, though. */ -static size_t shmem_xattr_security_list(struct dentry *dentry, char *list, - size_t list_len, const char *name, - size_t name_len, int handler_flags) +static int shmem_xattr_get(struct dentry *dentry, const char *name, + void *buffer, size_t size) { - return security_inode_listsecurity(dentry->d_inode, list, list_len); -} + struct shmem_inode_info *info; + struct shmem_xattr *xattr; + int ret = -ENODATA; -static int shmem_xattr_security_get(struct dentry *dentry, const char *name, - void *buffer, size_t size, int handler_flags) -{ - if (strcmp(name, "") == 0) - return -EINVAL; - return xattr_getsecurity(dentry->d_inode, name, buffer, size); + info = SHMEM_I(dentry->d_inode); + + spin_lock(&info->lock); + list_for_each_entry(xattr, &info->xattr_list, list) { + if (strcmp(name, xattr->name)) + continue; + + ret = xattr->size; + if (buffer) { + if (size < xattr->size) + ret = -ERANGE; + else + memcpy(buffer, xattr->value, xattr->size); + } + break; + } + spin_unlock(&info->lock); + return ret; } -static int shmem_xattr_security_set(struct dentry *dentry, const char *name, - const void *value, size_t size, int flags, int handler_flags) +static int shmem_xattr_set(struct dentry *dentry, const char *name, + const void *value, size_t size, int flags) { - if (strcmp(name, "") == 0) - return -EINVAL; - return security_inode_setsecurity(dentry->d_inode, name, value, - size, flags); + struct inode *inode = dentry->d_inode; + struct shmem_inode_info *info = SHMEM_I(inode); + struct shmem_xattr *xattr; + struct shmem_xattr *new_xattr = NULL; + size_t len; + int err = 0; + + /* value == NULL means remove */ + if (value) { + /* wrap around? */ + len = sizeof(*new_xattr) + size; + if (len <= sizeof(*new_xattr)) + return -ENOMEM; + + new_xattr = kmalloc(len, GFP_KERNEL); + if (!new_xattr) + return -ENOMEM; + + new_xattr->name = kstrdup(name, GFP_KERNEL); + if (!new_xattr->name) { + kfree(new_xattr); + return -ENOMEM; + } + + new_xattr->size = size; + memcpy(new_xattr->value, value, size); + } + + spin_lock(&info->lock); + list_for_each_entry(xattr, &info->xattr_list, list) { + if (!strcmp(name, xattr->name)) { + if (flags & XATTR_CREATE) { + xattr = new_xattr; + err = -EEXIST; + } else if (new_xattr) { + list_replace(&xattr->list, &new_xattr->list); + } else { + list_del(&xattr->list); + } + goto out; + } + } + if (flags & XATTR_REPLACE) { + xattr = new_xattr; + err = -ENODATA; + } else { + list_add(&new_xattr->list, &info->xattr_list); + xattr = NULL; + } +out: + spin_unlock(&info->lock); + if (xattr) + kfree(xattr->name); + kfree(xattr); + return err; } -static const struct xattr_handler shmem_xattr_security_handler = { - .prefix = XATTR_SECURITY_PREFIX, - .list = shmem_xattr_security_list, - .get = shmem_xattr_security_get, - .set = shmem_xattr_security_set, -}; static const struct xattr_handler *shmem_xattr_handlers[] = { +#ifdef CONFIG_TMPFS_POSIX_ACL &generic_acl_access_handler, &generic_acl_default_handler, - &shmem_xattr_security_handler, +#endif NULL }; + +static int shmem_xattr_validate(const char *name) +{ + struct { const char *prefix; size_t len; } arr[] = { + { XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN }, + { XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN } + }; + int i; + + for (i = 0; i < ARRAY_SIZE(arr); i++) { + size_t preflen = arr[i].len; + if (strncmp(name, arr[i].prefix, preflen) == 0) { + if (!name[preflen]) + return -EINVAL; + return 0; + } + } + return -EOPNOTSUPP; +} + +static ssize_t shmem_getxattr(struct dentry *dentry, const char *name, + void *buffer, size_t size) +{ + int err; + + /* + * If this is a request for a synthetic attribute in the system.* + * namespace use the generic infrastructure to resolve a handler + * for it via sb->s_xattr. + */ + if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) + return generic_getxattr(dentry, name, buffer, size); + + err = shmem_xattr_validate(name); + if (err) + return err; + + return shmem_xattr_get(dentry, name, buffer, size); +} + +static int shmem_setxattr(struct dentry *dentry, const char *name, + const void *value, size_t size, int flags) +{ + int err; + + /* + * If this is a request for a synthetic attribute in the system.* + * namespace use the generic infrastructure to resolve a handler + * for it via sb->s_xattr. + */ + if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) + return generic_setxattr(dentry, name, value, size, flags); + + err = shmem_xattr_validate(name); + if (err) + return err; + + if (size == 0) + value = ""; /* empty EA, do not remove */ + + return shmem_xattr_set(dentry, name, value, size, flags); + +} + +static int shmem_removexattr(struct dentry *dentry, const char *name) +{ + int err; + + /* + * If this is a request for a synthetic attribute in the system.* + * namespace use the generic infrastructure to resolve a handler + * for it via sb->s_xattr. + */ + if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) + return generic_removexattr(dentry, name); + + err = shmem_xattr_validate(name); + if (err) + return err; + + return shmem_xattr_set(dentry, name, NULL, 0, XATTR_REPLACE); +} + +static bool xattr_is_trusted(const char *name) +{ + return !strncmp(name, XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN); +} + +static ssize_t shmem_listxattr(struct dentry *dentry, char *buffer, size_t size) +{ + bool trusted = capable(CAP_SYS_ADMIN); + struct shmem_xattr *xattr; + struct shmem_inode_info *info; + size_t used = 0; + + info = SHMEM_I(dentry->d_inode); + + spin_lock(&info->lock); + list_for_each_entry(xattr, &info->xattr_list, list) { + size_t len; + + /* skip "trusted." attributes for unprivileged callers */ + if (!trusted && xattr_is_trusted(xattr->name)) + continue; + + len = strlen(xattr->name) + 1; + used += len; + if (buffer) { + if (size < used) { + used = -ERANGE; + break; + } + memcpy(buffer, xattr->name, len); + buffer += len; + } + } + spin_unlock(&info->lock); + + return used; +} +#endif /* CONFIG_TMPFS_XATTR */ + +static const struct inode_operations shmem_symlink_inline_operations = { + .readlink = generic_readlink, + .follow_link = shmem_follow_link_inline, +#ifdef CONFIG_TMPFS_XATTR + .setxattr = shmem_setxattr, + .getxattr = shmem_getxattr, + .listxattr = shmem_listxattr, + .removexattr = shmem_removexattr, +#endif +}; + +static const struct inode_operations shmem_symlink_inode_operations = { + .readlink = generic_readlink, + .follow_link = shmem_follow_link, + .put_link = shmem_put_link, +#ifdef CONFIG_TMPFS_XATTR + .setxattr = shmem_setxattr, + .getxattr = shmem_getxattr, + .listxattr = shmem_listxattr, + .removexattr = shmem_removexattr, #endif +}; static struct dentry *shmem_get_parent(struct dentry *child) { @@ -2402,8 +2606,10 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent) sb->s_magic = TMPFS_MAGIC; sb->s_op = &shmem_ops; sb->s_time_gran = 1; -#ifdef CONFIG_TMPFS_POSIX_ACL +#ifdef CONFIG_TMPFS_XATTR sb->s_xattr = shmem_xattr_handlers; +#endif +#ifdef CONFIG_TMPFS_POSIX_ACL sb->s_flags |= MS_POSIXACL; #endif @@ -2501,11 +2707,13 @@ static const struct file_operations shmem_file_operations = { static const struct inode_operations shmem_inode_operations = { .setattr = shmem_notify_change, .truncate_range = shmem_truncate_range, +#ifdef CONFIG_TMPFS_XATTR + .setxattr = shmem_setxattr, + .getxattr = shmem_getxattr, + .listxattr = shmem_listxattr, + .removexattr = shmem_removexattr, +#endif #ifdef CONFIG_TMPFS_POSIX_ACL - .setxattr = generic_setxattr, - .getxattr = generic_getxattr, - .listxattr = generic_listxattr, - .removexattr = generic_removexattr, .check_acl = generic_check_acl, #endif @@ -2523,23 +2731,27 @@ static const struct inode_operations shmem_dir_inode_operations = { .mknod = shmem_mknod, .rename = shmem_rename, #endif +#ifdef CONFIG_TMPFS_XATTR + .setxattr = shmem_setxattr, + .getxattr = shmem_getxattr, + .listxattr = shmem_listxattr, + .removexattr = shmem_removexattr, +#endif #ifdef CONFIG_TMPFS_POSIX_ACL .setattr = shmem_notify_change, - .setxattr = generic_setxattr, - .getxattr = generic_getxattr, - .listxattr = generic_listxattr, - .removexattr = generic_removexattr, .check_acl = generic_check_acl, #endif }; static const struct inode_operations shmem_special_inode_operations = { +#ifdef CONFIG_TMPFS_XATTR + .setxattr = shmem_setxattr, + .getxattr = shmem_getxattr, + .listxattr = shmem_listxattr, + .removexattr = shmem_removexattr, +#endif #ifdef CONFIG_TMPFS_POSIX_ACL .setattr = shmem_notify_change, - .setxattr = generic_setxattr, - .getxattr = generic_getxattr, - .listxattr = generic_listxattr, - .removexattr = generic_removexattr, .check_acl = generic_check_acl, #endif }; -- cgit v1.2.3-70-g09d2 From d98f6cb67fb5b9376d4957d7ba9f32eac35c2e08 Mon Sep 17 00:00:00 2001 From: Stephen Wilson Date: Tue, 24 May 2011 17:12:41 -0700 Subject: mm: export get_vma_policy() In commit 48fce3429d ("mempolicies: unexport get_vma_policy()") get_vma_policy() was marked static as all clients were local to mempolicy.c. However, the decision to generate /proc/pid/numa_maps in the numa memory policy code and outside the procfs subsystem introduces an artificial interdependency between the two systems. Exporting get_vma_policy() once again is the first step to clean up this interdependency. Signed-off-by: Stephen Wilson Reviewed-by: KOSAKI Motohiro Cc: Hugh Dickins Cc: David Rientjes Cc: Lee Schermerhorn Cc: Alexey Dobriyan Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mempolicy.h | 3 +++ mm/mempolicy.c | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 31ac26ca4ac..c2f603218fb 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -199,6 +199,9 @@ void mpol_free_shared_policy(struct shared_policy *p); struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx); +struct mempolicy *get_vma_policy(struct task_struct *tsk, + struct vm_area_struct *vma, unsigned long addr); + extern void numa_default_policy(void); extern void numa_policy_init(void); extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new, diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 959a8b8c735..5bfb03ef3cb 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1489,7 +1489,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, * freeing by another task. It is the caller's responsibility to free the * extra reference for shared policies. */ -static struct mempolicy *get_vma_policy(struct task_struct *task, +struct mempolicy *get_vma_policy(struct task_struct *task, struct vm_area_struct *vma, unsigned long addr) { struct mempolicy *pol = task->mempolicy; -- cgit v1.2.3-70-g09d2 From 29ea2f6982f1edc4302729116f2246dd7b45471d Mon Sep 17 00:00:00 2001 From: Stephen Wilson Date: Tue, 24 May 2011 17:12:42 -0700 Subject: mm: use walk_page_range() instead of custom page table walking code Converting show_numa_map() to use the generic routine decouples the function from mempolicy.c, allowing it to be moved out of the mm subsystem and into fs/proc. Also, include KSM pages in /proc/pid/numa_maps statistics. The pagewalk logic implemented by check_pte_range() failed to account for such pages as they were not applicable to the page migration case. Signed-off-by: Stephen Wilson Reviewed-by: KOSAKI Motohiro Cc: Hugh Dickins Cc: David Rientjes Cc: Lee Schermerhorn Cc: Alexey Dobriyan Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 68 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 5bfb03ef3cb..945e85de2d4 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2531,6 +2531,7 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context) } struct numa_maps { + struct vm_area_struct *vma; unsigned long pages; unsigned long anon; unsigned long active; @@ -2568,6 +2569,41 @@ static void gather_stats(struct page *page, void *private, int pte_dirty) md->node[page_to_nid(page)]++; } +static int gather_pte_stats(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) +{ + struct numa_maps *md; + spinlock_t *ptl; + pte_t *orig_pte; + pte_t *pte; + + md = walk->private; + orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + do { + struct page *page; + int nid; + + if (!pte_present(*pte)) + continue; + + page = vm_normal_page(md->vma, addr, *pte); + if (!page) + continue; + + if (PageReserved(page)) + continue; + + nid = page_to_nid(page); + if (!node_isset(nid, node_states[N_HIGH_MEMORY])) + continue; + + gather_stats(page, md, pte_dirty(*pte)); + + } while (pte++, addr += PAGE_SIZE, addr != end); + pte_unmap_unlock(orig_pte, ptl); + return 0; +} + #ifdef CONFIG_HUGETLB_PAGE static void check_huge_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, @@ -2597,12 +2633,35 @@ static void check_huge_range(struct vm_area_struct *vma, gather_stats(page, md, pte_dirty(*ptep)); } } + +static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, + unsigned long addr, unsigned long end, struct mm_walk *walk) +{ + struct page *page; + + if (pte_none(*pte)) + return 0; + + page = pte_page(*pte); + if (!page) + return 0; + + gather_stats(page, walk->private, pte_dirty(*pte)); + return 0; +} + #else static inline void check_huge_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, struct numa_maps *md) { } + +static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, + unsigned long addr, unsigned long end, struct mm_walk *walk) +{ + return 0; +} #endif /* @@ -2615,6 +2674,7 @@ int show_numa_map(struct seq_file *m, void *v) struct numa_maps *md; struct file *file = vma->vm_file; struct mm_struct *mm = vma->vm_mm; + struct mm_walk walk = {}; struct mempolicy *pol; int n; char buffer[50]; @@ -2626,6 +2686,13 @@ int show_numa_map(struct seq_file *m, void *v) if (!md) return 0; + md->vma = vma; + + walk.hugetlb_entry = gather_hugetbl_stats; + walk.pmd_entry = gather_pte_stats; + walk.private = md; + walk.mm = mm; + pol = get_vma_policy(priv->task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); mpol_cond_put(pol); @@ -2642,13 +2709,7 @@ int show_numa_map(struct seq_file *m, void *v) seq_printf(m, " stack"); } - if (is_vm_hugetlb_page(vma)) { - check_huge_range(vma, vma->vm_start, vma->vm_end, md); - seq_printf(m, " huge"); - } else { - check_pgd_range(vma, vma->vm_start, vma->vm_end, - &node_states[N_HIGH_MEMORY], MPOL_MF_STATS, md); - } + walk_page_range(vma->vm_start, vma->vm_end, &walk); if (!md->pages) goto out; -- cgit v1.2.3-70-g09d2 From b1f72d1857bb0de19ce20a59f3f85e6dc47bdec8 Mon Sep 17 00:00:00 2001 From: Stephen Wilson Date: Tue, 24 May 2011 17:12:43 -0700 Subject: mm: remove MPOL_MF_STATS Mapping statistics in a NUMA environment is now computed using the generic walk_page_range() logic. Remove the old/equivalent functionality. Signed-off-by: Stephen Wilson Reviewed-by: KOSAKI Motohiro Cc: Hugh Dickins Cc: David Rientjes Cc: Lee Schermerhorn Cc: Alexey Dobriyan Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 945e85de2d4..4e188766fd4 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -99,7 +99,6 @@ /* Internal flags */ #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */ #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */ -#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2) /* Gather statistics */ static struct kmem_cache *policy_cache; static struct kmem_cache *sn_cache; @@ -492,9 +491,7 @@ static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT)) continue; - if (flags & MPOL_MF_STATS) - gather_stats(page, private, pte_dirty(*pte)); - else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) + if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) migrate_page_add(page, private, flags); else break; -- cgit v1.2.3-70-g09d2 From 722e2ee09b8dfc2ac5eedb802dc0d227702df084 Mon Sep 17 00:00:00 2001 From: Stephen Wilson Date: Tue, 24 May 2011 17:12:44 -0700 Subject: mm: make gather_stats() type-safe and remove forward declaration Improve the prototype of gather_stats() to take a struct numa_maps as argument instead of a generic void *. Update all callers to make the required type explicit. Since gather_stats() is not needed before its definition and is scheduled to be moved out of mempolicy.c the declaration is removed as well. Signed-off-by: Stephen Wilson Reviewed-by: KOSAKI Motohiro Cc: Hugh Dickins Cc: David Rientjes Cc: Lee Schermerhorn Cc: Alexey Dobriyan Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4e188766fd4..6c5306544df 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -456,7 +456,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { }, }; -static void gather_stats(struct page *, void *, int pte_dirty); static void migrate_page_add(struct page *page, struct list_head *pagelist, unsigned long flags); @@ -2539,9 +2538,8 @@ struct numa_maps { unsigned long node[MAX_NUMNODES]; }; -static void gather_stats(struct page *page, void *private, int pte_dirty) +static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty) { - struct numa_maps *md = private; int count = page_mapcount(page); md->pages++; @@ -2634,6 +2632,7 @@ static void check_huge_range(struct vm_area_struct *vma, static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { + struct numa_maps *md; struct page *page; if (pte_none(*pte)) @@ -2643,7 +2642,8 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, if (!page) return 0; - gather_stats(page, walk->private, pte_dirty(*pte)); + md = walk->private; + gather_stats(page, md, pte_dirty(*pte)); return 0; } -- cgit v1.2.3-70-g09d2 From 9840e37239183a947a15d617c67e418c6e505dd8 Mon Sep 17 00:00:00 2001 From: Stephen Wilson Date: Tue, 24 May 2011 17:12:45 -0700 Subject: mm: remove check_huge_range() This function has been superseded by gather_hugetbl_stats() and is no longer needed. Signed-off-by: Stephen Wilson Reviewed-by: KOSAKI Motohiro Cc: Hugh Dickins Cc: David Rientjes Cc: Lee Schermerhorn Cc: Alexey Dobriyan Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 35 ----------------------------------- 1 file changed, 35 deletions(-) (limited to 'mm') diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 6c5306544df..3b6e4057fab 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2600,35 +2600,6 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, } #ifdef CONFIG_HUGETLB_PAGE -static void check_huge_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end, - struct numa_maps *md) -{ - unsigned long addr; - struct page *page; - struct hstate *h = hstate_vma(vma); - unsigned long sz = huge_page_size(h); - - for (addr = start; addr < end; addr += sz) { - pte_t *ptep = huge_pte_offset(vma->vm_mm, - addr & huge_page_mask(h)); - pte_t pte; - - if (!ptep) - continue; - - pte = *ptep; - if (pte_none(pte)) - continue; - - page = pte_page(pte); - if (!page) - continue; - - gather_stats(page, md, pte_dirty(*ptep)); - } -} - static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -2648,12 +2619,6 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, } #else -static inline void check_huge_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end, - struct numa_maps *md) -{ -} - static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { -- cgit v1.2.3-70-g09d2 From f69ff943df6972aae96c10733b6847fa094d8a59 Mon Sep 17 00:00:00 2001 From: Stephen Wilson Date: Tue, 24 May 2011 17:12:47 -0700 Subject: mm: proc: move show_numa_map() to fs/proc/task_mmu.c Moving show_numa_map() from mempolicy.c to task_mmu.c solves several issues. - Having the show() operation "miles away" from the corresponding seq_file iteration operations is a maintenance burden. - The need to export ad hoc info like struct proc_maps_private is eliminated. - The implementation of show_numa_map() can be improved in a simple manner by cooperating with the other seq_file operations (start, stop, etc) -- something that would be messy to do without this change. Signed-off-by: Stephen Wilson Reviewed-by: KOSAKI Motohiro Cc: Hugh Dickins Cc: David Rientjes Cc: Lee Schermerhorn Cc: Alexey Dobriyan Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 184 ++++++++++++++++++++++++++++++++++++++++++++++++++++- mm/mempolicy.c | 183 ---------------------------------------------------- 2 files changed, 182 insertions(+), 185 deletions(-) (limited to 'mm') diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 318d8654989..2ed53d18b2e 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -858,8 +858,188 @@ const struct file_operations proc_pagemap_operations = { #endif /* CONFIG_PROC_PAGE_MONITOR */ #ifdef CONFIG_NUMA -extern int show_numa_map(struct seq_file *m, void *v); +struct numa_maps { + struct vm_area_struct *vma; + unsigned long pages; + unsigned long anon; + unsigned long active; + unsigned long writeback; + unsigned long mapcount_max; + unsigned long dirty; + unsigned long swapcache; + unsigned long node[MAX_NUMNODES]; +}; + +static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty) +{ + int count = page_mapcount(page); + + md->pages++; + if (pte_dirty || PageDirty(page)) + md->dirty++; + + if (PageSwapCache(page)) + md->swapcache++; + + if (PageActive(page) || PageUnevictable(page)) + md->active++; + + if (PageWriteback(page)) + md->writeback++; + + if (PageAnon(page)) + md->anon++; + + if (count > md->mapcount_max) + md->mapcount_max = count; + + md->node[page_to_nid(page)]++; +} + +static int gather_pte_stats(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) +{ + struct numa_maps *md; + spinlock_t *ptl; + pte_t *orig_pte; + pte_t *pte; + + md = walk->private; + orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + do { + struct page *page; + int nid; + + if (!pte_present(*pte)) + continue; + + page = vm_normal_page(md->vma, addr, *pte); + if (!page) + continue; + + if (PageReserved(page)) + continue; + + nid = page_to_nid(page); + if (!node_isset(nid, node_states[N_HIGH_MEMORY])) + continue; + + gather_stats(page, md, pte_dirty(*pte)); + + } while (pte++, addr += PAGE_SIZE, addr != end); + pte_unmap_unlock(orig_pte, ptl); + return 0; +} +#ifdef CONFIG_HUGETLB_PAGE +static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, + unsigned long addr, unsigned long end, struct mm_walk *walk) +{ + struct numa_maps *md; + struct page *page; + + if (pte_none(*pte)) + return 0; + + page = pte_page(*pte); + if (!page) + return 0; + + md = walk->private; + gather_stats(page, md, pte_dirty(*pte)); + return 0; +} + +#else +static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, + unsigned long addr, unsigned long end, struct mm_walk *walk) +{ + return 0; +} +#endif + +/* + * Display pages allocated per node and memory policy via /proc. + */ +static int show_numa_map(struct seq_file *m, void *v) +{ + struct proc_maps_private *priv = m->private; + struct vm_area_struct *vma = v; + struct numa_maps *md; + struct file *file = vma->vm_file; + struct mm_struct *mm = vma->vm_mm; + struct mm_walk walk = {}; + struct mempolicy *pol; + int n; + char buffer[50]; + + if (!mm) + return 0; + + md = kzalloc(sizeof(struct numa_maps), GFP_KERNEL); + if (!md) + return 0; + + md->vma = vma; + + walk.hugetlb_entry = gather_hugetbl_stats; + walk.pmd_entry = gather_pte_stats; + walk.private = md; + walk.mm = mm; + + pol = get_vma_policy(priv->task, vma, vma->vm_start); + mpol_to_str(buffer, sizeof(buffer), pol, 0); + mpol_cond_put(pol); + + seq_printf(m, "%08lx %s", vma->vm_start, buffer); + + if (file) { + seq_printf(m, " file="); + seq_path(m, &file->f_path, "\n\t= "); + } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) { + seq_printf(m, " heap"); + } else if (vma->vm_start <= mm->start_stack && + vma->vm_end >= mm->start_stack) { + seq_printf(m, " stack"); + } + + walk_page_range(vma->vm_start, vma->vm_end, &walk); + + if (!md->pages) + goto out; + + if (md->anon) + seq_printf(m, " anon=%lu", md->anon); + + if (md->dirty) + seq_printf(m, " dirty=%lu", md->dirty); + + if (md->pages != md->anon && md->pages != md->dirty) + seq_printf(m, " mapped=%lu", md->pages); + + if (md->mapcount_max > 1) + seq_printf(m, " mapmax=%lu", md->mapcount_max); + + if (md->swapcache) + seq_printf(m, " swapcache=%lu", md->swapcache); + + if (md->active < md->pages && !is_vm_hugetlb_page(vma)) + seq_printf(m, " active=%lu", md->active); + + if (md->writeback) + seq_printf(m, " writeback=%lu", md->writeback); + + for_each_node_state(n, N_HIGH_MEMORY) + if (md->node[n]) + seq_printf(m, " N%d=%lu", n, md->node[n]); +out: + seq_putc(m, '\n'); + kfree(md); + + if (m->count < m->size) + m->version = (vma != priv->tail_vma) ? vma->vm_start : 0; + return 0; +} static const struct seq_operations proc_pid_numa_maps_op = { .start = m_start, .next = m_next, @@ -878,4 +1058,4 @@ const struct file_operations proc_numa_maps_operations = { .llseek = seq_lseek, .release = seq_release_private, }; -#endif +#endif /* CONFIG_NUMA */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 3b6e4057fab..e7fb9d25c54 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2525,186 +2525,3 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context) } return p - buffer; } - -struct numa_maps { - struct vm_area_struct *vma; - unsigned long pages; - unsigned long anon; - unsigned long active; - unsigned long writeback; - unsigned long mapcount_max; - unsigned long dirty; - unsigned long swapcache; - unsigned long node[MAX_NUMNODES]; -}; - -static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty) -{ - int count = page_mapcount(page); - - md->pages++; - if (pte_dirty || PageDirty(page)) - md->dirty++; - - if (PageSwapCache(page)) - md->swapcache++; - - if (PageActive(page) || PageUnevictable(page)) - md->active++; - - if (PageWriteback(page)) - md->writeback++; - - if (PageAnon(page)) - md->anon++; - - if (count > md->mapcount_max) - md->mapcount_max = count; - - md->node[page_to_nid(page)]++; -} - -static int gather_pte_stats(pmd_t *pmd, unsigned long addr, - unsigned long end, struct mm_walk *walk) -{ - struct numa_maps *md; - spinlock_t *ptl; - pte_t *orig_pte; - pte_t *pte; - - md = walk->private; - orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); - do { - struct page *page; - int nid; - - if (!pte_present(*pte)) - continue; - - page = vm_normal_page(md->vma, addr, *pte); - if (!page) - continue; - - if (PageReserved(page)) - continue; - - nid = page_to_nid(page); - if (!node_isset(nid, node_states[N_HIGH_MEMORY])) - continue; - - gather_stats(page, md, pte_dirty(*pte)); - - } while (pte++, addr += PAGE_SIZE, addr != end); - pte_unmap_unlock(orig_pte, ptl); - return 0; -} - -#ifdef CONFIG_HUGETLB_PAGE -static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, struct mm_walk *walk) -{ - struct numa_maps *md; - struct page *page; - - if (pte_none(*pte)) - return 0; - - page = pte_page(*pte); - if (!page) - return 0; - - md = walk->private; - gather_stats(page, md, pte_dirty(*pte)); - return 0; -} - -#else -static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, struct mm_walk *walk) -{ - return 0; -} -#endif - -/* - * Display pages allocated per node and memory policy via /proc. - */ -int show_numa_map(struct seq_file *m, void *v) -{ - struct proc_maps_private *priv = m->private; - struct vm_area_struct *vma = v; - struct numa_maps *md; - struct file *file = vma->vm_file; - struct mm_struct *mm = vma->vm_mm; - struct mm_walk walk = {}; - struct mempolicy *pol; - int n; - char buffer[50]; - - if (!mm) - return 0; - - md = kzalloc(sizeof(struct numa_maps), GFP_KERNEL); - if (!md) - return 0; - - md->vma = vma; - - walk.hugetlb_entry = gather_hugetbl_stats; - walk.pmd_entry = gather_pte_stats; - walk.private = md; - walk.mm = mm; - - pol = get_vma_policy(priv->task, vma, vma->vm_start); - mpol_to_str(buffer, sizeof(buffer), pol, 0); - mpol_cond_put(pol); - - seq_printf(m, "%08lx %s", vma->vm_start, buffer); - - if (file) { - seq_printf(m, " file="); - seq_path(m, &file->f_path, "\n\t= "); - } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) { - seq_printf(m, " heap"); - } else if (vma->vm_start <= mm->start_stack && - vma->vm_end >= mm->start_stack) { - seq_printf(m, " stack"); - } - - walk_page_range(vma->vm_start, vma->vm_end, &walk); - - if (!md->pages) - goto out; - - if (md->anon) - seq_printf(m," anon=%lu",md->anon); - - if (md->dirty) - seq_printf(m," dirty=%lu",md->dirty); - - if (md->pages != md->anon && md->pages != md->dirty) - seq_printf(m, " mapped=%lu", md->pages); - - if (md->mapcount_max > 1) - seq_printf(m, " mapmax=%lu", md->mapcount_max); - - if (md->swapcache) - seq_printf(m," swapcache=%lu", md->swapcache); - - if (md->active < md->pages && !is_vm_hugetlb_page(vma)) - seq_printf(m," active=%lu", md->active); - - if (md->writeback) - seq_printf(m," writeback=%lu", md->writeback); - - for_each_node_state(n, N_HIGH_MEMORY) - if (md->node[n]) - seq_printf(m, " N%d=%lu", n, md->node[n]); -out: - seq_putc(m, '\n'); - kfree(md); - - if (m->count < m->size) - m->version = (vma != priv->tail_vma) ? vma->vm_start : 0; - return 0; -} -- cgit v1.2.3-70-g09d2 From a2c8990aed5ab000491732b07c8c4465d1b389b8 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 24 May 2011 17:12:50 -0700 Subject: memsw: remove noswapaccount kernel parameter The noswapaccount parameter has been deprecated since 2.6.38 without any complaints from users so we can remove it. swapaccount=0|1 can be used instead. As we are removing the parameter we can also clean up swapaccount because it doesn't have to accept an empty string anymore (to match noswapaccount) and so we can push = into __setup macro rather than checking "=1" resp. "=0" strings Signed-off-by: Michal Hocko Cc: Hiroyuki Kamezawa Cc: Daisuke Nishimura Cc: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 3 --- mm/memcontrol.c | 13 +++---------- 2 files changed, 3 insertions(+), 13 deletions(-) (limited to 'mm') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 7c6624e7a5c..5438a2d7907 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1777,9 +1777,6 @@ bytes respectively. Such letter suffixes can also be entirely omitted. nosoftlockup [KNL] Disable the soft-lockup detector. - noswapaccount [KNL] Disable accounting of swap in memory resource - controller. (See Documentation/cgroups/memory.txt) - nosync [HW,M68K] Disables sync negotiation for all devices. notsc [BUGS=X86-32] Disable Time Stamp Counter diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 010f9166fa6..d5fd3dcd3f2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5169,19 +5169,12 @@ struct cgroup_subsys mem_cgroup_subsys = { static int __init enable_swap_account(char *s) { /* consider enabled if no parameter or 1 is given */ - if (!(*s) || !strcmp(s, "=1")) + if (!strcmp(s, "1")) really_do_swap_account = 1; - else if (!strcmp(s, "=0")) + else if (!strcmp(s, "0")) really_do_swap_account = 0; return 1; } -__setup("swapaccount", enable_swap_account); +__setup("swapaccount=", enable_swap_account); -static int __init disable_swap_account(char *s) -{ - printk_once("noswapaccount is deprecated and will be removed in 2.6.40. Use swapaccount=0 instead\n"); - enable_swap_account("=0"); - return 1; -} -__setup("noswapaccount", disable_swap_account); #endif -- cgit v1.2.3-70-g09d2 From cfa54a0fcfc1017c6f122b6f21aaba36daa07f71 Mon Sep 17 00:00:00 2001 From: Andrew Barry Date: Tue, 24 May 2011 17:12:52 -0700 Subject: mm/page_alloc.c: prevent unending loop in __alloc_pages_slowpath() I believe I found a problem in __alloc_pages_slowpath, which allows a process to get stuck endlessly looping, even when lots of memory is available. Running an I/O and memory intensive stress-test I see a 0-order page allocation with __GFP_IO and __GFP_WAIT, running on a system with very little free memory. Right about the same time that the stress-test gets killed by the OOM-killer, the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even though most of the systems memory was freed by the oom-kill of the stress-test. The utility ends up looping from the rebalance label down through the wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact skips the call to get_page_from_freelist. Because all of the reclaimable memory on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then jumps back to rebalance without ever trying to get_page_from_freelist. This loop repeats infinitely. The test case is pretty pathological. Running a mix of I/O stress-tests that do a lot of fork() and consume all of the system memory, I can pretty reliably hit this on 600 nodes, in about 12 hours. 32GB/node. Signed-off-by: Andrew Barry Signed-off-by: Minchan Kim Reviewed-by: Rik van Riel Acked-by: Mel Gorman Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 10a8c6da385..2a00f17c3bf 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2106,6 +2106,7 @@ restart: first_zones_zonelist(zonelist, high_zoneidx, NULL, &preferred_zone); +rebalance: /* This is the last chance, in general, before the goto nopage. */ page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS, @@ -2113,7 +2114,6 @@ restart: if (page) goto got_pg; -rebalance: /* Allocate without watermarks if the context allows */ if (alloc_flags & ALLOC_NO_WATERMARKS) { page = __alloc_pages_high_priority(gfp_mask, order, -- cgit v1.2.3-70-g09d2 From eb709b0d062efd653a61183af8e27b2711c3cf5c Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Tue, 24 May 2011 17:12:55 -0700 Subject: mm: batch activate_page() to reduce lock contention The zone->lru_lock is heavily contented in workload where activate_page() is frequently used. We could do batch activate_page() to reduce the lock contention. The batched pages will be added into zone list when the pool is full or page reclaim is trying to drain them. For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes, processes shared map to the file. Each process read access the whole file and then exit. The process exit will do unmap_vmas() and cause a lot of activate_page() call. In such workload, we saw about 58% total time reduction with below patch. Other workloads with a lot of activate_page also benefits a lot too. Andrew Morton suggested activate_page() and putback_lru_pages() should follow the same path to active pages, but this is hard to implement (see commit 7a608572a282a ("Revert "mm: batch activate_page() to reduce lock contention")). On the other hand, do we really need putback_lru_pages() to follow the same path? I tested several FIO/FFSB benchmark (about 20 scripts for each benchmark) in 3 machines here from 2 sockets to 4 sockets. My test doesn't show anything significant with/without below patch (there is slight difference but mostly some noise which we found even without below patch before). Below patch basically returns to the same as my first post. I tested some microbenchmarks: case-anon-cow-rand-mt 0.58% case-anon-cow-rand -3.30% case-anon-cow-seq-mt -0.51% case-anon-cow-seq -5.68% case-anon-r-rand-mt 0.23% case-anon-r-rand 0.81% case-anon-r-seq-mt -0.71% case-anon-r-seq -1.99% case-anon-rx-rand-mt 2.11% case-anon-rx-seq-mt 3.46% case-anon-w-rand-mt -0.03% case-anon-w-rand -0.50% case-anon-w-seq-mt -1.08% case-anon-w-seq -0.12% case-anon-wx-rand-mt -5.02% case-anon-wx-seq-mt -1.43% case-fork 1.65% case-fork-sleep -0.07% case-fork-withmem 1.39% case-hugetlb -0.59% case-lru-file-mmap-read-mt -0.54% case-lru-file-mmap-read 0.61% case-lru-file-mmap-read-rand -2.24% case-lru-file-readonce -0.64% case-lru-file-readtwice -11.69% case-lru-memcg -1.35% case-mmap-pread-rand-mt 1.88% case-mmap-pread-rand -15.26% case-mmap-pread-seq-mt 0.89% case-mmap-pread-seq -69.72% case-mmap-xread-rand-mt 0.71% case-mmap-xread-seq-mt 0.38% The most significent are: case-lru-file-readtwice -11.69% case-mmap-pread-rand -15.26% case-mmap-pread-seq -69.72% which use activate_page a lot. others are basically variations because each run has slightly difference. In UP case, 'size mm/swap.o' before the two patches: text data bss dec hex filename 6466 896 4 7366 1cc6 mm/swap.o after the two patches: text data bss dec hex filename 6343 896 4 7243 1c4b mm/swap.o Signed-off-by: Shaohua Li Cc: KOSAKI Motohiro Cc: Hiroyuki Kamezawa Cc: Andi Kleen Cc: Minchan Kim Cc: Rik van Riel Cc: Mel Gorman Cc: Johannes Weiner Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swap.c | 45 ++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 40 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/swap.c b/mm/swap.c index 2f365d1a4bb..3a442f18b0b 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -272,14 +272,10 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page, memcg_reclaim_stat->recent_rotated[file]++; } -/* - * FIXME: speed this up? - */ -void activate_page(struct page *page) +static void __activate_page(struct page *page, void *arg) { struct zone *zone = page_zone(page); - spin_lock_irq(&zone->lru_lock); if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { int file = page_is_file_cache(page); int lru = page_lru_base_type(page); @@ -292,8 +288,45 @@ void activate_page(struct page *page) update_page_reclaim_stat(zone, page, file, 1); } +} + +#ifdef CONFIG_SMP +static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs); + +static void activate_page_drain(int cpu) +{ + struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu); + + if (pagevec_count(pvec)) + pagevec_lru_move_fn(pvec, __activate_page, NULL); +} + +void activate_page(struct page *page) +{ + if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { + struct pagevec *pvec = &get_cpu_var(activate_page_pvecs); + + page_cache_get(page); + if (!pagevec_add(pvec, page)) + pagevec_lru_move_fn(pvec, __activate_page, NULL); + put_cpu_var(activate_page_pvecs); + } +} + +#else +static inline void activate_page_drain(int cpu) +{ +} + +void activate_page(struct page *page) +{ + struct zone *zone = page_zone(page); + + spin_lock_irq(&zone->lru_lock); + __activate_page(page, NULL); spin_unlock_irq(&zone->lru_lock); } +#endif /* * Mark a page as having seen activity. @@ -464,6 +497,8 @@ static void drain_cpu_pagevecs(int cpu) pvec = &per_cpu(lru_deactivate_pvecs, cpu); if (pagevec_count(pvec)) pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); + + activate_page_drain(cpu); } /** -- cgit v1.2.3-70-g09d2 From f67d9b1576c1c6e02100f8b27f4e9d66bbeb4d49 Mon Sep 17 00:00:00 2001 From: Bob Liu Date: Tue, 24 May 2011 17:12:56 -0700 Subject: nommu: add page alignment to mmap Currently on nommu arch mmap(),mremap() and munmap() doesn't do page_align() which isn't consist with mmu arch and cause some issues. First, some drivers' mmap() function depends on vma->vm_end - vma->start is page aligned which is true on mmu arch but not on nommu. eg: uvc camera driver. Second munmap() may return -EINVAL[split file] error in cases when end is not page aligned(passed into from userspace) but vma->vm_end is aligned dure to split or driver's mmap() ops. Add page alignment to fix those issues. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Bob Liu Cc: David Howells Cc: Paul Mundt Cc: Greg Ungerer Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) (limited to 'mm') diff --git a/mm/nommu.c b/mm/nommu.c index 92e1a47d1e5..1fd0c51b10a 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1124,7 +1124,7 @@ static int do_mmap_private(struct vm_area_struct *vma, unsigned long capabilities) { struct page *pages; - unsigned long total, point, n, rlen; + unsigned long total, point, n; void *base; int ret, order; @@ -1148,13 +1148,12 @@ static int do_mmap_private(struct vm_area_struct *vma, * make a private copy of the data and map that instead */ } - rlen = PAGE_ALIGN(len); /* allocate some memory to hold the mapping * - note that this may not return a page-aligned address if the object * we're allocating is smaller than a page */ - order = get_order(rlen); + order = get_order(len); kdebug("alloc order %d for %lx", order, len); pages = alloc_pages(GFP_KERNEL, order); @@ -1164,7 +1163,7 @@ static int do_mmap_private(struct vm_area_struct *vma, total = 1 << order; atomic_long_add(total, &mmap_pages_allocated); - point = rlen >> PAGE_SHIFT; + point = len >> PAGE_SHIFT; /* we allocated a power-of-2 sized page set, so we may want to trim off * the excess */ @@ -1186,7 +1185,7 @@ static int do_mmap_private(struct vm_area_struct *vma, base = page_address(pages); region->vm_flags = vma->vm_flags |= VM_MAPPED_COPY; region->vm_start = (unsigned long) base; - region->vm_end = region->vm_start + rlen; + region->vm_end = region->vm_start + len; region->vm_top = region->vm_start + (total << PAGE_SHIFT); vma->vm_start = region->vm_start; @@ -1202,15 +1201,15 @@ static int do_mmap_private(struct vm_area_struct *vma, old_fs = get_fs(); set_fs(KERNEL_DS); - ret = vma->vm_file->f_op->read(vma->vm_file, base, rlen, &fpos); + ret = vma->vm_file->f_op->read(vma->vm_file, base, len, &fpos); set_fs(old_fs); if (ret < 0) goto error_free; /* clear the last little bit */ - if (ret < rlen) - memset(base + ret, 0, rlen - ret); + if (ret < len) + memset(base + ret, 0, len - ret); } @@ -1259,6 +1258,7 @@ unsigned long do_mmap_pgoff(struct file *file, /* we ignore the address hint */ addr = 0; + len = PAGE_ALIGN(len); /* we've determined that we can make the mapping, now translate what we * now know into VMA flags */ @@ -1635,14 +1635,17 @@ static int shrink_vma(struct mm_struct *mm, int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) { struct vm_area_struct *vma; - unsigned long end = start + len; + unsigned long end; int ret; kenter(",%lx,%zx", start, len); + len = PAGE_ALIGN(len); if (len == 0) return -EINVAL; + end = start + len; + /* find the first potentially overlapping VMA */ vma = find_vma(mm, start); if (!vma) { @@ -1762,6 +1765,8 @@ unsigned long do_mremap(unsigned long addr, struct vm_area_struct *vma; /* insanity checks first */ + old_len = PAGE_ALIGN(old_len); + new_len = PAGE_ALIGN(new_len); if (old_len == 0 || new_len == 0) return (unsigned long) -EINVAL; -- cgit v1.2.3-70-g09d2 From 49a78d085fa6b44d6ed791923c7172a6433589c2 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Wed, 25 May 2011 18:06:54 -0700 Subject: slub: remove no-longer used 'unlock_out' label Commit a71ae47a2cbf ("slub: Fix double bit unlock in debug mode") removed the only goto to this label, resulting in mm/slub.c: In function '__slab_alloc': mm/slub.c:1834: warning: label 'unlock_out' defined but not used fixed trivially by the removal of the label itself too. Reported-by: Stephen Rothwell Cc: Christoph Lameter Signed-off-by: Linus Torvalds --- mm/slub.c | 1 - 1 file changed, 1 deletion(-) (limited to 'mm') diff --git a/mm/slub.c b/mm/slub.c index 4aad32d2e60..7be0223531b 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1831,7 +1831,6 @@ load_freelist: page->inuse = page->objects; page->freelist = NULL; -unlock_out: slab_unlock(page); c->tid = next_tid(c->tid); local_irq_restore(flags); -- cgit v1.2.3-70-g09d2