From 75485363ce8552698bfb9970d901f755d5713cca Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:01:42 -0700 Subject: mm: vmscan: limit the number of pages kswapd reclaims at each priority This series does not fix all the current known problems with reclaim but it addresses one important swapping bug when there is background IO. Changelog since V3 - Drop the slab shrink changes in light of Glaubers series and discussions highlighted that there were a number of potential problems with the patch. (mel) - Rebased to 3.10-rc1 Changelog since V2 - Preserve ratio properly for proportional scanning (kamezawa) Changelog since V1 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY (andi) - Reformat comment in shrink_page_list (andi) - Clarify some comments (dhillf) - Rework how the proportional scanning is preserved - Add PageReclaim check before kswapd starts writeback - Reset sc.nr_reclaimed on every full zone scan Kswapd and page reclaim behaviour has been screwy in one way or the other for a long time. Very broadly speaking it worked in the far past because machines were limited in memory so it did not have that many pages to scan and it stalled congestion_wait() frequently to prevent it going completely nuts. In recent times it has behaved very unsatisfactorily with some of the problems compounded by the removal of stall logic and the introduction of transparent hugepage support with high-order reclaims. There are many variations of bugs that are rooted in this area. One example is reports of a large copy operations or backup causing the machine to grind to a halt or applications pushed to swap. Sometimes in low memory situations a large percentage of memory suddenly gets reclaimed. In other cases an application starts and kswapd hits 100% CPU usage for prolonged periods of time and so on. There is now talk of introducing features like an extra free kbytes tunable to work around aspects of the problem instead of trying to deal with it. It's compounded by the problem that it can be very workload and machine specific. This series aims at addressing some of the worst of these problems without attempting to fundmentally alter how page reclaim works. Patches 1-2 limits the number of pages kswapd reclaims while still obeying the anon/file proportion of the LRUs it should be scanning. Patches 3-4 control how and when kswapd raises its scanning priority and deletes the scanning restart logic which is tricky to follow. Patch 5 notes that it is too easy for kswapd to reach priority 0 when scanning and then reclaim the world. Down with that sort of thing. Patch 6 notes that kswapd starts writeback based on scanning priority which is not necessarily related to dirty pages. It will have kswapd writeback pages if a number of unqueued dirty pages have been recently encountered at the tail of the LRU. Patch 7 notes that sometimes kswapd should stall waiting on IO to complete to reduce LRU churn and the likelihood that it'll reclaim young clean pages or push applications to swap. It will cause kswapd to block on IO if it detects that pages being reclaimed under writeback are recycling through the LRU before the IO completes. Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they are applied. This was tested using memcached+memcachetest while some background IO was in progress as implemented by the parallel IO tests implement in MM Tests. memcachetest benchmarks how many operations/second memcached can service and it is run multiple times. It starts with no background IO and then re-runs the test with larger amounts of IO in the background to roughly simulate a large copy in progress. The expectation is that the IO should have little or no impact on memcachetest which is running entirely in memory. 3.10.0-rc1 3.10.0-rc1 vanilla lessdisrupt-v4 Ops memcachetest-0M 22155.00 ( 0.00%) 22180.00 ( 0.11%) Ops memcachetest-715M 22720.00 ( 0.00%) 22355.00 ( -1.61%) Ops memcachetest-2385M 3939.00 ( 0.00%) 23450.00 (495.33%) Ops memcachetest-4055M 3628.00 ( 0.00%) 24341.00 (570.92%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) Ops io-duration-2385M 118.00 ( 0.00%) 21.00 ( 82.20%) Ops io-duration-4055M 162.00 ( 0.00%) 36.00 ( 77.78%) Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-715M 140134.00 ( 0.00%) 18.00 ( 99.99%) Ops swaptotal-2385M 392438.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-4055M 449037.00 ( 0.00%) 27864.00 ( 93.79%) Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-715M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-2385M 148031.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-4055M 135109.00 ( 0.00%) 0.00 ( 0.00%) Ops minorfaults-0M 1529984.00 ( 0.00%) 1530235.00 ( -0.02%) Ops minorfaults-715M 1794168.00 ( 0.00%) 1613750.00 ( 10.06%) Ops minorfaults-2385M 1739813.00 ( 0.00%) 1609396.00 ( 7.50%) Ops minorfaults-4055M 1754460.00 ( 0.00%) 1614810.00 ( 7.96%) Ops majorfaults-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops majorfaults-715M 185.00 ( 0.00%) 180.00 ( 2.70%) Ops majorfaults-2385M 24472.00 ( 0.00%) 101.00 ( 99.59%) Ops majorfaults-4055M 22302.00 ( 0.00%) 229.00 ( 98.97%) Note how the vanilla kernels performance collapses when there is enough IO taking place in the background. This drop in performance is part of what users complain of when they start backups. Note how the swapin and major fault figures indicate that processes were being pushed to swap prematurely. With the series applied, there is no noticable performance drop and while there is still some swap activity, it's tiny. 20 iterations of this test were run in total and averaged. Every 5 iterations, additional IO was generated in the background using dd to measure how the workload was impacted. The 0M, 715M, 2385M and 4055M subblock refer to the amount of IO going on in the background at each iteration. So memcachetest-2385M is reporting how many transactions/second memcachetest recorded on average over 5 iterations while there was 2385M of IO going on in the ground. There are six blocks of information reported here memcachetest is the transactions/second reported by memcachetest. In the vanilla kernel note that performance drops from around 22K/sec to just under 4K/second when there is 2385M of IO going on in the background. This is one type of performance collapse users complain about if a large cp or backup starts in the background io-duration refers to how long it takes for the background IO to complete. It's showing that with the patched kernel that the IO completes faster while not interfering with the memcache workload swaptotal is the total amount of swap traffic. With the patched kernel, the total amount of swapping is much reduced although it is still not zero. swapin in this case is an indication as to whether we are swap trashing. The closer the swapin/swapout ratio is to 1, the worse the trashing is. Note with the patched kernel that there is no swapin activity indicating that all the pages swapped were really inactive unused pages. minorfaults are just minor faults. An increased number of minor faults can indicate that page reclaim is unmapping the pages but not swapping them out before they are faulted back in. With the patched kernel, there is only a small change in minor faults majorfaults are just major faults in the target workload and a high number can indicate that a workload is being prematurely swapped. With the patched kernel, major faults are much reduced. As there are no swapin's recorded so it's not being swapped. The likely explanation is that that libraries or configuration files used by the workload during startup get paged out by the background IO. Overall with the series applied, there is no noticable performance drop due to background IO and while there is still some swap activity, it's tiny and the lack of swapins imply that the swapped pages were inactive and unused. 3.10.0-rc1 3.10.0-rc1 vanilla lessdisrupt-v4 Page Ins 1234608 101892 Page Outs 12446272 11810468 Swap Ins 283406 0 Swap Outs 698469 27882 Direct pages scanned 0 136480 Kswapd pages scanned 6266537 5369364 Kswapd pages reclaimed 1088989 930832 Direct pages reclaimed 0 120901 Kswapd efficiency 17% 17% Kswapd velocity 5398.371 4635.115 Direct efficiency 100% 88% Direct velocity 0.000 117.817 Percentage direct scans 0% 2% Page writes by reclaim 1655843 4009929 Page writes file 957374 3982047 Page writes anon 698469 27882 Page reclaim immediate 5245 1745 Page rescued immediate 0 0 Slabs scanned 33664 25216 Direct inode steals 0 0 Kswapd inode steals 19409 778 Kswapd skipped wait 0 0 THP fault alloc 35 30 THP collapse alloc 472 401 THP splits 27 22 THP fault fallback 0 0 THP collapse fail 0 1 Compaction stalls 0 4 Compaction success 0 0 Compaction failures 0 4 Page migrate success 0 0 Page migrate failure 0 0 Compaction pages isolated 0 0 Compaction migrate scanned 0 0 Compaction free scanned 0 0 Compaction cost 0 0 NUMA PTE updates 0 0 NUMA hint faults 0 0 NUMA hint local faults 0 0 NUMA pages migrated 0 0 AutoNUMA cost 0 0 Unfortunately, note that there is a small amount of direct reclaim due to kswapd no longer reclaiming the world. ftrace indicates that the direct reclaim stalls are mostly harmless with the vast bulk of the stalls incurred by dd 23 tclsh-3367 38 memcachetest-13733 49 memcachetest-12443 57 tee-3368 1541 dd-13826 1981 dd-12539 A consequence of the direct reclaim for dd is that the processes for the IO workload may show a higher system CPU usage. There is also a risk that kswapd not reclaiming the world may mean that it stays awake balancing zones, does not stall on the appropriate events and continually scans pages it cannot reclaim consuming CPU. This will be visible as continued high CPU usage but in my own tests I only saw a single spike lasting less than a second and I did not observe any problems related to reclaim while running the series on my desktop. This patch: The number of pages kswapd can reclaim is bound by the number of pages it scans which is related to the size of the zone and the scanning priority. In many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large number of pages it cannot reclaim, it will raise the priority and potentially discard a large percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible effect is a reclaim "spike" where a large percentage of memory is suddenly freed. It would be bad enough if this was just unused memory but because of how anon/file pages are balanced it is possible that applications get pushed to swap unnecessarily. This patch limits the number of pages kswapd will reclaim to the high watermark. Reclaim will still overshoot due to it not being a hard limit as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it prevents kswapd reclaiming the world at higher priorities. The number of pages it reclaims is not adjusted for high-order allocations as kswapd will reclaim excessively if it is to balance zones for high-order allocations. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Reviewed-by: Michal Hocko Acked-by: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Tested-by: Zlatko Calusic Cc: dormando Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 49 +++++++++++++++++++++++++++++-------------------- 1 file changed, 29 insertions(+), 20 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index fa6a85378ee..cdbc0699ea2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2600,6 +2600,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, return pgdat_balanced(pgdat, order, classzone_idx); } +/* + * kswapd shrinks the zone by the number of pages required to reach + * the high watermark. + */ +static void kswapd_shrink_zone(struct zone *zone, + struct scan_control *sc, + unsigned long lru_pages) +{ + unsigned long nr_slab; + struct reclaim_state *reclaim_state = current->reclaim_state; + struct shrink_control shrink = { + .gfp_mask = sc->gfp_mask, + }; + + /* Reclaim above the high watermark. */ + sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); + shrink_zone(zone, sc); + + reclaim_state->reclaimed_slab = 0; + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages); + sc->nr_reclaimed += reclaim_state->reclaimed_slab; + + if (nr_slab == 0 && !zone_reclaimable(zone)) + zone->all_unreclaimable = 1; +} + /* * For kswapd, balance_pgdat() will work across all this node's zones until * they are all at high_wmark_pages(zone). @@ -2627,24 +2653,15 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, bool pgdat_is_balanced = false; int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ - struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .may_unmap = 1, .may_swap = 1, - /* - * kswapd doesn't want to be bailed out while reclaim. because - * we want to put equal scanning pressure on each zone. - */ - .nr_to_reclaim = ULONG_MAX, .order = order, .target_mem_cgroup = NULL, }; - struct shrink_control shrink = { - .gfp_mask = sc.gfp_mask, - }; loop_again: sc.priority = DEF_PRIORITY; sc.nr_reclaimed = 0; @@ -2716,7 +2733,7 @@ loop_again: */ for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; - int nr_slab, testorder; + int testorder; unsigned long balance_gap; if (!populated_zone(zone)) @@ -2764,16 +2781,8 @@ loop_again: if ((buffer_heads_over_limit && is_highmem_idx(i)) || !zone_balanced(zone, testorder, - balance_gap, end_zone)) { - shrink_zone(zone, &sc); - - reclaim_state->reclaimed_slab = 0; - nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages); - sc.nr_reclaimed += reclaim_state->reclaimed_slab; - - if (nr_slab == 0 && !zone_reclaimable(zone)) - zone->all_unreclaimable = 1; - } + balance_gap, end_zone)) + kswapd_shrink_zone(zone, &sc, lru_pages); /* * If we're getting trouble reclaiming, start doing -- cgit v1.2.3-70-g09d2 From e82e0561dae9f3ae5a21fc2d3d3ccbe69d90be46 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:01:44 -0700 Subject: mm: vmscan: obey proportional scanning requirements for kswapd Simplistically, the anon and file LRU lists are scanned proportionally depending on the value of vm.swappiness although there are other factors taken into account by get_scan_count(). The patch "mm: vmscan: Limit the number of pages kswapd reclaims" limits the number of pages kswapd reclaims but it breaks this proportional scanning and may evenly shrink anon/file LRUs regardless of vm.swappiness. This patch preserves the proportional scanning and reclaim. It does mean that kswapd will reclaim more than requested but the number of pages will be related to the high watermark. [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify] [kamezawa.hiroyu@jp.fujitsu.com: Recalculate scan based on target] [hannes@cmpxchg.org: Account for already scanned pages properly] Signed-off-by: Mel Gorman Acked-by: Rik van Riel Reviewed-by: Michal Hocko Cc: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Tested-by: Zlatko Calusic Cc: dormando Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 59 insertions(+), 8 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index cdbc0699ea2..26ad67f1962 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1822,17 +1822,25 @@ out: static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) { unsigned long nr[NR_LRU_LISTS]; + unsigned long targets[NR_LRU_LISTS]; unsigned long nr_to_scan; enum lru_list lru; unsigned long nr_reclaimed = 0; unsigned long nr_to_reclaim = sc->nr_to_reclaim; struct blk_plug plug; + bool scan_adjusted = false; get_scan_count(lruvec, sc, nr); + /* Record the original scan target for proportional adjustments later */ + memcpy(targets, nr, sizeof(nr)); + blk_start_plug(&plug); while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) { + unsigned long nr_anon, nr_file, percentage; + unsigned long nr_scanned; + for_each_evictable_lru(lru) { if (nr[lru]) { nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX); @@ -1842,17 +1850,60 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) lruvec, sc); } } + + if (nr_reclaimed < nr_to_reclaim || scan_adjusted) + continue; + /* - * On large memory systems, scan >> priority can become - * really large. This is fine for the starting priority; - * we want to put equal scanning pressure on each zone. - * However, if the VM has a harder time of freeing pages, - * with multiple processes reclaiming pages, the total - * freeing target can get unreasonably large. + * For global direct reclaim, reclaim only the number of pages + * requested. Less care is taken to scan proportionally as it + * is more important to minimise direct reclaim stall latency + * than it is to properly age the LRU lists. */ - if (nr_reclaimed >= nr_to_reclaim && - sc->priority < DEF_PRIORITY) + if (global_reclaim(sc) && !current_is_kswapd()) break; + + /* + * For kswapd and memcg, reclaim at least the number of pages + * requested. Ensure that the anon and file LRUs shrink + * proportionally what was requested by get_scan_count(). We + * stop reclaiming one LRU and reduce the amount scanning + * proportional to the original scan target. + */ + nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE]; + nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON]; + + if (nr_file > nr_anon) { + unsigned long scan_target = targets[LRU_INACTIVE_ANON] + + targets[LRU_ACTIVE_ANON] + 1; + lru = LRU_BASE; + percentage = nr_anon * 100 / scan_target; + } else { + unsigned long scan_target = targets[LRU_INACTIVE_FILE] + + targets[LRU_ACTIVE_FILE] + 1; + lru = LRU_FILE; + percentage = nr_file * 100 / scan_target; + } + + /* Stop scanning the smaller of the LRU */ + nr[lru] = 0; + nr[lru + LRU_ACTIVE] = 0; + + /* + * Recalculate the other LRU scan count based on its original + * scan target and the percentage scanning already complete + */ + lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE; + nr_scanned = targets[lru] - nr[lru]; + nr[lru] = targets[lru] * (100 - percentage) / 100; + nr[lru] -= min(nr[lru], nr_scanned); + + lru += LRU_ACTIVE; + nr_scanned = targets[lru] - nr[lru]; + nr[lru] = targets[lru] * (100 - percentage) / 100; + nr[lru] -= min(nr[lru], nr_scanned); + + scan_adjusted = true; } blk_finish_plug(&plug); sc->nr_reclaimed += nr_reclaimed; -- cgit v1.2.3-70-g09d2 From b8e83b942a16eb73e63406592d3178207a4f07a1 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:01:45 -0700 Subject: mm: vmscan: flatten kswapd priority loop kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX pages have been reclaimed or the pgdat is considered balanced. It then rechecks if it needs to restart at DEF_PRIORITY and whether high-order reclaim needs to be reset. This is not wrong per-se but it is confusing to follow and forcing kswapd to stay at DEF_PRIORITY may require several restarts before it has scanned enough pages to meet the high watermark even at 100% efficiency. This patch irons out the logic a bit by controlling when priority is raised and removing the "goto loop_again". This patch has kswapd raise the scanning priority until it is scanning enough pages that it could meet the high watermark in one shrink of the LRU lists if it is able to reclaim at 100% efficiency. It will not raise the scanning prioirty higher unless it is failing to reclaim any pages. To avoid infinite looping for high-order allocation requests kswapd will not reclaim for high-order allocations when it has reclaimed at least twice the number of pages as the allocation request. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: Rik van Riel Cc: Jiri Slaby Cc: Valdis Kletnieks Tested-by: Zlatko Calusic Cc: dormando Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 86 +++++++++++++++++++++++++++++-------------------------------- 1 file changed, 41 insertions(+), 45 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 26ad67f1962..1c10ee51221 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2654,8 +2654,12 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, /* * kswapd shrinks the zone by the number of pages required to reach * the high watermark. + * + * Returns true if kswapd scanned at least the requested number of pages to + * reclaim. This is used to determine if the scanning priority needs to be + * raised. */ -static void kswapd_shrink_zone(struct zone *zone, +static bool kswapd_shrink_zone(struct zone *zone, struct scan_control *sc, unsigned long lru_pages) { @@ -2675,6 +2679,8 @@ static void kswapd_shrink_zone(struct zone *zone, if (nr_slab == 0 && !zone_reclaimable(zone)) zone->all_unreclaimable = 1; + + return sc->nr_scanned >= sc->nr_to_reclaim; } /* @@ -2701,26 +2707,26 @@ static void kswapd_shrink_zone(struct zone *zone, static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int *classzone_idx) { - bool pgdat_is_balanced = false; int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, + .priority = DEF_PRIORITY, .may_unmap = 1, .may_swap = 1, + .may_writepage = !laptop_mode, .order = order, .target_mem_cgroup = NULL, }; -loop_again: - sc.priority = DEF_PRIORITY; - sc.nr_reclaimed = 0; - sc.may_writepage = !laptop_mode; count_vm_event(PAGEOUTRUN); do { unsigned long lru_pages = 0; + bool raise_priority = true; + + sc.nr_reclaimed = 0; /* * Scan in the highmem->dma direction for the highest @@ -2762,10 +2768,8 @@ loop_again: } } - if (i < 0) { - pgdat_is_balanced = true; + if (i < 0) goto out; - } for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; @@ -2832,8 +2836,16 @@ loop_again: if ((buffer_heads_over_limit && is_highmem_idx(i)) || !zone_balanced(zone, testorder, - balance_gap, end_zone)) - kswapd_shrink_zone(zone, &sc, lru_pages); + balance_gap, end_zone)) { + /* + * There should be no need to raise the + * scanning priority if enough pages are + * already being scanned that high + * watermark would be met at 100% efficiency. + */ + if (kswapd_shrink_zone(zone, &sc, lru_pages)) + raise_priority = false; + } /* * If we're getting trouble reclaiming, start doing @@ -2868,46 +2880,29 @@ loop_again: pfmemalloc_watermark_ok(pgdat)) wake_up(&pgdat->pfmemalloc_wait); - if (pgdat_balanced(pgdat, order, *classzone_idx)) { - pgdat_is_balanced = true; - break; /* kswapd: all done */ - } - /* - * We do this so kswapd doesn't build up large priorities for - * example when it is freeing in parallel with allocators. It - * matches the direct reclaim path behaviour in terms of impact - * on zone->*_priority. + * Fragmentation may mean that the system cannot be rebalanced + * for high-order allocations in all zones. If twice the + * allocation size has been reclaimed and the zones are still + * not balanced then recheck the watermarks at order-0 to + * prevent kswapd reclaiming excessively. Assume that a + * process requested a high-order can direct reclaim/compact. */ - if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) - break; - } while (--sc.priority >= 0); - -out: - if (!pgdat_is_balanced) { - cond_resched(); + if (order && sc.nr_reclaimed >= 2UL << order) + order = sc.order = 0; - try_to_freeze(); + /* Check if kswapd should be suspending */ + if (try_to_freeze() || kthread_should_stop()) + break; /* - * Fragmentation may mean that the system cannot be - * rebalanced for high-order allocations in all zones. - * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX, - * it means the zones have been fully scanned and are still - * not balanced. For high-order allocations, there is - * little point trying all over again as kswapd may - * infinite loop. - * - * Instead, recheck all watermarks at order-0 as they - * are the most important. If watermarks are ok, kswapd will go - * back to sleep. High-order users can still perform direct - * reclaim if they wish. + * Raise priority if scanning rate is too low or there was no + * progress in reclaiming pages */ - if (sc.nr_reclaimed < SWAP_CLUSTER_MAX) - order = sc.order = 0; - - goto loop_again; - } + if (raise_priority || !sc.nr_reclaimed) + sc.priority--; + } while (sc.priority >= 0 && + !pgdat_balanced(pgdat, order, *classzone_idx)); /* * If kswapd was reclaiming at a higher order, it has the option of @@ -2936,6 +2931,7 @@ out: compact_pgdat(pgdat, order); } +out: /* * Return the order we were reclaiming at so prepare_kswapd_sleep() * makes a decision on the order we were last reclaiming at. However, -- cgit v1.2.3-70-g09d2 From 2ab44f434586b8ccb11f781b4c2730492e6628f5 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:01:47 -0700 Subject: mm: vmscan: decide whether to compact the pgdat based on reclaim progress In the past, kswapd makes a decision on whether to compact memory after the pgdat was considered balanced. This more or less worked but it is late to make such a decision and does not fit well now that kswapd makes a decision whether to exit the zone scanning loop depending on reclaim progress. This patch will compact a pgdat if at least the requested number of pages were reclaimed from unbalanced zones for a given priority. If any zone is currently balanced, kswapd will not call compaction as it is expected the necessary pages are already available. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: Rik van Riel Cc: Jiri Slaby Cc: Valdis Kletnieks Tested-by: Zlatko Calusic Cc: dormando Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 59 ++++++++++++++++++++++++++++++----------------------------- 1 file changed, 30 insertions(+), 29 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 1c10ee51221..cd0980393ba 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2661,7 +2661,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, */ static bool kswapd_shrink_zone(struct zone *zone, struct scan_control *sc, - unsigned long lru_pages) + unsigned long lru_pages, + unsigned long *nr_attempted) { unsigned long nr_slab; struct reclaim_state *reclaim_state = current->reclaim_state; @@ -2677,6 +2678,9 @@ static bool kswapd_shrink_zone(struct zone *zone, nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages); sc->nr_reclaimed += reclaim_state->reclaimed_slab; + /* Account for the number of pages attempted to reclaim */ + *nr_attempted += sc->nr_to_reclaim; + if (nr_slab == 0 && !zone_reclaimable(zone)) zone->all_unreclaimable = 1; @@ -2724,7 +2728,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, do { unsigned long lru_pages = 0; + unsigned long nr_attempted = 0; bool raise_priority = true; + bool pgdat_needs_compaction = (order > 0); sc.nr_reclaimed = 0; @@ -2774,7 +2780,21 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; + if (!populated_zone(zone)) + continue; + lru_pages += zone_reclaimable_pages(zone); + + /* + * If any zone is currently balanced then kswapd will + * not call compaction as it is expected that the + * necessary pages are already available. + */ + if (pgdat_needs_compaction && + zone_watermark_ok(zone, order, + low_wmark_pages(zone), + *classzone_idx, 0)) + pgdat_needs_compaction = false; } /* @@ -2843,7 +2863,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, * already being scanned that high * watermark would be met at 100% efficiency. */ - if (kswapd_shrink_zone(zone, &sc, lru_pages)) + if (kswapd_shrink_zone(zone, &sc, lru_pages, + &nr_attempted)) raise_priority = false; } @@ -2895,6 +2916,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, if (try_to_freeze() || kthread_should_stop()) break; + /* + * Compact if necessary and kswapd is reclaiming at least the + * high watermark number of pages as requsted + */ + if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted) + compact_pgdat(pgdat, order); + /* * Raise priority if scanning rate is too low or there was no * progress in reclaiming pages @@ -2904,33 +2932,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, } while (sc.priority >= 0 && !pgdat_balanced(pgdat, order, *classzone_idx)); - /* - * If kswapd was reclaiming at a higher order, it has the option of - * sleeping without all zones being balanced. Before it does, it must - * ensure that the watermarks for order-0 on *all* zones are met and - * that the congestion flags are cleared. The congestion flag must - * be cleared as kswapd is the only mechanism that clears the flag - * and it is potentially going to sleep here. - */ - if (order) { - int zones_need_compaction = 1; - - for (i = 0; i <= end_zone; i++) { - struct zone *zone = pgdat->node_zones + i; - - if (!populated_zone(zone)) - continue; - - /* Check if the memory needs to be defragmented. */ - if (zone_watermark_ok(zone, order, - low_wmark_pages(zone), *classzone_idx, 0)) - zones_need_compaction = 0; - } - - if (zones_need_compaction) - compact_pgdat(pgdat, order); - } - out: /* * Return the order we were reclaiming at so prepare_kswapd_sleep() -- cgit v1.2.3-70-g09d2 From 9aa41348a8d11427feec350b21dcdd4330fd20c4 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:01:48 -0700 Subject: mm: vmscan: do not allow kswapd to scan at maximum priority Page reclaim at priority 0 will scan the entire LRU as priority 0 is considered to be a near OOM condition. Kswapd can reach priority 0 quite easily if it is encountering a large number of pages it cannot reclaim such as pages under writeback. When this happens, kswapd reclaims very aggressively even though there may be no real risk of allocation failure or OOM. This patch prevents kswapd reaching priority 0 and trying to reclaim the world. Direct reclaimers will still reach priority 0 in the event of an OOM situation. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Acked-by: Johannes Weiner Reviewed-by: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Tested-by: Zlatko Calusic Cc: dormando Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index cd0980393ba..1505c573719 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2929,7 +2929,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, */ if (raise_priority || !sc.nr_reclaimed) sc.priority--; - } while (sc.priority >= 0 && + } while (sc.priority >= 1 && !pgdat_balanced(pgdat, order, *classzone_idx)); out: -- cgit v1.2.3-70-g09d2 From d43006d503ac921c7df4f94d13c17db6f13c9d26 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:01:50 -0700 Subject: mm: vmscan: have kswapd writeback pages based on dirty pages encountered, not priority Currently kswapd queues dirty pages for writeback if scanning at an elevated priority but the priority kswapd scans at is not related to the number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten kswapd priority loop", the priority is related to the size of the LRU and the zone watermark which is no indication as to whether kswapd should write pages or not. This patch tracks if an excessive number of unqueued dirty pages are being encountered at the end of the LRU. If so, it indicates that dirty pages are being recycled before flusher threads can clean them and flags the zone so that kswapd will start writing pages until the zone is balanced. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: Rik van Riel Cc: Jiri Slaby Cc: Valdis Kletnieks Tested-by: Zlatko Calusic Cc: dormando Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 9 +++++++++ mm/vmscan.c | 31 +++++++++++++++++++++++++------ 2 files changed, 34 insertions(+), 6 deletions(-) (limited to 'mm/vmscan.c') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5c76737d836..2aaf72f7e34 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -495,6 +495,10 @@ typedef enum { ZONE_CONGESTED, /* zone has many dirty pages backed by * a congested BDI */ + ZONE_TAIL_LRU_DIRTY, /* reclaim scanning has recently found + * many dirty file pages at the tail + * of the LRU. + */ } zone_flags_t; static inline void zone_set_flag(struct zone *zone, zone_flags_t flag) @@ -517,6 +521,11 @@ static inline int zone_is_reclaim_congested(const struct zone *zone) return test_bit(ZONE_CONGESTED, &zone->flags); } +static inline int zone_is_reclaim_dirty(const struct zone *zone) +{ + return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags); +} + static inline int zone_is_reclaim_locked(const struct zone *zone) { return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags); diff --git a/mm/vmscan.c b/mm/vmscan.c index 1505c573719..d6c916d808b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -676,13 +676,14 @@ static unsigned long shrink_page_list(struct list_head *page_list, struct zone *zone, struct scan_control *sc, enum ttu_flags ttu_flags, - unsigned long *ret_nr_dirty, + unsigned long *ret_nr_unqueued_dirty, unsigned long *ret_nr_writeback, bool force_reclaim) { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); int pgactivate = 0; + unsigned long nr_unqueued_dirty = 0; unsigned long nr_dirty = 0; unsigned long nr_congested = 0; unsigned long nr_reclaimed = 0; @@ -808,14 +809,17 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (PageDirty(page)) { nr_dirty++; + if (!PageWriteback(page)) + nr_unqueued_dirty++; + /* * Only kswapd can writeback filesystem pages to - * avoid risk of stack overflow but do not writeback - * unless under significant pressure. + * avoid risk of stack overflow but only writeback + * if many dirty pages have been encountered. */ if (page_is_file_cache(page) && (!current_is_kswapd() || - sc->priority >= DEF_PRIORITY - 2)) { + !zone_is_reclaim_dirty(zone))) { /* * Immediately reclaim when written back. * Similar in principal to deactivate_page() @@ -960,7 +964,7 @@ keep: list_splice(&ret_pages, page_list); count_vm_events(PGACTIVATE, pgactivate); mem_cgroup_uncharge_end(); - *ret_nr_dirty += nr_dirty; + *ret_nr_unqueued_dirty += nr_unqueued_dirty; *ret_nr_writeback += nr_writeback; return nr_reclaimed; } @@ -1373,6 +1377,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, (nr_taken >> (DEF_PRIORITY - sc->priority))) wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); + /* + * Similarly, if many dirty pages are encountered that are not + * currently being written then flag that kswapd should start + * writing back pages. + */ + if (global_reclaim(sc) && nr_dirty && + nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority))) + zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY); + trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id, zone_idx(zone), nr_scanned, nr_reclaimed, @@ -2769,8 +2782,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, end_zone = i; break; } else { - /* If balanced, clear the congested flag */ + /* + * If balanced, clear the dirty and congested + * flags + */ zone_clear_flag(zone, ZONE_CONGESTED); + zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY); } } @@ -2888,8 +2905,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, * possible there are dirty pages backed by * congested BDIs but as pressure is relieved, * speculatively avoid congestion waits + * or writing pages from kswapd context. */ zone_clear_flag(zone, ZONE_CONGESTED); + zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY); } /* -- cgit v1.2.3-70-g09d2 From 283aba9f9e0e4882bf09bd37a2983379a6fae805 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:01:51 -0700 Subject: mm: vmscan: block kswapd if it is encountering pages under writeback Historically, kswapd used to congestion_wait() at higher priorities if it was not making forward progress. This made no sense as the failure to make progress could be completely independent of IO. It was later replaced by wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't wait on congested zones in balance_pgdat()) as it was duplicating logic in shrink_inactive_list(). This is problematic. If kswapd encounters many pages under writeback and it continues to scan until it reaches the high watermark then it will quickly skip over the pages under writeback and reclaim clean young pages or push applications out to swap. The use of wait_iff_congested() is not suited to kswapd as it will only stall if the underlying BDI is really congested or a direct reclaimer was unable to write to the underlying BDI. kswapd bypasses the BDI congestion as it sets PF_SWAPWRITE but even if this was taken into account then it would cause direct reclaimers to stall on writeback which is not desirable. This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is encountering too many pages under writeback. If this flag is set and kswapd encounters a PageReclaim page under writeback then it'll assume that the LRU lists are being recycled too quickly before IO can complete and block waiting for some IO to complete. Signed-off-by: Mel Gorman Reviewed-by: Michal Hocko Acked-by: Rik van Riel Cc: Johannes Weiner Cc: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Tested-by: Zlatko Calusic Cc: dormando Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 8 +++++ mm/vmscan.c | 82 ++++++++++++++++++++++++++++++++++++-------------- 2 files changed, 68 insertions(+), 22 deletions(-) (limited to 'mm/vmscan.c') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 2aaf72f7e34..fce64afba04 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -499,6 +499,9 @@ typedef enum { * many dirty file pages at the tail * of the LRU. */ + ZONE_WRITEBACK, /* reclaim scanning has recently found + * many pages under writeback + */ } zone_flags_t; static inline void zone_set_flag(struct zone *zone, zone_flags_t flag) @@ -526,6 +529,11 @@ static inline int zone_is_reclaim_dirty(const struct zone *zone) return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags); } +static inline int zone_is_reclaim_writeback(const struct zone *zone) +{ + return test_bit(ZONE_WRITEBACK, &zone->flags); +} + static inline int zone_is_reclaim_locked(const struct zone *zone) { return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags); diff --git a/mm/vmscan.c b/mm/vmscan.c index d6c916d808b..1109de0c35b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -724,25 +724,55 @@ static unsigned long shrink_page_list(struct list_head *page_list, may_enter_fs = (sc->gfp_mask & __GFP_FS) || (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); + /* + * If a page at the tail of the LRU is under writeback, there + * are three cases to consider. + * + * 1) If reclaim is encountering an excessive number of pages + * under writeback and this page is both under writeback and + * PageReclaim then it indicates that pages are being queued + * for IO but are being recycled through the LRU before the + * IO can complete. Waiting on the page itself risks an + * indefinite stall if it is impossible to writeback the + * page due to IO error or disconnected storage so instead + * block for HZ/10 or until some IO completes then clear the + * ZONE_WRITEBACK flag to recheck if the condition exists. + * + * 2) Global reclaim encounters a page, memcg encounters a + * page that is not marked for immediate reclaim or + * the caller does not have __GFP_IO. In this case mark + * the page for immediate reclaim and continue scanning. + * + * __GFP_IO is checked because a loop driver thread might + * enter reclaim, and deadlock if it waits on a page for + * which it is needed to do the write (loop masks off + * __GFP_IO|__GFP_FS for this reason); but more thought + * would probably show more reasons. + * + * Don't require __GFP_FS, since we're not going into the + * FS, just waiting on its writeback completion. Worryingly, + * ext4 gfs2 and xfs allocate pages with + * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing + * may_enter_fs here is liable to OOM on them. + * + * 3) memcg encounters a page that is not already marked + * PageReclaim. memcg does not have any dirty pages + * throttling so we could easily OOM just because too many + * pages are in writeback and there is nothing else to + * reclaim. Wait for the writeback to complete. + */ if (PageWriteback(page)) { - /* - * memcg doesn't have any dirty pages throttling so we - * could easily OOM just because too many pages are in - * writeback and there is nothing else to reclaim. - * - * Check __GFP_IO, certainly because a loop driver - * thread might enter reclaim, and deadlock if it waits - * on a page for which it is needed to do the write - * (loop masks off __GFP_IO|__GFP_FS for this reason); - * but more thought would probably show more reasons. - * - * Don't require __GFP_FS, since we're not going into - * the FS, just waiting on its writeback completion. - * Worryingly, ext4 gfs2 and xfs allocate pages with - * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so - * testing may_enter_fs here is liable to OOM on them. - */ - if (global_reclaim(sc) || + /* Case 1 above */ + if (current_is_kswapd() && + PageReclaim(page) && + zone_is_reclaim_writeback(zone)) { + unlock_page(page); + congestion_wait(BLK_RW_ASYNC, HZ/10); + zone_clear_flag(zone, ZONE_WRITEBACK); + goto keep; + + /* Case 2 above */ + } else if (global_reclaim(sc) || !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { /* * This is slightly racy - end_page_writeback() @@ -757,9 +787,13 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ SetPageReclaim(page); nr_writeback++; + goto keep_locked; + + /* Case 3 above */ + } else { + wait_on_page_writeback(page); } - wait_on_page_writeback(page); } if (!force_reclaim) @@ -1374,8 +1408,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, * isolated page is PageWriteback */ if (nr_writeback && nr_writeback >= - (nr_taken >> (DEF_PRIORITY - sc->priority))) + (nr_taken >> (DEF_PRIORITY - sc->priority))) { wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); + zone_set_flag(zone, ZONE_WRITEBACK); + } /* * Similarly, if many dirty pages are encountered that are not @@ -2669,8 +2705,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, * the high watermark. * * Returns true if kswapd scanned at least the requested number of pages to - * reclaim. This is used to determine if the scanning priority needs to be - * raised. + * reclaim or if the lack of progress was due to pages under writeback. + * This is used to determine if the scanning priority needs to be raised. */ static bool kswapd_shrink_zone(struct zone *zone, struct scan_control *sc, @@ -2697,6 +2733,8 @@ static bool kswapd_shrink_zone(struct zone *zone, if (nr_slab == 0 && !zone_reclaimable(zone)) zone->all_unreclaimable = 1; + zone_clear_flag(zone, ZONE_WRITEBACK); + return sc->nr_scanned >= sc->nr_to_reclaim; } -- cgit v1.2.3-70-g09d2 From b7ea3c417b6c2e74ca1cb051568f60377908928d Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:01:53 -0700 Subject: mm: vmscan: check if kswapd should writepage once per pgdat scan Currently kswapd checks if it should start writepage as it shrinks each zone without taking into consideration if the zone is balanced or not. This is not wrong as such but it does not make much sense either. This patch checks once per pgdat scan if kswapd should be writing pages. Signed-off-by: Mel Gorman Reviewed-by: Michal Hocko Acked-by: Rik van Riel Acked-by: Johannes Weiner Cc: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Tested-by: Zlatko Calusic Cc: dormando Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 1109de0c35b..a2d0c684261 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2852,6 +2852,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, pgdat_needs_compaction = false; } + /* + * If we're getting trouble reclaiming, start doing writepage + * even in laptop mode. + */ + if (sc.priority < DEF_PRIORITY - 2) + sc.may_writepage = 1; + /* * Now scan the zone in the dma->highmem direction, stopping * at the last zone which needs scanning. @@ -2923,13 +2930,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, raise_priority = false; } - /* - * If we're getting trouble reclaiming, start doing - * writepage even in laptop mode. - */ - if (sc.priority < DEF_PRIORITY - 2) - sc.may_writepage = 1; - if (zone->all_unreclaimable) { if (end_zone && end_zone == i) end_zone--; -- cgit v1.2.3-70-g09d2 From 7c954f6de6b630de30f265a079aad359f159ebe9 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:01:54 -0700 Subject: mm: vmscan: move logic from balance_pgdat() to kswapd_shrink_zone() balance_pgdat() is very long and some of the logic can and should be internal to kswapd_shrink_zone(). Move it so the flow of balance_pgdat() is marginally easier to follow. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Michal Hocko Acked-by: Rik van Riel Cc: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Tested-by: Zlatko Calusic Cc: dormando Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 110 +++++++++++++++++++++++++++++------------------------------- 1 file changed, 54 insertions(+), 56 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index a2d0c684261..4a43c289b23 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2709,18 +2709,53 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, * This is used to determine if the scanning priority needs to be raised. */ static bool kswapd_shrink_zone(struct zone *zone, + int classzone_idx, struct scan_control *sc, unsigned long lru_pages, unsigned long *nr_attempted) { unsigned long nr_slab; + int testorder = sc->order; + unsigned long balance_gap; struct reclaim_state *reclaim_state = current->reclaim_state; struct shrink_control shrink = { .gfp_mask = sc->gfp_mask, }; + bool lowmem_pressure; /* Reclaim above the high watermark. */ sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); + + /* + * Kswapd reclaims only single pages with compaction enabled. Trying + * too hard to reclaim until contiguous free pages have become + * available can hurt performance by evicting too much useful data + * from memory. Do not reclaim more than needed for compaction. + */ + if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && + compaction_suitable(zone, sc->order) != + COMPACT_SKIPPED) + testorder = 0; + + /* + * We put equal pressure on every zone, unless one zone has way too + * many pages free already. The "too many pages" is defined as the + * high wmark plus a "gap" where the gap is either the low + * watermark or 1% of the zone, whichever is smaller. + */ + balance_gap = min(low_wmark_pages(zone), + (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / + KSWAPD_ZONE_BALANCE_GAP_RATIO); + + /* + * If there is no low memory pressure or the zone is balanced then no + * reclaim is necessary + */ + lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone)); + if (!lowmem_pressure && zone_balanced(zone, testorder, + balance_gap, classzone_idx)) + return true; + shrink_zone(zone, sc); reclaim_state->reclaimed_slab = 0; @@ -2735,6 +2770,18 @@ static bool kswapd_shrink_zone(struct zone *zone, zone_clear_flag(zone, ZONE_WRITEBACK); + /* + * If a zone reaches its high watermark, consider it to be no longer + * congested. It's possible there are dirty pages backed by congested + * BDIs but as pressure is relieved, speculatively avoid congestion + * waits. + */ + if (!zone->all_unreclaimable && + zone_balanced(zone, testorder, 0, classzone_idx)) { + zone_clear_flag(zone, ZONE_CONGESTED); + zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY); + } + return sc->nr_scanned >= sc->nr_to_reclaim; } @@ -2870,8 +2917,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, */ for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; - int testorder; - unsigned long balance_gap; if (!populated_zone(zone)) continue; @@ -2892,61 +2937,14 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, sc.nr_reclaimed += nr_soft_reclaimed; /* - * We put equal pressure on every zone, unless - * one zone has way too many pages free - * already. The "too many pages" is defined - * as the high wmark plus a "gap" where the - * gap is either the low watermark or 1% - * of the zone, whichever is smaller. - */ - balance_gap = min(low_wmark_pages(zone), - (zone->managed_pages + - KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / - KSWAPD_ZONE_BALANCE_GAP_RATIO); - /* - * Kswapd reclaims only single pages with compaction - * enabled. Trying too hard to reclaim until contiguous - * free pages have become available can hurt performance - * by evicting too much useful data from memory. - * Do not reclaim more than needed for compaction. + * There should be no need to raise the scanning + * priority if enough pages are already being scanned + * that that high watermark would be met at 100% + * efficiency. */ - testorder = order; - if (IS_ENABLED(CONFIG_COMPACTION) && order && - compaction_suitable(zone, order) != - COMPACT_SKIPPED) - testorder = 0; - - if ((buffer_heads_over_limit && is_highmem_idx(i)) || - !zone_balanced(zone, testorder, - balance_gap, end_zone)) { - /* - * There should be no need to raise the - * scanning priority if enough pages are - * already being scanned that high - * watermark would be met at 100% efficiency. - */ - if (kswapd_shrink_zone(zone, &sc, lru_pages, - &nr_attempted)) - raise_priority = false; - } - - if (zone->all_unreclaimable) { - if (end_zone && end_zone == i) - end_zone--; - continue; - } - - if (zone_balanced(zone, testorder, 0, end_zone)) - /* - * If a zone reaches its high watermark, - * consider it to be no longer congested. It's - * possible there are dirty pages backed by - * congested BDIs but as pressure is relieved, - * speculatively avoid congestion waits - * or writing pages from kswapd context. - */ - zone_clear_flag(zone, ZONE_CONGESTED); - zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY); + if (kswapd_shrink_zone(zone, end_zone, &sc, + lru_pages, &nr_attempted)) + raise_priority = false; } /* -- cgit v1.2.3-70-g09d2 From e2be15f6c3eecedfbe1550cca8d72c5057abbbd2 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:01:57 -0700 Subject: mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered Further testing of the "Reduce system disruption due to kswapd" discovered a few problems. First and foremost, it's possible for pages under writeback to be freed which will lead to badness. Second, as pages were not being swapped the file LRU was being scanned faster and clean file pages were being reclaimed. In some cases this results in increased read IO to re-read data from disk. Third, more pages were being written from kswapd context which can adversly affect IO performance. Lastly, it was observed that PageDirty pages are not necessarily dirty on all filesystems (buffers can be clean while PageDirty is set and ->writepage generates no IO) and not all filesystems set PageWriteback when the page is being written (e.g. ext3). This disconnect confuses the reclaim stalling logic. This follow-up series is aimed at these problems. The tests were based on three kernels vanilla: kernel 3.9 as that is what the current mmotm uses as a baseline mmotm-20130522 is mmotm as of 22nd May with "Reduce system disruption due to kswapd" applied on top as per what should be in Andrew's tree right now lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel The first test used memcached+memcachetest while some background IO was in progress as implemented by the parallel IO tests implement in MM Tests. memcachetest benchmarks how many operations/second memcached can service. It starts with no background IO on a freshly created ext4 filesystem and then re-runs the test with larger amounts of IO in the background to roughly simulate a large copy in progress. The expectation is that the IO should have little or no impact on memcachetest which is running entirely in memory. parallelio 3.9.0 3.9.0 3.9.0 vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10 Ops memcachetest-0M 23117.00 ( 0.00%) 22780.00 ( -1.46%) 22763.00 ( -1.53%) Ops memcachetest-715M 23774.00 ( 0.00%) 23299.00 ( -2.00%) 22934.00 ( -3.53%) Ops memcachetest-2385M 4208.00 ( 0.00%) 24154.00 (474.00%) 23765.00 (464.76%) Ops memcachetest-4055M 4104.00 ( 0.00%) 25130.00 (512.33%) 24614.00 (499.76%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) 6.00 ( 50.00%) Ops io-duration-2385M 116.00 ( 0.00%) 21.00 ( 81.90%) 21.00 ( 81.90%) Ops io-duration-4055M 160.00 ( 0.00%) 36.00 ( 77.50%) 35.00 ( 78.12%) Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-715M 140138.00 ( 0.00%) 18.00 ( 99.99%) 18.00 ( 99.99%) Ops swaptotal-2385M 385682.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-4055M 418029.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-715M 144.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-2385M 134227.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-4055M 125618.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops minorfaults-0M 1536429.00 ( 0.00%) 1531632.00 ( 0.31%) 1533541.00 ( 0.19%) Ops minorfaults-715M 1786996.00 ( 0.00%) 1612148.00 ( 9.78%) 1608832.00 ( 9.97%) Ops minorfaults-2385M 1757952.00 ( 0.00%) 1614874.00 ( 8.14%) 1613541.00 ( 8.21%) Ops minorfaults-4055M 1774460.00 ( 0.00%) 1633400.00 ( 7.95%) 1630881.00 ( 8.09%) Ops majorfaults-0M 1.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops majorfaults-715M 184.00 ( 0.00%) 167.00 ( 9.24%) 166.00 ( 9.78%) Ops majorfaults-2385M 24444.00 ( 0.00%) 155.00 ( 99.37%) 93.00 ( 99.62%) Ops majorfaults-4055M 21357.00 ( 0.00%) 147.00 ( 99.31%) 134.00 ( 99.37%) memcachetest is the transactions/second reported by memcachetest. In the vanilla kernel note that performance drops from around 23K/sec to just over 4K/second when there is 2385M of IO going on in the background. With current mmotm, there is no collapse in performance and with this follow-up series there is little change. swaptotal is the total amount of swap traffic. With mmotm and the follow-up series, the total amount of swapping is much reduced. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 Minor Faults 11160152 10706748 10622316 Major Faults 46305 755 678 Swap Ins 260249 0 0 Swap Outs 683860 18 18 Direct pages scanned 0 678 2520 Kswapd pages scanned 6046108 8814900 1639279 Kswapd pages reclaimed 1081954 1172267 1094635 Direct pages reclaimed 0 566 2304 Kswapd efficiency 17% 13% 66% Kswapd velocity 5217.560 7618.953 1414.879 Direct efficiency 100% 83% 91% Direct velocity 0.000 0.586 2.175 Percentage direct scans 0% 0% 0% Zone normal velocity 5105.086 6824.681 671.158 Zone dma32 velocity 112.473 794.858 745.896 Zone dma velocity 0.000 0.000 0.000 Page writes by reclaim 1929612.000 6861768.000 32821.000 Page writes file 1245752 6861750 32803 Page writes anon 683860 18 18 Page reclaim immediate 7484 40 239 Sector Reads 1130320 93996 86900 Sector Writes 13508052 10823500 11804436 Page rescued immediate 0 0 0 Slabs scanned 33536 27136 18560 Direct inode steals 0 0 0 Kswapd inode steals 8641 1035 0 Kswapd skipped wait 0 0 0 THP fault alloc 8 37 33 THP collapse alloc 508 552 515 THP splits 24 1 1 THP fault fallback 0 0 0 THP collapse fail 0 0 0 There are a number of observations to make here 1. Swap outs are almost eliminated. Swap ins are 0 indicating that the pages swapped were really unused anonymous pages. Related to that, major faults are much reduced. 2. kswapd efficiency was impacted by the initial series but with these follow-up patches, the efficiency is now at 66% indicating that far fewer pages were skipped during scanning due to dirty or writeback pages. 3. kswapd velocity is reduced indicating that fewer pages are being scanned with the follow-up series as kswapd now stalls when the tail of the LRU queue is full of unqueued dirty pages. The stall gives flushers a chance to catch-up so kswapd can reclaim clean pages when it wakes 4. In light of Zlatko's recent reports about zone scanning imbalances, mmtests now reports scanning velocity on a per-zone basis. With mainline, you can see that the scanning activity is dominated by the Normal zone with over 45 times more scanning in Normal than the DMA32 zone. With the series currently in mmotm, the ratio is slightly better but it is still the case that the bulk of scanning is in the highest zone. With this follow-up series, the ratio of scanning between the Normal and DMA32 zone is roughly equal. 5. As Dave Chinner observed, the current patches in mmotm increased the number of pages written from kswapd context which is expected to adversly impact IO performance. With the follow-up patches, far fewer pages are written from kswapd context than the mainline kernel 6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With the follow-up series, there is less slab shrinking activity and no inodes were reclaimed. 7. Note that "Sectors Read" is drastically reduced implying that the source data being used for the IO is not being aggressively discarded due to page reclaim skipping over dirty pages and reclaiming clean pages. Note that the reducion in reads could also be due to inode data not being re-read from disk after a slab shrink. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 Mean sda-avgqz 166.99 32.09 33.44 Mean sda-await 853.64 192.76 185.43 Mean sda-r_await 6.31 9.24 5.97 Mean sda-w_await 2992.81 202.65 192.43 Max sda-avgqz 1409.91 718.75 698.98 Max sda-await 6665.74 3538.00 3124.23 Max sda-r_await 58.96 111.95 58.00 Max sda-w_await 28458.94 3977.29 3148.61 In light of the changes in writes from reclaim context, the number of reads and Dave Chinner's concerns about IO performance I took a closer look at the IO stats for the test disk. Few observations 1. The average queue size is reduced by the initial series and roughly the same with this follow up. 2. Average wait times for writes are reduced and as the IO is completing faster it at least implies that the gain is because flushers are writing the files efficiently instead of page reclaim getting in the way. 3. The reduction in maximum write latency is staggering. 28 seconds down to 3 seconds. Jan Kara asked how NFS is affected by all of this. Unstable pages can be taken into account as one of the patches in the series shows but it is still the case that filesystems with unusual handling of dirty or writeback could still be treated better. Tests like postmark, fsmark and largedd showed up nothing useful. On my test setup, pages are simply not being written back from reclaim context with or without the patches and there are no changes in performance. My test setup probably is just not strong enough network-wise to be really interesting. I ran a longer-lived memcached test with IO going to NFS instead of a local disk parallelio 3.9.0 3.9.0 3.9.0 vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10 Ops memcachetest-0M 23323.00 ( 0.00%) 23241.00 ( -0.35%) 23321.00 ( -0.01%) Ops memcachetest-715M 25526.00 ( 0.00%) 24763.00 ( -2.99%) 23242.00 ( -8.95%) Ops memcachetest-2385M 8814.00 ( 0.00%) 26924.00 (205.47%) 23521.00 (166.86%) Ops memcachetest-4055M 5835.00 ( 0.00%) 26827.00 (359.76%) 25560.00 (338.05%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-715M 65.00 ( 0.00%) 71.00 ( -9.23%) 11.00 ( 83.08%) Ops io-duration-2385M 129.00 ( 0.00%) 94.00 ( 27.13%) 53.00 ( 58.91%) Ops io-duration-4055M 301.00 ( 0.00%) 100.00 ( 66.78%) 108.00 ( 64.12%) Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-715M 14394.00 ( 0.00%) 949.00 ( 93.41%) 63.00 ( 99.56%) Ops swaptotal-2385M 401483.00 ( 0.00%) 24437.00 ( 93.91%) 30118.00 ( 92.50%) Ops swaptotal-4055M 554123.00 ( 0.00%) 35688.00 ( 93.56%) 63082.00 ( 88.62%) Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-715M 4522.00 ( 0.00%) 560.00 ( 87.62%) 63.00 ( 98.61%) Ops swapin-2385M 169861.00 ( 0.00%) 5026.00 ( 97.04%) 13917.00 ( 91.81%) Ops swapin-4055M 192374.00 ( 0.00%) 10056.00 ( 94.77%) 25729.00 ( 86.63%) Ops minorfaults-0M 1445969.00 ( 0.00%) 1520878.00 ( -5.18%) 1454024.00 ( -0.56%) Ops minorfaults-715M 1557288.00 ( 0.00%) 1528482.00 ( 1.85%) 1535776.00 ( 1.38%) Ops minorfaults-2385M 1692896.00 ( 0.00%) 1570523.00 ( 7.23%) 1559622.00 ( 7.87%) Ops minorfaults-4055M 1654985.00 ( 0.00%) 1581456.00 ( 4.44%) 1596713.00 ( 3.52%) Ops majorfaults-0M 0.00 ( 0.00%) 1.00 (-99.00%) 0.00 ( 0.00%) Ops majorfaults-715M 763.00 ( 0.00%) 265.00 ( 65.27%) 75.00 ( 90.17%) Ops majorfaults-2385M 23861.00 ( 0.00%) 894.00 ( 96.25%) 2189.00 ( 90.83%) Ops majorfaults-4055M 27210.00 ( 0.00%) 1569.00 ( 94.23%) 4088.00 ( 84.98%) 1. Performance does not collapse due to IO which is good. IO is also completing faster. Note with mmotm, IO completes in a third of the time and faster again with this series applied 2. Swapping is reduced, although not eliminated. The figures for the follow-up look bad but it does vary a bit as the stalling is not perfect for nfs or filesystems like ext3 with unusual handling of dirty and writeback pages 3. There are swapins, particularly with larger amounts of IO indicating that active pages are being reclaimed. However, the number of much reduced. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 Minor Faults 36339175 35025445 35219699 Major Faults 310964 27108 51887 Swap Ins 2176399 173069 333316 Swap Outs 3344050 357228 504824 Direct pages scanned 8972 77283 43242 Kswapd pages scanned 20899983 8939566 14772851 Kswapd pages reclaimed 6193156 5172605 5231026 Direct pages reclaimed 8450 73802 39514 Kswapd efficiency 29% 57% 35% Kswapd velocity 3929.743 1847.499 3058.840 Direct efficiency 94% 95% 91% Direct velocity 1.687 15.972 8.954 Percentage direct scans 0% 0% 0% Zone normal velocity 3721.907 939.103 2185.142 Zone dma32 velocity 209.522 924.368 882.651 Zone dma velocity 0.000 0.000 0.000 Page writes by reclaim 4082185.000 526319.000 537114.000 Page writes file 738135 169091 32290 Page writes anon 3344050 357228 504824 Page reclaim immediate 9524 170 5595843 Sector Reads 8909900 861192 1483680 Sector Writes 13428980 1488744 2076800 Page rescued immediate 0 0 0 Slabs scanned 38016 31744 28672 Direct inode steals 0 0 0 Kswapd inode steals 424 0 0 Kswapd skipped wait 0 0 0 THP fault alloc 14 15 119 THP collapse alloc 1767 1569 1618 THP splits 30 29 25 THP fault fallback 0 0 0 THP collapse fail 8 5 0 Compaction stalls 17 41 100 Compaction success 7 31 95 Compaction failures 10 10 5 Page migrate success 7083 22157 62217 Page migrate failure 0 0 0 Compaction pages isolated 14847 48758 135830 Compaction migrate scanned 18328 48398 138929 Compaction free scanned 2000255 355827 1720269 Compaction cost 7 24 68 I guess the main takeaway again is the much reduced page writes from reclaim context and reduced reads. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 Mean sda-avgqz 23.58 0.35 0.44 Mean sda-await 133.47 15.72 15.46 Mean sda-r_await 4.72 4.69 3.95 Mean sda-w_await 507.69 28.40 33.68 Max sda-avgqz 680.60 12.25 23.14 Max sda-await 3958.89 221.83 286.22 Max sda-r_await 63.86 61.23 67.29 Max sda-w_await 11710.38 883.57 1767.28 And as before, write wait times are much reduced. This patch: The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority" decides whether to writeback pages from reclaim context based on the number of dirty pages encountered. This situation is flagged too easily and flushers are not given the chance to catch up resulting in more pages being written from reclaim context and potentially impacting IO performance. The check for PageWriteback is also misplaced as it happens within a PageDirty check which is nonsense as the dirty may have been cleared for IO. The accounting is updated very late and pages that are already under writeback, were reactivated, could not unmapped or could not be released are all missed. Similarly, a page is considered congested for reasons other than being congested and pages that cannot be written out in the correct context are skipped. Finally, it considers stalling and writing back filesystem pages due to encountering dirty anonymous pages at the tail of the LRU which is dumb. This patch causes kswapd to begin writing filesystem pages from reclaim context only if page reclaim found that all filesystem pages at the tail of the LRU were unqueued dirty pages. Before it starts writing filesystem pages, it will stall to give flushers a chance to catch up. The decision on whether wait_iff_congested is also now determined by dirty filesystem pages only. Congested pages are based on whether the underlying BDI is congested regardless of the context of the reclaiming process. Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Michal Hocko Cc: Rik van Riel Cc: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Cc: Zlatko Calusic Cc: dormando Cc: Trond Myklebust Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 48 insertions(+), 13 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 4a43c289b23..999ef0b9399 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -669,6 +669,25 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_RECLAIM; } +/* Check if a page is dirty or under writeback */ +static void page_check_dirty_writeback(struct page *page, + bool *dirty, bool *writeback) +{ + /* + * Anonymous pages are not handled by flushers and must be written + * from reclaim context. Do not stall reclaim based on them + */ + if (!page_is_file_cache(page)) { + *dirty = false; + *writeback = false; + return; + } + + /* By default assume that the page flags are accurate */ + *dirty = PageDirty(page); + *writeback = PageWriteback(page); +} + /* * shrink_page_list() returns the number of reclaimed pages */ @@ -697,6 +716,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, struct page *page; int may_enter_fs; enum page_references references = PAGEREF_RECLAIM_CLEAN; + bool dirty, writeback; cond_resched(); @@ -724,6 +744,24 @@ static unsigned long shrink_page_list(struct list_head *page_list, may_enter_fs = (sc->gfp_mask & __GFP_FS) || (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); + /* + * The number of dirty pages determines if a zone is marked + * reclaim_congested which affects wait_iff_congested. kswapd + * will stall and start writing pages if the tail of the LRU + * is all dirty unqueued pages. + */ + page_check_dirty_writeback(page, &dirty, &writeback); + if (dirty || writeback) + nr_dirty++; + + if (dirty && !writeback) + nr_unqueued_dirty++; + + /* Treat this page as congested if underlying BDI is */ + mapping = page_mapping(page); + if (mapping && bdi_write_congested(mapping->backing_dev_info)) + nr_congested++; + /* * If a page at the tail of the LRU is under writeback, there * are three cases to consider. @@ -819,9 +857,10 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (!add_to_swap(page, page_list)) goto activate_locked; may_enter_fs = 1; - } - mapping = page_mapping(page); + /* Adding to swap updated mapping */ + mapping = page_mapping(page); + } /* * The page is mapped into the page tables of one or more @@ -841,11 +880,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, } if (PageDirty(page)) { - nr_dirty++; - - if (!PageWriteback(page)) - nr_unqueued_dirty++; - /* * Only kswapd can writeback filesystem pages to * avoid risk of stack overflow but only writeback @@ -876,7 +910,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, /* Page is dirty, try to write it out here */ switch (pageout(page, mapping, sc)) { case PAGE_KEEP: - nr_congested++; goto keep_locked; case PAGE_ACTIVATE: goto activate_locked; @@ -1318,7 +1351,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, unsigned long nr_scanned; unsigned long nr_reclaimed = 0; unsigned long nr_taken; - unsigned long nr_dirty = 0; + unsigned long nr_unqueued_dirty = 0; unsigned long nr_writeback = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); @@ -1361,7 +1394,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, return 0; nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP, - &nr_dirty, &nr_writeback, false); + &nr_unqueued_dirty, &nr_writeback, false); spin_lock_irq(&zone->lru_lock); @@ -1416,11 +1449,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, /* * Similarly, if many dirty pages are encountered that are not * currently being written then flag that kswapd should start - * writing back pages. + * writing back pages and stall to give a chance for flushers + * to catch up. */ - if (global_reclaim(sc) && nr_dirty && - nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority))) + if (global_reclaim(sc) && nr_unqueued_dirty == nr_taken) { + congestion_wait(BLK_RW_ASYNC, HZ/10); zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY); + } trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id, zone_idx(zone), -- cgit v1.2.3-70-g09d2 From b1a6f21e3b2315d46ae8af88a8f4eb8ea2763107 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:01:58 -0700 Subject: mm: vmscan: stall page reclaim after a list of pages have been processed Commit "mm: vmscan: Block kswapd if it is encountering pages under writeback" blocks page reclaim if it encounters pages under writeback marked for immediate reclaim. It blocks while pages are still isolated from the LRU which is unnecessary. This patch defers the blocking until after the isolated pages have been processed and tidies up some of the comments. Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Michal Hocko Cc: Rik van Riel Cc: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Cc: Zlatko Calusic Cc: dormando Cc: Trond Myklebust Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 49 +++++++++++++++++++++++++++++++++---------------- 1 file changed, 33 insertions(+), 16 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 999ef0b9399..5b1a79c8f0c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -697,6 +697,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, enum ttu_flags ttu_flags, unsigned long *ret_nr_unqueued_dirty, unsigned long *ret_nr_writeback, + unsigned long *ret_nr_immediate, bool force_reclaim) { LIST_HEAD(ret_pages); @@ -707,6 +708,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, unsigned long nr_congested = 0; unsigned long nr_reclaimed = 0; unsigned long nr_writeback = 0; + unsigned long nr_immediate = 0; cond_resched(); @@ -773,8 +775,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, * IO can complete. Waiting on the page itself risks an * indefinite stall if it is impossible to writeback the * page due to IO error or disconnected storage so instead - * block for HZ/10 or until some IO completes then clear the - * ZONE_WRITEBACK flag to recheck if the condition exists. + * note that the LRU is being scanned too quickly and the + * caller can stall after page list has been processed. * * 2) Global reclaim encounters a page, memcg encounters a * page that is not marked for immediate reclaim or @@ -804,10 +806,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (current_is_kswapd() && PageReclaim(page) && zone_is_reclaim_writeback(zone)) { - unlock_page(page); - congestion_wait(BLK_RW_ASYNC, HZ/10); - zone_clear_flag(zone, ZONE_WRITEBACK); - goto keep; + nr_immediate++; + goto keep_locked; /* Case 2 above */ } else if (global_reclaim(sc) || @@ -1033,6 +1033,7 @@ keep: mem_cgroup_uncharge_end(); *ret_nr_unqueued_dirty += nr_unqueued_dirty; *ret_nr_writeback += nr_writeback; + *ret_nr_immediate += nr_immediate; return nr_reclaimed; } @@ -1044,7 +1045,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, .priority = DEF_PRIORITY, .may_unmap = 1, }; - unsigned long ret, dummy1, dummy2; + unsigned long ret, dummy1, dummy2, dummy3; struct page *page, *next; LIST_HEAD(clean_pages); @@ -1057,7 +1058,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, ret = shrink_page_list(&clean_pages, zone, &sc, TTU_UNMAP|TTU_IGNORE_ACCESS, - &dummy1, &dummy2, true); + &dummy1, &dummy2, &dummy3, true); list_splice(&clean_pages, page_list); __mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret); return ret; @@ -1353,6 +1354,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, unsigned long nr_taken; unsigned long nr_unqueued_dirty = 0; unsigned long nr_writeback = 0; + unsigned long nr_immediate = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); struct zone *zone = lruvec_zone(lruvec); @@ -1394,7 +1396,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, return 0; nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP, - &nr_unqueued_dirty, &nr_writeback, false); + &nr_unqueued_dirty, &nr_writeback, &nr_immediate, + false); spin_lock_irq(&zone->lru_lock); @@ -1447,14 +1450,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, } /* - * Similarly, if many dirty pages are encountered that are not - * currently being written then flag that kswapd should start - * writing back pages and stall to give a chance for flushers - * to catch up. + * memcg will stall in page writeback so only consider forcibly + * stalling for global reclaim */ - if (global_reclaim(sc) && nr_unqueued_dirty == nr_taken) { - congestion_wait(BLK_RW_ASYNC, HZ/10); - zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY); + if (global_reclaim(sc)) { + /* + * If dirty pages are scanned that are not queued for IO, it + * implies that flushers are not keeping up. In this case, flag + * the zone ZONE_TAIL_LRU_DIRTY and kswapd will start writing + * pages from reclaim context. It will forcibly stall in the + * next check. + */ + if (nr_unqueued_dirty == nr_taken) + zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY); + + /* + * In addition, if kswapd scans pages marked marked for + * immediate reclaim and under writeback (nr_immediate), it + * implies that pages are cycling through the LRU faster than + * they are written so also forcibly stall. + */ + if (nr_unqueued_dirty == nr_taken || nr_immediate) + congestion_wait(BLK_RW_ASYNC, HZ/10); } trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id, -- cgit v1.2.3-70-g09d2 From f7ab8db791a8692f5ed4201dbae25722c1732a8d Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:02:00 -0700 Subject: mm: vmscan: set zone flags before blocking In shrink_page_list a decision may be made to stall and flag a zone as ZONE_WRITEBACK so that if a large number of unqueued dirty pages are encountered later then the reclaimer will stall. Set ZONE_WRITEBACK before potentially going to sleep so it is noticed sooner. Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Michal Hocko Cc: Rik van Riel Cc: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Cc: Zlatko Calusic Cc: dormando Cc: Trond Myklebust Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 5b1a79c8f0c..5f80d018bff 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1445,8 +1445,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, */ if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY - sc->priority))) { - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); zone_set_flag(zone, ZONE_WRITEBACK); + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); } /* -- cgit v1.2.3-70-g09d2 From 8e950282804558e4605401b9c79c1d34f0d73507 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:02:02 -0700 Subject: mm: vmscan: move direct reclaim wait_iff_congested into shrink_list shrink_inactive_list makes decisions on whether to stall based on the number of dirty pages encountered. The wait_iff_congested() call in shrink_page_list does no such thing and it's arbitrary. This patch moves the decision on whether to set ZONE_CONGESTED and the wait_iff_congested call into shrink_page_list. This keeps all the decisions on whether to stall or not in the one place. Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Michal Hocko Cc: Rik van Riel Cc: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Cc: Zlatko Calusic Cc: dormando Cc: Trond Myklebust Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 62 ++++++++++++++++++++++++++++++++----------------------------- 1 file changed, 33 insertions(+), 29 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 5f80d018bff..4898daf074c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -695,7 +695,9 @@ static unsigned long shrink_page_list(struct list_head *page_list, struct zone *zone, struct scan_control *sc, enum ttu_flags ttu_flags, + unsigned long *ret_nr_dirty, unsigned long *ret_nr_unqueued_dirty, + unsigned long *ret_nr_congested, unsigned long *ret_nr_writeback, unsigned long *ret_nr_immediate, bool force_reclaim) @@ -1017,20 +1019,13 @@ keep: VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); } - /* - * Tag a zone as congested if all the dirty pages encountered were - * backed by a congested BDI. In this case, reclaimers should just - * back off and wait for congestion to clear because further reclaim - * will encounter the same problem - */ - if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc)) - zone_set_flag(zone, ZONE_CONGESTED); - free_hot_cold_page_list(&free_pages, 1); list_splice(&ret_pages, page_list); count_vm_events(PGACTIVATE, pgactivate); mem_cgroup_uncharge_end(); + *ret_nr_dirty += nr_dirty; + *ret_nr_congested += nr_congested; *ret_nr_unqueued_dirty += nr_unqueued_dirty; *ret_nr_writeback += nr_writeback; *ret_nr_immediate += nr_immediate; @@ -1045,7 +1040,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, .priority = DEF_PRIORITY, .may_unmap = 1, }; - unsigned long ret, dummy1, dummy2, dummy3; + unsigned long ret, dummy1, dummy2, dummy3, dummy4, dummy5; struct page *page, *next; LIST_HEAD(clean_pages); @@ -1057,8 +1052,8 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, } ret = shrink_page_list(&clean_pages, zone, &sc, - TTU_UNMAP|TTU_IGNORE_ACCESS, - &dummy1, &dummy2, &dummy3, true); + TTU_UNMAP|TTU_IGNORE_ACCESS, + &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true); list_splice(&clean_pages, page_list); __mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret); return ret; @@ -1352,6 +1347,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, unsigned long nr_scanned; unsigned long nr_reclaimed = 0; unsigned long nr_taken; + unsigned long nr_dirty = 0; + unsigned long nr_congested = 0; unsigned long nr_unqueued_dirty = 0; unsigned long nr_writeback = 0; unsigned long nr_immediate = 0; @@ -1396,8 +1393,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, return 0; nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP, - &nr_unqueued_dirty, &nr_writeback, &nr_immediate, - false); + &nr_dirty, &nr_unqueued_dirty, &nr_congested, + &nr_writeback, &nr_immediate, + false); spin_lock_irq(&zone->lru_lock); @@ -1431,7 +1429,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, * same way balance_dirty_pages() manages. * * This scales the number of dirty pages that must be under writeback - * before throttling depending on priority. It is a simple backoff + * before a zone gets flagged ZONE_WRITEBACK. It is a simple backoff * function that has the most effect in the range DEF_PRIORITY to * DEF_PRIORITY-2 which is the priority reclaim is considered to be * in trouble and reclaim is considered to be in trouble. @@ -1442,18 +1440,27 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, * ... * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any * isolated page is PageWriteback + * + * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number + * of pages under pages flagged for immediate reclaim and stall if any + * are encountered in the nr_immediate check below. */ if (nr_writeback && nr_writeback >= - (nr_taken >> (DEF_PRIORITY - sc->priority))) { + (nr_taken >> (DEF_PRIORITY - sc->priority))) zone_set_flag(zone, ZONE_WRITEBACK); - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); - } /* * memcg will stall in page writeback so only consider forcibly * stalling for global reclaim */ if (global_reclaim(sc)) { + /* + * Tag a zone as congested if all the dirty pages scanned were + * backed by a congested BDI and wait_iff_congested will stall. + */ + if (nr_dirty && nr_dirty == nr_congested) + zone_set_flag(zone, ZONE_CONGESTED); + /* * If dirty pages are scanned that are not queued for IO, it * implies that flushers are not keeping up. In this case, flag @@ -1474,6 +1481,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, congestion_wait(BLK_RW_ASYNC, HZ/10); } + /* + * Stall direct reclaim for IO completions if underlying BDIs or zone + * is congested. Allow kswapd to continue until it starts encountering + * unqueued dirty pages or cycling through the LRU too quickly. + */ + if (!sc->hibernation_mode && !current_is_kswapd()) + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); + trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id, zone_idx(zone), nr_scanned, nr_reclaimed, @@ -2374,17 +2389,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, WB_REASON_TRY_TO_FREE_PAGES); sc->may_writepage = 1; } - - /* Take a nap, wait for some writeback to complete */ - if (!sc->hibernation_mode && sc->nr_scanned && - sc->priority < DEF_PRIORITY - 2) { - struct zone *preferred_zone; - - first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask), - &cpuset_current_mems_allowed, - &preferred_zone); - wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10); - } } while (--sc->priority >= 0); out: -- cgit v1.2.3-70-g09d2 From d04e8acd03e5c3421ef18e3da7bc88d56179ca42 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:02:03 -0700 Subject: mm: vmscan: treat pages marked for immediate reclaim as zone congestion Currently a zone will only be marked congested if the underlying BDI is congested but if dirty pages are spread across zones it is possible that an individual zone is full of dirty pages without being congested. The impact is that zone gets scanned very quickly potentially reclaiming really clean pages. This patch treats pages marked for immediate reclaim as congested for the purposes of marking a zone ZONE_CONGESTED and stalling in wait_iff_congested. Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Michal Hocko Cc: Rik van Riel Cc: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Cc: Zlatko Calusic Cc: dormando Cc: Trond Myklebust Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 4898daf074c..bf4778479e3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -761,9 +761,15 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (dirty && !writeback) nr_unqueued_dirty++; - /* Treat this page as congested if underlying BDI is */ + /* + * Treat this page as congested if the underlying BDI is or if + * pages are cycling through the LRU so quickly that the + * pages marked for immediate reclaim are making it to the + * end of the LRU a second time. + */ mapping = page_mapping(page); - if (mapping && bdi_write_congested(mapping->backing_dev_info)) + if ((mapping && bdi_write_congested(mapping->backing_dev_info)) || + (writeback && PageReclaim(page))) nr_congested++; /* -- cgit v1.2.3-70-g09d2 From b45972265f823ed01eae0867a176320071665787 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:02:05 -0700 Subject: mm: vmscan: take page buffers dirty and locked state into account Page reclaim keeps track of dirty and under writeback pages and uses it to determine if wait_iff_congested() should stall or if kswapd should begin writing back pages. This fails to account for buffer pages that can be under writeback but not PageWriteback which is the case for filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages can have all the buffers clean and writepage does no IO so it should not be accounted as congested. This patch adds an address_space operation that filesystems may optionally use to check if a page is really dirty or really under writeback. An implementation is provided for for buffer_heads is added and used for block operations and ext3 in ordered mode. By default the page flags are obeyed. Credit goes to Jan Kara for identifying that the page flags alone are not sufficient for ext3 and sanity checking a number of ideas on how the problem could be addressed. Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Michal Hocko Cc: Rik van Riel Cc: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Cc: Zlatko Calusic Cc: dormando Cc: Trond Myklebust Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/block_dev.c | 1 + fs/buffer.c | 34 ++++++++++++++++++++++++++++++++++ fs/ext3/inode.c | 1 + include/linux/buffer_head.h | 3 +++ include/linux/fs.h | 1 + mm/vmscan.c | 10 ++++++++++ 6 files changed, 50 insertions(+) (limited to 'mm/vmscan.c') diff --git a/fs/block_dev.c b/fs/block_dev.c index 431b6a04ebf..bb43ce081d6 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1562,6 +1562,7 @@ static const struct address_space_operations def_blk_aops = { .writepages = generic_writepages, .releasepage = blkdev_releasepage, .direct_IO = blkdev_direct_IO, + .is_dirty_writeback = buffer_check_dirty_writeback, }; const struct file_operations def_blk_fops = { diff --git a/fs/buffer.c b/fs/buffer.c index f93392e2df1..4d7433534f5 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -82,6 +82,40 @@ void unlock_buffer(struct buffer_head *bh) } EXPORT_SYMBOL(unlock_buffer); +/* + * Returns if the page has dirty or writeback buffers. If all the buffers + * are unlocked and clean then the PageDirty information is stale. If + * any of the pages are locked, it is assumed they are locked for IO. + */ +void buffer_check_dirty_writeback(struct page *page, + bool *dirty, bool *writeback) +{ + struct buffer_head *head, *bh; + *dirty = false; + *writeback = false; + + BUG_ON(!PageLocked(page)); + + if (!page_has_buffers(page)) + return; + + if (PageWriteback(page)) + *writeback = true; + + head = page_buffers(page); + bh = head; + do { + if (buffer_locked(bh)) + *writeback = true; + + if (buffer_dirty(bh)) + *dirty = true; + + bh = bh->b_this_page; + } while (bh != head); +} +EXPORT_SYMBOL(buffer_check_dirty_writeback); + /* * Block until a buffer comes unlocked. This doesn't stop it * from becoming locked again - you have to lock it yourself diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index f67668f724b..2bd85486b87 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1985,6 +1985,7 @@ static const struct address_space_operations ext3_ordered_aops = { .direct_IO = ext3_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .is_dirty_writeback = buffer_check_dirty_writeback, .error_remove_page = generic_error_remove_page, }; diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index f5a3b838ddb..91fa9a94ae9 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -139,6 +139,9 @@ BUFFER_FNS(Prio, prio) }) #define page_has_buffers(page) PagePrivate(page) +void buffer_check_dirty_writeback(struct page *page, + bool *dirty, bool *writeback); + /* * Declarations */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 2b82c804149..99be011e00d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -380,6 +380,7 @@ struct address_space_operations { int (*launder_page) (struct page *); int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); + void (*is_dirty_writeback) (struct page *, bool *, bool *); int (*error_remove_page)(struct address_space *, struct page *); /* swapfile support */ diff --git a/mm/vmscan.c b/mm/vmscan.c index bf4778479e3..c8579439984 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -673,6 +673,8 @@ static enum page_references page_check_references(struct page *page, static void page_check_dirty_writeback(struct page *page, bool *dirty, bool *writeback) { + struct address_space *mapping; + /* * Anonymous pages are not handled by flushers and must be written * from reclaim context. Do not stall reclaim based on them @@ -686,6 +688,14 @@ static void page_check_dirty_writeback(struct page *page, /* By default assume that the page flags are accurate */ *dirty = PageDirty(page); *writeback = PageWriteback(page); + + /* Verify dirty/writeback state if the filesystem supports it */ + if (!page_has_private(page)) + return; + + mapping = page_mapping(page); + if (mapping && mapping->a_ops->is_dirty_writeback) + mapping->a_ops->is_dirty_writeback(page, dirty, writeback); } /* -- cgit v1.2.3-70-g09d2 From c53954a092d07c5684d31ea1fc813d262cff08a5 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:02:34 -0700 Subject: mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru Similar to __pagevec_lru_add, this patch removes the LRU parameter from __lru_cache_add and lru_cache_add_lru as the caller does not control the exact LRU the page gets added to. lru_cache_add_lru gets renamed to lru_cache_add the name is silly without the lru parameter. With the parameter removed, it is required that the caller indicate if they want the page added to the active or inactive list by setting or clearing PageActive respectively. [akpm@linux-foundation.org: Suggested the patch] [gang.chen@asianux.com: fix used-unintialized warning] Signed-off-by: Mel Gorman Signed-off-by: Chen Gang Cc: Jan Kara Cc: Rik van Riel Acked-by: Johannes Weiner Cc: Alexey Lyahkov Cc: Andrew Perepechko Cc: Robin Dong Cc: Theodore Tso Cc: Hugh Dickins Cc: Rik van Riel Cc: Bernd Schubert Cc: David Howells Cc: Trond Myklebust Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 11 +++++++---- mm/rmap.c | 7 ++++--- mm/swap.c | 17 +++++++---------- mm/vmscan.c | 5 ++--- 4 files changed, 20 insertions(+), 20 deletions(-) (limited to 'mm/vmscan.c') diff --git a/include/linux/swap.h b/include/linux/swap.h index 1701ce4be74..85d74373002 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -10,6 +10,7 @@ #include #include #include +#include #include struct notifier_block; @@ -233,8 +234,8 @@ extern unsigned long nr_free_pagecache_pages(void); /* linux/mm/swap.c */ -extern void __lru_cache_add(struct page *, enum lru_list lru); -extern void lru_cache_add_lru(struct page *, enum lru_list lru); +extern void __lru_cache_add(struct page *); +extern void lru_cache_add(struct page *); extern void lru_add_page_tail(struct page *page, struct page *page_tail, struct lruvec *lruvec, struct list_head *head); extern void activate_page(struct page *); @@ -254,12 +255,14 @@ extern void add_page_to_unevictable_list(struct page *page); */ static inline void lru_cache_add_anon(struct page *page) { - __lru_cache_add(page, LRU_INACTIVE_ANON); + ClearPageActive(page); + __lru_cache_add(page); } static inline void lru_cache_add_file(struct page *page) { - __lru_cache_add(page, LRU_INACTIVE_FILE); + ClearPageActive(page); + __lru_cache_add(page); } /* linux/mm/vmscan.c */ diff --git a/mm/rmap.c b/mm/rmap.c index 6280da86b5d..e22ceeb6e5e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1093,9 +1093,10 @@ void page_add_new_anon_rmap(struct page *page, else __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES); __page_set_anon_rmap(page, vma, address, 1); - if (!mlocked_vma_newpage(vma, page)) - lru_cache_add_lru(page, LRU_ACTIVE_ANON); - else + if (!mlocked_vma_newpage(vma, page)) { + SetPageActive(page); + lru_cache_add(page); + } else add_page_to_unevictable_list(page); } diff --git a/mm/swap.c b/mm/swap.c index 6a9d0c43924..4a1d0d2c52f 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -494,15 +494,10 @@ EXPORT_SYMBOL(mark_page_accessed); * pagevec is drained. This gives a chance for the caller of __lru_cache_add() * have the page added to the active list using mark_page_accessed(). */ -void __lru_cache_add(struct page *page, enum lru_list lru) +void __lru_cache_add(struct page *page) { struct pagevec *pvec = &get_cpu_var(lru_add_pvec); - if (is_active_lru(lru)) - SetPageActive(page); - else - ClearPageActive(page); - page_cache_get(page); if (!pagevec_space(pvec)) __pagevec_lru_add(pvec); @@ -512,11 +507,10 @@ void __lru_cache_add(struct page *page, enum lru_list lru) EXPORT_SYMBOL(__lru_cache_add); /** - * lru_cache_add_lru - add a page to a page list + * lru_cache_add - add a page to a page list * @page: the page to be added to the LRU. - * @lru: the LRU list to which the page is added. */ -void lru_cache_add_lru(struct page *page, enum lru_list lru) +void lru_cache_add(struct page *page) { if (PageActive(page)) { VM_BUG_ON(PageUnevictable(page)); @@ -525,7 +519,7 @@ void lru_cache_add_lru(struct page *page, enum lru_list lru) } VM_BUG_ON(PageLRU(page)); - __lru_cache_add(page, lru); + __lru_cache_add(page); } /** @@ -745,6 +739,9 @@ void release_pages(struct page **pages, int nr, int cold) del_page_from_lru_list(page, lruvec, page_off_lru(page)); } + /* Clear Active bit in case of parallel mark_page_accessed */ + ClearPageActive(page); + list_add(&page->lru, &pages_to_free); } if (zone) diff --git a/mm/vmscan.c b/mm/vmscan.c index c8579439984..99b3ac7771a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -546,7 +546,6 @@ int remove_mapping(struct address_space *mapping, struct page *page) void putback_lru_page(struct page *page) { int lru; - int active = !!TestClearPageActive(page); int was_unevictable = PageUnevictable(page); VM_BUG_ON(PageLRU(page)); @@ -561,8 +560,8 @@ redo: * unevictable page on [in]active list. * We know how to handle that. */ - lru = active + page_lru_base_type(page); - lru_cache_add_lru(page, lru); + lru = page_lru_base_type(page); + lru_cache_add(page); } else { /* * Put unevictable pages directly on zone's unevictable -- cgit v1.2.3-70-g09d2 From 5a1c9cbc1550f93335d7c03eb6c271e642deff04 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Jul 2013 16:00:24 -0700 Subject: mm: vmscan: do not continue scanning if reclaim was aborted for compaction Direct reclaim is not aborting to allow compaction to go ahead properly. do_try_to_free_pages is told to abort reclaim which is happily ignores and instead increases priority instead until it reaches 0 and starts shrinking file/anon equally. This patch corrects the situation by aborting reclaim when requested instead of raising priority. Signed-off-by: Mel Gorman Cc: Rik van Riel Cc: Johannes Weiner Cc: Michal Hocko Cc: Dave Chinner Cc: Kamezawa Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 99b3ac7771a..2385663ae5e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2361,8 +2361,10 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, aborted_reclaim = shrink_zones(zonelist, sc); /* - * Don't shrink slabs when reclaiming memory from - * over limit cgroups + * Don't shrink slabs when reclaiming memory from over limit + * cgroups but do shrink slab at least once when aborting + * reclaim for compaction to avoid unevenly scanning file/anon + * LRU pages over slab pages. */ if (global_reclaim(sc)) { unsigned long lru_pages = 0; @@ -2404,7 +2406,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, WB_REASON_TRY_TO_FREE_PAGES); sc->may_writepage = 1; } - } while (--sc->priority >= 0); + } while (--sc->priority >= 0 && !aborted_reclaim); out: delayacct_freepages_end(); -- cgit v1.2.3-70-g09d2 From 918fc718c5922520c499ad60f61b8df86b998ae9 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 8 Jul 2013 16:00:25 -0700 Subject: mm: vmscan: do not scale writeback pages when deciding whether to set ZONE_WRITEBACK After the patch "mm: vmscan: Flatten kswapd priority loop" was merged the scanning priority of kswapd changed. The priority now rises until it is scanning enough pages to meet the high watermark. shrink_inactive_list sets ZONE_WRITEBACK if a number of pages were encountered under writeback but this value is scaled based on the priority. As kswapd frequently scans with a higher priority now it is relatively easy to set ZONE_WRITEBACK. This patch removes the scaling and treates writeback pages similar to how it treats unqueued dirty pages and congested pages. The user-visible effect should be that kswapd will writeback fewer pages from reclaim context. Signed-off-by: Mel Gorman Cc: Rik van Riel Cc: Johannes Weiner Cc: Michal Hocko Cc: Dave Chinner Cc: Kamezawa Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 16 +--------------- 1 file changed, 1 insertion(+), 15 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 2385663ae5e..2cff0d491c6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1443,25 +1443,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, * as there is no guarantee the dirtying process is throttled in the * same way balance_dirty_pages() manages. * - * This scales the number of dirty pages that must be under writeback - * before a zone gets flagged ZONE_WRITEBACK. It is a simple backoff - * function that has the most effect in the range DEF_PRIORITY to - * DEF_PRIORITY-2 which is the priority reclaim is considered to be - * in trouble and reclaim is considered to be in trouble. - * - * DEF_PRIORITY 100% isolated pages must be PageWriteback to throttle - * DEF_PRIORITY-1 50% must be PageWriteback - * DEF_PRIORITY-2 25% must be PageWriteback, kswapd in trouble - * ... - * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any - * isolated page is PageWriteback - * * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number * of pages under pages flagged for immediate reclaim and stall if any * are encountered in the nr_immediate check below. */ - if (nr_writeback && nr_writeback >= - (nr_taken >> (DEF_PRIORITY - sc->priority))) + if (nr_writeback && nr_writeback == nr_taken) zone_set_flag(zone, ZONE_WRITEBACK); /* -- cgit v1.2.3-70-g09d2