[PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O

It turns out that it is advantageous to leave a small portion of unmapped file backed pages if all of a zone's pages (or almost all pages) are allocated and so the page allocator has to go off-node. This allows recently used file I/O buffers to stay on the node and reduces the times that zone reclaim is invoked if file I/O occurs when we run out of memory in a zone. The problem is that zone reclaim runs too frequently when the page cache is used for file I/O (read write and therefore unmapped pages!) alone and we have almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped pages. File I/O will use these pages for the next read/write requests and the unmapped pages increase. After the zone has filled up again zone reclaim will remove it again after only 32 pages. This cycle is too inefficient and there are potentially too many zone reclaim cycles. With the 1% boundary we may still remove all unmapped pages for file I/O in zone reclaim pass. However. it will take a large number of read and writes to get back to 1% again where we trigger zone reclaim again. The zone reclaim 2.6.16/17 does not show this behavior because we have a 30 second timeout. [akpm@osdl.org: rename the /proc file and the variable] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
author: Christoph Lameter <clameter@sgi.com> 2006-07-03 00:24:13 -0700
committer: Linus Torvalds <torvalds@g5.osdl.org> 2006-07-03 15:26:59 -0700
commit: 9614634fe6a138fd8ae044950700d2af8d203f97 (patch)
tree: 9b020c1d36d8625f4048c057058efb2e17c81973 /mm
parent: cb6358eb69d9854f65f2979c0ce9280eee041828 (diff)
2 files changed, 36 insertions, 13 deletions
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e792a583f3..54a4f5375bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2005,6 +2005,10 @@ static void __meminit free_area_init_core(struct pglist_data *pgdat,
 
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
+#ifdef CONFIG_NUMA
+		zone->min_unmapped_ratio = (realsize*sysctl_min_unmapped_ratio)
+						/ 100;
+#endif
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
 		spin_lock_init(&zone->lru_lock);
@@ -2298,6 +2302,24 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write,
 	return 0;
 }
 
+#ifdef CONFIG_NUMA
+int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	struct zone *zone;
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	for_each_zone(zone)
+		zone->min_unmapped_ratio = (zone->present_pages *
+				sysctl_min_unmapped_ratio) / 100;
+	return 0;
+}
+#endif
+
 /*
  * lowmem_reserve_ratio_sysctl_handler - just a wrapper around
  *	proc_dointvec() so that we can call setup_per_zone_lowmem_reserve()
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ff2ebe9458a..5d4c4d02254 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1503,10 +1503,6 @@ module_init(kswapd_init)
  *
  * If non-zero call zone_reclaim when the number of free pages falls below
  * the watermarks.
- *
- * In the future we may add flags to the mode. However, the page allocator
- * should only have to check that zone_reclaim_mode != 0 before calling
- * zone_reclaim().
  */
 int zone_reclaim_mode __read_mostly;
 
@@ -1524,6 +1520,12 @@ int zone_reclaim_mode __read_mostly;
 #define ZONE_RECLAIM_PRIORITY 4
 
 /*
+ * Percentage of pages in a zone that must be unmapped for zone_reclaim to
+ * occur.
+ */
+int sysctl_min_unmapped_ratio = 1;
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
@@ -1590,18 +1592,17 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	int node_id;
 
 	/*
-	 * Do not reclaim if there are not enough reclaimable pages in this
-	 * zone that would satify this allocations.
+	 * Zone reclaim reclaims unmapped file backed pages.
 	 *
-	 * All unmapped pagecache pages are reclaimable.
-	 *
-	 * Both counters may be temporarily off a bit so we use
-	 * SWAP_CLUSTER_MAX as the boundary. It may also be good to
-	 * leave a few frequently used unmapped pagecache pages around.
+	 * A small portion of unmapped file backed pages is needed for
+	 * file I/O otherwise pages read by file I/O will be immediately
+	 * thrown out if the zone is overallocated. So we do not reclaim
+	 * if less than a specified percentage of the zone is used by
+	 * unmapped file backed pages.
 	 */
 	if (zone_page_state(zone, NR_FILE_PAGES) -
-		zone_page_state(zone, NR_FILE_MAPPED) < SWAP_CLUSTER_MAX)
-			return 0;
+	    zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_ratio)
+		return 0;
 
 	/*
 	 * Avoid concurrent zone reclaims, do not reclaim in a zone that does
author	Christoph Lameter <clameter@sgi.com>	2006-07-03 00:24:13 -0700
committer	Linus Torvalds <torvalds@g5.osdl.org>	2006-07-03 15:26:59 -0700
commit	9614634fe6a138fd8ae044950700d2af8d203f97 (patch)
tree	9b020c1d36d8625f4048c057058efb2e17c81973 /mm
parent	cb6358eb69d9854f65f2979c0ce9280eee041828 (diff)