From bf3f3bc5e734706730c12a323f9b2068052aa1f0 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:38:55 -0800 Subject: mm: don't mark_page_accessed in fault path Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at unmap-time if the pte is young has a number of problems. mark_page_accessed is supposed to be roughly the equivalent of a young pte for unmapped references. Unfortunately it doesn't come with any context: after being called, reclaim doesn't know who or why the page was touched. So calling mark_page_accessed not only adds extra lru or PG_referenced manipulations for pages that are already going to have pte_young ptes anyway, but it also adds these references which are difficult to work with from the context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not wish to contribute to the page being referenced). Then, simply doing SetPageReferenced when zapping a pte and finding it is young, is not a really good solution either. SetPageReferenced does not correctly promote the page to the active list for example. So after removing mark_page_accessed from the fault path, several mmap()+touch+munmap() would have a very different result from several read(2) calls for example, which is not really desirable. Signed-off-by: Nick Piggin Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 1 - 1 file changed, 1 deletion(-) (limited to 'mm/filemap.c') diff --git a/mm/filemap.c b/mm/filemap.c index f5769b4dc07..f9d88183f69 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1530,7 +1530,6 @@ retry_find: /* * Found the page and have a reference on it. */ - mark_page_accessed(page); ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT; vmf->page = page; return ret | VM_FAULT_LOCKED; -- cgit v1.2.3-70-g09d2 From 05fe478dd04e02fa230c305ab9b5616669821dd3 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:39:08 -0800 Subject: mm: write_cache_pages integrity fix In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so the function will return success after writing out nr_to_write pages, even if that was not sufficient to guarantee data integrity. The callers tend to set it to values that could break data interity semantics easily in practice. For example, nr_to_write can be set to mapping->nr_pages * 2, however if a file has a single, dirty page, then fsync is called, subsequent pages might be concurrently added and dirtied, then write_cache_pages might writeout two of these newly dirty pages, while not writing out the old page that should have been written out. Fix this by ignoring nr_to_write if it is a data integrity sync. This is a data integrity bug. The reason this has been done in the past is to avoid stalling sync operations behind page dirtiers. "If a file has one dirty page at offset 1000000000000000 then someone does an fsync() and someone else gets in first and starts madly writing pages at offset 0, we want to write that page at 1000000000000000. Somehow." What we do today is return success after an arbitrary amount of pages are written, whether or not we have provided the data-integrity semantics that the caller has asked for. Even this doesn't actually fix all stall cases completely: in the above situation, if the file has a huge number of pages in pagecache (but not dirty), then mapping->nrpages is going to be huge, even if pages are being dirtied. This change does indeed make the possibility of long stalls lager, and that's not a good thing, but lying about data integrity is even worse. We have to either perform the sync, or return -ELINUXISLAME so at least the caller knows what has happened. There are subsequent competing approaches in the works to solve the stall problems properly, without compromising data integrity. Signed-off-by: Nick Piggin Cc: Chris Mason Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 2 +- mm/page-writeback.c | 6 ++++-- 2 files changed, 5 insertions(+), 3 deletions(-) (limited to 'mm/filemap.c') diff --git a/mm/filemap.c b/mm/filemap.c index f9d88183f69..9c5e6235cc7 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -210,7 +210,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, int ret; struct writeback_control wbc = { .sync_mode = sync_mode, - .nr_to_write = mapping->nrpages * 2, + .nr_to_write = LONG_MAX, .range_start = start, .range_end = end, }; diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2e847cdcad0..5edca676e2c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -963,8 +963,10 @@ retry: } } - if (--nr_to_write <= 0) - done = 1; + if (wbc->sync_mode == WB_SYNC_NONE) { + if (--wbc->nr_to_write <= 0) + done = 1; + } if (wbc->nonblocking && bdi_write_congested(bdi)) { wbc->encountered_congestion = 1; done = 1; -- cgit v1.2.3-70-g09d2 From 48b47c561e41525061b5bc0cfd67d6367fd11dc4 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:40:22 -0800 Subject: mm: direct IO starvation improvement Direct IO can invalidate and sync a lot of pagecache pages in the mapping. A 4K direct IO will actually try to sync and/or invalidate the pagecache of the entire file, for example (which might be many GB or TB large). Improve this by doing range syncs. Also, memory no longer has to be unmapped to catch the dirty bits for syncing, as dirty bits would remain coherent due to dirty mmap accounting. This fixes the immediate DM deadlocks when doing direct IO reads to block device with a mounted filesystem, if only by papering over the problem somewhat rather than addressing the fsync starvation cases. Signed-off-by: Nick Piggin Reviewed-by: Jeff Moyer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) (limited to 'mm/filemap.c') diff --git a/mm/filemap.c b/mm/filemap.c index 9c5e6235cc7..f3555fb806d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1317,7 +1317,8 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, goto out; /* skip atime */ size = i_size_read(inode); if (pos < size) { - retval = filemap_write_and_wait(mapping); + retval = filemap_write_and_wait_range(mapping, pos, + pos + iov_length(iov, nr_segs) - 1); if (!retval) { retval = mapping->a_ops->direct_IO(READ, iocb, iov, pos, nr_segs); @@ -2059,18 +2060,10 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov, if (count != ocount) *nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count); - /* - * Unmap all mmappings of the file up-front. - * - * This will cause any pte dirty bits to be propagated into the - * pageframes for the subsequent filemap_write_and_wait(). - */ write_len = iov_length(iov, *nr_segs); end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT; - if (mapping_mapped(mapping)) - unmap_mapping_range(mapping, pos, write_len, 0); - written = filemap_write_and_wait(mapping); + written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1); if (written) goto out; @@ -2290,7 +2283,8 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov, * the file data here, to try to honour O_DIRECT expectations. */ if (unlikely(file->f_flags & O_DIRECT) && written) - status = filemap_write_and_wait(mapping); + status = filemap_write_and_wait_range(mapping, + pos, pos + written - 1); return written ? written : status; } -- cgit v1.2.3-70-g09d2 From 67d58ac47d25f7e2a105248a4aea6113131ab874 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:40:28 -0800 Subject: mm: pagecache gfp flags fix Frustratingly, gfp_t is really divided into two classes of flags. One are the context dependent ones (can we sleep? can we enter filesystem? block subsystem? should we use some extra reserves, etc.). The other ones are the type of memory required and depend on how the algorithm is implemented rather than the point at which the memory is allocated (highmem? dma memory? etc). Some of the functions which allocate a page and add it to page cache take a gfp_t, but sometimes those functions or their callers aren't really doing the right thing: when allocating pagecache page, the memory type should be mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type should be kernel mapped (not highmem) memory. The gfp_t argument should only really be needed for context dependent options. This patch doesn't really solve that tangle in a nice way, but it does attempt to fix a couple of bugs. - find_or_create_page changes its radix-tree allocation to only include the main context dependent flags in order so the pagecache page may be allocated from arbitrary types of memory without affecting the radix-tree. In practice, slab allocations don't come from highmem anyway, and radix-tree only uses slab allocations. So there isn't a practical change (unless some fs uses GFP_DMA for pages). - grab_cache_page_nowait() is changed to allocate radix-tree nodes with GFP_NOFS, because it is not supposed to reenter the filesystem. This bug could cause lock recursion if a filesystem is not expecting the function to reenter the fs (as-per documentation). Filesystems should be careful about exactly what semantics they want and what they get when fiddling with gfp_t masks to allocate pagecache. One should be as liberal as possible with the type of memory that can be used, and same for the the context specific flags. Signed-off-by: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'mm/filemap.c') diff --git a/mm/filemap.c b/mm/filemap.c index f3555fb806d..2f55a1e2baf 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -741,7 +741,14 @@ repeat: page = __page_cache_alloc(gfp_mask); if (!page) return NULL; - err = add_to_page_cache_lru(page, mapping, index, gfp_mask); + /* + * We want a regular kernel memory (not highmem or DMA etc) + * allocation for the radix tree nodes, but we need to honour + * the context-specific requirements the caller has asked for. + * GFP_RECLAIM_MASK collects those requirements. + */ + err = add_to_page_cache_lru(page, mapping, index, + (gfp_mask & GFP_RECLAIM_MASK)); if (unlikely(err)) { page_cache_release(page); page = NULL; @@ -950,7 +957,7 @@ grab_cache_page_nowait(struct address_space *mapping, pgoff_t index) return NULL; } page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS); - if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) { + if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) { page_cache_release(page); page = NULL; } -- cgit v1.2.3-70-g09d2