summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2008-02-07Memory controller: add switch to control what type of pages to limitBalbir Singh
Choose if we want cached pages to be accounted or not. By default both are accounted for. A new set of tunables are added. echo -n 1 > mem_control_type switches the accounting to account for only mapped pages echo -n 3 > mem_control_type switches the behaviour back [bunk@kernel.org: mm/memcontrol.c: clenups] [akpm@linux-foundation.org: fix sparc32 build] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07Memory controller: OOM handlingPavel Emelianov
Out of memory handling for cgroups over their limit. A task from the cgroup over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. [akpm@linux-foundation.org: fix build due to oom-killer changes] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07Memory controller improve user interfaceBalbir Singh
Change the interface to use bytes instead of pages. Page sizes can vary across platforms and configurations. A new strategy routine has been added to the resource counters infrastructure to format the data as desired. Suggested by David Rientjes, Andrew Morton and Herbert Poetzl Tested on a UML setup with the config for memory control enabled. [kamezawa.hiroyu@jp.fujitsu.com: possible race fix in res_counter] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07Memory controller: add per cgroup LRU and reclaimBalbir Singh
Add the page_cgroup to the per cgroup LRU. The reclaim algorithm has been modified to make the isolate_lru_pages() as a pluggable component. The scan_control data structure now accepts the cgroup on behalf of which reclaims are carried out. try_to_free_pages() has been extended to become cgroup aware. [akpm@linux-foundation.org: fix warning] [Lee.Schermerhorn@hp.com: initialize all scan_control's isolate_pages member] [bunk@kernel.org: make do_try_to_free_pages() static] [hugh@veritas.com: memcgroup: fix try_to_free order] [kamezawa.hiroyu@jp.fujitsu.com: this unlock_page_cgroup() is unnecessary] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07Memory controller: task migrationBalbir Singh
Allow tasks to migrate from one cgroup to the other. We migrate mm_struct's mem_cgroup only when the thread group id migrates. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07Memory controller: memory accountingBalbir Singh
Add the accounting hooks. The accounting is carried out for RSS and Page Cache (unmapped) pages. There is now a common limit and accounting for both. The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap() time. Page cache is accounted at add_to_page_cache(), __delete_from_page_cache(). Swap cache is also accounted for. Each page's page_cgroup is protected with the last bit of the page_cgroup pointer, this makes handling of race conditions involving simultaneous mappings of a page easier. A reference count is kept in the page_cgroup to deal with cases where a page might be unmapped from the RSS of all tasks, but still lives in the page cache. Credits go to Vaidyanathan Srinivasan for helping with reference counting work of the page cgroup. Almost all of the page cache accounting code has help from Vaidyanathan Srinivasan. [hugh@veritas.com: fix swapoff breakage] [akpm@linux-foundation.org: fix locking] Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: <Valdis.Kletnieks@vt.edu> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07Memory controller: accounting setupPavel Emelianov
Basic setup routines, the mm_struct has a pointer to the cgroup that it belongs to and the the page has a page_cgroup associated with it. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07Memory controller: cgroups setupBalbir Singh
Setup the memory cgroup and add basic hooks and controls to integrate and work with the cgroup. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07memcgroup: temporarily revert swapoff modHugh Dickins
This patch precisely reverts the "swapoff: scan ptes preemptibly" patch just presented. It's a temporary measure to allow existing memory controller patches to apply without rejects: in due course they should be rendered down into one sensible patch, and this reversion disappear. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86: x86: fix deadlock, make pgd_lock irq-safe virtio: fix trivial build bug x86: fix mttr trimming x86: delay CPA self-test and repeat it x86: fix 64-bit sections generic: add __FINITDATA x86: remove suprious ifdefs from pageattr.c x86: mark the .rodata section also NX x86: fix iret exception recovery on 64-bit cpuidle: dubious one-bit signed bitfield in cpuidle.h x86: fix sparse warnings in powernow-k8.c x86: fix sparse error in traps_32.c x86: trivial sparse/checkpatch in quirks.c x86 ptrace: disallow null cs/ss MAINTAINERS: RDC R-321x SoC maintainer brk randomization: introduce CONFIG_COMPAT_BRK brk: check the lower bound properly x86: remove X2 workaround x86: make spurious fault handler aware of large mappings x86: make traps on entry code be debuggable in user space, 64-bit
2008-02-06brk randomization: introduce CONFIG_COMPAT_BRKIngo Molnar
based on similar patch from: Pavel Machek <pavel@ucw.cz> Introduce CONFIG_COMPAT_BRK. If disabled then the kernel is free (but not obliged to) randomize the brk area. Heap randomization breaks ancient binaries, so we keep COMPAT_BRK enabled by default. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-06brk: check the lower bound properlyJiri Kosina
There is a check in sys_brk(), that tries to make sure that we do not underflow the area that is dedicated to brk heap. The check is however wrong, as it assumes that brk area starts immediately after the end of the code (+bss), which is wrong for example in environments with randomized brk start. The proper way is to check whether the address is not below the start_brk address. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-06PERCPU : __percpu_alloc_mask() can dynamically size percpu_data storageEric Dumazet
Instead of allocating a fix sized array of NR_CPUS pointers for percpu_data, we can use nr_cpu_ids, which is generally < NR_CPUS. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Merge branch 'dmapool' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc * 'dmapool' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: pool: Improve memory usage for devices which can't cross boundaries Change dmapool free block management dmapool: Tidy up includes and add comments dmapool: Validate parameters to dma_pool_create Avoid taking waitqueue lock in dmapool dmapool: Fix style problems Move dmapool.c to mm/ directory
2008-02-05nommu: add new vmalloc_user() and remap_vmalloc_range() interfaces.Paul Mundt
This builds on top of the earlier vmalloc_32_user() work introduced by b50731732f926d6c49fd0724616a7344c31cd5cf, as we now have places in the nommu allmodconfig that hit up against these missing APIs. As vmalloc_32_user() is already implemented, this is moved over to vmalloc_user() and simply made a wrapper. As all current nommu platforms are 32-bit addressable, there's no special casing we have to do for ZONE_DMA and things of that nature as per GFP_VMALLOC32. remap_vmalloc_range() needs to check VM_USERMAP in order to figure out whether we permit the remap or not, which means that we also have to rework the vmalloc_user() code to grovel for the VMA and set the flag. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: David McCullough <david_mccullough@securecomputing.com> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05oom_kill: remove uid==0 checksSerge E. Hallyn
Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means "give me extra resources" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Add 64-bit capability support to the kernelAndrew Morgan
The patch supports legacy (32-bit) capability userspace, and where possible translates 32-bit capabilities to/from userspace and the VFS to 64-bit kernel space capabilities. If a capability set cannot be compressed into 32-bits for consumption by user space, the system call fails, with -ERANGE. FWIW libcap-2.00 supports this change (and earlier capability formats) http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/ [akpm@linux-foundation.org: coding-syle fixes] [akpm@linux-foundation.org: use get_task_comm()] [ezk@cs.sunysb.edu: build fix] [akpm@linux-foundation.org: do not initialise statics to 0 or NULL] [akpm@linux-foundation.org: unused var] [serue@us.ibm.com: export __cap_ symbols] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05VFS/Security: Rework inode_getsecurity and callers to return resulting bufferDavid P. Quigley
This patch modifies the interface to inode_getsecurity to have the function return a buffer containing the security blob and its length via parameters instead of relying on the calling function to give it an appropriately sized buffer. Security blobs obtained with this function should be freed using the release_secctx LSM hook. This alleviates the problem of the caller having to guess a length and preallocate a buffer for this function allowing it to be used elsewhere for Labeled NFS. The patch also removed the unused err parameter. The conversion is similar to the one performed by Al Viro for the security_getprocattr hook. Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05slob: reduce external fragmentation by using three free listsMatt Mackall
By putting smaller objects on their own list, we greatly reduce overall external fragmentation and increase repeatability. This reduces total SLOB overhead from > 50% to ~6% on a simple boot test. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05slob: fix free block merging at head of subpageMatt Mackall
We weren't merging freed blocks at the beginning of the free list. Fixing this showed a 2.5% efficiency improvement in a userspace test harness. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05writeback: speed up writeback of big dirty filesFengguang Wu
After making dirty a 100M file, the normal behavior is to start the writeback for all data after 30s delays. But sometimes the following happens instead: - after 30s: ~4M - after 5s: ~4M - after 5s: all remaining 92M Some analyze shows that the internal io dispatch queues goes like this: s_io s_more_io ------------------------- 1) 100M,1K 0 2) 1K 96M 3) 0 96M 1) initial state with a 100M file and a 1K file 2) 4M written, nr_to_write <= 0, so write more 3) 1K written, nr_to_write > 0, no more writes(BUG) nr_to_write > 0 in (3) fools the upper layer to think that data have all been written out. The big dirty file is actually still sitting in s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes in s_dirty. It is also not an option to draw inodes from both s_more_io and s_dirty, an let the loop go on: this might lead to live locks, and might also starve other superblocks in sync time(well kupdate may still starve some superblocks, that's another bug). We have to return when a full scan of s_io completes. So nr_to_write > 0 does not necessarily mean that "all data are written". This patch introduces a flag writeback_control.more_io to indicate that more io should be done. With it the big dirty file no longer has to wait for the next kupdate invokation 5s later. In sync_sb_inodes() we only set more_io on super_blocks we actually visited. This avoids the interaction between two pdflush deamons. Also in __sync_single_inode() we don't blindly keep requeuing the io if the filesystem cannot progress. Failing to do so may lead to 100% iowait. Tested-by: Mike Snitzer <snitzer@gmail.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm: fix section mismatch warning in sparse.cSam Ravnborg
Fix following warning: WARNING: mm/built-in.o(.text+0x22069): Section mismatch in reference from the function sparse_early_usemap_alloc() to the function .init.text:__alloc_bootmem_node() static sparse_early_usemap_alloc() were used only by sparse_init() and with sparse_init() annotated _init it is safe to annotate sparse_early_usemap_alloc with __init too. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm: fix PageUptodate data raceNick Piggin
After running SetPageUptodate, preceeding stores to the page contents to actually bring it uptodate may not be ordered with the store to set the page uptodate. Therefore, another CPU which checks PageUptodate is true, then reads the page contents can get stale data. Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after PageUptodate. Many places that test PageUptodate, do so with the page locked, and this would be enough to ensure memory ordering in those places if SetPageUptodate were only called while the page is locked. Unfortunately that is not always the case for some filesystems, but it could be an idea for the future. Also bring the handling of anonymous page uptodateness in line with that of file backed page management, by marking anon pages as uptodate when they _are_ uptodate, rather than when our implementation requires that they be marked as such. Doing allows us to get rid of the smp_wmb's in the page copying functions, which were especially added for anonymous pages for an analogous memory ordering problem. Both file and anonymous pages are handled with the same barriers. FAQ: Q. Why not do this in flush_dcache_page? A. Firstly, flush_dcache_page handles only one side (the smb side) of the ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away memory barriers in a completely unrelated function is nasty; at least in the PageUptodate macros, they are located together with (half) the operations involved in the ordering. Thirdly, the smp_wmb is only required when first bringing the page uptodate, wheras flush_dcache_page should be called each time it is written to through the kernel mapping. It is logically the wrong place to put it. Q. Why does this increase my text size / reduce my performance / etc. A. Because it is adding the necessary instructions to eliminate the data-race. Q. Can it be improved? A. Yes, eg. if you were to create a rule that all SetPageUptodate operations run under the page lock, we could avoid the smp_rmb places where PageUptodate is queried under the page lock. Requires audit of all filesystems and at least some would need reworking. That's great you're interested, I'm eagerly awaiting your patches. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05page migraton: handle orphaned pagesShaohua Li
Orphaned page might have fs-private metadata, the page is truncated. As the page hasn't mapping, page migration refuse to migrate the page. It appears the page is only freed in page reclaim and if zone watermark is low, the page is never freed, as a result migration always fail. I thought we could free the metadata so such page can be freed in migration and make migration more reliable. [akpm@linux-foundation.org: go direct to try_to_free_buffers()] Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05check ADVICE of fadvise64_64 even if get_xip_page is givenMasatake YAMATO
I've written some test programs in ltp project. During writing I met an problem which I cannot solve in user land. So I wrote a patch for linux kernel. Please, include this patch if acceptable. The test program tests the 4th parameter of fadvise64_64: long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice); My test case calls fadvise64_64 with invalid advice value and checks errno is set to EINVAL. About the advice parameter man page says: ... Permissible values for advice include: POSIX_FADV_NORMAL ... POSIX_FADV_SEQUENTIAL ... POSIX_FADV_RANDOM ... POSIX_FADV_NOREUSE ... POSIX_FADV_WILLNEED ... POSIX_FADV_DONTNEED ... ERRORS ... EINVAL An invalid value was specified for advice. However, I got a bug report that the system call invocations in my test case returned 0 unexpectedly. I've inspected the kernel code: asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice) { struct file *file = fget(fd); struct address_space *mapping; struct backing_dev_info *bdi; loff_t endbyte; /* inclusive */ pgoff_t start_index; pgoff_t end_index; unsigned long nrpages; int ret = 0; if (!file) return -EBADF; if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) { ret = -ESPIPE; goto out; } mapping = file->f_mapping; if (!mapping || len < 0) { ret = -EINVAL; goto out; } if (mapping->a_ops->get_xip_page) /* no bad return value, but ignore advice */ goto out; ... out: fput(file); return ret; } I found the advice parameter is just ignored in the case mapping->a_ops->get_xip_page is given. This behavior is different from what is written on the man page. Is this o.k.? get_xip_page is given if CONFIG_EXT2_FS_XIP is true. Anyway I cannot find the easy way to detect get_xip_page field is given or CONFIG_EXT2_FS_XIP is true from the user space. I propose the following patch which checks the advice parameter even if get_xip_page is given. Signed-off-by: Masatake YAMATO <yamato@redhat.com> Acked-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Include count of pagecache pages in show_mem() outputLarry Woodman
The show_mem() output does not include the total number of pagecache pages. This would be helpful when analyzing the debug information in the /var/log/messages file after OOM kills occur. This patch includes the total pagecache pages in that output. Signed-off-by: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Fix dirty page accounting leak with ext3 data=journalBjorn Steinbrink
In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 ("Clean up and make try_to_free_buffers() not race with dirty pages"), try_to_free_buffers was changed to bail out if the page was dirty. That in turn caused truncate_complete_page to leak massive amounts of memory, because the dirty bit was only cleared after the call to try_to_free_buffers. So the call to cancel_dirty_page was moved up to have the dirty bit cleared early in 3e67c0987d7567ad666641164a153dca9a43b11d ("truncate: clear page dirtiness before running try_to_free_buffers()"). The problem with that fix is, that the page can be redirtied after cancel_dirty_page was called, eg. like this: truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages And then we end up with dirty pages being wrongly accounted. As a result, in ecdfc9787fe527491baefc22dce8b2dbd5b2908d ("Resurrect 'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers were reverted, so the original reason for the massive memory leak is gone, and we can also revert the move of the call to cancel_dirty_page from truncate_complete_page and get the accounting right again. I'm not sure if it matters, but opposed to the final check in __remove_from_page_cache, this one also cares about the task io accounting, so maybe we want to use this instead, although it's not quite the clean fix either. Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl> Cc: Jan Kara <jack@ucw.cz> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Osterried <osterried@jesse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05set_page_refcounted() VM_BUG_ON fixQi Yong
The current PageTail semantic is that a PageTail page is first a PageCompound page. So remove the redundant PageCompound test in set_page_refcounted(). Signed-off-by: Qi Yong <qiyong@fc-cn.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm: remove fastcall from mm/Harvey Harrison
fastcall is always defined to be empty, remove it [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05page allocator: remove unused arguments in zone_init_free_lists()Andi Kleen
Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm: don't waste swap on locked pagesHugh Dickins
try_to_unmap always fails on a page found in a VM_LOCKED vma (unless migrating), and recycles it back to the active list. But if it's an anonymous page, we've already allocated swap to it: just wasting swap. Spot locked pages in page_referenced_one and treat them as referenced. Signed-off-by: Hugh Dickins <hugh@veritas.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ethan Solomita <solo@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05vmstat: remove prefetchChristoph Lameter
Remove the prefetch logic in order to avoid touching impossible per cpu areas. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mike Travis <travis@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm/page-writeback: highmem_is_dirtyable optionBron Gondwana
Add vm.highmem_is_dirtyable toggle A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of approximately 2Gb size which contains a hash format that is written randomly by the dbclean process. On 2.6.16 this process took a few minutes. With lowmem only accounting of dirty ratios, this takes about 12 hours of 100% disk IO, all random writes. Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to add the highmem back to the total available memory count. [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build] Signed-off-by: Bron Gondwana <brong@fastmail.fm> Cc: Ethan Solomita <solo@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Page allocator: get rid of the list of cold pagesChristoph Lameter
We have repeatedly discussed if the cold pages still have a point. There is one way to join the two lists: Use a single list and put the cold pages at the end and the hot pages at the beginning. That way a single list can serve for both types of allocations. The discussion of the RFC for this and Mel's measurements indicate that there may not be too much of a point left to having separate lists for hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Martin Bligh <mbligh@mbligh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm: don't allow ioremapping of ranges larger than vmalloc spaceRobert Bragg
When running with a 16M IOREMAP_MAX_ORDER (on armv7) we found that the vmlist search routine in __get_vm_area_node can mistakenly allow a driver to ioremap a range larger than vmalloc space. If at the time of the ioremap all existing vmlist areas sit below the determined alignment then the search routine continues past all entries and exits the for loop - straight into the found: label - without ever testing for integer wrapping or that the requested size fits. We were seeing a driver successfully ioremap 128M of flash even though there was only 120M of vmalloc space. From that point the system was left with the remainder of the first 16M of space to vmalloc/ioremap within. Signed-off-by: Robert Bragg <robert@sixbynine.org> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05vmstat: small revisions to refresh_cpu_vm_stats()Christoph Lameter
1. Add comments explaining how the function can be called. 2. Collect global diffs in a local array and only spill them once into the global counters when the zone scan is finished. This means that we only touch each global counter once instead of each time we fold cpu counters into zone counters. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05arch_rebalance_pgtables callMartin Schwidefsky
In order to change the layout of the page tables after an mmap has crossed the adress space limit of the current page table layout a architecture hook in get_unmapped_area is needed. The arguments are the address of the new mapping and the length of it. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05add mm argument to pte/pmd/pud/pgd_freeBenjamin Herrenschmidt
(with Martin Schwidefsky <schwidefsky@de.ibm.com>) The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as first argument. The free functions do not get the mm_struct argument. This is 1) asymmetrical and 2) to do mm related page table allocations the mm argument is needed on the free function as well. [kamalesh@linux.vnet.ibm.com: i386 fix] [akpm@linux-foundation.org: coding-syle fixes] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Page allocator: clean up pcp draining functionsChristoph Lameter
- Add comments explaing how drain_pages() works. - Eliminate useless functions - Rename drain_all_local_pages to drain_all_pages(). It does drain all pages not only those of the local processor. - Eliminate useless interrupt off / on sequences. drain_pages() disables interrupts on its own. The execution thread is pinned to processor by the caller. So there is no need to disable interrupts. - Put drain_all_pages() declaration in gfp.h and remove the declarations from suspend.h and from mm/memory_hotplug.c - Make software suspend call drain_all_pages(). The draining of processor local pages is may not the right approach if software suspend wants to support SMP. If they call drain_all_pages then we can make drain_pages() static. [akpm@linux-foundation.org: fix build] Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05radix-tree: avoid atomic allocations for preloaded insertionsNick Piggin
Most pagecache (and some other) radix tree insertions have the great opportunity to preallocate a few nodes with relaxed gfp flags. But the preallocation is squandered when it comes time to allocate a node, we default to first attempting a GFP_ATOMIC allocation -- that doesn't normally fail, but it can eat into atomic memory reserves that we don't need to be using. Another upshot of this is that it removes the sometimes highly contended zone->lock from underneath tree_lock. Pagecache insertions are always performed with a radix tree preload, and after this change, such a situation will never fall back to kmem_cache_alloc within radix_tree_node_alloc. David Miller reports seeing this allocation fail on a highly threaded sparc64 system: [527319.459981] dd: page allocation failure. order:0, mode:0x20 [527319.460403] Call Trace: [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8 [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8 [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90 [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260 [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0 [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134 [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280 [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214 [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428 [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170 [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0 [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c [527319.461361] [00000000004bb920] sys_read+0x34/0x60 [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40 The calltrace is significant: __do_page_cache_readahead allocates a number of pages with GFP_KERNEL, and hence it should have reclaimed sufficient memory to satisfy GFP_ATOMIC allocations. However after the list of pages goes to mpage_readpages, there can be significant intervals (including disk IO) before all the pages are inserted into the radix-tree. So the reserves can easily be depleted at that point. The patch is confirmed to fix the problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05make __vmalloc_area_node() staticAdrian Bunk
__vmalloc_area_node() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Remove unused code from mm/tiny-shmem.cBalbir Singh
This code in mm/tiny-shmem.c is under #if 0 - remove it. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm/page-writeback.c: make a function staticAdrian Bunk
task_dirty_limit() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: make page monitoring /proc file optionalMatt Mackall
Make /proc/ page monitoring configurable This puts the following files under an embedded config option: /proc/pid/clear_refs /proc/pid/smaps /proc/pid/pagemap /proc/kpagecount /proc/kpageflags [akpm@linux-foundation.org: Kconfig fix] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: introduce a generic page walkerMatt Mackall
Introduce a general page table walker Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: move is_swap_pteMatt Mackall
Move is_swap_pte helper function to swapops.h for use by pagemap code Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05clean up vmtruncateChristoph Hellwig
vmtruncate is a twisted maze of gotos, this patch cleans it up to have a proper if else for the two major cases of extending and truncating truncate and thus makes it a lot more readable while keeping exactly the same functinality. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: fix shmem_swaplist racesHugh Dickins
Intensive swapoff testing shows shmem_unuse spinning on an entry in shmem_swaplist pointing to itself: how does that come about? Days pass... First guess is this: shmem_delete_inode tests list_empty without taking the global mutex (so the swapping case doesn't slow down the common case); but there's an instant in shmem_unuse_inode's list_move_tail when the list entry may appear empty (a rare case, because it's actually moving the head not the the list member). So there's a danger of leaving the inode on the swaplist when it's freed, then reinitialized to point to itself when reused. Fix that by skipping the list_move_tail when it's a no-op, which happens to plug this. But this same spinning then surfaces on another machine. Ah, I'd never suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we still hold page lock, which would hold off inode deletion if the page were in pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache doesn't wait on locked pages). Hmm: we could put the the inode on swaplist earlier, but then shmem_unuse_inode could never prune unswapped inodes. Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode; though I am a little uneasy about the iput which has to follow - it works, and I see nothing wrong with it, but it is surprising that shmem inode deletion may now occur below shmem_writepage. Revisit this fix later? And while we're looking at these races: the way shmem_unuse tests swapped without holding info->lock looks unsafe, if we've more than one swap area: a racing shmem_writepage on another page of the same inode could be putting it in swapcache, just as we're deciding to remove the inode from swaplist - there's a danger of going on swap without being listed, so a later swapoff would hang, being unable to locate the entry. Move that test and removal down into shmem_unuse_inode, once info->lock is held. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: radix_tree_preloadingHugh Dickins
Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache or swap cache, without any radix tree preload: so tending to deplete emergency reserves of memory. GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's being called under memory pressure, so must not wait for more memory to become available. But shmem_unuse_inode now has a window in which it can and should preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its add_to_page_cache. shmem_getpage is not so straightforward: its filepage/swappage integrity relies upon exchanging between caches under spinlock, and it would need a lot of restructuring to place the preloads correctly. Instead, follow its pattern of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in add_to_page_cache, and begin each circuit of the repeat loop with a sleeping radix_tree_preload, followed immediately by radix_tree_preload_end - that won't guarantee success in the next add_to_page_cache, but doesn't need to. And we can then remove that bothersome congestion_wait: when needed, it'll automatically get done in the course of the radix_tree_preload. Signed-off-by: Hugh Dickins <hugh@veritas.com> Looks-good-to: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: open a window in shmem_unuse_inodeHugh Dickins
There are a couple of reasons (patches follow) why it would be good to open a window for sleep in shmem_unuse_inode, between its search for a matching swap entry, and its handling of the entry found. shmem_unuse_inode must then use igrab to hold the inode against deletion in that window, and its corresponding iput might result in deletion: so it had better unlock_page before the iput, and might as well release the page too. Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse into shmem_unuse_inode, in the case when it finds a match. Let try_to_unuse break on error in the shmem_unuse case, as it does in the unuse_mm case: though at this point in the series, no error to break on. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>