summaryrefslogtreecommitdiffstats
path: root/mm/shmem.c
AgeCommit message (Collapse)Author
2006-10-20[PATCH] separate bdi congestion functions from queue congestion functionsAndrew Morton
Separate out the concept of "queue congestion" from "backing-dev congestion". Congestion is a backing-dev concept, not a queue concept. The blk_* congestion functions are retained, as wrappers around the core backing-dev congestion functions. This proper layering is needed so that NFS can cleanly use the congestion functions, and so that CONFIG_BLOCK=n actually links. Cc: "Thomas Maier" <balagi@justmail.de> Cc: "Jens Axboe" <jens.axboe@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: David Howells <dhowells@redhat.com> Cc: Peter Osterlund <petero2@telia.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17[PATCH] knfsd: add nfs-export support to tmpfsDavid M. Grimes
We need to encode a decode the 'file' part of a handle. We simply use the inode number and generation number to construct the filehandle. The generation number is the time when the file was created. As inode numbers cycle through the full 32 bits before being reused, there is no real chance of the same inum being allocated to different files in the same second so this is suitably unique. Using time-of-day rather than e.g. jiffies makes it less likely that the same filehandle can be created after a reboot. In order to be able to decode a filehandle we need to be able to lookup by inum, which means that the inode needs to be added to the inode hash table (tmpfs doesn't currently hash inodes as there is never a need to lookup by inum). To avoid overhead when not exporting, we only hash an inode when it is first exported. This requires a lock to ensure it isn't hashed twice. This code is separate from the patch posted in June06 from Atal Shargorodsky which provided the same functionality, but does borrow slightly from it. Locking comment: Most filesystems that hash their inodes do so at the point where the 'struct inode' is initialised, and that has suitable locking (I_NEW). Here in shmem, we are hashing the inode later, the first time we need an NFS file handle for it. We no longer have I_NEW to ensure only one thread tries to add it to the hash table. Cc: Atal Shargorodsky <atal@codefidence.com> Cc: Gilad Ben-Yossef <gilad@codefidence.com> Signed-off-by: David M. Grimes <dgrimes@navisite.com> Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01[PATCH] r/o bind mount prepwork: inc_nlink() helperDave Hansen
This is mostly included for parity with dec_nlink(), where we will have some more hooks. This one should stay pretty darn straightforward for now. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01[PATCH] r/o bind mounts: unlink: monitor i_nlinkDave Hansen
When a filesystem decrements i_nlink to zero, it means that a write must be performed in order to drop the inode from the filesystem. We're shortly going to have keep filesystems from being remounted r/o between the time that this i_nlink decrement and that write occurs. So, add a little helper function to do the decrements. We'll tie into it in a bit to note when i_nlink hits zero. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29[PATCH] Access Control Lists for tmpfsAndreas Gruenbacher
Add access control lists for tmpfs. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27[PATCH] inode-diet: Eliminate i_blksize from the inode structureTheodore Ts'o
This eliminates the i_blksize field from struct inode. Filesystems that want to provide a per-inode st_blksize can do so by providing their own getattr routine instead of using the generic_fillattr() function. Note that some filesystems were providing pretty much random (and incorrect) values for i_blksize. [bunk@stusta.de: cleanup] [akpm@osdl.org: generic_fillattr() fix] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27[PATCH] Really ignore kmem_cache_destroy return valueAlexey Dobriyan
* Rougly half of callers already do it by not checking return value * Code in drivers/acpi/osl.c does the following to be sure: (void)kmem_cache_destroy(cache); * Those who check it printk something, however, slab_error already printed the name of failed cache. * XFS BUGs on failed kmem_cache_destroy which is not the decision low-level filesystem driver should make. Converted to ignore. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] reduce MAX_NR_ZONES: move HIGHMEM counters into highmem.c/.hChristoph Lameter
Move totalhigh_pages and nr_free_highpages() into highmem.c/.h Move the totalhigh_pages definition into highmem.c/.h. Move the nr_free_highpages function into highmem.c [yoichi_yuasa@tripeaks.co.jp: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: Remove obsolete #include <linux/config.h> remove obsolete swsusp_encrypt arch/arm26/Kconfig typos Documentation/IPMI typos Kconfig: Typos in net/sched/Kconfig v9fs: do not include linux/version.h Documentation/DocBook/mtdnand.tmpl: typo fixes typo fixes: specfic -> specific typo fixes in Documentation/networking/pktgen.txt typo fixes: occuring -> occurring typo fixes: infomation -> information typo fixes: disadvantadge -> disadvantage typo fixes: aquire -> acquire typo fixes: mecanism -> mechanism typo fixes: bandwith -> bandwidth fix a typo in the RTC_CLASS help text smb is no longer maintained Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S
2006-06-30[PATCH] Light weight event countersChristoph Lameter
The remaining counters in page_state after the zoned VM counter patches have been applied are all just for show in /proc/vmstat. They have no essential function for the VM. We use a simple increment of per cpu variables. In order to avoid the most severe races we disable preempt. Preempt does not prevent the race between an increment and an interrupt handler incrementing the same statistics counter. However, that race is exceedingly rare, we may only loose one increment or so and there is no requirement (at least not in kernel) that the vm event counters have to be accurate. In the non preempt case this results in a simple increment for each counter. For many architectures this will be reduced by the compiler to a single instruction. This single instruction is atomic for i386 and x86_64. And therefore even the rare race condition in an interrupt is avoided for both architectures in most cases. The patchset also adds an off switch for embedded systems that allows a building of linux kernels without these counters. The implementation of these counters is through inline code that hopefully results in only a single instruction increment instruction being emitted (i386, x86_64) or in the increment being hidden though instruction concurrency (EPIC architectures such as ia64 can get that done). Benefits: - VM event counter operations usually reduce to a single inline instruction on i386 and x86_64. - No interrupt disable, only preempt disable for the preempt case. Preempt disable can also be avoided by moving the counter into a spinlock. - Handling is similar to zoned VM counters. - Simple and easily extendable. - Can be omitted to reduce memory use for embedded use. References: RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2 RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2 local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2 V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2 V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2 V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30Remove obsolete #include <linux/config.h>Jörn Engel
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/devfs-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/devfs-2.6: (22 commits) [PATCH] devfs: Remove it from the feature_removal.txt file [PATCH] devfs: Last little devfs cleanups throughout the kernel tree. [PATCH] devfs: Rename TTY_DRIVER_NO_DEVFS to TTY_DRIVER_DYNAMIC_DEV [PATCH] devfs: Remove the tty_driver devfs_name field as it's no longer needed [PATCH] devfs: Remove the line_driver devfs_name field as it's no longer needed [PATCH] devfs: Remove the videodevice devfs_name field as it's no longer needed [PATCH] devfs: Remove the gendisk devfs_name field as it's no longer needed [PATCH] devfs: Remove the miscdevice devfs_name field as it's no longer needed [PATCH] devfs: Remove the devfs_fs_kernel.h file from the tree [PATCH] devfs: Remove devfs_remove() function from the kernel tree [PATCH] devfs: Remove devfs_mk_cdev() function from the kernel tree [PATCH] devfs: Remove devfs_mk_bdev() function from the kernel tree [PATCH] devfs: Remove devfs_mk_symlink() function from the kernel tree [PATCH] devfs: Remove devfs_mk_dir() function from the kernel tree [PATCH] devfs: Remove devfs_*_tape() functions from the kernel tree [PATCH] devfs: Remove devfs support from the sound subsystem [PATCH] devfs: Remove devfs support from the ide subsystem. [PATCH] devfs: Remove devfs support from the serial subsystem [PATCH] devfs: Remove devfs from the init code [PATCH] devfs: Remove devfs from the partition code ...
2006-06-28[PATCH] mark address_space_operations constChristoph Hellwig
Same as with already do with the file operations: keep them in .rodata and prevents people from doing runtime patching. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] devfs: Remove the devfs_fs_kernel.h file from the treeGreg Kroah-Hartman
Also fixes up all files that #include it. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-26[PATCH] devfs: Remove devfs_mk_dir() function from the kernel treeGreg Kroah-Hartman
Removes the devfs_mk_dir() function and all callers of it. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-24Merge branch 'master' of /home/trondmy/kernel/linux-2.6/Trond Myklebust
Conflicts: fs/nfs/inode.c fs/super.c Fix conflicts between patch 'NFS: Split fs/nfs/inode.c' and patch 'VFS: Permit filesystem to override root dentry on mount'
2006-06-23[PATCH] migration: remove unnecessary PageSwapCache checksChristoph Lameter
Remove two unnecessary PageSwapCache checks. The page refcount is raised and therefore page migration cannot occur in both functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] VFS: Permit filesystem to perform statfs with a known root dentryDavid Howells
Give the statfs superblock operation a dentry pointer rather than a superblock pointer. This complements the get_sb() patch. That reduced the significance of sb->s_root, allowing NFS to place a fake root there. However, NFS does require a dentry to use as a target for the statfs operation. This permits the root in the vfsmount to be used instead. linux/mount.h has been added where necessary to make allyesconfig build successfully. Interest has also been expressed for use with the FUSE and XFS filesystems. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Nathan Scott <nathans@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] VFS: Permit filesystem to override root dentry on mountDavid Howells
Extend the get_sb() filesystem operation to take an extra argument that permits the VFS to pass in the target vfsmount that defines the mountpoint. The filesystem is then required to manually set the superblock and root dentry pointers. For most filesystems, this should be done with simple_set_mnt() which will set the superblock pointer and then set the root dentry to the superblock's s_root (as per the old default behaviour). The get_sb() op now returns an integer as there's now no need to return the superblock pointer. This patch permits a superblock to be implicitly shared amongst several mount points, such as can be done with NFS to avoid potential inode aliasing. In such a case, simple_set_mnt() would not be called, and instead the mnt_root and mnt_sb would be set directly. The patch also makes the following changes: (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount pointer argument and return an integer, so most filesystems have to change very little. (*) If one of the convenience function is not used, then get_sb() should normally call simple_set_mnt() to instantiate the vfsmount. This will always return 0, and so can be tail-called from get_sb(). (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the dcache upon superblock destruction rather than shrink_dcache_anon(). This is required because the superblock may now have multiple trees that aren't actually bound to s_root, but that still need to be cleaned up. The currently called functions assume that the whole tree is rooted at s_root, and that anonymous dentries are not the roots of trees which results in dentries being left unculled. However, with the way NFS superblock sharing are currently set to be implemented, these assumptions are violated: the root of the filesystem is simply a dummy dentry and inode (the real inode for '/' may well be inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries with child trees. [*] Anonymous until discovered from another tree. (*) The documentation has been adjusted, including the additional bit of changing ext2_* into foo_* in the documentation. [akpm@osdl.org: convert ipath_fs, do other stuff] Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Nathan Scott <nathans@sgi.com> Cc: Roland Dreier <rolandd@cisco.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-20Merge branch 'master' of /home/trondmy/kernel/linux-2.6/Trond Myklebust
2006-06-12[PATCH] tmpfs: Decrement i_nlink correctly in shmem_rmdir()Sergey Vlasov
shmem_rmdir() must undo the increment of i_nlink done in shmem_get_inode() for directories, otherwise at least IN_DELETE_SELF inotify event generation is broken. Signed-off-by: Sergey Vlasov <vsu@altlinux.ru> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-12[PATCH] tmpfs: time granularity fix for [acm]time going backwardsRobin H. Johnson
I noticed a strange behavior in a tmpfs file system the other day, while building packages - occasionally, and seemingly at random, make decided to rebuild a target. However, only on tmpfs. A file would be created, and if checked, it had a sub-second timestamp. However, after an utimes related call where sub-seconds should be set, they were zeroed instead. In the case that a file was created, and utimes(...,NULL) was used on it in the same second, the timestamp on the file moved backwards. After some digging, I found that this was being caused by tmpfs not having a time granularity set, thus inheriting the default 1 second granularity. Hugh adds: yes, we missed tmpfs when the s_time_gran mods went into 2.6.11. Unfortunately, the granularity of CURRENT_TIME, often used in filesystems, does not match the default granularity set by alloc_super. A few more such discrepancies have been found, but this is the most important to fix now. Signed-off-by: Robin H. Johnson <robbat2@gentoo.org> Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-09VFS: Unexport do_kern_mount() and clean up simple_pin_fs()Trond Myklebust
Replace all module uses with the new vfs_kern_mount() interface, and fix up simple_pin_fs(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2006-04-22[PATCH] add migratepage address space op to shmemLee Schermerhorn
Basic problem: pages of a shared memory segment can only be migrated once. In 2.6.16 through 2.6.17-rc1, shared memory mappings do not have a migratepage address space op. Therefore, migrate_pages() falls back to default processing. In this path, it will try to pageout() dirty pages. Once a shared memory page has been migrated it becomes dirty, so migrate_pages() will try to page it out. However, because the page count is 3 [cache + current + pte], pageout() will return PAGE_KEEP because is_page_cache_freeable() returns false. This will abort all subsequent migrations. This patch adds a migratepage address space op to shared memory segments to avoid taking the default path. We use the "migrate_page()" function because it knows how to migrate dirty pages. This allows shared memory segment pages to migrate, subject to other conditions such as # pte's referencing the page [page_mapcount(page)], when requested. I think this is safe. If we're migrating a shared memory page, then we found the page via a page table, so it must be in memory. Can be verified with memtoy and the shmem-mbind-test script, both available at: http://free.linux.hp.com/~lts/Tools/ Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] shmem: inline to avoid warningHugh Dickins
shmem.c was named and shamed in Jesper's "Building 100 kernels" warnings: shmem_parse_mpol is only used when CONFIG_TMPFS parses mount options; and only called from that one site, so mark it inline like its non-NUMA stub. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: kill kmem_cache_t usagePekka Enberg
We have struct kmem_cache now so use it instead of the old typedef. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-21[PATCH] tmpfs: fix mount mpol nodelist parsingHugh Dickins
I've been dissatisfied with the mpol_nodelist mount option which was added to tmpfs earlier in -rc. Replace it by mpol=policy:nodelist. And it was broken: a nodelist is a comma-separated list of numbers and ranges; the mount options are a comma-separated list of token=values. Whoops, blindly strsep'ing on commas doesn't work so well: since we've no numeric tokens, and unlikely to add them, use that to distinguish. Move the mpol= parsing to shmem_parse_mpol under CONFIG_NUMA, reject all its options as invalid if not NUMA. /proc shows MPOL_PREFERRED as "prefer", so use that name for the policy instead of "preferred". Enforce that mpol=default has no nodelist; that mpol=prefer has one node only; that mpol=bind has a nodelist; but let mpol=interleave use node_online_map if no nodelist given. Describe this in tmpfs.txt. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Robin Holt <holt@sgi.com> Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Direct Migration V9: PageSwapCache checksChristoph Lameter
Check for PageSwapCache after looking up and locking a swap page. The page migration code may change a swap pte to point to a different page under lock_page(). If that happens then the vm must retry the lookup operation in the swap space to find the correct page number. There are a couple of locations in the VM where a lock_page() is done on a swap page. In these locations we need to check afterwards if the page was migrated. If the page was migrated then the old page that was looked up before was freed and no longer has the PageSwapCache bit set. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14[PATCH] Add tmpfs options for memory placement policiesRobin Holt
Anything that writes into a tmpfs filesystem is liable to disproportionately decrease the available memory on a particular node. Since there's no telling what sort of application (e.g. dd/cp/cat) might be dropping large files there, this lets the admin choose the appropriate default behavior for their site's situation. Introduce a tmpfs mount option which allows specifying a memory policy and a second option to specify the nodelist for that policy. With the default policy, tmpfs will behave as it does today. This patch adds support for preferred, bind, and interleave policies. The default policy will cause pages to be added to tmpfs files on the node which is doing the writing. Some jobs expect a single process to create and manage the tmpfs files. This results in a node which has a significantly reduced number of free pages. With this patch, the administrator can specify the policy and nodes for that policy where they would prefer allocations. This patch was originally written by Brent Casavant and Hugh Dickins. I added support for the bind and preferred policies and the mpol_nodelist mount option. Signed-off-by: Brent Casavant <bcasavan@sgi.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-09[PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_semJes Sorensen
This patch converts the inode semaphore to a mutex. I have tested it on XFS and compiled as much as one can consider on an ia64. Anyway your luck with it might be different. Modified-by: Ingo Molnar <mingo@elte.hu> (finished the conversion) Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2006-01-06[PATCH] NOMMU: Make SYSV IPC SHM use ramfs facilities on NOMMUDavid Howells
The attached patch makes the SYSV IPC shared memory facilities use the new ramfs facilities on a no-MMU kernel. The following changes are made: (1) There are now shmem_mmap() and shmem_get_unmapped_area() functions to allow the IPC SHM facilities to commune with the tiny-shmem and shmem code. (2) ramfs files now need resizing using do_truncate() rather than by modifying the inode size directly (see shmem_file_setup()). This causes ramfs to attempt to bind a block of pages of sufficient size to the inode. (3) CONFIG_SYSVIPC is no longer contingent on CONFIG_MMU. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing storeBadari Pulavarty
Here is the patch to implement madvise(MADV_REMOVE) - which frees up a given range of pages & its associated backing store. Current implementation supports only shmfs/tmpfs and other filesystems return -ENOSYS. "Some app allocates large tmpfs files, then when some task quits and some client disconnect, some memory can be released. However the only way to release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli Databases want to use this feature to drop a section of their bufferpool (shared memory segments) - without writing back to disk/swap space. This feature is also useful for supporting hot-plug memory on UML. Concerns raised by Andrew Morton: - "We have no plan for holepunching! If we _do_ have such a plan (or might in the future) then what would the API look like? I think sys_holepunch(fd, start, len), so we should start out with that." - Using madvise is very weird, because people will ask "why do I need to mmap my file before I can stick a hole in it?" - None of the other madvise operations call into the filesystem in this manner. A broad question is: is this capability an MM operation or a filesytem operation? truncate, for example, is a filesystem operation which sometimes has MM side-effects. madvise is an mm operation and with this patch, it gains FS side-effects, only they're really, really significant ones." Comments: - Andrea suggested the fs operation too but then it's more efficient to have it as a mm operation with fs side effects, because they don't immediatly know fd and physical offset of the range. It's possible to fixup in userland and to use the fs operation but it's more expensive, the vmas are already in the kernel and we can use them. Short term plan & Future Direction: - We seem to need this interface only for shmfs/tmpfs files in the short term. We have to add hooks into the filesystem for correctness and completeness. This is what this patch does. - In the future, plan is to support both fs and mmap apis also. This also involves (other) filesystem specific functions to be implemented. - Current patch doesn't support VM_NONLINEAR - which can be addressed in the future. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrea Arcangeli <andrea@suse.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-03[PATCH] add AOP_TRUNCATED_PAGE, prepend AOP_ to WRITEPAGE_ACTIVATEZach Brown
readpage(), prepare_write(), and commit_write() callers are updated to understand the special return code AOP_TRUNCATED_PAGE in the style of writepage() and WRITEPAGE_ACTIVATE. AOP_TRUNCATED_PAGE tells the caller that the callee has unlocked the page and that the operation should be tried again with a new page. OCFS2 uses this to detect and work around a lock inversion in its aop methods. There should be no change in behaviour for methods that don't return AOP_TRUNCATED_PAGE. WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are made enums so that kerneldoc can be used to document their semantics. Signed-off-by: Zach Brown <zach.brown@oracle.com>
2005-10-29[PATCH] mm: split page table lockHugh Dickins
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] core remove PageReservedNick Piggin
Remove PageReserved() calls from core code by tightening VM_RESERVED handling in mm/ to cover PageReserved functionality. PageReserved special casing is removed from get_page and put_page. All setting and clearing of PageReserved is retained, and it is now flagged in the page_alloc checks to help ensure we don't introduce any refcount based freeing of Reserved pages. MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being deprecated. We never completely handled it correctly anyway, and is be reintroduced in future if required (Hugh has a proof of concept). Once PageReserved() calls are removed from kernel/power/swsusp.c, and all arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can be trivially removed. Last real user of PageReserved is swsusp, which uses PageReserved to determine whether a struct page points to valid memory or not. This still needs to be addressed (a generic page_is_ram() should work). A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. Signed-off-by: Nick Piggin <npiggin@suse.de> Refcount bug fix for filemap_xip.c Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: page fault handlers tidyupHugh Dickins
Impose a little more consistency on the page fault handlers do_wp_page, do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass their arguments in the same order, called the same names? break_cow is all very well, but what it did was inlined elsewhere: easier to compare if it's brought back into do_wp_page. do_file_page's fallback to do_no_page dates from a time when we were testing pte_file by using it wherever possible: currently it's peculiar to nonlinear vmas, so just check that. BUG_ON if not? Better not, it's probably page table corruption, so just show the pte: hmm, there's a pte_ERROR macro, let's use that for do_wp_page's invalid pfn too. Hah! Someone in the ppc64 world noticed pte_ERROR was unused so removed it: restored (and say "pud" not "pmd" in its pud_ERROR). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-28[PATCH] gfp_t: mm/* (easy parts)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08[PATCH] gfp flags annotations - part 1Al Viro
- added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] tmpfs: Enable atomic inode security labelingStephen Smalley
This patch modifies tmpfs to call the inode_init_security LSM hook to set up the incore inode security state for new inodes before the inode becomes accessible via the dcache. As there is no underlying storage of security xattrs in this case, it is not necessary for the hook to return the (name, value, len) triple to the tmpfs code, so this patch also modifies the SELinux hook function to correctly handle the case where the (name, value, len) pointers are NULL. The hook call is needed in tmpfs in order to support proper security labeling of tmpfs inodes (e.g. for udev with tmpfs /dev in Fedora). With this change in place, we should then be able to remove the security_inode_post_create/mkdir/... hooks safely. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] update filesystems for new delete_inode behaviorMark Fasheh
Update the file systems in fs/ implementing a delete_inode() callback to call truncate_inode_pages(). One implementation note: In developing this patch I put the calls to truncate_inode_pages() at the very top of those filesystems delete_inode() callbacks in order to retain the previous behavior. I'm guessing that some of those could probably be optimized. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] Additions to .data.read_mostly sectionRavikiran G Thirumalai
Mark variables which are usually accessed for reads with __readmostly. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] Generic VFS fallback for security xattrsStephen Smalley
This patch modifies the VFS setxattr, getxattr, and listxattr code to fall back to the security module for security xattrs if the filesystem does not support xattrs natively. This allows security modules to export the incore inode security label information to userspace even if the filesystem does not provide xattr storage, and eliminates the need to individually patch various pseudo filesystem types to provide such access. The patch removes the existing xattr code from devpts and tmpfs as it is then no longer needed. The patch restructures the code flow slightly to reduce duplication between the normal path and the fallback path, but this should only have one user-visible side effect - a program may get -EACCES rather than -EOPNOTSUPP if policy denied access but the filesystem didn't support the operation anyway. Note that the post_setxattr hook call is not needed in the fallback case, as the inode_setsecurity hook call handles the incore inode security state update directly. In contrast, we do call fsnotify in both cases. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] shmem_populate: avoid an useless check, and some commentsPaolo 'Blaisorblade' Giarrusso
Either shmem_getpage returns a failure, or it found a page, or it was told it couldn't do any I/O. So it's useless to check nonblock in the else branch. We could add a BUG() there but I preferred to comment the offending function. This was taken out from one Ingo Molnar's old patch I'm resurrecting. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Ingo Molnar <mingo@elte.hu> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-19Fix nasty ncpfs symlink handling bug.Linus Torvalds
This bug could cause oopses and page state corruption, because ncpfs used the generic page-cache symlink handlign functions. But those functions only work if the page cache is guaranteed to be "stable", ie a page that was installed when the symlink walk was started has to still be installed in the page cache at the end of the walk. We could have fixed ncpfs to not use the generic helper routines, but it is in many ways much cleaner to instead improve on the symlink walking helper routines so that they don't require that absolute stability. We do this by allowing "follow_link()" to return a error-pointer as a cookie, which is fed back to the cleanup "put_link()" routine. This also simplifies NFS symlink handling. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] shmem: restore superblock infoHugh Dickins
To improve shmem scalability, we allowed tmpfs instances which don't need their blocks or inodes limited not to count them, and not to allocate any sbinfo. Which was okay when the only use for the sbinfo was accounting blocks and inodes; but since then a couple of unrelated projects extending tmpfs want to store other data in the sbinfo. Whether either extension reaches mainline is beside the point: I'm guilty of a bad design decision, and should restore sbinfo to make any such future extensions easier. So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited inodes. Brent Casavant verified (many months ago) that this does not perceptibly impact the scalability (since the unlimited sbinfo cacheline is repeatedly accessed but only once dirtied). And merge shmem_set_size into its sole caller shmem_remount_fs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16Linux-2.6.12-rc2v2.6.12-rc2Linus Torvalds
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!