From d47992f86b307985b3215bcf141d56d1849d71df Mon Sep 17 00:00:00 2001 From: Lukas Czerner Date: Tue, 21 May 2013 23:17:23 -0400 Subject: mm: change invalidatepage prototype to accept length Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Signed-off-by: Lukas Czerner Cc: Andrew Morton Cc: Hugh Dickins --- Documentation/filesystems/Locking | 6 +++--- Documentation/filesystems/vfs.txt | 20 ++++++++++---------- 2 files changed, 13 insertions(+), 13 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 0706d32a61e..cbbac3fa0eb 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -189,7 +189,7 @@ prototypes: loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); - int (*invalidatepage) (struct page *, unsigned long); + void (*invalidatepage) (struct page *, unsigned int, unsigned int); int (*releasepage) (struct page *, int); void (*freepage)(struct page *); int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, @@ -310,8 +310,8 @@ filesystems and by the swapper. The latter will eventually go away. Please, keep it that way and don't breed new callers. ->invalidatepage() is called when the filesystem must attempt to drop -some or all of the buffers from the page when it is being truncated. It -returns zero on success. If ->invalidatepage is zero, the kernel uses +some or all of the buffers from the page when it is being truncated. It +returns zero on success. If ->invalidatepage is zero, the kernel uses block_invalidatepage() instead. ->releasepage() is called when the kernel is about to try to drop the diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index bc4b06b3160..e445b95a002 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -549,7 +549,7 @@ struct address_space_operations ------------------------------- This describes how the VFS can manipulate mapping of a file to page cache in -your filesystem. As of kernel 2.6.22, the following members are defined: +your filesystem. The following members are defined: struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); @@ -566,7 +566,7 @@ struct address_space_operations { loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); - int (*invalidatepage) (struct page *, unsigned long); + void (*invalidatepage) (struct page *, unsigned int, unsigned int); int (*releasepage) (struct page *, int); void (*freepage)(struct page *); ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, @@ -685,14 +685,14 @@ struct address_space_operations { invalidatepage: If a page has PagePrivate set, then invalidatepage will be called when part or all of the page is to be removed from the address space. This generally corresponds to either a - truncation or a complete invalidation of the address space - (in the latter case 'offset' will always be 0). - Any private data associated with the page should be updated - to reflect this truncation. If offset is 0, then - the private data should be released, because the page - must be able to be completely discarded. This may be done by - calling the ->releasepage function, but in this case the - release MUST succeed. + truncation, punch hole or a complete invalidation of the address + space (in the latter case 'offset' will always be 0 and 'length' + will be PAGE_CACHE_SIZE). Any private data associated with the page + should be updated to reflect this truncation. If offset is 0 and + length is PAGE_CACHE_SIZE, then the private data should be released, + because the page must be able to be completely discarded. This may + be done by calling the ->releasepage function, but in this case the + release MUST succeed. releasepage: releasepage is called on PagePrivate pages to indicate that the page should be freed if possible. ->releasepage -- cgit v1.2.3-70-g09d2 From f884ab15afdc5514e88105c92a4e2e1e6539869a Mon Sep 17 00:00:00 2001 From: Anatol Pomozov Date: Wed, 8 May 2013 16:56:16 -0700 Subject: doc: fix misspellings with 'codespell' tool Signed-off-by: Anatol Pomozov Signed-off-by: Jiri Kosina --- Documentation/DocBook/drm.tmpl | 6 +++--- Documentation/DocBook/media/dvb/frontend.xml | 2 +- Documentation/DocBook/media/v4l/controls.xml | 2 +- Documentation/DocBook/writing_usb_driver.tmpl | 2 +- Documentation/bcache.txt | 10 +++++----- Documentation/block/queue-sysfs.txt | 2 +- Documentation/cgroups/memory.txt | 2 +- Documentation/device-mapper/cache.txt | 2 +- .../devicetree/bindings/arm/samsung/interrupt-combiner.txt | 6 +++--- Documentation/devicetree/bindings/arm/spear/shirq.txt | 2 +- Documentation/devicetree/bindings/clock/silabs,si5351.txt | 2 +- Documentation/devicetree/bindings/mmc/synopsis-dw-mshc.txt | 2 +- Documentation/devicetree/bindings/powerpc/4xx/emac.txt | 2 +- Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt | 2 +- .../devicetree/bindings/timer/samsung,exynos4210-mct.txt | 2 +- Documentation/devicetree/bindings/usb/am33xx-usb.txt | 2 +- Documentation/devicetree/bindings/usb/omap-usb.txt | 2 +- Documentation/dynamic-debug-howto.txt | 2 +- Documentation/fb/cirrusfb.txt | 2 +- Documentation/filesystems/jfs.txt | 2 +- Documentation/filesystems/qnx6.txt | 2 +- Documentation/filesystems/vfat.txt | 2 +- Documentation/laptops/dslm.c | 2 +- Documentation/media-framework.txt | 2 +- Documentation/metag/kernel-ABI.txt | 2 +- Documentation/misc-devices/mei/mei.txt | 2 +- Documentation/networking/ip-sysctl.txt | 2 +- Documentation/networking/netlink_mmap.txt | 12 ++++++------ Documentation/pinctrl.txt | 2 +- Documentation/thermal/exynos_thermal_emulation | 2 +- Documentation/virtual/kvm/api.txt | 6 +++--- Documentation/vm/pagemap.txt | 2 +- Documentation/w1/slaves/w1_ds28e04 | 2 +- 33 files changed, 48 insertions(+), 48 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/DocBook/drm.tmpl b/Documentation/DocBook/drm.tmpl index f9df3b872c1..6dd8d10d6b7 100644 --- a/Documentation/DocBook/drm.tmpl +++ b/Documentation/DocBook/drm.tmpl @@ -434,7 +434,7 @@ char *date; The DRM core includes two memory managers, namely Translation Table Maps (TTM) and Graphics Execution Manager (GEM). TTM was the first DRM memory manager to be developed and tried to be a one-size-fits-them all - solution. It provides a single userspace API to accomodate the need of + solution. It provides a single userspace API to accommodate the need of all hardware, supporting both Unified Memory Architecture (UMA) devices and devices with dedicated video RAM (i.e. most discrete video cards). This resulted in a large, complex piece of code that turned out to be @@ -701,7 +701,7 @@ char *date; Similar to global names, GEM file descriptors are also used to share GEM objects across processes. They offer additional security: as file - descriptors must be explictly sent over UNIX domain sockets to be shared + descriptors must be explicitly sent over UNIX domain sockets to be shared between applications, they can't be guessed like the globally unique GEM names. @@ -1154,7 +1154,7 @@ int max_width, max_height; The page_flip operation schedules a page flip. - Once any pending rendering targetting the new frame buffer has + Once any pending rendering targeting the new frame buffer has completed, the CRTC will be reprogrammed to display that frame buffer after the next vertical refresh. The operation must return immediately without waiting for rendering or page flip to complete and must block diff --git a/Documentation/DocBook/media/dvb/frontend.xml b/Documentation/DocBook/media/dvb/frontend.xml index df39ba395df..0d6e81bd9ed 100644 --- a/Documentation/DocBook/media/dvb/frontend.xml +++ b/Documentation/DocBook/media/dvb/frontend.xml @@ -233,7 +233,7 @@ typedef enum fe_status { The frontend FEC inner coding (Viterbi, LDPC or other inner code) is stable FE_HAS_SYNC -Syncronization bytes was found +Synchronization bytes was found FE_HAS_LOCK The DVB were locked and everything is working diff --git a/Documentation/DocBook/media/v4l/controls.xml b/Documentation/DocBook/media/v4l/controls.xml index 8d7a77928d4..c2fc9ec1417 100644 --- a/Documentation/DocBook/media/v4l/controls.xml +++ b/Documentation/DocBook/media/v4l/controls.xml @@ -3147,7 +3147,7 @@ giving priority to the center of the metered area. A multi-zone metering. The light intensity is measured in several points of the frame and the the results are combined. The algorithm of the zones selection and their significance in calculating the -final value is device dependant. +final value is device dependent. diff --git a/Documentation/DocBook/writing_usb_driver.tmpl b/Documentation/DocBook/writing_usb_driver.tmpl index bd97a13fa5a..3210dcf741c 100644 --- a/Documentation/DocBook/writing_usb_driver.tmpl +++ b/Documentation/DocBook/writing_usb_driver.tmpl @@ -83,7 +83,7 @@ Because each different protocol causes a new driver to be created, I have - written a generic USB driver skeleton, modeled after the pci-skeleton.c + written a generic USB driver skeleton, modelled after the pci-skeleton.c file in the kernel source tree upon which many PCI network drivers have been based. This USB skeleton can be found at drivers/usb/usb-skeleton.c in the kernel source tree. In this article I will walk through the basics diff --git a/Documentation/bcache.txt b/Documentation/bcache.txt index 77db8809bd9..2cdd4cfc691 100644 --- a/Documentation/bcache.txt +++ b/Documentation/bcache.txt @@ -181,7 +181,7 @@ want for getting the best possible numbers when benchmarking. In practice this isn't an issue because as soon as a write comes along it'll cause the btree node to be split, and you need almost no write traffic for - this to not show up enough to be noticable (especially since bcache's btree + this to not show up enough to be noticeable (especially since bcache's btree nodes are huge and index large regions of the device). But when you're benchmarking, if you're trying to warm the cache by reading a bunch of data and there's no other traffic - that can be a problem. @@ -222,7 +222,7 @@ running it's in passthrough mode or caching). sequential_cutoff - A sequential IO will bypass the cache once it passes this threshhold; the + A sequential IO will bypass the cache once it passes this threshold; the most recent 128 IOs are tracked so sequential IO can be detected even when it isn't all done at once. @@ -296,7 +296,7 @@ cache_miss_collisions since the synchronization for cache misses was rewritten) cache_readaheads - Count of times readahead occured. + Count of times readahead occurred. SYSFS - CACHE SET: @@ -359,7 +359,7 @@ unregister SYSFS - CACHE SET INTERNAL: This directory also exposes timings for a number of internal operations, with -separate files for average duration, average frequency, last occurence and max +separate files for average duration, average frequency, last occurrence and max duration: garbage collection, btree read, btree node sorts and btree splits. active_journal_entries @@ -414,7 +414,7 @@ freelist_percent space. io_errors - Number of errors that have occured, decayed by io_error_halflife. + Number of errors that have occurred, decayed by io_error_halflife. metadata_written Sum of all non data writes (btree writes and all other metadata). diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index e54ac1d5340..7d2d046c265 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -93,7 +93,7 @@ To avoid priority inversion through request starvation, a request queue maintains a separate request pool per each cgroup when CONFIG_BLK_CGROUP is enabled, and this parameter applies to each such per-block-cgroup request pool. IOW, if there are N block cgroups, -each request queue may have upto N request pools, each independently +each request queue may have up to N request pools, each independently regulated by nr_requests. optimal_io_size (RO) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index ddf4f93967a..3aaf7870a93 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -304,7 +304,7 @@ kernel memory, we prevent new processes from being created when the kernel memory usage is too high. * slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy -of each kmem_cache is created everytime the cache is touched by the first time +of each kmem_cache is created every time the cache is touched by the first time from inside the memcg. The creation is done lazily, so some objects can still be skipped while the cache is being created. All objects in a slab page should belong to the same memcg. This only fails to hold when a task is migrated to a diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt index f50470abe24..e8cdf7241b6 100644 --- a/Documentation/device-mapper/cache.txt +++ b/Documentation/device-mapper/cache.txt @@ -87,7 +87,7 @@ Migration throttling Migrating data between the origin and cache device uses bandwidth. The user can set a throttle to prevent more than a certain amount of -migration occuring at any one time. Currently we're not taking any +migration occurring at any one time. Currently we're not taking any account of normal io traffic going to the devices. More work needs doing here to avoid migrating during those peak io moments. diff --git a/Documentation/devicetree/bindings/arm/samsung/interrupt-combiner.txt b/Documentation/devicetree/bindings/arm/samsung/interrupt-combiner.txt index f2f2171e530..9e5f73412cd 100644 --- a/Documentation/devicetree/bindings/arm/samsung/interrupt-combiner.txt +++ b/Documentation/devicetree/bindings/arm/samsung/interrupt-combiner.txt @@ -5,7 +5,7 @@ can combine interrupt sources as a group and provide a single interrupt request for the group. The interrupt request from each group are connected to a parent interrupt controller, such as GIC in case of Exynos4210. -The interrupt combiner controller consists of multiple combiners. Upto eight +The interrupt combiner controller consists of multiple combiners. Up to eight interrupt sources can be connected to a combiner. The combiner outputs one combined interrupt for its eight interrupt sources. The combined interrupt is usually connected to a parent interrupt controller. @@ -14,8 +14,8 @@ A single node in the device tree is used to describe the interrupt combiner controller module (which includes multiple combiners). A combiner in the interrupt controller module shares config/control registers with other combiners. For example, a 32-bit interrupt enable/disable config register -can accommodate upto 4 interrupt combiners (with each combiner supporting -upto 8 interrupt sources). +can accommodate up to 4 interrupt combiners (with each combiner supporting +up to 8 interrupt sources). Required properties: - compatible: should be "samsung,exynos4210-combiner". diff --git a/Documentation/devicetree/bindings/arm/spear/shirq.txt b/Documentation/devicetree/bindings/arm/spear/shirq.txt index 13fbb8866bd..715a013ed4b 100644 --- a/Documentation/devicetree/bindings/arm/spear/shirq.txt +++ b/Documentation/devicetree/bindings/arm/spear/shirq.txt @@ -14,7 +14,7 @@ A single node in the device tree is used to describe the shared interrupt multiplexor (one node for all groups). A group in the interrupt controller shares config/control registers with other groups. For example, a 32-bit interrupt enable/disable config register can -accommodate upto 4 interrupt groups. +accommodate up to 4 interrupt groups. Required properties: - compatible: should be, either of diff --git a/Documentation/devicetree/bindings/clock/silabs,si5351.txt b/Documentation/devicetree/bindings/clock/silabs,si5351.txt index cc374651662..04feb13d6b2 100644 --- a/Documentation/devicetree/bindings/clock/silabs,si5351.txt +++ b/Documentation/devicetree/bindings/clock/silabs,si5351.txt @@ -4,7 +4,7 @@ Reference [1] Si5351A/B/C Data Sheet http://www.silabs.com/Support%20Documents/TechnicalDocs/Si5351.pdf -The Si5351a/b/c are programmable i2c clock generators with upto 8 output +The Si5351a/b/c are programmable i2c clock generators with up to 8 output clocks. Si5351a also has a reduced pin-count package (MSOP10) where only 3 output clocks are accessible. The internal structure of the clock generators can be found in [1]. diff --git a/Documentation/devicetree/bindings/mmc/synopsis-dw-mshc.txt b/Documentation/devicetree/bindings/mmc/synopsis-dw-mshc.txt index 726fd2122a1..1180d7814af 100644 --- a/Documentation/devicetree/bindings/mmc/synopsis-dw-mshc.txt +++ b/Documentation/devicetree/bindings/mmc/synopsis-dw-mshc.txt @@ -51,7 +51,7 @@ Optional properties: * card-detect-delay: Delay in milli-seconds before detecting card after card insert event. The default value is 0. -* supports-highspeed: Enables support for high speed cards (upto 50MHz) +* supports-highspeed: Enables support for high speed cards (up to 50MHz) * broken-cd: as documented in mmc core bindings. diff --git a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt index 2161334a7ca..712baf6c3e2 100644 --- a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt +++ b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt @@ -1,7 +1,7 @@ 4xx/Axon EMAC ethernet nodes The EMAC ethernet controller in IBM and AMCC 4xx chips, and also - the Axon bridge. To operate this needs to interact with a ths + the Axon bridge. To operate this needs to interact with a this special McMAL DMA controller, and sometimes an RGMII or ZMII interface. In addition to the nodes and properties described below, the node for the OPB bus on which the EMAC sits must have a diff --git a/Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt b/Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt index 8bf89c64364..f11f295c845 100644 --- a/Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt +++ b/Documentation/devicetree/bindings/spi/brcm,bcm2835-spi.txt @@ -2,7 +2,7 @@ Broadcom BCM2835 SPI0 controller The BCM2835 contains two forms of SPI master controller, one known simply as SPI0, and the other known as the "Universal SPI Master"; part of the -auxilliary block. This binding applies to the SPI0 controller. +auxiliary block. This binding applies to the SPI0 controller. Required properties: - compatible: Should be "brcm,bcm2835-spi". diff --git a/Documentation/devicetree/bindings/timer/samsung,exynos4210-mct.txt b/Documentation/devicetree/bindings/timer/samsung,exynos4210-mct.txt index cb47bfbcaee..b5a86d20ee3 100644 --- a/Documentation/devicetree/bindings/timer/samsung,exynos4210-mct.txt +++ b/Documentation/devicetree/bindings/timer/samsung,exynos4210-mct.txt @@ -44,7 +44,7 @@ Example 1: In this example, the system uses only the first global timer }; Example 2: In this example, the MCT global and local timer interrupts are - connected to two seperate interrupt controllers. Hence, an + connected to two separate interrupt controllers. Hence, an interrupt-map is created to map the interrupts to the respective interrupt controllers. diff --git a/Documentation/devicetree/bindings/usb/am33xx-usb.txt b/Documentation/devicetree/bindings/usb/am33xx-usb.txt index ea840f7f925..dc9dc8c87f1 100644 --- a/Documentation/devicetree/bindings/usb/am33xx-usb.txt +++ b/Documentation/devicetree/bindings/usb/am33xx-usb.txt @@ -12,7 +12,7 @@ AM33XX MUSB GLUE represents PERIPHERAL. - port1-mode : Should be "1" to represent HOST. "3" signifies OTG and "2" represents PERIPHERAL. - - power : Should be "250". This signifies the controller can supply upto + - power : Should be "250". This signifies the controller can supply up to 500mA when operating in host mode. Example: diff --git a/Documentation/devicetree/bindings/usb/omap-usb.txt b/Documentation/devicetree/bindings/usb/omap-usb.txt index d4769f343d6..57e71f6817d 100644 --- a/Documentation/devicetree/bindings/usb/omap-usb.txt +++ b/Documentation/devicetree/bindings/usb/omap-usb.txt @@ -16,7 +16,7 @@ OMAP MUSB GLUE specifying ULPI and UTMI respectively. - mode : Should be "3" to represent OTG. "1" signifies HOST and "2" represents PERIPHERAL. - - power : Should be "50". This signifies the controller can supply upto + - power : Should be "50". This signifies the controller can supply up to 100mA when operating in host mode. - usb-phy : the phandle for the PHY device diff --git a/Documentation/dynamic-debug-howto.txt b/Documentation/dynamic-debug-howto.txt index 72322c6d735..1bbdcfcf1f1 100644 --- a/Documentation/dynamic-debug-howto.txt +++ b/Documentation/dynamic-debug-howto.txt @@ -279,7 +279,7 @@ The dyndbg option is a "fake" module parameter, which means: - modules do not need to define it explicitly - every module gets it tacitly, whether they use pr_debug or not -- it doesnt appear in /sys/module/$module/parameters/ +- it doesn't appear in /sys/module/$module/parameters/ To see it, grep the control file, or inspect /proc/cmdline. For CONFIG_DYNAMIC_DEBUG kernels, any settings given at boot-time (or diff --git a/Documentation/fb/cirrusfb.txt b/Documentation/fb/cirrusfb.txt index f9436843e99..f75950d330a 100644 --- a/Documentation/fb/cirrusfb.txt +++ b/Documentation/fb/cirrusfb.txt @@ -55,7 +55,7 @@ Version 1.9.4.4 * Overhaul color register routines. * Associated with the above, console colors are now obtained from a LUT called 'palette' instead of from the VGA registers. This code was - modeled after that in atyfb and matroxfb. + modelled after that in atyfb and matroxfb. * Code cleanup, add comments. * Overhaul SR07 handling. * Bug fixes. diff --git a/Documentation/filesystems/jfs.txt b/Documentation/filesystems/jfs.txt index f7433355394..41fd757997b 100644 --- a/Documentation/filesystems/jfs.txt +++ b/Documentation/filesystems/jfs.txt @@ -42,7 +42,7 @@ nodiscard(*) block device when blocks are freed. This is useful for SSD devices and sparse/thinly-provisioned LUNs. The FITRIM ioctl command is also available together with the nodiscard option. The value of minlen specifies the minimum blockcount, when - a TRIM command to the block device is considered usefull. + a TRIM command to the block device is considered useful. When no value is given to the discard option, it defaults to 64 blocks, which means 256KiB in JFS. The minlen value of discard overrides the minlen value given diff --git a/Documentation/filesystems/qnx6.txt b/Documentation/filesystems/qnx6.txt index e59f2f09f56..99e90184a72 100644 --- a/Documentation/filesystems/qnx6.txt +++ b/Documentation/filesystems/qnx6.txt @@ -148,7 +148,7 @@ smaller than addressing space in the bitmap. Bitmap system area ------------------ -The bitmap itself is devided into three parts. +The bitmap itself is divided into three parts. First the system area, that is split into two halfs. Then userspace. diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt index 4a93e98b290..aa1f459fa6c 100644 --- a/Documentation/filesystems/vfat.txt +++ b/Documentation/filesystems/vfat.txt @@ -307,7 +307,7 @@ the following: - + diff --git a/Documentation/laptops/dslm.c b/Documentation/laptops/dslm.c index 72ff290c5fc..d5dd2d4b04d 100644 --- a/Documentation/laptops/dslm.c +++ b/Documentation/laptops/dslm.c @@ -2,7 +2,7 @@ * dslm.c * Simple Disk Sleep Monitor * by Bartek Kania - * Licenced under the GPL + * Licensed under the GPL */ #include #include diff --git a/Documentation/media-framework.txt b/Documentation/media-framework.txt index 77bd0a42f19..eeced24e56a 100644 --- a/Documentation/media-framework.txt +++ b/Documentation/media-framework.txt @@ -18,7 +18,7 @@ Abstract media device model Discovering a device internal topology, and configuring it at runtime, is one of the goals of the media framework. To achieve this, hardware devices are -modeled as an oriented graph of building blocks called entities connected +modelled as an oriented graph of building blocks called entities connected through pads. An entity is a basic media hardware building block. It can correspond to diff --git a/Documentation/metag/kernel-ABI.txt b/Documentation/metag/kernel-ABI.txt index 7b8dee83b9c..62821660319 100644 --- a/Documentation/metag/kernel-ABI.txt +++ b/Documentation/metag/kernel-ABI.txt @@ -189,7 +189,7 @@ call: 64-bit arguments are placed in matching pairs of registers (i.e. the same register number in both D0 and D1 units), with the least significant half in D0 -and the most significant half in D1, leaving a gap where necessary. Futher +and the most significant half in D1, leaving a gap where necessary. Further arguments are stored on the stack in reverse order (earlier arguments at higher addresses): diff --git a/Documentation/misc-devices/mei/mei.txt b/Documentation/misc-devices/mei/mei.txt index 6ec70295071..15bba1aeba9 100644 --- a/Documentation/misc-devices/mei/mei.txt +++ b/Documentation/misc-devices/mei/mei.txt @@ -120,7 +120,7 @@ The Intel MEI Driver supports the following IOCTL command: Notes: max_msg_length (MTU) in client properties describes the maximum data that can be sent or received. (e.g. if MTU=2K, can send - requests up to bytes 2k and received responses upto 2k bytes). + requests up to bytes 2k and received responses up to 2k bytes). Intel ME Applications: ============== diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index f98ca633b52..398d0fb1dd0 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -183,7 +183,7 @@ tcp_early_retrans - INTEGER for triggering fast retransmit when the amount of outstanding data is small and when no previously unsent data can be transmitted (such that limited transmit could be used). Also controls the use of - Tail loss probe (TLP) that converts RTOs occuring due to tail + Tail loss probe (TLP) that converts RTOs occurring due to tail losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01). Possible values: 0 disables ER diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt index 1c2dab40962..9bd0f5211e9 100644 --- a/Documentation/networking/netlink_mmap.txt +++ b/Documentation/networking/netlink_mmap.txt @@ -54,7 +54,7 @@ it will use an allocated socket buffer as usual and the contents will be copied to the ring on transmission, nullifying most of the performance gains. Dumps of kernel databases automatically support memory mapped I/O. -Conversion of the transmit path involves changing message contruction to +Conversion of the transmit path involves changing message construction to use memory from the TX ring instead of (usually) a buffer declared on the stack and setting up the frame header approriately. Optionally poll() can be used to wait for free frames in the TX ring. @@ -65,8 +65,8 @@ Structured and definitions for using memory mapped I/O are contained in RX and TX rings ---------------- -Each ring contains a number of continous memory blocks, containing frames of -fixed size dependant on the parameters used for ring setup. +Each ring contains a number of continuous memory blocks, containing frames of +fixed size dependent on the parameters used for ring setup. Ring: [ block 0 ] [ frame 0 ] @@ -80,7 +80,7 @@ Ring: [ block 0 ] [ frame 2 * n + 1 ] The blocks are only visible to the kernel, from the point of view of user-space -the ring just contains the frames in a continous memory zone. +the ring just contains the frames in a continuous memory zone. The ring parameters used for setting up the ring are defined as follows: @@ -91,7 +91,7 @@ struct nl_mmap_req { unsigned int nm_frame_nr; }; -Frames are grouped into blocks, where each block is a continous region of memory +Frames are grouped into blocks, where each block is a continuous region of memory and holds nm_block_size / nm_frame_size frames. The total number of frames in the ring is nm_frame_nr. The following invariants hold: @@ -113,7 +113,7 @@ Some parameters are constrained, specifically: - nm_frame_nr must equal the actual number of frames as specified above. -When the kernel can't allocate phsyically continous memory for a ring block, +When the kernel can't allocate physically continuous memory for a ring block, it will fall back to use physically discontinous memory. This might affect performance negatively, in order to avoid this the nm_frame_size parameter should be chosen to be as small as possible for the required frame size and diff --git a/Documentation/pinctrl.txt b/Documentation/pinctrl.txt index 447fd4cd54e..d3c6d3dd7d4 100644 --- a/Documentation/pinctrl.txt +++ b/Documentation/pinctrl.txt @@ -298,7 +298,7 @@ Since the pin controller subsystem have its pinspace local to the pin controller we need a mapping so that the pin control subsystem can figure out which pin controller handles control of a certain GPIO pin. Since a single pin controller may be muxing several GPIO ranges (typically SoCs that have -one set of pins but internally several GPIO silicon blocks, each modeled as +one set of pins but internally several GPIO silicon blocks, each modelled as a struct gpio_chip) any number of GPIO ranges can be added to a pin controller instance like this: diff --git a/Documentation/thermal/exynos_thermal_emulation b/Documentation/thermal/exynos_thermal_emulation index 36a3e79c120..b15efec6ca2 100644 --- a/Documentation/thermal/exynos_thermal_emulation +++ b/Documentation/thermal/exynos_thermal_emulation @@ -20,7 +20,7 @@ When it's enabled, sysfs node will be created as The sysfs node, 'emul_node', will contain value 0 for the initial state. When you input any temperature you want to update to sysfs node, it automatically enable emulation mode and current temperature will be changed into it. -(Exynos also supports user changable delay time which would be used to delay of +(Exynos also supports user changeable delay time which would be used to delay of changing temperature. However, this node only uses same delay of real sensing time, 938us.) Exynos emulation mode requires synchronous of value changing and enabling. It means when you diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 5f91eda9164..c2db6e3fb45 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1683,7 +1683,7 @@ The parameter is defined like this: This ioctl maps the memory at "user_addr" with the length "length" to the vcpu's address space starting at "vcpu_addr". All parameters need to -be alligned by 1 megabyte. +be aligned by 1 megabyte. 4.66 KVM_S390_UCAS_UNMAP @@ -1703,7 +1703,7 @@ The parameter is defined like this: This ioctl unmaps the memory in the vcpu's address space starting at "vcpu_addr" with the length "length". The field "user_addr" is ignored. -All parameters need to be alligned by 1 megabyte. +All parameters need to be aligned by 1 megabyte. 4.67 KVM_S390_VCPU_FAULT @@ -2019,7 +2019,7 @@ be OR'ed into the "vsid" argument of the slbmte instruction. The "enc" array is a list which for each of those segment base page size provides the list of supported actual page sizes (which can be only larger or equal to the base page size), along with the -corresponding encoding in the hash PTE. Similarily, the array is +corresponding encoding in the hash PTE. Similarly, the array is 8 entries sorted by increasing sizes and an entry with a "0" shift is an empty entry and a terminator: diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 7587493c67f..5ef3dd38bb6 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -147,5 +147,5 @@ once. Other notes: Reading from any of the files will return -EINVAL if you are not starting -the read on an 8-byte boundary (e.g., if you seeked an odd number of bytes +the read on an 8-byte boundary (e.g., if you sought an odd number of bytes into the file), or if the size of the read is not a multiple of 8 bytes. diff --git a/Documentation/w1/slaves/w1_ds28e04 b/Documentation/w1/slaves/w1_ds28e04 index 85bc9a7e02f..7819b65cfa4 100644 --- a/Documentation/w1/slaves/w1_ds28e04 +++ b/Documentation/w1/slaves/w1_ds28e04 @@ -24,7 +24,7 @@ Memory Access A write operation on the "eeprom" file writes the given byte sequence to the EEPROM of the DS28E04. If CRC checking mode is enabled only - fully alligned blocks of 32 bytes with valid CRC16 values (in bytes 30 + fully aligned blocks of 32 bytes with valid CRC16 values (in bytes 30 and 31) are allowed to be written. PIO Access -- cgit v1.2.3-70-g09d2 From d3eaace84e40bf946129e516dcbd617173c1cf14 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Wed, 5 Jun 2013 12:09:09 +1000 Subject: xfs: disable noattr2/attr2 mount options for CRC enabled filesystems attr2 format is always enabled for v5 superblock filesystems, so the mount options to enable or disable it need to be cause mount errors. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Signed-off-by: Ben Myers --- Documentation/filesystems/xfs.txt | 3 +++ fs/xfs/xfs_super.c | 11 +++++++++++ 2 files changed, 14 insertions(+) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt index 3e4b3dd1e04..83577f0232a 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/filesystems/xfs.txt @@ -33,6 +33,9 @@ When mounting an XFS filesystem, the following options are accepted. removing extended attributes) the on-disk superblock feature bit field will be updated to reflect this format being in use. + CRC enabled filesystems always use the attr2 format, and so + will reject the noattr2 mount option if it is set. + barrier Enables the use of block layer write barriers for writes into the journal and unwritten extent conversion. This allows for diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index ea341cea68c..3033ba5e976 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1372,6 +1372,17 @@ xfs_finish_flags( } } + /* + * V5 filesystems always use attr2 format for attributes. + */ + if (xfs_sb_version_hascrc(&mp->m_sb) && + (mp->m_flags & XFS_MOUNT_NOATTR2)) { + xfs_warn(mp, +"Cannot mount a V5 filesystem as %s. %s is always enabled for V5 filesystems.", + MNTOPT_NOATTR2, MNTOPT_ATTR2); + return XFS_ERROR(EINVAL); + } + /* * mkfs'ed attr2 will turn on attr2 mount unless explicitly * told by noattr2 to turn it off -- cgit v1.2.3-70-g09d2 From 696c018c7718f5e33e1107da19c4d64a25018878 Mon Sep 17 00:00:00 2001 From: Namjae Jeon Date: Sun, 16 Jun 2013 09:48:48 +0900 Subject: f2fs: add remount_fs callback support Add the f2fs_remount function call which will be used during the filesystem remounting. This function will help us to change the mount options specific to f2fs. Also modify the f2fs background_gc mount option, which will allow the user to dynamically trun on/off the garbage collection in f2fs based on the background_gc value. If background_gc=on, Garbage collection will be turned off & if background_gc=off, Garbage collection will be truned on. By default the garbage collection is on in f2fs. Change Log: v2: Incorporated the review comments by Gu Zheng. Removing the restore part for VFS flags Updating comments with proper flag conditions Display GC background option as ON/OFF Revised conditions to stop GC in case of remount v1: Initial changes for adding remount_fs callback support. Cc: Gu Zheng Signed-off-by: Namjae Jeon Signed-off-by: Pankaj Kumar Reviewed-by: Gu Zheng [Jaegeuk Kim: change /** with /* for the coding style] Signed-off-by: Jaegeuk Kim --- Documentation/filesystems/f2fs.txt | 9 +- fs/f2fs/super.c | 235 ++++++++++++++++++++++++------------- 2 files changed, 160 insertions(+), 84 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt index bd3c56c6738..b91e2f26b67 100644 --- a/Documentation/filesystems/f2fs.txt +++ b/Documentation/filesystems/f2fs.txt @@ -98,8 +98,13 @@ Cleaning Overhead MOUNT OPTIONS ================================================================================ -background_gc_off Turn off cleaning operations, namely garbage collection, - triggered in background when I/O subsystem is idle. +background_gc=%s Turn on/off cleaning operations, namely garbage + collection, triggered in background when I/O subsystem is + idle. If background_gc=on, it will turn on the garbage + collection and if background_gc=off, garbage collection + will be truned off. + Default value for this option is on. So garbage + collection is on by default. disable_roll_forward Disable the roll-forward recovery routine discard Issue discard/TRIM commands when a segment is cleaned. no_heap Disable heap-style segment allocation which finds free diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index ba56549bb2f..75c7dc363e9 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -34,7 +34,7 @@ static struct kmem_cache *f2fs_inode_cachep; enum { - Opt_gc_background_off, + Opt_gc_background, Opt_disable_roll_forward, Opt_discard, Opt_noheap, @@ -46,7 +46,7 @@ enum { }; static match_table_t f2fs_tokens = { - {Opt_gc_background_off, "background_gc_off"}, + {Opt_gc_background, "background_gc=%s"}, {Opt_disable_roll_forward, "disable_roll_forward"}, {Opt_discard, "discard"}, {Opt_noheap, "no_heap"}, @@ -76,6 +76,91 @@ static void init_once(void *foo) inode_init_once(&fi->vfs_inode); } +static int parse_options(struct super_block *sb, char *options) +{ + struct f2fs_sb_info *sbi = F2FS_SB(sb); + substring_t args[MAX_OPT_ARGS]; + char *p, *name; + int arg = 0; + + if (!options) + return 0; + + while ((p = strsep(&options, ",")) != NULL) { + int token; + if (!*p) + continue; + /* + * Initialize args struct so we know whether arg was + * found; some options take optional arguments. + */ + args[0].to = args[0].from = NULL; + token = match_token(p, f2fs_tokens, args); + + switch (token) { + case Opt_gc_background: + name = match_strdup(&args[0]); + + if (!name) + return -ENOMEM; + if (!strncmp(name, "on", 2)) + set_opt(sbi, BG_GC); + else if (!strncmp(name, "off", 3)) + clear_opt(sbi, BG_GC); + else { + kfree(name); + return -EINVAL; + } + kfree(name); + break; + case Opt_disable_roll_forward: + set_opt(sbi, DISABLE_ROLL_FORWARD); + break; + case Opt_discard: + set_opt(sbi, DISCARD); + break; + case Opt_noheap: + set_opt(sbi, NOHEAP); + break; +#ifdef CONFIG_F2FS_FS_XATTR + case Opt_nouser_xattr: + clear_opt(sbi, XATTR_USER); + break; +#else + case Opt_nouser_xattr: + f2fs_msg(sb, KERN_INFO, + "nouser_xattr options not supported"); + break; +#endif +#ifdef CONFIG_F2FS_FS_POSIX_ACL + case Opt_noacl: + clear_opt(sbi, POSIX_ACL); + break; +#else + case Opt_noacl: + f2fs_msg(sb, KERN_INFO, "noacl options not supported"); + break; +#endif + case Opt_active_logs: + if (args->from && match_int(args, &arg)) + return -EINVAL; + if (arg != 2 && arg != 4 && arg != NR_CURSEG_TYPE) + return -EINVAL; + sbi->active_logs = arg; + break; + case Opt_disable_ext_identify: + set_opt(sbi, DISABLE_EXT_IDENTIFY); + break; + default: + f2fs_msg(sb, KERN_ERR, + "Unrecognized mount option \"%s\" or missing value", + p); + return -EINVAL; + } + } + return 0; +} + static struct inode *f2fs_alloc_inode(struct super_block *sb) { struct f2fs_inode_info *fi; @@ -225,10 +310,10 @@ static int f2fs_show_options(struct seq_file *seq, struct dentry *root) { struct f2fs_sb_info *sbi = F2FS_SB(root->d_sb); - if (test_opt(sbi, BG_GC)) - seq_puts(seq, ",background_gc_on"); + if (!(root->d_sb->s_flags & MS_RDONLY) && test_opt(sbi, BG_GC)) + seq_printf(seq, ",background_gc=%s", "on"); else - seq_puts(seq, ",background_gc_off"); + seq_printf(seq, ",background_gc=%s", "off"); if (test_opt(sbi, DISABLE_ROLL_FORWARD)) seq_puts(seq, ",disable_roll_forward"); if (test_opt(sbi, DISCARD)) @@ -255,6 +340,58 @@ static int f2fs_show_options(struct seq_file *seq, struct dentry *root) return 0; } +static int f2fs_remount(struct super_block *sb, int *flags, char *data) +{ + struct f2fs_sb_info *sbi = F2FS_SB(sb); + struct f2fs_mount_info org_mount_opt; + int err, active_logs; + + /* + * Save the old mount options in case we + * need to restore them. + */ + org_mount_opt = sbi->mount_opt; + active_logs = sbi->active_logs; + + /* parse mount options */ + err = parse_options(sb, data); + if (err) + goto restore_opts; + + /* + * Previous and new state of filesystem is RO, + * so no point in checking GC conditions. + */ + if ((sb->s_flags & MS_RDONLY) && (*flags & MS_RDONLY)) + goto skip; + + /* + * We stop the GC thread if FS is mounted as RO + * or if background_gc = off is passed in mount + * option. Also sync the filesystem. + */ + if ((*flags & MS_RDONLY) || !test_opt(sbi, BG_GC)) { + if (sbi->gc_thread) { + stop_gc_thread(sbi); + f2fs_sync_fs(sb, 1); + } + } else if (test_opt(sbi, BG_GC) && !sbi->gc_thread) { + err = start_gc_thread(sbi); + if (err) + goto restore_opts; + } +skip: + /* Update the POSIXACL Flag */ + sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | + (test_opt(sbi, POSIX_ACL) ? MS_POSIXACL : 0); + return 0; + +restore_opts: + sbi->mount_opt = org_mount_opt; + sbi->active_logs = active_logs; + return err; +} + static struct super_operations f2fs_sops = { .alloc_inode = f2fs_alloc_inode, .drop_inode = f2fs_drop_inode, @@ -268,6 +405,7 @@ static struct super_operations f2fs_sops = { .freeze_fs = f2fs_freeze, .unfreeze_fs = f2fs_unfreeze, .statfs = f2fs_statfs, + .remount_fs = f2fs_remount, }; static struct inode *f2fs_nfs_get_inode(struct super_block *sb, @@ -315,79 +453,6 @@ static const struct export_operations f2fs_export_ops = { .get_parent = f2fs_get_parent, }; -static int parse_options(struct super_block *sb, char *options) -{ - struct f2fs_sb_info *sbi = F2FS_SB(sb); - substring_t args[MAX_OPT_ARGS]; - char *p; - int arg = 0; - - if (!options) - return 0; - - while ((p = strsep(&options, ",")) != NULL) { - int token; - if (!*p) - continue; - /* - * Initialize args struct so we know whether arg was - * found; some options take optional arguments. - */ - args[0].to = args[0].from = NULL; - token = match_token(p, f2fs_tokens, args); - - switch (token) { - case Opt_gc_background_off: - clear_opt(sbi, BG_GC); - break; - case Opt_disable_roll_forward: - set_opt(sbi, DISABLE_ROLL_FORWARD); - break; - case Opt_discard: - set_opt(sbi, DISCARD); - break; - case Opt_noheap: - set_opt(sbi, NOHEAP); - break; -#ifdef CONFIG_F2FS_FS_XATTR - case Opt_nouser_xattr: - clear_opt(sbi, XATTR_USER); - break; -#else - case Opt_nouser_xattr: - f2fs_msg(sb, KERN_INFO, - "nouser_xattr options not supported"); - break; -#endif -#ifdef CONFIG_F2FS_FS_POSIX_ACL - case Opt_noacl: - clear_opt(sbi, POSIX_ACL); - break; -#else - case Opt_noacl: - f2fs_msg(sb, KERN_INFO, "noacl options not supported"); - break; -#endif - case Opt_active_logs: - if (args->from && match_int(args, &arg)) - return -EINVAL; - if (arg != 2 && arg != 4 && arg != NR_CURSEG_TYPE) - return -EINVAL; - sbi->active_logs = arg; - break; - case Opt_disable_ext_identify: - set_opt(sbi, DISABLE_EXT_IDENTIFY); - break; - default: - f2fs_msg(sb, KERN_ERR, - "Unrecognized mount option \"%s\" or missing value", - p); - return -EINVAL; - } - } - return 0; -} - static loff_t max_file_size(unsigned bits) { loff_t result = ADDRS_PER_INODE; @@ -686,10 +751,16 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) "Cannot recover all fsync data errno=%ld", err); } - /* After POR, we can run background GC thread */ - err = start_gc_thread(sbi); - if (err) - goto fail; + /* + * If filesystem is not mounted as read-only then + * do start the gc_thread. + */ + if (!(sb->s_flags & MS_RDONLY)) { + /* After POR, we can run background GC thread.*/ + err = start_gc_thread(sbi); + if (err) + goto fail; + } err = f2fs_build_stats(sbi); if (err) -- cgit v1.2.3-70-g09d2 From 5c0ba4e0762e6dabd14a5c276652e2defec38de7 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Wed, 15 May 2013 13:52:59 -0400 Subject: [readdir] introduce iterate_dir() and dir_context iterate_dir(): new helper, replacing vfs_readdir(). struct dir_context: contains the readdir callback (and will get more stuff in it), embedded into whatever data that callback wants to deal with; eventually, we'll be passing it to ->readdir() replacement instead of (data,filldir) pair. Signed-off-by: Al Viro --- Documentation/filesystems/porting | 3 +++ arch/alpha/kernel/osf_sys.c | 4 +++- arch/parisc/hpux/fs.c | 4 +++- fs/compat.c | 12 +++++++++--- fs/ecryptfs/file.c | 4 +++- fs/exportfs/expfs.c | 4 +++- fs/nfsd/nfs4recover.c | 13 +++++++++---- fs/nfsd/vfs.c | 4 +++- fs/readdir.c | 21 +++++++++++++-------- include/linux/fs.h | 4 ++++ 10 files changed, 53 insertions(+), 20 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 4db22f6491e..85a4a033bae 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -445,3 +445,6 @@ object doesn't exist. It's remote/distributed ones that might care... [mandatory] FS_REVAL_DOT is gone; if you used to have it, add ->d_weak_revalidate() in your dentry operations instead. +-- +[mandatory] + vfs_readdir() is gone; switch to iterate_dir() instead diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c index b9e37ad6fa1..ac19c7299d8 100644 --- a/arch/alpha/kernel/osf_sys.c +++ b/arch/alpha/kernel/osf_sys.c @@ -96,6 +96,7 @@ struct osf_dirent { }; struct osf_dirent_callback { + struct dir_context ctx; struct osf_dirent __user *dirent; long __user *basep; unsigned int count; @@ -155,8 +156,9 @@ SYSCALL_DEFINE4(osf_getdirentries, unsigned int, fd, buf.basep = basep; buf.count = count; buf.error = 0; + buf.ctx.actor = osf_filldir; - error = vfs_readdir(arg.file, osf_filldir, &buf); + error = iterate_dir(arg.file, &buf.ctx); if (error >= 0) error = buf.error; if (count != buf.count) diff --git a/arch/parisc/hpux/fs.c b/arch/parisc/hpux/fs.c index 838b479a42c..fc2cbee86e3 100644 --- a/arch/parisc/hpux/fs.c +++ b/arch/parisc/hpux/fs.c @@ -60,6 +60,7 @@ struct hpux_dirent { }; struct getdents_callback { + struct dir_context ctx; struct hpux_dirent __user *current_dir; struct hpux_dirent __user *previous; int count; @@ -121,8 +122,9 @@ int hpux_getdents(unsigned int fd, struct hpux_dirent __user *dirent, unsigned i buf.previous = NULL; buf.count = count; buf.error = 0; + buf.ctx.actor = filldir; - error = vfs_readdir(arg.file, filldir, &buf); + error = iterate_dir(arg.file, &buf.ctx); if (error >= 0) error = buf.error; lastdirent = buf.previous; diff --git a/fs/compat.c b/fs/compat.c index fc3b55dce18..2279b59e81f 100644 --- a/fs/compat.c +++ b/fs/compat.c @@ -832,6 +832,7 @@ struct compat_old_linux_dirent { }; struct compat_readdir_callback { + struct dir_context ctx; struct compat_old_linux_dirent __user *dirent; int result; }; @@ -880,8 +881,9 @@ asmlinkage long compat_sys_old_readdir(unsigned int fd, buf.result = 0; buf.dirent = dirent; + buf.ctx.actor = compat_fillonedir; - error = vfs_readdir(f.file, compat_fillonedir, &buf); + error = iterate_dir(f.file, &buf.ctx); if (buf.result) error = buf.result; @@ -897,6 +899,7 @@ struct compat_linux_dirent { }; struct compat_getdents_callback { + struct dir_context ctx; struct compat_linux_dirent __user *current_dir; struct compat_linux_dirent __user *previous; int count; @@ -965,8 +968,9 @@ asmlinkage long compat_sys_getdents(unsigned int fd, buf.previous = NULL; buf.count = count; buf.error = 0; + buf.ctx.actor = compat_filldir; - error = vfs_readdir(f.file, compat_filldir, &buf); + error = iterate_dir(f.file, &buf.ctx); if (error >= 0) error = buf.error; lastdirent = buf.previous; @@ -983,6 +987,7 @@ asmlinkage long compat_sys_getdents(unsigned int fd, #ifndef __ARCH_OMIT_COMPAT_SYS_GETDENTS64 struct compat_getdents_callback64 { + struct dir_context ctx; struct linux_dirent64 __user *current_dir; struct linux_dirent64 __user *previous; int count; @@ -1050,8 +1055,9 @@ asmlinkage long compat_sys_getdents64(unsigned int fd, buf.previous = NULL; buf.count = count; buf.error = 0; + buf.ctx.actor = compat_filldir64; - error = vfs_readdir(f.file, compat_filldir64, &buf); + error = iterate_dir(f.file, &buf.ctx); if (error >= 0) error = buf.error; lastdirent = buf.previous; diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c index a7abbea2c09..041379a646b 100644 --- a/fs/ecryptfs/file.c +++ b/fs/ecryptfs/file.c @@ -68,6 +68,7 @@ static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb, } struct ecryptfs_getdents_callback { + struct dir_context ctx; void *dirent; struct dentry *dentry; filldir_t filldir; @@ -126,7 +127,8 @@ static int ecryptfs_readdir(struct file *file, void *dirent, filldir_t filldir) buf.filldir = filldir; buf.filldir_called = 0; buf.entries_written = 0; - rc = vfs_readdir(lower_file, ecryptfs_filldir, (void *)&buf); + buf.ctx.actor = ecryptfs_filldir; + rc = iterate_dir(lower_file, &buf.ctx); file->f_pos = lower_file->f_pos; if (rc < 0) goto out; diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c index 262fc994098..7cb190426ce 100644 --- a/fs/exportfs/expfs.c +++ b/fs/exportfs/expfs.c @@ -212,6 +212,7 @@ reconnect_path(struct vfsmount *mnt, struct dentry *target_dir, char *nbuf) } struct getdents_callback { + struct dir_context ctx; char *name; /* name that was found. It already points to a buffer NAME_MAX+1 is size */ unsigned long ino; /* the inum we are looking for */ @@ -278,10 +279,11 @@ static int get_name(const struct path *path, char *name, struct dentry *child) buffer.ino = child->d_inode->i_ino; buffer.found = 0; buffer.sequence = 0; + buffer.ctx.actor = filldir_one; while (1) { int old_seq = buffer.sequence; - error = vfs_readdir(file, filldir_one, &buffer); + error = iterate_dir(file, &buffer.ctx); if (buffer.found) { error = 0; break; diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c index 4e9a21db867..4f8cc6ba7c2 100644 --- a/fs/nfsd/nfs4recover.c +++ b/fs/nfsd/nfs4recover.c @@ -263,7 +263,10 @@ nfsd4_list_rec_dir(recdir_func *f, struct nfsd_net *nn) { const struct cred *original_cred; struct dentry *dir = nn->rec_file->f_path.dentry; - LIST_HEAD(names); + struct { + struct dir_context ctx; + struct list_head names; + } ctx; int status; status = nfs4_save_creds(&original_cred); @@ -276,11 +279,13 @@ nfsd4_list_rec_dir(recdir_func *f, struct nfsd_net *nn) return status; } - status = vfs_readdir(nn->rec_file, nfsd4_build_namelist, &names); + INIT_LIST_HEAD(&ctx.names); + ctx.ctx.actor = nfsd4_build_namelist; + status = iterate_dir(nn->rec_file, &ctx.ctx); mutex_lock_nested(&dir->d_inode->i_mutex, I_MUTEX_PARENT); - while (!list_empty(&names)) { + while (!list_empty(&ctx.names)) { struct name_list *entry; - entry = list_entry(names.next, struct name_list, list); + entry = list_entry(ctx.names.next, struct name_list, list); if (!status) { struct dentry *dentry; dentry = lookup_one_len(entry->name, dir, HEXDIR_LEN-1); diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 84ce601d806..f939ba9bf8e 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1912,6 +1912,7 @@ struct buffered_dirent { }; struct readdir_data { + struct dir_context ctx; char *dirent; size_t used; int full; @@ -1949,6 +1950,7 @@ static __be32 nfsd_buffered_readdir(struct file *file, filldir_t func, int size; loff_t offset; + buf.ctx.actor = nfsd_buffered_filldir; buf.dirent = (void *)__get_free_page(GFP_KERNEL); if (!buf.dirent) return nfserrno(-ENOMEM); @@ -1963,7 +1965,7 @@ static __be32 nfsd_buffered_readdir(struct file *file, filldir_t func, buf.used = 0; buf.full = 0; - host_err = vfs_readdir(file, nfsd_buffered_filldir, &buf); + host_err = iterate_dir(file, &buf.ctx); if (buf.full) host_err = 0; diff --git a/fs/readdir.c b/fs/readdir.c index fee38e04fae..5b620a2b45e 100644 --- a/fs/readdir.c +++ b/fs/readdir.c @@ -20,7 +20,7 @@ #include -int vfs_readdir(struct file *file, filldir_t filler, void *buf) +int iterate_dir(struct file *file, struct dir_context *ctx) { struct inode *inode = file_inode(file); int res = -ENOTDIR; @@ -37,15 +37,14 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf) res = -ENOENT; if (!IS_DEADDIR(inode)) { - res = file->f_op->readdir(file, buf, filler); + res = file->f_op->readdir(file, ctx, ctx->actor); file_accessed(file); } mutex_unlock(&inode->i_mutex); out: return res; } - -EXPORT_SYMBOL(vfs_readdir); +EXPORT_SYMBOL(iterate_dir); /* * Traditional linux readdir() handling.. @@ -66,6 +65,7 @@ struct old_linux_dirent { }; struct readdir_callback { + struct dir_context ctx; struct old_linux_dirent __user * dirent; int result; }; @@ -73,7 +73,7 @@ struct readdir_callback { static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset, u64 ino, unsigned int d_type) { - struct readdir_callback * buf = (struct readdir_callback *) __buf; + struct readdir_callback *buf = (struct readdir_callback *) __buf; struct old_linux_dirent __user * dirent; unsigned long d_ino; @@ -112,10 +112,11 @@ SYSCALL_DEFINE3(old_readdir, unsigned int, fd, if (!f.file) return -EBADF; + buf.ctx.actor = fillonedir; buf.result = 0; buf.dirent = dirent; - error = vfs_readdir(f.file, fillonedir, &buf); + error = iterate_dir(f.file, &buf.ctx); if (buf.result) error = buf.result; @@ -137,6 +138,7 @@ struct linux_dirent { }; struct getdents_callback { + struct dir_context ctx; struct linux_dirent __user * current_dir; struct linux_dirent __user * previous; int count; @@ -205,8 +207,9 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd, buf.previous = NULL; buf.count = count; buf.error = 0; + buf.ctx.actor = filldir; - error = vfs_readdir(f.file, filldir, &buf); + error = iterate_dir(f.file, &buf.ctx); if (error >= 0) error = buf.error; lastdirent = buf.previous; @@ -221,6 +224,7 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd, } struct getdents_callback64 { + struct dir_context ctx; struct linux_dirent64 __user * current_dir; struct linux_dirent64 __user * previous; int count; @@ -285,8 +289,9 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd, buf.previous = NULL; buf.count = count; buf.error = 0; + buf.ctx.actor = filldir64; - error = vfs_readdir(f.file, filldir64, &buf); + error = iterate_dir(f.file, &buf.ctx); if (error >= 0) error = buf.error; lastdirent = buf.previous; diff --git a/include/linux/fs.h b/include/linux/fs.h index 65c2be22b60..643e5b6cbaf 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1506,6 +1506,9 @@ int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); * to have different dirent layouts depending on the binary type. */ typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned); +struct dir_context { + filldir_t actor; +}; struct block_device_operations; /* These macros are for out of kernel modules to test that @@ -2494,6 +2497,7 @@ loff_t inode_get_bytes(struct inode *inode); void inode_set_bytes(struct inode *inode, loff_t bytes); extern int vfs_readdir(struct file *, filldir_t, void *); +extern int iterate_dir(struct file *, struct dir_context *); extern int vfs_stat(const char __user *, struct kstat *); extern int vfs_lstat(const char __user *, struct kstat *); -- cgit v1.2.3-70-g09d2 From 2233f31aade393641f0eaed43a71110e629bb900 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Wed, 22 May 2013 21:44:23 -0400 Subject: [readdir] ->readdir() is gone everything's converted to ->iterate() Signed-off-by: Al Viro --- Documentation/filesystems/Locking | 2 +- Documentation/filesystems/porting | 3 +++ Documentation/filesystems/vfs.txt | 4 ++-- fs/bad_inode.c | 4 ++-- fs/exportfs/expfs.c | 2 +- fs/readdir.c | 13 ++++--------- include/linux/fs.h | 1 - 7 files changed, 13 insertions(+), 16 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 0706d32a61e..bdd82b2339d 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -414,7 +414,7 @@ prototypes: ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); - int (*readdir) (struct file *, void *, filldir_t); + int (*iterate) (struct file *, struct dir_context *); unsigned int (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 85a4a033bae..206a1bdc732 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -448,3 +448,6 @@ in your dentry operations instead. -- [mandatory] vfs_readdir() is gone; switch to iterate_dir() instead +-- +[mandatory] + ->readdir() is gone now; switch to ->iterate() diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index bc4b06b3160..4a35f6614a6 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -777,7 +777,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); - int (*readdir) (struct file *, void *, filldir_t); + int (*iterate) (struct file *, struct dir_context *); unsigned int (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); @@ -815,7 +815,7 @@ otherwise noted. aio_write: called by io_submit(2) and other asynchronous I/O operations - readdir: called when the VFS needs to read the directory contents + iterate: called when the VFS needs to read the directory contents poll: called by the VFS when a process wants to check if there is activity on this file and (optionally) go to sleep until there diff --git a/fs/bad_inode.c b/fs/bad_inode.c index 922ad460bff..7c93953030f 100644 --- a/fs/bad_inode.c +++ b/fs/bad_inode.c @@ -45,7 +45,7 @@ static ssize_t bad_file_aio_write(struct kiocb *iocb, const struct iovec *iov, return -EIO; } -static int bad_file_readdir(struct file *filp, void *dirent, filldir_t filldir) +static int bad_file_readdir(struct file *file, struct dir_context *ctx) { return -EIO; } @@ -152,7 +152,7 @@ static const struct file_operations bad_file_ops = .write = bad_file_write, .aio_read = bad_file_aio_read, .aio_write = bad_file_aio_write, - .readdir = bad_file_readdir, + .iterate = bad_file_readdir, .poll = bad_file_poll, .unlocked_ioctl = bad_file_unlocked_ioctl, .compat_ioctl = bad_file_compat_ioctl, diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c index 6c8ef1dd4bd..78072e65f92 100644 --- a/fs/exportfs/expfs.c +++ b/fs/exportfs/expfs.c @@ -272,7 +272,7 @@ static int get_name(const struct path *path, char *name, struct dentry *child) goto out; error = -EINVAL; - if (!file->f_op->readdir && !file->f_op->iterate) + if (!file->f_op->iterate) goto out_close; buffer.name = name; diff --git a/fs/readdir.c b/fs/readdir.c index 5d6578affbb..a6245c9fd0e 100644 --- a/fs/readdir.c +++ b/fs/readdir.c @@ -24,7 +24,7 @@ int iterate_dir(struct file *file, struct dir_context *ctx) { struct inode *inode = file_inode(file); int res = -ENOTDIR; - if (!file->f_op || (!file->f_op->readdir && !file->f_op->iterate)) + if (!file->f_op || !file->f_op->iterate) goto out; res = security_file_permission(file, MAY_READ); @@ -37,14 +37,9 @@ int iterate_dir(struct file *file, struct dir_context *ctx) res = -ENOENT; if (!IS_DEADDIR(inode)) { - if (file->f_op->iterate) { - ctx->pos = file->f_pos; - res = file->f_op->iterate(file, ctx); - file->f_pos = ctx->pos; - } else { - res = file->f_op->readdir(file, ctx, ctx->actor); - ctx->pos = file->f_pos; - } + ctx->pos = file->f_pos; + res = file->f_op->iterate(file, ctx); + file->f_pos = ctx->pos; file_accessed(file); } mutex_unlock(&inode->i_mutex); diff --git a/include/linux/fs.h b/include/linux/fs.h index aa9770c7e8d..237af62976a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1526,7 +1526,6 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); - int (*readdir) (struct file *, void *, filldir_t); int (*iterate) (struct file *, struct dir_context *); unsigned int (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); -- cgit v1.2.3-70-g09d2 From da53be12bbb4fabbe2e9f6f908de0cf478b5161d Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 21 May 2013 15:22:44 -0700 Subject: Don't pass inode to ->d_hash() and ->d_compare() Instances either don't look at it at all (the majority of cases) or only want it to find the superblock (which can be had as dentry->d_sb). A few cases that want more are actually safe with dentry->d_inode - the only precaution needed is the check that it hadn't been replaced with NULL by rmdir() or by overwriting rename(), which case should be simply treated as cache miss. Signed-off-by: Linus Torvalds Signed-off-by: Al Viro --- Documentation/filesystems/Locking | 6 ++--- Documentation/filesystems/vfs.txt | 19 +++++++--------- fs/adfs/dir.c | 6 ++--- fs/affs/namei.c | 26 +++++++-------------- fs/cifs/dir.c | 9 +++----- fs/dcache.c | 27 ++++++++-------------- fs/efivarfs/super.c | 9 ++++---- fs/fat/namei_msdos.c | 6 ++--- fs/fat/namei_vfat.c | 12 ++++------ fs/gfs2/dentry.c | 3 +-- fs/hfs/hfs_fs.h | 7 ++---- fs/hfs/string.c | 6 ++--- fs/hfsplus/hfsplus_fs.h | 7 ++---- fs/hfsplus/unicode.c | 7 ++---- fs/hpfs/dentry.c | 7 ++---- fs/isofs/inode.c | 48 +++++++++++++-------------------------- fs/isofs/namei.c | 3 +-- fs/jfs/namei.c | 7 ++---- fs/namei.c | 7 +++--- fs/ncpfs/dir.c | 32 +++++++++++++++++++------- fs/proc/proc_sysctl.c | 7 +++--- fs/sysv/namei.c | 3 +-- include/linux/dcache.h | 9 +++----- 23 files changed, 108 insertions(+), 165 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index bdd82b2339d..f94a362f408 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -11,10 +11,8 @@ be able to use diff(1). prototypes: int (*d_revalidate)(struct dentry *, unsigned int); int (*d_weak_revalidate)(struct dentry *, unsigned int); - int (*d_hash)(const struct dentry *, const struct inode *, - struct qstr *); - int (*d_compare)(const struct dentry *, const struct inode *, - const struct dentry *, const struct inode *, + int (*d_hash)(const struct dentry *, struct qstr *); + int (*d_compare)(const struct dentry *, const struct dentry *, unsigned int, const char *, const struct qstr *); int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 4a35f6614a6..51ba44e3fc4 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -901,10 +901,8 @@ defined: struct dentry_operations { int (*d_revalidate)(struct dentry *, unsigned int); int (*d_weak_revalidate)(struct dentry *, unsigned int); - int (*d_hash)(const struct dentry *, const struct inode *, - struct qstr *); - int (*d_compare)(const struct dentry *, const struct inode *, - const struct dentry *, const struct inode *, + int (*d_hash)(const struct dentry *, struct qstr *); + int (*d_compare)(const struct dentry *, const struct dentry *, unsigned int, const char *, const struct qstr *); int (*d_delete)(const struct dentry *); void (*d_release)(struct dentry *); @@ -949,25 +947,24 @@ struct dentry_operations { d_hash: called when the VFS adds a dentry to the hash table. The first dentry passed to d_hash is the parent directory that the name is - to be hashed into. The inode is the dentry's inode. + to be hashed into. Same locking and synchronisation rules as d_compare regarding what is safe to dereference etc. d_compare: called to compare a dentry name with a given name. The first dentry is the parent of the dentry to be compared, the second is - the parent's inode, then the dentry and inode (may be NULL) of the - child dentry. len and name string are properties of the dentry to be - compared. qstr is the name to compare it with. + the child dentry. len and name string are properties of the dentry + to be compared. qstr is the name to compare it with. Must be constant and idempotent, and should not take locks if - possible, and should not or store into the dentry or inodes. - Should not dereference pointers outside the dentry or inodes without + possible, and should not or store into the dentry. + Should not dereference pointers outside the dentry without lots of care (eg. d_parent, d_inode, d_name should not be used). However, our vfsmount is pinned, and RCU held, so the dentries and inodes won't disappear, neither will our sb or filesystem module. - ->i_sb and ->d_sb may be used. + ->d_sb may be used. It is a tricky calling convention because it needs to be called under "rcu-walk", ie. without any locks or references on things. diff --git a/fs/adfs/dir.c b/fs/adfs/dir.c index ade28bb058e..0d138c0de29 100644 --- a/fs/adfs/dir.c +++ b/fs/adfs/dir.c @@ -191,8 +191,7 @@ const struct file_operations adfs_dir_operations = { }; static int -adfs_hash(const struct dentry *parent, const struct inode *inode, - struct qstr *qstr) +adfs_hash(const struct dentry *parent, struct qstr *qstr) { const unsigned int name_len = ADFS_SB(parent->d_sb)->s_namelen; const unsigned char *name; @@ -228,8 +227,7 @@ adfs_hash(const struct dentry *parent, const struct inode *inode, * requirements of the underlying filesystem. */ static int -adfs_compare(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +adfs_compare(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { int i; diff --git a/fs/affs/namei.c b/fs/affs/namei.c index ff65884a783..c36cbb4537a 100644 --- a/fs/affs/namei.c +++ b/fs/affs/namei.c @@ -13,18 +13,12 @@ typedef int (*toupper_t)(int); static int affs_toupper(int ch); -static int affs_hash_dentry(const struct dentry *, - const struct inode *, struct qstr *); -static int affs_compare_dentry(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +static int affs_hash_dentry(const struct dentry *, struct qstr *); +static int affs_compare_dentry(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name); static int affs_intl_toupper(int ch); -static int affs_intl_hash_dentry(const struct dentry *, - const struct inode *, struct qstr *); -static int affs_intl_compare_dentry(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +static int affs_intl_hash_dentry(const struct dentry *, struct qstr *); +static int affs_intl_compare_dentry(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name); const struct dentry_operations affs_dentry_operations = { @@ -86,14 +80,12 @@ __affs_hash_dentry(struct qstr *qstr, toupper_t toupper) } static int -affs_hash_dentry(const struct dentry *dentry, const struct inode *inode, - struct qstr *qstr) +affs_hash_dentry(const struct dentry *dentry, struct qstr *qstr) { return __affs_hash_dentry(qstr, affs_toupper); } static int -affs_intl_hash_dentry(const struct dentry *dentry, const struct inode *inode, - struct qstr *qstr) +affs_intl_hash_dentry(const struct dentry *dentry, struct qstr *qstr) { return __affs_hash_dentry(qstr, affs_intl_toupper); } @@ -131,15 +123,13 @@ static inline int __affs_compare_dentry(unsigned int len, } static int -affs_compare_dentry(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +affs_compare_dentry(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { return __affs_compare_dentry(len, str, name, affs_toupper); } static int -affs_intl_compare_dentry(const struct dentry *parent,const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +affs_intl_compare_dentry(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { return __affs_compare_dentry(len, str, name, affs_intl_toupper); diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c index 5699b5036ed..5175aebf673 100644 --- a/fs/cifs/dir.c +++ b/fs/cifs/dir.c @@ -822,8 +822,7 @@ const struct dentry_operations cifs_dentry_ops = { /* d_delete: cifs_d_delete, */ /* not needed except for debugging */ }; -static int cifs_ci_hash(const struct dentry *dentry, const struct inode *inode, - struct qstr *q) +static int cifs_ci_hash(const struct dentry *dentry, struct qstr *q) { struct nls_table *codepage = CIFS_SB(dentry->d_sb)->local_nls; unsigned long hash; @@ -838,12 +837,10 @@ static int cifs_ci_hash(const struct dentry *dentry, const struct inode *inode, return 0; } -static int cifs_ci_compare(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +static int cifs_ci_compare(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { - struct nls_table *codepage = CIFS_SB(pinode->i_sb)->local_nls; + struct nls_table *codepage = CIFS_SB(parent->d_sb)->local_nls; if ((name->len == len) && (nls_strnicmp(codepage, name->name, str, len) == 0)) diff --git a/fs/dcache.c b/fs/dcache.c index b692c7e097c..3199fe6863a 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1723,7 +1723,7 @@ EXPORT_SYMBOL(d_add_ci); * Do the slow-case of the dentry name compare. * * Unlike the dentry_cmp() function, we need to atomically - * load the name, length and inode information, so that the + * load the name and length information, so that the * filesystem can rely on them, and can use the 'name' and * 'len' information without worrying about walking off the * end of memory etc. @@ -1741,22 +1741,18 @@ enum slow_d_compare { static noinline enum slow_d_compare slow_dentry_cmp( const struct dentry *parent, - struct inode *inode, struct dentry *dentry, unsigned int seq, const struct qstr *name) { int tlen = dentry->d_name.len; const char *tname = dentry->d_name.name; - struct inode *i = dentry->d_inode; if (read_seqcount_retry(&dentry->d_seq, seq)) { cpu_relax(); return D_COMP_SEQRETRY; } - if (parent->d_op->d_compare(parent, inode, - dentry, i, - tlen, tname, name)) + if (parent->d_op->d_compare(parent, dentry, tlen, tname, name)) return D_COMP_NOMATCH; return D_COMP_OK; } @@ -1766,7 +1762,6 @@ static noinline enum slow_d_compare slow_dentry_cmp( * @parent: parent dentry * @name: qstr of name we wish to find * @seqp: returns d_seq value at the point where the dentry was found - * @inode: returns dentry->d_inode when the inode was found valid. * Returns: dentry, or NULL * * __d_lookup_rcu is the dcache lookup function for rcu-walk name @@ -1793,7 +1788,7 @@ static noinline enum slow_d_compare slow_dentry_cmp( */ struct dentry *__d_lookup_rcu(const struct dentry *parent, const struct qstr *name, - unsigned *seqp, struct inode *inode) + unsigned *seqp) { u64 hashlen = name->hash_len; const unsigned char *str = name->name; @@ -1827,11 +1822,10 @@ struct dentry *__d_lookup_rcu(const struct dentry *parent, seqretry: /* * The dentry sequence count protects us from concurrent - * renames, and thus protects inode, parent and name fields. + * renames, and thus protects parent and name fields. * * The caller must perform a seqcount check in order - * to do anything useful with the returned dentry, - * including using the 'd_inode' pointer. + * to do anything useful with the returned dentry. * * NOTE! We do a "raw" seqcount_begin here. That means that * we don't wait for the sequence count to stabilize if it @@ -1845,12 +1839,12 @@ seqretry: continue; if (d_unhashed(dentry)) continue; - *seqp = seq; if (unlikely(parent->d_flags & DCACHE_OP_COMPARE)) { if (dentry->d_name.hash != hashlen_hash(hashlen)) continue; - switch (slow_dentry_cmp(parent, inode, dentry, seq, name)) { + *seqp = seq; + switch (slow_dentry_cmp(parent, dentry, seq, name)) { case D_COMP_OK: return dentry; case D_COMP_NOMATCH: @@ -1862,6 +1856,7 @@ seqretry: if (dentry->d_name.hash_len != hashlen) continue; + *seqp = seq; if (!dentry_cmp(dentry, str, hashlen_len(hashlen))) return dentry; } @@ -1959,9 +1954,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name) if (parent->d_flags & DCACHE_OP_COMPARE) { int tlen = dentry->d_name.len; const char *tname = dentry->d_name.name; - if (parent->d_op->d_compare(parent, parent->d_inode, - dentry, dentry->d_inode, - tlen, tname, name)) + if (parent->d_op->d_compare(parent, dentry, tlen, tname, name)) goto next; } else { if (dentry->d_name.len != len) @@ -1998,7 +1991,7 @@ struct dentry *d_hash_and_lookup(struct dentry *dir, struct qstr *name) */ name->hash = full_name_hash(name->name, name->len); if (dir->d_flags & DCACHE_OP_HASH) { - int err = dir->d_op->d_hash(dir, dir->d_inode, name); + int err = dir->d_op->d_hash(dir, name); if (unlikely(err < 0)) return ERR_PTR(err); } diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c index 141aee31884..a8766b880c0 100644 --- a/fs/efivarfs/super.c +++ b/fs/efivarfs/super.c @@ -45,8 +45,8 @@ static struct super_block *efivarfs_sb; * So we need to perform a case-sensitive match on part 1 and a * case-insensitive match on part 2. */ -static int efivarfs_d_compare(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +static int efivarfs_d_compare(const struct dentry *parent, + const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { @@ -63,8 +63,7 @@ static int efivarfs_d_compare(const struct dentry *parent, const struct inode *p return strncasecmp(name->name + guid, str + guid, EFI_VARIABLE_GUID_LEN); } -static int efivarfs_d_hash(const struct dentry *dentry, - const struct inode *inode, struct qstr *qstr) +static int efivarfs_d_hash(const struct dentry *dentry, struct qstr *qstr) { unsigned long hash = init_name_hash(); const unsigned char *s = qstr->name; @@ -108,7 +107,7 @@ static struct dentry *efivarfs_alloc_dentry(struct dentry *parent, char *name) q.name = name; q.len = strlen(name); - err = efivarfs_d_hash(NULL, NULL, &q); + err = efivarfs_d_hash(NULL, &q); if (err) return ERR_PTR(err); diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c index 081b759cff8..a783b0e1272 100644 --- a/fs/fat/namei_msdos.c +++ b/fs/fat/namei_msdos.c @@ -148,8 +148,7 @@ static int msdos_find(struct inode *dir, const unsigned char *name, int len, * that the existing dentry can be used. The msdos fs routines will * return ENOENT or EINVAL as appropriate. */ -static int msdos_hash(const struct dentry *dentry, const struct inode *inode, - struct qstr *qstr) +static int msdos_hash(const struct dentry *dentry, struct qstr *qstr) { struct fat_mount_options *options = &MSDOS_SB(dentry->d_sb)->options; unsigned char msdos_name[MSDOS_NAME]; @@ -165,8 +164,7 @@ static int msdos_hash(const struct dentry *dentry, const struct inode *inode, * Compare two msdos names. If either of the names are invalid, * we fall back to doing the standard name comparison. */ -static int msdos_cmp(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +static int msdos_cmp(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { struct fat_mount_options *options = &MSDOS_SB(parent->d_sb)->options; diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c index 2da952036a3..6df8d3d885e 100644 --- a/fs/fat/namei_vfat.c +++ b/fs/fat/namei_vfat.c @@ -107,8 +107,7 @@ static unsigned int vfat_striptail_len(const struct qstr *qstr) * that the existing dentry can be used. The vfat fs routines will * return ENOENT or EINVAL as appropriate. */ -static int vfat_hash(const struct dentry *dentry, const struct inode *inode, - struct qstr *qstr) +static int vfat_hash(const struct dentry *dentry, struct qstr *qstr) { qstr->hash = full_name_hash(qstr->name, vfat_striptail_len(qstr)); return 0; @@ -120,8 +119,7 @@ static int vfat_hash(const struct dentry *dentry, const struct inode *inode, * that the existing dentry can be used. The vfat fs routines will * return ENOENT or EINVAL as appropriate. */ -static int vfat_hashi(const struct dentry *dentry, const struct inode *inode, - struct qstr *qstr) +static int vfat_hashi(const struct dentry *dentry, struct qstr *qstr) { struct nls_table *t = MSDOS_SB(dentry->d_sb)->nls_io; const unsigned char *name; @@ -142,8 +140,7 @@ static int vfat_hashi(const struct dentry *dentry, const struct inode *inode, /* * Case insensitive compare of two vfat names. */ -static int vfat_cmpi(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +static int vfat_cmpi(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { struct nls_table *t = MSDOS_SB(parent->d_sb)->nls_io; @@ -162,8 +159,7 @@ static int vfat_cmpi(const struct dentry *parent, const struct inode *pinode, /* * Case sensitive compare of two vfat names. */ -static int vfat_cmp(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +static int vfat_cmp(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { unsigned int alen, blen; diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c index 4fddb3c22d2..f2448ab2aac 100644 --- a/fs/gfs2/dentry.c +++ b/fs/gfs2/dentry.c @@ -109,8 +109,7 @@ fail: return 0; } -static int gfs2_dhash(const struct dentry *dentry, const struct inode *inode, - struct qstr *str) +static int gfs2_dhash(const struct dentry *dentry, struct qstr *str) { str->hash = gfs2_disk_hash(str->name, str->len); return 0; diff --git a/fs/hfs/hfs_fs.h b/fs/hfs/hfs_fs.h index a73b11839a4..0524cda47a6 100644 --- a/fs/hfs/hfs_fs.h +++ b/fs/hfs/hfs_fs.h @@ -229,13 +229,10 @@ extern int hfs_part_find(struct super_block *, sector_t *, sector_t *); /* string.c */ extern const struct dentry_operations hfs_dentry_operations; -extern int hfs_hash_dentry(const struct dentry *, const struct inode *, - struct qstr *); +extern int hfs_hash_dentry(const struct dentry *, struct qstr *); extern int hfs_strcmp(const unsigned char *, unsigned int, const unsigned char *, unsigned int); -extern int hfs_compare_dentry(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +extern int hfs_compare_dentry(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name); /* trans.c */ diff --git a/fs/hfs/string.c b/fs/hfs/string.c index 495a976a3cc..85b610c3909 100644 --- a/fs/hfs/string.c +++ b/fs/hfs/string.c @@ -51,8 +51,7 @@ static unsigned char caseorder[256] = { /* * Hash a string to an integer in a case-independent way */ -int hfs_hash_dentry(const struct dentry *dentry, const struct inode *inode, - struct qstr *this) +int hfs_hash_dentry(const struct dentry *dentry, struct qstr *this) { const unsigned char *name = this->name; unsigned int hash, len = this->len; @@ -93,8 +92,7 @@ int hfs_strcmp(const unsigned char *s1, unsigned int len1, * Test for equality of two strings in the HFS filename character ordering. * return 1 on failure and 0 on success */ -int hfs_compare_dentry(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +int hfs_compare_dentry(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { const unsigned char *n1, *n2; diff --git a/fs/hfsplus/hfsplus_fs.h b/fs/hfsplus/hfsplus_fs.h index 60b0a3388b2..ede79317cfb 100644 --- a/fs/hfsplus/hfsplus_fs.h +++ b/fs/hfsplus/hfsplus_fs.h @@ -495,11 +495,8 @@ int hfsplus_uni2asc(struct super_block *, const struct hfsplus_unistr *, char *, int *); int hfsplus_asc2uni(struct super_block *, struct hfsplus_unistr *, int, const char *, int); -int hfsplus_hash_dentry(const struct dentry *dentry, - const struct inode *inode, struct qstr *str); -int hfsplus_compare_dentry(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str); +int hfsplus_compare_dentry(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name); /* wrapper.c */ diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c index 2c2e47dcfdd..e8ef121a4d8 100644 --- a/fs/hfsplus/unicode.c +++ b/fs/hfsplus/unicode.c @@ -334,8 +334,7 @@ int hfsplus_asc2uni(struct super_block *sb, * Composed unicode characters are decomposed and case-folding is performed * if the appropriate bits are (un)set on the superblock. */ -int hfsplus_hash_dentry(const struct dentry *dentry, const struct inode *inode, - struct qstr *str) +int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str) { struct super_block *sb = dentry->d_sb; const char *astr; @@ -386,9 +385,7 @@ int hfsplus_hash_dentry(const struct dentry *dentry, const struct inode *inode, * Composed unicode characters are decomposed and case-folding is performed * if the appropriate bits are (un)set on the superblock. */ -int hfsplus_compare_dentry(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +int hfsplus_compare_dentry(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { struct super_block *sb = parent->d_sb; diff --git a/fs/hpfs/dentry.c b/fs/hpfs/dentry.c index 05d4816e4e7..fa27980f222 100644 --- a/fs/hpfs/dentry.c +++ b/fs/hpfs/dentry.c @@ -12,8 +12,7 @@ * Note: the dentry argument is the parent dentry. */ -static int hpfs_hash_dentry(const struct dentry *dentry, const struct inode *inode, - struct qstr *qstr) +static int hpfs_hash_dentry(const struct dentry *dentry, struct qstr *qstr) { unsigned long hash; int i; @@ -35,9 +34,7 @@ static int hpfs_hash_dentry(const struct dentry *dentry, const struct inode *ino return 0; } -static int hpfs_compare_dentry(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +static int hpfs_compare_dentry(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { unsigned al = len; diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c index d9b8aebdeb2..c348d6d8862 100644 --- a/fs/isofs/inode.c +++ b/fs/isofs/inode.c @@ -28,31 +28,23 @@ #define BEQUIET -static int isofs_hashi(const struct dentry *parent, const struct inode *inode, - struct qstr *qstr); -static int isofs_hash(const struct dentry *parent, const struct inode *inode, - struct qstr *qstr); +static int isofs_hashi(const struct dentry *parent, struct qstr *qstr); +static int isofs_hash(const struct dentry *parent, struct qstr *qstr); static int isofs_dentry_cmpi(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, + const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name); static int isofs_dentry_cmp(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, + const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name); #ifdef CONFIG_JOLIET -static int isofs_hashi_ms(const struct dentry *parent, const struct inode *inode, - struct qstr *qstr); -static int isofs_hash_ms(const struct dentry *parent, const struct inode *inode, - struct qstr *qstr); +static int isofs_hashi_ms(const struct dentry *parent, struct qstr *qstr); +static int isofs_hash_ms(const struct dentry *parent, struct qstr *qstr); static int isofs_dentry_cmpi_ms(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, + const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name); static int isofs_dentry_cmp_ms(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, + const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name); #endif @@ -265,30 +257,26 @@ static int isofs_dentry_cmp_common( } static int -isofs_hash(const struct dentry *dentry, const struct inode *inode, - struct qstr *qstr) +isofs_hash(const struct dentry *dentry, struct qstr *qstr) { return isofs_hash_common(dentry, qstr, 0); } static int -isofs_hashi(const struct dentry *dentry, const struct inode *inode, - struct qstr *qstr) +isofs_hashi(const struct dentry *dentry, struct qstr *qstr) { return isofs_hashi_common(dentry, qstr, 0); } static int -isofs_dentry_cmp(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +isofs_dentry_cmp(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { return isofs_dentry_cmp_common(len, str, name, 0, 0); } static int -isofs_dentry_cmpi(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +isofs_dentry_cmpi(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { return isofs_dentry_cmp_common(len, str, name, 0, 1); @@ -296,30 +284,26 @@ isofs_dentry_cmpi(const struct dentry *parent, const struct inode *pinode, #ifdef CONFIG_JOLIET static int -isofs_hash_ms(const struct dentry *dentry, const struct inode *inode, - struct qstr *qstr) +isofs_hash_ms(const struct dentry *dentry, struct qstr *qstr) { return isofs_hash_common(dentry, qstr, 1); } static int -isofs_hashi_ms(const struct dentry *dentry, const struct inode *inode, - struct qstr *qstr) +isofs_hashi_ms(const struct dentry *dentry, struct qstr *qstr) { return isofs_hashi_common(dentry, qstr, 1); } static int -isofs_dentry_cmp_ms(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +isofs_dentry_cmp_ms(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { return isofs_dentry_cmp_common(len, str, name, 1, 0); } static int -isofs_dentry_cmpi_ms(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +isofs_dentry_cmpi_ms(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { return isofs_dentry_cmp_common(len, str, name, 1, 1); diff --git a/fs/isofs/namei.c b/fs/isofs/namei.c index c167028844e..95295640d9c 100644 --- a/fs/isofs/namei.c +++ b/fs/isofs/namei.c @@ -37,8 +37,7 @@ isofs_cmp(struct dentry *dentry, const char *compare, int dlen) qstr.name = compare; qstr.len = dlen; - return dentry->d_op->d_compare(NULL, NULL, NULL, NULL, - dentry->d_name.len, dentry->d_name.name, &qstr); + return dentry->d_op->d_compare(NULL, NULL, dentry->d_name.len, dentry->d_name.name, &qstr); } /* diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c index 89186b7b900..8b19027291d 100644 --- a/fs/jfs/namei.c +++ b/fs/jfs/namei.c @@ -1538,8 +1538,7 @@ const struct file_operations jfs_dir_operations = { .llseek = generic_file_llseek, }; -static int jfs_ci_hash(const struct dentry *dir, const struct inode *inode, - struct qstr *this) +static int jfs_ci_hash(const struct dentry *dir, struct qstr *this) { unsigned long hash; int i; @@ -1552,9 +1551,7 @@ static int jfs_ci_hash(const struct dentry *dir, const struct inode *inode, return 0; } -static int jfs_ci_compare(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +static int jfs_ci_compare(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { int i, result = 1; diff --git a/fs/namei.c b/fs/namei.c index 66998b06d82..b2beee7a733 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1352,7 +1352,7 @@ static int lookup_fast(struct nameidata *nd, */ if (nd->flags & LOOKUP_RCU) { unsigned seq; - dentry = __d_lookup_rcu(parent, &nd->last, &seq, nd->inode); + dentry = __d_lookup_rcu(parent, &nd->last, &seq); if (!dentry) goto unlazy; @@ -1787,8 +1787,7 @@ static int link_path_walk(const char *name, struct nameidata *nd) struct dentry *parent = nd->path.dentry; nd->flags &= ~LOOKUP_JUMPED; if (unlikely(parent->d_flags & DCACHE_OP_HASH)) { - err = parent->d_op->d_hash(parent, nd->inode, - &this); + err = parent->d_op->d_hash(parent, &this); if (err < 0) break; } @@ -2121,7 +2120,7 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len) * to use its own hash.. */ if (base->d_flags & DCACHE_OP_HASH) { - int err = base->d_op->d_hash(base, base->d_inode, &this); + int err = base->d_op->d_hash(base, &this); if (err < 0) return ERR_PTR(err); } diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c index 3bc105d36f1..3be047474bf 100644 --- a/fs/ncpfs/dir.c +++ b/fs/ncpfs/dir.c @@ -73,10 +73,8 @@ const struct inode_operations ncp_dir_inode_operations = * Dentry operations routines */ static int ncp_lookup_validate(struct dentry *, unsigned int); -static int ncp_hash_dentry(const struct dentry *, const struct inode *, - struct qstr *); -static int ncp_compare_dentry(const struct dentry *, const struct inode *, - const struct dentry *, const struct inode *, +static int ncp_hash_dentry(const struct dentry *, struct qstr *); +static int ncp_compare_dentry(const struct dentry *, const struct dentry *, unsigned int, const char *, const struct qstr *); static int ncp_delete_dentry(const struct dentry *); @@ -119,11 +117,19 @@ static inline int ncp_case_sensitive(const struct inode *i) /* * Note: leave the hash unchanged if the directory * is case-sensitive. + * + * Accessing the parent inode can be racy under RCU pathwalking. + * Use ACCESS_ONCE() to make sure we use _one_ particular inode, + * the callers will handle races. */ static int -ncp_hash_dentry(const struct dentry *dentry, const struct inode *inode, - struct qstr *this) +ncp_hash_dentry(const struct dentry *dentry, struct qstr *this) { + struct inode *inode = ACCESS_ONCE(dentry->d_inode); + + if (!inode) + return 0; + if (!ncp_case_sensitive(inode)) { struct super_block *sb = dentry->d_sb; struct nls_table *t; @@ -140,14 +146,24 @@ ncp_hash_dentry(const struct dentry *dentry, const struct inode *inode, return 0; } +/* + * Accessing the parent inode can be racy under RCU pathwalking. + * Use ACCESS_ONCE() to make sure we use _one_ particular inode, + * the callers will handle races. + */ static int -ncp_compare_dentry(const struct dentry *parent, const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +ncp_compare_dentry(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { + struct inode *pinode; + if (len != name->len) return 1; + pinode = ACCESS_ONCE(parent->d_inode); + if (!pinode) + return 1; + if (ncp_case_sensitive(pinode)) return strncmp(str, name->name, len); diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index f3a570e7c25..71290463a1d 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -796,15 +796,16 @@ static int sysctl_is_seen(struct ctl_table_header *p) return res; } -static int proc_sys_compare(const struct dentry *parent, - const struct inode *pinode, - const struct dentry *dentry, const struct inode *inode, +static int proc_sys_compare(const struct dentry *parent, const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { struct ctl_table_header *head; + struct inode *inode; + /* Although proc doesn't have negative dentries, rcu-walk means * that inode here can be NULL */ /* AV: can it, indeed? */ + inode = ACCESS_ONCE(dentry->d_inode); if (!inode) return 1; if (name->len != len) diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c index 1c0d5f26476..731b2bbcaab 100644 --- a/fs/sysv/namei.c +++ b/fs/sysv/namei.c @@ -27,8 +27,7 @@ static int add_nondir(struct dentry *dentry, struct inode *inode) return err; } -static int sysv_hash(const struct dentry *dentry, const struct inode *inode, - struct qstr *qstr) +static int sysv_hash(const struct dentry *dentry, struct qstr *qstr) { /* Truncate the name in place, avoids having to define a compare function. */ diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 86da7595ba3..f42dbe14547 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -146,10 +146,8 @@ enum dentry_d_lock_class struct dentry_operations { int (*d_revalidate)(struct dentry *, unsigned int); int (*d_weak_revalidate)(struct dentry *, unsigned int); - int (*d_hash)(const struct dentry *, const struct inode *, - struct qstr *); - int (*d_compare)(const struct dentry *, const struct inode *, - const struct dentry *, const struct inode *, + int (*d_hash)(const struct dentry *, struct qstr *); + int (*d_compare)(const struct dentry *, const struct dentry *, unsigned int, const char *, const struct qstr *); int (*d_delete)(const struct dentry *); void (*d_release)(struct dentry *); @@ -302,8 +300,7 @@ extern struct dentry *d_lookup(const struct dentry *, const struct qstr *); extern struct dentry *d_hash_and_lookup(struct dentry *, struct qstr *); extern struct dentry *__d_lookup(const struct dentry *, const struct qstr *); extern struct dentry *__d_lookup_rcu(const struct dentry *parent, - const struct qstr *name, - unsigned *seq, struct inode *inode); + const struct qstr *name, unsigned *seq); /** * __d_rcu_to_refcount - take a refcount on dentry if sequence check is ok -- cgit v1.2.3-70-g09d2 From 1c8c601a8c0dc59fe64907dcd9d512a3d181ddc7 Mon Sep 17 00:00:00 2001 From: Jeff Layton Date: Fri, 21 Jun 2013 08:58:15 -0400 Subject: locks: protect most of the file_lock handling with i_lock Having a global lock that protects all of this code is a clear scalability problem. Instead of doing that, move most of the code to be protected by the i_lock instead. The exceptions are the global lists that the ->fl_link sits on, and the ->fl_block list. ->fl_link is what connects these structures to the global lists, so we must ensure that we hold those locks when iterating over or updating these lists. Furthermore, sound deadlock detection requires that we hold the blocked_list state steady while checking for loops. We also must ensure that the search and update to the list are atomic. For the checking and insertion side of the blocked_list, push the acquisition of the global lock into __posix_lock_file and ensure that checking and update of the blocked_list is done without dropping the lock in between. On the removal side, when waking up blocked lock waiters, take the global lock before walking the blocked list and dequeue the waiters from the global list prior to removal from the fl_block list. With this, deadlock detection should be race free while we minimize excessive file_lock_lock thrashing. Finally, in order to avoid a lock inversion problem when handling /proc/locks output we must ensure that manipulations of the fl_block list are also protected by the file_lock_lock. Signed-off-by: Jeff Layton Signed-off-by: Al Viro --- Documentation/filesystems/Locking | 21 +++-- fs/afs/flock.c | 7 +- fs/ceph/locks.c | 2 +- fs/ceph/mds_client.c | 8 +- fs/cifs/cifsfs.c | 2 +- fs/cifs/file.c | 13 +-- fs/gfs2/file.c | 2 +- fs/lockd/svcsubs.c | 12 +-- fs/locks.c | 164 ++++++++++++++++++++++++-------------- fs/nfs/delegation.c | 10 +-- fs/nfs/nfs4state.c | 8 +- fs/nfsd/nfs4state.c | 8 +- include/linux/fs.h | 11 --- 13 files changed, 155 insertions(+), 113 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index f94a362f408..c2963a74fbc 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -342,7 +342,7 @@ prototypes: locking rules: - file_lock_lock may block + inode->i_lock may block fl_copy_lock: yes no fl_release_private: maybe no @@ -355,12 +355,19 @@ prototypes: int (*lm_change)(struct file_lock **, int); locking rules: - file_lock_lock may block -lm_compare_owner: yes no -lm_notify: yes no -lm_grant: no no -lm_break: yes no -lm_change yes no + + inode->i_lock file_lock_lock may block +lm_compare_owner: yes[1] maybe no +lm_notify: yes yes no +lm_grant: no no no +lm_break: yes no no +lm_change yes no no + +[1]: ->lm_compare_owner is generally called with *an* inode->i_lock held. It +may not be the i_lock of the inode for either file_lock being compared! This is +the case with deadlock detection, since the code has to chase down the owners +of locks that may be entirely unrelated to the one on which the lock is being +acquired. When doing a search for deadlocks, the file_lock_lock is also held. --------------------------- buffer_head ----------------------------------- prototypes: diff --git a/fs/afs/flock.c b/fs/afs/flock.c index 2497bf306c7..a8cf2cff836 100644 --- a/fs/afs/flock.c +++ b/fs/afs/flock.c @@ -252,7 +252,8 @@ static void afs_defer_unlock(struct afs_vnode *vnode, struct key *key) */ static int afs_do_setlk(struct file *file, struct file_lock *fl) { - struct afs_vnode *vnode = AFS_FS_I(file->f_mapping->host); + struct inode *inode = file_inode(file); + struct afs_vnode *vnode = AFS_FS_I(inode); afs_lock_type_t type; struct key *key = file->private_data; int ret; @@ -273,7 +274,7 @@ static int afs_do_setlk(struct file *file, struct file_lock *fl) type = (fl->fl_type == F_RDLCK) ? AFS_LOCK_READ : AFS_LOCK_WRITE; - lock_flocks(); + spin_lock(&inode->i_lock); /* make sure we've got a callback on this file and that our view of the * data version is up to date */ @@ -420,7 +421,7 @@ given_lock: afs_vnode_fetch_status(vnode, NULL, key); error: - unlock_flocks(); + spin_unlock(&inode->i_lock); _leave(" = %d", ret); return ret; diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index ebbf680378e..690f73f4242 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -192,7 +192,7 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count) /** * Encode the flock and fcntl locks for the given inode into the ceph_filelock - * array. Must be called with lock_flocks() already held. + * array. Must be called with inode->i_lock already held. * If we encounter more of a specific lock type than expected, return -ENOSPC. */ int ceph_encode_locks_to_buffer(struct inode *inode, diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 4d2920304be..74fd2898b2a 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2481,20 +2481,20 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap, struct ceph_filelock *flocks; encode_again: - lock_flocks(); + spin_lock(&inode->i_lock); ceph_count_locks(inode, &num_fcntl_locks, &num_flock_locks); - unlock_flocks(); + spin_unlock(&inode->i_lock); flocks = kmalloc((num_fcntl_locks+num_flock_locks) * sizeof(struct ceph_filelock), GFP_NOFS); if (!flocks) { err = -ENOMEM; goto out_free; } - lock_flocks(); + spin_lock(&inode->i_lock); err = ceph_encode_locks_to_buffer(inode, flocks, num_fcntl_locks, num_flock_locks); - unlock_flocks(); + spin_unlock(&inode->i_lock); if (err) { kfree(flocks); if (err == -ENOSPC) diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index 540c1ccfcdb..a445e71746f 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -765,7 +765,7 @@ static loff_t cifs_llseek(struct file *file, loff_t offset, int whence) static int cifs_setlease(struct file *file, long arg, struct file_lock **lease) { - /* note that this is called by vfs setlease with lock_flocks held + /* note that this is called by vfs setlease with i_lock held to protect *lease from going away */ struct inode *inode = file_inode(file); struct cifsFileInfo *cfile = file->private_data; diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 1686e408564..0630710a9c3 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -1092,6 +1092,7 @@ struct lock_to_push { static int cifs_push_posix_locks(struct cifsFileInfo *cfile) { + struct inode *inode = cfile->dentry->d_inode; struct cifs_tcon *tcon = tlink_tcon(cfile->tlink); struct file_lock *flock, **before; unsigned int count = 0, i = 0; @@ -1102,12 +1103,12 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile) xid = get_xid(); - lock_flocks(); - cifs_for_each_lock(cfile->dentry->d_inode, before) { + spin_lock(&inode->i_lock); + cifs_for_each_lock(inode, before) { if ((*before)->fl_flags & FL_POSIX) count++; } - unlock_flocks(); + spin_unlock(&inode->i_lock); INIT_LIST_HEAD(&locks_to_send); @@ -1126,8 +1127,8 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile) } el = locks_to_send.next; - lock_flocks(); - cifs_for_each_lock(cfile->dentry->d_inode, before) { + spin_lock(&inode->i_lock); + cifs_for_each_lock(inode, before) { flock = *before; if ((flock->fl_flags & FL_POSIX) == 0) continue; @@ -1152,7 +1153,7 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile) lck->offset = flock->fl_start; el = el->next; } - unlock_flocks(); + spin_unlock(&inode->i_lock); list_for_each_entry_safe(lck, tmp, &locks_to_send, llist) { int stored_rc; diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index b3333371aeb..cebfd404c1d 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -889,7 +889,7 @@ out_uninit: * cluster; until we do, disable leases (by just returning -EINVAL), * unless the administrator has requested purely local locking. * - * Locking: called under lock_flocks + * Locking: called under i_lock * * Returns: errno */ diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c index 97e87415b14..dc5c75930f0 100644 --- a/fs/lockd/svcsubs.c +++ b/fs/lockd/svcsubs.c @@ -169,7 +169,7 @@ nlm_traverse_locks(struct nlm_host *host, struct nlm_file *file, again: file->f_locks = 0; - lock_flocks(); /* protects i_flock list */ + spin_lock(&inode->i_lock); for (fl = inode->i_flock; fl; fl = fl->fl_next) { if (fl->fl_lmops != &nlmsvc_lock_operations) continue; @@ -181,7 +181,7 @@ again: if (match(lockhost, host)) { struct file_lock lock = *fl; - unlock_flocks(); + spin_unlock(&inode->i_lock); lock.fl_type = F_UNLCK; lock.fl_start = 0; lock.fl_end = OFFSET_MAX; @@ -193,7 +193,7 @@ again: goto again; } } - unlock_flocks(); + spin_unlock(&inode->i_lock); return 0; } @@ -228,14 +228,14 @@ nlm_file_inuse(struct nlm_file *file) if (file->f_count || !list_empty(&file->f_blocks) || file->f_shares) return 1; - lock_flocks(); + spin_lock(&inode->i_lock); for (fl = inode->i_flock; fl; fl = fl->fl_next) { if (fl->fl_lmops == &nlmsvc_lock_operations) { - unlock_flocks(); + spin_unlock(&inode->i_lock); return 1; } } - unlock_flocks(); + spin_unlock(&inode->i_lock); file->f_locks = 0; return 0; } diff --git a/fs/locks.c b/fs/locks.c index 89d898bce16..ce302d43822 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -153,27 +153,37 @@ int lease_break_time = 45; #define for_each_lock(inode, lockp) \ for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next) -/* The global file_lock_list is only used for displaying /proc/locks. */ +/* + * The global file_lock_list is only used for displaying /proc/locks. Protected + * by the file_lock_lock. + */ static LIST_HEAD(file_lock_list); -/* The blocked_list is used to find POSIX lock loops for deadlock detection. */ +/* + * The blocked_list is used to find POSIX lock loops for deadlock detection. + * Protected by file_lock_lock. + */ static LIST_HEAD(blocked_list); -/* Protects the two list heads above, plus the inode->i_flock list */ +/* + * This lock protects the blocked_list, and the file_lock_list. Generally, if + * you're accessing one of those lists, you want to be holding this lock. + * + * In addition, it also protects the fl->fl_block list, and the fl->fl_next + * pointer for file_lock structures that are acting as lock requests (in + * contrast to those that are acting as records of acquired locks). + * + * Note that when we acquire this lock in order to change the above fields, + * we often hold the i_lock as well. In certain cases, when reading the fields + * protected by this lock, we can skip acquiring it iff we already hold the + * i_lock. + * + * In particular, adding an entry to the fl_block list requires that you hold + * both the i_lock and the blocked_lock_lock (acquired in that order). Deleting + * an entry from the list however only requires the file_lock_lock. + */ static DEFINE_SPINLOCK(file_lock_lock); -void lock_flocks(void) -{ - spin_lock(&file_lock_lock); -} -EXPORT_SYMBOL_GPL(lock_flocks); - -void unlock_flocks(void) -{ - spin_unlock(&file_lock_lock); -} -EXPORT_SYMBOL_GPL(unlock_flocks); - static struct kmem_cache *filelock_cache __read_mostly; static void locks_init_lock_heads(struct file_lock *fl) @@ -489,13 +499,17 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2) static inline void locks_insert_global_locks(struct file_lock *fl) { + spin_lock(&file_lock_lock); list_add_tail(&fl->fl_link, &file_lock_list); + spin_unlock(&file_lock_lock); } static inline void locks_delete_global_locks(struct file_lock *fl) { + spin_lock(&file_lock_lock); list_del_init(&fl->fl_link); + spin_unlock(&file_lock_lock); } static inline void @@ -512,6 +526,8 @@ locks_delete_global_blocked(struct file_lock *waiter) /* Remove waiter from blocker's block list. * When blocker ends up pointing to itself then the list is empty. + * + * Must be called with file_lock_lock held. */ static void __locks_delete_block(struct file_lock *waiter) { @@ -520,37 +536,47 @@ static void __locks_delete_block(struct file_lock *waiter) waiter->fl_next = NULL; } -/* - */ static void locks_delete_block(struct file_lock *waiter) { - lock_flocks(); + spin_lock(&file_lock_lock); __locks_delete_block(waiter); - unlock_flocks(); + spin_unlock(&file_lock_lock); } /* Insert waiter into blocker's block list. * We use a circular list so that processes can be easily woken up in * the order they blocked. The documentation doesn't require this but * it seems like the reasonable thing to do. + * + * Must be called with file_lock_lock held! */ -static void locks_insert_block(struct file_lock *blocker, - struct file_lock *waiter) +static void __locks_insert_block(struct file_lock *blocker, + struct file_lock *waiter) { BUG_ON(!list_empty(&waiter->fl_block)); waiter->fl_next = blocker; list_add_tail(&waiter->fl_block, &blocker->fl_block); if (IS_POSIX(blocker)) - locks_insert_global_blocked(request); + locks_insert_global_blocked(waiter); +} + +/* Must be called with i_lock held. */ +static void locks_insert_block(struct file_lock *blocker, + struct file_lock *waiter) +{ + spin_lock(&file_lock_lock); + __locks_insert_block(blocker, waiter); + spin_unlock(&file_lock_lock); } /* * Wake up processes blocked waiting for blocker. * - * Must be called with the file_lock_lock held! + * Must be called with the inode->i_lock held! */ static void locks_wake_up_blocks(struct file_lock *blocker) { + spin_lock(&file_lock_lock); while (!list_empty(&blocker->fl_block)) { struct file_lock *waiter; @@ -562,10 +588,13 @@ static void locks_wake_up_blocks(struct file_lock *blocker) else wake_up(&waiter->fl_wait); } + spin_unlock(&file_lock_lock); } /* Insert file lock fl into an inode's lock list at the position indicated * by pos. At the same time add the lock to the global file lock list. + * + * Must be called with the i_lock held! */ static void locks_insert_lock(struct file_lock **pos, struct file_lock *fl) { @@ -583,6 +612,8 @@ static void locks_insert_lock(struct file_lock **pos, struct file_lock *fl) * Wake up processes that are blocked waiting for this lock, * notify the FS that the lock has been cleared and * finally free the lock. + * + * Must be called with the i_lock held! */ static void locks_delete_lock(struct file_lock **thisfl_p) { @@ -652,8 +683,9 @@ void posix_test_lock(struct file *filp, struct file_lock *fl) { struct file_lock *cfl; + struct inode *inode = file_inode(filp); - lock_flocks(); + spin_lock(&inode->i_lock); for (cfl = file_inode(filp)->i_flock; cfl; cfl = cfl->fl_next) { if (!IS_POSIX(cfl)) continue; @@ -666,7 +698,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl) fl->fl_pid = pid_vnr(cfl->fl_nspid); } else fl->fl_type = F_UNLCK; - unlock_flocks(); + spin_unlock(&inode->i_lock); return; } EXPORT_SYMBOL(posix_test_lock); @@ -710,6 +742,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl) return NULL; } +/* Must be called with the file_lock_lock held! */ static int posix_locks_deadlock(struct file_lock *caller_fl, struct file_lock *block_fl) { @@ -745,7 +778,7 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) return -ENOMEM; } - lock_flocks(); + spin_lock(&inode->i_lock); if (request->fl_flags & FL_ACCESS) goto find_conflict; @@ -775,9 +808,9 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) * give it the opportunity to lock the file. */ if (found) { - unlock_flocks(); + spin_unlock(&inode->i_lock); cond_resched(); - lock_flocks(); + spin_lock(&inode->i_lock); } find_conflict: @@ -804,7 +837,7 @@ find_conflict: error = 0; out: - unlock_flocks(); + spin_unlock(&inode->i_lock); if (new_fl) locks_free_lock(new_fl); return error; @@ -834,7 +867,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str new_fl2 = locks_alloc_lock(); } - lock_flocks(); + spin_lock(&inode->i_lock); /* * New lock request. Walk all POSIX locks and look for conflicts. If * there are any, either return error or put the request on the @@ -852,11 +885,17 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str error = -EAGAIN; if (!(request->fl_flags & FL_SLEEP)) goto out; + /* + * Deadlock detection and insertion into the blocked + * locks list must be done while holding the same lock! + */ error = -EDEADLK; - if (posix_locks_deadlock(request, fl)) - goto out; - error = FILE_LOCK_DEFERRED; - locks_insert_block(fl, request); + spin_lock(&file_lock_lock); + if (likely(!posix_locks_deadlock(request, fl))) { + error = FILE_LOCK_DEFERRED; + __locks_insert_block(fl, request); + } + spin_unlock(&file_lock_lock); goto out; } } @@ -1006,7 +1045,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str locks_wake_up_blocks(left); } out: - unlock_flocks(); + spin_unlock(&inode->i_lock); /* * Free any unused locks. */ @@ -1081,14 +1120,14 @@ int locks_mandatory_locked(struct inode *inode) /* * Search the lock list for this inode for any POSIX locks. */ - lock_flocks(); + spin_lock(&inode->i_lock); for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { if (!IS_POSIX(fl)) continue; if (fl->fl_owner != owner) break; } - unlock_flocks(); + spin_unlock(&inode->i_lock); return fl ? -EAGAIN : 0; } @@ -1231,7 +1270,7 @@ int __break_lease(struct inode *inode, unsigned int mode) if (IS_ERR(new_fl)) return PTR_ERR(new_fl); - lock_flocks(); + spin_lock(&inode->i_lock); time_out_leases(inode); @@ -1281,11 +1320,11 @@ restart: break_time++; } locks_insert_block(flock, new_fl); - unlock_flocks(); + spin_unlock(&inode->i_lock); error = wait_event_interruptible_timeout(new_fl->fl_wait, !new_fl->fl_next, break_time); - lock_flocks(); - __locks_delete_block(new_fl); + spin_lock(&inode->i_lock); + locks_delete_block(new_fl); if (error >= 0) { if (error == 0) time_out_leases(inode); @@ -1302,7 +1341,7 @@ restart: } out: - unlock_flocks(); + spin_unlock(&inode->i_lock); locks_free_lock(new_fl); return error; } @@ -1355,9 +1394,10 @@ EXPORT_SYMBOL(lease_get_mtime); int fcntl_getlease(struct file *filp) { struct file_lock *fl; + struct inode *inode = file_inode(filp); int type = F_UNLCK; - lock_flocks(); + spin_lock(&inode->i_lock); time_out_leases(file_inode(filp)); for (fl = file_inode(filp)->i_flock; fl && IS_LEASE(fl); fl = fl->fl_next) { @@ -1366,7 +1406,7 @@ int fcntl_getlease(struct file *filp) break; } } - unlock_flocks(); + spin_unlock(&inode->i_lock); return type; } @@ -1460,7 +1500,7 @@ static int generic_delete_lease(struct file *filp, struct file_lock **flp) * The (input) flp->fl_lmops->lm_break function is required * by break_lease(). * - * Called with file_lock_lock held. + * Called with inode->i_lock held. */ int generic_setlease(struct file *filp, long arg, struct file_lock **flp) { @@ -1529,11 +1569,12 @@ static int __vfs_setlease(struct file *filp, long arg, struct file_lock **lease) int vfs_setlease(struct file *filp, long arg, struct file_lock **lease) { + struct inode *inode = file_inode(filp); int error; - lock_flocks(); + spin_lock(&inode->i_lock); error = __vfs_setlease(filp, arg, lease); - unlock_flocks(); + spin_unlock(&inode->i_lock); return error; } @@ -1551,6 +1592,7 @@ static int do_fcntl_delete_lease(struct file *filp) static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg) { struct file_lock *fl, *ret; + struct inode *inode = file_inode(filp); struct fasync_struct *new; int error; @@ -1564,10 +1606,10 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg) return -ENOMEM; } ret = fl; - lock_flocks(); + spin_lock(&inode->i_lock); error = __vfs_setlease(filp, arg, &ret); if (error) { - unlock_flocks(); + spin_unlock(&inode->i_lock); locks_free_lock(fl); goto out_free_fasync; } @@ -1584,7 +1626,7 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg) new = NULL; error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0); - unlock_flocks(); + spin_unlock(&inode->i_lock); out_free_fasync: if (new) @@ -2108,7 +2150,7 @@ void locks_remove_flock(struct file *filp) fl.fl_ops->fl_release_private(&fl); } - lock_flocks(); + spin_lock(&inode->i_lock); before = &inode->i_flock; while ((fl = *before) != NULL) { @@ -2126,7 +2168,7 @@ void locks_remove_flock(struct file *filp) } before = &fl->fl_next; } - unlock_flocks(); + spin_unlock(&inode->i_lock); } /** @@ -2140,12 +2182,12 @@ posix_unblock_lock(struct file_lock *waiter) { int status = 0; - lock_flocks(); + spin_lock(&file_lock_lock); if (waiter->fl_next) __locks_delete_block(waiter); else status = -ENOENT; - unlock_flocks(); + spin_unlock(&file_lock_lock); return status; } EXPORT_SYMBOL(posix_unblock_lock); @@ -2259,7 +2301,7 @@ static void *locks_start(struct seq_file *f, loff_t *pos) { loff_t *p = f->private; - lock_flocks(); + spin_lock(&file_lock_lock); *p = (*pos + 1); return seq_list_start(&file_lock_list, *pos); } @@ -2273,7 +2315,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos) static void locks_stop(struct seq_file *f, void *v) { - unlock_flocks(); + spin_unlock(&file_lock_lock); } static const struct seq_operations locks_seq_operations = { @@ -2320,7 +2362,8 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len) { struct file_lock *fl; int result = 1; - lock_flocks(); + + spin_lock(&inode->i_lock); for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { if (IS_POSIX(fl)) { if (fl->fl_type == F_RDLCK) @@ -2337,7 +2380,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len) result = 0; break; } - unlock_flocks(); + spin_unlock(&inode->i_lock); return result; } @@ -2360,7 +2403,8 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len) { struct file_lock *fl; int result = 1; - lock_flocks(); + + spin_lock(&inode->i_lock); for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { if (IS_POSIX(fl)) { if ((fl->fl_end < start) || (fl->fl_start > (start + len))) @@ -2375,7 +2419,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len) result = 0; break; } - unlock_flocks(); + spin_unlock(&inode->i_lock); return result; } diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c index 57db3244f4d..7ec4814e298 100644 --- a/fs/nfs/delegation.c +++ b/fs/nfs/delegation.c @@ -73,20 +73,20 @@ static int nfs_delegation_claim_locks(struct nfs_open_context *ctx, struct nfs4_ if (inode->i_flock == NULL) goto out; - /* Protect inode->i_flock using the file locks lock */ - lock_flocks(); + /* Protect inode->i_flock using the i_lock */ + spin_lock(&inode->i_lock); for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK))) continue; if (nfs_file_open_context(fl->fl_file) != ctx) continue; - unlock_flocks(); + spin_unlock(&inode->i_lock); status = nfs4_lock_delegation_recall(fl, state, stateid); if (status < 0) goto out; - lock_flocks(); + spin_lock(&inode->i_lock); } - unlock_flocks(); + spin_unlock(&inode->i_lock); out: return status; } diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c index 1fab140764c..ff10b4aa534 100644 --- a/fs/nfs/nfs4state.c +++ b/fs/nfs/nfs4state.c @@ -1373,13 +1373,13 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_ /* Guard against delegation returns and new lock/unlock calls */ down_write(&nfsi->rwsem); /* Protect inode->i_flock using the BKL */ - lock_flocks(); + spin_lock(&inode->i_lock); for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK))) continue; if (nfs_file_open_context(fl->fl_file)->state != state) continue; - unlock_flocks(); + spin_unlock(&inode->i_lock); status = ops->recover_lock(state, fl); switch (status) { case 0: @@ -1406,9 +1406,9 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_ /* kill_proc(fl->fl_pid, SIGLOST, 1); */ status = 0; } - lock_flocks(); + spin_lock(&inode->i_lock); } - unlock_flocks(); + spin_unlock(&inode->i_lock); out: up_write(&nfsi->rwsem); return status; diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 316ec843dec..f17051838b4 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -2645,13 +2645,13 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) list_add_tail(&dp->dl_recall_lru, &nn->del_recall_lru); - /* only place dl_time is set. protected by lock_flocks*/ + /* Only place dl_time is set; protected by i_lock: */ dp->dl_time = get_seconds(); nfsd4_cb_recall(dp); } -/* Called from break_lease() with lock_flocks() held. */ +/* Called from break_lease() with i_lock held. */ static void nfsd_break_deleg_cb(struct file_lock *fl) { struct nfs4_file *fp = (struct nfs4_file *)fl->fl_owner; @@ -4520,7 +4520,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner) struct inode *inode = filp->fi_inode; int status = 0; - lock_flocks(); + spin_lock(&inode->i_lock); for (flpp = &inode->i_flock; *flpp != NULL; flpp = &(*flpp)->fl_next) { if ((*flpp)->fl_owner == (fl_owner_t)lowner) { status = 1; @@ -4528,7 +4528,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner) } } out: - unlock_flocks(); + spin_unlock(&inode->i_lock); return status; } diff --git a/include/linux/fs.h b/include/linux/fs.h index ed9fdaaf322..24fe998795e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1024,8 +1024,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **); extern int lease_modify(struct file_lock **, int); extern int lock_may_read(struct inode *, loff_t start, unsigned long count); extern int lock_may_write(struct inode *, loff_t start, unsigned long count); -extern void lock_flocks(void); -extern void unlock_flocks(void); #else /* !CONFIG_FILE_LOCKING */ static inline int fcntl_getlk(struct file *file, struct flock __user *user) { @@ -1166,15 +1164,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start, { return 1; } - -static inline void lock_flocks(void) -{ -} - -static inline void unlock_flocks(void) -{ -} - #endif /* !CONFIG_FILE_LOCKING */ -- cgit v1.2.3-70-g09d2 From 3999e49364193f7dbbba66e2be655fe91ba1fced Mon Sep 17 00:00:00 2001 From: Jeff Layton Date: Fri, 21 Jun 2013 08:58:19 -0400 Subject: locks: add a new "lm_owner_key" lock operation Currently, the hashing that the locking code uses to add these values to the blocked_hash is simply calculated using fl_owner field. That's valid in most cases except for server-side lockd, which validates the owner of a lock based on fl_owner and fl_pid. In the case where you have a small number of NFS clients doing a lot of locking between different processes, you could end up with all the blocked requests sitting in a very small number of hash buckets. Add a new lm_owner_key operation to the lock_manager_operations that will generate an unsigned long to use as the key in the hashtable. That function is only implemented for server-side lockd, and simply XORs the fl_owner and fl_pid. Signed-off-by: Jeff Layton Acked-by: J. Bruce Fields Signed-off-by: Al Viro --- Documentation/filesystems/Locking | 16 +++++++++++----- fs/lockd/svclock.c | 12 ++++++++++++ fs/locks.c | 12 ++++++++++-- include/linux/fs.h | 1 + 4 files changed, 34 insertions(+), 7 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index c2963a74fbc..2db7c9e492e 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -349,6 +349,7 @@ fl_release_private: maybe no ----------------------- lock_manager_operations --------------------------- prototypes: int (*lm_compare_owner)(struct file_lock *, struct file_lock *); + unsigned long (*lm_owner_key)(struct file_lock *); void (*lm_notify)(struct file_lock *); /* unblock callback */ int (*lm_grant)(struct file_lock *, struct file_lock *, int); void (*lm_break)(struct file_lock *); /* break_lease callback */ @@ -358,16 +359,21 @@ locking rules: inode->i_lock file_lock_lock may block lm_compare_owner: yes[1] maybe no +lm_owner_key yes[1] yes no lm_notify: yes yes no lm_grant: no no no lm_break: yes no no lm_change yes no no -[1]: ->lm_compare_owner is generally called with *an* inode->i_lock held. It -may not be the i_lock of the inode for either file_lock being compared! This is -the case with deadlock detection, since the code has to chase down the owners -of locks that may be entirely unrelated to the one on which the lock is being -acquired. When doing a search for deadlocks, the file_lock_lock is also held. +[1]: ->lm_compare_owner and ->lm_owner_key are generally called with +*an* inode->i_lock held. It may not be the i_lock of the inode +associated with either file_lock argument! This is the case with deadlock +detection, since the code has to chase down the owners of locks that may +be entirely unrelated to the one on which the lock is being acquired. +For deadlock detection however, the file_lock_lock is also held. The +fact that these locks are held ensures that the file_locks do not +disappear out from under you while doing the comparison or generating an +owner key. --------------------------- buffer_head ----------------------------------- prototypes: diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c index a469098682c..067778b0ccc 100644 --- a/fs/lockd/svclock.c +++ b/fs/lockd/svclock.c @@ -744,8 +744,20 @@ static int nlmsvc_same_owner(struct file_lock *fl1, struct file_lock *fl2) return fl1->fl_owner == fl2->fl_owner && fl1->fl_pid == fl2->fl_pid; } +/* + * Since NLM uses two "keys" for tracking locks, we need to hash them down + * to one for the blocked_hash. Here, we're just xor'ing the host address + * with the pid in order to create a key value for picking a hash bucket. + */ +static unsigned long +nlmsvc_owner_key(struct file_lock *fl) +{ + return (unsigned long)fl->fl_owner ^ (unsigned long)fl->fl_pid; +} + const struct lock_manager_operations nlmsvc_lock_operations = { .lm_compare_owner = nlmsvc_same_owner, + .lm_owner_key = nlmsvc_owner_key, .lm_notify = nlmsvc_notify_blocked, .lm_grant = nlmsvc_grant_deferred, }; diff --git a/fs/locks.c b/fs/locks.c index 71d847cbbb6..6242e0b1c69 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -521,10 +521,18 @@ locks_delete_global_locks(struct file_lock *fl) spin_unlock(&file_lock_lock); } +static unsigned long +posix_owner_key(struct file_lock *fl) +{ + if (fl->fl_lmops && fl->fl_lmops->lm_owner_key) + return fl->fl_lmops->lm_owner_key(fl); + return (unsigned long)fl->fl_owner; +} + static inline void locks_insert_global_blocked(struct file_lock *waiter) { - hash_add(blocked_hash, &waiter->fl_link, (unsigned long)waiter->fl_owner); + hash_add(blocked_hash, &waiter->fl_link, posix_owner_key(waiter)); } static inline void @@ -757,7 +765,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl) { struct file_lock *fl; - hash_for_each_possible(blocked_hash, fl, fl_link, (unsigned long)block_fl->fl_owner) { + hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) { if (posix_same_owner(fl, block_fl)) return fl->fl_next; } diff --git a/include/linux/fs.h b/include/linux/fs.h index fab064a3b65..a137a73fc1f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -908,6 +908,7 @@ struct file_lock_operations { struct lock_manager_operations { int (*lm_compare_owner)(struct file_lock *, struct file_lock *); + unsigned long (*lm_owner_key)(struct file_lock *); void (*lm_notify)(struct file_lock *); /* unblock callback */ int (*lm_grant)(struct file_lock *, struct file_lock *, int); void (*lm_break)(struct file_lock *); -- cgit v1.2.3-70-g09d2 From 7b2296afb392bc21a50f42e7c7f4b19d3fea8c6d Mon Sep 17 00:00:00 2001 From: Jeff Layton Date: Fri, 21 Jun 2013 08:58:20 -0400 Subject: locks: give the blocked_hash its own spinlock There's no reason we have to protect the blocked_hash and file_lock_list with the same spinlock. With the tests I have, breaking it in two gives a barely measurable performance benefit, but it seems reasonable to make this locking as granular as possible. Signed-off-by: Jeff Layton Signed-off-by: Al Viro --- Documentation/filesystems/Locking | 16 +++++++-------- fs/locks.c | 41 +++++++++++++++++++++------------------ 2 files changed, 30 insertions(+), 27 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 2db7c9e492e..7d9ca7a83fc 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -357,20 +357,20 @@ prototypes: locking rules: - inode->i_lock file_lock_lock may block -lm_compare_owner: yes[1] maybe no -lm_owner_key yes[1] yes no -lm_notify: yes yes no -lm_grant: no no no -lm_break: yes no no -lm_change yes no no + inode->i_lock blocked_lock_lock may block +lm_compare_owner: yes[1] maybe no +lm_owner_key yes[1] yes no +lm_notify: yes yes no +lm_grant: no no no +lm_break: yes no no +lm_change yes no no [1]: ->lm_compare_owner and ->lm_owner_key are generally called with *an* inode->i_lock held. It may not be the i_lock of the inode associated with either file_lock argument! This is the case with deadlock detection, since the code has to chase down the owners of locks that may be entirely unrelated to the one on which the lock is being acquired. -For deadlock detection however, the file_lock_lock is also held. The +For deadlock detection however, the blocked_lock_lock is also held. The fact that these locks are held ensures that the file_locks do not disappear out from under you while doing the comparison or generating an owner key. diff --git a/fs/locks.c b/fs/locks.c index 6242e0b1c69..04e2c1fdb15 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -159,10 +159,11 @@ int lease_break_time = 45; * by the file_lock_lock. */ static HLIST_HEAD(file_lock_list); +static DEFINE_SPINLOCK(file_lock_lock); /* * The blocked_hash is used to find POSIX lock loops for deadlock detection. - * It is protected by file_lock_lock. + * It is protected by blocked_lock_lock. * * We hash locks by lockowner in order to optimize searching for the lock a * particular lockowner is waiting on. @@ -175,8 +176,8 @@ static HLIST_HEAD(file_lock_list); static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS); /* - * This lock protects the blocked_hash and the file_lock_list. Generally, if - * you're accessing one of those lists, you want to be holding this lock. + * This lock protects the blocked_hash. Generally, if you're accessing it, you + * want to be holding this lock. * * In addition, it also protects the fl->fl_block list, and the fl->fl_next * pointer for file_lock structures that are acting as lock requests (in @@ -191,7 +192,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS); * both the i_lock and the blocked_lock_lock (acquired in that order). Deleting * an entry from the list however only requires the file_lock_lock. */ -static DEFINE_SPINLOCK(file_lock_lock); +static DEFINE_SPINLOCK(blocked_lock_lock); static struct kmem_cache *filelock_cache __read_mostly; @@ -544,7 +545,7 @@ locks_delete_global_blocked(struct file_lock *waiter) /* Remove waiter from blocker's block list. * When blocker ends up pointing to itself then the list is empty. * - * Must be called with file_lock_lock held. + * Must be called with blocked_lock_lock held. */ static void __locks_delete_block(struct file_lock *waiter) { @@ -555,9 +556,9 @@ static void __locks_delete_block(struct file_lock *waiter) static void locks_delete_block(struct file_lock *waiter) { - spin_lock(&file_lock_lock); + spin_lock(&blocked_lock_lock); __locks_delete_block(waiter); - spin_unlock(&file_lock_lock); + spin_unlock(&blocked_lock_lock); } /* Insert waiter into blocker's block list. @@ -565,9 +566,9 @@ static void locks_delete_block(struct file_lock *waiter) * the order they blocked. The documentation doesn't require this but * it seems like the reasonable thing to do. * - * Must be called with both the i_lock and file_lock_lock held. The fl_block + * Must be called with both the i_lock and blocked_lock_lock held. The fl_block * list itself is protected by the file_lock_list, but by ensuring that the - * i_lock is also held on insertions we can avoid taking the file_lock_lock + * i_lock is also held on insertions we can avoid taking the blocked_lock_lock * in some cases when we see that the fl_block list is empty. */ static void __locks_insert_block(struct file_lock *blocker, @@ -584,9 +585,9 @@ static void __locks_insert_block(struct file_lock *blocker, static void locks_insert_block(struct file_lock *blocker, struct file_lock *waiter) { - spin_lock(&file_lock_lock); + spin_lock(&blocked_lock_lock); __locks_insert_block(blocker, waiter); - spin_unlock(&file_lock_lock); + spin_unlock(&blocked_lock_lock); } /* @@ -601,12 +602,12 @@ static void locks_wake_up_blocks(struct file_lock *blocker) * blocked requests are only added to the list under the i_lock, and * the i_lock is always held here. Note that removal from the fl_block * list does not require the i_lock, so we must recheck list_empty() - * after acquiring the file_lock_lock. + * after acquiring the blocked_lock_lock. */ if (list_empty(&blocker->fl_block)) return; - spin_lock(&file_lock_lock); + spin_lock(&blocked_lock_lock); while (!list_empty(&blocker->fl_block)) { struct file_lock *waiter; @@ -618,7 +619,7 @@ static void locks_wake_up_blocks(struct file_lock *blocker) else wake_up(&waiter->fl_wait); } - spin_unlock(&file_lock_lock); + spin_unlock(&blocked_lock_lock); } /* Insert file lock fl into an inode's lock list at the position indicated @@ -772,7 +773,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl) return NULL; } -/* Must be called with the file_lock_lock held! */ +/* Must be called with the blocked_lock_lock held! */ static int posix_locks_deadlock(struct file_lock *caller_fl, struct file_lock *block_fl) { @@ -920,12 +921,12 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str * locks list must be done while holding the same lock! */ error = -EDEADLK; - spin_lock(&file_lock_lock); + spin_lock(&blocked_lock_lock); if (likely(!posix_locks_deadlock(request, fl))) { error = FILE_LOCK_DEFERRED; __locks_insert_block(fl, request); } - spin_unlock(&file_lock_lock); + spin_unlock(&blocked_lock_lock); goto out; } } @@ -2212,12 +2213,12 @@ posix_unblock_lock(struct file_lock *waiter) { int status = 0; - spin_lock(&file_lock_lock); + spin_lock(&blocked_lock_lock); if (waiter->fl_next) __locks_delete_block(waiter); else status = -ENOENT; - spin_unlock(&file_lock_lock); + spin_unlock(&blocked_lock_lock); return status; } EXPORT_SYMBOL(posix_unblock_lock); @@ -2332,6 +2333,7 @@ static void *locks_start(struct seq_file *f, loff_t *pos) loff_t *p = f->private; spin_lock(&file_lock_lock); + spin_lock(&blocked_lock_lock); *p = (*pos + 1); return seq_hlist_start(&file_lock_list, *pos); } @@ -2345,6 +2347,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos) static void locks_stop(struct seq_file *f, void *v) { + spin_unlock(&blocked_lock_lock); spin_unlock(&file_lock_lock); } -- cgit v1.2.3-70-g09d2 From 48bde8d3620f5f3c6ae9ff599eb404055ae51664 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Wed, 3 Jul 2013 16:19:23 +0400 Subject: Document ->tmpfile() Signed-off-by: Al Viro --- Documentation/filesystems/Locking | 2 ++ Documentation/filesystems/vfs.txt | 5 +++++ 2 files changed, 7 insertions(+) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 7d9ca7a83fc..e95d3131309 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -64,6 +64,7 @@ prototypes: int (*atomic_open)(struct inode *, struct dentry *, struct file *, unsigned open_flag, umode_t create_mode, int *opened); + int (*tmpfile) (struct inode *, struct dentry *, umode_t); locking rules: all may block @@ -91,6 +92,7 @@ removexattr: yes fiemap: no update_time: no atomic_open: yes +tmpfile: no Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on victim. diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 51ba44e3fc4..aeff462c722 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -360,6 +360,8 @@ struct inode_operations { int (*removexattr) (struct dentry *, const char *); void (*update_time)(struct inode *, struct timespec *, int); int (*atomic_open)(struct inode *, struct dentry *, + int (*tmpfile) (struct inode *, struct dentry *, umode_t); +} ____cacheline_aligned; struct file *, unsigned open_flag, umode_t create_mode, int *opened); }; @@ -472,6 +474,9 @@ otherwise noted. component is negative or needs lookup. Cached positive dentries are still handled by f_op->open(). + tmpfile: called in the end of O_TMPFILE open(). Optional, equivalent to + atomically creating, opening and unlinking a file in given directory. + The Address Space Object ======================== -- cgit v1.2.3-70-g09d2 From 0f8975ec4db2c8b5bd111b211292ca9be0feb6b8 Mon Sep 17 00:00:00 2001 From: Pavel Emelyanov Date: Wed, 3 Jul 2013 15:01:20 -0700 Subject: mm: soft-dirty bits for user memory changes tracking The soft-dirty is a bit on a PTE which helps to track which pages a task writes to. In order to do this tracking one should 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) 2. Wait some time. 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) To do this tracking, the writable bit is cleared from PTEs when the soft-dirty bit is. Thus, after this, when the task tries to modify a page at some virtual address the #PF occurs and the kernel sets the soft-dirty bit on the respective PTE. Note, that although all the task's address space is marked as r/o after the soft-dirty bits clear, the #PF-s that occur after that are processed fast. This is so, since the pages are still mapped to physical memory, and thus all the kernel does is finds this fact out and puts back writable, dirty and soft-dirty bits on the PTE. Another thing to note, is that when mremap moves PTEs they are marked with soft-dirty as well, since from the user perspective mremap modifies the virtual memory at mremap's new address. Signed-off-by: Pavel Emelyanov Cc: Matt Mackall Cc: Xiao Guangrong Cc: Glauber Costa Cc: Marcelo Tosatti Cc: KOSAKI Motohiro Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 7 +++++- Documentation/vm/soft-dirty.txt | 36 +++++++++++++++++++++++++++ arch/Kconfig | 3 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 24 ++++++++++++++++-- arch/x86/include/asm/pgtable_types.h | 12 +++++++++ fs/proc/task_mmu.c | 47 ++++++++++++++++++++++++++++++++---- include/asm-generic/pgtable.h | 22 +++++++++++++++++ mm/Kconfig | 12 +++++++++ mm/huge_memory.c | 2 +- mm/mremap.c | 2 +- 11 files changed, 158 insertions(+), 10 deletions(-) create mode 100644 Documentation/vm/soft-dirty.txt (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index fd8d0d594fc..fcc22c982a2 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -473,7 +473,8 @@ This file is only present if the CONFIG_MMU kernel configuration option is enabled. The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG -bits on both physical and virtual pages associated with a process. +bits on both physical and virtual pages associated with a process, and the +soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details). To clear the bits for all the pages associated with the process > echo 1 > /proc/PID/clear_refs @@ -482,6 +483,10 @@ To clear the bits for the anonymous pages associated with the process To clear the bits for the file mapped pages associated with the process > echo 3 > /proc/PID/clear_refs + +To clear the soft-dirty bit + > echo 4 > /proc/PID/clear_refs + Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags diff --git a/Documentation/vm/soft-dirty.txt b/Documentation/vm/soft-dirty.txt new file mode 100644 index 00000000000..9a12a5956bc --- /dev/null +++ b/Documentation/vm/soft-dirty.txt @@ -0,0 +1,36 @@ + SOFT-DIRTY PTEs + + The soft-dirty is a bit on a PTE which helps to track which pages a task +writes to. In order to do this tracking one should + + 1. Clear soft-dirty bits from the task's PTEs. + + This is done by writing "4" into the /proc/PID/clear_refs file of the + task in question. + + 2. Wait some time. + + 3. Read soft-dirty bits from the PTEs. + + This is done by reading from the /proc/PID/pagemap. The bit 55 of the + 64-bit qword is the soft-dirty one. If set, the respective PTE was + written to since step 1. + + + Internally, to do this tracking, the writable bit is cleared from PTEs +when the soft-dirty bit is cleared. So, after this, when the task tries to +modify a page at some virtual address the #PF occurs and the kernel sets +the soft-dirty bit on the respective PTE. + + Note, that although all the task's address space is marked as r/o after the +soft-dirty bits clear, the #PF-s that occur after that are processed fast. +This is so, since the pages are still mapped to physical memory, and thus all +the kernel does is finds this fact out and puts both writable and soft-dirty +bits on the PTE. + + + This feature is actively used by the checkpoint-restore project. You +can find more details about it on http://criu.org + + +-- Pavel Emelyanov, Apr 9, 2013 diff --git a/arch/Kconfig b/arch/Kconfig index a4429bcd609..8d2ae24b9f4 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -365,6 +365,9 @@ config HAVE_IRQ_TIME_ACCOUNTING config HAVE_ARCH_TRANSPARENT_HUGEPAGE bool +config HAVE_ARCH_SOFT_DIRTY + bool + config HAVE_MOD_ARCH_SPECIFIC bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b094816a7e0..10764a3d62c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -102,6 +102,7 @@ config X86 select HAVE_ARCH_SECCOMP_FILTER select BUILDTIME_EXTABLE_SORT select GENERIC_CMOS_UPDATE + select HAVE_ARCH_SOFT_DIRTY select CLOCKSOURCE_WATCHDOG select GENERIC_CLOCKEVENTS select ARCH_CLOCKSOURCE_DATA if X86_64 diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 5b0818bc896..7dc305a4605 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -207,7 +207,7 @@ static inline pte_t pte_mkexec(pte_t pte) static inline pte_t pte_mkdirty(pte_t pte) { - return pte_set_flags(pte, _PAGE_DIRTY); + return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } static inline pte_t pte_mkyoung(pte_t pte) @@ -271,7 +271,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd) static inline pmd_t pmd_mkdirty(pmd_t pmd) { - return pmd_set_flags(pmd, _PAGE_DIRTY); + return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } static inline pmd_t pmd_mkhuge(pmd_t pmd) @@ -294,6 +294,26 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd) return pmd_clear_flags(pmd, _PAGE_PRESENT); } +static inline int pte_soft_dirty(pte_t pte) +{ + return pte_flags(pte) & _PAGE_SOFT_DIRTY; +} + +static inline int pmd_soft_dirty(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_SOFT_DIRTY; +} + +static inline pte_t pte_mksoft_dirty(pte_t pte) +{ + return pte_set_flags(pte, _PAGE_SOFT_DIRTY); +} + +static inline pmd_t pmd_mksoft_dirty(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_SOFT_DIRTY); +} + /* * Mask out unsupported bits in a present pgprot. Non-present pgprots * can use those bits for other purposes, so leave them be. diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index e6423002c10..c98ac63aae4 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -55,6 +55,18 @@ #define _PAGE_HIDDEN (_AT(pteval_t, 0)) #endif +/* + * The same hidden bit is used by kmemcheck, but since kmemcheck + * works on kernel pages while soft-dirty engine on user space, + * they do not conflict with each other. + */ + +#ifdef CONFIG_MEM_SOFT_DIRTY +#define _PAGE_SOFT_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN) +#else +#define _PAGE_SOFT_DIRTY (_AT(pteval_t, 0)) +#endif + #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #else diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 39d64129257..a18e065c1c3 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include @@ -692,13 +693,32 @@ enum clear_refs_types { CLEAR_REFS_ALL = 1, CLEAR_REFS_ANON, CLEAR_REFS_MAPPED, + CLEAR_REFS_SOFT_DIRTY, CLEAR_REFS_LAST, }; struct clear_refs_private { struct vm_area_struct *vma; + enum clear_refs_types type; }; +static inline void clear_soft_dirty(struct vm_area_struct *vma, + unsigned long addr, pte_t *pte) +{ +#ifdef CONFIG_MEM_SOFT_DIRTY + /* + * The soft-dirty tracker uses #PF-s to catch writes + * to pages, so write-protect the pte as well. See the + * Documentation/vm/soft-dirty.txt for full description + * of how soft-dirty works. + */ + pte_t ptent = *pte; + ptent = pte_wrprotect(ptent); + ptent = pte_clear_flags(ptent, _PAGE_SOFT_DIRTY); + set_pte_at(vma->vm_mm, addr, pte, ptent); +#endif +} + static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -718,6 +738,11 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, if (!pte_present(ptent)) continue; + if (cp->type == CLEAR_REFS_SOFT_DIRTY) { + clear_soft_dirty(vma, addr, pte); + continue; + } + page = vm_normal_page(vma, addr, ptent); if (!page) continue; @@ -759,6 +784,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, mm = get_task_mm(task); if (mm) { struct clear_refs_private cp = { + .type = type, }; struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range, @@ -766,6 +792,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, .private = &cp, }; down_read(&mm->mmap_sem); + if (type == CLEAR_REFS_SOFT_DIRTY) + mmu_notifier_invalidate_range_start(mm, 0, -1); for (vma = mm->mmap; vma; vma = vma->vm_next) { cp.vma = vma; if (is_vm_hugetlb_page(vma)) @@ -786,6 +814,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, walk_page_range(vma->vm_start, vma->vm_end, &clear_refs_walk); } + if (type == CLEAR_REFS_SOFT_DIRTY) + mmu_notifier_invalidate_range_end(mm, 0, -1); flush_tlb_mm(mm); up_read(&mm->mmap_sem); mmput(mm); @@ -827,6 +857,7 @@ struct pagemapread { /* in "new" pagemap pshift bits are occupied with more status bits */ #define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT)) +#define __PM_SOFT_DIRTY (1LL) #define PM_PRESENT PM_STATUS(4LL) #define PM_SWAP PM_STATUS(2LL) #define PM_FILE PM_STATUS(1LL) @@ -868,6 +899,7 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, { u64 frame, flags; struct page *page = NULL; + int flags2 = 0; if (pte_present(pte)) { frame = pte_pfn(pte); @@ -888,13 +920,15 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, if (page && !PageAnon(page)) flags |= PM_FILE; + if (pte_soft_dirty(pte)) + flags2 |= __PM_SOFT_DIRTY; - *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, 0) | flags); + *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, - pmd_t pmd, int offset) + pmd_t pmd, int offset, int pmd_flags2) { /* * Currently pmd for thp is always present because thp can not be @@ -903,13 +937,13 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *p */ if (pmd_present(pmd)) *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset) - | PM_STATUS2(pm->v2, 0) | PM_PRESENT); + | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT); else *pme = make_pme(PM_NOT_PRESENT(pm->v2)); } #else static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, - pmd_t pmd, int offset) + pmd_t pmd, int offset, int pmd_flags2) { } #endif @@ -926,12 +960,15 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, /* find the first VMA at or above 'addr' */ vma = find_vma(walk->mm, addr); if (vma && pmd_trans_huge_lock(pmd, vma) == 1) { + int pmd_flags2; + + pmd_flags2 = (pmd_soft_dirty(*pmd) ? __PM_SOFT_DIRTY : 0); for (; addr != end; addr += PAGE_SIZE) { unsigned long offset; offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset); + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); err = add_to_pagemap(addr, &pme, pm); if (err) break; diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index b1836987d50..a7126d28f4c 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -396,6 +396,28 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm, #define arch_start_context_switch(prev) do {} while (0) #endif +#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY +static inline int pte_soft_dirty(pte_t pte) +{ + return 0; +} + +static inline int pmd_soft_dirty(pmd_t pmd) +{ + return 0; +} + +static inline pte_t pte_mksoft_dirty(pte_t pte) +{ + return pte; +} + +static inline pmd_t pmd_mksoft_dirty(pmd_t pmd) +{ + return pmd; +} +#endif + #ifndef __HAVE_PFNMAP_TRACKING /* * Interfaces that can be used by architecture code to keep track of diff --git a/mm/Kconfig b/mm/Kconfig index f5e698e30d4..7e28ecfa8aa 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -477,3 +477,15 @@ config FRONTSWAP and swap data is stored as normal on the matching swap device. If unsure, say Y to enable frontswap. + +config MEM_SOFT_DIRTY + bool "Track memory changes" + depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY + select PROC_PAGE_MONITOR + help + This option enables memory changes tracking by introducing a + soft-dirty bit on pte-s. This bit it set when someone writes + into a page just as regular dirty bit, but unlike the latter + it can be cleared by hands. + + See Documentation/vm/soft-dirty.txt for more details. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 362c329b83f..d8b3b850150 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1429,7 +1429,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, if (ret == 1) { pmd = pmdp_get_and_clear(mm, old_addr, old_pmd); VM_BUG_ON(!pmd_none(*new_pmd)); - set_pmd_at(mm, new_addr, new_pmd, pmd); + set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd)); spin_unlock(&mm->page_table_lock); } out: diff --git a/mm/mremap.c b/mm/mremap.c index 463a25705ac..3708655378e 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, continue; pte = ptep_get_and_clear(mm, old_addr, old_pte); pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); - set_pte_at(mm, new_addr, new_pte, pte); + set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte)); } arch_leave_lazy_mmu_mode(); -- cgit v1.2.3-70-g09d2 From 26c0c5bf38159673f0ae28c38fc9f90dbeb4d4aa Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:04:45 -0700 Subject: documentation: update address_space_operations The documentation for address_space_operations is partially out of date. Signed-off-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/vfs.txt | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 1f0ba30ae47..fc5d2a1d26c 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -559,7 +559,6 @@ your filesystem. The following members are defined: struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); - int (*sync_page)(struct page *); int (*writepages)(struct address_space *, struct writeback_control *); int (*set_page_dirty)(struct page *page); int (*readpages)(struct file *filp, struct address_space *mapping, @@ -581,6 +580,8 @@ struct address_space_operations { /* migrate the contents of a page to the specified target */ int (*migratepage) (struct page *, struct page *); int (*launder_page) (struct page *); + int (*is_partially_uptodate) (struct page *, read_descriptor_t *, + unsigned long); int (*error_remove_page) (struct mapping *mapping, struct page *page); int (*swap_activate)(struct file *); int (*swap_deactivate)(struct file *); @@ -612,13 +613,6 @@ struct address_space_operations { In this case, the page will be relocated, relocked and if that all succeeds, ->readpage will be called again. - sync_page: called by the VM to notify the backing store to perform all - queued I/O operations for a page. I/O operations for other pages - associated with this address_space object may also be performed. - - This function is optional and is called only for pages with - PG_Writeback set while waiting for the writeback to complete. - writepages: called by the VM to write out pages associated with the address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then the writeback_control will specify a range of pages that must be @@ -747,6 +741,11 @@ struct address_space_operations { prevent redirtying the page, it is kept locked during the whole operation. + is_partially_uptodate: Called by the VM when reading a file through the + pagecache when the underlying blocksize != pagesize. If the required + block is up to date then the read can complete without needing the IO + to bring the whole page up to date. + error_remove_page: normally set to generic_error_remove_page if truncation is ok for this address space. Used for memory failure handling. Setting this implies you deal with pages going away under you, -- cgit v1.2.3-70-g09d2 From 543cc115339baa44fbea877b3d8673aca652622f Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 3 Jul 2013 15:04:46 -0700 Subject: documentation: document the is_dirty_writeback aops callback Signed-off-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/vfs.txt | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index fc5d2a1d26c..f93a88250a4 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -582,6 +582,7 @@ struct address_space_operations { int (*launder_page) (struct page *); int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); + void (*is_dirty_writeback) (struct page *, bool *, bool *); int (*error_remove_page) (struct mapping *mapping, struct page *page); int (*swap_activate)(struct file *); int (*swap_deactivate)(struct file *); @@ -746,6 +747,15 @@ struct address_space_operations { block is up to date then the read can complete without needing the IO to bring the whole page up to date. + is_dirty_writeback: Called by the VM when attempting to reclaim a page. + The VM uses dirty and writeback information to determine if it needs + to stall to allow flushers a chance to complete some IO. Ordinarily + it can use PageDirty and PageWriteback but some filesystems have + more complex state (unstable pages in NFS prevent reclaim) or + do not set those flags due to locking problems (jbd). This callback + allows a filesystem to indicate to the VM if a page should be + treated as dirty or writeback for the purposes of stalling. + error_remove_page: normally set to generic_error_remove_page if truncation is ok for this address space. Used for memory failure handling. Setting this implies you deal with pages going away under you, -- cgit v1.2.3-70-g09d2 From 3e5b7d8b491c3710b7e007eab0a643f923932e3d Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Wed, 10 Jul 2013 07:03:59 +1000 Subject: xfs: update mount options documentation Because it's horribly out of date. And mark various deprecated options as deprecated and give them a removal date. Signed-off-by: Dave Chinner Reviewed-by: Mark Tinguely Signed-off-by: Ben Myers --- Documentation/filesystems/xfs.txt | 317 +++++++++++++++++++++++++------------- 1 file changed, 209 insertions(+), 108 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt index 83577f0232a..12525b17d9e 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/filesystems/xfs.txt @@ -18,6 +18,8 @@ Mount Options ============= When mounting an XFS filesystem, the following options are accepted. +For boolean mount options, the names with the (*) suffix is the +default behaviour. allocsize=size Sets the buffered I/O end-of-file preallocation size when @@ -25,97 +27,128 @@ When mounting an XFS filesystem, the following options are accepted. Valid values for this option are page size (typically 4KiB) through to 1GiB, inclusive, in power-of-2 increments. - attr2/noattr2 - The options enable/disable (default is disabled for backward - compatibility on-disk) an "opportunistic" improvement to be - made in the way inline extended attributes are stored on-disk. - When the new form is used for the first time (by setting or - removing extended attributes) the on-disk superblock feature - bit field will be updated to reflect this format being in use. + The default behaviour is for dynamic end-of-file + preallocation size, which uses a set of heuristics to + optimise the preallocation size based on the current + allocation patterns within the file and the access patterns + to the file. Specifying a fixed allocsize value turns off + the dynamic behaviour. + + attr2 + noattr2 + The options enable/disable an "opportunistic" improvement to + be made in the way inline extended attributes are stored + on-disk. When the new form is used for the first time when + attr2 is selected (either when setting or removing extended + attributes) the on-disk superblock feature bit field will be + updated to reflect this format being in use. + + The default behaviour is determined by the on-disk feature + bit indicating that attr2 behaviour is active. If either + mount option it set, then that becomes the new default used + by the filesystem. CRC enabled filesystems always use the attr2 format, and so will reject the noattr2 mount option if it is set. - barrier - Enables the use of block layer write barriers for writes into - the journal and unwritten extent conversion. This allows for - drive level write caching to be enabled, for devices that - support write barriers. + barrier (*) + nobarrier + Enables/disables the use of block layer write barriers for + writes into the journal and for data integrity operations. + This allows for drive level write caching to be enabled, for + devices that support write barriers. discard - Issue command to let the block device reclaim space freed by the - filesystem. This is useful for SSD devices, thinly provisioned - LUNs and virtual machine images, but may have a performance - impact. - - dmapi - Enable the DMAPI (Data Management API) event callouts. - Use with the "mtpt" option. - - grpid/bsdgroups and nogrpid/sysvgroups - These options define what group ID a newly created file gets. - When grpid is set, it takes the group ID of the directory in - which it is created; otherwise (the default) it takes the fsgid - of the current process, unless the directory has the setgid bit - set, in which case it takes the gid from the parent directory, - and also gets the setgid bit set if it is a directory itself. - - ihashsize=value - In memory inode hashes have been removed, so this option has - no function as of August 2007. Option is deprecated. - - ikeep/noikeep - When ikeep is specified, XFS does not delete empty inode clusters - and keeps them around on disk. ikeep is the traditional XFS - behaviour. When noikeep is specified, empty inode clusters - are returned to the free space pool. The default is noikeep for - non-DMAPI mounts, while ikeep is the default when DMAPI is in use. - - inode64 - Indicates that XFS is allowed to create inodes at any location - in the filesystem, including those which will result in inode - numbers occupying more than 32 bits of significance. This is - the default allocation option. Applications which do not handle - inode numbers bigger than 32 bits, should use inode32 option. + nodiscard (*) + Enable/disable the issuing of commands to let the block + device reclaim space freed by the filesystem. This is + useful for SSD devices, thinly provisioned LUNs and virtual + machine images, but may have a performance impact. + + Note: It is currently recommended that you use the fstrim + application to discard unused blocks rather than the discard + mount option because the performance impact of this option + is quite severe. + + grpid/bsdgroups + nogrpid/sysvgroups (*) + These options define what group ID a newly created file + gets. When grpid is set, it takes the group ID of the + directory in which it is created; otherwise it takes the + fsgid of the current process, unless the directory has the + setgid bit set, in which case it takes the gid from the + parent directory, and also gets the setgid bit set if it is + a directory itself. + + filestreams + Make the data allocator use the filestreams allocation mode + across the entire filesystem rather than just on directories + configured to use it. + + ikeep + noikeep (*) + When ikeep is specified, XFS does not delete empty inode + clusters and keeps them around on disk. When noikeep is + specified, empty inode clusters are returned to the free + space pool. inode32 - Indicates that XFS is limited to create inodes at locations which - will not result in inode numbers with more than 32 bits of - significance. This is provided for backwards compatibility, since - 64 bits inode numbers might cause problems for some applications - that cannot handle large inode numbers. - - largeio/nolargeio + inode64 (*) + When inode32 is specified, it indicates that XFS limits + inode creation to locations which will not result in inode + numbers with more than 32 bits of significance. + + When inode64 is specified, it indicates that XFS is allowed + to create inodes at any location in the filesystem, + including those which will result in inode numbers occupying + more than 32 bits of significance. + + inode32 is provided for backwards compatibility with older + systems and applications, since 64 bits inode numbers might + cause problems for some applications that cannot handle + large inode numbers. If applications are in use which do + not handle inode numbers bigger than 32 bits, the inode32 + option should be specified. + + + largeio + nolargeio (*) If "nolargeio" is specified, the optimal I/O reported in - st_blksize by stat(2) will be as small as possible to allow user - applications to avoid inefficient read/modify/write I/O. - If "largeio" specified, a filesystem that has a "swidth" specified - will return the "swidth" value (in bytes) in st_blksize. If the - filesystem does not have a "swidth" specified but does specify - an "allocsize" then "allocsize" (in bytes) will be returned - instead. - If neither of these two options are specified, then filesystem - will behave as if "nolargeio" was specified. + st_blksize by stat(2) will be as small as possible to allow + user applications to avoid inefficient read/modify/write + I/O. This is typically the page size of the machine, as + this is the granularity of the page cache. + + If "largeio" specified, a filesystem that was created with a + "swidth" specified will return the "swidth" value (in bytes) + in st_blksize. If the filesystem does not have a "swidth" + specified but does specify an "allocsize" then "allocsize" + (in bytes) will be returned instead. Otherwise the behaviour + is the same as if "nolargeio" was specified. logbufs=value - Set the number of in-memory log buffers. Valid numbers range - from 2-8 inclusive. - The default value is 8 buffers for filesystems with a - blocksize of 64KiB, 4 buffers for filesystems with a blocksize - of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB - and 2 buffers for all other configurations. Increasing the - number of buffers may increase performance on some workloads - at the cost of the memory used for the additional log buffers - and their associated control structures. + Set the number of in-memory log buffers. Valid numbers + range from 2-8 inclusive. + + The default value is 8 buffers. + + If the memory cost of 8 log buffers is too high on small + systems, then it may be reduced at some cost to performance + on metadata intensive workloads. The logbsize option below + controls the size of each buffer and so is also relevent to + this case. logbsize=value - Set the size of each in-memory log buffer. - Size may be specified in bytes, or in kilobytes with a "k" suffix. - Valid sizes for version 1 and version 2 logs are 16384 (16k) and - 32768 (32k). Valid sizes for version 2 logs also include - 65536 (64k), 131072 (128k) and 262144 (256k). - The default value for machines with more than 32MiB of memory - is 32768, machines with less memory use 16384 by default. + Set the size of each in-memory log buffer. The size may be + specified in bytes, or in kilobytes with a "k" suffix. + Valid sizes for version 1 and version 2 logs are 16384 (16k) + and 32768 (32k). Valid sizes for version 2 logs also + include 65536 (64k), 131072 (128k) and 262144 (256k). The + logbsize must be an integer multiple of the log + stripe unit configured at mkfs time. + + The default value for for version 1 logs is 32768, while the + default value for version 2 logs is MAX(32768, log_sunit). logdev=device and rtdev=device Use an external log (metadata journal) and/or real-time device. @@ -124,16 +157,11 @@ When mounting an XFS filesystem, the following options are accepted. optional, and the log section can be separate from the data section or contained within it. - mtpt=mountpoint - Use with the "dmapi" option. The value specified here will be - included in the DMAPI mount event, and should be the path of - the actual mountpoint that is used. - noalign - Data allocations will not be aligned at stripe unit boundaries. - - noatime - Access timestamps are not updated when a file is read. + Data allocations will not be aligned at stripe unit + boundaries. This is only relevant to filesystems created + with non-zero data alignment parameters (sunit, swidth) by + mkfs. norecovery The filesystem will be mounted without running log recovery. @@ -144,8 +172,14 @@ When mounting an XFS filesystem, the following options are accepted. the mount will fail. nouuid - Don't check for double mounted file systems using the file system uuid. - This is useful to mount LVM snapshot volumes. + Don't check for double mounted file systems using the file + system uuid. This is useful to mount LVM snapshot volumes, + and often used in combination with "norecovery" for mounting + read-only snapshots. + + noquota + Forcibly turns off all quota accounting and enforcement + within the filesystem. uquota/usrquota/uqnoenforce/quota User disk quota accounting enabled, and limits (optionally) @@ -160,24 +194,64 @@ When mounting an XFS filesystem, the following options are accepted. enforced. Refer to xfs_quota(8) for further details. sunit=value and swidth=value - Used to specify the stripe unit and width for a RAID device or - a stripe volume. "value" must be specified in 512-byte block - units. - If this option is not specified and the filesystem was made on - a stripe volume or the stripe width or unit were specified for - the RAID device at mkfs time, then the mount system call will - restore the value from the superblock. For filesystems that - are made directly on RAID devices, these options can be used - to override the information in the superblock if the underlying - disk layout changes after the filesystem has been created. - The "swidth" option is required if the "sunit" option has been - specified, and must be a multiple of the "sunit" value. + Used to specify the stripe unit and width for a RAID device + or a stripe volume. "value" must be specified in 512-byte + block units. These options are only relevant to filesystems + that were created with non-zero data alignment parameters. + + The sunit and swidth parameters specified must be compatible + with the existing filesystem alignment characteristics. In + general, that means the only valid changes to sunit are + increasing it by a power-of-2 multiple. Valid swidth values + are any integer multiple of a valid sunit value. + + Typically the only time these mount options are necessary if + after an underlying RAID device has had it's geometry + modified, such as adding a new disk to a RAID5 lun and + reshaping it. swalloc Data allocations will be rounded up to stripe width boundaries when the current end of file is being extended and the file size is larger than the stripe width size. + wsync + When specified, all filesystem namespace operations are + executed synchronously. This ensures that when the namespace + operation (create, unlink, etc) completes, the change to the + namespace is on stable storage. This is useful in HA setups + where failover must not result in clients seeing + inconsistent namespace presentation during or after a + failover event. + + +Deprecated Mount Options +======================== + + delaylog/nodelaylog + Delayed logging is the only logging method that XFS supports + now, so these mount options are now ignored. + + Due for removal in 3.12. + + ihashsize=value + In memory inode hashes have been removed, so this option has + no function as of August 2007. Option is deprecated. + + Due for removal in 3.12. + + irixsgid + This behaviour is now controlled by a sysctl, so the mount + option is ignored. + + Due for removal in 3.12. + + osyncisdsync + osyncisosync + O_SYNC and O_DSYNC are fully supported, so there is no need + for these options any more. + + Due for removal in 3.12. sysctls ======= @@ -189,15 +263,20 @@ The following sysctls are available for the XFS filesystem: in /proc/fs/xfs/stat. It then immediately resets to "0". fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) - The interval at which the xfssyncd thread flushes metadata - out to disk. This thread will flush log activity out, and - do some processing on unlinked inodes. + The interval at which the filesystem flushes metadata + out to disk and runs internal cache cleanup routines. - fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000) - The interval at which xfsbufd scans the dirty metadata buffers list. + fs.xfs.filestream_centisecs (Min: 1 Default: 3000 Max: 360000) + The interval at which the filesystem ages filestreams cache + references and returns timed-out AGs back to the free stream + pool. - fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000) - The age at which xfsbufd flushes dirty metadata buffers to disk. + fs.xfs.speculative_prealloc_lifetime + (Units: seconds Min: 1 Default: 300 Max: 86400) + The interval at which the background scanning for inodes + with unused speculative preallocation runs. The scan + removes unused preallocation from clean inodes and releases + the unused space back to the free pool. fs.xfs.error_level (Min: 0 Default: 3 Max: 11) A volume knob for error reporting when internal errors occur. @@ -254,9 +333,31 @@ The following sysctls are available for the XFS filesystem: by the xfs_io(8) chattr command on a directory to be inherited by files in that directory. + fs.xfs.inherit_nodefrag (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "nodefrag" flag set + by the xfs_io(8) chattr command on a directory to be + inherited by files in that directory. + fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256) In "inode32" allocation mode, this option determines how many files the allocator attempts to allocate in the same allocation group before moving to the next allocation group. The intent is to control the rate at which the allocator moves between allocation groups when allocating extents for new files. + +Deprecated Sysctls +================== + + fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000) + Dirty metadata is now tracked by the log subsystem and + flushing is driven by log space and idling demands. The + xfsbufd no longer exists, so this syctl does nothing. + + Due for removal in 3.14. + + fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000) + Dirty metadata is now tracked by the log subsystem and + flushing is driven by log space and idling demands. The + xfsbufd no longer exists, so this syctl does nothing. + + Due for removal in 3.14. -- cgit v1.2.3-70-g09d2