summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2010-03-01xfs: increase readdir buffer sizeEric Sandeen
While doing some testing of readdir perf a while back, I noticed that the buffer size we're using internally is smaller than what glibc gives us by default. Upping this size helped a bit, and seems safe. glibc's __alloc_dir() does: const size_t default_allocation = (4 * BUFSIZ < sizeof (struct dirent64) ? sizeof (struct dirent64) : 4 * BUFSIZ); const size_t small_allocation = (BUFSIZ < sizeof (struct dirent64) ? sizeof (struct dirent64) : BUFSIZ); size_t allocation = default_allocation; #ifdef _STATBUF_ST_BLKSIZE if (statp != NULL && default_allocation < statp->st_blksize) allocation = statp->st_blksize; #endif and #define _G_BUFSIZ 8192 #define _IO_BUFSIZ _G_BUFSIZ # define BUFSIZ _IO_BUFSIZ so the default buffer is 4 * 8192 = 32768 (except in the unlikely case of blocks > 32k....) Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01Merge branch 'for-2.6.34' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block: (38 commits) block: don't access jiffies when initialising io_context cfq: remove 8 bytes of padding from cfq_rb_root on 64 bit builds block: fix for "Consolidate phys_segment and hw_segment limits" cfq-iosched: quantum check tweak blktrace: perform cleanup after setup error blkdev: fix merge_bvec_fn return value checks cfq-iosched: requests "in flight" vs "in driver" clarification cciss: Fix problem with scatter gather elements in the scsi half of the driver cciss: eliminate unnecessary pointer use in cciss scsi code cciss: do not use void pointer for scsi hba data cciss: factor out scatter gather chain block mapping code cciss: fix scatter gather chain block dma direction kludge cciss: simplify scatter gather code cciss: factor out scatter gather chain block allocation and freeing cciss: detect bad alignment of scsi commands at build time cciss: clarify command list padding calculation cfq-iosched: rethink seeky detection for SSDs cfq-iosched: rework seeky detection block: remove padding from io_context on 64bit builds block: Consolidate phys_segment and hw_segment limits ...
2010-03-01GFS2: print glock numbers in hexBob Peterson
This patch changes glock numbers from printing in decimal to hex. Since DLM prints corresponding resource IDs in hex, it makes debugging easier. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-03-01GFS2: ordered writes are backwardsDave Chinner
When we queue data buffers for ordered write, the buffers are added to the head of the ordered write list. When the log needs to push these buffers to disk, it also walks the list from the head. The result is that the the ordered buffers are submitted to disk in reverse order. For large writes, this means that whenever the log flushes large streams of reverse sequential order buffers are pushed down into the block layers. The elevators don't handle this particularly well, so IO rates tend to be significantly lower than if the IO was issued in ascending block order. Queue new ordered buffers to the tail of the ordered buffer list to ensure that IO is dispatched in the order it was submitted. This should significantly improve large sequential write speeds. On a disk capable of 85MB/s, speeds increase from 50MB/s to 65MB/s for noop and from 38MB/s to 50MB/s for cfq. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-03-01GFS2: Remove loopy umount codeSteven Whitehouse
As a consequence of the previous patch, we can now remove the loop which used to be required due to the circular dependency between the inodes and glocks. Instead we can just invalidate the inodes, and then clear up any glocks which are left. Also we no longer need the rwsem since there is no longer any danger of the inode invalidation calling back into the glock code (and from there back into the inode code). Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-03-01GFS2: Metadata address space clean upSteven Whitehouse
Since the start of GFS2, an "extra" inode has been used to store the metadata belonging to each inode. The only reason for using this inode was to have an extra address space, the other fields were unused. This means that the memory usage was rather inefficient. The reason for keeping each inode's metadata in a separate address space is that when glocks are requested on remote nodes, we need to be able to efficiently locate the data and metadata which relating to that glock (inode) in order to sync or sync and invalidate it (depending on the remotely requested lock mode). This patch adds a new type of glock, which has in addition to its normal fields, has an address space. This applies to all inode and rgrp glocks (but to no other glock types which remain as before). As a result, we no longer need to have the second inode. This results in three major improvements: 1. A saving of approx 25% of memory used in caching inodes 2. A removal of the circular dependency between inodes and glocks 3. No confusion between "normal" and "metadata" inodes in super.c Although the first of these is the more immediately apparent, the second is just as important as it now enables a number of clean ups at umount time. Those will be the subject of future patches. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-28Merge branch 'master' of /home/davem/src/GIT/linux-2.6/David S. Miller
Conflicts: drivers/firmware/iscsi_ibft.c
2010-03-01Merge branch 'next' into for-linusJames Morris
2010-02-28blkdev: fix merge_bvec_fn return value checksDmitry Monakhov
merge_bvec_fn() returns bvec->bv_len on success. So we have to check against this value. But in case of fs_optimization merge we compare with wrong value. This patch must be included in b428cd6da7e6559aca69aa2e3a526037d3f20403 But accidentally i've forgot to add this in the initial patch. To make things straight let's replace all such checks. In fact this makes code easy to understand. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-02-28Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (44 commits) rcu: Fix accelerated GPs for last non-dynticked CPU rcu: Make non-RCU_PROVE_LOCKING rcu_read_lock_sched_held() understand boot rcu: Fix accelerated grace periods for last non-dynticked CPU rcu: Export rcu_scheduler_active rcu: Make rcu_read_lock_sched_held() take boot time into account rcu: Make lockdep_rcu_dereference() message less alarmist sched, cgroups: Fix module export rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information rcu: Fix rcutorture mod_timer argument to delay one jiffy rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection rcu: Convert to raw_spinlocks rcu: Stop overflowing signed integers rcu: Use canonical URL for Mathieu's dissertation rcu: Accelerate grace period if last non-dynticked CPU rcu: Fix citation of Mathieu's dissertation rcu: Documentation update for CONFIG_PROVE_RCU security: Apply lockdep-based checking to rcu_dereference() uses idr: Apply lockdep-based diagnostics to rcu_dereference() uses radix-tree: Disable RCU lockdep checking in radix tree vfs: Abstract rcu_dereference_check for files-fdtable use ...
2010-02-28exofs: groups supportBoaz Harrosh
* _calc_stripe_info() changes to accommodate for grouping calculations. Returns additional information * old _prepare_pages() becomes _prepare_one_group() which stores pages belonging to one device group. * New _prepare_for_striping iterates on all groups calling _prepare_one_group(). * Enable mounting of groups data_maps (group_width != 0) [QUESTION] what is faster A or B; A. x += stride; x = x % width + first_x; B x += stride if (x < last_x) x = first_x; Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28exofs: Prepare for groupsBoaz Harrosh
* Rename _offset_dev_unit_off() to _calc_stripe_info() and recieve a struct for the output params * In _prepare_for_striping we only need to call _calc_stripe_info() once. The other componets are easy to calculate from that. This code was inspired by what's done in truncate. * Some code shifts that make sense now but will make more sense when group support is added. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28exofs: Error recovery if object is missing from storageBoaz Harrosh
If an object is referenced by a directory but does not exist on a target, it is a very serious corruption that means: 1. Either a power failure with very slim chance of it happening. Because the directory update is always submitted much after object creation, but if a directory is written to one device and the object creation to another it might theoretically happen. 2. It only ever happened to me while developing with BUGs causing file corruption. Crashes could also cause it but they are more like case 1. In any way the object does not exist, so data is surely lost. If there is a mix-up in the obj-id or data-map, then lost objects can be salvaged by off-line fsck. The only recoverable information is the directory name. By letting it appear as a regular empty file, with date==0 (1970 Jan 1st) ownership to root, we enable recovery of the only useful information. And also enable deletion or over-write. I can see how this can hurt. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28exofs: convert io_state to use pages array instead of bio at inputBoaz Harrosh
* inode.c operations are full-pages based, and not actually true scatter-gather * Lets us use more pages at once upto 512 (from 249) in 64 bit * Brings us much much closer to be able to use exofs's io_state engine from objlayout driver. (Once I decide where to put the common code) After RAID0 patch the outer (input) bio was never used as a bio, but was simply a page carrier into the raid engine. Even in the simple mirror/single-dev arrangement pages info was copied into a second bio. It is now easer to just pass a pages array into the io_state and prepare bio(s) once. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28exofs: RAID0 supportBoaz Harrosh
We now support striping over mirror devices. Including variable sized stripe_unit. Some limits: * stripe_unit must be a multiple of PAGE_SIZE * stripe_unit * stripe_count is maximum upto 32-bit (4Gb) Tested RAID0 over mirrors, RAID0 only, mirrors only. All check. Design notes: * I'm not using a vectored raid-engine mechanism yet. Following the pnfs-objects-layout data-map structure, "Mirror" is just a private case of "group_width" == 1, and RAID0 is a private case of "Mirrors" == 1. The performance lose of the general case over the particular special case optimization is totally negligible, also considering the extra code size. * In general I added a prepare_stripes() stage that divides the to-be-io pages to the participating devices, the previous exofs_ios_write/read, now becomes _write/read_mirrors and a new write/read upper layer loops on all devices calling _write/read_mirrors. Effectively the prepare_stripes stage is the all secret. Also truncate need fixing to accommodate for striping. * In a RAID0 arrangement, in a regular usage scenario, if all inode layouts will start at the same device, the small files fill up the first device and the later devices stay empty, the farther the device the emptier it is. To fix that, each inode will start at a different stripe_unit, according to it's obj_id modulus number-of-stripe-units. And will then span all stripe-units in the same incrementing order wrapping back to the beginning of the device table. We call it a stripe-units moving window. Special consideration was taken to keep all devices in a mirror arrangement identical. So a broken osd-device could just be cloned from one of the mirrors and no FS scrubbing is needed. (We do that by rotating stripe-unit at a time and not a single device at a time.) TODO: We no longer verify object_length == inode->i_size in exofs_iget. (since i_size is stripped on multiple objects now). I should introduce a multiple-device attribute reading, and use it in exofs_iget. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28exofs: Define on-disk per-inode optional layout attributeBoaz Harrosh
* Layouts describe the way a file is spread on multiple devices. The layout information is stored in the objects attribute introduced in this patch. * There can be multiple generating function for the layout. Currently defined: - No attribute present - use below moving-window on global device table, all devices. (This is the only one currently used in exofs) - an obj_id generated moving window - the obj_id is a randomizing factor in the otherwise global map layout. - An explicit layout stored, including a data_map and a device index list. - More might be defined in future ... * There are two attributes defined of the same structure: A-data-files-layout - This layout is used by data-files. If present at a directory, all files of that directory will be created with this layout. A-meta-data-layout - This layout is used by a directory and other meta-data information. Also inherited at creation of subdirectories. * At creation time inodes are created with the layout specified above. A usermode utility may change the creation layout on a give directory or file. Which in the case of directories, will also apply to newly created files/subdirectories, children of that directory. In the simple unaltered case of a newly created exofs, no layout attributes are present, and all layouts adhere to the layout specified at the device-table. * In case of a future file system loaded in an old exofs-driver. At iget(), the generating_function is inspected and if not supported will return an IO error to the application and the inode will not be loaded. So not to damage any data. Note: After this patch we do not yet support any type of layout only the RAID0 patch that enables striping at the super-block level will add support for RAID0 layouts above. This way we are past and future compatible and fully bisectable. * Access to the device table is done by an accessor since it will change according to above information. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28exofs: unindent exofs_sbi_readBoaz Harrosh
The original idea was that a mirror read can be sub-divided to multiple devices. But this has very little gain and only at very large IOes so it's not going to be implemented soon. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28exofs: Move layout related members to a layout structureBoaz Harrosh
* Abstract away those members in exofs_sb_info that are related/needed by a layout into a new exofs_layout structure. Embed it in exofs_sb_info. * At exofs_io_state receive/keep a pointer to an exofs_layout. No need for an exofs_sb_info pointer, all we need is at exofs_layout. * Change any usage of above exofs_sb_info members to their new name. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28exofs: Recover in the case of read-passed-end-of-fileBoaz Harrosh
In check_io, implement the case of reading passed end of file, by clearing the pages and recover with no error. In a raid arrangement this can become a legitimate situation in case of holes in the file. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28exofs: Micro-optimize exofs_i_infoBoaz Harrosh
optimize the exofs_i_info struct usage by moving the embedded vfs_inode to be first. A compiler might optimize away an "add" operation with constant zero. (Which it cannot with other constants) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28exofs: debug print even lessBoaz Harrosh
* Last debug trimming left in some stupid print, remove them. Fixup some other prints * Shift printing from inode.c to ios.c * Add couple of prints when memory allocation fails. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-27ocfs2: send SIGXFSZ if new filesize exceeds limit -v2Wengang Wang
This patch makes ocfs2 send SIGXFSZ if new file size exceeds the rlimit. Processes may get SIGXFSZ on one node (in the cluster) while others will not on another if file size limits are different on the two nodes. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-27ocfs2/userdlm: Add tracing in userdlmSunil Mushran
Make use of the newly added BASTS masklog to trace ASTs and BASTs in userdlm. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-27ocfs2: Use a separate masklog for AST and BASTsSunil Mushran
This patch adds a new masklog and uses it allow tracing ASTs and BASTs in the dlmglue layer. This has been found to be very useful in debugging cluster locking issues. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26Remove EXPERIMENTAL from NFS_FSCACHEChristian Kujau
There's currently an open Ubuntu bug[0], with the intent to compile NFS_FSCACHE (and possibly AFS_FSCACHE, 9P_FSCACHE) into the standard Ubuntu kernel. However, since *_FSCACHE still depends on EXPERIMENTAL, this won't happen. As Arjan van de Ven pointed out[1], the EXPERIMENTAL flag doesn't mean that much any more, I propose the following patch to fs/nfs/Kconfig. I'd do the same for fs/9p/Kconfig and fs/afs/Kconfig, but as I did not test 9p or AFS, I feel it would not be appropriate for me to remove the flag. [0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/440522/comments/5 [1] http://lkml.org/lkml/2010/1/23/145 Signed-off-by: Christian Kujau <lists@nerdbynature.de> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: use bastmode in debugfs output dlm: Send lockspace name with uevents dlm: send reply before bast dlm: fix ordering of bast and cast
2010-02-26Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (52 commits) fs/xfs: Correct NULL test xfs: optimize log flushing in xfs_fsync xfs: only clear the suid bit once in xfs_write xfs: kill xfs_bawrite xfs: log changed inodes instead of writing them synchronously xfs: remove invalid barrier optimization from xfs_fsync xfs: kill the unused XFS_QMOPT_* flush flags V2 xfs: Use delay write promotion for dquot flushing xfs: Sort delayed write buffers before dispatch xfs: Don't issue buffer IO direct from AIL push V2 xfs: Use delayed write for inodes rather than async V2 xfs: Make inode reclaim states explicit xfs: more reserved blocks fixups xfs: turn off sign warnings xfs: don't hold onto reserved blocks on remount,ro xfs: quota limit statvfs available blocks xfs: replace KM_LARGE with explicit vmalloc use xfs: cleanup up xfs_log_force calling conventions xfs: kill XLOG_VEC_SET_TYPE xfs: remove duplicate buffer flags ...
2010-02-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-viptLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt: xfs: fix xfs to work with Virtually Indexed architectures sh: add mm API for DMA to vmalloc/vmap areas arm: add mm API for DMA to vmalloc/vmap areas parisc: add mm API for DMA to vmalloc/vmap areas mm: add coherence API for DMA to vmalloc/vmap areas
2010-02-26dlm: allow dlm do recovery during shutdownSrinivas Eeda
If a node down event happens while dlm shutdown in progress, dlm recovery should be done before dlm is shutdown. We can't migrate unrecovered locks, obviously. But dlm_reco_thread only does recovery if the dlm_state is in DLM_CTXT_JOINED. dlm_reco_thread should do recovery if dlm_state is in DLM_CTXT_JOINED or DLM_CTXT_IN_SHUTDOWN. Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Only bug out in direct io write for reflinked extent.Tao Ma
In ocfs2_direct_IO_get_blocks, we only need to bug out in case of we are going to write a recounted extent rec. What a silly bug introduced by me! Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Cc: stable@kernel.org
2010-02-26ocfs2: fix warning in ocfs2_file_aio_write()Coly Li
This patch fixes a compiling warning in ocfs2_file_aio_write(). Signed-off-by: Coly Li <coly.li@suse.de> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2_dlmfs: Enable the use of user cluster stacks.Joel Becker
Unlike ocfs2, dlmfs has no permanent storage. It can't store off a cluster stack it is supposed to be using. So it can't specify the stack name in ocfs2_cluster_connect(). Instead, we create ocfs2_cluster_connect_agnostic(), which simply uses the stack that is currently enabled. This is find for dlmfs, which will rely on the stack initialization. We add the "stackglue" capability to dlmfs's capability list. This lets userspace know dlmfs can be used with all cluster stacks. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2_dlmfs: Use the stackglue.Joel Becker
Rather than directly using o2dlm, dlmfs can now use the stackglue. This allows it to use userspace cluster stacks and fs/dlm. This commit forces o2cb for now. A latter commit will bump the protocol version and allow non-o2cb stacks. This is one big sed, really. LKM_xxMODE becomes DLM_LOCK_xx. LKM_flag becomes DLM_LKF_flag. We also learn to check that the LVB is valid before reading it. Any DLM can lose the contents of the LVB during a complicated recovery. userdlm should be checking this. Now it does. dlmfs will return 0 from read(2) if the LVB was invalid. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2_dlmfs: Don't honor truncate. The size of a dlmfs file is LVB_LENJoel Becker
We want folks using dlmfs to be able to use the LVB in places other than just write(2)/read(2). By ignoring truncate requests, we allow 'echo "contents" > /dlm/space/lockname' to work. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Pass the locking protocol into ocfs2_cluster_connect().Joel Becker
Inside the stackglue, the locking protocol structure is hanging off of the ocfs2_cluster_connection. This takes it one further; the locking protocol is passed into ocfs2_cluster_connect(). Now different cluster connections can have different locking protocols with distinct asts. Note that all locking protocols have to keep their maximum protocol version in lock-step. With the protocol structure set in ocfs2_cluster_connect(), there is no need for the stackglue to have a static pointer to a specific protocol structure. We can change initialization to only pass in the maximum protocol version. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Remove the ast pointers from ocfs2_stack_pluginsJoel Becker
With the full ocfs2_locking_protocol hanging off of the ocfs2_cluster_connection, ast wrappers can get the ast/bast pointers there. They don't need to get them from their plugin structure. The user plugin still needs the maximum locking protocol version, though. This changes the plugin structure so that it only holds the max version, not the entire ocfs2_locking_protocol pointer. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Hang the locking proto on the cluster conn and use it in asts.Joel Becker
With the ocfs2_cluster_connection hanging off of the ocfs2_dlm_lksb, we have access to it in the ast and bast wrapper functions. Attach the ocfs2_locking_protocol to the conn. Now, instead of refering to a static variable for ast/bast pointers, the wrappers can look at the connection. This means different connections can have different ast/bast pointers, and it reduces the need for the static pointer. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Attach the connection to the lksbJoel Becker
We're going to want it in the ast functions, so we convert union ocfs2_dlm_lksb to struct ocfs2_dlm_lksb and let it carry the connection. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Pass lksbs back from stackglue ast/bast functions.Joel Becker
The stackglue ast and bast functions tried to maintain the fiction that their arguments were void pointers. In reality, stack_user.c had to know that the argument was an ocfs2_lock_res in order to get the status off of the lksb. That's ugly. This changes stackglue to always pass the lksb as the argument to ast and bast functions. The caller can always use container_of() to get the ocfs2_lock_res or user_dlm_lock_res. The net effect to the caller is zero. They still get back the lockres in their ast. stackglue gets cleaner, and now can use the lksb itself. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2_dlmfs: Move to its own directoryJoel Becker
We're going to remove the tie between ocfs2_dlmfs and o2dlm. ocfs2_dlmfs doesn't belong in the fs/ocfs2/dlm directory anymore. Here we move it to fs/ocfs2/dlmfs. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2_dlmfs: Use poll() to signify BASTs.Joel Becker
o2dlm's userspace filesystem is an easy way to use the DLM from userspace. It is intentionally simple. For example, it does not allow for asynchronous behavior or lock conversion. This is intentional to keep the interface simple. Because there is no asynchronous notification, there is no way for a process holding a lock to know another node needs the lock. This is the number one complaint of ocfs2_dlmfs users. Turns out, we can solve this very easily. We add poll() support to ocfs2_dlmfs. When a BAST is received, the lock's file descriptor will receive POLLIN. This is trivial to implement. Userdlm already has an appropriate waitqueue, and the lock knows when it is blocked. We add the "bast" capability to tell userspace this is available. Signed-off-by: Joel Becker <joel.becker@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2_dlmfs: Add capabilities parameter.Joel Becker
Over time, dlmfs has added some features that were not part of the initial ABI. Unfortunately, some of these features are not detectable via standard usage. For example, Linux's default poll always returns POLLIN, so there is no way for a caller of poll(2) to know when dlmfs added poll support. Instead, we provide this list of new capabilities. Capabilities is a read-only attribute. We do it as a module parameter so we can discover it whether dlmfs is built in, loaded, or even not loaded (via modinfo). The ABI features are local to this machine's dlmfs mount. This is distinct from the locking protocol, which is concerned with inter-node interaction. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Handle errors while setting external xattr values.Joel Becker
ocfs2 can store extended attribute values as large as a single file. It does this using a standard ocfs2 btree for the large value. However, the previous code did not handle all error cases cleanly. There are multiple problems to have. 1) We have trouble allocating space for a new xattr. This leaves us with an empty xattr. 2) We overwrote an existing local xattr with a value root, and now we have an error allocating the storage. This leaves us an empty xattr. where there used to be a value. The value is lost. 3) We have trouble truncating a reused value. This leaves us with the original entry pointing to the truncated original value. The value is lost. 4) We have trouble extending the storage on a reused value. This leaves us with the original value safely in place, but with more storage allocated when needed. This doesn't consider storing local xattrs (values that don't require a btree). Those only fail when the journal fails. Case (1) is easy. We just remove the xattr we added. We leak the storage because we can't safely remove it, but otherwise everything is happy. We'll print a warning about the leak. Case (4) is easy. We still have the original value in place. We can just leave the extra storage attached to this xattr. We return the error, but the old value is untouched. We print a warning about the storage. Case (2) and (3) are hard because we've lost the original values. In the old code, we ended up with values that could be partially read. That's not good. Instead, we just wipe the xattr entry and leak the storage. It stinks that the original value is lost, but now there isn't a partial value to be read. We'll print a big fat warning. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Set inline xattr entries with ocfs2_xa_set()Joel Becker
ocfs2_xattr_ibody_set() is the only remaining user of ocfs2_xattr_set_entry(). ocfs2_xattr_set_entry() actually does two things: it calls ocfs2_xa_set(), and it initializes the inline xattrs. Initializing the inline space really belongs in its own call. We lift the initialization to ocfs2_xattr_ibody_init(), called from ocfs2_xattr_ibody_set() only when necessary. Now ocfs2_xattr_ibody_set() can call ocfs2_xa_set() directly. ocfs2_xattr_set_entry() goes away. Another nice fact is that ocfs2_init_dinode_xa_loc() can trust i_xattr_inline_size. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Set xattr block entries with ocfs2_xa_set()Joel Becker
ocfs2_xattr_block_set() calls into ocfs2_xattr_set_entry() with just the HAS_XATTR flag. Most of the machinery of ocfs2_xattr_set_entry() is skipped. All that really happens other than the call to ocfs2_xa_set() is making sure the HAS_XATTR flag is set on the inode. But HAS_XATTR should be set when we also set di->i_xattr_loc. And that's done in ocfs2_create_xattr_block(). So let's move it there, and then ocfs2_xattr_block_set() can just call ocfs2_xa_set(). While we're there, ocfs2_create_xattr_block() can take the set_ctxt for a smaller argument list. It also learns to set HAS_XATTR_FL, because it knows for sure. ocfs2_create_empty_xatttr_block() in the reflink path fakes a set_ctxt to call ocfs2_create_xattr_block(). Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Let ocfs2_xa_prepare_entry() do space checks.Joel Becker
ocfs2_xattr_set_in_bucket() doesn't need to do its own hacky space checking. Let's let ocfs2_xa_prepare_entry() (via ocfs2_xa_set()) do the more accurate work. Whenever it doesn't have space, ocfs2_xattr_set_in_bucket() can try to get more space. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Gell into ocfs2_xa_set()Joel Becker
ocfs2_xa_set() wraps the ocfs2_xa_prepare_entry()/ocfs2_xa_store_value() logic. Both callers can now use the same routine. ocfs2_xa_remove() moves directly into ocfs2_xa_set(). Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Allocation in ocfs2_xa_prepare_entry(), values in ocfs2_xa_store_value()Joel Becker
ocfs2_xa_prepare_entry() gets all the logic to add, remove, or modify external value trees. Now, when it exits, the entry is ready to receive a value of any size. ocfs2_xa_remove() is added to handle the complete removal of an entry. It truncates the external value tree before calling ocfs2_xa_remove_entry(). ocfs2_xa_store_inline_value() becomes ocfs2_xa_store_value(). It can store any value. ocfs2_xattr_set_entry() loses all the allocation logic and just uses these functions. ocfs2_xattr_set_value_outside() disappears. ocfs2_xattr_set_in_bucket() uses these functions and makes ocfs2_xattr_set_entry_in_bucket() obsolete. That goes away, as does ocfs2_xattr_bucket_set_value_outside() and ocfs2_xattr_bucket_value_truncate(). Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Teach ocfs2_xa_loc how to do its own journal workJoel Becker
We're going to want to make sure our buffers get accessed and dirtied correctly. So have the xa_loc do the work. This includes storing the inode on ocfs2_xa_loc. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26ocfs2: Provide ocfs2_xa_fill_value_buf() for external value processingJoel Becker
We use the ocfs2_xattr_value_buf structure to manage external values. It lets the value tree code do its work regardless of the containing storage. ocfs2_xa_fill_value_buf() initializes a value buf from an ocfs2_xa_loc entry. Signed-off-by: Joel Becker <joel.becker@oracle.com>