summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2011-08-05Btrfs: force unplugs when switching from high to regular priority biosChris Mason
Btrfs does bio submissions from a worker thread, and each device has a list of high priority bios and regular priority bios. Synchronous writes go to the high priority thread while async writes go to regular list. This commit brings back an explicit unplug any time we switch from high to regular priority, which makes it easier for the block layer to give us low latencies. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: don't call writepages from within write_full_pageJosef Bacik
When doing a writepage we call writepages to try and write out any other dirty pages in the area. This could cause problems where we commit a transaction and then have somebody else dirtying metadata in the area as we could end up writing out a lot more than we care about, which could cause latency on anybody who is waiting for the transaction to completely finish committing. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: Remove unused variable 'last_index' in file.cMitch Harder
The variable 'last_index' is calculated in the __btrfs_buffered_write function and passed as a parameter to the prepare_pages function, but is not used anywhere in the prepare_pages function. Remove instances of 'last_index' in these functions. Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: clean up for find_first_extent_bit()Xiao Guangrong
find_first_extent_bit() and find_first_extent_bit_state() share most of the code, and we can just make the former call the latter. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: clean up for wait_extent_bit()Xiao Guangrong
We can just use cond_resched_lock(). Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: clean up for insert_state()Xiao Guangrong
Don't duplicate set_state_bits(). Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: remove unused members from struct extent_stateXiao Guangrong
These members are not used at all. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: clean up code for merging extent mapsLi Zefan
unpin_extent_cache() and add_extent_mapping() shares the same code that merges extent maps. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: clean up code for extent_map lookupLi Zefan
lookup_extent_map() and search_extent_map() can share most of code. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: clean up search_extent_mapping()Li Zefan
rb_node returned by __tree_search() can be a valid pointer or NULL, but won't be some errno. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: remove redundant code for dir item lookupLi Zefan
When we search a dir item with a specific hash code, we can just return NULL without further checking if btrfs_search_slot() returns 1. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: make acl functions really no-op if acl is not enabledLi Zefan
So there's no overhead for something we don't use. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: remove remaining ref-cache codeLi Zefan
Since commit f2a97a9dbd86eb1ef956bdf20e05c507b32beb96 ("btrfs: remove all unused functions"), there's no extern functions at all in ref-cache.c, so just remove the remaining dead code. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: remove a BUG_ON() in btrfs_commit_transaction()Li Zefan
wait_for_commit() always returns 0. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: use wait_event()Li Zefan
Use wait_event() when possible to avoid code duplication. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: check the nodatasum flag when writing compressed filesLi Zefan
If mounting with nodatasum option, we won't csum file data for general write or direct-io write, and this rule should also be applied when writing compressed files. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: copy string correctly in INO_LOOKUP ioctlLi Zefan
Memory areas [ptr, ptr+total_len] and [name, name+total_len] may overlap, so it's wrong to use memcpy(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: don't print the leaf if we had an errorJosef Bacik
In __btrfs_free_extent we will print the leaf if we fail to find the extent we wanted, but the problem is if we get an error we won't have a leaf so often this leads to a NULL pointer dereference and we lose the error that actually occurred. So only print the leaf if ret > 0, which means we didn't find the item we were looking for but we didn't error either. This way the error is preserved. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01btrfs: make btrfs_set_root_node voidMark Fasheh
This is fairly trivial - btrfs_set_root_node() - always returns zero so we can just make it void. All callers ignore the return code now anyway. I also made sure to check that none of the functions that btrfs_set_root_node() calls returns an error that we might have needed to catch and pass back. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: fix oops while writing data to SSD partitionsliubo
Here I have a two SSD-partitions btrfs, and they are defaultly set to "data=raid0, metadata=raid1", then I try to fill my btrfs partition till "No space left on device", via "dd if=/dev/zero of=/mnt/btrfs/tmp". I get an oops panic from kernel BUG at fs/btrfs/extent-tree.c:5199!, which refers to find_free_extent's BUG_ON(index != get_block_group_index(block_group)); In SSD mode, in order to find enough space to alloc, we may check the block_group cache which has been checked sometime before, but the index is not updated, where it hits the BUG_ON. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Acked-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: Protect the readonly flag of block groupWuBo
The access for ro in btrfs_block_group_cache should be protected because of the racy lock in relocation. Signed-off-by: Wu Bo <wu.bo@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01btrfs: Make extent-io callbacks that never fail return voidJeff Mahoney
The set/clear bit and the extent split/merge hooks only ever return 0. Changing them to return void simplifies the error handling cases later. This patch changes the hook prototypes, the single implementation of each, and the functions that call them to return void instead. Since all four of these hooks execute under a spinlock, they're necessarily simple. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: fix readahead in file defragLi Zefan
We passed the wrong value to btrfs_force_ra(). Fix this by changing the argument of btrfs_force_ra() from last_index to nr_page. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: return error to caller when btrfs_unlink() failesTsutomu Itoh
When btrfs_unlink_inode() and btrfs_orphan_add() in btrfs_unlink() are error, the error code is returned to the caller instead of BUG_ON(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs:don't check the return value of __btrfs_add_inode_defragWanlong Gao
Don't need to check the return value of __btrfs_add_inode_defrag(), since it will always return 0. Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Merge branch 'alloc_path' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/btrfs-error-handling into for-linus
2011-07-27Merge branch 'integration' into for-linusChris Mason
2011-07-27Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errorsChris Mason
The btrfs transaction code will return any errors that come from reserve_metadata_bytes. We need to make sure we don't return funny things like 1 or EAGAIN. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: use the commit_root for reading free_space_inode crcsChris Mason
Now that we are using regular file crcs for the free space cache, we can deadlock if we try to read the free_space_inode while we are updating the crc tree. This commit fixes things by using the commit_root to read the crcs. This is safe because we the free space cache file would already be loaded if that block group had been changed in the current transaction. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: reduce extent_state lock contention for metadataChris Mason
For metadata buffers that don't straddle pages (all of them), btrfs can safely use the page uptodate bits and extent_buffer uptodate bit instead of needing to use the extent_state tree. This greatly reduces contention on the state tree lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: remove lockdep magic from btrfs_next_leafChris Mason
Before the reader/writer locks, btrfs_next_leaf needed to keep the path blocking to avoid making lockdep upset. Now that btrfs_next_leaf only takes read locks, this isn't required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: make a lockdep class for each rootChris Mason
This patch was originally from Tejun Heo. lockdep complains about the btrfs locking because we sometimes take btree locks from two different trees at the same time. The current classes are based only on level in the btree, which isn't enough information for lockdep to figure out if the lock is safe. This patch makes a class for each type of tree, and lumps all the FS trees that actually have files and directories into the same class. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: switch the btrfs tree locks to reader/writerChris Mason
The btrfs metadata btree is the source of significant lock contention, especially in the root node. This commit changes our locking to use a reader/writer lock. The lock is built on top of rw spinlocks, and it extends the lock tracking to remember if we have a read lock or a write lock when we go to blocking. Atomics count the number of blocking readers or writers at any given time. It removes all of the adaptive spinning from the old code and uses only the spinning/blocking hints inside of btrfs to decide when it should continue spinning. In read heavy workloads this is dramatically faster. In write heavy workloads we're still faster because of less contention on the root node lock. We suffer slightly in dbench because we schedule more often during write locks, but all other benchmarks so far are improved. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: fix deadlock when throttling transactionsJosef Bacik
Hit this nice little deadlock. What happens is this __btrfs_end_transaction with throttle set, --use_count so it equals 0 btrfs_commit_transaction <somebody else actually manages to start the commit> btrfs_end_transaction --use_count so now its -1 <== BAD we just return and wait on the transaction This is bad because we just return after our use_count is -1 and don't let go of our num_writer count on the transaction, so the guy committing the transaction just sits there forever. Fix this by inc'ing our use_count if we're going to call commit_transaction so that if we call btrfs_end_transaction it's valid. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: stop using highmem for extent_buffersChris Mason
The extent_buffers have a very complex interface where we use HIGHMEM for metadata and try to cache a kmap mapping to access the memory. The next commit adds reader/writer locks, and concurrent use of this kmap cache would make it even more complex. This commit drops the ability to use HIGHMEM with extent buffers, and rips out all of the related code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: fix BUG_ON() caused by ENOSPC when relocating spaceMiao Xie
When we balanced the chunks across the devices, BUG_ON() in __finish_chunk_alloc() was triggered. ------------[ cut here ]------------ kernel BUG at fs/btrfs/volumes.c:2568! [SNIP] Call Trace: [<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs] [<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs] [<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs] [<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs] [<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs] [<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs] [<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs] [<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs] [<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs] [<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs] [<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs] [<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs] [<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs] [<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160 [<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs] [<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs] [<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30 [<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs] [<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs] [<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs] [<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs] [<ffffffff81159c63>] ? generic_permission+0x23/0xb0 [<ffffffff8115b415>] vfs_create+0xa5/0xc0 [<ffffffff8115ce6e>] do_last+0x5fe/0x880 [<ffffffff8115dc0d>] path_openat+0xcd/0x3d0 [<ffffffff8115e029>] do_filp_open+0x49/0xa0 [<ffffffff8116a965>] ? alloc_fd+0x95/0x160 [<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0 [<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0 [<ffffffff8114f1e0>] sys_open+0x20/0x30 [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b [SNIP] RIP [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs] The reason is: Task1 Space balance task do_chunk_alloc() __finish_chunk_alloc() update device info in the chunk tree alloc system metadata block relocate system metadata block group set system metadata block group readonly, This block group is the only one that can allocate space. So there is no free space that can be allocated now. find no space and don't try to alloc new chunk, and then return ENOSPC BUG_ON() in __finish_chunk_alloc() was triggered. Fix this bug by allocating a new system metadata chunk before relocating the old one if we find there is no free space which can be allocated after setting the old block group to be read-only. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: tag pages for writeback in syncJosef Bacik
Everybody else does this, we need to do it too. If we're syncing, we need to tag the pages we're going to write for writeback so we don't end up writing the same stuff over and over again if somebody is constantly redirtying our file. This will keep us from having latencies with heavy sync workloads. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: fix enospc problems with delallocJosef Bacik
So I had this brilliant idea to use atomic counters for outstanding and reserved extents, but this turned out to be a bad idea. Consider this where we have 1 outstanding extent and 1 reserved extent Reserver Releaser atomic_dec(outstanding) now 0 atomic_read(outstanding)+1 get 1 atomic_read(reserved) get 1 don't actually reserve anything because they are the same atomic_cmpxchg(reserved, 1, 0) atomic_inc(outstanding) atomic_add(0, reserved) free reserved space for 1 extent Then the reserver now has no actual space reserved for it, and when it goes to finish the ordered IO it won't have enough space to do it's allocation and you get those lovely warnings. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: don't flush delalloc arbitrarilyJosef Bacik
Kill the check to see if we have 512mb of reserved space in delalloc and shrink_delalloc if we do. This causes unexpected latencies and we have other logic to see if we need to throttle. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27Btrfs: use find_or_create_page instead of grab_cache_pageJosef Bacik
grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to GFP_HIGHUSER_MOVABLE. So instead use find_or_create_page in all cases where we need GFP_NOFS so we don't deadlock. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-07-27Btrfs: use a worker thread to do cachingJosef Bacik
A user reported a deadlock when copying a bunch of files. This is because they were low on memory and kthreadd got hung up trying to migrate pages for an allocation when starting the caching kthread. The page was locked by the person starting the caching kthread. To fix this we just need to use the async thread stuff so that the threads are already created and we don't have to worry about deadlocks. Thanks, Reported-by: Roman Mamedov <rm@romanrm.ru> Signed-off-by: Josef Bacik <josef@redhat.com>
2011-07-25btrfs: don't BUG_ON allocation errors in btrfs_drop_snapshotMark Fasheh
In addition to properly handling allocation failure from btrfs_alloc_path, I also fixed up the kzalloc error handling code immediately below it. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-25btrfs: Don't BUG_ON alloc_path errors in find_next_chunkMark Fasheh
I also removed the BUG_ON from error return of find_next_chunk in init_first_rw_device(). It turns out that the only caller of init_first_rw_device() also BUGS on any nonzero return so no actual behavior change has occurred here. do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk() which can now return -ENOMEM. Instead of setting space_info->full on any error from btrfs_alloc_chunk() I catch and return every error value _except_ -ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: CIFS: Fix wrong length in cifs_iovec_read
2011-07-21vfs: drop conditional inode prefetch in __do_lookup_rcuLinus Torvalds
It seems to hurt performance in real life. Yes, the inode will be used later, but the conditional doesn't seem to predict all that well (negative dentries are not uncommon) and it looks like the cost of prefetching is simply higher than depending on the cache doing the right thing. As usual. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-21FS-Cache: Fix __fscache_uncache_all_inode_pages()'s outer loopJan Beulich
The compiler, at least for ix86 and m68k, validly warns that the comparison: next <= (loff_t)-1 is always true (and it's always true also for x86-64 and probably all other arches - as long as pgoff_t isn't wider than loff_t). The intention appears to be to avoid wrapping of "next", so rather than eliminating the pointless comparison, fix the loop to indeed get exited when "next" would otherwise wrap. On m68k the following warning is observed: fs/fscache/page.c: In function '__fscache_uncache_all_inode_pages': fs/fscache/page.c:979: warning: comparison is always false due to limited range of data type Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Reported-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Suresh Jayaraman <sjayaraman@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-21CIFS: Fix wrong length in cifs_iovec_readPavel Shilovsky
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-07-19fs/libfs.c: fix simple_attr_write() on 32bit machinesAkinobu Mita
Assume that /sys/kernel/debug/dummy64 is debugfs file created by debugfs_create_x64(). # cd /sys/kernel/debug # echo 0x1234567812345678 > dummy64 # cat dummy64 0x0000000012345678 # echo 0x80000000 > dummy64 # cat dummy64 0xffffffff80000000 A value larger than INT_MAX cannot be written to the debugfs file created by debugfs_create_u64 or debugfs_create_x64 on 32bit machine. Because simple_attr_write() uses simple_strtol() for the conversion. To fix this, use simple_strtoll() instead. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: vfs: fix race in rcu lookup of pruned dentry Fix cifs_get_root() [ Edited the last commit to get rid of a 'unused variable "seq"' warning due to Al editing the patch. - Linus ]
2011-07-19vfs: fix race in rcu lookup of pruned dentryLinus Torvalds
Don't update *inode in __follow_mount_rcu() until we'd verified that there is mountpoint there. Kudos to Hugh Dickins for catching that one in the first place and eventually figuring out the solution (and catching a braino in the earlier version of patch). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>