From 638f44163d57f87d0905fbed7d54202beff916fc Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Fri, 30 Aug 2013 10:23:45 +1000 Subject: xfs: recovery of swap extents operations for CRC filesystems This is the recovery side of the btree block owner change operation performed by swapext on CRC enabled filesystems. We detect that an owner change is needed by the flag that has been placed on the inode log format flag field. Because the inode recovery is being replayed after the buffers that make up the BMBT in the given checkpoint, we can walk all the buffers and directly modify them when we see the flag set on an inode. Because the inode can be relogged and hence present in multiple chekpoints with the "change owner" flag set, we could do multiple passes across the inode to do this change. While this isn't optimal, we can't directly ignore the flag as there may be multiple independent swap extent operations being replayed on the same inode in different checkpoints so we can't ignore them. Further, because the owner change operation uses ordered buffers, we might have buffers that are newer on disk than the current checkpoint and so already have the owner changed in them. Hence we cannot just peek at a buffer in the tree and check that it has the correct owner and assume that the change was completed. So, for the moment just brute force the owner change every time we see an inode with the flag set. Note that we have to be careful here because the owner of the buffers may point to either the old owner or the new owner. Currently the verifier can't verify the owner directly, so there is no failure case here right now. If we verify the owner exactly in future, then we'll have to take this into account. This was tested in terms of normal operation via xfstests - all of the fsr tests now pass without failure. however, we really need to modify xfs/227 to stress v3 inodes correctly to ensure we fully cover this case for v5 filesystems. In terms of recovery testing, I used a hacked version of xfs_fsr that held the temp inode open for a few seconds before exiting so that the filesystem could be shut down with an open owner change recovery flags set on at least the temp inode. fsr leaves the temp inode unlinked and in btree format, so this was necessary for the owner change to be reliably replayed. logprint confirmed the tmp inode in the log had the correct flag set: INO: cnt:3 total:3 a:0x69e9e0 len:56 a:0x69ea20 len:176 a:0x69eae0 len:88 INODE: #regs:3 ino:0x44 flags:0x209 dsize:88 ^^^^^ 0x200 is set, indicating a data fork owner change needed to be replayed on inode 0x44. A printk in the revoery code confirmed that the inode change was recovered: XFS (vdc): Mounting Filesystem XFS (vdc): Starting recovery (logdev: internal) recovering owner change ino 0x44 XFS (vdc): Version 5 superblock detected. This kernel L support enabled! Use of these features in this kernel is at your own risk! XFS (vdc): Ending recovery (logdev: internal) The script used to test this was: $ cat ./recovery-fsr.sh #!/bin/bash dev=/dev/vdc mntpt=/mnt/scratch testfile=$mntpt/testfile umount $mntpt mkfs.xfs -f -m crc=1 $dev mount $dev $mntpt chmod 777 $mntpt for i in `seq 10000 -1 0`; do xfs_io -f -d -c "pwrite $(($i * 4096)) 4096" $testfile > /dev/null 2>&1 done xfs_bmap -vp $testfile |head -20 xfs_fsr -d -v $testfile & sleep 10 /home/dave/src/xfstests-dev/src/godown -f $mntpt wait umount $mntpt xfs_logprint -t $dev |tail -20 time mount $dev $mntpt xfs_bmap -vp $testfile umount $mntpt $ Signed-off-by: Dave Chinner Reviewed-by: Mark Tinguely Signed-off-by: Ben Myers --- fs/xfs/xfs_btree.c | 32 ++++++++++++++++++++------------ 1 file changed, 20 insertions(+), 12 deletions(-) (limited to 'fs/xfs/xfs_btree.c') diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c index 047573f0270..5690e102243 100644 --- a/fs/xfs/xfs_btree.c +++ b/fs/xfs/xfs_btree.c @@ -3907,13 +3907,16 @@ xfs_btree_get_rec( * buffer as an ordered buffer and log it appropriately. We need to ensure that * we mark the region we change dirty so that if the buffer is relogged in * a subsequent transaction the changes we make here as an ordered buffer are - * correctly relogged in that transaction. + * correctly relogged in that transaction. If we are in recovery context, then + * just queue the modified buffer as delayed write buffer so the transaction + * recovery completion writes the changes to disk. */ static int xfs_btree_block_change_owner( struct xfs_btree_cur *cur, int level, - __uint64_t new_owner) + __uint64_t new_owner, + struct list_head *buffer_list) { struct xfs_btree_block *block; struct xfs_buf *bp; @@ -3930,16 +3933,19 @@ xfs_btree_block_change_owner( block->bb_u.s.bb_owner = cpu_to_be32(new_owner); /* - * Log owner change as an ordered buffer. If the block is a root block - * hosted in an inode, we might not have a buffer pointer here and we - * shouldn't attempt to log the change as the information is already - * held in the inode and discarded when the root block is formatted into - * the on-disk inode fork. We still change it, though, so everything is - * consistent in memory. + * If the block is a root block hosted in an inode, we might not have a + * buffer pointer here and we shouldn't attempt to log the change as the + * information is already held in the inode and discarded when the root + * block is formatted into the on-disk inode fork. We still change it, + * though, so everything is consistent in memory. */ if (bp) { - xfs_trans_ordered_buf(cur->bc_tp, bp); - xfs_btree_log_block(cur, bp, XFS_BB_OWNER); + if (cur->bc_tp) { + xfs_trans_ordered_buf(cur->bc_tp, bp); + xfs_btree_log_block(cur, bp, XFS_BB_OWNER); + } else { + xfs_buf_delwri_queue(bp, buffer_list); + } } else { ASSERT(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE); ASSERT(level == cur->bc_nlevels - 1); @@ -3956,7 +3962,8 @@ xfs_btree_block_change_owner( int xfs_btree_change_owner( struct xfs_btree_cur *cur, - __uint64_t new_owner) + __uint64_t new_owner, + struct list_head *buffer_list) { union xfs_btree_ptr lptr; int level; @@ -3986,7 +3993,8 @@ xfs_btree_change_owner( /* for each buffer in the level */ do { error = xfs_btree_block_change_owner(cur, level, - new_owner); + new_owner, + buffer_list); } while (!error); if (error != ENOENT) -- cgit v1.2.3-70-g09d2