From 8ebc423238341b52912c7295b045a32477b33f09 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 7 Apr 2009 04:19:49 +0200 Subject: reiserfs: kill-the-BKL This patch is an attempt to remove the Bkl based locking scheme from reiserfs and is intended. It is a bit inspired from an old attempt by Peter Zijlstra: http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/2174.html The bkl is heavily used in this filesystem to prevent from concurrent write accesses on the filesystem. Reiserfs makes a deep use of the specific properties of the Bkl: - It can be acqquired recursively by a same task - It is released on the schedule() calls and reacquired when schedule() returns The two properties above are a roadmap for the reiserfs write locking so it's very hard to simply replace it with a common mutex. - We need a recursive-able locking unless we want to restructure several blocks of the code. - We need to identify the sites where the bkl was implictly relaxed (schedule, wait, sync, etc...) so that we can in turn release and reacquire our new lock explicitly. Such implicit releases of the lock are often required to let other resources producer/consumer do their job or we can suffer unexpected starvations or deadlocks. So the new lock that replaces the bkl here is a per superblock mutex with a specific property: it can be acquired recursively by a same task, like the bkl. For such purpose, we integrate a lock owner and a lock depth field on the superblock information structure. The first axis on this patch is to turn reiserfs_write_(un)lock() function into a wrapper to manage this mutex. Also some explicit calls to lock_kernel() have been converted to reiserfs_write_lock() helpers. The second axis is to find the important blocking sites (schedule...(), wait_on_buffer(), sync_dirty_buffer(), etc...) and then apply an explicit release of the write lock on these locations before blocking. Then we can safely wait for those who can give us resources or those who need some. Typically this is a fight between the current writer, the reiserfs workqueue (aka the async commiter) and the pdflush threads. The third axis is a consequence of the second. The write lock is usually on top of a lock dependency chain which can include the journal lock, the flush lock or the commit lock. So it's dangerous to release and trying to reacquire the write lock while we still hold other locks. This is fine with the bkl: T1 T2 lock_kernel() mutex_lock(A) unlock_kernel() // do something lock_kernel() mutex_lock(A) -> already locked by T1 schedule() (and then unlock_kernel()) lock_kernel() mutex_unlock(A) .... This is not fine with a mutex: T1 T2 mutex_lock(write) mutex_lock(A) mutex_unlock(write) // do something mutex_lock(write) mutex_lock(A) -> already locked by T1 schedule() mutex_lock(write) -> already locked by T2 deadlock The solution in this patch is to provide a helper which releases the write lock and sleep a bit if we can't lock a mutex that depend on it. It's another simulation of the bkl behaviour. The last axis is to locate the fs callbacks that are called with the bkl held, according to Documentation/filesystem/Locking. Those are: - reiserfs_remount - reiserfs_fill_super - reiserfs_put_super Reiserfs didn't need to explicitly lock because of the context of these callbacks. But now we must take care of that with the new locking. After this patch, reiserfs suffers from a slight performance regression (for now). On UP, a high volume write with dd reports an average of 27 MB/s instead of 30 MB/s without the patch applied. Signed-off-by: Frederic Weisbecker Reviewed-by: Ingo Molnar Cc: Jeff Mahoney Cc: Peter Zijlstra Cc: Bron Gondwana Cc: Andrew Morton Cc: Linus Torvalds Cc: Alexander Viro LKML-Reference: <1239070789-13354-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar --- include/linux/reiserfs_fs_sb.h | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'include/linux/reiserfs_fs_sb.h') diff --git a/include/linux/reiserfs_fs_sb.h b/include/linux/reiserfs_fs_sb.h index dab68bbed67..045c3721367 100644 --- a/include/linux/reiserfs_fs_sb.h +++ b/include/linux/reiserfs_fs_sb.h @@ -7,6 +7,8 @@ #ifdef __KERNEL__ #include #include +#include +#include #endif typedef enum { @@ -355,6 +357,13 @@ struct reiserfs_sb_info { struct reiserfs_journal *s_journal; /* pointer to journal information */ unsigned short s_mount_state; /* reiserfs state (valid, invalid) */ + /* Serialize writers access, replace the old bkl */ + struct mutex lock; + /* Owner of the lock (can be recursive) */ + struct task_struct *lock_owner; + /* Depth of the lock, start from -1 like the bkl */ + int lock_depth; + /* Comment? -Hans */ void (*end_io_handler) (struct buffer_head *, int); hashf_t s_hash_function; /* pointer to function which is used -- cgit v1.2.3-70-g09d2 From 08f14fc8963e585e65b71212ce8050607b9b6c36 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Sat, 16 May 2009 19:10:38 +0200 Subject: kill-the-bkl/reiserfs: move the concurrent tree accesses checks per superblock When do_balance() balances the tree, a trick is performed to provide the ability for other tree writers/readers to check whether do_balance() is executing concurrently (requires CONFIG_REISERFS_CHECK). This is done to protect concurrent accesses to the tree. The trick is the following: When do_balance is called, a unique global variable called cur_tb takes a pointer to the current tree to be rebalanced. Once do_balance finishes its work, cur_tb takes the NULL value. Then, concurrent tree readers/writers just have to check the value of cur_tb to ensure do_balance isn't executing concurrently. If it is, then it proves that schedule() occured on do_balance(), which then relaxed the bkl that protected the tree. Now that the bkl has be turned into a mutex, this check is still fine even though do_balance() becomes preemptible: the write lock will not be automatically released on schedule(), so the tree is still protected. But this is only fine if we have a single reiserfs mountpoint. Indeed, because the bkl is a global lock, it didn't allowed concurrent executions between a tree reader/writer in a mount point and a do_balance() on another tree from another mountpoint. So assuming all these readers/writers weren't supposed to be reentrant, the current check now sometimes detect false positives with the current per-superblock mutex which allows this reentrancy. This patch keeps the concurrent tree accesses check but moves it per superblock, so that only trees from a same mount point are checked to be not accessed concurrently. [ Impact: fix spurious panic while running several reiserfs mount-points ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Ingo Molnar Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/do_balan.c | 17 +++++------------ fs/reiserfs/fix_node.c | 5 +---- fs/reiserfs/prints.c | 4 ---- fs/reiserfs/stree.c | 5 +---- include/linux/reiserfs_fs_sb.h | 11 +++++++++++ 5 files changed, 18 insertions(+), 24 deletions(-) (limited to 'include/linux/reiserfs_fs_sb.h') diff --git a/fs/reiserfs/do_balan.c b/fs/reiserfs/do_balan.c index 128d3f7c8aa..60c08044066 100644 --- a/fs/reiserfs/do_balan.c +++ b/fs/reiserfs/do_balan.c @@ -21,14 +21,6 @@ #include #include -#ifdef CONFIG_REISERFS_CHECK - -struct tree_balance *cur_tb = NULL; /* detects whether more than one - copy of tb exists as a means - of checking whether schedule - is interrupting do_balance */ -#endif - static inline void buffer_info_init_left(struct tree_balance *tb, struct buffer_info *bi) { @@ -1840,11 +1832,12 @@ static int check_before_balancing(struct tree_balance *tb) { int retval = 0; - if (cur_tb) { + if (REISERFS_SB(tb->tb_sb)->cur_tb) { reiserfs_panic(tb->tb_sb, "vs-12335", "suspect that schedule " "occurred based on cur_tb not being null at " "this point in code. do_balance cannot properly " - "handle schedule occurring while it runs."); + "handle concurrent tree accesses on a same " + "mount point."); } /* double check that buffers that we will modify are unlocked. (fix_nodes should already have @@ -1986,7 +1979,7 @@ static inline void do_balance_starts(struct tree_balance *tb) "check");*/ RFALSE(check_before_balancing(tb), "PAP-12340: locked buffers in TB"); #ifdef CONFIG_REISERFS_CHECK - cur_tb = tb; + REISERFS_SB(tb->tb_sb)->cur_tb = tb; #endif } @@ -1996,7 +1989,7 @@ static inline void do_balance_completed(struct tree_balance *tb) #ifdef CONFIG_REISERFS_CHECK check_leaf_level(tb); check_internal_levels(tb); - cur_tb = NULL; + REISERFS_SB(tb->tb_sb)->cur_tb = NULL; #endif /* reiserfs_free_block is no longer schedule safe. So, we need to diff --git a/fs/reiserfs/fix_node.c b/fs/reiserfs/fix_node.c index 3a685e3f754..d2f31330dca 100644 --- a/fs/reiserfs/fix_node.c +++ b/fs/reiserfs/fix_node.c @@ -563,9 +563,6 @@ static int get_num_ver(int mode, struct tree_balance *tb, int h, return needed_nodes; } -#ifdef CONFIG_REISERFS_CHECK -extern struct tree_balance *cur_tb; -#endif /* Set parameters for balancing. * Performs write of results of analysis of balancing into structure tb, @@ -2368,7 +2365,7 @@ int fix_nodes(int op_mode, struct tree_balance *tb, return REPEAT_SEARCH; } #ifdef CONFIG_REISERFS_CHECK - if (cur_tb) { + if (REISERFS_SB(tb->tb_sb)->cur_tb) { print_cur_tb("fix_nodes"); reiserfs_panic(tb->tb_sb, "PAP-8305", "there is pending do_balance"); diff --git a/fs/reiserfs/prints.c b/fs/reiserfs/prints.c index 536eacaeb71..adbc6f53851 100644 --- a/fs/reiserfs/prints.c +++ b/fs/reiserfs/prints.c @@ -349,10 +349,6 @@ void reiserfs_debug(struct super_block *s, int level, const char *fmt, ...) . */ -#ifdef CONFIG_REISERFS_CHECK -extern struct tree_balance *cur_tb; -#endif - void __reiserfs_panic(struct super_block *sb, const char *id, const char *function, const char *fmt, ...) { diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 6b025a42d51..5fa7118f04e 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -222,9 +222,6 @@ static inline int bin_search(const void *key, /* Key to search for. */ return ITEM_NOT_FOUND; } -#ifdef CONFIG_REISERFS_CHECK -extern struct tree_balance *cur_tb; -#endif /* Minimal possible key. It is never in the tree. */ const struct reiserfs_key MIN_KEY = { 0, 0, {{0, 0},} }; @@ -711,7 +708,7 @@ int search_by_key(struct super_block *sb, const struct cpu_key *key, /* Key to s !key_in_buffer(search_path, key, sb), "PAP-5130: key is not in the buffer"); #ifdef CONFIG_REISERFS_CHECK - if (cur_tb) { + if (REISERFS_SB(sb)->cur_tb) { print_cur_tb("5140"); reiserfs_panic(sb, "PAP-5140", "schedule occurred in do_balance!"); diff --git a/include/linux/reiserfs_fs_sb.h b/include/linux/reiserfs_fs_sb.h index 045c3721367..52c83b6a758 100644 --- a/include/linux/reiserfs_fs_sb.h +++ b/include/linux/reiserfs_fs_sb.h @@ -417,6 +417,17 @@ struct reiserfs_sb_info { char *s_qf_names[MAXQUOTAS]; int s_jquota_fmt; #endif +#ifdef CONFIG_REISERFS_CHECK + + struct tree_balance *cur_tb; /* + * Detects whether more than one + * copy of tb exists per superblock + * as a means of checking whether + * do_balance is executing concurrently + * against another tree reader/writer + * on a same mount point. + */ +#endif }; /* Definitions of reiserfs on-disk properties: */ -- cgit v1.2.3-70-g09d2