Age | Commit message (Collapse) | Author |
|
a) count file openers correctly; i_count use was completely wrong
b) use new mutex for exclusion between final close/open/truncate,
to protect tailpacking logics. i_mutex use was wrong and resulted
in deadlocks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
I have encountered the same problem that Eric Sandeen described in
this post
http://lkml.org/lkml/fancy/2010/4/23/467
while experimenting with stackable filesystems.
The reason seems to be that ecryptfs calls lookup_one_len() to get the
lower dentry, which in turn calls the lower parent dirs d_revalidate()
with a NULL nameidata object.
If ecryptfs is the underlaying filesystem, the NULL pointer dereference
occurs, since ecryptfs is not prepared to handle a NULL nameidata.
I know that this cant happen any more, since it is no longer allowed to
mount ecryptfs upon itself.
But maybe this patch it useful nevertheless, since the problem would still
apply for an underlaying filesystem that implements d_revalidate() and is
not prepared to handle a NULL nameidata (I dont know if there actually
is such a fs).
With this patch (against 2.6.35-rc5) ecryptfs uses the vfs_lookup_path()
function instead of lookup_one_len() which ensures that the nameidata
passed to the lower filesystems d_revalidate().
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
|
The comments in the code indicate that file_info should be released if the
function fails. This releasing is done at the label out_free, not out.
The semantic match that finds this problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,f1,l;
position p1,p2;
expression *ptr != NULL;
@@
x@p1 = kmem_cache_zalloc(...);
...
if (x == NULL) S
<... when != x
when != if (...) { <+...x...+> }
(
x->f1 = E
|
(x->f1 == NULL || ...)
|
f(...,x->f1,...)
)
...>
(
return <+...x...+>;
|
return@p2 ...;
)
@script:python@
p1 << r.p1;
p2 << r.p2;
@@
print "* file: %s kmem_cache_zalloc %s" % (p1[0].file,p1[0].line)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: stable@kernel.org
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
|
In ecryptfs_lookup_and_interpose_lower() the lower mount is not decremented
if allocation of a dentry info struct failed. As a result the lower filesystem
cant be unmounted any more (since it is considered busy). This patch corrects
the reference counting.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: stable@kernel.org
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
|
Lower filesystems that only implemented unlocked_ioctl weren't being
passed ioctl calls because eCryptfs only checked for
lower_file->f_op->ioctl and returned -ENOTTY if it was NULL.
eCryptfs shouldn't implement ioctl(), since it doesn't require the BKL.
This patch introduces ecryptfs_unlocked_ioctl() and
ecryptfs_compat_ioctl(), which passes the calls on to the lower file
system.
https://bugs.launchpad.net/ecryptfs/+bug/469664
Reported-by: James Dupin <james.dupin@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
|
Fix warning seen with "make -j24 CONFIG_DEBUG_SECTION_MISMATCH=y V=1":
fs/ecryptfs/messaging.c: In function 'ecryptfs_process_response':
fs/ecryptfs/messaging.c:276: warning: 'daemon' may be used uninitialized in this function
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
|
|
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
Merge reason: The staging tree has introduced the easycap
driver lately. We need the latest updates to pushdown the
bkl in its ioctl helper.
|
|
Checkpatch.pl in 2.6.34 added a check for spaces between tabs.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
|
|
Handling of autofs ioctl numbers does not need to be generic
and can easily be done directly in autofs itself.
This also pushes the BKL into autofs and autofs4 ioctl
methods.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Cc: Autofs <autofs@linux.kernel.org>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
|
|
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
|
|
quiet the warning:
fs/omfs/file.c: In function 'omfs_get_block':
fs/omfs/file.c:225: warning: 'new_block' may be used uninitialized in
this function
new_block is used properly by the call to omfs_grow_extent()
Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Bob Copeland <me@bobcopeland.com>
|
|
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
do_coredump: Do not take BKL
init: Remove the BKL from startup code
|
|
* 'for-2.6.36' of git://linux-nfs.org/~bfields/linux: (34 commits)
nfsd4: fix file open accounting for RDWR opens
nfsd: don't allow setting maxblksize after svc created
nfsd: initialize nfsd versions before creating svc
net: sunrpc: removed duplicated #include
nfsd41: Fix a crash when a callback is retried
nfsd: fix startup/shutdown order bug
nfsd: minor nfsd read api cleanup
gcc-4.6: nfsd: fix initialized but not read warnings
nfsd4: share file descriptors between stateid's
nfsd4: fix openmode checking on IO using lock stateid
nfsd4: miscellaneous process_open2 cleanup
nfsd4: don't pretend to support write delegations
nfsd: bypass readahead cache when have struct file
nfsd: minor nfsd_svc() cleanup
nfsd: move more into nfsd_startup()
nfsd: just keep single lockd reference for nfsd
nfsd: clean up nfsd_create_serv error handling
nfsd: fix error handling in __write_ports_addxprt
nfsd: fix error handling when starting nfsd with rpcbind down
nfsd4: fix v4 state shutdown error paths
...
|
|
Fix the module init error handling. There are a bunch of goto labels for
aborting the init procedure at different points and just undoing what needs
undoing - they aren't all in the right places, however.
This can lead to an oops like the following:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffff81042a31>] destroy_workqueue+0x17/0xc0
...
Modules linked in: kafs(+) dns_resolver rxkad af_rxrpc fscache
Pid: 2171, comm: insmod Not tainted 2.6.35-cachefs+ #319 DG965RY/
...
Process insmod (pid: 2171, threadinfo ffff88003ca6a000, task ffff88003dcc3050)
...
Call Trace:
[<ffffffffa0055994>] afs_callback_update_kill+0x10/0x12 [kafs]
[<ffffffffa007d1c5>] afs_init+0x190/0x1ce [kafs]
[<ffffffffa007d035>] ? afs_init+0x0/0x1ce [kafs]
[<ffffffff810001ef>] do_one_initcall+0x59/0x14e
[<ffffffff8105f7ee>] sys_init_module+0x9c/0x1de
[<ffffffff81001eab>] system_call_fastpath+0x16/0x1b
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'nfs-for-2.6.36' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (42 commits)
NFS: NFSv4.1 is no longer a "developer only" feature
NFS: NFS_V4 is no longer an EXPERIMENTAL feature
NFS: Fix /proc/mount for legacy binary interface
NFS: Fix the locking in nfs4_callback_getattr
SUNRPC: Defer deleting the security context until gss_do_free_ctx()
SUNRPC: prevent task_cleanup running on freed xprt
SUNRPC: Reduce asynchronous RPC task stack usage
SUNRPC: Move the bound cred to struct rpc_rqst
SUNRPC: Clean up of rpc_bindcred()
SUNRPC: Move remaining RPC client related task initialisation into clnt.c
SUNRPC: Ensure that rpc_exit() always wakes up a sleeping task
SUNRPC: Make the credential cache hashtable size configurable
SUNRPC: Store the hashtable size in struct rpc_cred_cache
NFS: Ensure the AUTH_UNIX credcache is allocated dynamically
NFS: Fix the NFS users of rpc_restart_call()
SUNRPC: The function rpc_restart_call() should return success/failure
NFSv4: Get rid of the bogus RPC_ASSASSINATED(task) checks
NFSv4: Clean up the process of renewing the NFSv4 lease
NFSv4.1: Handle NFS4ERR_DELAY on SEQUENCE correctly
NFS: nfs_rename() should not have to flush out writebacks
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: add retrieve request
fuse: add store request
fuse: don't use atomic kmap
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (45 commits)
nilfs2: reject filesystem with unsupported block size
nilfs2: avoid rec_len overflow with 64KB block size
nilfs2: simplify nilfs_get_page function
nilfs2: reject incompatible filesystem
nilfs2: add feature set fields to super block
nilfs2: clarify byte offset in super block format
nilfs2: apply read-ahead for nilfs_btree_lookup_contig
nilfs2: introduce check flag to btree node buffer
nilfs2: add btree get block function with readahead option
nilfs2: add read ahead mode to nilfs_btnode_submit_block
nilfs2: fix buffer head leak in nilfs_btnode_submit_block
nilfs2: eliminate inline keywords in btree implementation
nilfs2: get maximum number of child nodes from bmap object
nilfs2: reduce repetitive calculation of max number of child nodes
nilfs2: optimize calculation of min/max number of btree node children
nilfs2: remove redundant pointer checks in bmap lookup functions
nilfs2: get rid of nilfs_bmap_union
nilfs2: unify bmap set_target_v operations
nilfs2: get rid of nilfs_btree uses
nilfs2: get rid of nilfs_direct uses
...
|
|
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
ext4: Adding error check after calling ext4_mb_regular_allocator()
ext4: Fix dirtying of journalled buffers in data=journal mode
ext4: re-inline ext4_rec_len_(to|from)_disk functions
jbd2: Remove t_handle_lock from start_this_handle()
jbd2: Change j_state_lock to be a rwlock_t
jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop
ext4: Add mount options in superblock
ext4: force block allocation on quota_off
ext4: fix freeze deadlock under IO
ext4: drop inode from orphan list if ext4_delete_inode() fails
ext4: check to make make sure bd_dev is set before dereferencing it
jbd2: Make barrier messages less scary
ext4: don't print scary messages for allocation failures post-abort
ext4: fix EFBIG edge case when writing to large non-extent file
ext4: fix ext4_get_blocks references
ext4: Always journal quota file modifications
ext4: Fix potential memory leak in ext4_fill_super
ext4: Don't error out the fs if the user tries to make a file too big
ext4: allocate stripe-multiple IOs on stripe boundaries
ext4: move aio completion after unwritten extent conversion
...
Fix up conflicts in fs/ext4/inode.c as per Ted.
Fix up xfs conflicts as per earlier xfs merge.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
ext3: Fix dirtying of journalled buffers in data=journal mode
ext3: default to ordered mode
quota: Use mark_inode_dirty_sync instead of mark_inode_dirty
quota: Change quota error message to print out disk and function name
MAINTAINERS: Update entries of ext2 and ext3
MAINTAINERS: Update address of Andreas Dilger
ext3: Avoid filesystem corruption after a crash under heavy delete load
ext3: remove vestiges of nobh support
ext3: Fix set but unused variables
quota: clean up quota active checks
quota: Clean up the namespace in dqblk_xfs.h
quota: check quota reservation on remove_dquot_ref
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
fs/dlm: Drop unnecessary null test
dlm: use genl_register_family_with_ops()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
udf: super.c Fix warning: variable 'sbi' set but not used
udf: remove duplicated #include
|
|
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[DNS RESOLVER] Minor typo correction
DNS: Fixes for the DNS query module
cifs: Include linux/err.h for IS_ERR and PTR_ERR
DNS: Make AFS go to the DNS for AFSDB records for unknown cells
DNS: Separate out CIFS DNS Resolver code
cifs: account for new creduid=0x%x parameter in spnego upcall string
cifs: reduce false positives with inode aliasing serverino autodisable
CIFS: Make cifs_convert_address() take a const src pointer and a length
cifs: show features compiled in as part of DebugData
cifs: update README
Fix up trivial conflicts in fs/cifs/cifsfs.c due to workqueue changes
|
|
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (55 commits)
workqueue: mark init_workqueues() as early_initcall()
workqueue: explain for_each_*cwq_cpu() iterators
fscache: fix build on !CONFIG_SYSCTL
slow-work: kill it
gfs2: use workqueue instead of slow-work
drm: use workqueue instead of slow-work
cifs: use workqueue instead of slow-work
fscache: drop references to slow-work
fscache: convert operation to use workqueue instead of slow-work
fscache: convert object to use workqueue instead of slow-work
workqueue: fix how cpu number is stored in work->data
workqueue: fix mayday_mask handling on UP
workqueue: fix build problem on !CONFIG_SMP
workqueue: fix locking in retry path of maybe_create_worker()
async: use workqueue for worker pool
workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead
workqueue: implement unbound workqueue
workqueue: prepare for WQ_UNBOUND implementation
libata: take advantage of cmwq and remove concurrency limitations
workqueue: fix worker management invocation without pending works
...
Fixed up conflicts in fs/cifs/* as per Tejun. Other trivial conflicts in
include/linux/workqueue.h, kernel/trace/Kconfig and kernel/workqueue.c
|
|
Currently, o2net_accept_one() is allowed to accept a connection from
listening node itself, such a fake connection will not be successfully
established due to no handshake detected afterwards, and later end up
with triggering connecting worker in a loop.
We're going to fix this by treating such connection request as 'invalid',
since we've got no chance of requesting connection from a node to itself
in a OCFS2 cluster.
The fix doesn't hurt user's scan for o2net-listener, it always gets a
successful connection from userpace.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
When we need to take both dlm_domain_lock and dlm->spinlock, we should take
them in order of: dlm_domain_lock then dlm->spinlock.
There is pathes disobey this order. That is calling dlm_lockres_put() with
dlm->spinlock held in dlm_run_purge_list. dlm_lockres_put() calls dlm_put() at
the ref and dlm_put() locks on dlm_domain_lock.
Fix:
Don't grab/put the dlm when the initialising/releasing lockres.
That grab is not required because we don't call dlm_unregister_domain()
based on refcount.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
In the following situation, there remains an incorrect bit in refmap on the
recovery master. Finally the recovery master will fail at purging the lockres
due to the incorrect bit in refmap.
1) node A has no interest on lockres A any longer, so it is purging it.
2) the owner of lockres A is node B, so node A is sending de-ref message
to node B.
3) at this time, node B crashed. node C becomes the recovery master. it recovers
lockres A(because the master is the dead node B).
4) node A migrated lockres A to node C with a refbit there.
5) node A failed to send de-ref message to node B because it crashed. The failure
is ignored. no other action is done for lockres A any more.
For mormal, re-send the deref message to it to recovery master can fix it. Well,
ignoring the failure of deref to the original master and not recovering the lockres
to recovery master has the same effect. And the later is simpler.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Hi,
Thanks a lot for all the review and comments so far;) I'd like to send
the improved (V4) version of this patch.
This patch fixes a deadlock in OCFS2 ACL. We found this bug in OCFS2
and Samba integration using scenario, the symptom is several smbd
processes will be hung under heavy workload. Finally we found out it
is the nested PR lock calling that leads to this deadlock:
node1 node2
gr PR
|
V
PR(EX)---> BAST:OCFS2_LOCK_BLOCKED
|
V
rq PR
|
V
wait=1
After requesting the 2nd PR lock, the process "smbd" went into D
state. It can only be woken up when the 1st PR lock's RO holder equals
zero. There should be an ocfs2_inode_unlock in the calling path later
on, which can decrement the RO holder. But since it has been in
uninterruptible sleep, the unlock function has no chance to be called.
The related stack trace is:
smbd D ffff8800013d0600 0 9522 5608 0x00000000
ffff88002ca7fb18 0000000000000282 ffff88002f964500 ffff88002ca7fa98
ffff8800013d0600 ffff88002ca7fae0 ffff88002f964340 ffff88002f964340
ffff88002ca7ffd8 ffff88002ca7ffd8 ffff88002f964340 ffff88002f964340
Call Trace:
[<ffffffff80350425>] schedule_timeout+0x175/0x210
[<ffffffff8034f580>] wait_for_common+0xf0/0x210
[<ffffffffa03e12b9>] __ocfs2_cluster_lock+0x3b9/0xa90 [ocfs2]
[<ffffffffa03e7665>] ocfs2_inode_lock_full_nested+0x255/0xdb0 [ocfs2]
[<ffffffffa0446019>] ocfs2_get_acl+0x69/0x120 [ocfs2]
[<ffffffffa0446368>] ocfs2_check_acl+0x28/0x80 [ocfs2]
[<ffffffff800e3507>] acl_permission_check+0x57/0xb0
[<ffffffff800e357d>] generic_permission+0x1d/0xc0
[<ffffffffa03eecea>] ocfs2_permission+0x10a/0x1d0 [ocfs2]
[<ffffffff800e3f65>] inode_permission+0x45/0x100
[<ffffffff800d86b3>] sys_chdir+0x53/0x90
[<ffffffff80007458>] system_call_fastpath+0x16/0x1b
[<00007f34a4ef6927>] 0x7f34a4ef6927
For details, please see:
https://bugzilla.novell.com/show_bug.cgi?id=614332 and
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1278
Signed-off-by: Jiaju Zhang <jjzhang@suse.de>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
The refcount record calculation in ocfs2_calc_refcount_meta_credits
is too optimistic that we can always allocate contiguous clusters
and handle an already existed refcount rec as a whole. Actually
because of file system fragmentation, we may have the chance to split
a refcount record into 3 parts during the transaction. So consider
the worst case in record calculation.
Cc: stable@kernel.org
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
This patch fixes two problems in dlm_run_purgelist
1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
the same lockres instead of trying the next lockres.
2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
spinlock is reacquired but in this window lockres can get reused. This leads
to BUG.
This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
lockres spinlock protecting it from getting reused.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
When we have to take both dlm->master_lock and lockres->spinlock,
take them in order
lockres->spinlock and then dlm->master_lock.
The patch fixes a violation of the rule.
We can simply move taking dlm->master_lock to where we have dropped res->spinlock
since when we access res->state and free mle memory we don't need master_lock's
protection.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Setting the acl while creating a new inode depends on
the error codes of posix_acl_create_masq. This patch fix
a issue of overwriting the error codes of it.
Reported-by: Pawel Zawora <pzawora@gmail.com>
Cc: <stable@kernel.org> [ .33, .34 ]
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
should take care of the periodic background write-out. However, the write-out
will actually start only 'dirty_writeback_interval' centisecs later, so we can
delay the wake-up.
This change was requested by Nick Piggin who pointed out that if we delay the
wake-up, we weed out 2 unnecessary contex switches, which matters because
'__mark_inode_dirty()' is a hot-path function.
This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
sets up a timer to wake-up the bdi thread and returns. So the wake-up is
delayed.
We also delete the timer in bdi threads just before writing-back. And
synchronously delete it when unregistering bdi. At the unregister point the bdi
does not have any users, so no one can arm it again.
Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
this change as well.
This patch also moves the 'bdi_wb_init()' function down in the file to avoid
forward-declaration of 'bdi_wakeup_thread_delayed()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very
bad for battery-driven devices.
There are two types of activities bdi threads do:
1. process bdi works from the 'bdi->work_list'
2. periodic write-back
So there are 2 sources of wake-up events for bdi threads:
1. 'bdi_queue_work()' - submits bdi works
2. '__mark_inode_dirty()' - adds dirty I/O to bdi's
The former already has bdi wake-up code. The latter does not, and this patch
adds it.
'__mark_inode_dirty()' is hot-path function, but this patch adds another
'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when
the bdi has no dirty inodes. So adding this spinlock should be fine and should
not affect performance.
This patch makes sure bdi threads and the forker thread do not wake-up if there
is nothing to do. The forker thread will nevertheless wake up at least every
5 min. to check whether it has to kill a bdi thread. This can also be optimized,
but is not worth it.
This patch also tidies up the warning about unregistered bid, and turns it from
an ugly crocodile to a simple 'WARN()' statement.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Currently, bdi threads can decide to exit if there were no useful activities
for 5 minutes. However, this causes nasty races: we can easily oops in the
'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up.
And even if we do not oops, but the bdi tread exits immediately after we wake
it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs)
in the bdi work processing.
This patch makes the forker thread to be the central place which not only
creates bdi threads, but also kills them if they were inactive long enough.
This better design-wise.
Another reason why this change was done is to prepare for the further changes
which will prevent the bdi threads from waking up every 5 sec and wasting
power. Indeed, when the task does not wake up periodically anymore, it won't be
able to exit either.
This patch also moves the the 'wake_up_bit()' call from the bdi thread to the
forker thread as well. So now the forker thread sets the BDI_pending bit, then
forks the task or kills it, then clears the bit and wakes up the waiting
process.
The only process which may wain on the bit is 'bdi_wb_shutdown()'. This
function was changed as well - now it first removes the bdi from the
'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is
guaranteed that the forker thread won't race with it, because the bdi is not
visible. Note, the forker thread sets the 'BDI_pending' bit under the
'bdi->wb_lock' which is essential for proper serialization.
And additionally, when we change 'bdi->wb.task', we now take the
'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise
would when raced with, say, 'bdi_queue_work()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Currently bdi threads use local variable 'last_active' which stores last time
when the bdi thread did some useful work. Move this local variable to 'struct
bdi_writeback'. This is just a preparation for the further patches which will
make the forker thread decide when bdi threads should be killed.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
The forker thread removes bdis from 'bdi_list' before forking the bdi thread.
But this is wrong for at least 2 reasons.
Reason #1: if we temporary remove a bdi from the list, we may miss works which
would otherwise be given to us.
Reason #2: this is racy; indeed, 'bdi_wb_shutdown()' expects that bdis are
always in the 'bdi_list' (see 'bdi_remove_from_list()'), and when
it races with the forker thread, it can shut down the bdi thread
at the same time as the forker creates it.
This patch makes sure the forker thread never removes bdis from 'bdi_list'
(which was suggested by Christoph Hellwig).
In order to make sure that we do not race with 'bdi_wb_shutdown()', we have to
hold the 'bdi_lock' while walking the 'bdi_list' and setting the 'BDI_pending'
flag.
NOTE! The error path is interesting. Currently, when we fail to create a bdi
thread, we move the bdi to the tail of 'bdi_list'. But if we never remove the
bdi from the list, we cannot move it to the tail either, because then we can
mess up the RCU readers which walk the list. And also, we'll have the race
described above in "Reason #2".
But I not think that adding to the tail is any important so I just do not do
that.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Currently, bdi threads ('bdi_writeback_thread()') can lose wake-ups. For
example, if 'bdi_queue_work()' is executed after the bdi thread have had
finished 'wb_do_writeback()' but before it called
'schedule_timeout_interruptible()'.
To fix this issue, we have to check whether we have works to process after we
have changed the task state to 'TASK_INTERRUPTIBLE'.
This patch also clean-ups handling of the cases when 'dirty_writeback_interval'
is zero or non-zero.
Additionally, this patch also removes unneeded 'list_empty_careful()' call.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
The write-back code mixes words "thread" and "task" for the same things. This
is not a big deal, but still an inconsistency.
hch: a convention I tend to use and I've seen in various places
is to always use _task for the storage of the task_struct pointer,
and thread everywhere else. This especially helps with having
foo_thread for the actual thread and foo_task for a global
variable keeping the task_struct pointer
This patch renames:
* 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
* 'bdi_forker_task()' -> 'bdi_forker_thread()'
because bdi threads are 'bdi_writeback_thread()', so these names are more
consistent.
This patch also amends commentaries and makes them refer the forker and bdi
threads as "thread", not "task".
Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use
'static void' instead of 'void static' and make checkpatch.pl happy.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
CODA should not be using defines in the global name space of
that nature, prefix them with CODA_.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
83ba7b07 cleans up the writeback.
So we don't use wb any more in get_next_work_item.
Let's remove unnecessary argument.
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
SPLICE_F_NONBLOCK is clearly documented to only affect blocking on the
pipe. In __generic_file_splice_read(), however, it causes an EAGAIN
if the page is currently being read.
This makes it impossible to write an application that only wants
failure if the pipe is full. For example if the same process is
handling both ends of a pipe and isn't otherwise able to determine
whether a splice to the pipe will fill it or not.
We could make the read non-blocking on O_NONBLOCK or some other splice
flag, but for now this is the simplest fix.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
The open and release block_device_operations are currently
called with the BKL held. In order to change that, we must
first make sure that all drivers that currently rely
on this have no regressions.
This blindly pushes the BKL into all .open and .release
operations for all block drivers to prepare for the
next step. The drivers can subsequently replace the BKL
with their own locks or remove it completely when it can
be shown that it is not needed.
The functions blkdev_get and blkdev_put are the only
remaining users of the big kernel lock in the block
layer, besides a few uses in the ioctl code, none
of which need to serialize with blkdev_{get,put}.
Most of these two functions is also under the protection
of bdev->bd_mutex, including the actual calls to
->open and ->release, and the common code does not
access any global data structures that need the BKL.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Tracing high level background writeback events is good, but it doesn't
give the entire picture. Add visibility into write throttling to catch IO
dispatched by foreground throttling of processing dirtying lots of pages.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Trace queue/sched/exec parts of the writeback loop. This provides
insight into when and why flusher threads are scheduled to run. e.g
a sync invocation leaves traces like:
sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
This also lays the foundation for adding more writeback tracing to
provide deeper insight into the whole writeback path.
The original tracing code is from Jens Axboe, though this version is
a rewrite as a result of the code being traced changing
significantly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
No real bugs I believe, just some dead code, and some
shut up code.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Move all code for the writeback thread into fs/fs-writeback.c instead of
splitting it over two functions in two files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|