Age | Commit message (Collapse) | Author |
|
o2dlm's userspace filesystem is an easy way to use the DLM from
userspace. It is intentionally simple. For example, it does not allow
for asynchronous behavior or lock conversion. This is intentional to
keep the interface simple.
Because there is no asynchronous notification, there is no way for a
process holding a lock to know another node needs the lock. This is the
number one complaint of ocfs2_dlmfs users. Turns out, we can solve this
very easily. We add poll() support to ocfs2_dlmfs. When a BAST is
received, the lock's file descriptor will receive POLLIN.
This is trivial to implement. Userdlm already has an appropriate
waitqueue, and the lock knows when it is blocked.
We add the "bast" capability to tell userspace this is available.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Over time, dlmfs has added some features that were not part of the
initial ABI. Unfortunately, some of these features are not detectable
via standard usage. For example, Linux's default poll always returns
POLLIN, so there is no way for a caller of poll(2) to know when dlmfs
added poll support. Instead, we provide this list of new capabilities.
Capabilities is a read-only attribute. We do it as a module parameter
so we can discover it whether dlmfs is built in, loaded, or even not
loaded (via modinfo).
The ABI features are local to this machine's dlmfs mount. This is
distinct from the locking protocol, which is concerned with inter-node
interaction.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
ocfs2 can store extended attribute values as large as a single file. It
does this using a standard ocfs2 btree for the large value. However,
the previous code did not handle all error cases cleanly.
There are multiple problems to have.
1) We have trouble allocating space for a new xattr. This leaves us
with an empty xattr.
2) We overwrote an existing local xattr with a value root, and now we
have an error allocating the storage. This leaves us an empty xattr.
where there used to be a value. The value is lost.
3) We have trouble truncating a reused value. This leaves us with the
original entry pointing to the truncated original value. The value
is lost.
4) We have trouble extending the storage on a reused value. This leaves
us with the original value safely in place, but with more storage
allocated when needed.
This doesn't consider storing local xattrs (values that don't require a
btree). Those only fail when the journal fails.
Case (1) is easy. We just remove the xattr we added. We leak the
storage because we can't safely remove it, but otherwise everything is
happy. We'll print a warning about the leak.
Case (4) is easy. We still have the original value in place. We can
just leave the extra storage attached to this xattr. We return the
error, but the old value is untouched. We print a warning about the
storage.
Case (2) and (3) are hard because we've lost the original values. In
the old code, we ended up with values that could be partially read.
That's not good. Instead, we just wipe the xattr entry and leak the
storage. It stinks that the original value is lost, but now there isn't
a partial value to be read. We'll print a big fat warning.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
ocfs2_xattr_ibody_set() is the only remaining user of
ocfs2_xattr_set_entry(). ocfs2_xattr_set_entry() actually does two
things: it calls ocfs2_xa_set(), and it initializes the inline xattrs.
Initializing the inline space really belongs in its own call.
We lift the initialization to ocfs2_xattr_ibody_init(), called from
ocfs2_xattr_ibody_set() only when necessary. Now
ocfs2_xattr_ibody_set() can call ocfs2_xa_set() directly.
ocfs2_xattr_set_entry() goes away.
Another nice fact is that ocfs2_init_dinode_xa_loc() can trust
i_xattr_inline_size.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
ocfs2_xattr_block_set() calls into ocfs2_xattr_set_entry() with just the
HAS_XATTR flag. Most of the machinery of ocfs2_xattr_set_entry() is
skipped. All that really happens other than the call to ocfs2_xa_set()
is making sure the HAS_XATTR flag is set on the inode.
But HAS_XATTR should be set when we also set di->i_xattr_loc. And
that's done in ocfs2_create_xattr_block(). So let's move it there, and
then ocfs2_xattr_block_set() can just call ocfs2_xa_set().
While we're there, ocfs2_create_xattr_block() can take the set_ctxt for
a smaller argument list. It also learns to set HAS_XATTR_FL, because it
knows for sure. ocfs2_create_empty_xatttr_block() in the reflink path
fakes a set_ctxt to call ocfs2_create_xattr_block().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
ocfs2_xattr_set_in_bucket() doesn't need to do its own hacky space
checking. Let's let ocfs2_xa_prepare_entry() (via ocfs2_xa_set()) do
the more accurate work. Whenever it doesn't have space,
ocfs2_xattr_set_in_bucket() can try to get more space.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
ocfs2_xa_set() wraps the ocfs2_xa_prepare_entry()/ocfs2_xa_store_value()
logic. Both callers can now use the same routine. ocfs2_xa_remove()
moves directly into ocfs2_xa_set().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
ocfs2_xa_prepare_entry() gets all the logic to add, remove, or modify
external value trees. Now, when it exits, the entry is ready to receive
a value of any size.
ocfs2_xa_remove() is added to handle the complete removal of an entry.
It truncates the external value tree before calling
ocfs2_xa_remove_entry().
ocfs2_xa_store_inline_value() becomes ocfs2_xa_store_value(). It can
store any value.
ocfs2_xattr_set_entry() loses all the allocation logic and just uses
these functions. ocfs2_xattr_set_value_outside() disappears.
ocfs2_xattr_set_in_bucket() uses these functions and makes
ocfs2_xattr_set_entry_in_bucket() obsolete. That goes away, as does
ocfs2_xattr_bucket_set_value_outside() and
ocfs2_xattr_bucket_value_truncate().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
We're going to want to make sure our buffers get accessed and dirtied
correctly. So have the xa_loc do the work. This includes storing the
inode on ocfs2_xa_loc.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
We use the ocfs2_xattr_value_buf structure to manage external values.
It lets the value tree code do its work regardless of the containing
storage. ocfs2_xa_fill_value_buf() initializes a value buf from an
ocfs2_xa_loc entry.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Previously the xattr code would send in a fake value, containing a tree
root, to the function that installed name+value pairs. Instead, we pass
the real value to ocfs2_xa_set_inline_value(), and it notices that the
value cannot fit. Thus, it installs a tree root.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
We create two new functions on ocfs2_xa_loc, ocfs2_xa_prepare_entry()
and ocfs2_xa_store_inline_value().
ocfs2_xa_prepare_entry() makes sure that the xl_entry field of
ocfs2_xa_loc is ready to receive an xattr. The entry will point to an
appropriately sized name+value region in storage. If an existing entry
can be reused, it will be. If no entry already exists, it will be
allocated. If there isn't space to allocate it, -ENOSPC will be
returned.
ocfs2_xa_store_inline_value() stores the data that goes into the 'value'
part of the name+value pair. For values that don't fit directly, this
stores the value tree root.
A number of operations are added to ocfs2_xa_loc_operations to support
these functions. This reflects the disparate behaviors of xattr blocks
and buckets.
With these functions, the overlapping ocfs2_xattr_set_entry_local() and
ocfs2_xattr_set_entry_normal() can be replaced with a single call
scheme.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
An ocfs2 xattr entry stores the text name and value as a pair in the
storage area. Obviously names and values can be variable-sized. If a
value is too large for the entry storage, a tree root is stored instead.
The name+value pair is also padded.
Because of this, there are a million places in the code that do:
if (needs_external_tree(value_size)
namevalue_size = pad(name_size) + tree_root_size;
else
namevalue_size = pad(name_size) + pad(value_size);
Let's create some convenience functions to make the code more readable.
There are three forms. The first takes the raw sizes. The second takes
an ocfs2_xattr_info structure. The third takes an existing
ocfs2_xattr_entry.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Rather than calculating strlen all over the place, let's store the
name length directly on ocfs2_xattr_info.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
struct ocfs2_xattr_info is a useful structure describing an xattr
you'd like to set. Let's put prefixes on the member fields so it's
easier to read and use.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Add ocfs2_xa_remove_entry(), which will remove an xattr entry from its
storage via the ocfs2_xa_loc descriptor.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
The ocfs2 extended attribute (xattr) code is very flexible. It can
store xattrs in the inode itself, in an external block, or in a tree of
data structures. This allows the number of xattrs to be bounded by the
filesystem size.
However, the code that manages each possible storage location is
different. Maintaining the ocfs2 xattr code requires changing each hunk
separately.
This patch is the start of a series introducing the ocfs2_xa_loc
structure. This structure wraps the on-disk details of an xattr
entry. The goal is that the generic xattr routines can use
ocfs2_xa_loc without knowing the underlying storage location.
This first pass merely implements the basic structure, initializing it,
and wiping the name+value pair of the entry.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Add current->comm to the standard mlog() output to help with debugging.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
When ocfs2 has to do CoW for refcounted extents, we disable direct I/O
and go through the buffered I/O path. This makes the combined check
easier to read.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
This patch add extent block (metadata) stealing mechanism for
extent allocation. This mechanism is same as the inode stealing.
if no room in slot specific extent_alloc, we will try to
allocate extent block from the next slot.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
|
|
The bast mode that appears in the debugfs output should be
useful on both master and process nodes. lkb_highbast is
currently printed, and is only useful on the master node.
lkb_bastmode is only useful on the process node. This
patch sets lkb_bastmode on the master node as well, and
uses that value in the debugfs print.
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Although it is possible to get this information from the path,
its much easier to provide the lockspace as a seperate env
variable.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
The must_resend flag is always true, not false. In any case, we can
just ignore it anyway.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
When the lock master processes a successful operation (request,
convert, cancel, or unlock), it will process the effects of the
change before sending the reply for the operation. The "effects"
of the operation are:
- blocking callbacks (basts) for any newly granted locks
- waiting or converting locks that can now be granted
The cast is queued on the local node when the reply from the lock
master is received. This means that a lock holder can receive a
bast for a lock mode that is doesn't yet know has been granted.
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
We used to try to avoid freeing and then reallocating the osd
struct. This is a bit fragile due to potential interactions with
other references (beyond o_requests), and may be the cause of
this crash:
[120633.442358] BUG: unable to handle kernel NULL pointer dereference at (null)
[120633.443292] IP: [<ffffffff812549b6>] rb_erase+0x11d/0x277
[120633.443292] PGD f7ff3067 PUD f7f53067 PMD 0
[120633.443292] Oops: 0000 [#1] PREEMPT SMP
[120633.443292] last sysfs file: /sys/kernel/uevent_seqnum
[120633.443292] CPU 1
[120633.443292] Modules linked in: ceph fan ac battery psmouse ehci_hcd ide_pci_generic ohci_hcd thermal processor button
[120633.443292] Pid: 3023, comm: ceph-msgr/1 Not tainted 2.6.32-rc2 #12 H8SSL
[120633.443292] RIP: 0010:[<ffffffff812549b6>] [<ffffffff812549b6>] rb_erase+0x11d/0x277
[120633.443292] RSP: 0018:ffff8800f7b13a50 EFLAGS: 00010246
[120633.443292] RAX: ffff880022907819 RBX: ffff880022907818 RCX: 0000000000000000
[120633.443292] RDX: ffff8800f7b13a80 RSI: ffff8800f587eb48 RDI: 0000000000000000
[120633.443292] RBP: ffff8800f7b13a60 R08: 0000000000000000 R09: 0000000000000004
[120633.443292] R10: 0000000000000000 R11: ffff8800c4441000 R12: ffff8800f587eb48
[120633.443292] R13: ffff8800f58eaa00 R14: ffff8800f413c000 R15: 0000000000000001
[120633.443292] FS: 00007fbef6e226e0(0000) GS:ffff880009200000(0000) knlGS:0000000000000000
[120633.443292] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[120633.443292] CR2: 0000000000000000 CR3: 00000000f7c53000 CR4: 00000000000006e0
[120633.443292] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[120633.443292] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[120633.443292] Process ceph-msgr/1 (pid: 3023, threadinfo ffff8800f7b12000, task ffff8800f5858b40)
[120633.443292] Stack:
[120633.443292] ffff8800f413c000 ffff8800f587e9c0 ffff8800f7b13a80 ffffffffa0098a86
[120633.443292] <0> 00000000000006f1 0000000000000000 ffff8800f7b13af0 ffffffffa009959b
[120633.443292] <0> ffff8800f413c000 ffff880022a68400 ffff880022a68400 ffff8800f587e9c0
[120633.443292] Call Trace:
[120633.443292] [<ffffffffa0098a86>] __remove_osd+0x4d/0xbc [ceph]
[120633.443292] [<ffffffffa009959b>] __map_osds+0x199/0x4fa [ceph]
[120633.443292] [<ffffffffa00999f4>] ? __send_request+0xf8/0x186 [ceph]
[120633.443292] [<ffffffffa0099beb>] kick_requests+0x169/0x3cb [ceph]
[120633.443292] [<ffffffffa009a8c1>] ceph_osdc_handle_map+0x370/0x522 [ceph]
Since we're probably screwed anyway if a small kmalloc is
failing, don't bother with trying to be clever here.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Except for SCSI no device drivers distinguish between physical and
hardware segment limits. Consolidate the two into a single segment
limit.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
* 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6: (41 commits)
of: remove undefined request_OF_resource & release_OF_resource
of/sparc: Remove sparc-local declaration of allnodes and devtree_lock
of: move definition of of_chosen into common code.
of: remove unused extern reference to devtree_lock
of: put default string compare and #a/s-cell values into common header
of/flattree: Don't assume HAVE_LMB
of: protect linux/of.h with CONFIG_OF
proc_devtree: fix THIS_MODULE without module.h
of: Remove old and misplaced function declarations
of/flattree: Make the kernel accept ePAPR style phandle information
of/flattree: endian-convert members of boot_param_header
of: assume big-endian properties, adding conversions where necessary
of: use __be32 for cell value accessors
of/flattree: use OF_ROOT_NODE_{SIZE,ADDR}_CELLS DEFAULT for fdt parsing
of/flattree: use callback to setup initrd from /chosen
proc_devtree: include linux/of.h
of: make set_node_proc_entry private to proc_devtree.c
of: include linux/proc_fs.h
of/flattree: merge early_init_dt_scan_memory() common code
of: add 'of_' prefix to machine_is_compatible()
...
|
|
Move any out_sent messages to out_queue _before_ checking if
out_queue is empty and going to STANDBY, or else we may drop
something that was never acked.
And clean up the code a bit (less goto).
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
This fixes lock ABBA inversion, as the ->invalidate_authorizer()
op may need to take a lock (or even call back into the
messenger).
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
and fcheck_files().
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
LKML-Reference: <1266887105-1528-8-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Conflicts:
include/linux/blkdev.h
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
convert it to a real mutex.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
The server's callback client should stop trying to connect to the
client's callback server as soon as it gets ECONNREFUSED.
The NFS server's callback client does not call rpc_ping(), but appears
to have it's own "ping" procedure, so it wasn't covered by commit
caabea8a.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
|
|
Jeff correctly noted that using unsigned ea length is more intuitive.
CC: Jeff Lyaton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
When both blocking and completion callbacks are queued for lock,
the dlm would always deliver the completion callback (cast) first.
In some cases the blocking callback (bast) is queued before the
cast, though, and should be delivered first. This patch keeps
track of the order in which they were queued and delivers them
in that order.
This patch also keeps track of the granted mode in the last cast
and eliminates the following bast if the bast mode is compatible
with the preceding cast mode. This happens when a remotely mastered
lock is demoted, e.g. EX->NL, in which case the local node queues
a cast immediately after sending the demote message. In this way
a cast can be queued for a mode, e.g. NL, that makes an in-transit
bast extraneous.
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
commit_transaction has the same value as journal->j_running_transaction,
so we can simplify the assert statement.
Signed-off-by: dingdinghua <dingdinghua@nrchpc.ac.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
The patch is aimed to reorganize and simplify quota code a bit.
Quota code is itself complex enough, but we can make it more readable
in some places:
- Move quota option parsing to separate functions.
- Simplify old-quota and journaled-quota mix check.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Replace intermediate EXT4_MOUNT_XXX flags manipulation to
corresponding macro.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
fallocate() may potentially instantiate blocks past EOF, depending
on the flags used when it is called.
e2fsck currently has a test for blocks past i_size, and it
sometimes trips up - noticeably on xfstests 013 which runs fsstress.
This patch from Jiayang does fix it up - it (along with
e2fsprogs updates and other patches recently from Aneesh) has
survived many fsstress runs in a row.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Remove fs/ntfs/ChangeLog. No need for such files since we have git.
Acked-by: Anton Altaparmakov <aia21@cam.ac.uk>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
The tid is in the message header, not body. Broken since 6df058c0.
No need to look at next mds session; just mark the request and be done.
(The old error path was broken too, but now it's gone.)
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Verify the mds session is currently registered before handling
incoming messages. Clean up message handlers to pull mds out
of session->s_mds instead of less trustworthy src field.
Clean up con_{get,put} debug output.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
The destroy_inode path needs no inode locks since there are no
inode references. Update __ceph_remove_cap comment to reflect
that it is called without cap->session->s_mutex in this case.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Fix skipping of unexpected message types from osd, mon.
Clean up pr_info and debug output.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
Add an "ea_name" parameter to CIFSSMBQAllEAs. When it's set make it
behave like CIFSSMBQueryEA does now. The current callers of
CIFSSMBQueryEA are converted to use CIFSSMBQAllEAs, and the old
CIFSSMBQueryEA function is removed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|