summaryrefslogtreecommitdiffstats
path: root/fs/ceph/file.c
AgeCommit message (Collapse)Author
2012-12-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph update from Sage Weil: "There are a few different groups of commits here. The largest is Alex's ongoing work to enable the coming RBD features (cloning, striping). There is some cleanup in libceph that goes along with it. Cyril and David have fixed some problems with NFS reexport (leaking dentries and page locks), and there is a batch of patches from Yan fixing problems with the fs client when running against a clustered MDS. There are a few bug fixes mixed in for good measure, many of which will be going to the stable trees once they're upstream. My apologies for the late pull. There is still a gremlin in the rbd map/unmap code and I was hoping to include the fix for that as well, but we haven't been able to confirm the fix is correct yet; I'll send that in a separate pull once it's nailed down." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits) rbd: get rid of rbd_{get,put}_dev() libceph: register request before unregister linger libceph: don't use rb_init_node() in ceph_osdc_alloc_request() libceph: init event->node in ceph_osdc_create_event() libceph: init osd->o_node in create_osd() libceph: report connection fault with warning libceph: socket can close in any connection state rbd: don't use ENOTSUPP rbd: remove linger unconditionally rbd: get rid of RBD_MAX_SEG_NAME_LEN libceph: avoid using freed osd in __kick_osd_requests() ceph: don't reference req after put rbd: do not allow remove of mounted-on image libceph: Unlock unprocessed pages in start_read() error path ceph: call handle_cap_grant() for cap import message ceph: Fix __ceph_do_pending_vmtruncate ceph: Don't add dirty inode to dirty list if caps is in migration ceph: Fix infinite loop in __wake_requests ceph: Don't update i_max_size when handling non-auth cap bdi_register: add __printf verification, fix arg mismatch ...
2012-12-17lseek: the "whence" argument is called "whence"Andrew Morton
But the kernel decided to call it "origin" instead. Fix most of the sites. Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-05ceph: Fix i_size update raceSage Weil
ceph_aio_write() has an optimization that marks cap EPH_CAP_FILE_WR dirty before data is copied to page cache and inode size is updated. If ceph_check_caps() flushes the dirty cap before the inode size is updated, MDS can miss the new inode size. The fix is move ceph_{get,put}_cap_refs() into ceph_write_{begin,end}() and call __ceph_mark_dirty_caps() after inode size is updated. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01ceph: propagate layout error on osd request creationSage Weil
If we are creating an osd request and get an invalid layout, return an EINVAL to the caller. We switch up the return to have an error code instead of NULL implying -ENOMEM. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-08-02ceph: simplify+fix atomic_openSage Weil
The initial ->atomic_open op was carried over from the old intent code, which was incomplete and didn't really work. Replace it with a fresh method. In particular: * always attempt to do an atomic open+lookup, both for the create case and for lookups of existing files. * fix symlink handling by returning 1 to the VFS so that we can follow the link to its destination. This fixes a longstanding ceph bug (#2392). Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-14kill struct opendataAl Viro
Just pass struct file *. Methods are happier that way... There's no need to return struct file * from finish_open() now, so let it return int. Next: saner prototypes for parts in namei.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14make ->atomic_open() return intAl Viro
Change of calling conventions: old new NULL 1 file 0 ERR_PTR(-ve) -ve Caller *knows* that struct file *; no need to return it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14->atomic_open() prototype change - pass int * instead of bool *Al Viro
... and let finish_open() report having opened the file via that sucker. Next step: don't modify od->filp at all. [AV: FILE_CREATE was already used by cifs; Miklos' fix folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14ceph: implement i_op->atomic_open()Miklos Szeredi
Add an ->atomic_open implementation which replaces the atomic lookup+open+create operation implemented via ->lookup and ->create operations. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14ceph: remove unused arg from ceph_lookup_open()Miklos Szeredi
What was the purpose of this? Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-07ceph: drop support for preferred_osd pgsSage Weil
This was an ill-conceived feature that has been removed from Ceph. Do this gracefully: - reject attempts to specify a preferred_osd via the ioctl - stop exposing this information via virtual xattrs - always fill in -1 for requests, in case we talk to an older server - don't calculate preferred_osd placements/pgids Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
2011-12-13ceph: fix SEEK_CUR, SEEK_SET regressionSage Weil
Commit 06222e491e663dac939f04b125c9dc52126a75c4 got the if wrong so that it always evaluates as true. This is semantically harmless, but makes SEEK_CUR and SEEK_SET needlessly query the server. Rewrite the if to explicitly enumerate the cases we DO need a valid i_size to make this code less fragile. Reported-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-07ceph: use i_ceph_lock instead of i_lockSage Weil
We have been using i_lock to protect all kinds of data structures in the ceph_inode_info struct, including lists of inodes that we need to iterate over while avoiding races with inode destruction. That requires grabbing a reference to the inode with the list lock protected, but igrab() now takes i_lock to check the inode flags. Changing the list lock ordering would be a painful process. However, using a ceph-specific i_ceph_lock in the ceph inode instead of i_lock is a simple mechanical change and avoids the ordering constraints imposed by igrab(). Reported-by: Amon Ott <a.ott@m-privacy.de> Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) ceph: document unlocked d_parent accesses ceph: explicitly reference rename old_dentry parent dir in request ceph: document locking for ceph_set_dentry_offset ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug ceph: protect d_parent access in ceph_d_revalidate ceph: protect access to d_parent ceph: handle racing calls to ceph_init_dentry ceph: set dir complete frag after adding capability rbd: set blk_queue request sizes to object size ceph: set up readahead size when rsize is not passed rbd: cancel watch request when releasing the device ceph: ignore lease mask ceph: fix ceph_lookup_open intent usage ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC ceph: fix bad parent_inode calc in ceph_lookup_open ceph: avoid carrying Fw cap during write into page cache libceph: don't time out osd requests that haven't been received ceph: report f_bfree based on kb_avail rather than diffing. ceph: only queue capsnap if caps are dirty ceph: fix snap writeback when racing with writes ...
2011-07-26ceph: protect access to d_parentSage Weil
d_parent is protected by d_lock: use it when looking up a dentry's parent directory inode. Also take a reference and drop it in the caller to avoid a use-after-free. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26ceph: fix ceph_lookup_open intent usageSage Weil
We weren't properly calling lookup_instantiate_filp when setting up the lookup intent, which could lead to file leakage on errors. So: - use separate helper for the hidden snapdir translation, immediately following the mds request - use ceph_finish_lookup for the final dentry/return value dance in the exit path - lookup_instantiate_filp on success Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNCSage Weil
We only need to put these on the directory unsafe list if they have side effects that fsync(2) should flush out. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26ceph: fix bad parent_inode calc in ceph_lookup_openSage Weil
We were always getting NULL here because the intent file f_dentry is always NULL at this point, which means we were always passing NULL to ceph_mdsc_do_request. In reality, this was fine, since this isn't currently ever a write operation that needs to get strung on the dir's unsafe list. Use the dir explicitly, and only pass it if this open has side-effects that a dir fsync should flush. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26ceph: avoid carrying Fw cap during write into page cacheSage Weil
The generic_file_aio_write call may block on balance_dirty_pages while we flush data to the OSDs. If we hold a reference to the FILE_WR cap during that interval revocation by the MDS (e.g., to do a stat(2)) may be very slow. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26ceph: add F_SYNC file flag to force sync (non-O_DIRECT) ioSage Weil
This allows us to force IO through the sync path which you normally only get when multiple clients are reading/writing to the same file or by mounting with -o sync. Among other things, this lets test programs verify correctness with a single mount. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-20fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseekJosef Bacik
This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases we just return -EINVAL, in others we do the normal generic thing, and in others we're simply making sure that the properly due-dilligence is done. For example in NFS/CIFS we need to make sure the file size is update properly for the SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself that is all we have to do. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20don't transliterate lower bits of ->intent.open.flags to FMODE_...Al Viro
->create() instances are much happier that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-13ceph: fix sync and dio writes across stripe boundariesSage Weil
We were iterating across stripe boundaries properly, but not moving the write buffer pointer forward. This caused us to rewrite the same data after the break. Fix by adjusting the data pointer forward, and recalculating the io and buffer alignment after the break. Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-13ceph: fix page alignment correctionsSage Weil
dd if=/dev/urandom of=/mnt/fs_depot/dd10 bs=500 seek=8388 count=1 dd if=/mnt/fs_depot/dd10 of=/root/dd10out bs=500 skip=8388 count=1 Reported-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07ceph: fix ENOENT logic in striped_readSage Weil
Getting ENOENT is equivalent to reading 0 bytes. Make that correction before setting up the hit_stripe and was_short flags. Fixes the following case: dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0 dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct Reported-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07ceph: fix short sync reads from the OSDSage Weil
If we get a short read from the OSD because the object is small, we need to zero the remainder of the buffer. For O_DIRECT reads, the attempted range is not trimmed to i_size by the VFS, so we were actually looping indefinitely. Fix by trimming by i_size, and the unconditionally zeroing the trailing range. Reported-by: Jeff Wu <cpwu@tnsoft.com.cn> Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07ceph: use ihold when we already have an inode refSage Weil
We should use ihold whenever we already have a stable inode ref, even when we aren't holding i_lock. This avoids adding new and unnecessary locking dependencies. Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-04ceph: do not call __mark_dirty_inode under i_lockSage Weil
The __mark_dirty_inode helper now takes i_lock as of 250df6ed. Fix the one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the flags value so that the callers can do it outside of i_lock. Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21ceph: add request to the tail of unsafe write listHenry C Chang
In sync_write_wait(), we assume that the newest request is at the tail of unsafe write list. We should maintain the semantics here. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21ceph: remove request from unsafe list if it is canceled/timed outHenry C Chang
This fixes the list corruption warning like this: ------------[ cut here ]------------ WARNING: at lib/list_debug.c:30 __list_add+0x68/0x81() Hardware name: X8DTU list_add corruption. prev->next should be next (ffff880618931250), but was (null). (prev=ffff880c188b9130). Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs ceph libceph libcrc32c sunrpc ipv6 fuse igb i2c_i801 ioatdma i2c_core iTCO_wdt iTCO_vendor_support joydev dca serio_raw usb_storage [last unloaded: scsi_wait_scan] Pid: 10977, comm: smbd Tainted: G W 2.6.32.23-170.Elaster.xendom0.fc12.x86_64 #1 Call Trace: [<ffffffff8105753c>] warn_slowpath_common+0x7c/0x94 [<ffffffff810575ab>] warn_slowpath_fmt+0x41/0x43 [<ffffffff812351a3>] __list_add+0x68/0x81 [<ffffffffa014799d>] ceph_aio_write+0x614/0x8a2 [ceph] [<ffffffff8111d2a0>] do_sync_write+0xe8/0x125 [<ffffffff81075a1f>] ? autoremove_wake_function+0x0/0x39 [<ffffffff811f21ec>] ? selinux_file_permission+0x5c/0xb3 [<ffffffff811e8521>] ? security_file_permission+0x16/0x18 [<ffffffff8111d864>] vfs_write+0xae/0x10b [<ffffffff8111d91b>] sys_pwrite64+0x5a/0x76 [<ffffffff81012d32>] system_call_fastpath+0x16/0x1b ---[ end trace 08573eb9f07ff6f4 ]--- Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17ceph: mark user pages dirty on direct-io readsHenry C Chang
For read operation, we have to set the argument _write_ of get_user_pages to 1 since we will write data to pages. Also, we need to SetPageDirty before releasing these pages. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-15ceph: fix direct-io on non-page-aligned buffersHenry C Chang
The user buffer may be 512-byte aligned, not page-aligned. We were assuming the buffer was page-aligned and only accounting for non-page-aligned io offsets. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09ceph: make page alignment explicit in osd interfaceSage Weil
We used to infer alignment of IOs within a page based on the file offset, which assumed they matched. This broke with direct IO that was not aligned to pages (e.g., 512-byte aligned IO). We were also trusting the alignment specified in the OSD reply, which could have been adjusted by the server. Explicitly specify the page alignment when setting up OSD IO requests. Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09ceph: fix comment, remove extraneous argsSage Weil
The offset/length arguments aren't used. Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07ceph: fix open for write on clustered mdsSage Weil
Normally when we open a file we already have a cap, and simply update the wanted set. However, if we open a file for write, but don't have an auth cap, that doesn't work; we need to open a new cap with the auth MDS. Only reuse existing caps if we are opening for read or the existing cap is auth. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: factor out libceph from Ceph file systemYehuda Sadeh
This factors out protocol and low-level storage parts of ceph into a separate libceph module living in net/ceph and include/linux/ceph. This is mostly a matter of moving files around. However, a few key pieces of the interface change as well: - ceph_client becomes ceph_fs_client and ceph_client, where the latter captures the mon and osd clients, and the fs_client gets the mds client and file system specific pieces. - Mount option parsing and debugfs setup is correspondingly broken into two pieces. - The mon client gets a generic handler callback for otherwise unknown messages (mds map, in this case). - The basic supported/required feature bits can be expanded (and are by ceph_fs_client). No functional change, aside from some subtle error handling cases that got cleaned up in the refactoring process. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07ceph: fix list_add usage on unsafe_writes listHenry C Chang
Fix argument order. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-03ceph: whitespace cleanupSage Weil
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02ceph: add flock/fcntl lock supportGreg Farnum
Implement flock inode operation to support advisory file locking. All lock/unlock operations are synchronous with the MDS. Lock state is sent when reconnecting to a recovering MDS to restore the shared lock state. Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01ceph: code cleanupYehuda Sadeh
Mainly fixing minor issues reported by sparse. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01ceph: perform lazy reads when file mode and caps permitSage Weil
If the file mode is marked as "lazy," perform cached/buffered reads when the caps permit it. Adjust the rdcache_gen and invalidation logic accordingly so that we manage our cache based on the FILE_CACHE -or- FILE_LAZYIO cap bits. Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01ceph: perform lazy writes when file mode and caps permitSage Weil
If we have marked a file as "lazy" (using the ceph ioctl), perform buffered writes when the MDS caps allow it. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-27ceph: use complete_all and wake_up_allYehuda Sadeh
This fixes an issue triggered by running concurrent syncs. One of the syncs would go through while the other would just hang indefinitely. In any case, we never actually want to wake a single waiter, so the *_all functions should be used. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-30Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: clean up on forwarded aborted mds request ceph: fix leak of osd authorizer ceph: close out mds, osd connections before stopping auth ceph: make lease code DN specific fs/ceph: Use ERR_CAST ceph: renew auth tickets before they expire ceph: do not resend mon requests on auth ticket renewal ceph: removed duplicated #includes ceph: avoid possible null dereference ceph: make mds requests killable, not interruptible sched: add wait_for_completion_killable_timeout
2010-05-29fs/ceph: Use ERR_CASTJulia Lawall
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of the returned value is the same as the type of the enclosing function. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-24Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (59 commits) ceph: reuse mon subscribe message instead of allocated anew ceph: avoid resending queued message to monitor ceph: Storage class should be before const qualifier ceph: all allocation functions should get gfp_mask ceph: specify max_bytes on readdir replies ceph: cleanup pool op strings ceph: Use kzalloc ceph: use common helper for aborted dir request invalidation ceph: cope with out of order (unsafe after safe) mds reply ceph: save peer feature bits in connection structure ceph: resync headers with userland ceph: use ceph. prefix for virtual xattrs ceph: throw out dirty caps metadata, data on session teardown ceph: attempt mds reconnect if mds closes our session ceph: clean up send_mds_reconnect interface ceph: wait for mds OPEN reply to indicate reconnect success ceph: only send cap releases when mds is OPEN|HUNG ceph: dicard cap releases on mds restart ceph: make mon client statfs handling more generic ceph: drop src address(es) from message header [new protocol feature] ...
2010-05-21sanitize vfs_fsync calling conventionsChristoph Hellwig
Now that the last user passing a NULL file pointer is gone we can remove the redundant dentry argument and associated hacks inside vfs_fsynmc_range. The next step will be removig the dentry argument from ->fsync, but given the luck with the last round of method prototype changes I'd rather defer this until after the main merge window. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-17ceph: all allocation functions should get gfp_maskYehuda Sadeh
This is essential, as for the rados block device we'll need to run in different contexts that would need flags that are other than GFP_NOFS. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17ceph: make ceph_msg_new return NULL on failure; clean up, fix callersSage Weil
Returning ERR_PTR(-ENOMEM) is useless extra work. Return NULL on failure instead, and fix up the callers (about half of which were wrong anyway). Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17ceph: use ceph_sb_to_client instead of ceph_clientCheng Renquan
ceph_sb_to_client and ceph_client are really identical, we need to dump one; while function ceph_client is confusing with "struct ceph_client", ceph_sb_to_client's definition is more clear; so we'd better switch all call to ceph_sb_to_client. -static inline struct ceph_client *ceph_client(struct super_block *sb) -{ - return sb->s_fs_info; -} Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>