summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2013-05-04cifs: ignore the unc= and prefixpath= mount optionsJeff Layton
...as advertised for 3.10. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-05-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second round of VFS updates from Al Viro: "Assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: xtensa simdisk: fix braino in "xtensa simdisk: switch to proc_create_data()" hostfs: use kmalloc instead of kzalloc hostfs: move HOSTFS_SUPER_MAGIC to <linux/magic.h> hostfs: remove "will unlock" comment vfs: use list_move instead of list_del/list_add proc_devtree: Replace include linux/module.h with linux/export.h create_mnt_ns: unidiomatic use of list_add() fs: remove dentry_lru_prune() Removed unused typedef to avoid "unused local typedef" warnings. kill fs/read_write.h fs: Fix hang with BSD accounting on frozen filesystem sun3_scsi: add ->show_info() nubus: Kill nubus_proc_detach_device() more mode_t whack-a-mole... do_coredump(): don't wait for thaw if coredump has already been interrupted do_mount(): fix a leak introduced in 3.9 ("mount: consolidate permission checks")
2013-05-04hostfs: use kmalloc instead of kzallocJames Hogan
The inode info structure is zeroed at allocation with kzalloc, and then all but one of the fields (including the largest, vfs_inode) are initialised explicitly. Switch to using kmalloc and initialise the remaining field too. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04hostfs: move HOSTFS_SUPER_MAGIC to <linux/magic.h>James Hogan
Move HOSTFS_SUPER_MAGIC to <linux/magic.h> to be with it's magical friends from other file systems. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04hostfs: remove "will unlock" commentJames Hogan
A "will unlock" comment was added to hostfs in the following commit, along with a spinlock: Commit e9193059b1b3733695d5b80e667778311695aa73 ("hostfs: fix races in dentry_name() and inode_name()"). But the spinlock was subsequently removed in the following commit: Commit ec2447c278ee973d35f38e53ca16ba7f965ae33d ("hostfs: simplify locking"). Since the comment is no longer applicable, remove it. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04vfs: use list_move instead of list_del/list_addWei Yongjun
Using list_move() instead of list_del() + list_add(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04proc_devtree: Replace include linux/module.h with linux/export.hSyam Sidhardhan
Since it uses only THIS_MODULE macro, include <linux/export.h> is the right to go here. Signed-off-by: Syam Sidhardhan <s.syam@samsung.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04create_mnt_ns: unidiomatic use of list_add()Al Viro
while list_add(A, B) and list_add(B, A) are equivalent when both A and B are guaranteed to be empty, the usual idiom is list_add(what, where), not the other way round... Not a bug per se, but only by accident and it makes RTFS harder for no good reason. Spotted-by: Rajat Sharma <fs.rajat@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04fs: remove dentry_lru_prune()Yan, Zheng
When pruning a dentry, its ancestor dentry can also be pruned. But the ancestor dentry does not go through dput(), so it does not get put on the dentry LRU. Hence associating d_prune with removing the dentry from the LRU is the wrong. The fix is remove dentry_lru_prune(). Call file system's d_prune() callback directly when pruning dentries. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04Removed unused typedef to avoid "unused local typedef" warnings.Han Shen
Fix warnings about unused local typedefs (reported by gcc 4.8). Signed-off-by: Han Shen (shenhan@google.com) Change-Id: I4bccc234f1390daa808d2b309ed112e20c0ac096 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04kill fs/read_write.hAl Viro
fs/compat.c doesn't need it anymore, so let's just move the remaining contents (two typedefs) into fs/read_write.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04do_coredump(): don't wait for thaw if coredump has already been interruptedAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04do_mount(): fix a leak introduced in 3.9 ("mount: consolidate permission ↵Al Viro
checks") Cc: stable@vger.kernel.org Bisected-by: Michael Leun <lkml20130126@newton.leun.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-03nfsd4: don't allow owner override on 4.1 CLAIM_FH opensJ. Bruce Fields
The Linux client is using CLAIM_FH to implement regular opens, not just recovery cases, so it depends on the server to check permissions correctly. Therefore the owner override, which may make sense in the delegation recovery case, isn't right in the CLAIM_FH case. Symptoms: on a client with 49f9a0fafd844c32f2abada047c0b9a5ba0d6255 "NFSv4.1: Enable open-by-filehandle", Bryan noticed this: touch test.txt chmod 000 test.txt echo test > test.txt succeeding. Cc: stable@kernel.org Reported-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-05-03Merge branch 'for-3.10' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd changes from J Bruce Fields: "Highlights include: - Some more DRC cleanup and performance work from Jeff Layton - A gss-proxy upcall from Simo Sorce: currently krb5 mounts to the server using credentials from Active Directory often fail due to limitations of the svcgssd upcall interface. This replacement lifts those limitations. The existing upcall is still supported for backwards compatibility. - More NFSv4.1 support: at this point, if a user with a current client who upgrades from 4.0 to 4.1 should see no regressions. In theory we do everything a 4.1 server is required to do. Patches for a couple minor exceptions are ready for 3.11, and with those and some more testing I'd like to turn 4.1 on by default in 3.11." Fix up semantic conflict as per Stephen Rothwell and linux-next: Commit 030d794bf498 ("SUNRPC: Use gssproxy upcall for server RPCGSS authentication") adds two new users of "PDE(inode)->data", but we're supposed to use "PDE_DATA(inode)" instead since commit d9dda78bad87 ("procfs: new helper - PDE_DATA(inode)"). The old PDE() macro is no longer available since commit c30480b92cf4 ("proc: Make the PROC_I() and PDE() macros internal to procfs") * 'for-3.10' of git://linux-nfs.org/~bfields/linux: (60 commits) NFSD: SECINFO doesn't handle unsupported pseudoflavors correctly NFSD: Simplify GSS flavor encoding in nfsd4_do_encode_secinfo() nfsd: make symbol nfsd_reply_cache_shrinker static svcauth_gss: fix error return code in rsc_parse() nfsd4: don't remap EISDIR errors in rename svcrpc: fix gss-proxy to respect user namespaces SUNRPC: gssp_procedures[] can be static SUNRPC: define {create,destroy}_use_gss_proxy_proc_entry in !PROC case nfsd4: better error return to indicate SSV non-support nfsd: fix EXDEV checking in rename SUNRPC: Use gssproxy upcall for server RPCGSS authentication. SUNRPC: Add RPC based upcall mechanism for RPCGSS auth SUNRPC: conditionally return endtime from import_sec_context SUNRPC: allow disabling idle timeout SUNRPC: attempt AF_LOCAL connect on setup nfsd: Decode and send 64bit time values nfsd4: put_client_renew_locked can be static nfsd4: remove unused macro nfsd4: remove some useless code nfsd4: implement SEQ4_STATUS_RECALLABLE_STATE_REVOKED ...
2013-05-03Merge tag 'jfs-3.10' of git://github.com/kleikamp/linux-shaggyLinus Torvalds
Pull jfs fixes from David Kleikamp: "A couple fixes for jfs" (What's with the unhelpful pull request "explanations" from fs people today?) * tag 'jfs-3.10' of git://github.com/kleikamp/linux-shaggy: jfs: fix a couple races jfs: avoid undefined behavior from left-shifting by 32 bits
2013-05-03Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext3/jbd fixes from Jan Kara: "A couple of ext3/jbd fixes" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: jbd: use kmem_cache_zalloc for allocating journal head jbd: use kmem_cache_zalloc instead of kmem_cache_alloc/memset jbd: don't wait (forever) for stale tid caused by wraparound ext3: fix data=journal fast mount/umount hang
2013-05-03NFSv4.x: Fix handling of partially delegated locksTrond Myklebust
If a NFS client receives a delegation for a file after it has taken a lock on that file, we can currently end up in a situation where we mistakenly skip unlocking that file. The following patch swaps an erroneous check in nfs4_proc_unlck for whether or not the file has a delegation to one which checks whether or not we hold a lock stateid for that file. Reported-by: Chuck Lever <Chuck.Lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [>=3.7] Tested-by: Chuck Lever <Chuck.Lever@oracle.com>
2013-05-03Merge tag 'for-v3.10' of git://git.infradead.org/users/cbou/linux-pstoreLinus Torvalds
Pull pstore update from Anton Vorontsov: - A new platform data parameter to specify ECC configuration; - Rounding fixup to not waste memory in ecc_blocks; - Restore ECC information printouts; - A small code cleanup: use kmemdup where appropriate. * tag 'for-v3.10' of git://git.infradead.org/users/cbou/linux-pstore: pstore/ram: Restore ecc information block pstore/ram: Allow specifying ecc parameters in platform data pstore/ram: Include ecc_size when calculating ecc_block pstore: Replace calls to kmalloc and memcpy with kmemdup
2013-05-03Merge branch 'for_next' into for_linusJan Kara
2013-05-03ext4: fix fio regressionYan, Zheng
We (Linux Kernel Performance project) found a regression introduced by commit: f7fec032aa ext4: track all extent status in extent status tree The commit causes about 20% performance decrease in fio random write test. Profiler shows that rb_next() uses a lot of CPU time. The call stack is: rb_next ext4_es_find_delayed_extent ext4_map_blocks _ext4_get_block ext4_get_block_write __blockdev_direct_IO ext4_direct_IO generic_file_direct_write __generic_file_aio_write ext4_file_write aio_rw_vect_retry aio_run_iocb do_io_submit sys_io_submit system_call_fastpath io_submit td_io_getevents io_u_queued_complete thread_main main __libc_start_main The cause is that ext4_es_find_delayed_extent() doesn't have an upper bound, it keeps searching until a delayed extent is found. When there are a lots of non-delayed entries in the extent state tree, ext4_es_find_delayed_extent() may uses a lot of CPU time. Reported-by: LKP project <lkp@linux.intel.com> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Cc: "Theodore Ts'o" <tytso@mit.edu>
2013-05-02Merge tag 'for-linus-v3.10-rc1' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs update from Ben Myers: "For 3.10-rc1 we have a number of bug fixes and cleanups and a currently experimental feature from David Chinner, CRCs protection for metadata. CRCs are enabled by using mkfs.xfs to create a filesystem with the feature bits set. - numerous fixes for speculative preallocation - don't verify buffers on IO errors - rename of random32 to prandom32 - refactoring/rearrangement in xfs_bmap.c - removal of unused m_inode_shrink in struct xfs_mount - fix error handling of xfs_bufs and readahead - quota driven preallocation throttling - fix WARN_ON in xfs_vm_releasepage - add ratelimited printk for different alert levels - fix spurious forced shutdowns due to freed Extent Free Intents - remove some obsolete XLOG_CIL_HARD_SPACE_LIMIT() macros - remove some obsoleted comments - (experimental) CRC support for metadata" * tag 'for-linus-v3.10-rc1' of git://oss.sgi.com/xfs/xfs: (46 commits) xfs: fix da node magic number mismatches xfs: Remote attr validation fixes and optimisations xfs: Teach dquot recovery about CONFIG_XFS_QUOTA xfs: add metadata CRC documentation xfs: implement extended feature masks xfs: add CRC checks to the superblock xfs: buffer type overruns blf_flags field xfs: add buffer types to directory and attribute buffers xfs: add CRC protection to remote attributes xfs: split remote attribute code out xfs: add CRCs to attr leaf blocks xfs: add CRCs to dir2/da node blocks xfs: shortform directory offsets change for dir3 format xfs: add CRC checking to dir2 leaf blocks xfs: add CRC checking to dir2 data blocks xfs: add CRC checking to dir2 free blocks xfs: add CRC checks to block format directory blocks xfs: add CRC checks to remote symlinks xfs: split out symlink code into it's own file. xfs: add version 3 inode format with CRCs ...
2013-05-02Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc update from Benjamin Herrenschmidt: "The main highlights this time around are: - A pile of addition POWER8 bits and nits, such as updated performance counter support (Michael Ellerman), new branch history buffer support (Anshuman Khandual), base support for the new PCI host bridge when not using the hypervisor (Gavin Shan) and other random related bits and fixes from various contributors. - Some rework of our page table format by Aneesh Kumar which fixes a thing or two and paves the way for THP support. THP itself will not make it this time around however. - More Freescale updates, including Altivec support on the new e6500 cores, new PCI controller support, and a pile of new boards support and updates. - The usual batch of trivial cleanups & fixes" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits) powerpc: Fix build error for book3e powerpc: Context switch the new EBB SPRs powerpc: Turn on the EBB H/FSCR bits powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S powerpc: Setup BHRB instructions facility in HFSCR for POWER8 powerpc: Fix interrupt range check on debug exception powerpc: Update tlbie/tlbiel as per ISA doc powerpc: Print page size info during boot powerpc: print both base and actual page size on hash failure powerpc: Fix hpte_decode to use the correct decoding for page sizes powerpc: Decode the pte-lp-encoding bits correctly. powerpc: Use encode avpn where we need only avpn values powerpc: Reduce PTE table memory wastage powerpc: Move the pte free routines from common header powerpc: Reduce the PTE_INDEX_SIZE powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format powerpc: New hugepage directory format powerpc: Don't truncate pgd_index wrongly powerpc: Don't hard code the size of pte page powerpc: Save DAR and DSISR in pt_regs on MCE ...
2013-05-01ceph: use ceph_create_snap_context()Alex Elder
Now that we have a library routine to create snap contexts, use it. This is part of: http://tracker.ceph.com/issues/4857 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: kill off osd data write_request parametersAlex Elder
In the incremental move toward supporting distinct data items in an osd request some of the functions had "write_request" parameters to indicate, basically, whether the data belonged to in_data or the out_data. Now that we maintain the data fields in the op structure there is no need to indicate the direction, so get rid of the "write_request" parameters. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: fix printk format warnings in file.cRandy Dunlap
Fix printk format warnings by using %zd for 'ssize_t' variables: fs/ceph/file.c:751:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat] fs/ceph/file.c:762:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat] Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: ceph-devel@vger.kernel.org Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-01ceph: fix race between writepages and truncateYan, Zheng
ceph_writepages_start() reads inode->i_size in two places. It can get different values between successive read, because truncate can change inode->i_size at any time. The race can lead to mismatch between data length of osd request and pages marked as writeback. When osd request finishes, it clear writeback page according to its data length. So some pages can be left in writeback state forever. The fix is only read inode->i_size once, save its value to a local variable and use the local variable when i_size is needed. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01ceph: apply write checks in ceph_aio_writeYan, Zheng
copy write checks in __generic_file_aio_write to ceph_aio_write. To make these checks cover sync write path. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01ceph: take i_mutex before getting Fw capYan, Zheng
There is deadlock as illustrated bellow. The fix is taking i_mutex before getting Fw cap reference. write truncate MDS --------------------- -------------------- -------------- get Fw cap lock i_mutex lock i_mutex (blocked) request setattr.size -> <- revoke Fw cap Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01libceph: change how "safe" callback is usedAlex Elder
An osd request currently has two callbacks. They inform the initiator of the request when we've received confirmation for the target osd that a request was received, and when the osd indicates all changes described by the request are durable. The only time the second callback is used is in the ceph file system for a synchronous write. There's a race that makes some handling of this case unsafe. This patch addresses this problem. The error handling for this callback is also kind of gross, and this patch changes that as well. In ceph_sync_write(), if a safe callback is requested we want to add the request on the ceph inode's unsafe items list. Because items on this list must have their tid set (by ceph_osd_start_request()), the request added *after* the call to that function returns. The problem with this is that there's a race between starting the request and adding it to the unsafe items list; the request may already be complete before ceph_sync_write() even begins to put it on the list. To address this, we change the way the "safe" callback is used. Rather than just calling it when the request is "safe", we use it to notify the initiator the bounds (start and end) of the period during which the request is *unsafe*. So the initiator gets notified just before the request gets sent to the osd (when it is "unsafe"), and again when it's known the results are durable (it's no longer unsafe). The first call will get made in __send_request(), just before the request message gets sent to the messenger for the first time. That function is only called by __send_queued(), which is always called with the osd client's request mutex held. We then have this callback function insert the request on the ceph inode's unsafe list when we're told the request is unsafe. This will avoid the race because this call will be made under protection of the osd client's request mutex. It also nicely groups the setup and cleanup of the state associated with managing unsafe requests. The name of the "safe" callback field is changed to "unsafe" to better reflect its new purpose. It has a Boolean "unsafe" parameter to indicate whether the request is becoming unsafe or is now safe. Because the "msg" parameter wasn't used, we drop that. This resolves the original problem reportedin: http://tracker.ceph.com/issues/4706 Reported-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01ceph: let osd client clean up for interrupted requestAlex Elder
In ceph_sync_write(), if a safe callback is supplied with a request, and an error is returned by ceph_osdc_wait_request(), a block of code is executed to remove the request from the unsafe writes list and drop references to capabilities acquired just prior to a call to ceph_osdc_wait_request(). The only function used for this callback is sync_write_commit(), and it does *exactly* what that block of error handling code does. Now in ceph_osdc_wait_request(), if an error occurs (due to an interupt during a wait_for_completion_interruptible() call), complete_request() gets called, and that calls the request's safe_callback method if it's defined. So this means that this cleanup activity gets called twice in this case, which is erroneous (and in fact leads to a crash). Fix this by just letting the osd client handle the cleanup in the event of an interrupt. This resolves one problem mentioned in: http://tracker.ceph.com/issues/4706 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-01ceph: fix symlink inode operationsYan, Zheng
add getattr/setattr and xattrs related methods. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01ceph: Use pseudo-random numbers to choose mdsSam Lang
We don't need to use up entropy to choose an mds, so use prandom_u32() to get a pseudo-random number. Also, we don't need to choose a random mds if only one mds is available, so add special casing for the common case. Fixes http://tracker.ceph.com/issues/3579 Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01libceph: add, don't set data for a messageAlex Elder
Change the names of the functions that put data on a pagelist to reflect that we're adding to whatever's already there rather than just setting it to the one thing. Currently only one data item is ever added to a message, but that's about to change. This resolves: http://tracker.ceph.com/issues/2770 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: combine initializing and setting osd dataAlex Elder
This ends up being a rather large patch but what it's doing is somewhat straightforward. Basically, this is replacing two calls with one. The first of the two calls is initializing a struct ceph_osd_data with data (either a page array, a page list, or a bio list); the second is setting an osd request op so it associates that data with one of the op's parameters. In place of those two will be a single function that initializes the op directly. That means we sort of fan out a set of the needed functions: - extent ops with pages data - extent ops with pagelist data - extent ops with bio list data and - class ops with page data for receiving a response We also have define another one, but it's only used internally: - class ops with pagelist data for request parameters Note that we *still* haven't gotten rid of the osd request's r_data_in and r_data_out fields. All the osd ops refer to them for their data. For now, these data fields are pointers assigned to the appropriate r_data_* field when these new functions are called. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: specify osd op by index in requestAlex Elder
An osd request now holds all of its source op structures, and every place that initializes one of these is in fact initializing one of the entries in the the osd request's array. So rather than supplying the address of the op to initialize, have caller specify the osd request and an indication of which op it would like to initialize. This better hides the details the op structure (and faciltates moving the data pointers they use). Since osd_req_op_init() is a common routine, and it's not used outside the osd client code, give it static scope. Also make it return the address of the specified op (so all the other init routines don't have to repeat that code). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: add data pointers in osd op structuresAlex Elder
An extent type osd operation currently implies that there will be corresponding data supplied in the data portion of the request (for write) or response (for read) message. Similarly, an osd class method operation implies a data item will be supplied to receive the response data from the operation. Add a ceph_osd_data pointer to each of those structures, and assign it to point to eithre the incoming or the outgoing data structure in the osd message. The data is not always available when an op is initially set up, so add two new functions to allow setting them after the op has been initialized. Begin to make use of the data item pointer available in the osd operation rather than the request data in or out structure in places where it's convenient. Add some assertions to verify pointers are always set the way they're expected to be. This is a sort of stepping stone toward really moving the data into the osd request ops, to allow for some validation before making that jump. This is the first in a series of patches that resolve: http://tracker.ceph.com/issues/4657 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: keep source rather than message osd op arrayAlex Elder
An osd request keeps a pointer to the osd operations (ops) array that it builds in its request message. In order to allow each op in the array to have its own distinct data, we will need to keep track of each op's data, and that information does not go over the wire. As long as we're tracking the data we might as well just track the entire (source) op definition for each of the ops. And if we're doing that, we'll have no more need to keep a pointer to the wire-encoded version. This patch makes the array of source ops be kept with the osd request structure, and uses that instead of the version encoded in the message in places where that was previously used. The array will be embedded in the request structure, and the maximum number of ops we ever actually use is currently 2. So reduce CEPH_OSD_MAX_OP to 2 to reduce the size of the structure. The result of doing this sort of ripples back up, and as a result various function parameters and local variables become unnecessary. Make r_num_ops be unsigned, and move the definition of struct ceph_osd_req_op earlier to ensure it's defined where needed. It does not yet add per-op data, that's coming soon. This resolves: http://tracker.ceph.com/issues/4656 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: a few more osd data cleanupsAlex Elder
These are very small changes that make use osd_data local pointers as shorthands for structures being operated on. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: define osd data initialization helpersAlex Elder
Define and use functions that encapsulate the initializion of a ceph_osd_data structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: build osd request message later for writepagesAlex Elder
Hold off building the osd request message in ceph_writepages_start() until just before it will be submitted to the osd client for execution. We'll still create the request and allocate the page pointer array after we learn we have at least one page to write. A local variable will be used to keep track of the allocated array of pages. Wait until just before submitting the request for assigning that page array pointer to the request message. Create ands use a new function osd_req_op_extent_update() whose purpose is to serve this one spot where the length value supplied when an osd request's op was initially formatted might need to get changed (reduced, never increased) before submitting the request. Previously, ceph_writepages_start() assigned the message header's data length because of this update. That's no longer necessary, because ceph_osdc_build_request() will recalculate the right value to use based on the content of the ops in the request. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: hold off building osd requestAlex Elder
Defer building the osd request until just before submitting it in all callers except ceph_writepages_start(). (That caller will be handed in the next patch.) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: kill ceph alloc_page_vec()Alex Elder
There is a helper function alloc_page_vec() that, despite its generic sounding name depends heavily on an osd request structure being populated with certain information. There is only one place this function is used, and it ends up being a bit simpler to just open code what it does, so get rid of the helper. The real motivation for this is deferring building the of the osd request message, and this is a step in that direction. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: define ceph_writepages_osd_request()Alex Elder
Mostly for readability, define ceph_writepages_osd_request() and use it to allocate the osd request for ceph_writepages_start(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: don't build request in ceph_osdc_new_request()Alex Elder
This patch moves the call to ceph_osdc_build_request() out of ceph_osdc_new_request() and into its caller. This is in order to defer formatting osd operation information into the request message until just before request is started. The only unusual (ab)user of ceph_osdc_build_request() is ceph_writepages_start(), where the final length of write request may change (downward) based on the current inode size or the oldest snapshot context with dirty data for the inode. The remaining callers don't change anything in the request after has been built. This means the ops array is now supplied by the caller. It also means there is no need to pass the mtime to ceph_osdc_new_request() (it gets provided to ceph_osdc_build_request()). And rather than passing a do_sync flag, have the number of ops in the ops array supplied imply adding a second STARTSYNC operation after the READ or WRITE requested. This and some of the patches that follow are related to having the messenger (only) be responsible for filling the content of the message header, as described here: http://tracker.ceph.com/issues/4589 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: use page_offset() in ceph_writepages_start()Alex Elder
There's one spot in ceph_writepages_start() that open-codes what page_offset() does safely. Use the macro so we don't have to worry about wrapping. This resolves: http://tracker.ceph.com/issues/4648 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: set up page array mempool with correct sizeAlex Elder
In create_fs_client() a memory pool is set up be used for arrays of pages that might be needed in ceph_writepages_start() if memory is tight. There are two problems with the way it's initialized: - The size provided is the number of pages we want in the array, but it should be the number of bytes required for that many page pointers. - The number of pages computed can end up being 0, while we will always need at least one page. This patch fixes both of these problems. This resolves the two simple problems defined in: http://tracker.ceph.com/issues/4603 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: wrap auth ops in wrapper functionsSage Weil
Use wrapper functions that check whether the auth op exists so that callers do not need a bunch of conditional checks. Simplifies the external interface. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01libceph: add update_authorizer auth methodSage Weil
Currently the messenger calls out to a get_authorizer con op, which will create a new authorizer if it doesn't yet have one. In the meantime, when we rotate our service keys, the authorizer doesn't get updated. Eventually it will be rejected by the server on a new connection attempt and get invalidated, and we will then rebuild a new authorizer, but this is not ideal. Instead, if we do have an authorizer, call a new update_authorizer op that will verify that the current authorizer is using the latest secret. If it is not, we will build a new one that does. This avoids the transient failure. This fixes one of the sorry sequence of events for bug http://tracker.ceph.com/issues/4282 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01ceph: fix buffer pointer advance in ceph_sync_writeHenry C Chang
We should advance the user data pointer by _len_ instead of _written_. _len_ is the data length written in each iteration while _written_ is the accumulated data length we have writtent out. Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com> Reviewed-by: Greg Farnum <greg@inktank.com> Tested-by: Sage Weil <sage@inktank.com>