summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2014-02-17ceph: add missing init_acl() for mkdir() and atomic_open()Yan, Zheng
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17ceph: fix ceph_set_acl()Yan, Zheng
If acl is equivalent to file mode permission bits, ceph_set_acl() needs to remove any existing acl xattr. Use __ceph_setxattr() to handle both setting and removing acl xattr cases, it doesn't return -ENODATA when there is no acl xattr. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17ceph: fix ceph_removexattr()Yan, Zheng
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17ceph: remove xattr when null value is given to setxattr()Yan, Zheng
For the setxattr request, introduce a new flag CEPH_XATTR_REMOVE to distinguish null value case from the zero-length value case. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17ceph: properly handle XATTR_CREATE and XATTR_REPLACEYan, Zheng
return -EEXIST if XATTR_CREATE is set and xattr alread exists. return -ENODATA if XATTR_REPLACE is set but xattr does not exist. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17NFSv4: Use the correct net namespace in nfs4_update_serverTrond Myklebust
We need to use the same net namespace that was used to resolve the hostname and sockaddr arguments. Fixes: 32e62b7c3ef09 (NFS: Add nfs4_update_server) Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-16ext4: don't leave i_crtime.tv_sec uninitializedTheodore Ts'o
If the i_crtime field is not present in the inode, don't leave the field uninitialized. Fixes: ef7f38359 ("ext4: Add nanosecond timestamps") Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-02-16Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We have a small collection of fixes in my for-linus branch. The big thing that stands out is a revert of a new ioctl. Users haven't shipped yet in btrfs-progs, and Dave Sterba found a better way to export the information" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: use right clone root offset for compressed extents btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105 Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol Btrfs: fix max_inline mount option Btrfs: fix a lockdep warning when cleaning up aborted transaction Revert "btrfs: add ioctl to export size of global metadata reservation"
2014-02-15ext4: fix online resize with a non-standard blocks per group settingTheodore Ts'o
The set_flexbg_block_bitmap() function assumed that the number of blocks in a blockgroup was sb->blocksize * 8, which is normally true, but not always! Use EXT4_BLOCKS_PER_GROUP(sb) instead, to fix block bitmap corruption after: mke2fs -t ext4 -g 3072 -i 4096 /dev/vdd 1G mount -t ext4 /dev/vdd /vdd resize2fs /dev/vdd 8G Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Jon Bernard <jbernard@tuxion.com> Cc: stable@vger.kernel.org
2014-02-15ext4: fix online resize with very large inode tablesTheodore Ts'o
If a file system has a large number of inodes per block group, all of the metadata blocks in a flex_bg may be larger than what can fit in a single block group. Unfortunately, ext4_alloc_group_tables() in resize.c was never tested to see if it would handle this case correctly, and there were a large number of bugs which caused the following sequence to result in a BUG_ON: kernel bug at fs/ext4/resize.c:409! ... call trace: [<ffffffff81256768>] ext4_flex_group_add+0x1448/0x1830 [<ffffffff81257de2>] ext4_resize_fs+0x7b2/0xe80 [<ffffffff8123ac50>] ext4_ioctl+0xbf0/0xf00 [<ffffffff811c111d>] do_vfs_ioctl+0x2dd/0x4b0 [<ffffffff811b9df2>] ? final_putname+0x22/0x50 [<ffffffff811c1371>] sys_ioctl+0x81/0xa0 [<ffffffff81676aa9>] system_call_fastpath+0x16/0x1b code: c8 4c 89 df e8 41 96 f8 ff 44 89 e8 49 01 c4 44 29 6d d4 0 rip [<ffffffff81254fa1>] set_flexbg_block_bitmap+0x171/0x180 This can be reproduced with the following command sequence: mke2fs -t ext4 -i 4096 /dev/vdd 1G mount -t ext4 /dev/vdd /vdd resize2fs /dev/vdd 8G To fix this, we need to make sure the right thing happens when a block group's inode table straddles two block groups, which means the following bugs had to be fixed: 1) Not clearing the BLOCK_UNINIT flag in the second block group in ext4_alloc_group_tables --- the was proximate cause of the BUG_ON. 2) Incorrectly determining how many block groups contained contiguous free blocks in ext4_alloc_group_tables(). 3) Incorrectly setting the start of the next block range to be marked in use after a discontinuity in setup_new_flex_group_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-02-15Btrfs: use right clone root offset for compressed extentsFilipe David Borba Manana
For non compressed extents, iterate_extent_inodes() gives us offsets that take into account the data offset from the file extent items, while for compressed extents it doesn't. Therefore we have to adjust them before placing them in a send clone instruction. Not doing this adjustment leads to the receiving end requesting for a wrong a file range to the clone ioctl, which results in different file content from the one in the original send root. Issue reproducible with the following excerpt from the test I made for xfstests: _scratch_mkfs _scratch_mount "-o compress-force=lzo" $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 $XFS_IO_PROG -c "pwrite -S 0x3e -b 80000 200000 80000" $SCRATCH_MNT/foo $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT $XFS_IO_PROG -c "pwrite -S 0xdc -b 10000 250000 10000" $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0xff -b 10000 300000 10000" $SCRATCH_MNT/foo # will be used for incremental send to be able to issue clone operations $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/clones_snap $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1 $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \ -x $SCRATCH_MNT/mysnap2/clones_snap $SCRATCH_MNT/mysnap2 $FSSUM_PROG -A -f -w $tmp/clones.fssum $SCRATCH_MNT/clones_snap \ -x $SCRATCH_MNT/clones_snap/mysnap1 -x $SCRATCH_MNT/clones_snap/mysnap2 $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap $BTRFS_UTIL_PROG send $SCRATCH_MNT/clones_snap -f $tmp/clones.snap $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 \ -c $SCRATCH_MNT/clones_snap $SCRATCH_MNT/mysnap2 -f $tmp/2.snap _scratch_unmount _scratch_mkfs _scratch_mount $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap $FSSUM_PROG -r $tmp/1.fssum $SCRATCH_MNT/mysnap1 2>> $seqres.full $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/clones.snap $FSSUM_PROG -r $tmp/clones.fssum $SCRATCH_MNT/clones_snap 2>> $seqres.full $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap $FSSUM_PROG -r $tmp/2.fssum $SCRATCH_MNT/mysnap2 2>> $seqres.full Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-15btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105Anand Jain
bdev is null when disk has disappeared and mounted with the degrade option stack trace --------- btrfs_sysfs_add_one+0x105/0x1c0 [btrfs] open_ctree+0x15f3/0x1fe0 [btrfs] btrfs_mount+0x5db/0x790 [btrfs] ? alloc_pages_current+0xa4/0x160 mount_fs+0x34/0x1b0 vfs_kern_mount+0x62/0xf0 do_mount+0x22e/0xa80 ? __get_free_pages+0x9/0x40 ? copy_mount_options+0x31/0x170 SyS_mount+0x7e/0xc0 system_call_fastpath+0x16/0x1b --------- reproducer: ------- mkfs.btrfs -draid1 -mraid1 /dev/sdc /dev/sdd (detach a disk) devmgt detach /dev/sdc [1] mount -o degrade /dev/sdd /btrfs ------- [1] github.com/anajain/devmgt.git Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Tested-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-14CIFS: Fix too big maxBuf size for SMB3 mountsPavel Shilovsky
SMB3 servers can respond with MaxTransactSize of more than 4M that can cause a memory allocation error returned from kmalloc in a lock codepath. Also the client doesn't support multicredit requests now and allows buffer sizes of 65536 bytes only. Set MaxTransactSize to this maximum supported value. Cc: stable@vger.kernel.org # 3.7+ Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2014-02-14cifs: ensure that uncached writes handle unmapped areas correctlyJeff Layton
It's possible for userland to pass down an iovec via writev() that has a bogus user pointer in it. If that happens and we're doing an uncached write, then we can end up getting less bytes than we expect from the call to iov_iter_copy_from_user. This is CVE-2014-0069 cifs_iovec_write isn't set up to handle that situation however. It'll blindly keep chugging through the page array and not filling those pages with anything useful. Worse yet, we'll later end up with a negative number in wdata->tailsz, which will confuse the sending routines and cause an oops at the very least. Fix this by having the copy phase of cifs_iovec_write stop copying data in this situation and send the last write as a short one. At the same time, we want to avoid sending a zero-length write to the server, so break out of the loop and set rc to -EFAULT if that happens. This also allows us to handle the case where no address in the iovec is valid. [Note: Marking this for stable on v3.4+ kernels, but kernels as old as v2.6.38 may have a similar problem and may need similar fix] Cc: <stable@vger.kernel.org> # v3.4+ Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2014-02-14Btrfs: unset DCACHE_DISCONNECTED when mounting default subvolJosef Bacik
A user was running into errors from an NFS export of a subvolume that had a default subvol set. When we mount a default subvol we will use d_obtain_alias() to find an existing dentry for the subvolume in the case that the root subvol has already been mounted, or a dummy one is allocated in the case that the root subvol has not already been mounted. This allows us to connect the dentry later on if we wander into the path. However if we don't ever wander into the path we will keep DCACHE_DISCONNECTED set for a long time, which angers NFS. It doesn't appear to cause any problems but it is annoying nonetheless, so simply unset DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to use d_materialise_unique() instead which will make everything play nicely together and reconnect stuff if we wander into the defaul subvol path from a different way. With this patch I'm no longer getting the NFS errors when exporting a volume that has been mounted with a default subvol set. Thanks, cc: bfields@fieldses.org cc: ebiederm@xmission.com Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-14Btrfs: fix max_inline mount optionMitch Harder
Currently, the only mount option for max_inline that has any effect is max_inline=0. Any other value that is supplied to max_inline will be adjusted to a minimum of 4k. Since max_inline has an effective maximum of ~3900 bytes due to page size limitations, the current behaviour only has meaning for max_inline=0. This patch will allow the the max_inline mount option to accept non-zero values as indicated in the documentation. Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-14Btrfs: fix a lockdep warning when cleaning up aborted transactionLiu Bo
Given now we have 2 spinlock for management of delayed refs, CONFIG_DEBUG_SPINLOCK=y helped me find this, [ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258 [ 4723.414882] lock: 0xffff880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2 [ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: G W O 3.12.0+ #4 [ 4723.421321] Call Trace: [ 4723.421872] [<ffffffff81680fe7>] dump_stack+0x54/0x74 [ 4723.422753] [<ffffffff81681093>] spin_dump+0x8c/0x91 [ 4723.424979] [<ffffffff816810b9>] spin_bug+0x21/0x26 [ 4723.425846] [<ffffffff81323956>] do_raw_spin_unlock+0x66/0x90 [ 4723.434424] [<ffffffff81689bf7>] _raw_spin_unlock+0x27/0x40 [ 4723.438747] [<ffffffffa015da9e>] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs] [ 4723.443321] [<ffffffffa015df54>] btrfs_cleanup_transaction+0x104/0x570 [btrfs] [ 4723.444692] [<ffffffff810c1b5d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 [ 4723.450336] [<ffffffff810c1c2d>] ? trace_hardirqs_on+0xd/0x10 [ 4723.451332] [<ffffffffa015e5ee>] transaction_kthread+0x22e/0x270 [btrfs] [ 4723.452543] [<ffffffffa015e3c0>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs] [ 4723.457833] [<ffffffff81079efa>] kthread+0xea/0xf0 [ 4723.458990] [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140 [ 4723.460133] [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0 [ 4723.460865] [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140 [ 4723.496521] ------------[ cut here ]------------ ---------------------------------------------------------------------- The reason is that we get to call cond_resched_lock(&head_ref->lock) while still holding @delayed_refs->lock. So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire dance before and after actually processing delayed refs. Here we don't drop the lock, others are not able to add new delayed refs to head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-14Revert "btrfs: add ioctl to export size of global metadata reservation"Chris Mason
This reverts commit 01e219e8069516cdb98594d417b8bb8d906ed30d. David Sterba found a different way to provide these features without adding a new ioctl. We haven't released any progs with this ioctl yet, so I'm taking this out for now until we finalize things. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.cz> CC: Jeff Mahoney <jeffm@suse.com>
2014-02-14Merge branch 'for-3.14' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull two nfsd bugfixes from Bruce Fields. * 'for-3.14' of git://linux-nfs.org/~bfields/linux: lockd: send correct lock when granting a delayed lock. nfsd4: fix acl buffer overrun
2014-02-14Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO fixes from Jens Axboe: "Second round of updates and fixes for 3.14-rc2. Most of this stuff has been queued up for a while. The notable exception is the blk-mq changes, which are naturally a bit more in flux still. The pull request contains: - Two bug fixes for the new immutable vecs, causing crashes with raid or swap. From Kent. - Various blk-mq tweaks and fixes from Christoph. A fix for integrity bio's from Nic. - A few bcache fixes from Kent and Darrick Wong. - xen-blk{front,back} fixes from David Vrabel, Matt Rushton, Nicolas Swenson, and Roger Pau Monne. - Fix for a vec miscount with integrity vectors from Martin. - Minor annotations or fixes from Masanari Iida and Rashika Kheria. - Tweak to null_blk to do more normal FIFO processing of requests from Shlomo Pongratz. - Elevator switching bypass fix from Tejun. - Softlockup in blkdev_issue_discard() fix when !CONFIG_PREEMPT from me" * 'for-linus' of git://git.kernel.dk/linux-block: (31 commits) block: add cond_resched() to potentially long running ioctl discard loop xen-blkback: init persistent_purge_work work_struct blk-mq: pair blk_mq_start_request / blk_mq_requeue_request blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq block: Fix cloning of discard/write same bios block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show blk-mq: rework flush sequencing logic null_blk: use blk_complete_request and blk_mq_complete_request virtio_blk: use blk_mq_complete_request blk-mq: rework I/O completions fs: Add prototype declaration to appropriate header file include/linux/bio.h fs: Mark function as static in fs/bio-integrity.c block/null_blk: Fix completion processing from LIFO to FIFO block: Explicitly handle discard/write same segments block: Fix nr_vecs for inline integrity vectors blk-mq: Add bio_integrity setup to blk_mq_make_request blk-mq: initialize sg_reserved_size blk-mq: handle dma_drain_size blk-mq: divert __blk_put_request for MQ ops blk-mq: support at_head inserations for blk_execute_rq ...
2014-02-13jfs: set i_ctime when setting ACLDave Kleikamp
This fixes a regression in 3.14-rc1 where xfstests generic/307 fails. jfs sets the ctime on the inode when writing an xattr. Previously, jfs went ahead and stored an acl that can be completely represented in the traditional permission bits, so the ctime was always set in the xattr code. The new code doesn't bother storing the acl in that case, thus the ctime isn't getting set. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Reported-by: Michael L. Semon <mlsemon35@gmail.com>
2014-02-13lockd: send correct lock when granting a delayed lock.NeilBrown
If an NFS client attempts to get a lock (using NLM) and the lock is not available, the server will remember the request and when the lock becomes available it will send a GRANT request to the client to provide the lock. If the client already held an adjacent lock, the GRANT callback will report the union of the existing and new locks, which can confuse the client. This happens because __posix_lock_file (called by vfs_lock_file) updates the passed-in file_lock structure when adjacent or over-lapping locks are found. To avoid this problem we take a copy of the two fields that can be changed (fl_start and fl_end) before the call and restore them afterwards. An alternate would be to allocate a 'struct file_lock', initialise it, use locks_copy_lock() to take a copy, then locks_release_private() after the vfs_lock_file() call. But that is a lot more work. Reported-by: Olaf Kirch <okir@suse.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> -- v1 had a couple of issues (large on-stack struct and didn't really work properly). This version is much better tested. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-02-12ext4: don't try to modify s_flags if the the file system is read-onlyTheodore Ts'o
If an ext4 file system is created by some tool other than mke2fs (perhaps by someone who has a pathalogical fear of the GPL) that doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags, and that file system is then mounted read-only, don't try to modify the s_flags field. Otherwise, if dm_verity is in use, the superblock will change, causing an dm_verity failure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-02-12ext4: fix error paths in swap_inode_boot_loader() Zheng Liu
In swap_inode_boot_loader() we forgot to release ->i_mutex and resume unlocked dio for inode and inode_bl if there is an error starting the journal handle. This commit fixes this issue. Reported-by: Ahmed Tamrawi <ahmedtamrawi@gmail.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dr. Tilmann Bubeck <t.bubeck@reinform.de> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org # v3.10+
2014-02-12ext4: fix xfstest generic/299 block validity failuresEric Whitney
Commit a115f749c1 (ext4: remove wait for unwritten extent conversion from ext4_truncate) exposed a bug in ext4_ext_handle_uninitialized_extents(). It can be triggered by xfstest generic/299 when run on a test file system created without a journal. This test continuously fallocates and truncates files to which random dio/aio writes are simultaneously performed by a separate process. The test completes successfully, but if the test filesystem is mounted with the block_validity option, a warning message stating that a logical block has been mapped to an illegal physical block is posted in the kernel log. The bug occurs when an extent is being converted to the written state by ext4_end_io_dio() and ext4_ext_handle_uninitialized_extents() discovers a mapping for an existing uninitialized extent. Although it sets EXT4_MAP_MAPPED in map->m_flags, it fails to set map->m_pblk to the discovered physical block number. Because map->m_pblk is not otherwise initialized or set by this function or its callers, its uninitialized value is returned to ext4_map_blocks(), where it is stored as a bogus mapping in the extent status tree. Since map->m_pblk can accidentally contain illegal values that are larger than the physical size of the file system, calls to check_block_validity() in ext4_map_blocks() that are enabled if the block_validity mount option is used can fail, resulting in the logged warning message. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org # 3.11+
2014-02-11nfsd4: fix acl buffer overrunJ. Bruce Fields
4ac7249ea5a0ceef9f8269f63f33cc873c3fac61 "nfsd: use get_acl and ->set_acl" forgets to set the size in the case get_acl() succeeds, so _posix_to_nfsv4_one() can then write past the end of its allocation. Symptoms were slab corruption warnings. Also, some minor cleanup while we're here. (Among other things, note that the first few lines guarantee that pacl is non-NULL.) Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-02-11block: Fix cloning of discard/write same biosKent Overstreet
Immutable biovecs changed the way bio segments are treated in such a way that bio_for_each_segment() cannot now do what we want for discard/write same bios, since bi_size means something completely different for them. Fortunately discard and write same bios never have more than a single biovec, so bio_for_each_segment() is unnecessary and not terribly meaningful for them, but we still have to special case them in a few places. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Tested-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-10Merge branch 'akpm' (patches from Andrew Morton)Linus Torvalds
Merge misc fixes from Andrew Morton: "A bunch of fixes" * emailed patches fron Andrew Morton <akpm@linux-foundation.org>: ocfs2: check existence of old dentry in ocfs2_link() ocfs2: update inode size after zeroing the hole ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_size mm/memory-failure.c: move refcount only in !MF_COUNT_INCREASED smp.h: fix x86+cpu.c sparse warnings about arch nonboot CPU calls mm: fix page leak at nfs_symlink() slub: do not assert not having lock in removing freed partial gitignore: add all.config ocfs2: fix ocfs2_sync_file() if filesystem is readonly drivers/edac/edac_mc_sysfs.c: poll timeout cannot be zero fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem xen: properly account for _PAGE_NUMA during xen pte translations mm/slub.c: list_lock may not be held in some circumstances drivers/md/bcache/extents.c: use %zi to format size_t vmcore: prevent PT_NOTE p_memsz overflow during header update drivers/message/i2o/i2o_config.c: fix deadlock in compat_ioctl(I2OGETIOPS) Documentation/: update 00-INDEX files checkpatch: fix detection of git repository get_maintainer: fix detection of git repository drivers/misc/sgi-gru/grukdump.c: unlocking should be conditional in gru_dump_context()
2014-02-10ocfs2: check existence of old dentry in ocfs2_link()Xue jiufei
System call linkat first calls user_path_at(), check the existence of old dentry, and then calls vfs_link()->ocfs2_link() to do the actual work. There may exist a race when Node A create a hard link for file while node B rm it. Node A Node B user_path_at() ->ocfs2_lookup(), find old dentry exist rm file, add inode say inodeA to orphan_dir call ocfs2_link(),create a hard link for inodeA. rm the link, add inodeA to orphan_dir again When orphan_scan work start, it calls ocfs2_queue_orphans() to do the main work. It first tranverses entrys in orphan_dir, linking all inodes in this orphan_dir to a list look like this: inodeA->inodeB->...->inodeA When tranvering this list, it will fall into loop, calling iput() again and again. And finally trigger BUG_ON(inode->i_state & I_CLEAR). Signed-off-by: joyce <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10ocfs2: update inode size after zeroing the holeJunxiao Bi
fs-writeback will release the dirty pages without page lock whose offset are over inode size, the release happens at block_write_full_page_endio(). If not update, dirty pages in file holes may be released before flushed to the disk, then file holes will contain some non-zero data, this will cause sparse file md5sum error. To reproduce the bug, find a big sparse file with many holes, like vm image file, its actual size should be bigger than available mem size to make writeback work more frequently, tar it with -S option, then keep untar it and check its md5sum again and again until you get a wrong md5sum. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Younger Liu <younger.liu@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_sizeYounger Liu
The issue scenario is as following: - Create a small file and fallocate a large disk space for a file with FALLOC_FL_KEEP_SIZE option. - ftruncate the file back to the original size again. but the disk free space is not changed back. This is a real bug that be fixed in this patch. In order to solve the issue above, we modified ocfs2_setattr(), if attr->ia_size != i_size_read(inode), It calls ocfs2_truncate_file(), and truncate disk space to attr->ia_size. Signed-off-by: Younger Liu <younger.liu@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Tested-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Sunil Mushran <sunil.mushran@gmail.com> Reviewed-by: Jensen <shencanquan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10mm: fix page leak at nfs_symlink()Rafael Aquini
Changes in commit a0b8cab3b9b2 ("mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API") have introduced a call to add_to_page_cache_lru() which causes a leak in nfs_symlink() as now the page gets an extra refcount that is not dropped. Jan Stancek observed and reported the leak effect while running test8 from Connectathon Testsuite. After several iterations over the test case, which creates several symlinks on a NFS mountpoint, the test system was quickly getting into an out-of-memory scenario. This patch fixes the page leak by dropping that extra refcount add_to_page_cache_lru() is grabbing. Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Jeff Layton <jlayton@redhat.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: <stable@vger.kernel.org> [3.11.x+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10ocfs2: fix ocfs2_sync_file() if filesystem is readonlyYounger Liu
If filesystem is readonly, there is no need to flush drive's caches or force any uncommitted transactions. [akpm@linux-foundation.org: return -EROFS, not 0] Signed-off-by: Younger Liu <younger.liucn@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmemEric W. Biederman
Recently due to a spike in connections per second memcached on 3 separate boxes triggered the OOM killer from accept. At the time the OOM killer was triggered there was 4GB out of 36GB free in zone 1. The problem was that alloc_fdtable was allocating an order 3 page (32KiB) to hold a bitmap, and there was sufficient fragmentation that the largest page available was 8KiB. I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious but I do agree that order 3 allocations are very likely to succeed. There are always pathologies where order > 0 allocations can fail when there are copious amounts of free memory available. Using the pigeon hole principle it is easy to show that it requires 1 page more than 50% of the pages being free to guarantee an order 1 (8KiB) allocation will succeed, 1 page more than 75% of the pages being free to guarantee an order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of the pages being free to guarantee an order 3 allocate will succeed. A server churning memory with a lot of small requests and replies like memcached is a common case that if anything can will skew the odds against large pages being available. Therefore let's not give external applications a practical way to kill linux server applications, and specify __GFP_NORETRY to the kmalloc in alloc_fdmem. Unless I am misreading the code and by the time the code reaches should_alloc_retry in __alloc_pages_slowpath (where __GFP_NORETRY becomes signification). We have already tried everything reasonable to allocate a page and the only thing left to do is wait. So not waiting and falling back to vmalloc immediately seems like the reasonable thing to do even if there wasn't a chance of triggering the OOM killer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Cong Wang <cwang@twopensource.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10vmcore: prevent PT_NOTE p_memsz overflow during header updateGreg Pearson
Currently, update_note_header_size_elf64() and update_note_header_size_elf32() will add the size of a PT_NOTE entry to real_sz even if that causes real_sz to exceeds max_sz. This patch corrects the while loop logic in those routines to ensure that does not happen and prints a warning if a PT_NOTE entry is dropped. If zero PT_NOTE entries are found or this condition is encountered because the only entry was dropped, a warning is printed and an error is returned. One possible negative side effect of exceeding the max_sz limit is an allocation failure in merge_note_headers_elf64() or merge_note_headers_elf32() which would produce console output such as the following while booting the crash kernel. vmalloc: allocation failure: 14076997632 bytes swapper/0: page allocation failure: order:0, mode:0x80d2 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-gbp1 #7 Call Trace: dump_stack+0x19/0x1b warn_alloc_failed+0xf0/0x160 __vmalloc_node_range+0x19e/0x250 vmalloc_user+0x4c/0x70 merge_note_headers_elf64.constprop.9+0x116/0x24a vmcore_init+0x2d4/0x76c do_one_initcall+0xe2/0x190 kernel_init_freeable+0x17c/0x207 kernel_init+0xe/0x180 ret_from_fork+0x7c/0xb0 Kdump: vmcore not initialized kdump: dump target is /dev/sda4 kdump: saving to /sysroot//var/crash/127.0.0.1-2014.01.28-13:58:52/ kdump: saving vmcore-dmesg.txt Cannot open /proc/vmcore: No such file or directory kdump: saving vmcore-dmesg.txt failed kdump: saving vmcore kdump: saving vmcore failed This type of failure has been seen on a four socket prototype system with certain memory configurations. Most PT_NOTE sections have a single entry similar to: n_namesz = 0x5 n_descsz = 0x150 n_type = 0x1 Occasionally, a second entry is encountered with very large n_namesz and n_descsz sizes: n_namesz = 0x80000008 n_descsz = 0x510ae163 n_type = 0x80000008 Not yet sure of the source of these extra entries, they seem bogus, but they shouldn't cause crash dump to fail. Signed-off-by: Greg Pearson <greg.pearson@hp.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10[CIFS] Fix cifsacl mounts over smb2 to not call cifsSteve French
When mounting with smb2/smb3 (e.g. vers=2.1) and cifsacl mount option, it was trying to get the mode by querying the acl over the cifs rather than smb2 protocol. This patch makes that protocol independent and makes cifsacl smb2 mounts return a more intuitive operation not supported error (until we add a worker function for smb2_get_acl). Note that a previous patch fixed getxattr/setxattr for the CIFSACL xattr which would unconditionally call cifs_get_acl and cifs_set_acl (even when mounted smb2). I made those protocol independent last week (new protocol version operations "get_acl" and "set_acl" but did not add an smb2_get_acl and smb2_set_acl yet so those now simply return EOPNOTSUPP which at least is better than sending cifs requests on smb2 mount) The previous patches did not fix the one remaining case though ie mounting with "cifsacl" when getting mode from acl would unconditionally end up calling "cifs_get_acl_from_fid" even for smb2 - so made that protocol independent but to make that protocol independent had to make sure that the callers were passing the protocol independent handle structure (cifs_fid) instead of cifs specific _u16 network file handle (ie cifs_fid instead of cifs_fid->fid) Now mount with smb2 and cifsacl mount options will return EOPNOTSUP (instead of timing out) and a future patch will add smb2 operations (e.g. get_smb2_acl) to enable this. Signed-off-by: Steve French <smfrench@gmail.com>
2014-02-10Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull CIFS fixes from Steve French: "Small fix from Jeff for writepages leak, and some fixes for ACLs and xattrs when SMB2 enabled. Am expecting another fix from Jeff and at least one more fix (for mounting SMB2 with cifsacl) in the next week" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: [CIFS] clean up page array when uncached write send fails cifs: use a flexarray in cifs_writedata retrieving CIFS ACLs when mounted with SMB2 fails dropping session Add protocol specific operation for CIFS xattrs
2014-02-10NFS: Do not set NFS_INO_INVALID_LABEL unless server supports labeled NFSTrond Myklebust
Commit aa9c2669626c (NFS: Client implementation of Labeled-NFS) introduces a performance regression. When nfs_zap_caches_locked is called, it sets the NFS_INO_INVALID_LABEL flag irrespectively of whether or not the NFS server supports security labels. Since that flag is never cleared, it means that all calls to nfs_revalidate_inode() will now trigger an on-the-wire GETATTR call. This patch ensures that we never set the NFS_INO_INVALID_LABEL unless the server advertises support for labeled NFS. It also causes nfs_setsecurity() to clear NFS_INO_INVALID_LABEL when it has successfully set the security label for the inode. Finally it gets rid of the NFS_INO_INVALID_LABEL cruft from nfs_update_inode, which has nothing to do with labeled NFS. Reported-by: Neil Brown <neilb@suse.de> Cc: stable@vger.kernel.org # 3.11+ Tested-by: Neil Brown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of fixes, both -stable fodder. The O_SYNC bug is fairly old..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix a kmap leak in virtio_console fix O_SYNC|O_APPEND syncing the wrong range on write()
2014-02-10xfs: ensure correct log item buffer alignmentDave Chinner
On 32 bit platforms, the log item vector headers are not 64 bit aligned or sized. hence if we don't take care to align them correctly or pad the buffer appropriately for 8 byte alignment, we can end up with alignment issues when accessing the user buffer directly as a structure. To solve this, simply pad the buffer headers to 64 bit offset so that the data section is always 8 byte aligned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reported-by: Michael L. Semon <mlsemon35@gmail.com> Tested-by: Michael L. Semon <mlsemon35@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-10xfs: ensure correct timestamp updates from truncateChristoph Hellwig
The VFS doesn't set the proper ATTR_CTIME and ATTR_MTIME values for truncate, so filesystems have to manually add them. The introduction of xfs_setattr_time accidentally broke this special case an caused a regression in generic/313. Fix this by removing the local mask variable in xfs_setattr_size so that we only have a single place to keep the attribute information. cc: <stable@vger.kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-09fs: Mark function as static in fs/bio-integrity.cRashika Kheria
Mark functions as static in bio-integrity.c because it is not used outside this file. This eliminates the following warnings in bio-integrity.c: fs/bio-integrity.c:224:5: warning: no previous prototype for ‘bio_integrity_tag’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-09fix O_SYNC|O_APPEND syncing the wrong range on write()Al Viro
It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support) when sync_page_range() had been introduced; generic_file_write{,v}() correctly synced pos_after_write - written .. pos_after_write - 1 but generic_file_aio_write() synced pos_before_write .. pos_before_write + written - 1 instead. Which is not the same thing with O_APPEND, obviously. A couple of years later correct variant had been killed off when everything switched to use of generic_file_aio_write(). All users of generic_file_aio_write() are affected, and the same bug has been copied into other instances of ->aio_write(). The fix is trivial; the only subtle point is that generic_write_sync() ought to be inlined to avoid calculations useless for the majority of calls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-02-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is a small collection of fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix data corruption when reading/updating compressed extents Btrfs: don't loop forever if we can't run because of the tree mod log btrfs: reserve no transaction units in btrfs_ioctl_set_features btrfs: commit transaction after setting label and features Btrfs: fix assert screwup for the pending move stuff
2014-02-08Btrfs: fix data corruption when reading/updating compressed extentsFilipe David Borba Manana
When using a mix of compressed file extents and prealloc extents, it is possible to fill a page of a file with random, garbage data from some unrelated previous use of the page, instead of a sequence of zeroes. A simple sequence of steps to get into such case, taken from the test case I made for xfstests, is: _scratch_mkfs _scratch_mount "-o compress-force=lzo" $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar This results in the following file items in the fs tree: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 inode generation 6 transid 6 size 542872 block group 0 mode 100600 item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 24576 ram 266240 extent compression 0 item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53 prealloc data disk byte 12849152 nr 241664 gen 6 prealloc data offset 0 nr 241664 item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 20480 ram 20480 extent compression 2 item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53 prealloc data disk byte 13090816 nr 405504 gen 6 prealloc data offset 0 nr 258048 The on disk extent at offset 266240 (which corresponds to 1 single disk block), contains 5 compressed chunks of file data. Each of the first 4 compress 4096 bytes of file data, while the last one only compresses 3024 bytes of file data. Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 = 1072 bytes) should always return zeroes (our next extent is a prealloc one). The solution here is the compression code path to zero the remaining (untouched) bytes of the last page it uncompressed data into, as the information about how much space the file data consumes in the last page is not known in the upper layer fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing the remainder of the page but only if it corresponds to the last page of the inode and if the inode's size is not a multiple of the page size. This would cause not only returning random data on reads, but also permanently storing random data when updating parts of the region that should be zeroed. For the example above, it means updating a single byte in the region [285648 ; 286720[ would store that byte correctly but also store random data on disk. A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08Btrfs: don't loop forever if we can't run because of the tree mod logJosef Bacik
A user reported a 100% cpu hang with my new delayed ref code. Turns out I forgot to increase the count check when we can't run a delayed ref because of the tree mod log. If we can't run any delayed refs during this there is no point in continuing to look, and we need to break out. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08btrfs: reserve no transaction units in btrfs_ioctl_set_featuresDavid Sterba
Added in patch "btrfs: add ioctls to query/change feature bits online" modifications to superblock don't need to reserve metadata blocks when starting a transaction. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08btrfs: commit transaction after setting label and featuresJeff Mahoney
The set_fslabel ioctl uses btrfs_end_transaction, which means it's possible that the change will be lost if the system crashes, same for the newly set features. Let's use btrfs_commit_transaction instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08Btrfs: fix assert screwup for the pending move stuffJosef Bacik
Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't reproduce. Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set, which meant that a key part of Filipe's original patch was not being built in. This appears to be a mess up with merging Filipe's patch as it does not exist in his original patch. Fix this by changing how we make sure del_waiting_dir_move asserts that it did not error and take the function out of the ifdef check. This makes btrfs/030 pass with the assert on or off. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08Merge tag 'jfs-3.14-rc2' of git://github.com/kleikamp/linux-shaggyLinus Torvalds
Pull jfs fix from David Kleikamp: "Fix regression" * tag 'jfs-3.14-rc2' of git://github.com/kleikamp/linux-shaggy: jfs: fix generic posix ACL regression