Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull GFS2 updates from Steven Whitehouse:
"This is possibly the smallest ever set of GFS2 patches for a merge
window. Also, most of them are bug fixes this time.
Two of my three patches (moving gfs2_sync_meta and merging the two
writepage implementations) are clean ups with the third (taking the
glock ref in examine_bucket) being a fix for a difficult to hit race
condition.
The removal of an unused memory barrier is a clean up from Bob
Peterson, and the "spectator" relates to a rarely used mount option.
Ben Marzinski's patch fixes a corner case where the incorrect inode
flags were being set, resulting in incorrect behaviour on fsync"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: dirty inode correctly in gfs2_write_end
GFS2: Don't flag consistency error if first mounter is a spectator
GFS2: Remove unnecessary memory barrier
GFS2: Merge ordered and writeback writepage
GFS2: Take glock reference in examine_bucket()
GFS2: Move gfs2_sync_meta to lops.c
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
"This includes both the first pile of Ceph patches (which I sent to
torvalds@vger, sigh) and a few new patches that add support for
fscache for Ceph. That includes a few fscache core fixes that David
Howells asked go through the Ceph tree. (Thanks go to Milosz Tanski
for putting this feature together)
This first batch of patches (included here) had (has) several
important RBD bug fixes, hole punch support, several different
cleanups in the page cache interactions, improvements in the truncate
code (new truncate mutex to avoid shenanigans with i_mutex), and a
series of fixes in the synchronous striping read/write code.
On top of that is a random collection of small fixes all across the
tree (error code checks and error path cleanup, obsolete wq flags,
etc)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits)
ceph: use d_invalidate() to invalidate aliases
ceph: remove ceph_lookup_inode()
ceph: trivial buildbot warnings fix
ceph: Do not do invalidate if the filesystem is mounted nofsc
ceph: page still marked private_2
ceph: ceph_readpage_to_fscache didn't check if marked
ceph: clean PgPrivate2 on returning from readpages
ceph: use fscache as a local presisent cache
fscache: Netfs function for cleanup post readpages
FS-Cache: Fix heading in documentation
CacheFiles: Implement interface to check cache consistency
FS-Cache: Add interface to check consistency of a cached object
rbd: fix null dereference in dout
rbd: fix buffer size for writes to images with snapshots
libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc
rbd: fix I/O error propagation for reads
ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
ceph: allow sync_read/write return partial successed size of read/write.
ceph: fix bugs about handling short-read for sync read mode.
ceph: remove useless variable revoked_rdcache
...
|
|
Pull UML updates from Richard Weinberger:
"This pile contains mostly fixes and improvements for issues identified
by Richard W M Jones while adding UML as backend to libguestfs"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: Add irq chip um/mask handlers
um: prctl: Do not include linux/ptrace.h
um: Run UML in it's own session.
um: Cleanup SIGTERM handling
um: ubd: Introduce submit_request()
um: ubd: Add REQ_FLUSH suppport
um: Implement probe_kernel_read()
um: hostfs: Fix writeback
|
|
This is the fix that the last two commits indirectly led up to - making
sure that we don't call dput() in a bad context on the dentries we've
looked up in RCU mode after the sequence count validation fails.
This basically expands d_rcu_to_refcount() into the callers, and then
fixes the callers to delay the dput() in the failure case until _after_
we've dropped all locks and are no longer in an RCU-locked region.
The case of 'complete_walk()' was trivial, since its failure case did
the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(),
and as such that is just a pure expansion of the function with a trivial
movement of the resulting dput() to after 'unlock_rcu_walk()'.
In contrast, the unlazy_walk() case was much more complicated, because
not only does convert two different dentries from RCU to be reference
counted, but it used to not call unlock_rcu_walk() at all, and instead
just returned an error and let the caller clean everything up in
"terminate_walk()".
Happily, one of the dentries in question (called "parent" inside
unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants
a refcount to anyway for the non-RCU case.
So what the new and improved unlazy_walk() does is to first turn that
dentry into a refcounted one, and once that is set up, the error cases
can continue to use the terminate_walk() helper for cleanup, but for the
non-RCU case. Which makes it possible to drop out of RCU mode if we
actually hit the sequence number failure case.
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This simplifies the RCU to refcounting code in particular.
I was originally intending to leave this for later, but walking through
all the dput() logic (see previous commit), I realized that the dput()
"might_sleep()" check was misleadingly weak. And I removed it as
misleading, both for performance profiling and for debugging.
However, the might_sleep() debugging case is actually true: the final
dput() can indeed sleep, if the inode of the dentry that you are
releasing ends up sleeping at iput time (see dentry_iput()). So the
problem with the might_sleep() in dput() wasn't that it wasn't true, it
was that it wasn't actually testing and triggering on the interesting
case.
In particular, just about *any* dput() can indeed sleep, if you happen
to race with another thread deleting the file in question, and you then
lose the race to the be the last dput() for that file. But because it's
a very rare race, the debugging code would never trigger it in practice.
Why is this problematic? The new d_rcu_to_refcount() (see commit
15570086b590: "vfs: reimplement d_rcu_to_refcount() using
lockref_get_or_lock()") does a dput() for the failure case, and it does
it under the RCU lock. So potentially sleeping really is a bug.
But there's no way I'm going to fix this with the previous complicated
"lockref_get_or_lock()" interface. And rather than revert to the old
and crufty nested dentry locking code (which did get this right by
delaying the reference count updates until they were verified to be
safe), let's make forward progress.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is me being a bit OCD after all the dentry optimization work this
merge window: profiles end up showing 'dput()' as a rather expensive
operation, and there were two unrelated bad reasons for that.
The first reason was reading d_lockref.count for debugging purposes,
which touches the lockref cacheline (for reads) before really need to.
More importantly, the debugging test in question is _wrong_, and has
hidden bugs. It's true that we can only sleep when the count goes down
to zero, but the test as-is hides the much more subtle bug that happens
if we race with somebody else deleting the file.
Anyway we _will_ touch that cacheline, but let's do it for a write and
in the right routine (ie in "lockref_put_or_lock()") which annotates the
costs better. So remove the misleading debug code.
The other was an unnecessary access to the cacheline that contains the
d_lru list, just to check whether we already were on the LRU list or
not. This is exactly what we have d_flags for, so that we can avoid
touching extra cache lines for the common case. So just add another bit
for "is this dentry on the LRU".
Finally, mark the tests properly likely/unlikely, so that the common
fast-paths are dense in the instruction stream.
This makes the profiles look much saner.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 2 (of many) from Al Viro:
"Mostly Miklos' series this time"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
constify dcache.c inlined helpers where possible
fuse: drop dentry on failed revalidate
fuse: clean up return in fuse_dentry_revalidate()
fuse: use d_materialise_unique()
sysfs: use check_submounts_and_drop()
nfs: use check_submounts_and_drop()
gfs2: use check_submounts_and_drop()
afs: use check_submounts_and_drop()
vfs: check unlinked ancestors before mount
vfs: check submounts and drop atomically
vfs: add d_walk()
vfs: restructure d_genocide()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace changes from Eric Biederman:
"This is an assorted mishmash of small cleanups, enhancements and bug
fixes.
The major theme is user namespace mount restrictions. nsown_capable
is killed as it encourages not thinking about details that need to be
considered. A very hard to hit pid namespace exiting bug was finally
tracked and fixed. A couple of cleanups to the basic namespace
infrastructure.
Finally there is an enhancement that makes per user namespace
capabilities usable as capabilities, and an enhancement that allows
the per userns root to nice other processes in the user namespace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Kill nsown_capable it makes the wrong thing easy
capabilities: allow nice if we are privileged
pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
userns: Allow PR_CAPBSET_DROP in a user namespace.
namespaces: Simplify copy_namespaces so it is clear what is going on.
pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
sysfs: Restrict mounting sysfs
userns: Better restrictions on when proc and sysfs can be mounted
vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces
kernel/nsproxy.c: Improving a snippet of code.
proc: Restrict mounting the proc filesystem
vfs: Lock in place mounts from more privileged users
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
"Nothing major for this kernel, just maintenance updates"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (21 commits)
apparmor: add the ability to report a sha1 hash of loaded policy
apparmor: export set of capabilities supported by the apparmor module
apparmor: add the profile introspection file to interface
apparmor: add an optional profile attachment string for profiles
apparmor: add interface files for profiles and namespaces
apparmor: allow setting any profile into the unconfined state
apparmor: make free_profile available outside of policy.c
apparmor: rework namespace free path
apparmor: update how unconfined is handled
apparmor: change how profile replacement update is done
apparmor: convert profile lists to RCU based locking
apparmor: provide base for multiple profiles to be replaced at once
apparmor: add a features/policy dir to interface
apparmor: enable users to query whether apparmor is enabled
apparmor: remove minimum size check for vmalloc()
Smack: parse multiple rules per write to load2, up to PAGE_SIZE-1 bytes
Smack: network label match fix
security: smack: add a hash table to quicken smk_find_entry()
security: smack: fix memleak in smk_write_rules_list()
xattr: Constify ->name member of "struct xattr".
...
|
|
We have to implement ->release() and trigger writeback from it.
Otherwise we might lose dirty pages at munmap().
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
d_invalidate() is the standard VFS method to invalidate dentry.
compare to d_delete(), it also try shrinking children dentries.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
commit 6f60f889 (ceph: fix freeing inode vs removing session caps race)
introduced ceph_lookup_inode(). But there is already a ceph_find_inode()
which provides similar function. So remove ceph_lookup_inode(), use
ceph_find_inode() instead.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <alex.elder@linary.org>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
The linux-next build bot found a three of warnings, this addresses all of them.
* non-ANSI function declaration of function 'ceph_fscache_register' and
'ceph_fscache_unregister'
* symbol 'ceph_cache_netfs' was not declared, now it's extern in the header.
* warning: "pr_fmt" redefined
Signed-off-by: Milosz Tanski <milosz@adfin.com>
|
|
Previously we would always try to enqueue work even if the filesystem is not
mounted with fscache enabled (or the file has no cookie). In the case of the
filesystem mouned nofsc (but with fscache compiled in) this would lead to a
crash.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
|
|
Previous patch that allowed us to cleanup most of the issues with pages marked
as private_2 when calling ceph_readpages. However, there seams to be a case in
the error case clean up in start read that still trigers this from time to
time. I've only seen this one a couple times.
BUG: Bad page state in process petabucket pfn:335b82
page:ffffea000cd6e080 count:0 mapcount:0 mapping: (null) index:0x0
page flags: 0x200000000001000(private_2)
Call Trace:
[<ffffffff81563442>] dump_stack+0x46/0x58
[<ffffffff8112c7f7>] bad_page+0xc7/0x120
[<ffffffff8112cd9e>] free_pages_prepare+0x10e/0x120
[<ffffffff8112e580>] free_hot_cold_page+0x40/0x160
[<ffffffff81132427>] __put_single_page+0x27/0x30
[<ffffffff81132d95>] put_page+0x25/0x40
[<ffffffffa02cb409>] ceph_readpages+0x2e9/0x6f0 [ceph]
[<ffffffff811313cf>] __do_page_cache_readahead+0x1af/0x260
Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
Previously ceph_readpage_to_fscache did not call if page was marked as cached
before calling fscache_write_page resulting in a BUG inside of fscache.
FS-Cache: Assertion failed
------------[ cut here ]------------
kernel BUG at fs/fscache/page.c:874!
invalid opcode: 0000 [#1] SMP
Call Trace:
[<ffffffffa02e6566>] __ceph_readpage_to_fscache+0x66/0x80 [ceph]
[<ffffffffa02caf84>] readpage_nounlock+0x124/0x210 [ceph]
[<ffffffffa02cb08d>] ceph_readpage+0x1d/0x40 [ceph]
[<ffffffff81126db6>] generic_file_aio_read+0x1f6/0x700
[<ffffffffa02c6fcc>] ceph_aio_read+0x5fc/0xab0 [ceph]
Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
In some cases the ceph readapages code code bails without filling all the pages
already marked by fscache. When we return back to readahead code this causes
a BUG.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
|
|
Adding support for fscache to the Ceph filesystem. This would bring it to on
par with some of the other network filesystems in Linux (like NFS, AFS, etc...)
In order to mount the filesystem with fscache the 'fsc' mount option must be
passed.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
Patches for Ceph FS-Cache support
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
"The usual trivial updates all over the tree -- mostly typo fixes and
documentation updates"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
doc: Documentation/cputopology.txt fix typo
treewide: Convert retrun typos to return
Fix comment typo for init_cma_reserved_pageblock
Documentation/trace: Correcting and extending tracepoint documentation
mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
power: Documentation: Update s2ram link
doc: fix a typo in Documentation/00-INDEX
Documentation/printk-formats.txt: No casts needed for u64/s64
doc: Fix typo "is is" in Documentations
treewide: Fix printks with 0x%#
zram: doc fixes
Documentation/kmemcheck: update kmemcheck documentation
doc: documentation/hwspinlock.txt fix typo
PM / Hibernate: add section for resume options
doc: filesystems : Fix typo in Documentations/filesystems
scsi/megaraid fixed several typos in comments
ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
page_isolation: Fix a comment typo in test_pages_isolated()
doc: fix a typo about irq affinity
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext3, reiserfs, udf & isofs fixes from Jan Kara:
"The contains a bunch of ext3 cleanups and minor improvements, major
reiserfs locking changes which should hopefully fix deadlocks
introduced by BKL removal, and udf/isofs changes to refuse mounting fs
rw instead of mounting it ro automatically which makes eject button
work as expected for all media (see the changelog for why userspace
should be ok with this change)"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
jbd: use a single printk for jbd_debug()
reiserfs: locking, release lock around quota operations
reiserfs: locking, handle nested locks properly
reiserfs: locking, push write lock out of xattr code
jbd: relocate assert after state lock in journal_commit_transaction()
udf: Refuse RW mount of the filesystem instead of making it RO
udf: Standardize return values in mount sequence
isofs: Refuse RW mount of the filesystem instead of making it RO
ext3: allow specifying external journal by pathname mount option
jbd: remove unneeded semicolon
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This patch-set includes the following major enhancement patches:
- support inline xattrs
- add sysfs support to control GCs explicitly
- add proc entry to show the current segment usage information
- improve the GC/SSR performance
The other bug fixes are as follows:
- avoid the overflow on status calculation
- fix some error handling routines
- fix inconsistent xattr states after power-off-recovery
- fix incorrect xattr node offset definition
- fix deadlock condition in fsync
- fix the fdatasync routine for power-off-recovery"
* tag 'for-f2fs-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (40 commits)
f2fs: optimize gc for better performance
f2fs: merge more bios of node block writes
f2fs: avoid an overflow during utilization calculation
f2fs: trigger GC when there are prefree segments
f2fs: use strncasecmp() simplify the string comparison
f2fs: fix omitting to update inode page
f2fs: support the inline xattrs
f2fs: add the truncate_xattr_node function
f2fs: introduce __find_xattr for readability
f2fs: reserve the xattr space dynamically
f2fs: add flags for inline xattrs
f2fs: fix error return code in init_f2fs_fs()
f2fs: fix wrong BUG_ON condition
f2fs: fix memory leak when init f2fs filesystem fail
f2fs: fix a compound statement label error
f2fs: avoid writing inode redundantly when creating a file
f2fs: alloc_page() doesn't return an ERR_PTR
f2fs: should cover i_xattr_nid with its xattr node page lock
f2fs: check the free space first in new_node_page
f2fs: clean up the needless end 'return' of void function
...
|
|
Currently the fscache code expect the netfs to call fscache_readpages_or_alloc
inside the aops readpages callback. It marks all the pages in the list
provided by readahead with PG_private_2. In the cases that the netfs fails to
read all the pages (which is legal) it ends up returning to the readahead and
triggering a BUG. This happens because the page list still contains marked
pages.
This patch implements a simple fscache_readpages_cancel function that the netfs
should call before returning from readpages. It will revoke the pages from the
underlying cache backend and unmark them.
The problem was originally worked out in the Ceph devel tree, but it also
occurs in CIFS. It appears that NFS, AFS and 9P are okay as read_cache_pages()
will clean up the unprocessed pages in the case of an error.
This can be used to address the following oops:
[12410647.597278] BUG: Bad page state in process petabucket pfn:3d504e
[12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping:
(null) index:0x0
[12410647.597298] page flags: 0x200000000001000(private_2)
...
[12410647.597334] Call Trace:
[12410647.597345] [<ffffffff815523f2>] dump_stack+0x19/0x1b
[12410647.597356] [<ffffffff8111def7>] bad_page+0xc7/0x120
[12410647.597359] [<ffffffff8111e49e>] free_pages_prepare+0x10e/0x120
[12410647.597361] [<ffffffff8111fc80>] free_hot_cold_page+0x40/0x170
[12410647.597363] [<ffffffff81123507>] __put_single_page+0x27/0x30
[12410647.597365] [<ffffffff81123df5>] put_page+0x25/0x40
[12410647.597376] [<ffffffffa02bdcf9>] ceph_readpages+0x2e9/0x6e0 [ceph]
[12410647.597379] [<ffffffff81122a8f>] __do_page_cache_readahead+0x1af/0x260
[12410647.597382] [<ffffffff81122ea1>] ra_submit+0x21/0x30
[12410647.597384] [<ffffffff81118f64>] filemap_fault+0x254/0x490
[12410647.597387] [<ffffffff8113a74f>] __do_fault+0x6f/0x4e0
[12410647.597391] [<ffffffff810125bd>] ? __switch_to+0x16d/0x4a0
[12410647.597395] [<ffffffff810865ba>] ? finish_task_switch+0x5a/0xc0
[12410647.597398] [<ffffffff8113d856>] handle_pte_fault+0xf6/0x930
[12410647.597401] [<ffffffff81008c33>] ? pte_mfn_to_pfn+0x93/0x110
[12410647.597403] [<ffffffff81008cce>] ? xen_pmd_val+0xe/0x10
[12410647.597405] [<ffffffff81005469>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[12410647.597407] [<ffffffff8113f361>] handle_mm_fault+0x251/0x370
[12410647.597411] [<ffffffff812b0ac4>] ? call_rwsem_down_read_failed+0x14/0x30
[12410647.597414] [<ffffffff8155bffa>] __do_page_fault+0x1aa/0x550
[12410647.597418] [<ffffffff8108011d>] ? up_write+0x1d/0x20
[12410647.597422] [<ffffffff8113141c>] ? vm_mmap_pgoff+0xbc/0xe0
[12410647.597425] [<ffffffff81143bb8>] ? SyS_mmap_pgoff+0xd8/0x240
[12410647.597427] [<ffffffff8155c3ae>] do_page_fault+0xe/0x10
[12410647.597431] [<ffffffff81558818>] page_fault+0x28/0x30
Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Implement the FS-Cache interface to check the consistency of a cache object in
CacheFiles.
Original-author: Hongyi Jia <jiayisuse@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hongyi Jia <jiayisuse@gmail.com>
cc: Milosz Tanski <milosz@adfin.com>
|
|
Extend the fscache netfs API so that the netfs can ask as to whether a cache
object is up to date with respect to its corresponding netfs object:
int fscache_check_consistency(struct fscache_cookie *cookie)
This will call back to the netfs to check whether the auxiliary data associated
with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it
may also return -ENOMEM and -ERESTARTSYS.
The backends now have to implement a mandatory operation pointer:
int (*check_consistency)(struct fscache_object *object)
that corresponds to the above API call. FS-Cache takes care of pinning the
object and the cookie in memory and managing this call with respect to the
object state.
Original-author: Hongyi Jia <jiayisuse@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hongyi Jia <jiayisuse@gmail.com>
cc: Milosz Tanski <milosz@adfin.com>
|
|
Pull sparc changes from David Miller:
"Several bug fixes (from Kirill Tkhai, Geery Uytterhoeven, and Alexey
Dobriyan) and some support for Fujitsu sparc64x chips (from Allen
Pais)"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Export flush_ptrace_access() (needed by lustre)
sparc: fix PCI device proc file mmap(2)
sparc64: Remove RWSEM export leftovers
sparc64: Fix off by one in trampoline TLB mapping installation loop.
sparc64: Fix ITLB handler of null page
esp_scsi: Fix tag state corruption when autosensing.
sparc64: Fix not SRA'ed %o5 in 32-bit traced syscall
sparc64: cleanup: Rename ret_from_syscall to ret_from_fork
sparc32: Fix exit flag passed from traced sys_sigreturn
sparc64: Fix wrong syscall return value passed to trace_sys_exit()
support sparc64x chip type in cpumap.c
cpu hw caps support for sparc64x
|
|
Drop a subtree when we find that it has moved or been delated. This can be
done as long as there are no submounts under this location.
If the directory was moved and we come across the same directory in a
future lookup it will be reconnected by d_materialise_unique().
Signed-off-by: Anand Avati <avati@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
On errors unrelated to the filesystem's state (ENOMEM, ENOTCONN) return the
error itself from ->d_revalidate() insted of returning zero (invalid).
Also make a common label for invalidating the dentry. This will be used by
the next patch.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Use d_materialise_unique() instead of d_splice_alias(). This allows dentry
subtrees to be moved to a new place if there moved, even if something is
referencing a dentry in the subtree (open fd, cwd, etc..).
This will also allow us to drop a subtree if it is found to be replaced by
something else. In this case the disconnected subtree can later be
reconnected to its new location.
d_materialise_unique() ensures that a directory entry only ever has one
alias. We keep fc->inst_mutex around the calls for d_materialise_unique()
on directories to prevent a race with mkdir "stealing" the inode.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Do have_submounts(), shrink_dcache_parent() and d_drop() atomically.
check_submounts_and_drop() can deal with negative dentries and
non-directories as well.
Non-directories can also be mounted on. And just like directories we don't
want these to disappear with invalidation.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Do have_submounts(), shrink_dcache_parent() and d_drop() atomically.
check_submounts_and_drop() can deal with negative dentries and
non-directories as well.
Non-directories can also be mounted on. And just like directories we don't
want these to disappear with invalidation.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Do have_submounts(), shrink_dcache_parent() and d_drop() atomically.
check_submounts_and_drop() can deal with negative dentries and
non-directories as well.
Non-directories can also be mounted on. And just like directories we don't
want these to disappear with invalidation.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Do have_submounts(), shrink_dcache_parent() and d_drop() atomically.
check_submounts_and_drop() can deal with negative dentries as well.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
We check submounts before doing d_drop() on a non-empty directory dentry in
NFS (have_submounts()), but we do not exclude a racing mount. Nor do we
prevent mounts to be added to the disconnected subtree using relative paths
after the d_drop().
This patch fixes these issues by checking for unlinked (unhashed, non-root)
ancestors before proceeding with the mount. This is done with rename
seqlock taken for write and with ->d_lock grabbed on each ancestor in turn,
including our dentry itself. This ensures that the only one of
check_submounts_and_drop() or has_unlinked_ancestor() can succeed.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
We check submounts before doing d_drop() on a non-empty directory dentry in
NFS (have_submounts()), but we do not exclude a racing mount.
Process A: have_submounts() -> returns false
Process B: mount() -> success
Process A: d_drop()
This patch prepares the ground for the fix by doing the following
operations all under the same rename lock:
have_submounts()
shrink_dcache_parent()
d_drop()
This is actually an optimization since have_submounts() and
shrink_dcache_parent() both traverse the same dentry tree separately.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: David Howells <dhowells@redhat.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This one replaces three instances open coded tree walking (have_submounts,
select_parent, d_genocide) with a common helper.
In addition to slightly reducing the kernel size, this simplifies the
callers and makes them less bug prone.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
It shouldn't matter when we decrement the refcount during the walk as long
as we do it exactly once.
Restructure d_genocide() to do the killing on entering the dentry instead
of when leaving it. This helps creating a common helper for tree walking.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Commit 786d7e1612f0b0adb6046f19b906609e4fe8b1ba "Fix rmmod/read/write races in /proc entries"
must have broken mmapping of PCI device proc files on Sparc.
Notice how it adds wrapper around ->mmap but doesn't do it around ->get_unmapped_area.
Add wrapper around ->get_unmapped_area.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 1 from Al Viro:
"Unfortunately, this merge window it'll have a be a lot of small piles -
my fault, actually, for not keeping #for-next in anything that would
resemble a sane shape ;-/
This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
components) + several long-standing patches from various folks.
There definitely will be a lot more (starting with Miklos'
check_submount_and_drop() series)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
direct-io: Handle O_(D)SYNC AIO
direct-io: Implement generic deferred AIO completions
add formats for dentry/file pathnames
kvm eventfd: switch to fdget
powerpc kvm: use fdget
switch fchmod() to fdget
switch epoll_ctl() to fdget
switch copy_module_from_fd() to fdget
git simplify nilfs check for busy subtree
ibmasmfs: don't bother passing superblock when not needed
don't pass superblock to hypfs_{mkdir,create*}
don't pass superblock to hypfs_diag_create_files
don't pass superblock to hypfs_vm_create_files()
oprofile: get rid of pointless forward declarations of struct super_block
oprofilefs_create_...() do not need superblock argument
oprofilefs_mkdir() doesn't need superblock argument
don't bother with passing superblock to oprofile_create_stats_files()
oprofile: don't bother with passing superblock to ->create_files()
don't bother passing sb to oprofile_create_files()
coh901318: don't open-code simple_read_from_buffer()
...
|
|
GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
it actually increased the file size. If gfs2_fsync was called without
I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
returning, so any metadata or journaled data changes were not getting
fsynced. This meant that writes to the middle of files were not always
getting fsynced properly.
This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
updated during a write. It also make gfs2_sync flush the incore log
if I_DIRTY_PAGES is set, and the file is using data journalling. This
will make sure that all incore logged data gets written to disk before
returning from a fsync.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This patch checks for the first mounter being a specator. If so, it
makes sure all the journals are clean. If there's a dirty journal,
the mount fails.
Testing results:
# insmod gfs2.ko
# mount -tgfs2 -o spectator /dev/sasdrives/scratch /mnt/gfs2
mount: permission denied
# dmesg | tail -2
[ 3390.655996] GFS2: fsid=MUSKETEER:home: Now mounting FS...
[ 3390.841336] GFS2: fsid=MUSKETEER:home.s: jid=0: Journal is dirty, so the first mounter must not be a spectator.
# mount -tgfs2 /dev/sasdrives/scratch /mnt/gfs2
# umount /mnt/gfs2
# mount -tgfs2 -o spectator /dev/sasdrives/scratch /mnt/gfs2
# ls /mnt/gfs2|wc -l
352
# umount /mnt/gfs2
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This patch improves the gc efficiency by optimizing the victim
selection policy. With this optimization, the random re-write
performance could increase up to 20%.
For f2fs, when disk is in shortage of free spaces, gc will selects
dirty segments and moves valid blocks around for making more space
available. The gc cost of a segment is determined by the valid blocks
in the segment. The less the valid blocks, the higher the efficiency.
The ideal victim segment is the one that has the most garbage blocks.
Currently, it searches up to 20 dirty segments for a victim segment.
The selected victim is not likely the best victim for gc when there
are much more dirty segments. Why not searching more dirty segments
for a better victim? The cost of searching dirty segments is
negligible in comparison to moving blocks.
In this patch, it enlarges the MAX_VICTIM_SEARCH to 4096 to make
the search more aggressively for a possible better victim. Since
it also applies to victim selection for SSR, it will likely improve
the SSR efficiency as well.
The test case is simple. It creates as many files until the disk full.
The size for each file is 32KB. Then it writes as many as 100000
records of 4KB size to random offsets of random files in sync mode.
The testing was done on a 2GB partition of a SDHC card. Let's see the
test result of f2fs without and with the patch.
---------------------------------------
2GB partition, SDHC
create 52023 files of size 32768 bytes
random re-write 100000 records of 4KB
---------------------------------------
| file creation (s) | rewrite time (s) | gc count | gc garbage blocks |
[no patch] 341 4227 1174 174840
[patched] 324 2958 645 106682
It's obvious that, with the patch, f2fs finishes the test in 20+% less
time than without the patch. And internally it does much less gc with
higher efficiency than before.
Since the performance improvement is related to gc, it might not be so
obvious for other tests that do not trigger gc as often as this one (
This is because f2fs selects dirty segments for SSR use most of the
time when free space is in shortage). The well-known iozone test tool
was not used for benchmarking the patch becuase it seems do not have
a test case that performs random re-write on a full disk.
This patch is the revised version based on the suggestion from
Jaegeuk Kim.
Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
[Jaegeuk Kim: suggested simpler solution]
Reviewed-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Previously, we experience bio traces as follows when running simple sequential
write test.
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500104928, size = 4K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499922208, size = 368K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499914752, size = 140K
-> total 512K
The first one is to write an indirect node block, and the others are to write
direct node blocks.
The reason why there are two separate bios for direct node blocks is:
0. initial state
------------------ ------------------
| | |xxxxxxxx |
------------------ ------------------
1. write 368K
------------------ ------------------
| | |xxxxxxxxWWWWWWWW|
------------------ ------------------
2. write 140K
------------------ ------------------
|WWWWWWW | |xxxxxxxxWWWWWWWW|
------------------ ------------------
This is because f2fs_write_node_pages tries to write just 512K totally, so that
we can lose the chance to merge more bios nicely.
After this patch is applied, we can get the following bio traces.
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500103168, size = 8K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500111368, size = 4K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500107272, size = 512K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500108296, size = 512K
f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500109320, size = 500K
And finally, we can improve the sequential write performance,
from 458.775 MB/s to 479.945 MB/s on SSD.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull PTR_RET() removal patches from Rusty Russell:
"PTR_RET() is a weird name, and led to some confusing usage. We ended
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle"
[ There are still some PTR_RET users scattered about, with some of them
possibly being new, but most of them existing in Rusty's tree too. We
have that
#define PTR_RET(p) PTR_ERR_OR_ZERO(p)
thing in <linux/err.h>, so they continue to work for now - Linus ]
* tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
staging/zcache: don't use PTR_RET().
remoteproc: don't use PTR_RET().
pinctrl: don't use PTR_RET().
acpi: Replace weird use of PTR_RET.
s390: Replace weird use of PTR_RET.
PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
PTR_RET is now PTR_ERR_OR_ZERO
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes a workqueue cleanup and the removal of incorrect and
unneeded signal blocking"
* tag 'dlm-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: remove signal blocking
dlm: WQ_NON_REENTRANT is meaningless and going away
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"New features for 3.12:
- Added aggressive extent caching using the extent status tree. This
can actually decrease memory usage in read-mostly workloads since
the information is much more compactly stored in the extent status
tree than if we had to keep the extent tree metadata blocks in the
buffer cache. This also improves Asynchronous I/O since it is it
makes much less likely that we need to do metadata I/O to lookup
the extent tree information.
- Improve the recovery after corrupted allocation bitmaps are found
when running in errors=ignore mode.
Also fixed some writeback vs truncate races when using a blocksize
less than the page size"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
ext4: allow specifying external journal by pathname mount option
ext4: mark group corrupt on group descriptor checksum
ext4: mark block group as corrupt on inode bitmap error
ext4: mark block group as corrupt on block bitmap error
ext4: fix type declaration of ext4_validate_block_bitmap
ext4: error out if verifying the block bitmap fails
jbd2: Fix endian mixing problems in the checksumming code
ext4: isolate ext4_extents.h file
ext4: Fix misspellings using 'codespell' tool
ext4: convert write_begin methods to stable_page_writes semantics
ext4: fix use of potentially uninitialized variables in debugging code
ext4: fix lost truncate due to race with writeback
ext4: simplify truncation code in ext4_setattr()
ext4: fix ext4_writepages() in presence of truncate
ext4: move test whether extent to map can be extended to one place
ext4: fix warning in ext4_da_update_reserve_space()
quota: provide interface for readding allocated space into reserved space
ext4: avoid reusing recently deleted inodes in no journal mode
ext4: allocate delayed allocation blocks before rename
ext4: start handle at least possible moment when renaming files
...
|
|
Function test_and_clear_bit implies a memory barrier, so subsequent
memory barriers are unnecessary.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request. Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.
Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Add support to the core direct-io code to defer AIO completions to user
context using a workqueue. This replaces opencoded and less efficient
code in XFS and ext4 (we save a memory allocation for each direct IO)
and will be needed to properly support O_(D)SYNC for AIO.
The communication between the filesystem and the direct I/O code requires
a new buffer head flag, which is a bit ugly but not avoidable until the
direct I/O code stops abusing the buffer_head structure for communicating
with the filesystems.
Currently this creates a per-superblock unbound workqueue for these
completions, which is taken from an earlier patch by Jan Kara. I'm
not really convinced about this use and would prefer a "normal" global
workqueue with a high concurrency limit, but this needs further discussion.
JK: Fixed ext4 part, dynamic allocation of the workqueue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore changes from Tony Luck:
"A big part of this is the addition of compression to the generic
pstore layer so that all backends can use the pitiful amounts of
storage they control more effectively. Three other small
fixes/cleanups too.
* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
pstore/ram: (really) fix undefined usage of rounddown_pow_of_two
pstore/ram: Read and write to the 'compressed' flag of pstore
efi-pstore: Read and write to the 'compressed' flag of pstore
erst: Read and write to the 'compressed' flag of pstore
powerpc/pseries: Read and write to the 'compressed' flag of pstore
pstore: Add file extension to pstore file if compressed
pstore: Add decompression support to pstore
pstore: Introduce new argument 'compressed' in the read callback
pstore: Add compression support to pstore
pstore/Kconfig: Select ZLIB_DEFLATE and ZLIB_INFLATE when PSTORE is selected
pstore: Add new argument 'compressed' in pstore write callback
powerpc/pseries: Remove (de)compression in nvram with pstore enabled
pstore: d_alloc_name() doesn't return an ERR_PTR
acpi/apei/erst: Add missing iounmap() on error in erst_exec_move_data()
|