Age | Commit message (Collapse) | Author |
|
The goal of this patchset is to support multiple hugetlb page sizes. This
is achieved by introducing a new struct hstate structure, which
encapsulates the important hugetlb state and constants (eg. huge page
size, number of huge pages currently allocated, etc).
The hstate structure is then passed around the code which requires these
fields, they will do the right thing regardless of the exact hstate they
are operating on.
This patch adds the hstate structure, with a single global instance of it
(default_hstate), and does the basic work of converting hugetlb to use the
hstate.
Future patches will add more hstate structures to allow for different
hugetlbfs mounts to have different page sizes.
[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Christoph recently added /proc/vmallocinfo file to get information about
vmalloc allocations.
This patch adds NUMA specific information, giving number of pages
allocated on each memory node.
This should help to check that vmalloc() is able to respect NUMA policies.
Example of output on a four nodes machine (one cpu per node)
1) network hash tables are evenly spreaded on four nodes (OK) (Same
point for inodes and dentries hash tables)
2) iptables tables (x_tables) are correctly allocated on each cpu node
(OK).
3) sys_swapon() allocates its memory from one node only.
4) each loaded module is using memory on one node.
Sysadmins could tune their setup to change points 3) and 4) if necessary.
grep "pages=" /proc/vmallocinfo
0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204/0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128
0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
0xffffc2000031a000-0xffffc2000031d000 12288 alloc_large_system_hash+0x204/0x2c0 pages=2 vmalloc N1=1 N2=1
0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e/0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3
0xffffc2000033e000-0xffffc20000341000 12288 sys_swapon+0x640/0xac0 pages=2 vmalloc N0=2
0xffffc20000341000-0xffffc20000344000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N0=2
0xffffc20000344000-0xffffc20000347000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N1=2
0xffffc20000347000-0xffffc2000034a000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N2=2
0xffffc2000034a000-0xffffc2000034d000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N3=2
0xffffc20004381000-0xffffc20004402000 528384 alloc_large_system_hash+0x204/0x2c0 pages=128 vmalloc N0=32 N1=32 N2=32 N3=32
0xffffc20004402000-0xffffc20004803000 4198400 alloc_large_system_hash+0x204/0x2c0 pages=1024 vmalloc vpages N0=256 N1=256 N2=256 N3=256
0xffffc20004803000-0xffffc20004904000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
0xffffc20004904000-0xffffc20004bec000 3047424 sys_swapon+0x640/0xac0 pages=743 vmalloc vpages N0=743
0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 pages=14 vmalloc N1=14
0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 pages=4 vmalloc N0=4
0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N0=2
0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N1=10
0xffffffffa0022000-0xffffffffa0028000 24576 sys_init_module+0xc27/0x1d00 pages=5 vmalloc N3=5
0xffffffffa0028000-0xffffffffa0050000 163840 sys_init_module+0xc27/0x1d00 pages=39 vmalloc N1=39
0xffffffffa0050000-0xffffffffa0052000 8192 sys_init_module+0xc27/0x1d00 pages=1 vmalloc N1=1
0xffffffffa0052000-0xffffffffa0056000 16384 sys_init_module+0xc27/0x1d00 pages=3 vmalloc N1=3
0xffffffffa0056000-0xffffffffa0081000 176128 sys_init_module+0xc27/0x1d00 pages=42 vmalloc N3=42
0xffffffffa0081000-0xffffffffa00ae000 184320 sys_init_module+0xc27/0x1d00 pages=44 vmalloc N3=44
0xffffffffa00ae000-0xffffffffa00b1000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
0xffffffffa00b1000-0xffffffffa00b9000 32768 sys_init_module+0xc27/0x1d00 pages=7 vmalloc N0=7
0xffffffffa00b9000-0xffffffffa00c4000 45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N3=10
0xffffffffa00c6000-0xffffffffa00e0000 106496 sys_init_module+0xc27/0x1d00 pages=25 vmalloc N2=25
0xffffffffa00e0000-0xffffffffa00f1000 69632 sys_init_module+0xc27/0x1d00 pages=16 vmalloc N2=16
0xffffffffa00f1000-0xffffffffa00f4000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
0xffffffffa00f4000-0xffffffffa00f7000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
[akpm@linux-foundation.org: fix comment text]
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
mmap(MAP_PRIVATE) on hugetlbfs will succeed
After patch 2 in this series, a process that successfully calls mmap() for
a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
process calls fork(). At that point, the next write fault from the parent
could fail due to COW if the child still has a reference.
We only reserve pages for the parent but a copy must be made to avoid
leaking data from the parent to the child after fork(). Reserves could be
taken for both parent and child at fork time to guarantee faults but if
the mapping is large it is highly likely we will not have sufficient pages
for the reservation, and it is common to fork only to exec() immediatly
after. A failure here would be very undesirable.
Note that the current behaviour of mainline with MAP_PRIVATE pages is
pretty bad. The following situation is allowed to occur today.
1. Process calls mmap(MAP_PRIVATE)
2. Process calls mlock() to fault all pages and makes sure it succeeds
3. Process forks()
4. Process writes to MAP_PRIVATE mapping while child still exists
5. If the COW fails at this point, the process gets SIGKILLed even though it
had taken care to ensure the pages existed
This patch improves the situation by guaranteeing the reliability of the
process that successfully calls mmap(). When the parent performs COW, it
will try to satisfy the allocation without using reserves. If that fails
the parent will steal the page leaving any children without a page.
Faults from the child after that point will result in failure. If the
child COW happens first, an attempt will be made to allocate the page
without reserves and the child will get SIGKILLed on failure.
To summarise the new behaviour:
1. If the original mapper performs COW on a private mapping with multiple
references, it will attempt to allocate a hugepage from the pool or
the buddy allocator without using the existing reserves. On fail, VMAs
mapping the same area are traversed and the page being COW'd is unmapped
where found. It will then steal the original page as the last mapper in
the normal way.
2. The VMAs the pages were unmapped from are flagged to note that pages
with data no longer exist. Future no-page faults on those VMAs will
terminate the process as otherwise it would appear that data was corrupted.
A warning is printed to the console that this situation occured.
2. If the child performs COW first, it will attempt to satisfy the COW
from the pool if there are enough pages or via the buddy allocator if
overcommit is allowed and the buddy allocator can satisfy the request. If
it fails, the child will be killed.
If the pool is large enough, existing applications will not notice that
the reserves were a factor. Existing applications depending on the
no-reserves been set are unlikely to exist as for much of the history of
hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
point or failing the mmap().
[npiggin@suse.de: fix CONFIG_HUGETLB=n build]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
until fork()
This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
a similar manner to the reservations taken for MAP_SHARED mappings. The
reserve count is accounted both globally and on a per-VMA basis for
private mappings. This guarantees that a process that successfully calls
mmap() will successfully fault all pages in the future unless fork() is
called.
The characteristics of private mappings of hugetlbfs files behaviour after
this patch are;
1. The process calling mmap() is guaranteed to succeed all future faults until
it forks().
2. On fork(), the parent may die due to SIGKILL on writes to the private
mapping if enough pages are not available for the COW. For reasonably
reliable behaviour in the face of a small huge page pool, children of
hugepage-aware processes should not reference the mappings; such as
might occur when fork()ing to exec().
3. On fork(), the child VMAs inherit no reserves. Reads on pages already
faulted by the parent will succeed. Successful writes will depend on enough
huge pages being free in the pool.
4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
and at fault time otherwise.
Before this patch, all reads or writes in the child potentially needs page
allocations that can later lead to the death of the parent. This applies
to reads and writes of uninstantiated pages as well as COW. After the
patch it is only a write to an instantiated page that causes problems.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
[Summary]
Split LRU-list of unused dentries to one per superblock to avoid soft
lock up during NFS mounts and remounting of any filesystem.
Previously I posted here:
http://lkml.org/lkml/2008/3/5/590
[Descriptions]
- background
dentry_unused is a list of dentries which are not referenced.
dentry_unused grows up when references on directories or files are
released. This list can be very long if there is huge free memory.
- the problem
When shrink_dcache_sb() is called, it scans all dentry_unused linearly
under spin_lock(), and if dentry->d_sb is differnt from given
superblock, scan next dentry. This scan costs very much if there are
many entries, and very ineffective if there are many superblocks.
IOW, When we need to shrink unused dentries on one dentry, but scans
unused dentries on all superblocks in the system. For example, we scan
500 dentries to unmount a filesystem, but scans 1,000,000 or more unused
dentries on other superblocks.
In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink
unused dentries on NFS, but scans 100,000,000 unused dentries on
superblocks in the system such as local ext3 filesystems. I hear NFS
mounting took 1 min on some system in use.
* : NFS uses virtual filesystem in rpc layer, so NFS is affected by
this problem.
100,000,000 is possible number on large systems.
Per-superblock LRU of unused dentried can reduce the cost in
reasonable manner.
- How to fix
I found this problem is solved by David Chinner's "Per-superblock
unused dentry LRU lists V3"(1), so I rebase it and add some fix to
reclaim with fairness, which is in Andrew Morton's comments(2).
1) http://lkml.org/lkml/2006/5/25/318
2) http://lkml.org/lkml/2006/5/25/320
Split LRU-list of unused dentries to each superblocks. Then, NFS
mounting will check dentries under a superblock instead of all. But
this spliting will break LRU of dentry-unused. So, I've attempted to
make reclaim unused dentrins with fairness by calculate number of
dentries to scan on this sb based on following way
number of dentries to scan on this sb =
count * (number of dentries on this sb / number of dentries in the machine)
- ToDo
- I have to measuring performance number and do stress tests.
- When unmount occurs during prune_dcache(), scanning on same
superblock, It is unable to reach next superblock because it is gone
away. We restart scannig superblock from first one, it causes
unfairness of reclaim unused dentries on first superblock. But I think
this happens very rarely.
- Test Results
Result on 6GB boxes with excessive unused dentries.
Without patch:
$ cat /proc/sys/fs/dentry-state
10181835 10180203 45 0 0 0
# mount -t nfs 10.124.60.70:/work/kernel-src nfs
real 0m1.830s
user 0m0.001s
sys 0m1.653s
With this patch:
$ cat /proc/sys/fs/dentry-state
10236610 10234751 45 0 0 0
# mount -t nfs 10.124.60.70:/work/kernel-src nfs
real 0m0.106s
user 0m0.002s
sys 0m0.032s
[akpm@linux-foundation.org: fix comments]
Signed-off-by: Kentaro Makita <k-makita@np.css.fujitsu.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Chinner <dgc@sgi.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The double indirection here is not needed anywhere and hence (at least)
confusing.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch adds proper extern declarations for five variables in
include/linux/vmstat.h
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
- use kzalloc() instead of kmalloc() + memset()
- improve a printk info
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
In mode_to_acl when converting a Unix mode to a Windows ACL
the subauth fields of the SID in the ACL were translated
incorrectly on bigendian architectures
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
fs/cifs/cifssmb.c:3917:13: warning: incorrect type in assignment (different base types)
fs/cifs/cifssmb.c:3917:13: expected bool [unsigned] [usertype] is_unicode
fs/cifs/cifssmb.c:3917:13: got restricted __le16
The comment explains why __force is used here.
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
Move the code that handles ATTR_SIZE changes to its own function. This
makes for a smaller function and reduces the level of indentation.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
Put CIFS sockets in their own class to avoid some lockdep warnings. CIFS
sockets are not exposed to user-space, and so are not subject to the
same deadlock scenarios.
A similar change was made a couple of years ago for RPC sockets in commit
ed07536ed6731775219c1df7fa26a7588753e693.
This patch should prevent lockdep false-positives like this one:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.18-98.el5.jtltest.38.bz456320.1debug #1
-------------------------------------------------------
test5/2483 is trying to acquire lock:
(sk_lock-AF_INET){--..}, at: [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f
but task is already holding lock:
(&inode->i_alloc_sem){--..}, at: [<ffffffff8002e454>] notify_change+0xf5/0x2e0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&inode->i_alloc_sem){--..}:
[<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf
[<ffffffff800a8a72>] lock_acquire+0x55/0x70
[<ffffffff8002e454>] notify_change+0xf5/0x2e0
[<ffffffff800a4e36>] down_write+0x3c/0x68
[<ffffffff8002e454>] notify_change+0xf5/0x2e0
[<ffffffff800e358d>] do_truncate+0x50/0x6b
[<ffffffff8005197c>] get_write_access+0x40/0x46
[<ffffffff80012cf1>] may_open+0x1d3/0x22e
[<ffffffff8001bc81>] open_namei+0x2c6/0x6dd
[<ffffffff800289c6>] do_filp_open+0x1c/0x38
[<ffffffff800683ef>] _spin_unlock+0x17/0x20
[<ffffffff800167a7>] get_unused_fd+0xf9/0x107
[<ffffffff8001a704>] do_sys_open+0x44/0xbe
[<ffffffff80060116>] system_call+0x7e/0x83
[<ffffffffffffffff>] 0xffffffffffffffff
-> #2 (&sysfs_inode_imutex_key){--..}:
[<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf
[<ffffffff8010f6df>] create_dir+0x26/0x1d7
[<ffffffff800a8a72>] lock_acquire+0x55/0x70
[<ffffffff8010f6df>] create_dir+0x26/0x1d7
[<ffffffff800671c0>] mutex_lock_nested+0x104/0x29c
[<ffffffff800a819d>] __lock_acquire+0x9ca/0xadf
[<ffffffff8010f6df>] create_dir+0x26/0x1d7
[<ffffffff8010fc67>] sysfs_create_dir+0x58/0x76
[<ffffffff8015144c>] kobject_add+0xdb/0x198
[<ffffffff801be765>] class_device_add+0xb2/0x465
[<ffffffff8005a6ff>] kobject_get+0x12/0x17
[<ffffffff80225265>] register_netdevice+0x270/0x33e
[<ffffffff8022538c>] register_netdev+0x59/0x67
[<ffffffff80464d40>] net_olddevs_init+0xb/0xac
[<ffffffff80448a79>] init+0x1f9/0x2fc
[<ffffffff80068885>] _spin_unlock_irq+0x24/0x27
[<ffffffff80067f86>] trace_hardirqs_on_thunk+0x35/0x37
[<ffffffff80061079>] child_rip+0xa/0x11
[<ffffffff80068885>] _spin_unlock_irq+0x24/0x27
[<ffffffff800606a8>] restore_args+0x0/0x30
[<ffffffff80179a59>] acpi_ds_init_one_object+0x0/0x80
[<ffffffff80448880>] init+0x0/0x2fc
[<ffffffff8006106f>] child_rip+0x0/0x11
[<ffffffffffffffff>] 0xffffffffffffffff
-> #1 (rtnl_mutex){--..}:
[<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf
[<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7
[<ffffffff800a8a72>] lock_acquire+0x55/0x70
[<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7
[<ffffffff800671c0>] mutex_lock_nested+0x104/0x29c
[<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7
[<ffffffff802451b0>] do_ip_setsockopt+0x6d1/0x9bf
[<ffffffff800a575e>] lock_release_holdtime+0x27/0x48
[<ffffffff800a575e>] lock_release_holdtime+0x27/0x48
[<ffffffff8006a85e>] do_page_fault+0x503/0x835
[<ffffffff8012cbf6>] socket_has_perm+0x5b/0x68
[<ffffffff80245556>] ip_setsockopt+0x22/0x78
[<ffffffff8021c973>] sys_setsockopt+0x91/0xb7
[<ffffffff800602a6>] tracesys+0xd5/0xdf
[<ffffffffffffffff>] 0xffffffffffffffff
-> #0 (sk_lock-AF_INET){--..}:
[<ffffffff800a5037>] print_stack_trace+0x59/0x68
[<ffffffff800a8092>] __lock_acquire+0x8bf/0xadf
[<ffffffff800a8a72>] lock_acquire+0x55/0x70
[<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f
[<ffffffff80035466>] lock_sock+0xd4/0xe4
[<ffffffff80096e91>] _local_bh_enable+0xcb/0xe0
[<ffffffff800606a8>] restore_args+0x0/0x30
[<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f
[<ffffffff80057540>] sock_sendmsg+0xf3/0x110
[<ffffffff800a2bb6>] autoremove_wake_function+0x0/0x2e
[<ffffffff800a10e4>] kernel_text_address+0x1a/0x26
[<ffffffff8006f4e2>] dump_trace+0x211/0x23a
[<ffffffff800a6d3d>] find_usage_backwards+0x5f/0x88
[<ffffffff8840221a>] MD5Final+0xaf/0xc2 [cifs]
[<ffffffff884032ec>] cifs_calculate_signature+0x55/0x69 [cifs]
[<ffffffff8021d891>] kernel_sendmsg+0x35/0x47
[<ffffffff883ff38e>] smb_send+0xa3/0x151 [cifs]
[<ffffffff883ff5de>] SendReceive+0x1a2/0x448 [cifs]
[<ffffffff800a812f>] __lock_acquire+0x95c/0xadf
[<ffffffff883e758a>] CIFSSMBSetEOF+0x20d/0x25b [cifs]
[<ffffffff883fa430>] cifs_set_file_size+0x110/0x3b7 [cifs]
[<ffffffff883faa89>] cifs_setattr+0x3b2/0x6f6 [cifs]
[<ffffffff8002e454>] notify_change+0xf5/0x2e0
[<ffffffff8002e4a4>] notify_change+0x145/0x2e0
[<ffffffff800e358d>] do_truncate+0x50/0x6b
[<ffffffff8005197c>] get_write_access+0x40/0x46
[<ffffffff80012cf1>] may_open+0x1d3/0x22e
[<ffffffff8001bc81>] open_namei+0x2c6/0x6dd
[<ffffffff800289c6>] do_filp_open+0x1c/0x38
[<ffffffff800683ef>] _spin_unlock+0x17/0x20
[<ffffffff800167a7>] get_unused_fd+0xf9/0x107
[<ffffffff8001a704>] do_sys_open+0x44/0xbe
[<ffffffff800602a6>] tracesys+0xd5/0xdf
[<ffffffffffffffff>] 0xffffffffffffffff
other info that might help us debug this:
2 locks held by test5/2483:
#0: (&inode->i_mutex){--..}, at: [<ffffffff800e3582>] do_truncate+0x45/0x6b
#1: (&inode->i_alloc_sem){--..}, at: [<ffffffff8002e454>] notify_change+0xf5/0x2e0
stack backtrace:
Call Trace:
[<ffffffff800a6a7b>] print_circular_bug_tail+0x65/0x6e
[<ffffffff800a5037>] print_stack_trace+0x59/0x68
[<ffffffff800a8092>] __lock_acquire+0x8bf/0xadf
[<ffffffff800a8a72>] lock_acquire+0x55/0x70
[<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f
[<ffffffff80035466>] lock_sock+0xd4/0xe4
[<ffffffff80096e91>] _local_bh_enable+0xcb/0xe0
[<ffffffff800606a8>] restore_args+0x0/0x30
[<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f
[<ffffffff80057540>] sock_sendmsg+0xf3/0x110
[<ffffffff800a2bb6>] autoremove_wake_function+0x0/0x2e
[<ffffffff800a10e4>] kernel_text_address+0x1a/0x26
[<ffffffff8006f4e2>] dump_trace+0x211/0x23a
[<ffffffff800a6d3d>] find_usage_backwards+0x5f/0x88
[<ffffffff8840221a>] :cifs:MD5Final+0xaf/0xc2
[<ffffffff884032ec>] :cifs:cifs_calculate_signature+0x55/0x69
[<ffffffff8021d891>] kernel_sendmsg+0x35/0x47
[<ffffffff883ff38e>] :cifs:smb_send+0xa3/0x151
[<ffffffff883ff5de>] :cifs:SendReceive+0x1a2/0x448
[<ffffffff800a812f>] __lock_acquire+0x95c/0xadf
[<ffffffff883e758a>] :cifs:CIFSSMBSetEOF+0x20d/0x25b
[<ffffffff883fa430>] :cifs:cifs_set_file_size+0x110/0x3b7
[<ffffffff883faa89>] :cifs:cifs_setattr+0x3b2/0x6f6
[<ffffffff8002e454>] notify_change+0xf5/0x2e0
[<ffffffff8002e4a4>] notify_change+0x145/0x2e0
[<ffffffff800e358d>] do_truncate+0x50/0x6b
[<ffffffff8005197c>] get_write_access+0x40/0x46
[<ffffffff80012cf1>] may_open+0x1d3/0x22e
[<ffffffff8001bc81>] open_namei+0x2c6/0x6dd
[<ffffffff800289c6>] do_filp_open+0x1c/0x38
[<ffffffff800683ef>] _spin_unlock+0x17/0x20
[<ffffffff800167a7>] get_unused_fd+0xf9/0x107
[<ffffffff8001a704>] do_sys_open+0x44/0xbe
[<ffffffff800602a6>] tracesys+0xd5/0xdf
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
fs/lockd/svcproc.c:115:11: warning: incorrect type in initializer (different base types)
fs/lockd/svcproc.c:115:11: expected int [signed] rc
fs/lockd/svcproc.c:115:11: got restricted __be32 [usertype] <noident>
... and so on...
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
|
|
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (82 commits)
ipw2200: Call netif_*_queue() interfaces properly.
netxen: Needs to include linux/vmalloc.h
[netdrvr] atl1d: fix !CONFIG_PM build
r6040: rework init_one error handling
r6040: bump release number to 0.18
r6040: handle RX fifo full and no descriptor interrupts
r6040: change the default waiting time
r6040: use definitions for magic values in descriptor status
r6040: completely rework the RX path
r6040: call napi_disable when puting down the interface and set lp->dev accordingly.
mv643xx_eth: fix NETPOLL build
r6040: rework the RX buffers allocation routine
r6040: fix scheduling while atomic in r6040_tx_timeout
r6040: fix null pointer access and tx timeouts
r6040: prefix all functions with r6040
rndis_host: support WM6 devices as modems
at91_ether: use netstats in net_device structure
sfc: Create one RX queue and interrupt per CPU package by default
sfc: Use a separate workqueue for resets
sfc: I2C adapter initialisation fixes
...
|
|
get_proc_net() can now become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (79 commits)
arm: bus_id -> dev_name() and dev_set_name() conversions
sparc64: fix up bus_id changes in sparc core code
3c59x: handle pci_name() being const
MTD: handle pci_name() being const
HP iLO driver
sysdev: Convert the x86 mce tolerant sysdev attribute to generic attribute
sysdev: Add utility functions for simple int/ulong variable sysdev attributes
sysdev: Pass the attribute to the low level sysdev show/store function
driver core: Suppress sysfs warnings for device_rename().
kobject: Transmit return value of call_usermodehelper() to caller
sysfs-rules.txt: reword API stability statement
debugfs: Implement debugfs_remove_recursive()
HOWTO: change email addresses of James in HOWTO
always enable FW_LOADER unless EMBEDDED=y
uio-howto.tmpl: use unique output names
uio-howto.tmpl: use standard copyright/legal markings
sysfs: don't call notify_change
sysdev: fix debugging statements in registration code.
kobject: should use kobject_put() in kset-example
kobject: reorder kobject to save space on 64 bit builds
...
|
|
struct pagemap_walk was placed on stack, some hooks are initialized, the
rest (->pgd_entry, ->pud_entry, ->pte_entry) are valid but junk.
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The Linux kernel puts the filename argument of execve() into the new
address space. Many developers are surprised to learn this. Those who
know and could use it, object "But it's not documented."
Those who want to use it dislike the expression
(char *)(1+ strlen(env[-1+ n_env]) + env[-1+ n_env])
because it requires locating the last original environment variable,
and assumes that the filename follows the characters.
This patch documents the insertion of the filename, and makes it easier
to find by adding a new tag AT_EXECFN in the ElfXX_auxv_t; see <elf.h>.
In many cases readlink("/proc/self/exe",) gives the same answer. But if
all the original pages get unmapped, then the kernel erases the symlink
for /proc/self/exe. This can happen when a program decompressor does a
good job of cleaning up after uncompressing directly to memory, so that
the address space of the target program looks the same as if compression
had never happened. One example is http://upx.sourceforge.net .
One notable use of the underlying concept (what path containED the
executable) is glibc expanding $ORIGIN in DT_RUNPATH. In practice for
the near term, it may be a good idea for user-mode code to use both
/proc/self/exe and AT_EXECFN as fall-back methods for each other.
/proc/self/exe can fail due to unmapping, AT_EXECFN can fail because it
won't be present on non-new systems. The auxvec or {AT_EXECFN}.d_val
also can get overwritten, although in nearly all cases this would be the
result of a bug.
The runtime cost is one NEW_AUX_ENT using two words of stack space. The
underlying value is maintained already as bprm->exec; setup_arg_pages()
in fs/exec.c slides it for stack_shift, etc.
Signed-off-by: John Reiser <jreiser@BitWagon.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
driver core: Suppress sysfs warnings for device_rename().
Renaming network devices to an already existing name is not
something we want sysfs to print a scary warning for, since the
callers can deal with this correctly. So let's introduce
sysfs_create_link_nowarn() which gets rid of the common warning.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
debugfs_remove_recursive() will remove a dentry and all its children.
Drivers can use this to zap their whole debugfs tree so that they don't
need to keep track of every single debugfs dentry they created.
It may fail to remove the whole tree in certain cases:
sh-3.2# rmmod atmel-mci < /sys/kernel/debug/mmc0/ios/clock
mmc0: card b368 removed
atmel_mci atmel_mci.0: Lost dma0chan1, falling back to PIO
sh-3.2# ls /sys/kernel/debug/mmc0/
ios
But I'm not sure if that case can be handled in any sane manner.
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>
Cc: Pierre Ossman <drzeus-list@drzeus.cx>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
sysfs_chmod_file() calls notify_change() to change the permission bits
on a sysfs file. Replace with explicit call to sysfs_setattr() and
fsnotify_change().
This is equivalent, except that security_inode_setattr() is not
called. This function is called by drivers, so the security checks do
not make any sense.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Kobjects do not have a limit in name size since a while, so stop
pretending that they do.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
device_create() is race-prone, so use the race-free
device_create_drvdata() instead as device_create() is going away.
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
* 'for-2.6.27' of git://linux-nfs.org/~bfields/linux: (51 commits)
nfsd: nfs4xdr.c do-while is not a compound statement
nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c
lockd: Pass "struct sockaddr *" to new failover-by-IP function
lockd: get host reference in nlmsvc_create_block() instead of callers
lockd: minor svclock.c style fixes
lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock
lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock
lockd: nlm_release_host() checks for NULL, caller needn't
file lock: reorder struct file_lock to save space on 64 bit builds
nfsd: take file and mnt write in nfs4_upgrade_open
nfsd: document open share bit tracking
nfsd: tabulate nfs4 xdr encoding functions
nfsd: dprint operation names
svcrdma: Change WR context get/put to use the kmem cache
svcrdma: Create a kmem cache for the WR contexts
svcrdma: Add flush_scheduled_work to module exit function
svcrdma: Limit ORD based on client's advertised IRD
svcrdma: Remove unused wait q from svcrdma_xprt structure
svcrdma: Remove unneeded spin locks from __svc_rdma_free
svcrdma: Add dma map count and WARN_ON
...
|
|
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (1232 commits)
iucv: Fix bad merging.
net_sched: Add size table for qdiscs
net_sched: Add accessor function for packet length for qdiscs
net_sched: Add qdisc_enqueue wrapper
highmem: Export totalhigh_pages.
ipv6 mcast: Omit redundant address family checks in ip6_mc_source().
net: Use standard structures for generic socket address structures.
ipv6 netns: Make several "global" sysctl variables namespace aware.
netns: Use net_eq() to compare net-namespaces for optimization.
ipv6: remove unused macros from net/ipv6.h
ipv6: remove unused parameter from ip6_ra_control
tcp: fix kernel panic with listening_get_next
tcp: Remove redundant checks when setting eff_sacks
tcp: options clean up
tcp: Fix MD5 signatures for non-linear skbs
sctp: Update sctp global memory limit allocations.
sctp: remove unnecessary byteshifting, calculate directly in big-endian
sctp: Allow only 1 listening socket with SO_REUSEADDR
sctp: Do not leak memory on multiple listen() calls
sctp: Support ipv6only AF_INET6 sockets.
...
|
|
git://oss.oracle.com/git/jlbec/linux-2.6
* 'configfs-fixup-ptr-error' of git://oss.oracle.com/git/jlbec/linux-2.6:
configfs: Allow ->make_item() and ->make_group() to return detailed errors.
Revert "configfs: Allow ->make_item() and ->make_group() to return detailed errors."
|
|
Move the line disciplines towards a conventional ->ops arrangement. For
the moment the actual 'tty_ldisc' struct in the tty is kept as part of
the tty struct but this can then be changed if it turns out that when it
all settles down we want to refcount ldiscs separately to the tty.
Pull the ldisc code out of /proc and put it with our ldisc code.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6
|
|
The WRITEMEM macro produces sparse warnings of the form:
fs/nfsd/nfs4xdr.c:2668:2: warning: do-while statement is not a compound statement
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
|
|
Thanks to problem report and original patch from Harvey Harrison.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Benny Halevy <bhalevy@panasas.com>
|
|
They are symmetrical to single_open ones :)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There are already 7 of them - time to kill some duplicate code.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:
Documentation/powerpc/booting-without-of.txt
drivers/atm/Makefile
drivers/net/fs_enet/fs_enet-main.c
drivers/pci/pci-acpi.c
net/8021q/vlan.c
net/iucv/iucv.c
|
|
The configfs operations ->make_item() and ->make_group() currently
return a new item/group. A return of NULL signifies an error. Because
of this, -ENOMEM is the only return code bubbled up the stack.
Multiple folks have requested the ability to return specific error codes
when these operations fail. This patch adds that ability by changing the
->make_item/group() ops to return ERR_PTR() values. These errors are
bubbled up appropriately. NULL returns are changed to -ENOMEM for
compatibility.
Also updated are the in-kernel users of configfs.
This is a rework of reverted commit 11c3b79218390a139f2d474ee1e983a672d5839a.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
errors."
This reverts commit 11c3b79218390a139f2d474ee1e983a672d5839a. The code
will move to PTR_ERR().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
The truncate patch should not use the i_allocated_meta_blocks
value. So add seperate functions to be used in the truncate
and alloc path. We also need to release the meta-data block
that we reserved for the blocks that we are truncating.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
With the FLEX_BG layout, there is no reason why extents can't cross
block groups, so make the truncate code reserve enough credits so we
don't BUG if we come across such an extent.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
The extents codepath for ext4_truncate() requests journal transaction
credits in very small chunks, requesting only what is needed. This
means there may not be enough credits left on the transaction handle
after ext4_truncate() returns and then when ext4_delete_inode() tries
finish up its work, it may not have enough transaction credits,
causing a BUG() oops in the jbd2 core.
Also, reserve an extra 2 blocks when starting an ext4_delete_inode()
since we need to update the inode bitmap, as well as update the
orphaned inode linked list.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
The ext4_ext_journal_restart() is a convenience function which checks
to see if the requested number of credits is present, and if so it
closes the current transaction and attaches the current handle to the
new transaction. Unfortunately, it wasn't proprely checking the
return value from ext4_journal_extend(), so it was starting a new
transaction when one was not necessary, and returning an error when
all that was necessary was to restart the handle with a new
transaction.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
ext4_da_write_begin needs to call journal_stop before returning,
if the page allocation fails.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
In ordered mode, the current jbd2 aborts the journal if a file data buffer
has an error. But this behavior is unintended, and we found that it has
been adopted accidentally.
This patch undoes it and just calls printk() instead of aborting the
journal. Unlike a similar patch for ext3/jbd, file data buffers are
written via generic_writepages(). But we also need to set AS_EIO
into their mappings because wait_on_page_writeback_range() clears
AS_EIO before a user process sees it.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
A transient I/O error can corrupt inode data. Here is the scenario:
(1) update inode_A at the block_B
(2) pdflush writes out new inode_A to the filesystem, but it results
in write I/O error, at this point, BH_Uptodate flag of the buffer
for block_B is cleared and BH_Write_EIO is set
(3) create new inode_C which located at block_B, and
__ext4_get_inode_loc() tries to read on-disk block_B because the
buffer is not uptodate
(4) if it can read on-disk block_B successfully, inode_A is
overwritten by old data
This patch makes __ext4_get_inode_loc() not read the inode block if the
buffer has BH_Write_EIO flag. In this case, the buffer should have the
latest information, so setting the uptodate flag to the buffer (this
avoids WARN_ON_ONCE() in mark_buffer_dirty().)
According to this change, we would need to test BH_Write_EIO flag for the
error checking. Currently nobody checks write I/O errors on metadata
buffers, but it will be done in other patches I'm working on.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently, the locality group prealloc list is freed only when there
is a block allocation failure. This can result in large number of
entries in the preallocation list making ext4_mb_use_preallocated()
expensive.
To fix this, we convert the locality group prealloc list to a hash
list. The hash index is the order of number of blocks in the prealloc
space with a max order of 9. When adding prealloc space to the list we
make sure total entries for each order does not exceed 8. If it is
more than 8 we discard few entries and make sure the we have only <= 5
entries.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
NR_CPUS can be really large. We should be using nr_cpu_ids instead.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|