Age | Commit message (Collapse) | Author |
|
This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity. A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.
The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.
And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
|
Instead of sending ->older_than_this to queue_io() and
move_expired_inodes(), send the entire wb_writeback_work
structure. There are other fields of a work item that are
useful in these routines and in tracepoints.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
|
Useful for analyzing the dynamics of the throttling algorithms and
debugging user reported problems.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
|
It helps understand how various throttle bandwidths are updated.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
|
As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.
RATIONALS
=========
- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
If every thread doing writes and being throttled start foreground
writeback, it leads to N IO submitters from at least N different
inodes at the same time, end up with N different sets of IO being
issued with potentially zero locality to each other, resulting in
much lower elevator sort/merge efficiency and hence we seek the disk
all over the place to service the different sets of IO.
OTOH, if there is only one submission thread, it doesn't jump between
inodes in the same way when congestion clears - it keeps writing to
the same inode, resulting in large related chunks of sequential IOs
being issued to the disk. This is more efficient than the above
foreground writeback because the elevator works better and the disk
seeks less.
- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
* "CPU usage has dropped by ~55%", "it certainly appears that most of
the CPU time saving comes from the removal of contention on the
inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
cacheline bouncing, because the new code is able to call much less
frequently into balance_dirty_pages() and hence access the global
page states)
* the user space "App overhead" is reduced by 20%, by avoiding the
cacheline pollution by the complex writeback code path
* "for a ~5% throughput reduction", "the number of write IOs have
dropped by ~25%", and the elapsed time reduced from 41:42.17 to
40:53.23.
* On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
and improves IO throughput from 38MB/s to 42MB/s.
- IO size too small for fast arrays and too large for slow USB sticks
The write_chunk used by current balance_dirty_pages() cannot be
directly set to some large value (eg. 128MB) for better IO efficiency.
Because it could lead to more than 1 second user perceivable stalls.
Even the current 4MB write size may be too large for slow USB sticks.
The fact that balance_dirty_pages() starts IO on itself couples the
IO size to wait time, which makes it hard to do suitable IO size while
keeping the wait time under control.
Now it's possible to increase writeback chunk size proportional to the
disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
the larger writeback size dramatically reduces the seek count to 1/10
(far beyond my expectation) and improves the write throughput by 24%.
- long block time in balance_dirty_pages() hurts desktop responsiveness
Many of us may have the experience: it often takes a couple of seconds
or even long time to stop a heavy writing dd/cp/tar command with
Ctrl-C or "kill -9".
- IO pipeline broken by bumpy write() progress
There are a broad class of "loop {read(buf); write(buf);}" applications
whose read() pipeline will be under-utilized or even come to a stop if
the write()s have long latencies _or_ don't progress in a constant rate.
The current threshold based throttling inherently transfers the large
low level IO completion fluctuations to bumpy application write()s,
and further deteriorates with increasing number of dirtiers and/or bdi's.
For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
the rsync progresses very bumpy in legacy kernel, and throughput is
improved by 67% by this patchset. (plus the larger write chunk size,
it will be 93% speedup).
The new rate based throttling can support 1000+ dd's with excellent
smoothness, low latency and low overheads.
For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().
Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:
- in large NUMA systems, the per-cpu counters may have big accounting
errors, leading to big throttle wait time and jitters.
- NFS may kill large amount of unstable pages with one single COMMIT.
Because NFS server serves COMMIT with expensive fsync() IOs, it is
desirable to delay and reduce the number of COMMITs. So it's not
likely to optimize away such kind of bursty IO completions, and the
resulted large (and tiny) stall times in IO completion based throttling.
So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:
- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than 4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times
It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.
BEHAVIOR CHANGE
===============
(1) dirty threshold
Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.
Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.
(2) smoothness/responsiveness
Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
|
Add two fields to task_struct.
1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility
The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.
The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().
The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.
On the sqrt in dirty_poll_interval():
It will serve as an initial guess when dirty pages are still in the
freerun area.
When dirty pages are floating inside the dirty control scope [freerun,
limit], a followup patch will use some refined dirty poll interval to
get the desired pause time.
thresh-dirty (MB) sqrt
1 16
2 22
4 32
8 45
16 64
32 90
64 128
128 181
256 256
512 362
1024 512
The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
be exceeded as long as there are less than 16 (or 512) concurrent dd's.
So sqrt naturally leads to less overheads and more safe concurrent tasks
for large memory servers, which have large (thresh-freerun) gaps.
peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
|
There are some imperfections in balanced_dirty_ratelimit.
1) large fluctuations
The dirty_rate used for computing balanced_dirty_ratelimit is merely
averaged in the past 200ms (very small comparing to the 3s estimation
period for write_bw), which makes rather dispersed distribution of
balanced_dirty_ratelimit.
It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)
The more practical way is filtering -- most singular
balanced_dirty_ratelimit points can be filtered out by remembering some
prev_balanced_rate and prev_prev_balanced_rate. However the more
reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.
2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_dirty_ratelimit. The truncates, due to its possibly
bumpy nature, can hardly be compensated smoothly. So let's face it. When
some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
high, dirty pages will go higher than the setpoint. task_ratelimit will
in turn become lower than dirty_ratelimit. So if we consider both
balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
only when they are on the same side of dirty_ratelimit, the systematical
errors in balanced_dirty_ratelimit won't be able to bring
dirty_ratelimit far away.
The balanced_dirty_ratelimit estimation may also be inaccurate near
@limit or @freerun, however is less an issue.
3) since we ultimately want to
- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible
the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
there is no point to bring up dirty_ratelimit in a hurry only to hurt
both the above two goals.
So, we make use of task_ratelimit to limit the update of dirty_ratelimit
in two ways:
1) avoid changing dirty rate when it's against the position control target
(the adjusted rate will slow down the progress of dirty pages going
back to setpoint).
2) limit the step size. task_ratelimit is changing values step by step,
leaving a consistent trace comparing to the randomly jumping
balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
errors in stable state and typically larger errors when there are big
errors in rate. So it's a pretty good limiting factor for the step
size of dirty_ratelimit.
Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
task_ratelimit is merely used as a limiting factor.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
|
It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.
On write() syscall, use bdi->dirty_ratelimit
============================================
balance_dirty_pages(pages_dirtied)
{
task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
pause = pages_dirtied / task_ratelimit;
sleep(pause);
}
On every 200ms, update bdi->dirty_ratelimit
===========================================
bdi_update_dirty_ratelimit()
{
task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
bdi->dirty_ratelimit = balanced_dirty_ratelimit
}
Estimation of balanced bdi->dirty_ratelimit
===========================================
balanced task_ratelimit
-----------------------
balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.
IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:
dirty_rate == write_bw (1)
The fairness requirement gives us:
task_ratelimit = balanced_dirty_ratelimit
== write_bw / N (2)
where N is the number of dd tasks. We don't know N beforehand, but
still can estimate balanced_dirty_ratelimit within 200ms.
Start by throttling each dd task at rate
task_ratelimit = task_ratelimit_0 (3)
(any non-zero initial value is OK)
After 200ms, we measured
dirty_rate = # of pages dirtied by all dd's / 200ms
write_bw = # of pages written to the disk / 200ms
For the aggressive dd dirtiers, the equality holds
dirty_rate == N * task_rate
== N * task_ratelimit_0 (4)
Or
task_ratelimit_0 == dirty_rate / N (5)
Now we conclude that the balanced task ratelimit can be estimated by
write_bw
balanced_dirty_ratelimit = task_ratelimit_0 * ---------- (6)
dirty_rate
Because with (4) and (5) we can get the desired equality (1):
write_bw
balanced_dirty_ratelimit == (dirty_rate / N) * ----------
dirty_rate
== write_bw / N
Then using the balanced task ratelimit we can compute task pause times like:
task_pause = task->nr_dirtied / task_ratelimit
task_ratelimit with position control
------------------------------------
However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.
The dirty position control works by extending (2) to
task_ratelimit = balanced_dirty_ratelimit * pos_ratio (7)
where pos_ratio is a negative feedback function that subjects to
1) f(setpoint) = 1.0
2) df/dx < 0
That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
pages are created less fast than they are cleaned, thus DROP to the
setpoints (and the reverse).
Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
remains CONSTANT for the past 200ms, we get
task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio (8)
Putting (8) into (6), we get the formula used in
bdi_update_dirty_ratelimit():
write_bw
balanced_dirty_ratelimit *= pos_ratio * ---------- (9)
dirty_rate
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
|
No behavior change.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
|
Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.
CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
|
'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip
* 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
irq: Fix check for already initialized irq_domain in irq_domain_add
irq: Add declaration of irq_domain_simple_ops to irqdomain.h
* 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
x86/rtc: Don't recursively acquire rtc_lock
* 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
posix-cpu-timers: Cure SMP wobbles
sched: Fix up wchan borkage
sched/rt: Migrate equal priority tasks to available CPUs
|
|
David reported:
Attached below is a watered-down version of rt/tst-cpuclock2.c from
GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
similar.
Run it several times, and you will see cases where the main thread
will measure a process clock difference before and after the nanosleep
which is smaller than the cpu-burner thread's individual thread clock
difference. This doesn't make any sense since the cpu-burner thread
is part of the top-level process's thread group.
I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
64-bit binaries).
For example:
[davem@boricha build-x86_64-linux]$ ./test
process: before(0.001221967) after(0.498624371) diff(497402404)
thread: before(0.000081692) after(0.498316431) diff(498234739)
self: before(0.001223521) after(0.001240219) diff(16698)
[davem@boricha build-x86_64-linux]$
The diff of 'process' should always be >= the diff of 'thread'.
I make sure to wrap the 'thread' clock measurements the most tightly
around the nanosleep() call, and that the 'process' clock measurements
are the outer-most ones.
---
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <pthread.h>
static pthread_barrier_t barrier;
static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1)
__asm__ __volatile__("" : : : "memory");
return NULL;
}
int main(void)
{
clockid_t process_clock, my_thread_clock, th_clock;
struct timespec process_before, process_after;
struct timespec me_before, me_after;
struct timespec th_before, th_after;
struct timespec sleeptime;
unsigned long diff;
pthread_t th;
int err;
err = clock_getcpuclockid(0, &process_clock);
if (err)
return 1;
err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
if (err)
return 1;
pthread_barrier_init(&barrier, NULL, 2);
err = pthread_create(&th, NULL, chew_cpu, NULL);
if (err)
return 1;
err = pthread_getcpuclockid(th, &th_clock);
if (err)
return 1;
pthread_barrier_wait(&barrier);
err = clock_gettime(process_clock, &process_before);
if (err)
return 1;
err = clock_gettime(my_thread_clock, &me_before);
if (err)
return 1;
err = clock_gettime(th_clock, &th_before);
if (err)
return 1;
sleeptime.tv_sec = 0;
sleeptime.tv_nsec = 500000000;
nanosleep(&sleeptime, NULL);
err = clock_gettime(th_clock, &th_after);
if (err)
return 1;
err = clock_gettime(my_thread_clock, &me_after);
if (err)
return 1;
err = clock_gettime(process_clock, &process_after);
if (err)
return 1;
diff = process_after.tv_nsec - process_before.tv_nsec;
printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
process_before.tv_sec, process_before.tv_nsec,
process_after.tv_sec, process_after.tv_nsec, diff);
diff = th_after.tv_nsec - th_before.tv_nsec;
printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
th_before.tv_sec, th_before.tv_nsec,
th_after.tv_sec, th_after.tv_nsec, diff);
diff = me_after.tv_nsec - me_before.tv_nsec;
printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
me_before.tv_sec, me_before.tv_nsec,
me_after.tv_sec, me_after.tv_nsec, diff);
return 0;
}
This is due to us using p->se.sum_exec_runtime in
thread_group_cputime() where we iterate the thread group and sum all
data. This does not take time since the last schedule operation (tick
or otherwise) into account. We can cure this by using
task_sched_runtime() at the cost of having to take locks.
This also means we can (and must) do away with
thread_group_sched_runtime() since the modified thread_group_cputime()
is now more accurate and would deadlock when called from
thread_group_sched_runtime().
Aside of that it makes the function safe on 32 bit systems. The old
code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
64bit value and could be changed on another cpu at the same time.
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
Tested-by: David Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
* 'writeback-for-linus' of git://github.com/fengguang/linux:
writeback: show raw dirtied_when in trace writeback_single_inode
|
|
That flag no longer makes sense, since we don't look up automount points
as eagerly any more. Additionally, it turns out that the NO_AUTOMOUNT
handling was buggy to begin with: it would avoid automounting even for
cases where we really *needed* to do the automount handling, and could
return ENOENT for autofs entries that hadn't been instantiated yet.
With our new non-eager automount semantics, one discussion has been
about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the
newfstatat() and fstatat64() system calls), but it's probably not worth
it: you can always force at least directory automounting by simply
adding the final '/' to the filename, which works for *all* of the stat
family system calls, old and new.
So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a
result of our bad default behavior.
Acked-by: Ian Kent <raven@themaw.net>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since we've now turned around and made LOOKUP_FOLLOW *not* force an
automount, we want to add the ability to force an automount event on
lookup even if we don't happen to have one of the other flags that force
it implicitly (LOOKUP_OPEN, LOOKUP_DIRECTORY, LOOKUP_PARENT..)
Most cases will never want to use this, since you'd normally want to
delay automounting as long as possible, which usually implies
LOOKUP_OPEN (when we open a file or directory, we really cannot avoid
the automount any more).
But Trond argued sufficiently forcefully that at a minimum bind mounting
a file and quotactl will want to force the automount lookup. Some other
cases (like nfs_follow_remote_path()) could use it too, although
LOOKUP_DIRECTORY would work there as well.
This commit just adds the flag and logic, no users yet, though. It also
doesn't actually touch the LOOKUP_NO_AUTOMOUNT flag that is related, and
was made irrelevant by the same change that made us not follow on
LOOKUP_FOLLOW.
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
[S390] kvm: extension capability for new address space layout
[S390] kvm: fix address mode switching
|
|
* 'for-linus' of git://git.kernel.dk/linux-block:
floppy: use del_timer_sync() in init cleanup
blk-cgroup: be able to remove the record of unplugged device
block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
mm: Add comment explaining task state setting in bdi_forker_thread()
mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
block: simplify force plug flush code a little bit
block: change force plug flush call order
block: Fix queue_flag update when rq_affinity goes from 2 to 1
block: separate priority boosting from REQ_META
block: remove READ_META and WRITE_META
xen-blkback: fixed indentation and comments
xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
|
|
598841ca9919d008b520114d8a4378c4ce4e40a1 ([S390] use gmap address
spaces for kvm guest images) changed kvm on s390 to use a separate
address space for kvm guests. We can now put KVM guests anywhere
in the user address mode with a size up to 8PB - as long as the
memory is 1MB-aligned. This change was done without KVM extension
capability bit.
The change was added after 3.0, but we still have a chance to add
a feature bit before 3.1 (keeping the releases in a sane state).
We use number 71 to avoid collisions with other pending kvm patches
as requested by Alexander Graf.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Avi Kivity <avi@redhat.com>
Cc: Alexander Graf <agraf@suse.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
irq_domain_simple_ops is exported, but is not declared in irqdomain.h,
so add it.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: marc.zyngier@arm.com
Cc: thomas.abraham@linaro.org
Cc: jamie@jamieiles.com
Cc: b-cousson@ti.com
Cc: shawn.guo@linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: devicetree-discuss@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1316017900-19918-2-git-send-email-robherring2@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
* git://github.com/davem330/net:
tcp: fix validation of D-SACK
tcp: fix build error if !CONFIG_SYN_COOKIES
|
|
commit 946cedccbd7387 (tcp: Change possible SYN flooding messages)
added a build error if CONFIG_SYN_COOKIES=n
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
* 'for-linus' of git://git.infradead.org/users/sameo/mfd-2.6:
mfd: Fix omap-usb-host build failure
mfd: Make omap-usb-host TLL mode work again
mfd: Set MAX8997 irq pointer
mfd: Fix initialisation of tps65910 interrupts
mfd: Check for twl4030-madc NULL pointer
mfd: Copy the device pointer to the twl4030-madc structure
mfd: Rename wm8350 static gpio_set_debounce()
mfd: Fix value of WM8994_CONFIGURE_GPIO
|
|
* git://github.com/davem330/net: (62 commits)
ipv6: don't use inetpeer to store metrics for routes.
can: ti_hecc: include linux/io.h
IRDA: Fix global type conflicts in net/irda/irsysctl.c v2
net: Handle different key sizes between address families in flow cache
net: Align AF-specific flowi structs to long
ipv4: Fix fib_info->fib_metrics leak
caif: fix a potential NULL dereference
sctp: deal with multiple COOKIE_ECHO chunks
ibmveth: Fix checksum offload failure handling
ibmveth: Checksum offload is always disabled
ibmveth: Fix issue with DMA mapping failure
ibmveth: Fix DMA unmap error
pch_gbe: support ML7831 IOH
pch_gbe: added the process of FIFO over run error
pch_gbe: fixed the issue which receives an unnecessary packet.
sfc: Use 64-bit writes for TX push where possible
Revert "sfc: Use write-combining to reduce TX latency" and follow-ups
bnx2x: Fix ethtool advertisement
bnx2x: Fix 578xx link LED
bnx2x: Fix XMAC loopback test
...
|
|
With the conversion of struct flowi to a union of AF-specific structs, some
operations on the flow cache need to account for the exact size of the key.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
AF-specific flowi structs are now passed to flow_key_compare, which must
also be aligned to a long.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Attempt to reduce the number of IP packets emitted in response to single
SCTP packet (2e3216cd) introduced a complication - if a packet contains
two COOKIE_ECHO chunks and nothing else then SCTP state machine corks the
socket while processing first COOKIE_ECHO and then loses the association
and forgets to uncork the socket. To deal with the issue add new SCTP
command which can be used to set association explictly. Use this new
command when processing second COOKIE_ECHO chunk to restore the context
for SCTP state machine.
Signed-off-by: Max Matveev <makc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
dev_forward_skb loops an skb back into host networking
stack which might hang on the memory indefinitely.
In particular, this can happen in macvtap in bridged mode.
Copy the userspace fragments to avoid blocking the
sender in that case.
As this patch makes skb_copy_ubufs extern now,
I also added some documentation and made it clear
the SKBTX_DEV_ZEROCOPY flag automatically instead
of doing it in all callers. This can be made into a separate
patch if people feel it's worth it.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
"Possible SYN flooding on port xxxx " messages can fill logs on servers.
Change logic to log the message only once per listener, and add two new
SNMP counters to track :
TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client
TCPReqQFullDrop : number of times a SYN request was dropped because
syncookies were not enabled.
Based on a prior patch from Tom Herbert, and suggestions from David.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Building a kernel with hotplug disabled results in a link failure:
`bgpio_remove' referenced in section `___ksymtab_gpl+bgpio_remove' of drivers/built-in.o: defined in discarded section `.devexit.text' of drivers/built-in.o
This is because of bgpio_remove() is exported. It is illegal to export
symbols which are discarded either at link time or as part of an
init/exit section.
Fix this by dropping the __devexit attributation from bgpio_remove().
Also drop the __devinit attributation from bgpio_init().
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add
memory.vmscan_stat").
The implementation of per-memcg reclaim statistics violates how memcg
hierarchies usually behave: hierarchically.
The reclaim statistics are accounted to child memcgs and the parent
hitting the limit, but not to hierarchy levels in between. Usually,
hierarchical statistics are perfectly recursive, with each level
representing the sum of itself and all its children.
Since this exports statistics to userspace, this may lead to confusion
and problems with changing things after the release, so revert it now,
we can try again later.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix kernel-doc warning about internal/private data by marking it
as "private:" so that kernel-doc will ignore it.
Warning(include/linux/regulator/consumer.h:128): No description found for parameter 'ret'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix kernel-doc warning in net/cfg80211.h:
Warning(include/net/cfg80211.h:1884): No description found for parameter 'registered'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
* 'perf-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
x86, perf: Check that current->mm is alive before getting user callchain
perf_event: Fix broken calc_timer_values()
perf events: Fix slow and broken cgroup context switch code
|
|
This needs to be an out of band value for the register and on this device
registers are 16 bit so we must shift left one to the 17th bit.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Cc: stable@kernel.org
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
Signed-off-by: Jim Garlick <garlick@llnl.gov>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
|
|
Some of the flags are OS/arch dependent we add a 9p
protocol value which maps to asm-generic/fcntl.h values in Linux
Based on the original patch from Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
|
|
Save inode->dirtied_when in the raw trace output for reliable scripting,
and to also show in formatted output the relative age in seconds for
easy human reading.
CC: Jan Kara <jack@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
|
Allow transparent sockets to be less restrictive about
the source ip of ipv6 udp packets being sent.
Google-Bug-Id: 5018138
Signed-off-by: Maciej Żenczykowski <maze@google.com>
CC: "Erik Kline" <ek@google.com>
CC: "Lorenzo Colitti" <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (42 commits)
netpoll: fix incorrect access to skb data in __netpoll_rx
cassini: init before use in cas_interruptN.
can: ti_hecc: Fix uninitialized spinlock in probe
can: ti_hecc: Fix unintialized variable
net: sh_eth: fix the compile error
net/phy: fix DP83865 phy interrupt handler
sendmmsg/sendmsg: fix unsafe user pointer access
ibmveth: Fix leak when recycling skb and hypervisor returns error
arp: fix rcu lockdep splat in arp_process()
bridge: fix a possible use after free
bridge: Pseudo-header required for the checksum of ICMPv6
mcast: Fix source address selection for multicast listener report
MAINTAINERS: Update GIT trees for network development
ath9k: Fix PS wrappers in ath9k_set_coverage_class
carl9170: Fix mismatch in carl9170_op_set_key mutex lock-unlock
wl12xx: add max_sched_scan_ssids value to the hw description
wl12xx: Fix validation of pm_runtime_get_sync return value
wl12xx: Remove obsolete testmode NVS push command
bcma: add uevent to the bus, to autoload drivers
ath9k_hw: Fix STA (AR9485) bringup issue due to incorrect MAC address
...
|
|
The current cgroup context switch code was incorrect leading
to bogus counts. Furthermore, as soon as there was an active
cgroup event on a CPU, the context switch cost on that CPU
would increase by a significant amount as demonstrated by a
simple ping/pong example:
$ ./pong
Both processes pinned to CPU1, running for 10s
10684.51 ctxsw/s
Now start a cgroup perf stat:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100
$ ./pong
Both processes pinned to CPU1, running for 10s
6674.61 ctxsw/s
That's a 37% penalty.
Note that pong is not even in the monitored cgroup.
The results shown by perf stat are bogus:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100
Performance counter stats for 'sleep 100':
CPU1 <not counted> cycles test
CPU1 16,984,189,138 cycles # 0.000 GHz
The second 'cycles' event should report a count @ CPU clock
(here 2.4GHz) as it is counting across all cgroups.
The patch below fixes the bogus accounting and bypasses any
cgroup switches in case the outgoing and incoming tasks are
in the same cgroup.
With this patch the same test now yields:
$ ./pong
Both processes pinned to CPU1, running for 10s
10775.30 ctxsw/s
Start perf stat with cgroup:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Run pong outside the cgroup:
$ /pong
Both processes pinned to CPU1, running for 10s
10687.80 ctxsw/s
The penalty is now less than 2%.
And the results for perf stat are correct:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Performance counter stats for 'sleep 10':
CPU1 <not counted> cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz
Now perf stat reports the correct counts for
for the non cgroup event.
If we run pong inside the cgroup, then we also get the
correct counts:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Performance counter stats for 'sleep 10':
CPU1 22,297,726,205 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz
10.001457237 seconds time elapsed
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
The nfsservctl system call is now gone, so we should remove all
linkage for it.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
* 'tty-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6:
omap-serial: Allow IXON and IXOFF to be disabled.
TTY: serial, document ignoring of uart->ops->startup error
TTY: pty, fix pty counting
8250: Fix race condition in serial8250_backup_timeout().
serial/8250_pci: delete duplicate data definition
8250_pci: add support for Rosewill RC-305 4x serial port card
tty: Add "spi:" prefix for spi modalias
atmel_serial: fix atmel_default_console_device
serial: 8250_pnp: add Intermec CV60 touchscreen device
drivers/serial/ucc_uart.c: Fix compiler warning
pch_uart: Set PCIe bus number using probe parameter
serial: samsung: Fix build error
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* 'driver-core-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
drivers:misc: ti-st: fix unexpected UART close
drivers:misc: ti-st: free skb on firmware download
drivers:misc: ti-st: wait for completion at fail
drivers:misc: ti-st: reinit completion before send
drivers:misc: ti-st: fail-safe on wrong pkt type
drivers:misc: ti-st: reinit completion on ver read
drivers:misc:ti-st: platform hooks for chip states
drivers:misc: ti-st: avoid a misleading dbg msg
base/devres.c: quiet sparse noise about context imbalance
pti: add missing CONFIG_PCI dependency
drivers/base/devtmpfs.c: correct annotation of `setup_done'
driver core: fix kernel-doc warning in platform.c
firmware: fix google/gsmi.c build warning
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem
|
|
We need a callback to do some things after pwm_enable, pwm_disable
and pwm_config.
Signed-off-by: Dilan Lee <dilee@nvidia.com>
Reviewed-by: Robert Morell <rmorell@nvidia.com>
Reviewed-by: Arun Murthy <arun.murthy@stericsson.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Replace/remove use of RIO v.1.2 registers/bits that are not
forward-compatible with newer versions of RapidIO specification.
RapidIO specification v.1.3 removed Write Port CSR, Doorbell CSR,
Mailbox CSR and Mailbox and Doorbell bits of the PEF CAR.
Use of removed (since RIO v.1.3) register bits affects users of
currently available 1.3 and 2.x compliant devices who may use not so
recent kernel versions.
Removing checks for unsupported bits makes corresponding routines
compatible with all versions of RapidIO specification. Therefore,
backporting makes stable kernel versions compliant with RIO v.1.3 and
later as well.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Cc: Chul Kim <chul.kim@idt.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Purely in-memory filesystems do not use the inode hash as the dcache
tells us if an entry already exists. As a result, they do not call
unlock_new_inode, and thus directory inodes do not get put into a
different lockdep class for i_sem.
We need the different lockdep classes, because the locking order for
i_mutex is different for directory inodes and regular inodes. Directory
inodes can do "readdir()", which takes i_mutex *before* possibly taking
mm->mmap_sem (due to a page fault while copying the directory entry to
user space).
In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
before accessing i_mutex.
The two cases can never happen for the same inode, so no real deadlock
can occur, but without the different lockdep classes, lockdep cannot
understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
can lead to false positives from lockdep like below:
find/645 is trying to acquire lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff81109514>] might_fault+0x5c/0xac
but task is already holding lock:
(&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff81149f34>]
vfs_readdir+0x5b/0xb4
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
[<ffffffff8108ac26>] lock_acquire+0xbf/0x103
[<ffffffff814db822>] __mutex_lock_common+0x4c/0x361
[<ffffffff814dbc46>] mutex_lock_nested+0x40/0x45
[<ffffffff811daa87>] hugetlbfs_file_mmap+0x82/0x110
[<ffffffff81111557>] mmap_region+0x258/0x432
[<ffffffff811119dd>] do_mmap_pgoff+0x2ac/0x306
[<ffffffff81111b4f>] sys_mmap_pgoff+0x118/0x16a
[<ffffffff8100c858>] sys_mmap+0x22/0x24
[<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
-> #0 (&mm->mmap_sem){++++++}:
[<ffffffff8108a4bc>] __lock_acquire+0xa1a/0xcf7
[<ffffffff8108ac26>] lock_acquire+0xbf/0x103
[<ffffffff81109541>] might_fault+0x89/0xac
[<ffffffff81149cff>] filldir+0x6f/0xc7
[<ffffffff811586ea>] dcache_readdir+0x67/0x205
[<ffffffff81149f54>] vfs_readdir+0x7b/0xb4
[<ffffffff8114a073>] sys_getdents+0x7e/0xd1
[<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
This patch moves the directory vs file lockdep annotation into a helper
function that can be called by in-memory filesystems and has hugetlbfs
call it.
Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback
* 'urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback:
squeeze max-pause area and drop pass-good area
|