Age | Commit message (Collapse) | Author |
|
warp_clock() currently accesses timekeeping internal state directly, which
is unnecessary. Convert it to use the proper timekeeping interfaces.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Spread p->sighand->siglock locking scope to make sure that
fastpath_timer_check() never iterates over all threads. Without
locking there is small possibility that signal->cputimer will stop
running while we write values to signal->cputime_expires.
Calling thread_group_cputime() from fastpath_timer_check() is not only
bad because it is slow, also it is racy with __exit_signal() which can
lead to invalid signal->{s,u}time values.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
When user sets up a timer without associated signal and process does
not use any other cpu timers and does not exit, tsk->signal->cputimer
is enabled and running forever.
Avoid running the timer for no reason.
I used below program to check patch does not break current user space
visible behavior.
#include <sys/time.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#include <assert.h>
void consume_cpu(void)
{
int i = 0;
int count = 0;
for(i=0; i<100000000; i++)
count++;
}
int main(void)
{
int i;
struct sigaction act;
struct sigevent evt = { };
timer_t tid;
struct itimerspec spec = { };
evt.sigev_notify = SIGEV_NONE;
assert(timer_create(CLOCK_PROCESS_CPUTIME_ID, &evt, &tid) == 0);
spec.it_value.tv_sec = 10;
assert(timer_settime(tid, 0, &spec, NULL) == 0);
for (i = 0; i < 30; i++) {
consume_cpu();
memset(&spec, 0, sizeof(spec));
assert(timer_gettime(tid, &spec) == 0);
printf("%lu.%09lu\n",
(unsigned long) spec.it_value.tv_sec,
(unsigned long) spec.it_value.tv_nsec);
}
assert(timer_delete(tid) == 0);
return 0;
}
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
According POSIX we need to correctly set old timer it_interval value when
user request that in timer_settime(). Tested using below program.
#include <sys/time.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <assert.h>
int main(void)
{
struct sigaction act;
struct sigevent evt = { };
timer_t tid;
struct itimerspec spec, u_spec, k_spec;
evt.sigev_notify = SIGEV_SIGNAL;
evt.sigev_signo = SIGPROF;
assert(timer_create(CLOCK_PROCESS_CPUTIME_ID, &evt, &tid) == 0);
spec.it_value.tv_sec = 1;
spec.it_value.tv_nsec = 2;
spec.it_interval.tv_sec = 3;
spec.it_interval.tv_nsec = 4;
u_spec = spec;
assert(timer_settime(tid, 0, &spec, NULL) == 0);
spec.it_value.tv_sec = 5;
spec.it_value.tv_nsec = 6;
spec.it_interval.tv_sec = 7;
spec.it_interval.tv_nsec = 8;
assert(timer_settime(tid, 0, &spec, &k_spec) == 0);
#define PRT(val) printf(#val ":\t%d/%d\n", (int) u_spec.val, (int) k_spec.val)
PRT(it_value.tv_sec);
PRT(it_value.tv_nsec);
PRT(it_interval.tv_sec);
PRT(it_interval.tv_nsec);
return 0;
}
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Let always set signal->cputime_expires expiration cache when setting
new itimer, POSIX 1.b timer, and RLIMIT_CPU. Since we are
initializing prof_exp expiration cache during fork(), this allows to
remove "RLIMIT_CPU != inf" check from fastpath_timer_check() and do
some other cleanups.
Checked against regression using test cases from:
http://marc.info/?l=linux-kernel&m=123749066504641&w=4
http://marc.info/?l=linux-kernel&m=123811277916642&w=2
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
When a process deletes cpu timer or a timer expires we do not clear
the expiration cache sig->cputimer_expires.
As a result the fastpath_timer_check() which prevents us to loop over
all threads in case no timer is active is not working and we run the
slow path needlessly on every tick.
Zero sig->cputimer_expires in stop_process_timers().
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Spencer Candland <spencer@bluehost.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
These functions forgot to run timer_stats_timer_clear_start_info(). It's
unobvious what effect this has and whether it matters much - we won't be
printing it out anyway if the timer's detached.
Untested, just an Ingo trollpatch.
[ Nevertheless correct - tglx ]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The current logic which handles clock events programming failures can
increase min_delta_ns unlimited and even can cause overflows.
Sanitize it by:
- prevent zero increase when min_delta_ns == 1
- limiting min_delta_ns to a jiffie
- bail out if the jiffie limit is hit
- add retries stats for /proc/timer_list so we can gather data
Reported-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Merge reason: We want to queue up a dependent patch.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Disabling affine wakeups is too horrible to contemplate. Remove the feature flag.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301890.6785.50.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
This features has been enabled for quite a while, after testing showed that
easing preemption for light tasks was harmful to high priority threads.
Remove the feature flag.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301675.6785.44.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Sync wakeups are critical functionality with a long history. Remove it, we don't
need the branch or icache footprint.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301817.6785.47.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
This feature never earned its keep, remove it.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301591.6785.42.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Our preemption model relies too heavily on sleeper fairness to disable it
without dire consequences. Remove the feature, and save a branch or two.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301520.6785.40.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
This feature hasn't been enabled in a long time, remove effectively dead code.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301447.6785.38.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Don't bother with selection when the current cpu is idle. Recent load
balancing changes also make it no longer necessary to check wake_affine()
success before returning the selected sibling, so we now always use it.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301369.6785.36.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Allow LAST_BUDDY to kick in sooner, improving cache utilization as soon as
a second buddy pair arrives on scene. The cost is latency starting to climb
sooner, the tbenefit for tbench 8 on my Q6600 box is ~2%. No detrimental
effects noted in normal idesktop usage.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301285.6785.34.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Now that we no longer depend on the clock being updated prior to enqueueing
on migratory wakeup, we can clean up a bit, placing calls to update_rq_clock()
exactly where they are needed, ie on enqueue, dequeue and schedule events.
In the case of a freshly enqueued task immediately preempting, we can skip the
update during preemption, as the clock was just updated by the enqueue event.
We also save an unneeded call during a migratory wakeup by not updating the
previous runqueue, where update_curr() won't be invoked.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301199.6785.32.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Both avg_overlap and avg_wakeup had an inherent problem in that their accuracy
was detrimentally affected by cross-cpu wakeups, this because we are missing
the necessary call to update_curr(). This can't be fixed without increasing
overhead in our already too fat fastpath.
Additionally, with recent load balancing changes making us prefer to place tasks
in an idle cache domain (which is good for compute bound loads), communicating
tasks suffer when a sync wakeup, which would enable affine placement, is turned
into a non-sync wakeup by SYNC_LESS. With one task on the runqueue, wake_affine()
rejects the affine wakeup request, leaving the unfortunate where placed, taking
frequent cache misses.
Remove it, and recover some fastpath cycles.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301121.6785.30.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Testing the load which led to this heuristic (nfs4 kbuild) shows that it has
outlived it's usefullness. With intervening load balancing changes, I cannot
see any difference with/without, so recover there fastpath cycles.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301062.6785.29.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Entering nohz code on every micro-idle is costing ~10% throughput for netperf
TCP_RR when scheduling cross-cpu. Rate limiting entry fixes this, but raises
ticks a bit. On my Q6600, an idle box goes from ~85 interrupts/sec to 128.
The higher the context switch rate, the more nohz entry costs. With this patch
and some cycle recovery patches in my tree, max cross cpu context switch rate is
improved by ~16%, a large portion of which of which is this ratelimiting.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301003.6785.28.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
This patch is an optimization in perf_event_task_sched_in() to avoid
scheduling the events twice in a row.
Without it, the perf_disable()/perf_enable() pair is invoked twice,
thereby pinned events counts while scheduling flexible events and we go
throuh hw_perf_enable() twice.
By encapsulating, the whole sequence into perf_disable()/perf_enable() we
ensure, hw_perf_enable() is going to be invoked only once because of the
refcount protection.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268288765-5326-1-git-send-email-eranian@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Put all statistic fields of sched_entity in one struct, sched_statistics,
and embed it into sched_entity.
This change allows to memset the sched_statistics to 0 when needed (for
instance when forking), avoiding bugs of non initialized fields.
Signed-off-by: Lucas De Marchi <lucas.de.marchi@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268275065-18542-1-git-send-email-lucas.de.marchi@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Since pick_next_highest_task_rt() already iterates all the cgroups and
is really only interested in tasks, skip over the !task entries.
Reported-by: Dhaval Giani <dhaval.giani@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Dhaval Giani <dhaval.giani@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Export perf_trace_regs and perf_arch_fetch_caller_regs since module will
use these.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
[ use EXPORT_PER_CPU_SYMBOL_GPL() ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4B989C1B.2090407@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
execution slot
From : Ananth N Mavinakayanahalli <ananth@in.ibm.com>
When freeing the instruction slot, the arithmetic to calculate
the index of the slot in the page needs to account for the total
size of the instruction on the various architectures.
Calculate the index correctly when freeing the out-of-line
execution slot.
Reported-by: Sachin Sant <sachinp@in.ibm.com>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <4B9667AB.9050507@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
We haven't used the "orig_rq" variable since
055a00865d "Fix/add missing update_rq_clock() calls"
Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: efault@gmx.de
LKML-Reference: <20100306111752.GL4958@bicker>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent
|
|
CONFIG_PROVE_RCU imposes additional overhead on the kernel, so
increase the RCU CPU stall timeouts in an attempt to allow for
this effect.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267830207-9474-2-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Replace the calls to read_barrier_depends() in
ftrace_list_func() with rcu_dereference_raw() to improve
readability. The reason that we use rcu_dereference_raw() here
is that removed entries are never freed, instead they are simply
leaked. This is one of a very few cases where use of
rcu_dereference_raw() is the long-term right answer. And I
don't yet know of any others. ;-)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267830207-9474-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Anton Blanchard found that he could reliably make the kernel hit a
BUG_ON in the slab allocator by taking a cpu offline and then online
while a system-wide perf record session was running.
The reason is that when the cpu comes up, we completely reinitialize
the ctx field of the struct perf_cpu_context for the cpu. If there is
a system-wide perf record session running, then there will be a struct
perf_event that has a reference to the context, so its refcount will
be 2. (The perf_event has been removed from the context's group_entry
and event_entry lists by perf_event_exit_cpu(), but that doesn't
remove the perf_event's reference to the context and doesn't decrement
the context's refcount.)
When the cpu comes up, perf_event_init_cpu() gets called, and it calls
__perf_event_init_context() on the cpu's context. That resets the
refcount to 1. Then when the perf record session finishes and the
perf_event is closed, the refcount gets decremented to 0 and the
context gets kfreed after an RCU grace period. Since the context
wasn't kmalloced -- it's part of a per-cpu variable -- bad things
happen.
In fact we don't need to completely reinitialize the context when the
cpu comes up. It's sufficient to initialize the context once at boot,
but we need to do it for all possible cpus.
This moves the context initialization to happen at boot time. With
this, we don't trash the refcount and the context never gets kfreed,
and we don't hit the BUG_ON.
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Anton Blanchard <anton@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Lars-Peter pointed out that the oneshot threaded interrupt handler
code has the following race:
CPU0 CPU1
hande_level_irq(irq X)
mask_ack_irq(irq X)
handle_IRQ_event(irq X)
wake_up(thread_handler)
thread handler(irq X) runs
finalize_oneshot(irq X)
does not unmask due to
!(desc->status & IRQ_MASKED)
return from irq
does not unmask due to
(desc->status & IRQ_ONESHOT)
This leaves the interrupt line masked forever.
The reason for this is the inconsistent handling of the IRQ_MASKED
flag. Instead of setting it in the mask function the oneshot support
sets the flag after waking up the irq thread.
The solution for this is to set/clear the IRQ_MASKED status whenever
we mask/unmask an interrupt line. That's the easy part, but that
cleanup opens another race:
CPU0 CPU1
hande_level_irq(irq)
mask_ack_irq(irq)
handle_IRQ_event(irq)
wake_up(thread_handler)
thread handler(irq) runs
finalize_oneshot_irq(irq)
unmask(irq)
irq triggers again
handle_level_irq(irq)
mask_ack_irq(irq)
return from irq due to IRQ_INPROGRESS
return from irq
does not unmask due to
(desc->status & IRQ_ONESHOT)
This requires that we synchronize finalize_oneshot_irq() with the
primary handler. If IRQ_INPROGESS is set we wait until the primary
handler on the other CPU has returned before unmasking the interrupt
line again.
We probably have never seen that problem because it does not happen on
UP and on SMP the irqbalancer protects us by pinning the primary
handler and the thread to the same CPU.
Reported-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
|
|
Drop the obsolete "profile" naming used by perf for trace events.
Perf can now do more than simple events counting, so generalize
the API naming.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
|
|
We are taking a wrong regs snapshot when a trace event triggers.
Either we use get_irq_regs(), which gives us the interrupted
registers if we are in an interrupt, or we use task_pt_regs()
which gives us the state before we entered the kernel, assuming
we are lucky enough to be no kernel thread, in which case
task_pt_regs() returns the initial set of regs when the kernel
thread was started.
What we want is different. We need a hot snapshot of the regs,
so that we can get the instruction pointer to record in the
sample, the frame pointer for the callchain, and some other
things.
Let's use the new perf_fetch_caller_regs() for that.
Comparison with perf record -e lock: -R -a -f -g
Before:
perf [kernel] [k] __do_softirq
|
--- __do_softirq
|
|--55.16%-- __open
|
--44.84%-- __write_nocancel
After:
perf [kernel] [k] perf_tp_event
|
--- perf_tp_event
|
|--41.07%-- lock_acquire
| |
| |--39.36%-- _raw_spin_lock
| | |
| | |--7.81%-- hrtimer_interrupt
| | | smp_apic_timer_interrupt
| | | apic_timer_interrupt
The old case was producing unreliable callchains. Now having
right frame and instruction pointers, we have the trace we
want.
Also syscalls and kprobe events already have the right regs,
let's use them instead of wasting a retrieval.
v2: Follow the rename perf_save_regs() -> perf_fetch_caller_regs()
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
|
|
Events that trigger overflows by interrupting a context can
use get_irq_regs() or task_pt_regs() to retrieve the state
when the event triggered. But this is not the case for some
other class of events like trace events as tracepoints are
executed in the same context than the code that triggered
the event.
It means we need a different api to capture the regs there,
namely we need a hot snapshot to get the most important
informations for perf: the instruction pointer to get the
event origin, the frame pointer for the callchain, the code
segment for user_mode() tests (we always use __KERNEL_CS as
trace events always occur from the kernel) and the eflags
for further purposes.
v2: rename perf_save_regs to perf_fetch_caller_regs as per
Masami's suggestion.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
|
|
There are rcu locked read side areas in the path where we submit
a trace event. And these rcu_read_(un)lock() trigger lock events,
which create recursive events.
One pair in do_perf_sw_event:
__lock_acquire
|
|--96.11%-- lock_acquire
| |
| |--27.21%-- do_perf_sw_event
| | perf_tp_event
| | |
| | |--49.62%-- ftrace_profile_lock_release
| | | lock_release
| | | |
| | | |--33.85%-- _raw_spin_unlock
Another pair in perf_output_begin/end:
__lock_acquire
|--23.40%-- perf_output_begin
| | __perf_event_overflow
| | perf_swevent_overflow
| | perf_swevent_add
| | perf_swevent_ctx_event
| | do_perf_sw_event
| | perf_tp_event
| | |
| | |--55.37%-- ftrace_profile_lock_acquire
| | | lock_acquire
| | | |
| | | |--37.31%-- _raw_spin_lock
The problem is not that much the trace recursion itself, as we have a
recursion protection already (though it's always wasteful to recurse).
But the trace events are outside the lockdep recursion protection, then
each lockdep event triggers a lock trace, which will trigger two
other lockdep events. Here the recursive lock trace event won't
be taken because of the trace recursion, so the recursion stops there
but lockdep will still analyse these new events:
To sum up, for each lockdep events we have:
lock_*()
|
trace lock_acquire
|
----- rcu_read_lock()
| |
| lock_acquire()
| |
| trace_lock_acquire() (stopped)
| |
| lockdep analyze
|
----- rcu_read_unlock()
|
lock_release
|
trace_lock_release() (stopped)
|
lockdep analyze
And you can repeat the above two times as we have two rcu read side
sections when we submit an event.
This is fixed in this patch by moving the lock trace event under
the lockdep recursion protection.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
|
|
Try to avoid useless rotation and PMU disables.
[ Could be improved by keeping a nr_runnable count to better account
for the < PERF_STAT_INACTIVE counters ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: paulus@samba.org
Cc: eranian@google.com
Cc: robert.richter@amd.com
Cc: fweisbec@gmail.com
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Currently we always call hw_perf_disable(), even if its already disabled,
this seems superflous, esp. since it cannot be made NMI safe (see further
patches).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
Cc: eranian@google.com
Cc: robert.richter@amd.com
Cc: fweisbec@gmail.com
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Remove the hw_perf_event_*() hotplug hooks in favour of per PMU hotplug
notifiers. This has the advantage of reducing the static weak interface
as well as exposing all hotplug actions to the PMU.
Use this to fix x86 hotplug usage where we did things in ONLINE which
should have been done in UP_PREPARE or STARTING.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: paulus@samba.org
Cc: eranian@google.com
Cc: robert.richter@amd.com
Cc: fweisbec@gmail.com
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
LKML-Reference: <20100305154128.736225361@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
This makes it easier to extend perf_sample_data and fixes a bug on arm
and sparc, which failed to set ->raw to NULL, which can cause crashes
when combined with PERF_SAMPLE_RAW.
It also optimizes PowerPC and tracepoint, because the struct
initialization is forced to zero out the whole structure.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jean Pihet <jpihet@mvista.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Jamie Iles <jamie.iles@picochip.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: stable@kernel.org
LKML-Reference: <20100304140100.315416040@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Conflicts:
tools/perf/util/probe-event.c
Merge reason: Pick up -rc1 and resolve the conflict as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Conflicts:
Documentation/filesystems/proc.txt
arch/arm/mach-u300/include/mach/debug-macro.S
drivers/net/qlge/qlge_ethtool.c
drivers/net/qlge/qlge_main.c
drivers/net/typhoon.c
|
|
A little more whack-a-mole annotating the dynamic sysfs attributes. I
had everything built into my earlier test kernel, and so I missed
these.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
These are the non-static sysfs attributes that exist on
my test machine. Fix them to use sysfs_attr_init or
sysfs_bin_attr_init as appropriate. It simply requires
making a sysfs attribute present to see this. So this
is a little bit tedious but otherwise not too bad.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Constify struct sysfs_ops.
This is part of the ops structure constification
effort started by Arjan van de Ven et al.
Benefits of this constification:
* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime
* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime
* potentially better optimized code as the compiler
can assume that the const data cannot be changed
* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing
Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Constify struct kset_uevent_ops.
This is part of the ops structure constification
effort started by Arjan van de Ven et al.
Benefits of this constification:
* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime
* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime
* potentially better optimized code as the compiler
can assume that the const data cannot be changed
* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing
Signed-off-by: Emese Revfy <re.emese@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Passing the attribute to the low level IO functions allows all kinds
of cleanups, by sharing low level IO code without requiring
an own function for every piece of data.
Also drivers can extend the attributes with own data fields
and use that in the low level function.
Similar to sysdev_attributes and normal attributes.
This is a tree-wide sweep, converting everything in one go.
No functional changes in this patch other than passing the new
argument everywhere.
Tested on x86, the non x86 parts are uncompiled.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
The current ELF dumper implementation can produce broken corefiles if
program headers exceed 65535. This number is determined by the number of
vmas which the process have. In particular, some extreme programs may use
more than 65535 vmas. (If you google max_map_count, you can find some
users facing this problem.) This kind of program never be able to generate
correct coredumps.
This patch implements ``extended numbering'' that uses sh_info field of
the first section header instead of e_phnum field in order to represent
upto 4294967295 vmas.
This is supported by
AMD64-ABI(http://www.x86-64.org/documentation.html) and
Solaris(http://docs.sun.com/app/docs/doc/817-1984/).
Of course, we are preparing patches for gdb and binutils.
Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding
macro for hiding _multiline_ logics in functions. This patch removes
#ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions. For
architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in
order to reduce a range of modification.
This cleanup is for my next patches, but I think this cleanup itself is
worth doing regardless of my firnal purpose.
Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|