summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2010-12-08nohz: Fix get_next_timer_interrupt() vs cpu hotplugHeiko Carstens
This fixes a bug as seen on 2.6.32 based kernels where timers got enqueued on offline cpus. If a cpu goes offline it might still have pending timers. These will be migrated during CPU_DEAD handling after the cpu is offline. However while the cpu is going offline it will schedule the idle task which will then call tick_nohz_stop_sched_tick(). That function in turn will call get_next_timer_intterupt() to figure out if the tick of the cpu can be stopped or not. If it turns out that the next tick is just one jiffy off (delta_jiffies == 1) tick_nohz_stop_sched_tick() incorrectly assumes that the tick should not stop and takes an early exit and thus it won't update the load balancer cpu. Just afterwards the cpu will be killed and the load balancer cpu could be the offline cpu. On 2.6.32 based kernel get_nohz_load_balancer() gets called to decide on which cpu a timer should be enqueued (see __mod_timer()). Which leads to the possibility that timers get enqueued on an offline cpu. These will never expire and can cause a system hang. This has been observed 2.6.32 kernels. On current kernels __mod_timer() uses get_nohz_timer_target() which doesn't have that problem. However there might be other problems because of the too early exit tick_nohz_stop_sched_tick() in case a cpu goes offline. The easiest and probably safest fix seems to be to let get_next_timer_interrupt() just lie and let it say there isn't any pending timer if the current cpu is offline. I also thought of moving migrate_[hr]timers() from CPU_DEAD to CPU_DYING, but seeing that there already have been fixes at least in the hrtimer code in this area I'm afraid that this could add new subtle bugs. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101201091109.GA8984@osiris.boeblingen.de.ibm.com> Cc: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08Sched: fix skip_clock_update optimizationMike Galbraith
idle_balance() drops/retakes rq->lock, leaving the previous task vulnerable to set_tsk_need_resched(). Clear it after we return from balancing instead, and in setup_thread_stack() as well, so no successfully descheduled or never scheduled task has it set. Need resched confused the skip_clock_update logic, which assumes that the next call to update_rq_clock() will come nearly immediately after being set. Make the optimization robust against the waking a sleeper before it sucessfully deschedules case by checking that the current task has not been dequeued before setting the flag, since it is that useless clock update we're trying to save, and clear unconditionally in schedule() proper instead of conditionally in put_prev_task(). Signed-off-by: Mike Galbraith <efault@gmx.de> Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com> Tested-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org LKML-Reference: <1291802742.1417.9.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08sched: Cure more NO_HZ load average woesPeter Zijlstra
There's a long-running regression that proved difficult to fix and which is hitting certain people and is rather annoying in its effects. Damien reported that after 74f5187ac8 (sched: Cure load average vs NO_HZ woes) his load average is unnaturally high, he also noted that even with that patch reverted the load avgerage numbers are not correct. The problem is that the previous patch only solved half the NO_HZ problem, it addressed the part of going into NO_HZ mode, not of comming out of NO_HZ mode. This patch implements that missing half. When comming out of NO_HZ mode there are two important things to take care of: - Folding the pending idle delta into the global active count. - Correctly aging the averages for the idle-duration. So with this patch the NO_HZ interaction should be complete and behaviour between CONFIG_NO_HZ=[yn] should be equivalent. Furthermore, this patch slightly changes the load average computation by adding a rounding term to the fixed point multiplication. Reported-by: Damien Wyart <damien.wyart@free.fr> Reported-by: Tim McGrath <tmhikaru@gmail.com> Tested-by: Damien Wyart <damien.wyart@free.fr> Tested-by: Orion Poplawski <orion@cora.nwra.com> Tested-by: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Cc: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1291129145.32004.874.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08perf: Fix duplicate events with multiple-pmu vs software eventsPeter Zijlstra
Because the multi-pmu bits can share contexts between struct pmu instances we could get duplicate events by iterating the pmu list. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08Merge branches 'x86-fixes-for-linus', 'perf-fixes-for-linus' and ↵Linus Torvalds
'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86/pvclock: Zero last_value on resume * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf record: Fix eternal wait for stillborn child perf header: Don't assume there's no attr info if no sample ids is provided perf symbols: Figure out start address of kernel map from kallsyms perf symbols: Fix kallsyms kernel/module map splitting * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: nohz: Fix printk_needs_cpu() return value on offline cpus printk: Fix wake_up_klogd() vs cpu hotplug
2010-12-07Merge branch 'irq-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Fix incorrect proc spurious output
2010-12-07Merge branch 'perf/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 into perf/core
2010-12-07Merge commit 'v2.6.37-rc5' into perf/coreIngo Molnar
Merge reason: Pick up the latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-06Merge branch 'pm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM / Hibernate: Fix memory corruption related to swap PM / Hibernate: Use async I/O when reading compressed hibernation image
2010-12-06PM / Hibernate: Fix memory corruption related to swapRafael J. Wysocki
There is a problem that swap pages allocated before the creation of a hibernation image can be released and used for storing the contents of different memory pages while the image is being saved. Since the kernel stored in the image doesn't know of that, it causes memory corruption to occur after resume from hibernation, especially on systems with relatively small RAM that need to swap often. This issue can be addressed by keeping the GFP_IOFS bits clear in gfp_allowed_mask during the entire hibernation, including the saving of the image, until the system is finally turned off or the hibernation is aborted. Unfortunately, for this purpose it's necessary to rework the way in which the hibernate and suspend code manipulates gfp_allowed_mask. This change is based on an earlier patch from Hugh Dickins. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-by: Ondrej Zary <linux@rainbow-software.org> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: stable@kernel.org
2010-12-06PM / Hibernate: Use async I/O when reading compressed hibernation imageBojan Smojver
This is a fix for reading LZO compressed image using async I/O. Essentially, instead of having just one page into which we keep reading blocks from swap, we allocate enough of them to cover the largest compressed size and then let block I/O pick them all up. Once we have them all (and here we wait), we decompress them, as usual. Obviously, the very first block we still pick up synchronously, because we need to know the size of the lot before we pick up the rest. Also fixed the copyright line, which I've forgotten before. Signed-off-by: Bojan Smojver <bojan@rexursive.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-12-06kprobes: Use text_poke_smp_batch for unoptimizingMasami Hiramatsu
Use text_poke_smp_batch() on unoptimization path for reducing the number of stop_machine() issues. If the number of unoptimizing probes is more than MAX_OPTIMIZE_PROBES(=256), kprobes unoptimizes first MAX_OPTIMIZE_PROBES probes and kicks optimizer for remaining probes. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: 2nddept-manager@sdl.hitachi.co.jp Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20101203095434.2961.22657.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-06kprobes: Use text_poke_smp_batch for optimizingMasami Hiramatsu
Use text_poke_smp_batch() in optimization path for reducing the number of stop_machine() issues. If the number of optimizing probes is more than MAX_OPTIMIZE_PROBES(=256), kprobes optimizes first MAX_OPTIMIZE_PROBES probes and kicks optimizer for remaining probes. Changes in v5: - Use kick_kprobe_optimizer() instead of directly calling schedule_delayed_work(). - Rescheduling optimizer outside of kprobe mutex lock. Changes in v2: - Allocate code buffer and parameters in arch_init_kprobes() instead of using static arraies. - Merge previous max optimization limit patch into this patch. So, this patch introduces upper limit of optimization at once. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: 2nddept-manager@sdl.hitachi.co.jp Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20101203095428.2961.8994.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-06kprobes: Reuse unused kprobeMasami Hiramatsu
Reuse unused (waiting for unoptimizing and no user handler) kprobe on given address instead of returning -EBUSY for registering a new kprobe. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <20101203095416.2961.39080.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-06kprobes: Support delayed unoptimizingMasami Hiramatsu
Unoptimization occurs when a probe is unregistered or disabled, and is heavy because it recovers instructions by using stop_machine(). This patch delays unoptimization operations and unoptimize several probes at once by using text_poke_smp_batch(). This can avoid unexpected system slowdown coming from stop_machine(). Changes in v5: - Split this patch into several cleanup patches and this patch. - Fix some text_mutex lock miss. - Use bool instead of int for behavior flags. - Add additional comment for (un)optimizing path. Changes in v2: - Use dynamic allocated buffers and params. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <20101203095409.2961.82733.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-06kprobes: Separate kprobe optimizing code from optimizerMasami Hiramatsu
Separate kprobe optimizing code from optimizer, this will make easy to introducing unoptimizing code in optimizer. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <20101203095403.2961.91201.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-06kprobes: Cleanup disabling and unregistering pathMasami Hiramatsu
Merge disabling kprobe to unregistering kprobe function and add comments for disabing/unregistring process. Current unregistering code disables(disarms) kprobes after checking target kprobe status. This patch changes it to disabling kprobe first after that it changing the kprobe's state. This allows to share probe disabling code between disable_kprobe() and unregister_kprobe(). Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <20101203095356.2961.30152.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-06kprobes: Rename old_p to more appropriate nameMasami Hiramatsu
Rename irrelevant uses of "old_p" to more appropriate names. Originally, "old_p" just meant "the old kprobe on given address" but current code uses that name as "just another kprobe" or something like that. This patch renames those pointer names to more appropriate one for maintainability. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <20101203095350.2961.48110.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-04perf events: Make sample_type identity fields available in all PERF_RECORD_ ↵Arnaldo Carvalho de Melo
events If perf_event_attr.sample_id_all is set it will add the PERF_SAMPLE_ identity info: TID, TIME, ID, CPU, STREAM_ID As a trailer, so that older perf tools can process new files, just ignoring the extra payload. With this its possible to do further analysis on problems in the event stream, like detecting reordering of MMAP and FORK events, etc. V2: Fixup header size in comm, mmap and task processing, as we have to take into account different sample_types for each matching event, noticed by Thomas Gleixner. Thomas also noticed a problem in v2 where if we didn't had space in the buffer we wouldn't restore the header size. Tested-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: Ian Munsie <imunsie@au1.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-12-04perf events: Separate the routines handling the PERF_SAMPLE_ identity fieldsArnaldo Carvalho de Melo
Those will be made available in sample like events like MMAP, EXEC, etc in a followup patch. So precalculate the extra id header space and have a separate routine to fill them up. V2: Thomas noticed that the id header needs to be precalculated at inherit_events too: LKML-Reference: <alpine.LFD.2.00.1012031245220.2653@localhost6.localdomain6> Tested-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: Ian Munsie <imunsie@au1.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <1291318772-30880-2-git-send-email-acme@infradead.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-12-04perf events: Fix event inherit fallout of precalculated headersThomas Gleixner
The precalculated header size is not updated when an event is inherited. That results in bogus sample entries for all child events. Bug introduced in c320c7b. Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ian Munsie <imunsie@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> LKML-Reference: <alpine.LFD.2.00.1012031245220.2653@localhost6.localdomain6> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-12-02do_exit(): make sure that we run with get_fs() == USER_DSNelson Elhage
If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not otherwise reset before do_exit(). do_exit may later (via mm_release in fork.c) do a put_user to a user-controlled address, potentially allowing a user to leverage an oops into a controlled write into kernel memory. This is only triggerable in the presence of another bug, but this potentially turns a lot of DoS bugs into privilege escalations, so it's worth fixing. I have proof-of-concept code which uses this bug along with CVE-2010-3849 to write a zero to an arbitrary kernel address, so I've tested that this is not theoretical. A more logical place to put this fix might be when we know an oops has occurred, before we call do_exit(), but that would involve changing every architecture, in multiple places. Let's just stick it in do_exit instead. [akpm@linux-foundation.org: update code comment] Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-01genirq: Fix incorrect proc spurious outputKenji Kaneshige
Since commit a1afb637(switch /proc/irq/*/spurious to seq_file) all /proc/irq/XX/spurious files show the information of irq 0. Current irq_spurious_proc_open() passes on NULL as the 3rd argument, which is used as an IRQ number in irq_spurious_proc_show(), to the single_open(). Because of this, all the /proc/irq/XX/spurious file shows IRQ 0 information regardless of the IRQ number. To fix the problem, irq_spurious_proc_open() must pass on the appropreate data (IRQ number) to single_open(). Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> LKML-Reference: <4CF4B778.90604@jp.fujitsu.com> Cc: stable@kernel.org [2.6.33+] Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-11-30perf events: Precalculate the header space for PERF_SAMPLE_ fieldsArnaldo Carvalho de Melo
PERF_SAMPLE_{CALLCHAIN,RAW} have variable lenghts per sample, but the others can be precalculated, reducing a bit the per sample cost. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: Ian Munsie <imunsie@au1.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> LKML-Reference: <new-submission> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-11-30tracing: Fix panic when lseek() called on "trace" opened for writingSlava Pestov
The file_ops struct for the "trace" special file defined llseek as seq_lseek(). However, if the file was opened for writing only, seq_open() was not called, and the seek would dereference a null pointer, file->private_data. This patch introduces a new wrapper for seq_lseek() which checks if the file descriptor is opened for reading first. If not, it does nothing. Cc: <stable@kernel.org> Signed-off-by: Slava Pestov <slavapestov@google.com> LKML-Reference: <1290640396-24179-1-git-send-email-slavapestov@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-11-30sched: Add 'autogroup' scheduling feature: automated per session task groupsMike Galbraith
A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. Currently, only per session autogroups are implemented, but the patch leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls setsid(), a new task group is created, the process is moved into the new task group, and a reference to the preveious task group is dropped. Child processes inherit this task group thereafter, and increase it's refcount. When the last thread of a process exits, the process's reference is dropped, such that when the last process referencing an autogroup exits, the autogroup is destroyed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used. Autogroup bandwidth is controllable via setting it's nice level through the proc filesystem: cat /proc/<pid>/autogroup Displays the task's group and the group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task group's shares to the weight of nice <level> task. Setting nice level is rate limited for !admin users due to the abuse risk of task group locking. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via the boot option noautogroup, and can also be turned on/off on the fly via: echo [01] > /proc/sys/kernel/sched_autogroup_enabled ... which will automatically move tasks to/from the root task group. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30sched: Fix unregister_fair_sched_group()Paul Turner
In the flipping and flopping between calling unregister_fair_sched_group() on a per-cpu versus per-group basis we ended up in a bad state. Remove from the list for the passed cpu as opposed to some arbitrary index. ( This fixes explosions w/ autogroup as well as a group creation/destruction stress test. ) Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20101130005740.080828123@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-29rcu: Make synchronize_srcu_expedited() fast if running readersPaul E. McKenney
The synchronize_srcu_expedited() function is currently quick if there are no active readers, but will delay a full jiffy if there are any. If these readers leave their SRCU read-side critical sections quickly, this is way too long to wait. So this commit first waits ten microseconds, and only then falls back to jiffy-at-a-time waiting. Reported-by: Avi Kivity <avi@redhat.com> Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Tested-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: fix race condition in synchronize_sched_expedited()Paul E. McKenney
The new (early 2010) implementation of synchronize_sched_expedited() uses try_stop_cpu() to force a context switch on every CPU. It also permits concurrent calls to synchronize_sched_expedited() to share a single call to try_stop_cpu() through use of an atomically incremented synchronize_sched_expedited_count variable. Unfortunately, this is subject to failure as follows: o Task A invokes synchronize_sched_expedited(), try_stop_cpus() succeeds, but Task A is preempted before getting to the atomic increment of synchronize_sched_expedited_count. o Task B also invokes synchronize_sched_expedited(), with exactly the same outcome as Task A. o Task C also invokes synchronize_sched_expedited(), again with exactly the same outcome as Tasks A and B. o Task D also invokes synchronize_sched_expedited(), but only gets as far as acquiring the mutex within try_stop_cpus() before being preempted, interrupted, or otherwise delayed. o Task E also invokes synchronize_sched_expedited(), but only gets to the snapshotting of synchronize_sched_expedited_count. o Tasks A, B, and C all increment synchronize_sched_expedited_count. o Task E fails to get the mutex, so checks the new value of synchronize_sched_expedited_count. It finds that the value has increased, so (wrongly) assumes that its work has been done, returning despite there having been no expedited grace period since it began. The solution is to have the lowest-numbered CPU atomically increment the synchronize_sched_expedited_count variable within the synchronize_sched_expedited_cpu_stop() function, which is under the protection of the mutex acquired by try_stop_cpus(). However, this also requires that piggybacking tasks wait for three rather than two instances of try_stop_cpu(), because we cannot control the order in which the per-CPU callback function occur. Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: update documentation/comments for Lai's adoption patchPaul E. McKenney
Lai's RCU-callback immediate-adoption patch changes the RCU tracing output, so update tracing.txt. Also update a few comments to clarify the synchronization design. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu,cleanup: simplify the code when cpu is dyingLai Jiangshan
When we handle the CPU_DYING notifier, the whole system is stopped except for the current CPU. We therefore need no synchronization with the other CPUs. This allows us to move any orphaned RCU callbacks directly to the list of any online CPU without needing to run them through the global orphan lists. These global orphan lists can therefore be dispensed with. This commit makes thes changes, though currently victimizes CPU 0 @@@. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu,cleanup: move synchronize_sched_expedited() out of sched.cLai Jiangshan
The first version of synchronize_sched_expedited() used the migration code in the scheduler, and was therefore implemented in kernel/sched.c. However, the more recent version of this code no longer uses the migration code, so this commit moves it to the main RCU source files. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: get rid of obsolete "classic" names in TREE_RCU tracingPaul E. McKenney
The TREE_RCU tracing had obsolete rcuclassic_trace_init() and rcuclassic_trace_cleanup() function names. This commit brings them up to date: rcutree_trace_init() and rcutree_trace_cleanup(), respectively. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: Distinguish between boosting and boostedPaul E. McKenney
RCU priority boosting's tracing did not distinguish between ongoing boosting and completion of boosting. This commit therefore adds this capability. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: add tracing for TINY_RCU and TINY_PREEMPT_RCUPaul E. McKenney
Add tracing for the tiny RCU implementations, including statistics on boosting in the case of TINY_PREEMPT_RCU and RCU_BOOST. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: priority boosting for TINY_PREEMPT_RCUPaul E. McKenney
Add priority boosting, but only for TINY_PREEMPT_RCU. This is enabled by the default-off RCU_BOOST kernel parameter. The priority to which to boost preempted RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter (defaulting to real-time priority 1) and the time to wait before boosting the readers blocking a given grace period is controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to 500 milliseconds). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-28Kill off a bunch of warning: ‘inline’ is not at beginning of declarationJesper Juhl
These warnings are spewed during a build of a 'allnoconfig' kernel (especially the ones from u64_stats_sync.h show up a lot) when building with -Wextra (which I often do).. They are a) annoying b) easy to get rid of. This patch kills them off. include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-29security: Define CAP_SYSLOGSerge E. Hallyn
Privileged syslog operations currently require CAP_SYS_ADMIN. Split this off into a new CAP_SYSLOG privilege which we can sanely take away from a container through the capability bounding set. With this patch, an lxc container can be prevented from messing with the host's syslog (i.e. dmesg -c). Changelog: mar 12 2010: add selinux capability2:cap_syslog perm Changelog: nov 22 2010: . port to new kernel . add a WARN_ONCE if userspace isn't using CAP_SYSLOG Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-By: Kees Cook <kees.cook@canonical.com> Cc: James Morris <jmorris@namei.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: "Christopher J. PeBenito" <cpebenito@tresys.com> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: James Morris <jmorris@namei.org>
2010-11-28Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix the software context switch counter perf, x86: Fixup Kconfig deps x86, perf, nmi: Disable perf if counters are not accessible perf: Fix inherit vs. context rotation bug
2010-11-27Merge branch 'cleanup-bd_claim' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.38/core
2010-11-27Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call
2010-11-27Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf symbols: Remove incorrect open-coded container_of() perf record: Handle restrictive permissions in /proc/{kallsyms,modules} x86/kprobes: Prevent kprobes to probe on save_args() irq_work: Drop cmpxchg() result perf: Fix owner-list vs exit x86, hw_nmi: Move backtrace_mask declaration under ARCH_HAS_NMI_WATCHDOG tracing: Fix recursive user stack trace perf,hw_breakpoint: Initialize hardware api earlier x86: Ignore trap bits on single step exceptions tracing: Force arch_local_irq_* notrace for paravirt tracing: Fix module use of trace_bprintk()
2010-11-27Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix idle balancing sched: Fix volanomark performance regression
2010-11-26perf, arch: Cleanup perf-pmu init vs lockup-detectorPeter Zijlstra
The perf hardware pmu got initialized at various points in the boot, some before early_initcall() some after (notably arch_initcall). The problem is that the NMI lockup detector is ran from early_initcall() and expects the hardware pmu to be present. Sanitize this by moving all architecture hardware pmu implementations to initialize at early_initcall() and move the lockup detector to an explicit initcall right after that. Cc: paulus <paulus@samba.org> Cc: davem <davem@davemloft.net> Cc: Michael Cree <mcree@orcon.net.nz> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290707759.2145.119.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26perf: Ignore non-sampling overflowsPeter Zijlstra
Some arch implementations call perf_event_overflow() by 'accident', ignore this. Reported-by: Francis Moreau <francis.moro@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26perf: Don't bother to init the hrtimer for no SW sampling countersFranck Bui-Huu
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290525705-6265-3-git-send-email-fbuihuu@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26perf: Limit event refresh to sampling eventFranck Bui-Huu
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290525705-6265-2-git-send-email-fbuihuu@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26perf: Introduce is_sampling_event()Franck Bui-Huu
and use it when appropriate. Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290525705-6265-1-git-send-email-fbuihuu@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26Merge branch 'perf/urgent' into perf/coreIngo Molnar
Conflicts: arch/x86/kernel/apic/hw_nmi.c Merge reason: Resolve conflict, queue up dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26sched: Remove unused argument dest_cpu to migrate_task()Nikanth Karthikesan
Remove unused argument, 'dest_cpu' of migrate_task(), and pass runqueue, as it is always known at the call site. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <201011261237.09187.knikanth@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>