summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2011-12-21ftrace: Fix unregister ftrace_ops accountingJiri Olsa
Multiple users of the function tracer can register their functions with the ftrace_ops structure. The accounting within ftrace will update the counter on each function record that is being traced. When the ftrace_ops filtering adds or removes functions, the function records will be updated accordingly if the ftrace_ops is still registered. When a ftrace_ops is removed, the counter of the function records, that the ftrace_ops traces, are decremented. When they reach zero the functions that they represent are modified to stop calling the mcount code. When changes are made, the code is updated via stop_machine() with a command passed to the function to tell it what to do. There is an ENABLE and DISABLE command that tells the called function to enable or disable the functions. But the ENABLE is really a misnomer as it should just update the records, as records that have been enabled and now have a count of zero should be disabled. The DISABLE command is used to disable all functions regardless of their counter values. This is the big off switch and is not the complement of the ENABLE command. To make matters worse, when a ftrace_ops is unregistered and there is another ftrace_ops registered, neither the DISABLE nor the ENABLE command are set when calling into the stop_machine() function and the records will not be updated to match their counter. A command is passed to that function that will update the mcount code to call the registered callback directly if it is the only one left. This means that the ftrace_ops that is still registered will have its callback called by all functions that have been set for it as well as the ftrace_ops that was just unregistered. Here's a way to trigger this bug. Compile the kernel with CONFIG_FUNCTION_PROFILER set and with CONFIG_FUNCTION_GRAPH not set: CONFIG_FUNCTION_PROFILER=y # CONFIG_FUNCTION_GRAPH is not set This will force the function profiler to use the function tracer instead of the function graph tracer. # cd /sys/kernel/debug/tracing # echo schedule > set_ftrace_filter # echo function > current_tracer # cat set_ftrace_filter schedule # cat trace # tracer: nop # # entries-in-buffer/entries-written: 692/68108025 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | kworker/0:2-909 [000] .... 531.235574: schedule <-worker_thread <idle>-0 [001] .N.. 531.235575: schedule <-cpu_idle kworker/0:2-909 [000] .... 531.235597: schedule <-worker_thread sshd-2563 [001] .... 531.235647: schedule <-schedule_hrtimeout_range_clock # echo 1 > function_profile_enabled # echo 0 > function_porfile_enabled # cat set_ftrace_filter schedule # cat trace # tracer: function # # entries-in-buffer/entries-written: 159701/118821262 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | <idle>-0 [002] ...1 604.870655: local_touch_nmi <-cpu_idle <idle>-0 [002] d..1 604.870655: enter_idle <-cpu_idle <idle>-0 [002] d..1 604.870656: atomic_notifier_call_chain <-enter_idle <idle>-0 [002] d..1 604.870656: __atomic_notifier_call_chain <-atomic_notifier_call_chain The same problem could have happened with the trace_probe_ops, but they are modified with the set_frace_filter file which does the update at closure of the file. The simple solution is to change ENABLE to UPDATE and call it every time an ftrace_ops is unregistered. Link: http://lkml.kernel.org/r/1323105776-26961-3-git-send-email-jolsa@redhat.com Cc: stable@vger.kernel.org # 3.0+ Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-21sched: Fix cgroup movement of waking processDaisuke Nishimura
There is a small race between try_to_wake_up() and sched_move_task(), which is trying to move the process being woken up. try_to_wake_up() on CPU0 sched_move_task() on CPU1 --------------------------------+--------------------------------- raw_spin_lock_irqsave(p->pi_lock) task_waking_fair() ->p.se.vruntime -= cfs_rq->min_vruntime ttwu_queue() ->send reschedule IPI to CPU1 raw_spin_unlock_irqsave(p->pi_lock) task_rq_lock() -> tring to aquire both p->pi_lock and rq->lock with IRQ disabled task_move_group_fair() -> p.se.vruntime -= (old)cfs_rq->min_vruntime += (new)cfs_rq->min_vruntime task_rq_unlock() (via IPI) sched_ttwu_pending() raw_spin_lock(rq->lock) ttwu_do_activate() ... enqueue_entity() child.se->vruntime += cfs_rq->min_vruntime raw_spin_unlock(rq->lock) As a result, vruntime of the process becomes far bigger than min_vruntime, if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime. This patch fixes this problem by just ignoring such process in task_move_group_fair(), because the vruntime has already been normalized in task_waking_fair(). Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20111215143741.df82dd50.nishimura@mxp.nes.nec.co.jp Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Fix cgroup movement of newly created processDaisuke Nishimura
There is a small race between do_fork() and sched_move_task(), which is trying to move the child. do_fork() sched_move_task() --------------------------------+--------------------------------- copy_process() sched_fork() task_fork_fair() -> vruntime of the child is initialized based on that of the parent. -> we can see the child in "tasks" file now. task_rq_lock() task_move_group_fair() -> child.se.vruntime -= (old)cfs_rq->min_vruntime += (new)cfs_rq->min_vruntime task_rq_unlock() wake_up_new_task() ... enqueue_entity() child.se.vruntime += cfs_rq->min_vruntime As a result, vruntime of the child becomes far bigger than min_vruntime, if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime. This patch fixes this problem by just ignoring such process in task_move_group_fair(), because the vruntime has already been normalized in task_fork_fair(). Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20111215143607.2ee12c5d.nishimura@mxp.nes.nec.co.jp Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Fix cgroup movement of forking processDaisuke Nishimura
There is a small race between task_fork_fair() and sched_move_task(), which is trying to move the parent. task_fork_fair() sched_move_task() --------------------------------+--------------------------------- cfs_rq = task_cfs_rq(current) -> cfs_rq is the "old" one. curr = cfs_rq->curr -> curr is set to the parent. task_rq_lock() dequeue_task() ->parent.se.vruntime -= (old)cfs_rq->min_vruntime enqueue_task() ->parent.se.vruntime += (new)cfs_rq->min_vruntime task_rq_unlock() raw_spin_lock_irqsave(rq->lock) se->vruntime = curr->vruntime -> vruntime of the child is set to that of the parent which has already been updated by sched_move_task(). se->vruntime -= (old)cfs_rq->min_vruntime. raw_spin_unlock_irqrestore(rq->lock) As a result, vruntime of the child becomes far bigger than expected, if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime. This patch fixes this problem by setting "cfs_rq" and "curr" after holding the rq->lock. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20111215143655.662676b0.nishimura@mxp.nes.nec.co.jp Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Remove cfs bandwidth period check in tg_set_cfs_period()Kamalesh Babulal
Remove cfs bandwidth period check from tg_set_cfs_period. Invalid bandwidth period's lower/upper limits are denoted by min_cfs_quota_period/max_cfs_quota_period repsectively, and are checked against valid period in tg_set_cfs_bandwidth(). As pjt pointed out, negative input will result in very large unsigned numbers and will be caught by the max allowed period test. Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Acked-by: Paul Turner <pjt@google.com> [ammended changelog to mention negative values] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111210135925.GA14593@linux.vnet.ibm.com -- kernel/sched/core.c | 3 --- 1 file changed, 3 deletions(-) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Fix load-balance lock-breakingPeter Zijlstra
The current lock break relies on contention on the rq locks, something which might never come because we've got IRQs disabled. Or will be very likely because on anything with more than 2 cpus a synchronized load-balance pass will very likely cause contention on the rq locks. Also the sched_nr_migrate thing fails when it gets trapped the loops of either the cgroup muck in load_balance_fair() or the move_tasks() load condition. Instead, use the new lb_flags field to propagate break/abort conditions for all these loops and create a new loop outside the irq disabled on the break being required. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-tsceb6w61q0gakmsccix6xxi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Replace all_pinned with a generic flags fieldPeter Zijlstra
Replace the all_pinned argument with a flags field so that we can add some extra controls throughout that entire call chain. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-33kevm71m924ok1gpxd720v3@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21sched: Only queue remote wakeups when crossing cache boundariesPeter Zijlstra
Mike reported a 13% drop in netperf TCP_RR performance due to the new remote wakeup code. Suresh too noticed some performance issues with it. Reducing the IPIs to only cross cache domains solves the observed performance issues. Reported-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21lockdep/waitqueues: Add better annotationPeter Zijlstra
-> #2 (&tty->write_wait){-.-...}: is a lot more informative than: -> #2 (key#19){-.....}: Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-8zpopbny51023rdb0qq67eye@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21Merge commit 'v3.2-rc6' into core/lockingIngo Molnar
Merge reason: Pick up the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-20Merge branch 'for-3.2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup * 'for-3.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroups: fix a css_set not found bug in cgroup_attach_proc
2011-12-20Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time/clocksource: Fix kernel-doc warnings rtc: m41t80: Workaround broken alarm functionality rtc: Expire alarms after the time is set.
2011-12-20Merge commit 'v3.2-rc6' into perf/coreIngo Molnar
Merge reason: Update with the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-20binary_sysctl(): fix memory leakMichel Lespinasse
binary_sysctl() calls sysctl_getname() which allocates from names_cache slab usin __getname() The matching function to free the name is __putname(), and not putname() which should be used only to match getname() allocations. This is because when auditing is enabled, putname() calls audit_putname *instead* (not in addition) to __putname(). Then, if a syscall is in progress, audit_putname does not release the name - instead, it expects the name to get released when the syscall completes, but that will happen only if audit_getname() was called previously, i.e. if the name was allocated with getname() rather than the naked __getname(). So, __getname() followed by putname() ends up leaking memory. Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Eric Paris <eparis@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-20cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemaskDavid Rientjes
Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a new set of allowed cpuset nodes where the two nodemasks, as a result of the remap, are now disjoint. c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") adds get_mems_allowed() to prevent the set of allowed nodes from changing for a thread. This causes any update to a set of allowed nodes to stall until put_mems_allowed() is called. This stall is unncessary, however, if at least one node remains unchanged in the update to the set of allowed nodes. This was addressed by 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one node remains set"), but it's still possible that an empty nodemask may be read from a mempolicy because the old nodemask may be remapped to the new nodemask during rebind. To prevent this, only avoid the stall if there is no mempolicy for the thread being changed. This is a temporary solution until all reads from mempolicy nodemasks can be guaranteed to not be empty without the get_mems_allowed() synchronization. Also moves the check for nodemask intersection inside task_lock() so that tsk->mems_allowed cannot change. This ensures that nothing can set this tsk's mems_allowed out from under us and also protects tsk->mempolicy. Reported-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-20Merge branch 'memblock-kill-early_node_map' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into core/memblock
2011-12-19Merge branch 'sched/core' of ↵Martin Schwidefsky
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into cputime-tip Conflicts: drivers/cpufreq/cpufreq_conservative.c drivers/cpufreq/cpufreq_ondemand.c drivers/macintosh/rack-meter.c fs/proc/stat.c fs/proc/uptime.c kernel/sched/core.c
2011-12-19cgroups: remove redundant get/put of css_set from css_set_check_fetched()Mandeep Singh Baines
We already have a reference to all elements in newcg_list. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: containers@lists.linux-foundation.org Cc: cgroups@vger.kernel.org Cc: Paul Menage <paul@paulmenage.org>
2011-12-19cgroups: fix a css_set not found bug in cgroup_attach_procMandeep Singh Baines
There is a BUG when migrating a PF_EXITING proc. Since css_set_prefetch() is not called for the PF_EXITING case, find_existing_css_set() will return NULL inside cgroup_task_migrate() causing a BUG. This bug is easy to reproduce. Create a zombie and echo its pid to cgroup.procs. $ cat zombie.c \#include <unistd.h> int main() { if (fork()) pause(); return 0; } $ We are hitting this bug pretty regularly on ChromeOS. This bug is already fixed by Tejun Heo's cgroup patchset which is targetted for the next merge window: https://lkml.org/lkml/2011/11/1/356 I've create a smaller patch here which just fixes this bug so that a fix can be merged into the current release and stable. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Downstream-Bug-Report: http://crosbug.com/23953 Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: containers@lists.linux-foundation.org Cc: cgroups@vger.kernel.org Cc: stable@kernel.org Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Menage <paul@paulmenage.org> Cc: Olof Johansson <olofj@chromium.org>
2011-12-19time/clocksource: Fix kernel-doc warningsKusanagi Kouichi
Fix various KernelDoc build warnings. Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Cc: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/20111219091320.0D5AF6FC03D@msa105.auone-net.jp Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-18writeback: dirty ratelimit - think time compensationWu Fengguang
Compensate the task's think time when computing the final pause time, so that ->dirty_ratelimit can be executed accurately. think time := time spend outside of balance_dirty_pages() In the rare case that the task slept longer than the 200ms period time (result in negative pause time), the sleep time will be compensated in the following periods, too, if it's less than 1 second. Accumulated errors are carefully avoided as long as the max pause area is not hitted. Pseudo code: period = pages_dirtied / task_ratelimit; think = jiffies - dirty_paused_when; pause = period - think; 1) normal case: period > think pause = period - think dirty_paused_when = jiffies + pause nr_dirtied = 0 period time |===============================>| think time pause time |===============>|==============>| ------|----------------|---------------|------------------------ dirty_paused_when jiffies 2) no pause case: period <= think don't pause; reduce future pause time by: dirty_paused_when += period nr_dirtied = 0 period time |===============================>| think time |===================================================>| ------|--------------------------------+-------------------|---- dirty_paused_when jiffies Acked-by: Jan Kara <jack@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-18writeback: charge leaked page dirties to active tasksWu Fengguang
It's a years long problem that a large number of short-lived dirtiers (eg. gcc instances in a fast kernel build) may starve long-run dirtiers (eg. dd) as well as pushing the dirty pages to the global hard limit. The solution is to charge the pages dirtied by the exited gcc to the other random dirtying tasks. It sounds not perfect, however should behave good enough in practice, seeing as that throttled tasks aren't actually running so those that are running are more likely to pick it up and get throttled, therefore promoting an equal spread. Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c Acked-by: Jan Kara <jack@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-17Merge branches 'perf-urgent-for-linus' and 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf events: Fix ring_buffer_wakeup() brown paperbag bug * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix select_idle_sibling() regression in selecting an idle SMT sibling MAINTAINERS: Update tip.git related git trees
2011-12-16sched: Fix select_idle_sibling() regression in selecting an idle SMT siblingPeter Zijlstra
Mike Galbraith reported that this recent commit: commit 4dcfe1025b513c2c1da5bf5586adb0e80148f612 Author: Peter Zijlstra <peterz@infradead.org> Date: Thu Nov 10 13:01:10 2011 +0100 sched: Avoid SMT siblings in select_idle_sibling() if possible stopped selecting an idle SMT sibling when there are no idle cores in a single socket system. Intent of the select_idle_sibling() was to fallback to an idle SMT sibling, if it fails to identify an idle core. But this fallback was not happening on systems where all the scheduler domains had `SD_SHARE_PKG_RESOURCES' flag set. Fix it. Slightly bigger patch of cleaning all these goto's etc is queued up for the next release. Reported-by: Mike Galbraith <efault@gmx.de> Reported-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1323978421.1984.244.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-16sched: Add missing rcu_dereference() around ->real_parent usageKees Cook
Wrap another ->real_parent dereference while under rcu_read_lock. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Glauber Costa <glommer@parallels.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Link: http://lkml.kernel.org/r/20111215164918.GA13003@www.outflux.net [ tidied up the changelog ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-15[S390] cputime: add sparse checking and cleanupMartin Schwidefsky
Make cputime_t and cputime64_t nocast to enable sparse checking to detect incorrect use of cputime. Drop the cputime macros for simple scalar operations. The conversion macros are still needed. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-12-15Merge commit 'v3.2-rc5' into sched/coreIngo Molnar
Merge reason: Pick up the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-14clocksource: convert sysdev_class to a regular subsystemKay Sievers
After all sysdev classes are ported to regular driver core entities, the sysdev implementation will be entirely removed from the kernel. Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-14rtmutex-tester: convert sysdev_class to a regular subsystemKay Sievers
After all sysdev classes are ported to regular driver core entities, the sysdev implementation will be entirely removed from the kernel. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-14perf events: Fix ring_buffer_wakeup() brown paperbag bugWill Deacon
Commit 10c6db11 ("perf: Fix loss of notification with multi-event") seems to unconditionally dereference event->rb in the wakeup handler, this is wrong, there might not be a buffer attached. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111213152651.GP20297@mudshark.cambridge.arm.com [ minor edits ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-14block, cfq: unlink cfq_io_context's immediatelyTejun Heo
cic is association between io_context and request_queue. A cic is linked from both ioc and q and should be destroyed when either one goes away. As ioc and q both have their own locks, locking becomes a bit complex - both orders work for removal from one but not from the other. Currently, cfq tries to circumvent this locking order issue with RCU. ioc->lock nests inside queue_lock but the radix tree and cic's are also protected by RCU allowing either side to walk their lists without grabbing lock. This rather unconventional use of RCU quickly devolves into extremely fragile convolution. e.g. The following is from cfqd going away too soon after ioc and q exits raced. general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: [ 88.503444] Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs RIP: 0010:[<ffffffff81397628>] [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0 ... Call Trace: [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90 [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20 [<ffffffff81389130>] exit_io_context+0x100/0x140 [<ffffffff81098a29>] do_exit+0x579/0x850 [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0 [<ffffffff81098de7>] sys_exit_group+0x17/0x20 [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b The only real hot path here is cic lookup during request initialization and avoiding extra locking requires very confined use of RCU. This patch makes cic removal from both ioc and request_queue perform double-locking and unlink immediately. * From q side, the change is almost trivial as ioc->lock nests inside queue_lock. It just needs to grab each ioc->lock as it walks cic_list and unlink it. * From ioc side, it's a bit more difficult because of inversed lock order. ioc needs its lock to walk its cic_list but can't grab the matching queue_lock and needs to perform unlock-relock dancing. Unlinking is now wholly done from put_io_context() and fast path is optimized by using the queue_lock the caller already holds, which is by far the most common case. If the ioc accessed multiple devices, it tries with trylock. In unlikely cases of fast path failure, it falls back to full double-locking dance from workqueue. Double-locking isn't the prettiest thing in the world but it's *far* simpler and more understandable than RCU trick without adding any meaningful overhead. This still leaves a lot of now unnecessary RCU logics. Future patches will trim them. -v2: Vivek pointed out that cic->q was being dereferenced after cic->release() was called. Updated to use local variable @this_q instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14block: make ioc get/put interface more conventional and fix race on alloctionTejun Heo
Ignoring copy_io() during fork, io_context can be allocated from two places - current_io_context() and set_task_ioprio(). The former is always called from local task while the latter can be called from different task. The synchornization between them are peculiar and dubious. * current_io_context() doesn't grab task_lock() and assumes that if it saw %NULL ->io_context, it would stay that way until allocation and assignment is complete. It has smp_wmb() between alloc/init and assignment. * set_task_ioprio() grabs task_lock() for assignment and does smp_read_barrier_depends() between "ioc = task->io_context" and "if (ioc)". Unfortunately, this doesn't achieve anything - the latter is not a dependent load of the former. ie, if ioc itself were being dereferenced "ioc->xxx", it would mean something (not sure what tho) but as the code currently stands, the dependent read barrier is noop. As only one of the the two test-assignment sequences is task_lock() protected, the task_lock() can't do much about race between the two. Nothing prevents current_io_context() and set_task_ioprio() allocating its own ioc for the same task and overwriting the other's. Also, set_task_ioprio() can race with exiting task and create a new ioc after exit_io_context() is finished. ioc get/put doesn't have any reason to be complex. The only hot path is accessing the existing ioc of %current, which is simple to achieve given that ->io_context is never destroyed as long as the task is alive. All other paths can happily go through task_lock() like all other task sub structures without impacting anything. This patch updates ioc get/put so that it becomes more conventional. * alloc_io_context() is replaced with get_task_io_context(). This is the only interface which can acquire access to ioc of another task. On return, the caller has an explicit reference to the object which should be put using put_io_context() afterwards. * The functionality of current_io_context() remains the same but when creating a new ioc, it shares the code path with get_task_io_context() and always goes through task_lock(). * get_io_context() now means incrementing ref on an ioc which the caller already has access to (be that an explicit refcnt or implicit %current one). * PF_EXITING inhibits creation of new io_context and once exit_io_context() is finished, it's guaranteed that both ioc acquisition functions return %NULL. * All users are updated. Most are trivial but smp_read_barrier_depends() removal from cfq_get_io_context() needs a bit of explanation. I suppose the original intention was to ensure ioc->ioprio is visible when set_task_ioprio() allocates new io_context and installs it; however, this wouldn't have worked because set_task_ioprio() doesn't have wmb between init and install. There are other problems with this which will be fixed in another patch. * While at it, use NUMA_NO_NODE instead of -1 for wildcard node specification. -v2: Vivek spotted contamination from debug patch. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13resource cgroups: remove bogus castDavidlohr Bueso
The memparse() function already accepts const char * as the parsing string. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-12-12cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()Tejun Heo
These three methods are no longer used. Kill them. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com>
2011-12-12cgroup, cpuset: don't use ss->pre_attach()Tejun Heo
->pre_attach() is supposed to be called before migration, which is observed during process migration but task migration does it the other way around. The only ->pre_attach() user is cpuset which can do the same operaitons in ->can_attach(). Collapse cpuset_pre_attach() into cpuset_can_attach(). -v2: Patch contamination from later patch removed. Spotted by Paul Menage. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com>
2011-12-12cgroup: don't use subsys->can_attach_task() or ->attach_task()Tejun Heo
Now that subsys->can_attach() and attach() take @tset instead of @task, they can handle per-task operations. Convert ->can_attach_task() and ->attach_task() users to use ->can_attach() and attach() instead. Most converions are straight-forward. Noteworthy changes are, * In cgroup_freezer, remove unnecessary NULL assignments to unused methods. It's useless and very prone to get out of sync, which already happened. * In cpuset, PF_THREAD_BOUND test is checked for each task. This doesn't make any practical difference but is conceptually cleaner. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: James Morris <jmorris@namei.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org>
2011-12-12cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), ↵Tejun Heo
cancel_attach() and attach() Currently, there's no way to pass multiple tasks to cgroup_subsys methods necessitating the need for separate per-process and per-task methods. This patch introduces cgroup_taskset which can be used to pass multiple tasks and their associated cgroups to cgroup_subsys methods. Three methods - can_attach(), cancel_attach() and attach() - are converted to use cgroup_taskset. This unifies passed parameters so that all methods have access to all information. Conversions in this patchset are identical and don't introduce any behavior change. -v2: documentation updated as per Paul Menage's suggestion. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: James Morris <jmorris@namei.org>
2011-12-12cgroup: improve old cgroup handling in cgroup_attach_proc()Tejun Heo
cgroup_attach_proc() behaves differently from cgroup_attach_task() in the following aspects. * All hooks are invoked even if no task is actually being moved. * ->can_attach_task() is called for all tasks in the group whether the new cgrp is different from the current cgrp or not; however, ->attach_task() is skipped if new equals new. This makes the calls asymmetric. This patch improves old cgroup handling in cgroup_attach_proc() by looking up the current cgroup at the head, recording it in the flex array along with the task itself, and using it to remove the above two differences. This will also ease further changes. -v2: nr_todo renamed to nr_migrating_tasks as per Paul Menage's suggestion. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-12cgroup: always lock threadgroup during migrationTejun Heo
Update cgroup to take advantage of the fack that threadgroup_lock() guarantees stable threadgroup. * Lock threadgroup even if the target is a single task. This guarantees that when the target tasks stay stable during migration regardless of the target type. * Remove PF_EXITING early exit optimization from attach_task_by_pid() and check it in cgroup_task_migrate() instead. The optimization was for rather cold path to begin with and PF_EXITING state can be trusted throughout migration by checking it after locking threadgroup. * Don't add PF_EXITING tasks to target task array in cgroup_attach_proc(). This ensures that task migration is performed only for live tasks. * Remove -ESRCH failure path from cgroup_task_migrate(). With the above changes, it's guaranteed to be called only for live tasks. After the changes, only live tasks are migrated and they're guaranteed to stay alive until migration is complete. This removes problems caused by exec and exit racing against cgroup migration including symmetry among cgroup attach methods and different cgroup methods racing each other. v2: Oleg pointed out that one more PF_EXITING check can be removed from cgroup_attach_proc(). Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Menage <paul@paulmenage.org>
2011-12-12threadgroup: extend threadgroup_lock() to cover exit and execTejun Heo
threadgroup_lock() protected only protected against new addition to the threadgroup, which was inherently somewhat incomplete and problematic for its only user cgroup. On-going migration could race against exec and exit leading to interesting problems - the symmetry between various attach methods, task exiting during method execution, ->exit() racing against attach methods, migrating task switching basic properties during exec and so on. This patch extends threadgroup_lock() such that it protects against all three threadgroup altering operations - fork, exit and exec. For exit, threadgroup_change_begin/end() calls are added to exit_signals around assertion of PF_EXITING. For exec, threadgroup_[un]lock() are updated to also grab and release cred_guard_mutex. With this change, threadgroup_lock() guarantees that the target threadgroup will remain stable - no new task will be added, no new PF_EXITING will be set and exec won't happen. The next patch will update cgroup so that it can take full advantage of this change. -v2: beefed up comment as suggested by Frederic. -v3: narrowed scope of protection in exit path as suggested by Frederic. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Menage <paul@paulmenage.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-12threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsemTejun Heo
Make the following renames to prepare for extension of threadgroup locking. * s/signal->threadgroup_fork_lock/signal->group_rwsem/ * s/threadgroup_fork_read_lock()/threadgroup_change_begin()/ * s/threadgroup_fork_read_unlock()/threadgroup_change_end()/ * s/threadgroup_fork_write_lock()/threadgroup_lock()/ * s/threadgroup_fork_write_unlock()/threadgroup_unlock()/ This patch doesn't cause any behavior change. -v2: Rename threadgroup_change_done() to threadgroup_change_end() per KAMEZAWA's suggestion. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Menage <paul@paulmenage.org>
2011-12-12cgroup: add cgroup_root_mutexTejun Heo
cgroup wants to make threadgroup stable while modifying cgroup hierarchies which will introduce locking dependency on cred_guard_mutex from cgroup_mutex. This unfortunately completes circular dependency. A. cgroup_mutex -> cred_guard_mutex -> s_type->i_mutex_key -> namespace_sem B. namespace_sem -> cgroup_mutex B is from cgroup_show_options() and this patch breaks it by introducing another mutex cgroup_root_mutex which nests inside cgroup_mutex and protects cgroupfs_root. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com>
2011-12-12cpu: Export cpu_up()Paul E. McKenney
Building rcutorture as a module requires cpu_up() as well as cpu_down() exported, so apply EXPORT_SYMBOL_GPL(). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Apply ACCESS_ONCE() to rcu_boost() return valuePaul E. McKenney
Both TINY_RCU's and TREE_RCU's implementations of rcu_boost() access the ->boost_tasks and ->exp_tasks fields without preventing concurrent changes to these fields. This commit therefore applies ACCESS_ONCE in order to prevent compiler mischief. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11Revert "rcu: Permit rt_mutex_unlock() with irqs disabled"Paul E. McKenney
This reverts commit 5342e269b2b58ee0b0b4168a94087faaa60d0567. The approach taken in this patch was deemed too abusive to mutexes, and thus too likely to result in maintenance problems in the future. Instead, we will disallow RCU read-side critical sections that partially overlap with interrupt-disbled code segments. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Augment rcu_batch_end tracing for idle and callback statePaul E. McKenney
The current rcu_batch_end event trace records only the name of the RCU flavor and the total number of callbacks that remain queued on the current CPU. This is insufficient for testing and tuning the new dyntick-idle RCU_FAST_NO_HZ code, so this commit adds idle state along with whether or not any of the callbacks that were ready to invoke at the beginning of rcu_do_batch() are still queued. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Add rcutorture tests for srcu_read_lock_raw()Paul E. McKenney
This commit adds simple rcutorture tests for srcu_read_lock_raw() and srcu_read_unlock_raw(). It does not test doing srcu_read_lock_raw() in an exception handler and releasing it in the corresponding process context. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Make rcutorture test for hotpluggability before offlining CPUsPaul E. McKenney
The rcutorture test now can automatically exercise CPU hotplug and collect success statistics, which can be correlated with other rcutorture activity. This permits rcutorture to completely exercise RCU regardless of what sort of userspace and filesystem layout is in use. Unfortunately, rcutorture is happy to attempt to offline CPUs that cannot be offlined, for example, CPU 0 in both the x86 and ARM architectures. Although this allows rcutorture testing to proceed normally, it confounds attempts at error analysis due to the resulting flood of spurious CPU-hotplug errors. Therefore, this commit uses the new cpu_is_hotpluggable() function to avoid attempting to offline CPUs that are not hotpluggable, which in turn avoids spurious CPU-hotplug errors. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Remove redundant rcu_cpu_stall_suppress declarationPaul E. McKenney
No point in having two identical rcu_cpu_stall_suppress declarations, so remove the more obscure of the two. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Adaptive dyntick-idle preparationPaul E. McKenney
If there are other CPUs active at a given point in time, then there is a limit to what a given CPU can do to advance the current RCU grace period. Beyond this limit, attempting to force the RCU grace period forward will do nothing but consume energy burning CPU cycles. Therefore, this commit takes an adaptive approach to RCU_FAST_NO_HZ preparations for idle. It pushes the RCU core state machine for two cycles unconditionally, and then it will push from zero to three additional cycles, but only as long as the RCU core has work for this CPU to do immediately. The rcu_pending() function is used to check whether the RCU core has such work. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>