summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2012-08-21timer: Clean up timer initializersTejun Heo
Over time, timer initializers became messy with unnecessarily duplicated code which are inconsistently spread across timer.h and timer.c. This patch cleans up timer initializers. * timer.c::__init_timer() is renamed to do_init_timer(). * __TIMER_INITIALIZER() added. It takes @flags and all initializers are wrappers around it. * init_timer[_on_stack]_key() now take @flags. * __init_timer[_on_stack]() added. They take @flags and all init macros are wrappers around them. * __setup_timer[_on_stack]() added. It uses __init_timer() and takes @flags. All setup macros are wrappers around the two. Note that this patch doesn't add missing init/setup combinations - e.g. init_timer_deferrable_on_stack(). Adding missing ones is trivial. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: torvalds@linux-foundation.org Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1344449428-24962-4-git-send-email-tj@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-21timer: Generalize timer->base flags handlingTejun Heo
To prepare for addition of another flag, generalize timer->base flags handling. * Rename from TBASE_*_FLAG to TIMER_* and make them LU constants. * Define and use TIMER_FLAG_MASK for flags masking so that multiple flags can be handled correctly. * Don't dereference timer->base directly even if !tbase_get_deferrable(). All two such places are already passed in @base, so use it instead. * Make sure tvec_base's alignment is large enough for timer->base flags using BUILD_BUG_ON(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: torvalds@linux-foundation.org Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1344449428-24962-2-git-send-email-tj@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-21genirq: Export dummy_irq_chipKuninori Morimoto
Export dummy_irq_chip to modules to allow them to do things such as irq_set_chip_and_handler(virq, &dummy_irq_chip, handle_level_irq); This fixes ERROR: "dummy_irq_chip" [drivers/gpio/gpio-pcf857x.ko] undefined! when gpio-pcf857x.c is being built as a module. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Greg KH <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/871ujstrp6.wl%25kuninori.morimoto.gx@renesas.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-21genirq: Export irq_set_chip_and_handler_name()Kuninori Morimoto
Export irq_set_chip_and_handler_name() to modules to allow them to do things such as irq_set_chip_and_handler(....); This fixes ERROR: "irq_set_chip_and_handler_name" \ [drivers/gpio/gpio-pcf857x.ko] undefined! when gpio-pcf857x.c is being built as a module. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Greg KH <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/873948trpk.wl%25kuninori.morimoto.gx@renesas.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-21Merge branch 'tip/perf/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent Pull syscall tracing fix from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-21uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() failsOleg Nesterov
This patch fixes: https://bugzilla.redhat.com/show_bug.cgi?id=843640 If mmap_region()->uprobe_mmap() fails, unmap_and_free_vma path does unmap_region() but does not remove the soon-to-be-freed vma from rb tree. Actually there are more problems but this is how William noticed this bug. Perhaps we could do do_munmap() + return in this case, but in fact it is simply wrong to abort if uprobe_mmap() fails. Until at least we move the !UPROBE_COPY_INSN code from install_breakpoint() to uprobe_register(). For example, uprobe_mmap()->install_breakpoint() can fail if the probed insn is not supported (remember, uprobe_register() succeeds if nobody mmaps inode/offset), mmap() should not fail in this case. dup_mmap()->uprobe_mmap() is wrong too by the same reason, fork() can race with uprobe_register() and fail for no reason if it wins the race and does install_breakpoint() first. And, if nothing else, both mmap_region() and dup_mmap() return success if uprobe_mmap() fails. Change them to ignore the error code from uprobe_mmap(). Reported-and-tested-by: William Cohen <wcohen@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> # v3.5 Cc: Anton Arapov <anton@redhat.com> Cc: William Cohen <wcohen@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-21Merge branch 'tip/perf/core-2' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core Pull ftrace fixlets from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-21Merge tag 'perf-core-for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: * Fix include order for bison/flex-generated C files, from Ben Hutchings * Build fixes and documentation corrections from David Ahern * Group parsing support, from Jiri Olsa * UI/gtk refactorings and improvements from Namhyung Kim * NULL deref fix for perf script, from Namhyung Kim * Assorted cleanups from Robert Richter * Let O= makes handle relative paths, from Steven Rostedt * perf script python fixes, from Feng Tang. * Improve 'perf lock' error message when the needed tracepoints are not present, from David Ahern. * Initial bash completion support, from Frederic Weisbecker * Allow building without libelf, from Namhyung Kim. * Support DWARF CFI based unwind to have callchains when %bp based unwinding is not possible, from Jiri Olsa. * Symbol resolution fixes, while fixing support PPC64 files with an .opt ELF section was the end goal, several fixes for code that handles all architectures and cleanups are included, from Cody Schafer. * Add a description for the JIT interface, from Andi Kleen. * Assorted fixes for Documentation and build in 32 bit, from Robert Richter * Add support for non-tracepoint events in perf script python, from Feng Tang * Cache the libtraceevent event_format associated to each evsel early, so that we avoid relookups, i.e. calling pevent_find_event repeatedly when processing tracepoint events. [ This is to reduce the surface contact with libtraceevents and make clear what is that the perf tools needs from that lib: so far parsing the common and per event fields. ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-21Merge branch 'tip/perf/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core Pull ftrace updates from Steve Rostedt: " This patch series extends ftrace function tracing utility to be more dynamic for its users. It allows for data passing to the callback functions, as well as reading regs as if a breakpoint were to trigger at function entry. The main goal of this patch series was to allow kprobes to use ftrace as an optimized probe point when a probe is placed on an ftrace nop. With lots of help from Masami Hiramatsu, and going through lots of iterations, we finally came up with a good solution. " Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-20workqueue: deprecate system_nrt[_freezable]_wqTejun Heo
system_nrt[_freezable]_wq are now spurious. Mark them deprecated and convert all users to system[_freezable]_wq. If you're cc'd and wondering what's going on: Now all workqueues are non-reentrant, so there's no reason to use system_nrt[_freezable]_wq. Please use system[_freezable]_wq instead. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: "David S. Miller" <davem@davemloft.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com>
2012-08-20workqueue: gut system_nrt[_freezable]_wq()Tejun Heo
Now that all workqueues are non-reentrant, system[_freezable]_wq() are equivalent to system_nrt[_freezable]_wq(). Replace the latter with wrappers around system[_freezable]_wq(). The wrapping goes through inline functions so that __deprecated can be added easily. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-20workqueue: gut flush[_delayed]_work_sync()Tejun Heo
Now that all workqueues are non-reentrant, flush[_delayed]_work_sync() are equivalent to flush[_delayed]_work(). Drop the separate implementation and make them thin wrappers around flush[_delayed]_work(). * start_flush_work() no longer takes @wait_executing as the only left user - flush_work() - always sets it to %true. * __cancel_work_timer() uses flush_work() instead of wait_on_work(). Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-20workqueue: make all workqueues non-reentrantTejun Heo
By default, each per-cpu part of a bound workqueue operates separately and a work item may be executing concurrently on different CPUs. The behavior avoids some cross-cpu traffic but leads to subtle weirdities and not-so-subtle contortions in the API. * There's no sane usefulness in allowing a single work item to be executed concurrently on multiple CPUs. People just get the behavior unintentionally and get surprised after learning about it. Most either explicitly synchronize or use non-reentrant/ordered workqueue but this is error-prone. * flush_work() can't wait for multiple instances of the same work item on different CPUs. If a work item is executing on cpu0 and then queued on cpu1, flush_work() can only wait for the one on cpu1. Unfortunately, work items can easily cross CPU boundaries unintentionally when the queueing thread gets migrated. This means that if multiple queuers compete, flush_work() can't even guarantee that the instance queued right before it is finished before returning. * flush_work_sync() was added to work around some of the deficiencies of flush_work(). In addition to the usual flushing, it ensures that all currently executing instances are finished before returning. This operation is expensive as it has to walk all CPUs and at the same time fails to address competing queuer case. Incorrectly using flush_work() when flush_work_sync() is necessary is an easy error to make and can lead to bugs which are difficult to reproduce. * Similar problems exist for flush_delayed_work[_sync](). Other than the cross-cpu access concern, there's no benefit in allowing parallel execution and it's plain silly to have this level of contortion for workqueue which is widely used from core code to extremely obscure drivers. This patch makes all workqueues non-reentrant. If a work item is executing on a different CPU when queueing is requested, it is always queued to that CPU. This guarantees that any given work item can be executing on one CPU at maximum and if a work item is queued and executing, both are on the same CPU. The only behavior change which may affect workqueue users negatively is that non-reentrancy overrides the affinity specified by queue_work_on(). On a reentrant workqueue, the affinity specified by queue_work_on() is always followed. Now, if the work item is executing on one of the CPUs, the work item will be queued there regardless of the requested affinity. I've reviewed all workqueue users which request explicit affinity, and, fortunately, none seems to be crazy enough to exploit parallel execution of the same work item. This adds an additional busy_hash lookup if the work item was previously queued on a different CPU. This shouldn't be noticeable under any sane workload. Work item queueing isn't a very high-frequency operation and they don't jump across CPUs all the time. In a micro benchmark to exaggerate this difference - measuring the time it takes for two work items to repeatedly jump between two CPUs a number (10M) of times with busy_hash table densely populated, the difference was around 3%. While the overhead is measureable, it is only visible in pathological cases and the difference isn't huge. This change brings much needed sanity to workqueue and makes its behavior consistent with timer. I think this is the right tradeoff to make. This enables significant simplification of workqueue API. Simplification patches will follow. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-20workqueue: fix checkpatch issuesValentin Ilie
Fixed some checkpatch warnings. tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit. Signed-off-by: Valentin Ilie <valentin.ilie@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com>
2012-08-20Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar. * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix migration thread runtime bogosity sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled sched,cgroup: Fix up task_groups list sched: fix divide by zero at {thread_group,task}_times sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
2012-08-20cputime: Consolidate vtime handling on context switchFrederic Weisbecker
The archs that implement virtual cputime accounting all flush the cputime of a task when it gets descheduled and sometimes set up some ground initialization for the next task to account its cputime. These archs all put their own hooks in their context switch callbacks and handle the off-case themselves. Consolidate this by creating a new account_switch_vtime() callback called in generic code right after a context switch and that these archs must implement to flush the prev task cputime and initialize the next task cputime related state. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2012-08-20sched: Move cputime code to its own fileFrederic Weisbecker
Extract cputime code from the giant sched/core.c and put it in its own file. This make it easier to deal with this particular area and de-bloat a bit more core.c Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2012-08-19Merge branch 'alpha' (alpha architecture patches)Linus Torvalds
Merge alpha architecture update from Michael Cree: "The Alpha Maintainer, Matt Turner, is currently unavailable, so I have collected up patches that have been posted to the linux-alpha mailing list over the last couple of months, and are forwarding them to you in the hope that you are prepared to accept them via me. The patches by Al Viro and myself I have been running against kernels for two months now so have had quite a bit of testing. All except one patch were intended for the 3.5 kernel but because of Matt's unavailability never got forwarded to you." * emailed patches from Michael Cree <mcree@orcon.net.nz>: (9 commits) alpha: Fix fall-out from disintegrating asm/system.h Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts alpha: fix fpu.h usage in userspace alpha/mm/fault.c: Port OOM changes to do_page_fault alpha: take kernel_execve() out of entry.S alpha: take a bunch of syscalls into osf_sys.c alpha: Use new generic strncpy_from_user() and strnlen_user() alpha: Wire up cross memory attach syscalls alpha: Don't export SOCK_NONBLOCK to user space.
2012-08-19alpha: take a bunch of syscalls into osf_sys.cAl Viro
New helper: current_thread_info(). Allows to do a bunch of odd syscalls in C. While we are at it, there had never been a reason to do osf_getpriority() in assembler. We also get "namespace"-aware (read: consistent with getuid(2), etc.) behaviour from getx?id() syscalls now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Michael Cree <mcree@orcon.net.nz> Acked-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-17tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1Will Deacon
syscall_get_nr can return -1 in the case that the task is not executing a system call. This patch fixes perf_syscall_{enter,exit} to check that the syscall number is valid before using it as an index into a bitmap. Link: http://lkml.kernel.org/r/1345137254-7377-1-git-send-email-will.deacon@arm.com Cc: Jason Baron <jbaron@redhat.com> Cc: Wade Farnsworth <wade_farnsworth@mentor.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-08-17Merge tag 'v3.6-rc2' into nextJames Morris
Linux 3.6-rc2 Resync with Linus.
2012-08-16workqueue: use system_highpri_wq for unbind_workJoonsoo Kim
To speed cpu down processing up, use system_highpri_wq. As scheduling priority of workers on it is higher than system_wq and it is not contended by other normal works on this cpu, work on it is processed faster than system_wq. tj: CPU up/downs care quite a bit about latency these days. This shouldn't hurt anything and makes sense. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16workqueue: use system_highpri_wq for highpri workers in rebind_workers()Joonsoo Kim
In rebind_workers(), we do inserting a work to rebind to cpu for busy workers. Currently, in this case, we use only system_wq. This makes a possible error situation as there is mismatch between cwq->pool and worker->pool. To prevent this, we should use system_highpri_wq for highpri worker to match theses. This implements it. tj: Rephrased comment a bit. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16workqueue: introduce system_highpri_wqJoonsoo Kim
Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker or highpri worker. But, we don't consider this difference in rebind_workers(), we use just system_wq for highpri worker. It makes mismatch between cwq->pool and worker->pool. It doesn't make error in current implementation, but possible in the future. Now, we introduce system_highpri_wq to use proper cwq for highpri workers in rebind_workers(). Following patch fix this issue properly. tj: Even apart from rebinding, having system_highpri_wq generally makes sense. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16workqueue: change value of lcpu in __queue_delayed_work_on()Joonsoo Kim
We assign cpu id into work struct's data field in __queue_delayed_work_on(). In current implementation, when work is come in first time, current running cpu id is assigned. If we do __queue_delayed_work_on() with CPU A on CPU B, __queue_work() invoked in delayed_work_timer_fn() go into the following sub-optimal path in case of WQ_NON_REENTRANT. gcwq = get_gcwq(cpu); if (wq->flags & WQ_NON_REENTRANT && (last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) { Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND. It is sufficient to prevent to go into sub-optimal path. tj: Slightly rephrased the comment. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16workqueue: correct req_cpu in trace_workqueue_queue_work()Joonsoo Kim
When we do tracing workqueue_queue_work(), it records requested cpu. But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND, requested cpu is changed as local cpu. In case of @wq->flag & WQ_UNBOUND, above change is not occured, therefore it is reasonable to correct it. Use temporary local variable for storing requested cpu. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16workqueue: use enum value to set array size of pools in gcwqJoonsoo Kim
Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent size of pools, definition of worker_pool in gcwq doesn't use it. Using it makes code robust and prevent future mistakes. So change code to use this enum value. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-15time: Improve sanity checking of timekeeping inputsJohn Stultz
Unexpected behavior could occur if the time is set to a value large enough to overflow a 64bit ktime_t (which is something larger then the year 2262). Also unexpected behavior could occur if large negative offsets are injected via adjtimex. So this patch improves the sanity check timekeeping inputs by improving the timespec_valid() check, and then makes better use of timespec_valid() to make sure we don't set the time to an invalid negative value or one that overflows ktime_t. Note: This does not protect from setting the time close to overflowing ktime_t and then letting natural accumulation cause the overflow. Reported-by: CAI Qian <caiqian@redhat.com> Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Zhouping Liu <zliu@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1344454580-17031-1-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-15audit: clean up refcounting in audit-treeMiklos Szeredi
Drop the initial reference by fsnotify_init_mark early instead of audit_tree_freeing_mark() at destroy time. In the cases we destroy the mark before we drop the initial reference we need to get rid of the get_mark that balances the put_mark in audit_tree_freeing_mark(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2012-08-15audit: fix refcounting in audit-treeMiklos Szeredi
Refcounting of fsnotify_mark in audit tree is broken. E.g: refcount create_chunk alloc_chunk 1 fsnotify_add_mark 2 untag_chunk fsnotify_get_mark 3 fsnotify_destroy_mark audit_tree_freeing_mark 2 fsnotify_put_mark 1 fsnotify_put_mark 0 via destroy_list fsnotify_mark_destroy -1 This was reported by various people as triggering Oops when stopping auditd. We could just remove the put_mark from audit_tree_freeing_mark() but that would break freeing via inode destruction. So this patch simply omits a put_mark after calling destroy_mark or adds a get_mark before. The additional get_mark is necessary where there's no other put_mark after fsnotify_destroy_mark() since it assumes that the caller is holding a reference (or the inode is keeping the mark pinned, not the case here AFAICS). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reported-by: Valentin Avram <aval13@gmail.com> Reported-by: Peter Moody <pmoody@google.com> Acked-by: Eric Paris <eparis@redhat.com> CC: stable@vger.kernel.org
2012-08-15audit: don't free_chunk() after fsnotify_add_mark()Miklos Szeredi
Don't do free_chunk() after fsnotify_add_mark(). That one does a delayed unref via the destroy list and this results in use-after-free. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Eric Paris <eparis@redhat.com> CC: stable@vger.kernel.org
2012-08-14pidns: Export free_pid_nsEric W. Biederman
There is a least one modular user so export free_pid_ns so modules can capture and use the pid namespace on the very rare occasion when it makes sense. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-08-14net ip6 flowlabel: Make owner a union of struct pid * and kuid_tEric W. Biederman
Correct a long standing omission and use struct pid in the owner field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS. This guarantees we don't have issues when pid wraparound occurs. Use a kuid_t in the owner field of struct ip6_flowlabel when the share type is IPV6_FL_S_USER to add user namespace support. In /proc/net/ip6_flowlabel capture the current pid namespace when opening the file and release the pid namespace when the file is closed ensuring we print the pid owner value that is meaning to the reader of the file. Similarly use from_kuid_munged to print uid values that are meaningful to the reader of the file. This requires exporting pid_nr_ns so that ipv6 can continue to built as a module. Yoiks what silliness Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-08-13workqueue: add missing wmb() in clear_work_data()Tejun Heo
Any operation which clears PENDING should be preceded by a wmb to guarantee that the next PENDING owner sees all the changes made before PENDING release. There are only two places where PENDING is cleared - set_work_cpu_and_clear_pending() and clear_work_data(). The caller of the former already does smp_wmb() but the latter doesn't have any. Move the wmb above set_work_cpu_and_clear_pending() into it and add one to clear_work_data(). There hasn't been any report related to this issue, and, given how clear_work_data() is used, it is extremely unlikely to have caused any actual problems on any architecture. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com>
2012-08-13workqueue: fix CPU binding of flush_delayed_work[_sync]()Tejun Heo
delayed_work encodes the workqueue to use and the last CPU in delayed_work->work.data while it's on timer. The target CPU is implicitly recorded as the CPU the timer is queued on and delayed_work_timer_fn() queues delayed_work->work to the CPU it is running on. Unfortunately, this leaves flush_delayed_work[_sync]() no way to find out which CPU the delayed_work was queued for when they try to re-queue after killing the timer. Currently, it chooses the local CPU flush is running on. This can unexpectedly move a delayed_work queued on a specific CPU to another CPU and lead to subtle errors. There isn't much point in trying to save several bytes in struct delayed_work, which is already close to a hundred bytes on 64bit with all debug options turned off. This patch adds delayed_work->cpu to remember the CPU it's queued for. Note that if the timer is migrated during CPU down, the work item could be queued to the downed global_cwq after this change. As a detached global_cwq behaves like an unbound one, this doesn't change much for the delayed_work. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2012-08-13sched: recover SD_WAKE_AFFINE in select_task_rq_fair and code clean upAlex Shi
Since power saving code was removed from sched now, the implement code is out of service in this function, and even pollute other logical. like, 'want_sd' never has chance to be set '0', that remove the effect of SD_WAKE_AFFINE here. So, clean up the obsolete code, includes SD_PREFER_LOCAL. Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/5028F431.6000306@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13sched: using dst_rq instead of this_rq during load balanceMichael Wang
As we already have dst_rq in lb_env, using or changing "this_rq" do not make sense. This patch will replace "this_rq" with dst_rq in load_balance, and we don't need to change "this_rq" while process LBF_SOME_PINNED any more. Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/501F8357.3070102@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13sched: Document schedule() entry pointsPekka Enberg
This patch adds a comment on top of the schedule() function to explain to scheduler newbies how the main scheduler function is entered. Acked-by: Randy Dunlap <rdunlap@xenotime.net> Explained-by: Ingo Molnar <mingo@kernel.org> Explained-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344070187-2420-1-git-send-email-penberg@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13sched: Fix __sched_period commentBorislav Petkov
It should be sched_nr_latency so fix it before it annoys me more. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344435364-18632-1-git-send-email-bp@amd64.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13Merge branch 'sched/urgent' into sched/coreThomas Gleixner
2012-08-13sched: Fix migration thread runtime bogosityMike Galbraith
Make stop scheduler class do the same accounting as other classes, Migration threads can be caught in the act while doing exec balancing, leading to the below due to use of unmaintained ->se.exec_start. The load that triggered this particular instance was an apparently out of control heavily threaded application that does system monitoring in what equated to an exec bomb, with one of the VERY frequently migrated tasks being ps. %CPU PID USER CMD 99.3 45 root [migration/10] 97.7 53 root [migration/12] 97.0 57 root [migration/13] 90.1 49 root [migration/11] 89.6 65 root [migration/15] 88.7 17 root [migration/3] 80.4 37 root [migration/8] 78.1 41 root [migration/9] 44.2 13 root [migration/2] Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttledMike Galbraith
Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13sched,cgroup: Fix up task_groups listMike Galbraith
With multiple instances of task_groups, for_each_rt_rq() is a noop, no task groups having been added to the rt.c list instance. This renders __enable/disable_runtime() and print_rt_stats() noop, the user (non) visible effect being that rt task groups are missing in /proc/sched_debug. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: stable@kernel.org # v3.3+ Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13sched: fix divide by zero at {thread_group,task}_timesStanislaw Gruszka
On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Cc: stable@vger.kernel.org Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchiesPeter Zijlstra
Peter Portante reported that for large cgroup hierarchies (and or on large CPU counts) we get immense lock contention on rq->lock and stuff stops working properly. His workload was a ton of processes, each in their own cgroup, everybody idling except for a sporadic wakeup once every so often. It was found that: schedule() idle_balance() load_balance() local_irq_save() double_rq_lock() update_h_load() walk_tg_tree(tg_load_down) tg_load_down() Results in an entire cgroup hierarchy walk under rq->lock for every new-idle balance and since new-idle balance isn't throttled this results in a lot of work while holding the rq->lock. This patch does two things, it removes the work from under rq->lock based on the good principle of race and pray which is widely employed in the load-balancer as a whole. And secondly it throttles the update_h_load() calculation to max once per jiffy. I considered excluding update_h_load() for new-idle balance all-together, but purely relying on regular balance passes to update this data might not work out under some rare circumstances where the new-idle busiest isn't the regular busiest for a while (unlikely, but a nightmare to debug if someone hits it and suffers). Cc: pjt@google.com Cc: Larry Woodman <lwoodman@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Reported-by: Peter Portante <pportant@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13rcu: Use smp_hotplug_thread facility for RCUs per-CPU kthreadPaul E. McKenney
Bring RCU into the new-age CPU-hotplug fold by modifying RCU's per-CPU kthread code to use the new smp_hotplug_thread facility. [ tglx: Adapted it to use callbacks and to the simplified rcu yield ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20120716103948.673354828@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13watchdog: Use hotplug thread infrastructureThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20120716103948.563736676@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13softirq: Use hotplug thread infrastructureThomas Gleixner
[ paulmck: Call rcu_note_context_switch() with interrupts enabled. ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20120716103948.456416747@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13hotplug: Fix UP bug in smpboot hotplug codePaul E. McKenney
Because kernel subsystems need their per-CPU kthreads on UP systems as well as on SMP systems, the smpboot hotplug kthread functions must be provided in UP builds as well as in SMP builds. This commit therefore adds smpboot.c to UP builds and excludes irrelevant code via #ifdef. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13smpboot: Provide infrastructure for percpu hotplug threadsThomas Gleixner
Provide a generic interface for setting up and tearing down percpu threads. On registration the threads for already online cpus are created and started. On deregistration (modules) the threads are stoppped. During hotplug operations the threads are created, started, parked and unparked. The datastructure for registration provides a pointer to percpu storage space and optional setup, cleanup, park, unpark functions. These functions are called when the thread state changes. Each implementation has to provide a function which is queried and returns whether the thread should run and the thread function itself. The core code handles all state transitions and avoids duplicated code in the call sites. [ paulmck: Preemption leak fix ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20120716103948.352501068@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>