summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2012-09-23rcu: Move TINY_RCU quiescent state out of extended quiescent stateLi Zhong
TINY_RCU's rcu_idle_enter_common() invokes rcu_sched_qs() in order to inform the RCU core of the quiescent state implied by idle entry. Of course, idle is also an extended quiescent state, so that the call to rcu_sched_qs() speeds up RCU's invoking of any callbacks that might be queued. This speed-up is important when entering into dyntick-idle mode -- if there are no further scheduling-clock interrupts, the callbacks might never be invoked, which could result in a system hang. However, processing callbacks does event tracing, which in turn implies RCU read-side critical sections, which are illegal in extended quiescent states. This patch therefore moves the call to rcu_sched_qs() so that it precedes the point at which we inform lockdep that RCU has entered an extended quiescent state. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23time: RCU permitted to stop idle entry via softirqPaul E. McKenney
The can_stop_idle_tick() function complains if a softirq vector is raised too late in the idle-entry process, presumably in order to prevent dangling softirq invocations from being delayed across the full idle period, which might be indefinitely long -- and if softirq was asserted any later than the call to this function, such a delay might well happen. However, RCU needs to be able to use softirq to stop idle entry in order to be able to drain RCU callbacks from the current CPU, which in turn enables faster entry into dyntick-idle mode, which in turn reduces power consumption. Because RCU takes this action at a well-defined point in the idle-entry path, it is safe for RCU to take this approach. This commit therefore silences the error message that is sometimes produced when the going-idle CPU suddenly finds that it has an RCU_SOFTIRQ to process. The error message will continue to be issued for other softirq vectors. Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Move TINY_PREEMPT_RCU away from raw_local_irq_save()Paul E. McKenney
The use of raw_local_irq_save() is unnecessary, given that local_irq_save() really does disable interrupts. Also, it appears to interfere with lockdep. Therefore, this commit moves to local_irq_save(). Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com>
2012-09-23rcu: Remove redundant memory barrier from __call_rcu()Paul E. McKenney
The first memory barrier in __call_rcu() is supposed to order any updates done beforehand by the caller against the actual queuing of the callback. However, the second memory barrier (which is intended to order incrementing the queue lengths before queuing the callback) is also between the caller's updates and the queuing of the callback. The second memory barrier can therefore serve both purposes. This commit therefore removes the first memory barrier. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Avoid spurious RCU CPU stall warningsPaul E. McKenney
If a given CPU avoids the idle loop but also avoids starting a new RCU grace period for a full minute, RCU can issue spurious RCU CPU stall warnings. This commit fixes this issue by adding a check for ongoing grace period to avoid these spurious stall warnings. Reported-by: Becky Bruce <bgillbruce@gmail.com> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Protect rcu_node accesses during CPU stall warningsPaul E. McKenney
The print_other_cpu_stall() function accesses a number of rcu_node fields without protection from the ->lock. In theory, this is not a problem because the fields accessed are all integers, but in practice the compiler can get nasty. Therefore, the commit extends the existing critical section to cover the entire loop body. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23rcu: Avoid rcu_print_detail_task_stall_rnp() segfaultPaul E. McKenney
The rcu_print_detail_task_stall_rnp() function invokes rcu_preempt_blocked_readers_cgp() to verify that there are some preempted RCU readers blocking the current grace period outside of the protection of the rcu_node structure's ->lock. This means that the last blocked reader might exit its RCU read-side critical section and remove itself from the ->blkd_tasks list before the ->lock is acquired, resulting in a segmentation fault when the subsequent code attempts to dereference the now-NULL gp_tasks pointer. This commit therefore moves the test under the lock. This will not have measurable effect on lock contention because this code is invoked only when printing RCU CPU stall warnings, in other words, in the common case, never. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23rcu: Apply for_each_rcu_flavor() to increment_cpu_stall_ticks()Paul E. McKenney
The increment_cpu_stall_ticks() function listed each RCU flavor explicitly, with an ifdef to handle preemptible RCU. This commit therefore applies for_each_rcu_flavor() to save a line of code. Because this commit switches from a code-based enumeration of the flavors of RCU to an rcu_state-list-based enumeration, it is no longer possible to apply __get_cpu_var() to the per-CPU rcu_data structures. We instead use __this_cpu_var() on the rcu_state structure's ->rda field that references the corresponding rcu_data structures. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23rcu: Fix obsolete rcu_initiate_boost() header commentPaul E. McKenney
Commit 1217ed1b (rcu: permit rcu_read_unlock() to be called while holding runqueue locks) made rcu_initiate_boost() restore irq state when releasing the rcu_node structure's ->lock, but failed to update the header comment accordingly. This commit therefore brings the header comment up to date. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Make offline-CPU checking allow for indefinite delaysPaul E. McKenney
The rcu_implicit_offline_qs() function implicitly assumed that execution would progress predictably when interrupts are disabled, which is of course not guaranteed when running on a hypervisor. Furthermore, this function is short, and is called from one place only in a short function. This commit therefore ensures that the timing is checked before checking the condition, which guarantees correct behavior even given indefinite delays. It also inlines rcu_implicit_offline_qs() into rcu_implicit_dynticks_qs(). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Improve boost selection when moving tasks to root rcu_nodePaul E. McKenney
The rcu_preempt_offline_tasks() moves all tasks queued on a given leaf rcu_node structure to the root rcu_node, which is done when the last CPU corresponding the the leaf rcu_node structure goes offline. Now that RCU-preempt's synchronize_rcu_expedited() implementation blocks CPU-hotplug operations during the initialization of each rcu_node structure's ->boost_tasks pointer, rcu_preempt_offline_tasks() can do a better job of setting the root rcu_node's ->boost_tasks pointer. The key point is that rcu_preempt_offline_tasks() runs as part of the CPU-hotplug process, so that a concurrent synchronize_rcu_expedited() is guaranteed to either have not started on the one hand (in which case there is no boosting on behalf of the expedited grace period) or to be completely initialized on the other (in which case, in the absence of other priority boosting, all ->boost_tasks pointers will be initialized). Therefore, if rcu_preempt_offline_tasks() finds that the ->boost_tasks pointer is equal to the ->exp_tasks pointer, it can be sure that it is correctly placed. In the case where there was boosting ongoing at the time that the synchronize_rcu_expedited() function started, different nodes might start boosting the tasks blocking the expedited grace period at different times. In this mixed case, the root node will either be boosting tasks for the expedited grace period already, or it will start as soon as it gets done boosting for the normal grace period -- but in this latter case, the root node's tasks needed to be boosted in any case. This commit therefore adds a check of the ->boost_tasks pointer against the ->exp_tasks pointer to the list that prevents updating ->boost_tasks. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Permit RCU_NONIDLE() to be used from interrupt contextPaul E. McKenney
There is a need to use RCU from interrupt context, but either before rcu_irq_enter() is called or after rcu_irq_exit() is called. If the interrupt occurs from idle, then lockdep-RCU will complain about such uses, as they appear to be illegal uses of RCU from the idle loop. In other environments, RCU_NONIDLE() could be used to properly protect the use of RCU, but RCU_NONIDLE() currently cannot be invoked except from process context. This commit therefore modifies RCU_NONIDLE() to permit its use more globally. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23rcu: Properly initialize ->boost_tasks on CPU offlinePaul E. McKenney
When rcu_preempt_offline_tasks() clears tasks from a leaf rcu_node structure, it does not NULL out the structure's ->boost_tasks field. This commit therefore fixes this issue. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Pull TINY_RCU dyntick-idle tracing into non-idle regionPaul E. McKenney
Because TINY_RCU's idle detection keys directly off of the nesting level, rather than from a separate variable as in TREE_RCU, the TINY_RCU dyntick-idle tracing on transition to idle must happen before the change to the nesting level. This commit therefore makes this change by passing the desired new value (rather than the old value) of the nesting level in to rcu_idle_enter_common(). [ paulmck: Add fix for wrong-variable bug spotted by Michael Wang <wangyun@linux.vnet.ibm.com>. ] Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Add PROVE_RCU_DELAY to provoke difficult racesPaul E. McKenney
There have been some recent bugs that were triggered only when preemptible RCU's __rcu_read_unlock() was preempted just after setting ->rcu_read_lock_nesting to INT_MIN, which is a low-probability event. Therefore, reproducing those bugs (to say nothing of gaining confidence in alleged fixes) was quite difficult. This commit therefore creates a new debug-only RCU kernel config option that forces a short delay in __rcu_read_unlock() to increase the probability of those sorts of bugs occurring. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23rcu: Fix day-one dyntick-idle stall-warning bugPaul E. McKenney
Each grace period is supposed to have at least one callback waiting for that grace period to complete. However, if CONFIG_NO_HZ=n, an extra callback-free grace period is no big problem -- it will chew up a tiny bit of CPU time, but it will complete normally. In contrast, CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to sleep indefinitely, in turn indefinitely delaying completion of the callback-free grace period. Given that nothing is waiting on this grace period, this is also not a problem. That is, unless RCU CPU stall warnings are also enabled, as they are in recent kernels. In this case, if a CPU wakes up after at least one minute of inactivity, an RCU CPU stall warning will result. The reason that no one noticed until quite recently is that most systems have enough OS noise that they will never remain absolutely idle for a full minute. But there are some embedded systems with cut-down userspace configurations that consistently get into this situation. All this begs the question of exactly how a callback-free grace period gets started in the first place. This can happen due to the fact that CPUs do not necessarily agree on which grace period is in progress. If a CPU still believes that the grace period that just completed is still ongoing, it will believe that it has callbacks that need to wait for another grace period, never mind the fact that the grace period that they were waiting for just completed. This CPU can therefore erroneously decide to start a new grace period. Note that this can happen in TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system: Deadlock considerations mean that the CPU that detected the end of the grace period is not necessarily officially informed of this fact for some time. Once this CPU notices that the earlier grace period completed, it will invoke its callbacks. It then won't have any callbacks left. If no other CPU has any callbacks, we now have a callback-free grace period. This commit therefore makes CPUs check more carefully before starting a new grace period. This new check relies on an array of tail pointers into each CPU's list of callbacks. If the CPU is up to date on which grace periods have completed, it checks to see if any callbacks follow the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks follow the RCU_WAIT_TAIL segment. The reason that this works is that the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment as soon as the CPU is officially notified that the old grace period has ended. This change is to cpu_needs_another_gp(), which is called in a number of places. The only one that really matters is in rcu_start_gp(), where the root rcu_node structure's ->lock is held, which prevents any other CPU from starting or completing a grace period, so that the comparison that determines whether the CPU is missing the completion of a grace period is stable. Reported-by: Becky Bruce <bgillbruce@gmail.com> Reported-by: Subodh Nijsure <snijsure@grid-net.com> Reported-by: Paul Walmsley <paul@pwsan.com> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Paul Walmsley <paul@pwsan.com> # OMAP3730, OMAP4430 Cc: stable@vger.kernel.org
2012-09-01time: Move ktime_t overflow checking into timespec_valid_strictJohn Stultz
Andreas Bombe reported that the added ktime_t overflow checking added to timespec_valid in commit 4e8b14526ca7 ("time: Improve sanity checking of timekeeping inputs") was causing problems with X.org because it caused timeouts larger then KTIME_T to be invalid. Previously, these large timeouts would be clamped to KTIME_MAX and would never expire, which is valid. This patch splits the ktime_t overflow checking into a new timespec_valid_strict function, and converts the timekeeping codes internal checking to use this more strict function. Reported-and-tested-by: Andreas Bombe <aeb@debian.org> Cc: Zhouping Liu <zliu@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-23Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This tree contains misc fixlets: a perf script python binding fix, a uprobes fix and a syscall tracing fix." * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf tools: Add missing files to build the python binding uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() fails tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1
2012-08-23Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Mostly small fixes for the fallout of the timekeeping overhaul in 3.6 along with stable fixes to address an accumulation problem and missing sanity checks for RTC readouts and user space provided values." * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Avoid making adjustments if we haven't accumulated anything time: Avoid potential shift overflow with large shift values time: Fix casting issue in timekeeping_forward_now time: Ensure we normalize the timekeeper in tk_xtime_add time: Improve sanity checking of timekeeping inputs
2012-08-22time: Avoid making adjustments if we haven't accumulated anythingJohn Stultz
If update_wall_time() is called and the current offset isn't large enough to accumulate, avoid re-calling timekeeping_adjust which may change the clock freq and can cause 1ns inconsistencies with CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1345595449-34965-5-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22time: Avoid potential shift overflow with large shift valuesJohn Stultz
Andreas Schwab noticed that the 1 << tk->shift could overflow if the shift value was greater than 30, since 1 would be a 32bit long on 32bit architectures. This issue was introduced by 1e75fa8be (time: Condense timekeeper.xtime into xtime_sec) Use 1ULL instead to ensure we don't overflow on the shift. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1345595449-34965-4-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22time: Fix casting issue in timekeeping_forward_nowAndreas Schwab
arch_gettimeoffset returns a u32 value which when shifted by tk->shift can overflow. This issue was introduced with 1e75fa8be (time: Condense timekeeper.xtime into xtime_sec) Cast it to u64 first. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1345595449-34965-3-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22time: Ensure we normalize the timekeeper in tk_xtime_addJohn Stultz
Andreas noticed problems with resume on specific hardware after commit 1e75fa8b (time: Condense timekeeper.xtime into xtime_sec) combined with commit b44d50dca (time: Fix casting issue in tk_set_xtime and tk_xtime_add) After some digging I realized we aren't normalizing the timekeeper after the add. Add the missing normalize call. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Tested-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1345595449-34965-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-21Merge branch 'audit-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull audit-tree fixes from Miklos Szeredi: "The audit subsystem maintainers (Al and Eric) are not responding to repeated resends. Eric did ack them a while ago, but no response since then. So I'm sending these directly to you." * 'audit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: audit: clean up refcounting in audit-tree audit: fix refcounting in audit-tree audit: don't free_chunk() after fsnotify_add_mark()
2012-08-21task_work: add a scheduling point in task_work_run()Eric Dumazet
It seems commit 4a9d4b024a31 ("switch fput to task_work_add") re- introduced the problem addressed in 944be0b22472 ("close_files(): add scheduling point") If a server process with a lot of files (say 2 million tcp sockets) is killed, we can spend a lot of time in task_work_run() and trigger a soft lockup. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21Merge branch 'tip/perf/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent Pull syscall tracing fix from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-21uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() failsOleg Nesterov
This patch fixes: https://bugzilla.redhat.com/show_bug.cgi?id=843640 If mmap_region()->uprobe_mmap() fails, unmap_and_free_vma path does unmap_region() but does not remove the soon-to-be-freed vma from rb tree. Actually there are more problems but this is how William noticed this bug. Perhaps we could do do_munmap() + return in this case, but in fact it is simply wrong to abort if uprobe_mmap() fails. Until at least we move the !UPROBE_COPY_INSN code from install_breakpoint() to uprobe_register(). For example, uprobe_mmap()->install_breakpoint() can fail if the probed insn is not supported (remember, uprobe_register() succeeds if nobody mmaps inode/offset), mmap() should not fail in this case. dup_mmap()->uprobe_mmap() is wrong too by the same reason, fork() can race with uprobe_register() and fail for no reason if it wins the race and does install_breakpoint() first. And, if nothing else, both mmap_region() and dup_mmap() return success if uprobe_mmap() fails. Change them to ignore the error code from uprobe_mmap(). Reported-and-tested-by: William Cohen <wcohen@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> # v3.5 Cc: Anton Arapov <anton@redhat.com> Cc: William Cohen <wcohen@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-20Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar. * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix migration thread runtime bogosity sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled sched,cgroup: Fix up task_groups list sched: fix divide by zero at {thread_group,task}_times sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
2012-08-19Merge branch 'alpha' (alpha architecture patches)Linus Torvalds
Merge alpha architecture update from Michael Cree: "The Alpha Maintainer, Matt Turner, is currently unavailable, so I have collected up patches that have been posted to the linux-alpha mailing list over the last couple of months, and are forwarding them to you in the hope that you are prepared to accept them via me. The patches by Al Viro and myself I have been running against kernels for two months now so have had quite a bit of testing. All except one patch were intended for the 3.5 kernel but because of Matt's unavailability never got forwarded to you." * emailed patches from Michael Cree <mcree@orcon.net.nz>: (9 commits) alpha: Fix fall-out from disintegrating asm/system.h Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts alpha: fix fpu.h usage in userspace alpha/mm/fault.c: Port OOM changes to do_page_fault alpha: take kernel_execve() out of entry.S alpha: take a bunch of syscalls into osf_sys.c alpha: Use new generic strncpy_from_user() and strnlen_user() alpha: Wire up cross memory attach syscalls alpha: Don't export SOCK_NONBLOCK to user space.
2012-08-19alpha: take a bunch of syscalls into osf_sys.cAl Viro
New helper: current_thread_info(). Allows to do a bunch of odd syscalls in C. While we are at it, there had never been a reason to do osf_getpriority() in assembler. We also get "namespace"-aware (read: consistent with getuid(2), etc.) behaviour from getx?id() syscalls now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Michael Cree <mcree@orcon.net.nz> Acked-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-17tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1Will Deacon
syscall_get_nr can return -1 in the case that the task is not executing a system call. This patch fixes perf_syscall_{enter,exit} to check that the syscall number is valid before using it as an index into a bitmap. Link: http://lkml.kernel.org/r/1345137254-7377-1-git-send-email-will.deacon@arm.com Cc: Jason Baron <jbaron@redhat.com> Cc: Wade Farnsworth <wade_farnsworth@mentor.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-08-15time: Improve sanity checking of timekeeping inputsJohn Stultz
Unexpected behavior could occur if the time is set to a value large enough to overflow a 64bit ktime_t (which is something larger then the year 2262). Also unexpected behavior could occur if large negative offsets are injected via adjtimex. So this patch improves the sanity check timekeeping inputs by improving the timespec_valid() check, and then makes better use of timespec_valid() to make sure we don't set the time to an invalid negative value or one that overflows ktime_t. Note: This does not protect from setting the time close to overflowing ktime_t and then letting natural accumulation cause the overflow. Reported-by: CAI Qian <caiqian@redhat.com> Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Zhouping Liu <zliu@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1344454580-17031-1-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-15audit: clean up refcounting in audit-treeMiklos Szeredi
Drop the initial reference by fsnotify_init_mark early instead of audit_tree_freeing_mark() at destroy time. In the cases we destroy the mark before we drop the initial reference we need to get rid of the get_mark that balances the put_mark in audit_tree_freeing_mark(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2012-08-15audit: fix refcounting in audit-treeMiklos Szeredi
Refcounting of fsnotify_mark in audit tree is broken. E.g: refcount create_chunk alloc_chunk 1 fsnotify_add_mark 2 untag_chunk fsnotify_get_mark 3 fsnotify_destroy_mark audit_tree_freeing_mark 2 fsnotify_put_mark 1 fsnotify_put_mark 0 via destroy_list fsnotify_mark_destroy -1 This was reported by various people as triggering Oops when stopping auditd. We could just remove the put_mark from audit_tree_freeing_mark() but that would break freeing via inode destruction. So this patch simply omits a put_mark after calling destroy_mark or adds a get_mark before. The additional get_mark is necessary where there's no other put_mark after fsnotify_destroy_mark() since it assumes that the caller is holding a reference (or the inode is keeping the mark pinned, not the case here AFAICS). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reported-by: Valentin Avram <aval13@gmail.com> Reported-by: Peter Moody <pmoody@google.com> Acked-by: Eric Paris <eparis@redhat.com> CC: stable@vger.kernel.org
2012-08-15audit: don't free_chunk() after fsnotify_add_mark()Miklos Szeredi
Don't do free_chunk() after fsnotify_add_mark(). That one does a delayed unref via the destroy list and this results in use-after-free. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Eric Paris <eparis@redhat.com> CC: stable@vger.kernel.org
2012-08-13sched: Fix migration thread runtime bogosityMike Galbraith
Make stop scheduler class do the same accounting as other classes, Migration threads can be caught in the act while doing exec balancing, leading to the below due to use of unmaintained ->se.exec_start. The load that triggered this particular instance was an apparently out of control heavily threaded application that does system monitoring in what equated to an exec bomb, with one of the VERY frequently migrated tasks being ps. %CPU PID USER CMD 99.3 45 root [migration/10] 97.7 53 root [migration/12] 97.0 57 root [migration/13] 90.1 49 root [migration/11] 89.6 65 root [migration/15] 88.7 17 root [migration/3] 80.4 37 root [migration/8] 78.1 41 root [migration/9] 44.2 13 root [migration/2] Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttledMike Galbraith
Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13sched,cgroup: Fix up task_groups listMike Galbraith
With multiple instances of task_groups, for_each_rt_rq() is a noop, no task groups having been added to the rt.c list instance. This renders __enable/disable_runtime() and print_rt_stats() noop, the user (non) visible effect being that rt task groups are missing in /proc/sched_debug. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: stable@kernel.org # v3.3+ Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13sched: fix divide by zero at {thread_group,task}_timesStanislaw Gruszka
On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Cc: stable@vger.kernel.org Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchiesPeter Zijlstra
Peter Portante reported that for large cgroup hierarchies (and or on large CPU counts) we get immense lock contention on rq->lock and stuff stops working properly. His workload was a ton of processes, each in their own cgroup, everybody idling except for a sporadic wakeup once every so often. It was found that: schedule() idle_balance() load_balance() local_irq_save() double_rq_lock() update_h_load() walk_tg_tree(tg_load_down) tg_load_down() Results in an entire cgroup hierarchy walk under rq->lock for every new-idle balance and since new-idle balance isn't throttled this results in a lot of work while holding the rq->lock. This patch does two things, it removes the work from under rq->lock based on the good principle of race and pray which is widely employed in the load-balancer as a whole. And secondly it throttles the update_h_load() calculation to max once per jiffy. I considered excluding update_h_load() for new-idle balance all-together, but purely relying on regular balance passes to update this data might not work out under some rare circumstances where the new-idle busiest isn't the regular busiest for a while (unlikely, but a nightmare to debug if someone hits it and suffers). Cc: pjt@google.com Cc: Larry Woodman <lwoodman@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Reported-by: Peter Portante <pportant@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-12Merge tag 'pm-for-3.6-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael J. Wysocki: - Fix for two recent regressions in the generic PM domains framework. - Revert of a commit that introduced a resume regression and is conceptually incorrect in my opinion. - Fix for a return value in pcc-cpufreq.c from Julia Lawall. - RTC wakeup signaling fix from Neil Brown. - Suppression of compiler warnings for CONFIG_PM_SLEEP unset in ACPI, platform/x86 and TPM drivers. * tag 'pm-for-3.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: tpm_tis / PM: Fix unused function warning for CONFIG_PM_SLEEP platform / x86 / PM: Fix unused function warnings for CONFIG_PM_SLEEP ACPI / PM: Fix unused function warnings for CONFIG_PM_SLEEP Revert "NMI watchdog: fix for lockup detector breakage on resume" PM: Make dev_pm_get_subsys_data() always return 0 on success drivers/cpufreq/pcc-cpufreq.c: fix error return code RTC: Avoid races between RTC alarm wakeup and suspend.
2012-08-12printk: Fix calculation of length used to discard recordsJeff Mahoney
While tracking down a weird buffer overflow issue in a program that looked to be sane, I started double checking the length returned by syslog(SYSLOG_ACTION_READ_ALL, ...) to make sure it wasn't overflowing the buffer. Sure enough, it was. I saw this in strace: 11339 syslog(SYSLOG_ACTION_READ_ALL, "<5>[244017.708129] REISERFS (dev"..., 8192) = 8279 It turns out that the loops that calculate how much space the entries will take when they're copied don't include the newlines and prefixes that will be included in the final output since prev flags is passed as zero. This patch properly accounts for it and fixes the overflow. CC: stable@kernel.org Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-08Revert "NMI watchdog: fix for lockup detector breakage on resume"Rafael J. Wysocki
Revert commit 45226e9 (NMI watchdog: fix for lockup detector breakage on resume) which breaks resume from system suspend on my SH7372 Mackerel board (by causing a NULL pointer dereference to happen) and is generally wrong, because it abuses the CPU hotplug functionality in a shamelessly blatant way. The original issue should be addressed through appropriate syscore resume callback instead. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-08-05time: Fix adjustment cleanup bug in timekeeping_adjust()Ingo Molnar
Tetsuo Handa reported that sporadically the system clock starts counting up too quickly which is enough to confuse the hangcheck timer to print a bogus stall warning. Commit 2a8c0883 "time: Move xtime_nsec adjustment underflow handling timekeeping_adjust" overlooked this exit path: } else return; which should really be a proper exit sequence, fixing the bug as a side effect. Also make the flow more readable by properly balancing curly braces. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote: Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote: Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: john.stultz@linaro.org Cc: a.p.zijlstra@chello.nl Cc: richardcochran@gmail.com Cc: prarit@redhat.com Link: http://lkml.kernel.org/r/20120804192114.GA28347@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-03Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex fixes from Ingo Molnar: "A couple of futex fixes from Darren Hart: two bugs reported by Dave Jones (found with his trinity test) and Dan Carpenter through static analysis. The third found while debugging the first two." * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi() futex: Fix bug in WARN_ON for NULL q.pi_state futex: Test for pi_mutex on fault in futex_wait_requeue_pi()
2012-08-03Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "One regression fix, and a couple of cleanups that clean up the code flow in areas that had high-profile bugs recently." * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Remove all direct references to timekeeper time: Clean up offs_real/wall_to_mono and offs_boot/total_sleep_time updates time: Clean up stray newlines time/jiffies: Rename ACTHZ to SHIFTED_HZ time/jiffies: Allow CLOCK_TICK_RATE to be undefined time: Fix casting issue in tk_set_xtime and tk_xtime_add
2012-08-03Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fixes and two late cleanups" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cleanups: Add load balance cpumask pointer to 'struct lb_env' sched: Fix comment about PREEMPT_ACTIVE bit location sched: Fix minor code style issues sched: Use task_rq_unlock() in __sched_setscheduler() sched/numa: Add SD_PERFER_SIBLING to CPU domain
2012-08-03Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Fix merge window fallout and fix sleep profiling (this was always broken, so it's not a fix for the merge window - we can skip this one from the head of the tree)." * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/trace: Add ability to set a target task for events perf/x86: Fix USER/KERNEL tagging of samples properly perf/x86/intel/uncore: Make UNCORE_PMU_HRTIMER_INTERVAL 64-bit
2012-08-03Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Ingo Molnar. * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Allow irq chips to mark themself oneshot safe
2012-08-03Merge tag 'for_linux-3.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb Pull KGDB/KDB/usb-dbgp fixes and cleanups from Jason Wessel: "There are no new features, those will be delayed to the 3.7 window. There are only fixes/cleanup against the usual kernel churn and we are removing more lines than we add: - usb-dbgp - increase the controller wait time to come out of halt. - kdb - Remove unused KDB_FLAG_ONLY_DO_DUMP code and cpu in more prompt - debug core - pass NMI type on archs that provide NMI types" * tag 'for_linux-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb: USB: echi-dbgp: increase the controller wait time to come out of halt. kernel/debug: Make use of KGDB_REASON_NMI kdb: Remove cpu from the more prompt kdb: Remove unused KDB_FLAG_ONLY_DO_DUMP