summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2013-06-25futex: Take hugepages into account when generating futex_keyZhang Yi
The futex_keys of process shared futexes are generated from the page offset, the mapping host and the mapping index of the futex user space address. This should result in an unique identifier for each futex. Though this is not true when futexes are located in different subpages of an hugepage. The reason is, that the mapping index for all those futexes evaluates to the index of the base page of the hugetlbfs mapping. So a futex at offset 0 of the hugepage mapping and another one at offset PAGE_SIZE of the same hugepage mapping have identical futex_keys. This happens because the futex code blindly uses page->index. Steps to reproduce the bug: 1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0 and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs mapping. The mutexes must be initialized as PTHREAD_PROCESS_SHARED because PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as their keys solely depend on the user space address. 2. Lock mutex1 and mutex2 3. Create thread1 and in the thread function lock mutex1, which results in thread1 blocking on the locked mutex1. 4. Create thread2 and in the thread function lock mutex2, which results in thread2 blocking on the locked mutex2. 5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2 still blocks on mutex2 because the futex_key points to mutex1. To solve this issue we need to take the normal page index of the page which contains the futex into account, if the futex is in an hugetlbfs mapping. In other words, we calculate the normal page mapping index of the subpage in the hugetlbfs mapping. Mappings which are not based on hugetlbfs are not affected and still use page->index. Thanks to Mel Gorman who provided a patch for adding proper evaluation functions to the hugetlbfs code to avoid exposing hugetlbfs specific details to the futex code. [ tglx: Massaged changelog ] Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Tested-by: Ma Chenggong <ma.chenggong@zte.com.cn> Reviewed-by: 'Mel Gorman' <mgorman@suse.de> Acked-by: 'Darren Hart' <dvhart@linux.intel.com> Cc: 'Peter Zijlstra' <peterz@infradead.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-25cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchyTejun Heo
Before 1a57423166 ("cgroup: make hierarchy_id use cyclic idr"), hierarchy IDs were allocated from 0. As the dummy hierarchy was always the one first initialized, it got assigned 0 and all other hierarchies from 1. The patch accidentally changed the minimum useable ID to 2. Let's restore ID 0 for dummy_root and while at it reserve 1 for unified hierarchy. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
2013-06-25cgroup: implement for_each_[builtin_]subsys()Tejun Heo
There are quite a few places where all loaded [builtin] subsys are iterated. Implement for_each_[builtin_]subsys() and replace manual iterations with those to simplify those places a bit. The new iterators automatically skip NULL subsystems. This shouldn't cause any functional difference. Iteration loops which scan all subsystems and then skipping modular ones explicitly are converted to use for_each_builtin_subsys(). While at it, reorder variable declarations and adjust whitespaces a bit in the affected functions. v2: Add lockdep_assert_held() in for_each_subsys() and add comments about synchronization as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-25cgroup: move init_css_set initialization inside cgroup_mutexTejun Heo
cgroup_init() was doing init_css_set initialization outside cgroup_mutex, which is fine but we want to add lockdep annotation on subsystem iterations and cgroup_init() will trigger it spuriously. Move init_css_set initialization inside cgroup_mutex. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-25irqdomain: Use irq_get_trigger_type() to get IRQ flagsJavier Martinez Canillas
Use irq_get_trigger_type() to get the IRQ trigger type flags instead calling irqd_get_trigger_type(irq_desc_get_irq_data(virq)) Signed-off-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk> Acked-by: Grant Likely <grant.likely@linaro.org> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mips@linux-mips.org Link: http://lkml.kernel.org/r/1371228049-27080-8-git-send-email-javier.martinez@collabora.co.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-24cgroup: s/for_each_subsys()/for_each_root_subsys()/Tejun Heo
for_each_subsys() walks over subsystems attached to a hierarchy and we're gonna add iterators which walk over all available subsystems. Rename for_each_subsys() to for_each_root_subsys() so that it's more appropriately named and for_each_subsys() can be used to iterate all subsystems. While at it, remove unnecessary underbar prefix from macro arguments, put them inside parentheses, and adjust indentation for the two for_each_*() macros. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24cgroup: clean up find_css_set() and friendsTejun Heo
find_css_set() passes uninitialized on-stack template[] array to find_existing_css_set() which sets the entries for all subsystems. Passing around an uninitialized array is a bit icky and we want to introduce an iterator which only iterates loaded subsystems. Let's initialize it on definition. While at it, also make the following cosmetic cleanups. * Convert to proper /** comments. * Reorder variable declarations. * Replace comment on synchronization with lockdep_assert_held(). This patch doesn't make any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24cgroup: remove cgroup->actual_subsys_maskTejun Heo
cgroup curiously has two subsystem masks, ->subsys_mask and ->actual_subsys_mask. The latter only exists because the new target subsys_mask is passed into rebind_subsystems() via @root>subsys_mask. rebind_subsystems() needs to know what the current mask is to decide how to reach the target mask so ->actual_subsys_mask is used as the temp location to remember the current state. Adding a temporary field to a permanent data structure is rather silly and can be misleading. Update rebind_subsystems() to take @added_mask and @removed_mask instead and remove @root->actual_subsys_mask. This patch shouldn't introduce any behavior changes. v2: Comment and description updated as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24cgroup: prefix global variables with "cgroup_"Tejun Heo
Global variable names in kernel/cgroup.c are asking for trouble - subsys, roots, rootnode and so on. Rename them to have "cgroup_" prefix. * s/subsys/cgroup_subsys/ * s/rootnode/cgroup_dummy_root/ * s/dummytop/cgroup_cummy_top/ * s/roots/cgroup_roots/ * s/root_count/cgroup_root_count/ This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24Merge 3.10-rc7 into driver-core-nextGreg Kroah-Hartman
We want the firmware merge fixes, and other bits, in here now. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-24clockevents: Prefer CPU local devices over global devicesStephen Boyd
On an SMP system with only one global clockevent and a dummy clockevent per CPU we run into problems. We want the dummy clockevents to be registered as the per CPU tick devices, but we can only achieve that if we register the dummy clockevents before the global clockevent or if we artificially inflate the rating of the dummy clockevents to be higher than the rating of the global clockevent. Failure to do so leads to boot hangs when the dummy timers are registered on all other CPUs besides the CPU that accepted the global clockevent as its tick device and there is no broadcast timer to poke the dummy devices. If we're registering multiple clockevents and one clockevent is global and the other is local to a particular CPU we should choose to use the local clockevent regardless of the rating of the device. This way, if the clockevent is a dummy it will take the tick device duty as long as there isn't a higher rated tick device and any global clockevent will be bumped out into broadcast mode, fixing the problem described above. Reported-and-tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Tested-by: soren.brinkmann@xilinx.com Cc: John Stultz <john.stultz@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: linux-arm-kernel@lists.infradead.org Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/20130613183950.GA32061@codeaurora.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-24genirq: Irqchip: document gcflags arg of irq_alloc_domain_generic_chipsJames Hogan
Commit 088f40b7b027dad6519712ff224a5798dd62a204 ("genirq: Generic chip: Add linear irq domain support") missed kerneldoc for the gcflags argument of irq_alloc_domain_generic_chips(). Add it now. Signed-off-by: James Hogan <james.hogan@imgtec.com> Acked-by: Grant Likely <grant.likely@secretlab.ca> Link: http://lkml.kernel.org/r/1371564513-4327-1-git-send-email-james.hogan@imgtec.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-24irq: fix checkpatch errorKefeng Wang
ERROR: space required before the open parenthesis '(' WARNING: Prefer pr_warn(... to pr_warning(... Just fix above 2 issue. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Grant Likely <grant.likely@linaro.org>
2013-06-24irqdomain: Include hwirq number in /proc/interruptsGrant Likely
Add the hardware interrupt number to the output of /proc/interrupts. It is often important to have access to the hardware interrupt number because it identifies exactly how an interrupt signal is wired up to the interrupt controller. This is especially important when using irq_domains since irq numbers get dynamically allocated in that case, and have no relation to the actual hardware number. Note: This output is currently conditional on whether or not the irq_domain pointer is set; however hwirq could still be used without irq_domain. It may be worthwhile to always output the hwirq number regardless of the domain pointer. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Tested-by: Olof Johansson <olof@lixom.net> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-06-24irqdomain: make irq_linear_revmap() a fast path againGrant Likely
Over the years, irq_linear_revmap() gained tests and checks to make sure callers were using it safely, which while important, also make it less of a fast path. After the irqdomain refactoring done recently, it is now possible to make irq_linear_revmap() a fast path again. This patch moves irq_linear_revmap() to the header file and makes it a static inline so that interrupt controller drivers using a linear mapping can decode the virq from a hwirq in just a couple of instructions. Signed-off-by: Grant Likely <grant.likely@linaro.org>
2013-06-24irqdomain: remove irq_domain_generate_simple()Grant Likely
Nobody calls it; remove the function Signed-off-by: Grant Likely <grant.likely@linaro.org>
2013-06-24irqdomain: Refactor irq_domain_associate_many()Grant Likely
Originally, irq_domain_associate_many() was designed to unwind the mapped irqs on a failure of any individual association. However, that proved to be a problem with certain IRQ controllers. Some of them only support a subset of irqs, and will fail when attempting to map a reserved IRQ. In those cases we want to map as many IRQs as possible, so instead it is better for irq_domain_associate_many() to make a best-effort attempt to map irqs, but not fail if any or all of them don't succeed. If a caller really cares about how many irqs got associated, then it should instead go back and check that all of the irqs is cares about were mapped. The original design open-coded the individual association code into the body of irq_domain_associate_many(), but with no longer needing to unwind associations, the code becomes simpler to split out irq_domain_associate() to contain the bulk of the logic, and irq_domain_associate_many() to be a simple loop wrapper. This patch also adds a new error check to the associate path to make sure it isn't called for an irq larger than the controller can handle, and adds locking so that the irq_domain_mutex is held while setting up a new association. v3: Fixup missing change to irq_domain_add_tree() v2: Fixup x86 warning. irq_domain_associate_many() no longer returns an error code, but reports errors to the printk log directly. In the majority of cases we don't actually want to fail if there is a problem, but rather log it and still try to boot the system. Signed-off-by: Grant Likely <grant.likely@linaro.org> irqdomain: Fix flubbed irq_domain_associate_many refactoring commit d39046ec72, "irqdomain: Refactor irq_domain_associate_many()" was missing the following hunk which causes a boot failure on anything using irq_domain_add_tree() to allocate an irq domain. Signed-off-by: Grant Likely <grant.likely@linaro.org> Cc: Michael Neuling <mikey@neuling.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>, Cc: Thomas Gleixner <tglx@linutronix.de>, Cc: Stephen Rothwell <sfr@canb.auug.org.au>
2013-06-24PM / QoS: Add pm_qos_request tracepointsSahara
Adds tracepoints to pm_qos_add_request, pm_qos_update_request, pm_qos_remove_request, and pm_qos_update_request_timeout. It's useful for checking pm_qos_class, value, and timeout_us. Signed-off-by: Sahara <keun-o.park@windriver.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-24PM / QoS: Add pm_qos_update_target/flags tracepointsSahara
This patch adds tracepoints to pm_qos_update_target and pm_qos_update_flags. It's useful for checking pm qos action, previous value and current value. Signed-off-by: Sahara <keun-o.park@windriver.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-23perf: Drop sample rate when sampling is too slowDave Hansen
This patch keeps track of how long perf's NMI handler is taking, and also calculates how many samples perf can take a second. If the sample length times the expected max number of samples exceeds a configurable threshold, it drops the sample rate. This way, we don't have a runaway sampling process eating up the CPU. This patch can tend to drop the sample rate down to level where perf doesn't work very well. *BUT* the alternative is that my system hangs because it spends all of its time handling NMIs. I'll take a busted performance tool over an entire system that's busted and undebuggable any day. BTW, my suspicion is that there's still an underlying bug here. Using the HPET instead of the TSC is definitely a contributing factor, but I suspect there are some other things going on. But, I can't go dig down on a bug like that with my machine hanging all the time. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus@samba.org Cc: acme@ghostprotocols.net Cc: Dave Hansen <dave@sr71.net> [ Prettified it a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-21Merge branch 'x86/urgent' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: "This series fixes a couple of build failures, and fixes MTRR cleanup and memory setup on very specific memory maps. Finally, it fixes triggering backtraces on all CPUs, which was inadvertently disabled on x86." * 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/efi: Fix dummy variable buffer allocation x86: Fix trigger_all_cpu_backtrace() implementation x86: Fix section mismatch on load_ucode_ap x86: fix build error and kconfig for ia32_emulation and binfmt range: Do not add new blank slot with add_range_with_merge x86, mtrr: Fix original mtrr range get for mtrr_cleanup
2013-06-21tick: Fix tick_broadcast_pending_mask not clearedDaniel Lezcano
The recent modification in the cpuidle framework consolidated the timer broadcast code across the different drivers by setting a new flag in the idle state. It tells the cpuidle core code to enter/exit the broadcast mode for the cpu when entering a deep idle state. The broadcast timer enter/exit is no longer handled by the back-end driver. This change made the local interrupt to be enabled *before* calling CLOCK_EVENT_NOTIFY_EXIT. On a tegra114, a four cores system, when the flag has been introduced in the driver, the following warning appeared: WARNING: at kernel/time/tick-broadcast.c:578 tick_broadcast_oneshot_control CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.10.0-rc3-next-20130529+ #15 [<c00667f8>] (tick_broadcast_oneshot_control+0x1a4/0x1d0) from [<c0065cd0>] (tick_notify+0x240/0x40c) [<c0065cd0>] (tick_notify+0x240/0x40c) from [<c0044724>] (notifier_call_chain+0x44/0x84) [<c0044724>] (notifier_call_chain+0x44/0x84) from [<c0044828>] (raw_notifier_call_chain+0x18/0x20) [<c0044828>] (raw_notifier_call_chain+0x18/0x20) from [<c00650cc>] (clockevents_notify+0x28/0x170) [<c00650cc>] (clockevents_notify+0x28/0x170) from [<c033f1f0>] (cpuidle_idle_call+0x11c/0x168) [<c033f1f0>] (cpuidle_idle_call+0x11c/0x168) from [<c000ea94>] (arch_cpu_idle+0x8/0x38) [<c000ea94>] (arch_cpu_idle+0x8/0x38) from [<c005ea80>] (cpu_startup_entry+0x60/0x134) [<c005ea80>] (cpu_startup_entry+0x60/0x134) from [<804fe9a4>] (0x804fe9a4) I don't have the hardware, so I wasn't able to reproduce the warning but after looking a while at the code, I deduced the following: 1. the CPU2 enters a deep idle state and sets the broadcast timer 2. the timer expires, the tick_handle_oneshot_broadcast function is called, setting the tick_broadcast_pending_mask and waking up the idle cpu CPU2 3. the CPU2 exits idle handles the interrupt and then invokes tick_broadcast_oneshot_control with CLOCK_EVENT_NOTIFY_EXIT which runs the following code: [...] if (dev->next_event.tv64 == KTIME_MAX) goto out; if (cpumask_test_and_clear_cpu(cpu, tick_broadcast_pending_mask)) goto out; [...] So if there is no next event scheduled for CPU2, we fulfil the first condition and jump out without clearing the tick_broadcast_pending_mask. 4. CPU2 goes to deep idle again and calls tick_broadcast_oneshot_control with CLOCK_NOTIFY_EVENT_ENTER but with the tick_broadcast_pending_mask set for CPU2, triggering the warning. The issue only surfaced due to the modifications of the cpuidle framework, which resulted in interrupts being enabled before the call to the clockevents code. If the call happens before interrupts have been enabled, the warning cannot trigger, because there is still the event pending which caused the broadcast timer expiry. Move the check for the next event below the check for the pending bit, so the pending bit gets cleared whether an event is scheduled on the cpu or not. [ tglx: Massaged changelog ] Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reported-and-tested-by: Joseph Lo <josephl@nvidia.com> Cc: Stephen Warren <swarren@nvidia.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linaro-kernel@lists.linaro.org Link: http://lkml.kernel.org/r/1371485735-31249-1-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-21PM / Sleep: Print last wakeup source on failed wakeup_count writeJulius Werner
Commit a938da06 introduced a useful little log message to tell users/debuggers which wakeup source aborted a suspend. However, this message is only printed if the abort happens during the in-kernel suspend path (after writing /sys/power/state). The full specification of the /sys/power/wakeup_count facility allows user-space power managers to double-check if wakeups have already happened before it actually tries to suspend (e.g. while it was running user-space pre-suspend hooks), by writing the last known wakeup_count value to /sys/power/wakeup_count. This patch changes the sysfs handler for that node to also print said log message if that write fails, so that we can figure out the offending wakeup source for both kinds of suspend aborts. Signed-off-by: Julius Werner <jwerner@chromium.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-21PM / QoS: correct the valid range of pm_qos_classSahara
The valid start index for pm_qos_array is not 0, but PM_QOS_CPU_DMA_LATENCY. There is a null_pm_qos at index 0 of pm_qos_array. However, null_pm_qos is not created as misc device so that inclusion of 0 index for checking pm_qos_class especially for file operations is not proper here. [rjw: Changelog, a bit] Signed-off-by: Sahara <keun-o.park@windriver.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-20Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Two smaller fixes - plus a context tracking tracing fix that is a bit bigger" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tracing/context-tracking: Add preempt_schedule_context() for tracing sched: Fix clear NOHZ_BALANCE_KICK sched/x86: Construct all sibling maps if smt
2013-06-20Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Four fixes. The mmap ones are unfortunately larger than desired - fuzzing uncovered bugs that needed perf context life time management changes to fix properly" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix broken PEBS-LL support on SNB-EP/IVB-EP perf: Fix mmap() accounting hole perf: Fix perf mmap bugs kprobes: Fix to free gone and unused optprobes
2013-06-20Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cpu idle fixes from Thomas Gleixner: - Add a missing irq enable. Fallout of the idle conversion - Fix stackprotector wreckage caused by the idle conversion * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: idle: Enable interrupts in the weak arch_cpu_idle() implementation idle: Add the stack canary init to cpu_startup_entry()
2013-06-20Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: - Fix inconstinant clock usage in virtual time accounting - Fix a build error in KVM caused by the NOHZ work - Remove a pointless timekeeping duty assignment which breaks NOHZ - Use a proper notifier return value to avoid random behaviour * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick: Remove useless timekeeping duty attribution to broadcast source nohz: Fix notifier return val that enforce timekeeping kvm: Move guest entry/exit APIs to context_tracking vtime: Use consistent clocks among nohz accounting
2013-06-20hw_breakpoint: Introduce "struct bp_cpuinfo"Oleg Nesterov
This patch simply moves all per-cpu variables into the new single per-cpu "struct bp_cpuinfo". To me this looks more logical and clean, but this can also simplify the further potential changes. In particular, I do not think this memory should be per-cpu, it is never used "locally". After this change it is trivial to turn it into, say, bootmem[nr_cpu_ids]. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20130620155020.GA6350@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20hw_breakpoint: Simplify *register_wide_hw_breakpoint()Oleg Nesterov
1. register_wide_hw_breakpoint() can use unregister_ if failure, no need to duplicate the code. 2. "struct perf_event **pevent" adds the unnecesary lever of indirection and complication, use per_cpu(*cpu_events, cpu). Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20130620155018.GA6347@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20hw_breakpoint: Introduce cpumask_of_bp()Oleg Nesterov
Add the trivial helper which simply returns cpumask_of() or cpu_possible_mask depending on bp->cpu. Change fetch_bp_busy_slots() and toggle_bp_slot() to always do for_each_cpu(cpumask_of_bp) to simplify the code and avoid the code duplication. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20130620155015.GA6340@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20hw_breakpoint: Simplify the "weight" usage in toggle_bp_slot() pathsOleg Nesterov
Change toggle_bp_slot() to make "weight" negative if !enable. This way we can always use "+ weight" without additional "if (enable)" check and toggle_bp_task_slot() no longer needs this arg. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20130620155013.GA6337@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20hw_breakpoint: Simplify list/idx mess in toggle_bp_slot() pathsOleg Nesterov
The enable/disable logic in toggle_bp_slot() is not symmetrical and imho very confusing. "old_count" in toggle_bp_task_slot() is actually new_count because this bp was already removed from the list. Change toggle_bp_slot() to always call list_add/list_del after toggle_bp_task_slot(). This way old_idx is task_bp_pinned() and this entry should be decremented, new_idx is +/-weight and we need to increment this element. The code/logic looks obvious. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20130620155011.GA6330@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20Merge branch 'perf/urgent' into perf/coreIngo Molnar
Merge in two hw_breakpoint fixes, before applying another 5. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20hw_breakpoint: Use cpu_possible_mask in {reserve,release}_bp_slot()Oleg Nesterov
fetch_bp_busy_slots() and toggle_bp_slot() use for_each_online_cpu(), this is obviously wrong wrt cpu_up() or cpu_down(), we can over/under account the per-cpu numbers. For example: # echo 0 >> /sys/devices/system/cpu/cpu1/online # perf record -e mem:0x10 -p 1 & # echo 1 >> /sys/devices/system/cpu/cpu1/online # perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10 -C1 -a & # taskset -p 0x2 1 triggers the same WARN_ONCE("Can't find any breakpoint slot") in arch_install_hw_breakpoint(). Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20130620155009.GA6327@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20hw_breakpoint: Fix cpu check in task_bp_pinned(cpu)Oleg Nesterov
trinity fuzzer triggered WARN_ONCE("Can't find any breakpoint slot") in arch_install_hw_breakpoint() but the problem is not arch-specific. The problem is, task_bp_pinned(cpu) checks "cpu == iter->cpu" but this doesn't account the "all cpus" events with iter->cpu < 0. This means that, say, register_user_hw_breakpoint(tsk) can happily create the arbitrary number > HBP_NUM of breakpoints which can not be activated. toggle_bp_task_slot() is equally wrong by the same reason and nr_task_bp_pinned[] can have negative entries. Simple test: # perl -e 'sleep 1 while 1' & # perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10,mem:0x10 -p `pidof perl` Before this patch this triggers the same problem/WARN_ON(), after the patch it correctly fails with -ENOSPC. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20130620155006.GA6324@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20nohz: Remove obsolete check for full dynticks CPUs to be RCU nocbsFrederic Weisbecker
Building full dynticks now implies that all CPUs are forced into RCU nocb mode through CONFIG_RCU_NOCB_CPU_ALL. The dynamic check has become useless. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de>
2013-06-20watchdog: Boot-disable by default on full dynticksFrederic Weisbecker
When the watchdog runs, it prevents the full dynticks CPUs from stopping their tick because the hard lockup detector uses perf events internally, which in turn rely on the periodic tick. Since this is a rather confusing behaviour that is not easy to track down and identify for those who want to test CONFIG_NO_HZ_FULL, let's default disable the watchdog on boot time when full dynticks is enabled. The user can still enable it later on runtime using proc or sysctl. Reported-by: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Anish Singh <anish198519851985@gmail.com>
2013-06-20watchdog: Rename confusing state variableFrederic Weisbecker
We have two very conflicting state variable names in the watchdog: * watchdog_enabled: This one reflects the user interface. It's set to 1 by default and can be overriden with boot options or sysctl/procfs interface. * watchdog_disabled: This is the internal toggle state that tells if watchdog threads, timers and NMI events are currently running or not. This state mostly depends on the user settings. It's a convenient state latch. Now we really need to find clearer names because those are just too confusing to encourage deep review. watchdog_enabled now becomes watchdog_user_enabled to reflect its purpose as an interface. watchdog_disabled becomes watchdog_running to suggest its role as a pure internal state. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Anish Singh <anish198519851985@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Don Zickus <dzickus@redhat.com>
2013-06-19ftrace: Fix stddev calculation in function profilerJuri Lelli
When FUNCTION_GRAPH_TRACER is enabled, ftrace can profile kernel functions and print basic statistics about them. Unfortunately, running stddev calculation is wrong. This patch corrects it implementing Welford’s method: s^2 = 1 / (n * (n-1)) * (n * \Sum (x_i)^2 - (\Sum x_i)^2) . Link: http://lkml.kernel.org/r/1371031398-24048-1-git-send-email-juri.lelli@gmail.com Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-19tracing/kprobes: Remove unnecessary checking of trace_probe_is_enabledzhangwei(Jovi)
Since tp->flags assignment was moved into function enable_trace_probe(), there is no need to use trace_probe_is_enabled to check flags in the same function. Remove the unnecessary checking. Link: http://lkml.kernel.org/r/51BA7B9E.3040807@huawei.com Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-19tracing: Disable tracing on warningSteven Rostedt (Red Hat)
Add a traceoff_on_warning option in both the kernel command line as well as a sysctl option. When set, any WARN*() function that is hit will cause the tracing_on variable to be cleared, which disables writing to the ring buffer. This is useful especially when tracing a bug with function tracing. When a warning is hit, the print caused by the warning can flood the trace with the functions that producing the output for the warning. This can make the resulting trace useless by either hiding where the bug happened, or worse, by overflowing the buffer and losing the trace of the bug totally. Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-19tracing: Add binary '&' filter for eventsSteven Rostedt
There are some cases when filtering on a set flag of a field of a tracepoint is useful. But currently the only filtering commands for numbered fields is ==, !=, <, <=, >, >=. This does not help when you just want to trace if a specific flag is set. For example: > # sudo trace-cmd record -e brcmfmac:brcmf_dbg -f 'level & 0x40000' > disable all > enable brcmfmac:brcmf_dbg > path = /sys/kernel/debug/tracing/events/brcmfmac/brcmf_dbg/enable > (level & 0x40000) > ^ > parse_error: Invalid operator > When trying to trace brcmf_dbg when level has its 1 << 18 bit set, the filter fails to perform. By allowing a binary '&' operation, this gives the user the ability to test a bit. Note, a binary '|' is not added, as it doesn't make sense as fields must be compared to constants (for now), and ORing a constant will always return true. Link: http://lkml.kernel.org/r/1371057385.9844.261.camel@gandalf.local.home Suggested-by: Arend van Spriel <arend@broadcom.com> Tested-by: Arend van Spriel <arend@broadcom.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-20watchdog: Register / unregister watchdog kthreads on sysctl controlFrederic Weisbecker
The user activation/deactivation of the watchdog through boot parameters or systcl is currently implemented with a dance involving kthreads parking and unparking methods: the threads are unconditionally registered on boot and they park as soon as the user want the watchdog to be disabled. This method involves a few noisy details to handle though: the watchdog kthreads may be unparked anytime due to hotplug operations, after which the watchdog internals have to decide to park again if it is user-disabled. As a result the setup() and unpark() methods need to be able to request a reparking. This is not currently supported in the kthread infrastructure so this piece of the watchdog code only works halfway. Besides, unparking/reparking the watchdog kthreads consume unnecessary cputime on hotplug operations when those could be simply ignored in the first place. As suggested by Srivatsa, let's instead only register the watchdog threads when they are needed. This way we don't need to think about hotplug operations and we don't burden the CPU onlining when the watchdog is simply disabled. Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Anish Singh <anish198519851985@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Don Zickus <dzickus@redhat.com>
2013-06-20nohz: Warn if the machine can not perform nohz_fullSteven Rostedt
If the user configures NO_HZ_FULL and defines nohz_full=XXX on the kernel command line, or enables NO_HZ_FULL_ALL, but nohz fails due to the machine having a unstable clock, warn about it. We do not want users thinking that they are getting the benefit of nohz when their machine can not support it. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-06-19Merge tag 'please-pull-einj' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras Pull miscellaneous fixes for ACPI EINJ (error injection) code, from Tony Luck. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19Merge branch 'rcu/next' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU changes from Paul E. McKenney: "The major changes for this series are: 1. Simplify RCU's grace-period and callback processing based on the new numbering for callbacks. These were posted to LKML at https://lkml.org/lkml/2013/5/20/330. 2. Documentation updates. These were posted to LKML at https://lkml.org/lkml/2013/5/20/348. 3. Miscellaneous fixes, including converting a few remaining printk() calls to pr_*(). These were posted to LKML at https://lkml.org/lkml/2013/5/20/324. 4. SRCU-related changes and fixes. These were posted to LKML at https://lkml.org/lkml/2013/5/20/425. 5. Removal of TINY_PREEMPT_RCU in favor of TREE_PREEMPT_RCU for single-CPU low-latency systems. These were posted to LKML at https://lkml.org/lkml/2013/5/20/427." Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Don't mix use of typedef ctl_table and struct ctl_tableJoe Perches
Just use struct ctl_table. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371063336.2069.22.camel@joe-AO722 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Remove WARN_ON(!sd) from init_sched_groups_power()Viresh Kumar
sd can't be NULL in init_sched_groups_power() and so checking it for NULL isn't useful. In case it is required, then also we need to rearrange the code a bit as we already accessed invalid pointer sd to get sg: sg = sd->groups. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/2bbe633cd74b431c05253a8ce61fdfd5066a531b.1370948150.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Fix memory leakage in build_sched_groups()Viresh Kumar
In build_sched_groups() we don't need to call get_group() for cpus which are already covered in previous iterations. Calling get_group() would mark the group used and eventually leak it since we wouldn't connect it and not find it again to free it. This will happen only in cases where sg->cpumask contained more than one cpu (For any topology level). This patch would free sg's memory for all cpus leaving the group leader as the group isn't marked used now. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/7a61e955abdcbb1dfa9fe493f11a5ec53a11ddd3.1370948150.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>