summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2011-07-14KVM: Steal time implementationGlauber Costa
To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM, while the vcpu had meaningful work to do - halt time does not count. This information is acquired through the run_delay field of delayacct/schedstats infrastructure, that counts time spent in a runqueue but not running. Steal time is a per-cpu information, so the traditional MSR-based infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the memory area address containing information about steal time This patch contains the hypervisor part of the steal time infrasructure, and can be backported independently of the guest portion. [avi, yongjie: export delayacct_on, to avoid build failures in some configs] Signed-off-by: Glauber Costa <glommer@redhat.com> Tested-by: Eric B Munson <emunson@mgebm.net> CC: Rik van Riel <riel@redhat.com> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Yongjie Ren <yongjie.ren@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-13ftrace: Fix dynamic selftest failure on some archsSteven Rostedt
Archs that do not implement CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST, will fail the dynamic ftrace selftest. The function tracer has a quick 'off' variable that will prevent the call back functions from being called. This variable is called function_trace_stop. In x86, this is implemented directly in the mcount assembly, but for other archs, an intermediate function is used called ftrace_test_stop_func(). In dynamic ftrace, the function pointer variable ftrace_trace_function is used to update the caller code in the mcount caller. But for archs that do not have CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST set, it only calls ftrace_test_stop_func() instead, which in turn calls __ftrace_trace_function. When more than one ftrace_ops is registered, the function it calls is ftrace_ops_list_func(), which will iterate over all registered ftrace_ops and call the callbacks that have their hash matching. The issue happens when two ftrace_ops are registered for different functions and one is then unregistered. The __ftrace_trace_function is then pointed to the remaining ftrace_ops callback function directly. This mean it will be called for all functions that were registered to trace by both ftrace_ops that were registered. This is not an issue for archs with CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST, because the update of ftrace_trace_function doesn't happen until after all functions have been updated, and then the mcount caller is updated. But for those archs that do use the ftrace_test_stop_func(), the update is immediate. The dynamic selftest fails because it hits this situation, and the ftrace_ops that it registers fails to only trace what it was suppose to and instead traces all other functions. The solution is to delay the setting of __ftrace_trace_function until after all the functions have been updated according to the registered ftrace_ops. Also, function_trace_stop is set during the update to prevent function tracing from calling code that is caused by the function tracer itself. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-07-13ftrace: Update filter when tracing enabled in set_ftrace_filter()Steven Rostedt
Currently, if set_ftrace_filter() is called when the ftrace_ops is active, the function filters will not be updated. They will only be updated when tracing is disabled and re-enabled. Update the functions immediately during set_ftrace_filter(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-07-13ftrace: Balance records when updating the hashSteven Rostedt
Whenever the hash of the ftrace_ops is updated, the record counts must be balance. This requires disabling the records that are set in the original hash, and then enabling the records that are set in the updated hash. Moving the update into ftrace_hash_move() removes the bug where the hash was updated but the records were not, which results in ftrace triggering a warning and disabling itself because the ftrace_ops filter is updated while the ftrace_ops was registered, and then the failure happens when the ftrace_ops is unregistered. The current code will not trigger this bug, but new code will. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-07-13rcu: Prevent RCU callbacks from executing before scheduler initializedPaul E. McKenney
Under some rare but real combinations of configuration parameters, RCU callbacks are posted during early boot that use kernel facilities that are not yet initialized. Therefore, when these callbacks are invoked, hard hangs and crashes ensue. This commit therefore prevents RCU callbacks from being invoked until after the scheduler is fully up and running, as in after multiple tasks have been spawned. It might well turn out that a better approach is to identify the specific RCU callbacks that are causing this problem, but that discussion will wait until such time as someone really needs an RCU callback to be invoked (as opposed to merely registered) during early boot. Reported-by: julie Sullivan <kernelmail.jms@gmail.com> Reported-by: RKK <kulkarni.ravi4@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: julie Sullivan <kernelmail.jms@gmail.com> Tested-by: RKK <kulkarni.ravi4@gmail.com>
2011-07-12Merge branch 'fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc: pcmcia: pxa2xx/vpac270: free gpios on exist rather than requesting ARM: pxa/raumfeld: fix device name for codec ak4104 ARM: pxa/raumfeld: display initialisation fixes ARM: pxa/raumfeld: adapt to upcoming hardware change ARM: pxa: fix gpio_to_chip() clash with gpiolib namespace genirq: replace irq_gc_ack() with {set,clr}_bit variants (fwd) arm: mach-vt8500: add forgotten irq_data conversion ARM: pxa168: correct nand pmu setting ARM: pxa910: correct nand pmu setting ARM: pxa: fix PGSR register address calculation
2011-07-12KVM: Add compat ioctl for KVM_SET_SIGNAL_MASKAlexander Graf
KVM has an ioctl to define which signal mask should be used while running inside VCPU_RUN. At least for big endian systems, this mask is different on 32-bit and 64-bit systems (though the size is identical). Add a compat wrapper that converts the mask to whatever the kernel accepts, allowing 32-bit kvm user space to set signal masks. This patch fixes qemu with --enable-io-thread on ppc64 hosts when running 32-bit user land. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12fixlet: Remove fs_excl from struct task.Justin TerAvest
fs_excl is a poor man's priority inheritance for filesystems to hint to the block layer that an operation is important. It was never clearly specified, not widely adopted, and will not prevent starvation in many cases (like across cgroups). fs_excl was introduced with the time sliced CFQ IO scheduler, to indicate when a process held FS exclusive resources and thus needed a boost. It doesn't cover all file systems, and it was never fully complete. Lets kill it. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-11doc: Konfig: Documentation/power/{pm => apm-acpi}.txtMichael Witten
Signed-off-by: Michael Witten <mfwitten@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-07-11Merge branch 'master' into for-nextJiri Kosina
Sync with Linus' tree to be able to apply pending patches that are based on newer code already present upstream.
2011-07-08rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_checkMichal Hocko
Since ca5ecddf (rcu: define __rcu address space modifier for sparse) rcu_dereference_check use rcu_read_lock_held as a part of condition automatically so callers do not have to do that as well. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-07-08plist: Remove the need to supply locks to plist headsDima Zavin
This was legacy code brought over from the RT tree and is no longer necessary. Signed-off-by: Dima Zavin <dima@android.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Daniel Walker <dwalker@codeaurora.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Link: http://lkml.kernel.org/r/1310084879-10351-2-git-send-email-dima@android.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-07ftrace: Do not disable interrupts for modules in mcount updateSteven Rostedt
When I mounted an NFS directory, it caused several modules to be loaded. At the time I was running the preemptirqsoff tracer, and it showed the following output: # tracer: preemptirqsoff # # preemptirqsoff latency trace v1.1.5 on 2.6.33.9-rt30-mrg-test # -------------------------------------------------------------------- # latency: 1177 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) # ----------------- # | task: modprobe-19370 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # => started at: ftrace_module_notify # => ended at: ftrace_module_notify # # # _------=> CPU# # / _-----=> irqs-off # | / _----=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # |||| /_--=> lock-depth # |||||/ delay # cmd pid |||||| time | caller # \ / |||||| \ | / modprobe-19370 3d.... 0us!: ftrace_process_locs <-ftrace_module_notify modprobe-19370 3d.... 1176us : ftrace_process_locs <-ftrace_module_notify modprobe-19370 3d.... 1178us : trace_hardirqs_on <-ftrace_module_notify modprobe-19370 3d.... 1178us : <stack trace> => ftrace_process_locs => ftrace_module_notify => notifier_call_chain => __blocking_notifier_call_chain => blocking_notifier_call_chain => sys_init_module => system_call_fastpath That's over 1ms that interrupts are disabled on a Real-Time kernel! Looking at the cause (being the ftrace author helped), I found that the interrupts are disabled before the code modification of mcounts into nops. The interrupts only need to be disabled on start up around this code, not when modules are being loaded. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-07-07tracing: Still trace filtered irq functions when irq trace is disabledSteven Rostedt
If a function is set to be traced by the set_graph_function, but the option funcgraph-irqs is zero, and the traced function happens to be called from a interrupt, it will not be traced. The point of funcgraph-irqs is to not trace interrupts when we are preempted by an irq, not to not trace functions we want to trace that happen to be *in* a irq. Luckily the current->trace_recursion element is perfect to add a flag to help us be able to trace functions within an interrupt even when we are not tracing interrupts that preempt the trace. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-07-07Merge branch 'pm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM / Hibernate: Fix free_unnecessary_pages()
2011-07-07Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and ↵Linus Torvalds
'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: debugobjects: Fix boot crash when kmemleak and debugobjects enabled * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: jump_label: Fix jump_label update for modules oprofile, x86: Fix race in nmi handler while starting counters * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Disable (revert) SCHED_LOAD_SCALE increase sched, cgroups: Fix MIN_SHARES on 64-bit boxen
2011-07-07genirq: replace irq_gc_ack() with {set,clr}_bit variants (fwd)Simon Guinot
This fixes a regression introduced by e59347a "arm: orion: Use generic irq chip". Depending on the device, interrupts acknowledgement is done by setting or by clearing a dedicated register. Replace irq_gc_ack() with some {set,clr}_bit variants allows to handle both cases. Note that this patch affects the following SoCs: Davinci, Samsung and Orion. Except for this last, the change is minor: irq_gc_ack() is just renamed into irq_gc_ack_set_bit(). For the Orion SoCs, the edge GPIO interrupts support is currently broken. irq_gc_ack() try to acknowledge a such interrupt by setting the corresponding cause register bit. The Orion GPIO device expect the opposite. To fix this issue, the irq_gc_ack_clr_bit() variant is used. Tested on Network Space v2. Reported-by: Joey Oravec <joravec@drewtech.com> Signed-off-by: Simon Guinot <sguinot@lacie.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2011-07-07ftrace: Fix regression of :mod:module function enablingSteven Rostedt
The new code that allows different utilities to pick and choose what functions they trace broke the :mod: hook that allows users to trace only functions of a particular module. The reason is that the :mod: hook bypasses the hash that is setup to allow individual users to trace their own functions and uses the global hash directly. But if the global hash has not been set up, it will cause a bug: echo '*:mod:radeon' > /sys/kernel/debug/set_ftrace_filter produces: [drm:drm_mode_getfb] *ERROR* invalid framebuffer id [drm:radeon_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip BUG: unable to handle kernel paging request at ffffffff8160ec90 IP: [<ffffffff810d9136>] add_hash_entry+0x66/0xd0 PGD 1a05067 PUD 1a09063 PMD 80000000016001e1 Oops: 0003 [#1] SMP Jul 7 04:02:28 phyllis kernel: [55303.858604] CPU 1 Modules linked in: cryptd aes_x86_64 aes_generic binfmt_misc rfcomm bnep ip6table_filter hid radeon r8169 ahci libahci mii ttm drm_kms_helper drm video i2c_algo_bit intel_agp intel_gtt Pid: 10344, comm: bash Tainted: G WC 3.0.0-rc5 #1 Dell Inc. Inspiron N5010/0YXXJJ RIP: 0010:[<ffffffff810d9136>] [<ffffffff810d9136>] add_hash_entry+0x66/0xd0 RSP: 0018:ffff88003a96bda8 EFLAGS: 00010246 RAX: ffff8801301735c0 RBX: ffffffff8160ec80 RCX: 0000000000306ee0 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880137c92940 RBP: ffff88003a96bdb8 R08: ffff880137c95680 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff81c9df78 R13: ffff8801153d1000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f329c18a700(0000) GS:ffff880137c80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff8160ec90 CR3: 000000003002b000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process bash (pid: 10344, threadinfo ffff88003a96a000, task ffff88012fcfc470) Stack: 0000000000000fd0 00000000000000fc ffff88003a96be38 ffffffff810d92f5 ffff88011c4c4e00 ffff880000000000 000000000b69f4d0 ffffffff8160ec80 ffff8800300e6f06 0000000081130295 0000000000000282 ffff8800300e6f00 Call Trace: [<ffffffff810d92f5>] match_records+0x155/0x1b0 [<ffffffff810d940c>] ftrace_mod_callback+0xbc/0x100 [<ffffffff810dafdf>] ftrace_regex_write+0x16f/0x210 [<ffffffff810db09f>] ftrace_filter_write+0xf/0x20 [<ffffffff81166e48>] vfs_write+0xc8/0x190 [<ffffffff81167001>] sys_write+0x51/0x90 [<ffffffff815c7e02>] system_call_fastpath+0x16/0x1b Code: 48 8b 33 31 d2 48 85 f6 75 33 49 89 d4 4c 03 63 08 49 8b 14 24 48 85 d2 48 89 10 74 04 48 89 42 08 49 89 04 24 4c 89 60 08 31 d2 RIP [<ffffffff810d9136>] add_hash_entry+0x66/0xd0 RSP <ffff88003a96bda8> CR2: ffffffff8160ec90 ---[ end trace a5d031828efdd88e ]--- Reported-by: Brian Marete <marete@toshnix.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-07-07tracing: Have "enable" file use refcounts like the "filter" fileSteven Rostedt
The "enable" file for the event system can be removed when a module is unloaded and the event system only has events from that module. As the event system nr_events count goes to zero, it may be freed if its ref_count is also set to zero. Like the "filter" file, the "enable" file may be opened by a task and referenced later, after a module has been unloaded and the events for that event system have been removed. Although the "filter" file referenced the event system structure, the "enable" file only references a pointer to the event system name. Since the name is freed when the event system is removed, it is possible that an access to the "enable" file may reference a freed pointer. Update the "enable" file to use the subsystem_open() routine that the "filter" file uses, to keep a reference to the event system structure while the "enable" file is opened. Cc: <stable@kernel.org> Reported-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-07-07tracing: Fix bug when reading system filters on module removalSteven Rostedt
The event system is freed when its nr_events is set to zero. This happens when a module created an event system and then later the module is removed. Modules may share systems, so the system is allocated when it is created and freed when the modules are unloaded and all the events under the system are removed (nr_events set to zero). The problem arises when a task opened the "filter" file for the system. If the module is unloaded and it removed the last event for that system, the system structure is freed. If the task that opened the filter file accesses the "filter" file after the system has been freed, the system will access an invalid pointer. By adding a ref_count, and using it to keep track of what is using the event system, we can free it after all users are finished with the event system. Cc: <stable@kernel.org> Reported-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-07-06PM / Hibernate: Fix free_unnecessary_pages()Rafael J. Wysocki
There is a bug in free_unnecessary_pages() that causes it to attempt to free too many pages in some cases, which triggers the BUG_ON() in memory_bm_clear_bit() for copy_bm. Namely, if count_data_pages() is initially greater than alloc_normal, we get to_free_normal equal to 0 and "save" greater from 0. In that case, if the sum of "save" and count_highmem_pages() is greater than alloc_highmem, we subtract a positive number from to_free_normal. Hence, since to_free_normal was 0 before the subtraction and is an unsigned int, the result is converted to a huge positive number that is used as the number of pages to free. Fix this bug by checking if to_free_normal is actually greater than or equal to the number we're going to subtract from it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-and-tested-by: Matthew Garrett <mjg@redhat.com> Cc: stable@kernel.org
2011-07-06resource: ability to resize an allocated resourceRam Pai
Provides the ability to resize a resource that is already allocated. This functionality is put in place to support reallocation needs of pci resources. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-05Merge branch 'tip/perf/core-2' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core
2011-07-02PM: Allow the clocks management code to be used during system suspendRafael J. Wysocki
The common clocks management code in drivers/base/power/clock_ops.c is going to be used during system-wide power transitions as well as for runtime PM, so it shouldn't depend on CONFIG_PM_RUNTIME. However, the suspend/resume functions provided by it for CONFIG_PM_RUNTIME unset, to be used during system-wide power transitions, should not behave in the same way as their counterparts defined for CONFIG_PM_RUNTIME set, because in that case the clocks are managed differently at run time. The names of the functions still contain the word "runtime" after this change, but that is going to be modified by a separate patch later. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reviewed-by: Kevin Hilman <khilman@ti.com>
2011-07-02PM / Domains: Support for generic I/O PM domains (v8)Rafael J. Wysocki
Introduce common headers, helper functions and callbacks allowing platforms to use simple generic power domains for runtime power management. Introduce struct generic_pm_domain to be used for representing power domains that each contain a number of devices and may be parent domains or subdomains with respect to other power domains. Among other things, this structure includes callbacks to be provided by platforms for performing specific tasks related to power management (i.e. ->stop_device() may disable a device's clocks, while ->start_device() may enable them, ->power_off() is supposed to remove power from the entire power domain and ->power_on() is supposed to restore it). Introduce functions that can be used as power domain runtime PM callbacks, pm_genpd_runtime_suspend() and pm_genpd_runtime_resume(), as well as helper functions for the initialization of a power domain represented by a struct generic_power_domain object, adding a device to or removing a device from it and adding or removing subdomains. Introduce configuration option CONFIG_PM_GENERIC_DOMAINS to be selected by the platforms that want to use the new code. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Reviewed-by: Kevin Hilman <khilman@ti.com>
2011-07-01Merge branch 'sched/core-v2' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into sched/core
2011-07-01perf: export perf_event_refresh() to modulesAvi Kivity
KVM needs one-shot samples, since a PMC programmed to -X will fire after X events and then again after 2^40 events (i.e. variable period). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1309362157-6596-4-git-send-email-avi@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01perf: Add context field to perf_eventAvi Kivity
The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event->overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1309362157-6596-2-git-send-email-avi@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01perf: Remove the perf_output_begin(.sample) argumentPeter Zijlstra
Since only samples call perf_output_sample() its much saner (and more correct) to put the sample logic in there than in the perf_output_begin()/perf_output_end() pair. Saves a useless argument, reduces conditionals and shrinks struct perf_output_handle, win! Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-2crpvsx3cqu67q3zqjbnlpsc@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01perf: Remove the nmi parameter from the swevent and overflow interfacePeter Zijlstra
The nmi parameter indicated if we could do wakeups from the current context, if not, we would set some state and self-IPI and let the resulting interrupt do the wakeup. For the various event classes: - hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from the PMI-tail (ARM etc.) - tracepoint: nmi=0; since tracepoint could be from NMI context. - software: nmi=[0,1]; some, like the schedule thing cannot perform wakeups, and hence need 0. As one can see, there is very little nmi=1 usage, and the down-side of not using it is that on some platforms some software events can have a jiffy delay in wakeup (when arch_irq_work_raise isn't implemented). The up-side however is that we can remove the nmi parameter and save a bunch of conditionals in fast paths. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Michael Cree <mcree@orcon.net.nz> Cc: Will Deacon <will.deacon@arm.com> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Cc: Anton Blanchard <anton@samba.org> Cc: Eric B Munson <emunson@mgebm.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David S. Miller <davem@davemloft.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Don Zickus <dzickus@redhat.com> Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01perf, x86: Add hw_watchdog_set_attr() in a sake of nmi-watchdog on P4Cyrill Gorcunov
Due to restriction and specifics of Netburst PMU we need a separated event for NMI watchdog. In particular every Netburst event consumes not just a counter and a config register, but also an additional ESCR register. Since ESCR registers are grouped upon counters (i.e. if ESCR is occupied for some event there is no room for another event to enter until its released) we need to pick up the "least" used ESCR (or the most available one) for nmi-watchdog purposes -- so MSR_P4_CRU_ESCR2/3 was chosen. With this patch nmi-watchdog and perf top should be able to run simultaneously. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> CC: Lin Ming <ming.m.lin@intel.com> CC: Arnaldo Carvalho de Melo <acme@redhat.com> CC: Frederic Weisbecker <fweisbec@gmail.com> Tested-and-reviewed-by: Don Zickus <dzickus@redhat.com> Tested-and-reviewed-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110623124918.GC13050@sun Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01events: Ensure that timers are updated without requiring read() callEric B Munson
The event tracing infrastructure exposes two timers which should be updated each time the value of the counter is updated. Currently, these counters are only updated when userspace calls read() on the fd associated with an event. This means that counters which are read via the mmap'd page exclusively never have their timers updated. This patch adds ensures that the timers are updated each time the values in the mmap'd page are updated. Signed-off-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1308932786-5111-1-git-send-email-emunson@mgebm.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01events: Move lockless timer calculation into helper functionEric B Munson
Take the timer calculation from perf_output_read and move it to a helper function for any place that needs timer values but cannot take the ctx->lock. Signed-off-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1308861279-15216-2-git-send-email-emunson@mgebm.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01events: Add note to update_event_times comment about holding ctx->lockEric B Munson
Signed-off-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1308861279-15216-1-git-send-email-emunson@mgebm.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01perf_events: Fix perf buffer watermark settingVince Weaver
Since 2.6.36 (specifically commit d57e34fdd60b ("perf: Simplify the ring-buffer logic: make perf_buffer_alloc() do everything needed"), the perf_buffer_init_code() has been mis-setting the buffer watermark if perf_event_attr.wakeup_events has a non-zero value. This is because perf_event_attr.wakeup_events is a union with perf_event_attr.wakeup_watermark. This commit re-enables the check for perf_event_attr.watermark being set before continuing with setting a non-default watermark. This bug is most noticable when you are trying to use PERF_IOC_REFRESH with a value larger than one and perf_event_attr.wakeup_events is set to one. In this case the buffer watermark will be set to 1 and you will get extraneous POLL_IN overflows rather than POLL_HUP as expected. [ avoid using attr.wakeup_events when attr.watermark is set ] Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1106011506390.5384@cl320.eecs.utk.edu Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01sched: Skip autogroup when looking for all rt sched groupsYong Zhang
Since commit ec514c48 ("sched: Fix rt_rq runtime leakage bug") 'cat /proc/sched_debug' will print data of root_task_group.rt_rq multiple times. This is because autogroup does not have its own rt group, instead rt group of autogroup is linked to root_task_group. So skip it when we are looking for all rt sched groups, and it will also save some noop operation against root_task_group when __disable_runtime()/__enable_runtime(). -v2: Based on Cheng Xu's idea which uses less code. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Cheng Xu <chengxu@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/BANLkTi=87P3RoTF_UEtamNfc_XGxQXE__Q@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01sched: Simplify mutex_spin_on_owner()Thomas Gleixner
It does not make sense to rcu_read_lock/unlock() in every loop iteration while spinning on the mutex. Move the rcu protection outside the loop. Also simplify the return path to always check for lock->owner == NULL which meets the requirements of both owner changed and need_resched() caused loop exits. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1106101458350.11814@ionos Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01sched: Remove rcu_read_lock() from wake_affine()Nikunj A. Dadhania
wake_affine() is only called from one path: select_task_rq_fair(), which already has the RCU read lock held. Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20110607101251.777.34547.stgit@IBM-009124035060.in.ibm.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01Merge commit 'v3.0-rc5' into sched/coreIngo Molnar
Merge reason: Move to a (much) newer base. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01Merge commit 'v3.0-rc5' into perf/coreIngo Molnar
Merge reason: Pick up the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-01sched, cgroups: Fix MIN_SHARES on 64-bit boxenMike Galbraith
Commit c8b28116 ("sched: Increase SCHED_LOAD_SCALE resolution") intended to have no user-visible effect, but allows setting cpu.shares to < MIN_SHARES, which the user then sees. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nikhil Rao <ncrao@google.com> Link: http://lkml.kernel.org/r/1307192600.8618.3.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-06-30netfilter: add SELinux context support to AUDIT targetMr Dash Four
In this revision the conversion of secid to SELinux context and adding it to the audit log is moved from xt_AUDIT.c to audit.c with the aid of a separate helper function - audit_log_secctx - which does both the conversion and logging of SELinux context, thus also preventing internal secid number being leaked to userspace. If conversion is not successful an error is raised. With the introduction of this helper function the work done in xt_AUDIT.c is much more simplified. It also opens the possibility of this helper function being used by other modules (including auditd itself), if desired. With this addition, typical (raw auditd) output after applying the patch would be: type=NETFILTER_PKT msg=audit(1305852240.082:31012): action=0 hook=1 len=52 inif=? outif=eth0 saddr=10.1.1.7 daddr=10.1.2.1 ipid=16312 proto=6 sport=56150 dport=22 obj=system_u:object_r:ssh_client_packet_t:s0 type=NETFILTER_PKT msg=audit(1306772064.079:56): action=0 hook=3 len=48 inif=eth0 outif=? smac=00:05:5d:7c:27:0b dmac=00:02:b3:0a:7f:81 macproto=0x0800 saddr=10.1.2.1 daddr=10.1.1.7 ipid=462 proto=6 sport=22 dport=3561 obj=system_u:object_r:ssh_server_packet_t:s0 Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Mr Dash Four <mr.dash.four@googlemail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-06-30Merge branch 'linus' into nextJames Morris
2011-06-29jump_label: Fix jump_label update for modulesXiao Guangrong
The jump labels entries for modules do not stop at __stop__jump_table, but after mod->jump_entries + mod_num_jump_entries. By checking the wrong end point, module trace events never get enabled. Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Tested-by: Avi Kivity <avi@redhat.com> Tested-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Link: http://lkml.kernel.org/r/4E00038B.2060404@cn.fujitsu.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-06-27taskstats: don't allow duplicate entries in listener modeVasiliy Kulikov
Currently a single process may register exit handlers unlimited times. It may lead to a bloated listeners chain and very slow process terminations. Eg after 10KK sent TASKSTATS_CMD_ATTR_REGISTER_CPUMASKs ~300 Mb of kernel memory is stolen for the handlers chain and "time id" shows 2-7 seconds instead of normal 0.003. It makes it possible to exhaust all kernel memory and to eat much of CPU time by triggerring numerous exits on a single CPU. The patch limits the number of times a single process may register itself on a single CPU to one. One little issue is kept unfixed - as taskstats_exit() is called before exit_files() in do_exit(), the orphaned listener entry (if it was not explicitly deregistered) is kept until the next someone's exit() and implicit deregistration in send_cpu_listeners(). So, if a process registered itself as a listener exits and the next spawned process gets the same pid, it would inherit taskstats attributes. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27x86, mtrr: use stop_machine APIs for doing MTRR rendezvousSuresh Siddha
MTRR rendezvous sequence is not implemened using stop_machine() before, as this gets called both from the process context aswell as the cpu online paths (where the cpu has not come online and the interrupts are disabled etc). Now that we have a new stop_machine_from_inactive_cpu() API, use it for rendezvous during mtrr init of a logical processor that is coming online. For the rest (runtime MTRR modification, system boot, resume paths), use stop_machine() to implement the rendezvous sequence. This will consolidate and cleanup the code. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/20110623182057.076997177@sbsiddha-MOBL3.sc.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-06-27stop_machine: implement stop_machine_from_inactive_cpu()Tejun Heo
Currently, mtrr wants stop_machine functionality while a CPU is being brought up. As stop_machine() requires the calling CPU to be active, mtrr implements its own stop_machine using stop_one_cpu() on each online CPU. This doesn't only unnecessarily duplicate complex logic but also introduces a possibility of deadlock when it races against the generic stop_machine(). This patch implements stop_machine_from_inactive_cpu() to serve such use cases. Its functionality is basically the same as stop_machine(); however, it should be called from a CPU which isn't active and doesn't depend on working scheduling on the calling CPU. This is achieved by using busy loops for synchronization and open-coding stop_cpus queuing and waiting with direct invocation of fn() for local CPU inbetween. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20110623182056.982526827@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-06-27stop_machine: reorganize stop_cpus() implementationTejun Heo
Refactor the queuing part of the stop cpus work from __stop_cpus() into queue_stop_cpus_work(). The reorganization is to help future improvements to stop_machine() and doesn't introduce any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20110623182056.897818337@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-06-27x86, mtrr: lock stop machine during MTRR rendezvous sequenceSuresh Siddha
MTRR rendezvous sequence using stop_one_cpu_nowait() can potentially happen in parallel with another system wide rendezvous using stop_machine(). This can lead to deadlock (The order in which works are queued can be different on different cpu's. Some cpu's will be running the first rendezvous handler and others will be running the second rendezvous handler. Each set waiting for the other set to join for the system wide rendezvous, leading to a deadlock). MTRR rendezvous sequence is not implemented using stop_machine() as this gets called both from the process context aswell as the cpu online paths (where the cpu has not come online and the interrupts are disabled etc). stop_machine() works with only online cpus. For now, take the stop_machine mutex in the MTRR rendezvous sequence that gets called from an online cpu (here we are in the process context and can potentially sleep while taking the mutex). And the MTRR rendezvous that gets triggered during cpu online doesn't need to take this stop_machine lock (as the stop_machine() already ensures that there is no cpu hotplug going on in parallel by doing get_online_cpus()) TBD: Pursue a cleaner solution of extending the stop_machine() infrastructure to handle the case where the calling cpu is still not online and use this for MTRR rendezvous sequence. fixes: https://bugzilla.novell.com/show_bug.cgi?id=672008 Reported-by: Vadim Kotelnikov <vadimuzzz@inbox.ru> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/20110623182056.807230326@sbsiddha-MOBL3.sc.intel.com Cc: stable@kernel.org # 2.6.35+, backport a week or two after this gets more testing in mainline Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-06-27ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/Oleg Nesterov
wait_consider_task() checks same_thread_group(parent, real_parent), this is the open-coded ptrace_reparented(). __ptrace_detach() remains the only function which has to check this by hand, although we could reorganize the code to delay __ptrace_unlink. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>