summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2011-12-06sched, nohz: Clean up the find_new_ilb() using sched groups nr_busy_cpusSuresh Siddha
nr_busy_cpus in the sched_group_power indicates whether the group is semi idle or not. This helps remove the is_semi_idle_group() and simplify the find_new_ilb() in the context of finding an optimal cpu that can do idle load balancing. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.656983582@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched, nohz: Implement sched group, domain aware nohz idle load balancingSuresh Siddha
When there are many logical cpu's that enter and exit idle often, members of the global nohz data structure are getting modified very frequently causing lot of cache-line contention. Make the nohz idle load balancing more scalabale by using the sched domain topology and 'nr_busy_cpu's in the struct sched_group_power. Idle load balance is kicked on one of the idle cpu's when there is atleast one idle cpu and: - a busy rq having more than one task or - a busy rq's scheduler group that share package resources (like HT/MC siblings) and has more than one member in that group busy or - for the SD_ASYM_PACKING domain, if the lower numbered cpu's in that domain are idle compared to the busy ones. This will help in kicking the idle load balancing request only when there is a potential imbalance. And once it is mostly balanced, these kicks will be minimized. These changes helped improve the workload that is context switch intensive between number of task pairs by 2x on a 8 socket NHM-EX based system. Reported-by: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.602203411@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched, nohz: Track nr_busy_cpus in the sched_group_powerSuresh Siddha
Introduce nr_busy_cpus in the struct sched_group_power [Not in sched_group because sched groups are duplicated for the SD_OVERLAP scheduler domain] and for each cpu that enters and exits idle, this parameter will be updated in each scheduler group of the scheduler domain that this cpu belongs to. To avoid the frequent update of this state as the cpu enters and exits idle, the update of the stat during idle exit is delayed to the first timer tick that happens after the cpu becomes busy. This is done using NOHZ_IDLE flag in the struct rq's nohz_flags. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.555984323@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched, nohz: Introduce nohz_flags in 'struct rq'Suresh Siddha
Introduce nohz_flags in the struct rq, which will track these two flags for now. NOHZ_TICK_STOPPED keeps track of the tick stopped status that gets set when the tick is stopped. It will be used to update the nohz idle load balancer data structures during the first busy tick after the tick is restarted. At this first busy tick after tickless idle, NOHZ_TICK_STOPPED flag will be reset. This will minimize the nohz idle load balancer status updates that currently happen for every tickless exit, making it more scalable when there are many logical cpu's that enter and exit idle often. NOHZ_BALANCE_KICK will track the need for nohz idle load balance on this rq. This will replace the nohz_balance_kick in the rq, which was not being updated atomically. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.499438999@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched/rt: Code cleanup, remove a redundant function callShan Hai
The second call to sched_rt_period() is redundant, because the value of the rt_runtime was already read and it was protected by the ->rt_runtime_lock. Signed-off-by: Shan Hai <haishan.bai@gmail.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1322535836-13590-2-git-send-email-haishan.bai@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched: Fix the sched group node allocation for SD_OVERLAP domainsSuresh Siddha
For the SD_OVERLAP domain, sched_groups for each CPU's sched_domain are privately allocated and not shared with any other cpu. So the sched group allocation should come from the cpu's node for which SD_OVERLAP sched domain is being setup. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111118230554.164910950@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched: Set skip_clock_update in yield_task_fair()Mike Galbraith
This is another case where we are on our way to schedule(), so can save a useless clock update and resulting microscopic vruntime update. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321971686.6855.18.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched: Use rt.nr_cpus_allowed to recover select_task_rq() cyclesMike Galbraith
rt.nr_cpus_allowed is always available, use it to bail from select_task_rq() when only one cpu can be used, and saves some cycles for pinned tasks. See the line marked with '*' below: # taskset -c 3 pipe-test PerfTop: 997 irqs/sec kernel:89.5% exact: 0.0% [1000Hz cycles], (all, CPU: 3) ------------------------------------------------------------------------------------------------ Virgin Patched samples pcnt function samples pcnt function _______ _____ ___________________________ _______ _____ ___________________________ 2880.00 10.2% __schedule 3136.00 11.3% __schedule 1634.00 5.8% pipe_read 1615.00 5.8% pipe_read 1458.00 5.2% system_call 1534.00 5.5% system_call 1382.00 4.9% _raw_spin_lock_irqsave 1412.00 5.1% _raw_spin_lock_irqsave 1202.00 4.3% pipe_write 1255.00 4.5% copy_user_generic_string 1164.00 4.1% copy_user_generic_string 1241.00 4.5% __switch_to 1097.00 3.9% __switch_to 929.00 3.3% mutex_lock 872.00 3.1% mutex_lock 846.00 3.0% mutex_unlock 687.00 2.4% mutex_unlock 804.00 2.9% pipe_write 682.00 2.4% native_sched_clock 713.00 2.6% native_sched_clock 643.00 2.3% system_call_after_swapgs 653.00 2.3% _raw_spin_unlock_irqrestore 617.00 2.2% sched_clock_local 633.00 2.3% fsnotify 612.00 2.2% fsnotify 605.00 2.2% sched_clock_local 596.00 2.1% _raw_spin_unlock_irqrestore 593.00 2.1% system_call_after_swapgs 542.00 1.9% sysret_check 559.00 2.0% sysret_check 467.00 1.7% fget_light 472.00 1.7% fget_light 462.00 1.6% finish_task_switch 461.00 1.7% finish_task_switch 437.00 1.5% vfs_write 442.00 1.6% vfs_write 431.00 1.5% do_sync_write 428.00 1.5% do_sync_write * 413.00 1.5% select_task_rq_fair 404.00 1.5% _raw_spin_lock_irq 386.00 1.4% update_curr 402.00 1.4% update_curr 385.00 1.4% rw_verify_area 389.00 1.4% do_sync_read 377.00 1.3% _raw_spin_lock_irq 378.00 1.4% vfs_read 369.00 1.3% do_sync_read 340.00 1.2% pipe_iov_copy_from_user 360.00 1.3% vfs_read 316.00 1.1% __wake_up_sync_key 342.00 1.2% hrtick_start_fair 313.00 1.1% __wake_up_common Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321971504.6855.15.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06sched: Clean up domain traversal in select_idle_sibling()Suresh Siddha
Instead of going through the scheduler domain hierarchy multiple times (for giving priority to an idle core over an idle SMT sibling in a busy core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES flag and traverse the domain hierarchy down till we find an idle group. This cleanup also addresses an issue reported by Mike where the recent changes returned the busy thread even in the presence of an idle SMT sibling in single socket platforms. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321556904.15339.25.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06events, sched: Add tracepoint for accounting blocked timeAndrew Vagin
This tracepoint shows how long a task is sleeping in uninterruptible state. E.g. it may show how long and where a mutex is waited for. Signed-off-by: Andrew Vagin <avagin@openvz.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1322471015-107825-8-git-send-email-avagin@openvz.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-17sched: Move all scheduler bits into kernel/sched/Peter Zijlstra
There's too many sched*.[ch] files in kernel/, give them their own directory. (No code changed, other than Makefile glue added.) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-17sched: Make separate sched*.c translation unitsPeter Zijlstra
Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include "*.c" fest. This further modularizes the scheduler code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-16sched: Fix comment for requeue_rt_entityRichard Weinberger
Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321117677-3282-1-git-send-email-richard@nod.at Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-16sched: Don't call task_group() too many times in set_task_rq()Andrew Vagin
It improves perfomance, especially if autogroup is enabled. The size of set_task_rq() was 0x180 and now it is 0xa0. Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321020240-3874331-1-git-send-email-avagin@openvz.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-16sched, trivial: Initialize root cgroup's sibling listGlauber Costa
Even though there are no siblings, the list should be initialized to not contain bogus values. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Paul Menage <paul@paulmenage.org> Acked-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1320182360-20043-2-git-send-email-glommer@parallels.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-16sched: Use jump labels to reduce overhead when bandwidth control is inactivePaul Turner
Now that the linkage of jump-labels has been fixed they show a measurable improvement in overhead for the enabled-but-unused case. Workload is: 'taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"' There's a speedup for all situations: instructions cycles branches ------------------------------------------------------------------------- Intel Westmere base 806611770 745895590 146765378 +jumplabel 803090165 (-0.44%) 713381840 (-4.36%) 144561130 AMD Barcelona base 824657415 740055589 148855354 +jumplabel 821056910 (-0.44%) 737558389 (-0.34%) 146635229 Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111108042736.560831357@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-16sched: Fix buglet in return_cfs_rq_runtime()Paul Turner
In return_cfs_rq_runtime() we want to return bandwidth when there are no remaining tasks, not "return" when this is the case. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111108042736.623812423@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-16sched: Avoid SMT siblings in select_idle_sibling() if possiblePeter Zijlstra
Avoid select_idle_sibling() from picking a sibling thread if there's an idle core that shares cache. This fixes SMT balancing in the increasingly common case where there's a shared cache core available to balance to. Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1321350377.1421.55.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-14sched: Set the command name of the idle tasks in SMP kernelsCarsten Emde
In UP systems, the idle task is initialized using the init_task structure from which the command name is taken (currently "swapper"). In SMP systems, one idle task per CPU is forked by the worker thread from which the task structure is copied. The command name is, therefore, "kworker/0:0" or "kworker/0:1", if not updated. Since such update was lacking, all idle tasks in SMP systems were incorrectly named. This longtime bug was not discovered immediately, because there is no /proc/0 entry - the bug only becomes apparent when tracing is enabled. This patch sets the command name of the idle tasks in SMP systems to the name that is used in the INIT_TASK structure suffixed by a slash and the number of the CPU. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111026211708.768925506@osadl.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-14sched, rt: Provide means of disabling cross-cpu bandwidth sharingPeter Zijlstra
Normally the RT bandwidth scheme will share bandwidth across the entire root_domain. However sometimes its convenient to disable this sharing for debug purposes. Provide a simple feature switch to this end. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-14sched: Document wait_for_completion_*() return valuesJ. Bruce Fields
The return-value convention for these functions varies depending on whether they're interruptible or can timeout. It can be a little confusing--document it. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111006192246.GB28026@fieldses.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-14sched_fair: Fix a typo in the comment describing update_sd_lb_statsHui Kang
Signed-off-by: Hui Kang <hkang.sunysb@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1318388459-4427-1-git-send-email-hkang.sunysb@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-14sched: Add a comment to effective_load() since it's a painPeter Zijlstra
Every time I have to stare at this function I need to completely reverse engineer its workings, about time I write a comment explaining the thing. Collected bits and pieces from previous changelogs, mostly: 4be9daaa1b33701f011f4117f22dc1e45a3e6e34 83378269a5fad98f562ebc0f09c349575e6cbfe1 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1318518057.27731.2.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-07PM / QoS: Set cpu_dma_pm_qos->nameDominik Brodowski
Since commit 4a31a334, the name of this misc device is not initialized, which leads to a funny device named /dev/(null) being created and /proc/misc containing an entry with just a number but no name. The latter leads to complaints by cryptsetup, which caused me to investigate this matter. Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-11-06Merge branch 'upstream/jump-label-noearly' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen * 'upstream/jump-label-noearly' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen: jump-label: initialize jump-label subsystem much earlier x86/jump_label: add arch_jump_label_transform_static() s390/jump-label: add arch_jump_label_transform_static() jump_label: add arch_jump_label_transform_static() to optimise non-live code updates sparc/jump_label: drop arch_jump_label_text_poke_early() x86/jump_label: drop arch_jump_label_text_poke_early() jump_label: if a key has already been initialized, don't nop it out stop_machine: make stop_machine safe and efficient to call early jump_label: use proper atomic_t initializer Conflicts: - arch/x86/kernel/jump_label.c Added __init_or_module to arch_jump_label_text_poke_early vs removal of that function entirely - kernel/stop_machine.c same patch ("stop_machine: make stop_machine safe and efficient to call early") merged twice, with whitespace fix in one version
2011-11-06Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-11-06Merge branch 'writeback-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Add a 'reason' to wb_writeback_work writeback: send work item to queue_io, move_expired_inodes writeback: trace event balance_dirty_pages writeback: trace event bdi_dirty_ratelimit writeback: fix ppc compile warnings on do_div(long long, unsigned long) writeback: per-bdi background threshold writeback: dirty position control - bdi reserve area writeback: control dirty pause time writeback: limit max dirty pause time writeback: IO-less balance_dirty_pages() writeback: per task dirty rate limit writeback: stabilize bdi->dirty_ratelimit writeback: dirty rate control writeback: add bg_threshold parameter to __bdi_update_bandwidth() writeback: dirty position control writeback: account per-bdi accumulated dirtied pages
2011-11-06Merge git://github.com/rustyrussell/linuxLinus Torvalds
* git://github.com/rustyrussell/linux: module,bug: Add TAINT_OOT_MODULE flag for modules not built in-tree module: Enable dynamic debugging regardless of taint
2011-11-06Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits) powerpc/p3060qds: Add support for P3060QDS board powerpc/83xx: Add shutdown request support to MCU handling on MPC8349 MITX powerpc/85xx: Make kexec to interate over online cpus powerpc/fsl_booke: Fix comment in head_fsl_booke.S powerpc/85xx: issue 15 EOI after core reset for FSL CoreNet devices powerpc/8xxx: Fix interrupt handling in MPC8xxx GPIO driver powerpc/85xx: Add 'fsl,pq3-gpio' compatiable for GPIO driver powerpc/86xx: Correct Gianfar support for GE boards powerpc/cpm: Clear muram before it is in use. drivers/virt: add ioctl for 32-bit compat on 64-bit to fsl-hv-manager powerpc/fsl_msi: add support for "msi-address-64" property powerpc/85xx: Setup secondary cores PIR with hard SMP id powerpc/fsl-booke: Fix settlbcam for 64-bit powerpc/85xx: Adding DCSR node to dtsi device trees powerpc/85xx: clean up FPGA device tree nodes for Freecsale QorIQ boards powerpc/85xx: fix PHYS_64BIT selection for P1022DS powerpc/fsl-booke: Fix setup_initial_memory_limit to not blindly map powerpc: respect mem= setting for early memory limit setup powerpc: Update corenet64_smp_defconfig powerpc: Update mpc85xx/corenet 32-bit defconfigs ... Fix up trivial conflicts in: - arch/powerpc/configs/40x/hcu4_defconfig removed stale file, edited elsewhere - arch/powerpc/include/asm/udbg.h, arch/powerpc/kernel/udbg.c: added opal and gelic drivers vs added ePAPR driver - drivers/tty/serial/8250.c moved UPIO_TSI to powerpc vs removed UPIO_DWAPB support
2011-11-07module,bug: Add TAINT_OOT_MODULE flag for modules not built in-treeBen Hutchings
Use of the GPL or a compatible licence doesn't necessarily make the code any good. We already consider staging modules to be suspect, and this should also be true for out-of-tree modules which may receive very little review. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Reviewed-by: Dave Jones <davej@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (patched oops-tracing.txt)
2011-11-07module: Enable dynamic debugging regardless of taintBen Hutchings
Dynamic debugging is currently disabled for tainted modules, except for TAINT_CRAP. This prevents use of dynamic debugging for out-of-tree modules once the next patch is applied. This condition was apparently intended to avoid a crash if a force- loaded module has an incompatible definition of dynamic debug structures. However, a administrator that forces us to load a module is claiming that it *is* compatible even though it fails our version checks. If they are mistaken, there are any number of ways the module could crash the system. As a side-effect, proprietary and other tainted modules can now use dynamic_debug. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-11-04PM / Freezer: Revert 27920651fe "PM / Freezer: Make fake_signal_wake_up() ↵Tejun Heo
wake TASK_KILLABLE tasks too" Commit 27920651fe "PM / Freezer: Make fake_signal_wake_up() wake TASK_KILLABLE tasks too" updated fake_signal_wake_up() used by freezer to wake up KILLABLE tasks. Sending unsolicited wakeups to tasks in killable sleep is dangerous as there are code paths which depend on tasks not waking up spuriously from KILLABLE sleep. For example. sys_read() or page can sleep in TASK_KILLABLE assuming that wait/down/whatever _killable can only fail if we can not return to the usermode. TASK_TRACED is another obvious example. The previous patch updated wait_event_freezekillable() such that it doesn't depend on the spurious wakeup. This patch reverts the offending commit. Note that the spurious KILLABLE wakeup had other implicit effects in KILLABLE sleeps in nfs and cifs and those will need further updates to regain freezekillable behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-11-04PM / QoS: Remove redundant checkGuennadi Liakhovetski
Remove an "if" check, that repeats an equivalent one 6 lines above. Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-11-04PM / Sleep: Fix race between CPU hotplug and freezerSrivatsa S. Bhat
The CPU hotplug notifications sent out by the _cpu_up() and _cpu_down() functions depend on the value of the 'tasks_frozen' argument passed to them (which indicates whether tasks have been frozen or not). (Examples for such CPU hotplug notifications: CPU_ONLINE, CPU_ONLINE_FROZEN, CPU_DEAD, CPU_DEAD_FROZEN). Thus, it is essential that while the callbacks for those notifications are running, the state of the system with respect to the tasks being frozen or not remains unchanged, *throughout that duration*. Hence there is a need for synchronizing the CPU hotplug code with the freezer subsystem. Since the freezer is involved only in the Suspend/Hibernate call paths, this patch hooks the CPU hotplug code to the suspend/hibernate notifiers PM_[SUSPEND|HIBERNATE]_PREPARE and PM_POST_[SUSPEND|HIBERNATE] to prevent the race between CPU hotplug and freezer, thus ensuring that CPU hotplug notifications will always be run with the state of the system really being what the notifications indicate, _throughout_ their execution time. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-11-03Revert "perf: Add PM notifiers to fix CPU hotplug races"Linus Torvalds
This reverts commit 144060fee07e9c22e179d00819c83c86fbcbf82c. It causes a resume regression for Andi on his Acer Aspire 1830T post 3.1. The screen just stays black after wakeup. Also, it really looks like the wrong way to suspend and resume perf events: I think they should be done as part of the CPU suspend and resume, rather than as a notifier that does smp_call_function(). Reported-by: Andi Kleen <andi@firstfloor.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02memcg: replace ss->id_lock with a rwlockAndrew Bresticker
While back-porting Johannes Weiner's patch "mm: memcg-aware global reclaim" for an internal effort, we noticed a significant performance regression during page-reclaim heavy workloads due to high contention of the ss->id_lock. This lock protects idr map, and serializes calls to idr_get_next() in css_get_next() (which is used during the memcg hierarchy walk). Since idr_get_next() is just doing a look up, we need only serialize it with respect to idr_remove()/idr_get_new(). By making the ss->id_lock a rwlock, contention is greatly reduced and performance improves. Tested: cat a 256m file from a ramdisk in a 128m container 50 times on each core (one file + container per core) in parallel on a NUMA machine. Result is the time for the test to complete in 1 of the containers. Both kernels included Johannes' memcg-aware global reclaim patches. Before rwlock patch: 1710.778s After rwlock patch: 152.227s Signed-off-by: Andrew Bresticker <abrestic@google.com> Cc: Paul Menage <menage@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02sysctl: add support for poll()Lucas De Marchi
Adding support for poll() in sysctl fs allows userspace to receive notifications of changes in sysctl entries. This adds a infrastructure to allow files in sysctl fs to be pollable and implements it for hostname and domainname. [akpm@linux-foundation.org: s/declare/define/ for definitions] Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Greg KH <gregkh@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02cpusets: avoid looping when storing to mems_allowed if one node remains setDavid Rientjes
{get,put}_mems_allowed() exist so that general kernel code may locklessly access a task's set of allowable nodes without having the chance that a concurrent write will cause the nodemask to be empty on configurations where MAX_NUMNODES > BITS_PER_LONG. This could incur a significant delay, however, especially in low memory conditions because the page allocator is blocking and reclaim requires get_mems_allowed() itself. It is not atypical to see writes to cpuset.mems take over 2 seconds to complete, for example. In low memory conditions, this is problematic because it's one of the most imporant times to change cpuset.mems in the first place! The only way a task's set of allowable nodes may change is through cpusets by writing to cpuset.mems and when attaching a task to a generic code is not reading the nodemask with get_mems_allowed() at the same time, and then clearing all the old nodes. This prevents the possibility that a reader will see an empty nodemask at the same time the writer is storing a new nodemask. If at least one node remains unchanged, though, it's possible to simply set all new nodes and then clear all the old nodes. Changing a task's nodemask is protected by cgroup_mutex so it's guaranteed that two threads are not changing the same task's nodemask at the same time, so the nodemask is guaranteed to be stored before another thread changes it and determines whether a node remains set or not. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Paul Menage <paul@paulmenage.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02cgroups: don't attach task to subsystem if migration failedBen Blum
If a task has exited to the point it has called cgroup_exit() already, then we can't migrate it to another cgroup anymore. This can happen when we are attaching a task to a new cgroup between the call to ->can_attach_task() on subsystems and the migration that is eventually tried in cgroup_task_migrate(). In this case cgroup_task_migrate() returns -ESRCH and we don't want to attach the task to the subsystems because the attachment to the new cgroup itself failed. Fix this by only calling ->attach_task() on the subsystems if the cgroup migration succeeded. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02cgroups: more safe tasklist locking in cgroup_attach_procBen Blum
Fix unstable tasklist locking in cgroup_attach_proc. According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is not sufficient to guarantee the tasklist is stable w.r.t. de_thread and exit. Taking tasklist_lock for reading, instead of rcu_read_lock, ensures proper exclusion. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-01Merge branch 'next/dt' of git://git.linaro.org/people/arnd/arm-socLinus Torvalds
* 'next/dt' of git://git.linaro.org/people/arnd/arm-soc: ARM: gic: use module.h instead of export.h ARM: gic: fix irq_alloc_descs handling for sparse irq ARM: gic: add OF based initialization ARM: gic: add irq_domain support irq: support domains with non-zero hwirq base of/irq: introduce of_irq_init ARM: at91: add at91sam9g20 and Calao USB A9G20 DT support ARM: at91: dt: at91sam9g45 family and board device tree files arm/mx5: add device tree support for imx51 babbage arm/mx5: add device tree support for imx53 boards ARM: msm: Add devicetree support for msm8660-surf msm_serial: Add devicetree support msm_serial: Use relative resources for iomem Fix up conflicts in arch/arm/mach-at91/{at91sam9260.c,at91sam9g45.c}
2011-10-31Merge branch 'akpm' (Andrew's incoming)Linus Torvalds
Quoth Andrew: - Most of MM. Still waiting for the poweroc guys to get off their butts and review some threaded hugepages patches. - alpha - vfs bits - drivers/misc - a few core kerenl tweaks - printk() features - MAINTAINERS updates - backlight merge - leds merge - various lib/ updates - checkpatch updates * akpm: (127 commits) epoll: fix spurious lockdep warnings checkpatch: add a --strict check for utf-8 in commit logs kernel.h/checkpatch: mark strict_strto<foo> and simple_strto<foo> as obsolete llist-return-whether-list-is-empty-before-adding-in-llist_add-fix wireless: at76c50x: follow rename pack_hex_byte to hex_byte_pack fat: follow rename pack_hex_byte() to hex_byte_pack() security: follow rename pack_hex_byte() to hex_byte_pack() kgdb: follow rename pack_hex_byte() to hex_byte_pack() lib: rename pack_hex_byte() to hex_byte_pack() lib/string.c: fix strim() semantics for strings that have only blanks lib/idr.c: fix comment for ida_get_new_above() lib/percpu_counter.c: enclose hotplug only variables in hotplug ifdef lib/bitmap.c: quiet sparse noise about address space lib/spinlock_debug.c: print owner on spinlock lockup lib/kstrtox: common code between kstrto*() and simple_strto*() functions drivers/leds/leds-lp5521.c: check if reset is successful leds: turn the blink_timer off before starting to blink leds: save the delay values after a successful call to blink_set() drivers/leds/leds-gpio.c: use gpio_get_value_cansleep() when initializing drivers/leds/leds-lm3530.c: add __devexit_p where needed ...
2011-10-31kgdb: follow rename pack_hex_byte() to hex_byte_pack()Andy Shevchenko
There is no functional change. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31printk: remove bounds checking for log_prefixWilliam Douglas
Currently log_prefix is testing that the first character of the log level and facility is less than '0' and greater than '9' (which is always false). Since the code being updated works because strtoul bombs out (endp isn't updated) and 0 is returned anyway just remove the check and don't change the behavior of the function. Signed-off-by: William Douglas <william.douglas@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31printk: fix bounds checking for log_prefixWilliam Douglas
Currently log_prefix is testing that the first character of the log level and facility is less than '0' and greater than '9' (which is always false). It should be testing to see if the character less than '0' or greater than '9' instead. This patch makes that change. The code being changed worked because strtoul bombs out (endp isn't updated) and 0 is returned anyway. Signed-off-by: William Douglas <william.douglas@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31printk: add console_suspend module parameterYanmin Zhang
We are enabling some power features on medfield. To test suspend-2-RAM conveniently, we need turn on/off console_suspend_enabled frequently. Add a module parameter, so users could change it by: /sys/module/printk/parameters/console_suspend Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31printk: add module parameter ignore_loglevel to control ignore_loglevelYanmin Zhang
We are enabling some power features on medfield. To test suspend-2-RAM conveniently, we need turn on/off ignore_loglevel frequently without rebooting. Add a module parameter, so users can change it by: /sys/module/printk/parameters/ignore_loglevel Signed-off-by: Yanmin Zhang <yanmin.zhang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31kernel/sysctl.c: add cap_last_cap to /proc/sys/kernelDan Ballard
Userspace needs to know the highest valid capability of the running kernel, which right now cannot reliably be retrieved from the header files only. The fact that this value cannot be determined properly right now creates various problems for libraries compiled on newer header files which are run on older kernels. They assume capabilities are available which actually aren't. libcap-ng is one example. And we ran into the same problem with systemd too. Now the capability is exported in /proc/sys/kernel/cap_last_cap. [akpm@linux-foundation.org: make cap_last_cap const, per Ulrich] Signed-off-by: Dan Ballard <dan@mindstab.net> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Lennart Poettering <lennart@poettering.net> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Ulrich Drepper <drepper@akkadia.org> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31watchdog: move watchdog_*_all_cpus under CONFIG_SYSCTLVasily Averin
Fix compilation warnings for CONFIG_SYSCTL=n: fixed compilation warnings in case of disabled CONFIG_SYSCTL kernel/watchdog.c:483:13: warning: `watchdog_enable_all_cpus' defined but not used kernel/watchdog.c:500:13: warning: `watchdog_disable_all_cpus' defined but not used these functions are static and are used only in sysctl handler, so move them inside #ifdef CONFIG_SYSCTL too Signed-off-by: Vasily Averin <vvs@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31stop_machine: make stop_machine safe and efficient to call earlyJeremy Fitzhardinge
Make stop_machine() safe to call early in boot, before SMP has been set up, by simply calling the callback function directly if there's only one CPU online. [ Fixes from AKPM: - add comment - local_irq_flags, not save_flags - also call hard_irq_disable() for systems which need it Tejun suggested using an explicit flag rather than just looking at the online cpu count. ] Cc: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Acked-by: Tejun Heo <htejun@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>