summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2010-11-29rcu: fix race condition in synchronize_sched_expedited()Paul E. McKenney
The new (early 2010) implementation of synchronize_sched_expedited() uses try_stop_cpu() to force a context switch on every CPU. It also permits concurrent calls to synchronize_sched_expedited() to share a single call to try_stop_cpu() through use of an atomically incremented synchronize_sched_expedited_count variable. Unfortunately, this is subject to failure as follows: o Task A invokes synchronize_sched_expedited(), try_stop_cpus() succeeds, but Task A is preempted before getting to the atomic increment of synchronize_sched_expedited_count. o Task B also invokes synchronize_sched_expedited(), with exactly the same outcome as Task A. o Task C also invokes synchronize_sched_expedited(), again with exactly the same outcome as Tasks A and B. o Task D also invokes synchronize_sched_expedited(), but only gets as far as acquiring the mutex within try_stop_cpus() before being preempted, interrupted, or otherwise delayed. o Task E also invokes synchronize_sched_expedited(), but only gets to the snapshotting of synchronize_sched_expedited_count. o Tasks A, B, and C all increment synchronize_sched_expedited_count. o Task E fails to get the mutex, so checks the new value of synchronize_sched_expedited_count. It finds that the value has increased, so (wrongly) assumes that its work has been done, returning despite there having been no expedited grace period since it began. The solution is to have the lowest-numbered CPU atomically increment the synchronize_sched_expedited_count variable within the synchronize_sched_expedited_cpu_stop() function, which is under the protection of the mutex acquired by try_stop_cpus(). However, this also requires that piggybacking tasks wait for three rather than two instances of try_stop_cpu(), because we cannot control the order in which the per-CPU callback function occur. Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: update documentation/comments for Lai's adoption patchPaul E. McKenney
Lai's RCU-callback immediate-adoption patch changes the RCU tracing output, so update tracing.txt. Also update a few comments to clarify the synchronization design. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu,cleanup: simplify the code when cpu is dyingLai Jiangshan
When we handle the CPU_DYING notifier, the whole system is stopped except for the current CPU. We therefore need no synchronization with the other CPUs. This allows us to move any orphaned RCU callbacks directly to the list of any online CPU without needing to run them through the global orphan lists. These global orphan lists can therefore be dispensed with. This commit makes thes changes, though currently victimizes CPU 0 @@@. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu,cleanup: move synchronize_sched_expedited() out of sched.cLai Jiangshan
The first version of synchronize_sched_expedited() used the migration code in the scheduler, and was therefore implemented in kernel/sched.c. However, the more recent version of this code no longer uses the migration code, so this commit moves it to the main RCU source files. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: get rid of obsolete "classic" names in TREE_RCU tracingPaul E. McKenney
The TREE_RCU tracing had obsolete rcuclassic_trace_init() and rcuclassic_trace_cleanup() function names. This commit brings them up to date: rcutree_trace_init() and rcutree_trace_cleanup(), respectively. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: Distinguish between boosting and boostedPaul E. McKenney
RCU priority boosting's tracing did not distinguish between ongoing boosting and completion of boosting. This commit therefore adds this capability. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: add tracing for TINY_RCU and TINY_PREEMPT_RCUPaul E. McKenney
Add tracing for the tiny RCU implementations, including statistics on boosting in the case of TINY_PREEMPT_RCU and RCU_BOOST. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: priority boosting for TINY_PREEMPT_RCUPaul E. McKenney
Add priority boosting, but only for TINY_PREEMPT_RCU. This is enabled by the default-off RCU_BOOST kernel parameter. The priority to which to boost preempted RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter (defaulting to real-time priority 1) and the time to wait before boosting the readers blocking a given grace period is controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to 500 milliseconds). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-28Kill off a bunch of warning: ‘inline’ is not at beginning of declarationJesper Juhl
These warnings are spewed during a build of a 'allnoconfig' kernel (especially the ones from u64_stats_sync.h show up a lot) when building with -Wextra (which I often do).. They are a) annoying b) easy to get rid of. This patch kills them off. include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-29security: Define CAP_SYSLOGSerge E. Hallyn
Privileged syslog operations currently require CAP_SYS_ADMIN. Split this off into a new CAP_SYSLOG privilege which we can sanely take away from a container through the capability bounding set. With this patch, an lxc container can be prevented from messing with the host's syslog (i.e. dmesg -c). Changelog: mar 12 2010: add selinux capability2:cap_syslog perm Changelog: nov 22 2010: . port to new kernel . add a WARN_ONCE if userspace isn't using CAP_SYSLOG Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-By: Kees Cook <kees.cook@canonical.com> Cc: James Morris <jmorris@namei.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: "Christopher J. PeBenito" <cpebenito@tresys.com> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: James Morris <jmorris@namei.org>
2010-11-28Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix the software context switch counter perf, x86: Fixup Kconfig deps x86, perf, nmi: Disable perf if counters are not accessible perf: Fix inherit vs. context rotation bug
2010-11-27Merge branch 'cleanup-bd_claim' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.38/core
2010-11-27Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call
2010-11-27Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf symbols: Remove incorrect open-coded container_of() perf record: Handle restrictive permissions in /proc/{kallsyms,modules} x86/kprobes: Prevent kprobes to probe on save_args() irq_work: Drop cmpxchg() result perf: Fix owner-list vs exit x86, hw_nmi: Move backtrace_mask declaration under ARCH_HAS_NMI_WATCHDOG tracing: Fix recursive user stack trace perf,hw_breakpoint: Initialize hardware api earlier x86: Ignore trap bits on single step exceptions tracing: Force arch_local_irq_* notrace for paravirt tracing: Fix module use of trace_bprintk()
2010-11-27Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix idle balancing sched: Fix volanomark performance regression
2010-11-26perf, arch: Cleanup perf-pmu init vs lockup-detectorPeter Zijlstra
The perf hardware pmu got initialized at various points in the boot, some before early_initcall() some after (notably arch_initcall). The problem is that the NMI lockup detector is ran from early_initcall() and expects the hardware pmu to be present. Sanitize this by moving all architecture hardware pmu implementations to initialize at early_initcall() and move the lockup detector to an explicit initcall right after that. Cc: paulus <paulus@samba.org> Cc: davem <davem@davemloft.net> Cc: Michael Cree <mcree@orcon.net.nz> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290707759.2145.119.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26perf: Ignore non-sampling overflowsPeter Zijlstra
Some arch implementations call perf_event_overflow() by 'accident', ignore this. Reported-by: Francis Moreau <francis.moro@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26perf: Don't bother to init the hrtimer for no SW sampling countersFranck Bui-Huu
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290525705-6265-3-git-send-email-fbuihuu@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26perf: Limit event refresh to sampling eventFranck Bui-Huu
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290525705-6265-2-git-send-email-fbuihuu@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26perf: Introduce is_sampling_event()Franck Bui-Huu
and use it when appropriate. Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290525705-6265-1-git-send-email-fbuihuu@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26Merge branch 'perf/urgent' into perf/coreIngo Molnar
Conflicts: arch/x86/kernel/apic/hw_nmi.c Merge reason: Resolve conflict, queue up dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26sched: Remove unused argument dest_cpu to migrate_task()Nikanth Karthikesan
Remove unused argument, 'dest_cpu' of migrate_task(), and pass runqueue, as it is always known at the call site. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <201011261237.09187.knikanth@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26mutexes, sched: Introduce arch_mutex_cpu_relax()Gerald Schaefer
The spinning mutex implementation uses cpu_relax() in busy loops as a compiler barrier. Depending on the architecture, cpu_relax() may do more than needed in this specific mutex spin loops. On System z we also give up the time slice of the virtual cpu in cpu_relax(), which prevents effective spinning on the mutex. This patch replaces cpu_relax() in the spinning mutex code with arch_mutex_cpu_relax(), which can be defined by each architecture that selects HAVE_ARCH_MUTEX_CPU_RELAX. The default is still cpu_relax(), so this patch should not affect other architectures than System z for now. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290437256.7455.4.camel@thinkpad> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26Merge commit 'v2.6.37-rc3' into sched/coreIngo Molnar
Merge reason: Pick up latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26Merge commit 'v2.6.37-rc3' into perf/coreIngo Molnar
Merge reason: Pick up latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26nohz: Fix printk_needs_cpu() return value on offline cpusHeiko Carstens
This patch fixes a hang observed with 2.6.32 kernels where timers got enqueued on offline cpus. printk_needs_cpu() may return 1 if called on offline cpus. When a cpu gets offlined it schedules the idle process which, before killing its own cpu, will call tick_nohz_stop_sched_tick(). That function in turn will call printk_needs_cpu() in order to check if the local tick can be disabled. On offline cpus this function should naturally return 0 since regardless if the tick gets disabled or not the cpu will be dead short after. That is besides the fact that __cpu_disable() should already have made sure that no interrupts on the offlined cpu will be delivered anyway. In this case it prevents tick_nohz_stop_sched_tick() to call select_nohz_load_balancer(). No idea if that really is a problem. However what made me debug this is that on 2.6.32 the function get_nohz_load_balancer() is used within __mod_timer() to select a cpu on which a timer gets enqueued. If printk_needs_cpu() returns 1 then the nohz_load_balancer cpu doesn't get updated when a cpu gets offlined. It may contain the cpu number of an offline cpu. In turn timers get enqueued on an offline cpu and not very surprisingly they never expire and cause system hangs. This has been observed 2.6.32 kernels. On current kernels __mod_timer() uses get_nohz_timer_target() which doesn't have that problem. However there might be other problems because of the too early exit tick_nohz_stop_sched_tick() in case a cpu goes offline. Easiest way to fix this is just to test if the current cpu is offline and call printk_tick() directly which clears the condition. Alternatively I tried a cpu hotplug notifier which would clear the condition, however between calling the notifier function and printk_needs_cpu() something could have called printk() again and the problem is back again. This seems to be the safest fix. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org LKML-Reference: <20101126120235.406766476@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26printk: Fix wake_up_klogd() vs cpu hotplugHeiko Carstens
wake_up_klogd() may get called from preemptible context but uses __raw_get_cpu_var() to write to a per cpu variable. If it gets preempted between getting the address and writing to it, the cpu in question could be offline if the process gets scheduled back and hence writes to the per cpu data of an offline cpu. This buggy behaviour was introduced with fa33507a "printk: robustify printk, fix #2" which was supposed to fix a "using smp_processor_id() in preemptible" warning. Let's use this_cpu_write() instead which disables preemption and makes sure that the outlined scenario cannot happen. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101126124247.GC7023@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26perf: Fix the software context switch counterPeter Zijlstra
Stephane noticed that because the perf_sw_event() call is inside the perf_event_task_sched_out() call it won't get called unless we have a per-task counter. Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26perf: Fix inherit vs. context rotation bugThomas Gleixner
It was found that sometimes children of tasks with inherited events had one extra event. Eventually it turned out to be due to the list rotation no being exclusive with the list iteration in the inheritance code. Cure this by temporarily disabling the rotation while we inherit the events. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26workqueue: check the allocation of system_unbound_wqHitoshi Mitake
I found a trivial bug on initialization of workqueue. Current init_workqueues doesn't check the result of allocation of system_unbound_wq, this should be checked like other queues. Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2010-11-23sched: Add some clock info to sched_debugPeter Zijlstra
Add more clock information to /proc/sched_debug, Thomas wanted to see the sched_clock_stable state. Requested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-23cpu: Remove incorrect BUG_ONPeter Zijlstra
Oleg mentioned that there is no actual guarantee the dying cpu's migration thread is actually finished running when we get there, so replace the BUG_ON() with a spinloop waiting for it. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-23cpu: Remove unused variableDhaval Giani
GCC warns us about: kernel/cpu.c: In function ‘take_cpu_down’: kernel/cpu.c:200:15: warning: unused variable ‘cpu’ This variable is unused since param->hcpu is directly used later on in cpu_notify. Signed-off-by: Dhaval Giani <dhaval_giani@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290091494.1145.5.camel@gondor.retis> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-23sched: Fix UP build breakagePeter Zijlstra
The recent cgroup-scheduling rework caused a UP build problem. Cc: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-23sched: Make task dump print all 15 chars of proc commErik Gilling
Signed-off-by: Erik Gilling <konkers@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290218934-8544-3-git-send-email-john.stultz@linaro.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-19Revert "kernel: make /proc/kallsyms mode 400 to reduce ease of attacking"Linus Torvalds
This reverts commit 59365d136d205cc20fe666ca7f89b1c5001b0d5a. It turns out that this can break certain existing user land setups. Quoth Sarah Sharp: "On Wednesday, I updated my branch to commit 460781b from linus' tree, and my box would not boot. klogd segfaulted, which stalled the whole system. At first I thought it actually hung the box, but it continued booting after 5 minutes, and I was able to log in. It dropped back to the text console instead of the graphical bootup display for that period of time. dmesg surprisingly still works. I've bisected the problem down to this commit (commit 59365d136d205cc20fe666ca7f89b1c5001b0d5a) The box is running klogd 1.5.5ubuntu3 (from Jaunty). Yes, I know that's old. I read the bit in the commit about changing the permissions of kallsyms after boot, but if I can't boot that doesn't help." So let's just keep the old default, and encourage distributions to do the "chmod -r /proc/kallsyms" in their bootup scripts. This is not worth a kernel option to change default behavior, since it's so easily done in user space. Reported-and-bisected-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> Cc: Marcus Meissner <meissner@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: Eugene Teo <eugeneteo@kernel.org> Cc: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-19tracing/events: Show real number in array fieldsSteven Rostedt
Currently we have in something like the sched_switch event: field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16; signed:1; When a userspace tool such as perf tries to parse this, the TASK_COMM_LEN is meaningless. This is done because the TRACE_EVENT() macro simply uses a #len to show the string of the length. When the length is an enum, we get a string that means nothing for tools. By adding a static buffer and a mutex to protect it, we can store the string into that buffer with snprintf and show the actual number. Now we get: field:char prev_comm[16]; offset:12; size:16; signed:1; Something much more useful. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-11-18Merge branch 'perf/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core
2010-11-18Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdb,ppc: Fix regression in evr register handling kgdb,x86: fix regression in detach handling kdb: fix crash when KDB_BASE_CMD_MAX is exceeded kdb: fix memory leak in kdb_main.c
2010-11-18tracing: New flag to allow non privileged users to use a trace eventFrederic Weisbecker
This adds a new trace event internal flag that allows them to be used in perf by non privileged users in case of task bound tracing. This is desired for syscalls tracepoint because they don't leak global system informations, like some other tracepoints. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Jason Baron <jbaron@redhat.com>
2010-11-18x86: Add RO/NX protection for loadable kernel modulesmatthieu castet
This patch is a logical extension of the protection provided by CONFIG_DEBUG_RODATA to LKMs. The protection is provided by splitting module_core and module_init into three logical parts each and setting appropriate page access permissions for each individual section: 1. Code: RO+X 2. RO data: RO+NX 3. RW data: RW+NX In order to achieve proper protection, layout_sections() have been modified to align each of the three parts mentioned above onto page boundary. Next, the corresponding page access permissions are set right before successful exit from load_module(). Further, free_module() and sys_init_module have been modified to set module_core and module_init as RW+NX right before calling module_free(). By default, the original section layout and access flags are preserved. When compiled with CONFIG_DEBUG_SET_MODULE_RONX=y, the patch will page-align each group of sections to ensure that each page contains only one type of content and will enforce RO/NX for each group of pages. -v1: Initial proof-of-concept patch. -v2: The patch have been re-written to reduce the number of #ifdefs and to make it architecture-agnostic. Code formatting has also been corrected. -v3: Opportunistic RO/NX protection is now unconditional. Section page-alignment is enabled when CONFIG_DEBUG_RODATA=y. -v4: Removed most macros and improved coding style. -v5: Changed page-alignment and RO/NX section size calculation -v6: Fixed comments. Restricted RO/NX enforcement to x86 only -v7: Introduced CONFIG_DEBUG_SET_MODULE_RONX, added calls to set_all_modules_text_rw() and set_all_modules_text_ro() in ftrace -v8: updated for compatibility with linux 2.6.33-rc5 -v9: coding style fixes -v10: more coding style fixes -v11: minor adjustments for -tip -v12: minor adjustments for v2.6.35-rc2-tip -v13: minor adjustments for v2.6.37-rc1-tip Signed-off-by: Siarhei Liakh <sliakh.lkml@gmail.com> Signed-off-by: Xuxian Jiang <jiang@cs.ncsu.edu> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Andi Kleen <ak@muc.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dave Jones <davej@redhat.com> Cc: Kees Cook <kees.cook@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <4CE2F914.9070106@free.fr> [ minor cleanliness edits, -v14: build failure fix ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Update tg->shares after cpu.shares writePaul Turner
Formerly sched_group_set_shares would force a rebalance by overflowing domain share sums. Now that per-cpu averages are maintained we can set the true value by issuing an update_cfs_shares() following a tg->shares update. Also initialize tg se->load to 0 for consistency since we'll now set correct weights on enqueue. Signed-off-by: Paul Turner <pjt@google.com?> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.465521344@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Allow update_cfs_load() to update global loadPaul Turner
Refactor the global load updates from update_shares_cpu() so that update_cfs_load() can update global load when it is more than ~10% out of sync. The new global_load parameter allows us to force an update, regardless of the error factor so that we can synchronize w/ update_shares(). Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.377473595@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Implement demand based update_cfs_load()Paul Turner
When the system is busy, dilation of rq->next_balance makes lb->update_shares() insufficiently frequent for threads which don't sleep (no dequeue/enqueue updates). Adjust for this by making demand based updates based on the accumulation of execution time sufficient to wrap our averaging window. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.291159744@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Update shares on idle_balancePaul Turner
Since shares updates are no longer expensive and effectively local, update them at idle_balance(). This allows us to more quickly redistribute shares to another cpu when our load becomes idle. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.204191702@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Add sysctl_sched_shares_windowPaul Turner
Introduce a new sysctl for the shares window and disambiguate it from sched_time_avg. A 10ms window appears to be a good compromise between accuracy and performance. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.112173964@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Introduce hierarchal order on shares update listPaul Turner
Avoid duplicate shares update calls by ensuring children always appear before parents in rq->leaf_cfs_rq_list. This allows us to do a single in-order traversal for update_shares(). Since we always enqueue in bottom-up order this reduces to 2 cases: 1) Our parent is already in the list, e.g. root \ b /\ c d* (root->b->c already enqueued) Since d's parent is enqueued we push it to the head of the list, implicitly ahead of b. 2) Our parent does not appear in the list (or we have no parent) In this case we enqueue to the tail of the list, if our parent is subsequently enqueued (bottom-up) it will appear to our right by the same rule. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.022488865@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Fix update_cfs_load() synchronizationPaul Turner
Using cfs_rq->nr_running is not sufficient to synchronize update_cfs_load with the put path since nr_running accounting occurs at deactivation. It's also not safe to make the removal decision based on load_avg as this fails with both high periods and low shares. Resolve this by clipping history after 4 periods without activity. Note: the above will always occur from update_shares() since in the last-task-sleep-case that task will still be cfs_rq->curr when update_cfs_load is called. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234937.933428187@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Fix load corruption from update_cfs_shares()Paul Turner
As part of enqueue_entity both a new entity weight and its contribution to the queuing cfs_rq / rq are updated. Since update_cfs_shares will only update the queueing weights when the entity is on_rq (which in this case it is not yet), there's a dependency loop here: update_cfs_shares needs account_entity_enqueue to update cfs_rq->load.weight account_entity_enqueue needs the updated weight for the queuing cfs_rq load[*] Fix this and avoid spurious dequeue/enqueues by issuing update_cfs_shares as if we had accounted the enqueue already. This was also resulting in rq->load corruption previously. [*]: this dependency also exists when using the group cfs_rq w/ update_cfs_shares as the weight of the enqueued entity changes without the load being updated. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234937.844900206@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Make tg_shares_up() walk on-demandPeter Zijlstra
Make tg_shares_up() use the active cgroup list, this means we cannot do a strict bottom-up walk of the hierarchy, but assuming its a very wide tree with a small number of active groups it should be a win. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234937.754159484@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>