summaryrefslogtreecommitdiffstats
path: root/kernel/sched
AgeCommit message (Collapse)Author
2014-03-11sched/idle: Add more comments to the codeDaniel Lezcano
The idle main function is a complex and a critical function. Added more comments to the code. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: tglx@linutronix.de Cc: rjw@rjwysocki.net Cc: preeti@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1393832934-11625-5-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11sched/idle: Move idle conditions in cpuidle_idle main functionDaniel Lezcano
This patch moves the condition before entering idle into the cpuidle main function located in idle.c. That simplify the idle mainloop functions and increase the readibility of the conditions to enter truly idle. This patch is code reorganization and does not change the behavior of the function. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: tglx@linutronix.de Cc: rjw@rjwysocki.net Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1393832934-11625-4-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11sched/idle: Reorganize the idle loopDaniel Lezcano
Now that we have the main cpuidle function in idle.c, move some code from the idle mainloop to this function for the sake of clarity. That removes if then else indentation difficult to follow when looking at the code. This patch does not change the current behavior. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: tglx@linutronix.de Cc: rjw@rjwysocki.net Cc: preeti@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1393832934-11625-3-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11cpuidle/idle: Move the cpuidle_idle_call function to idle.cDaniel Lezcano
The cpuidle_idle_call does nothing more than calling the three individuals function and is no longer used by any arch specific code but only in the cpuidle framework code. We can move this function into the idle task code to ensure better proximity to the scheduler code. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: rjw@rjwysocki.net Cc: preeti@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1393832934-11625-2-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11Merge branch 'sched/urgent' into sched/coreIngo Molnar
Pick up fixes before queueing up new changes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11sched/clock: Prevent tracing recursion in sched_clock_cpu()Fernando Luis Vazquez Cao
Prevent tracing of preempt_disable/enable() in sched_clock_cpu(). When CONFIG_DEBUG_PREEMPT is enabled, preempt_disable/enable() are traced and this causes trace_clock() users (and probably others) to go into an infinite recursion. Systems with a stable sched_clock() are not affected. This problem is similar to that fixed by upstream commit 95ef1e52922 ("KVM guest: prevent tracing recursion with kvmclock"). Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1394083528.4524.3.camel@nexus Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11sched/deadline: Deny unprivileged users to set/change SCHED_DEADLINE policyJuri Lelli
Deny the use of SCHED_DEADLINE policy to unprivileged users. Even if root users can set the policy for normal users, we don't want the latter to be able to change their parameters (safest behavior). Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1393844961-18097-1-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27sched: Guarantee task priority in pick_next_task()Peter Zijlstra
Michael spotted that the idle_balance() push down created a task priority problem. Previously, when we called idle_balance() before pick_next_task() it wasn't a problem when -- because of the rq->lock droppage -- an rt/dl task slipped in. Similarly for pre_schedule(), rt pre-schedule could have a dl task slip in. But by pulling it into the pick_next_task() loop, we'll not try a higher task priority again. Cure this by creating a re-start condition in pick_next_task(); and triggering this from pick_next_task_{rt,fair}(). It also fixes a live-lock where we get stuck in pick_next_task_fair() due to idle_balance() seeing !0 nr_running but there not actually being any fair tasks about. Reported-by: Michael Wang <wangyun@linux.vnet.ibm.com> Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()") Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27sched/idle: Remove stale old filePeter Zijlstra
Commit cf37b6b48428d ("sched/idle: Move cpu/idle.c to sched/idle.c") said to simply move a file; somehow it got mangled and created an old version of the file and forgot to remove the old file. Fix this fail; add the lost change and remove the now identical old file. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: rjw@rjwysocki.net Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Link: http://lkml.kernel.org/r/20140224172207.GC9987@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHEDDietmar Eggemann
The struct sched_avg of struct rq is only used in case group scheduling is enabled inside __update_tg_runnable_avg() to update per-cpu representation of a task group. I.e. that there is no need to maintain the runnable avg of a rq in the !CONFIG_FAIR_GROUP_SCHED case. This patch guards struct sched_avg of struct rq and update_rq_runnable_avg() with CONFIG_FAIR_GROUP_SCHED. There is an extra empty definition for update_rq_runnable_avg() necessary for the !CONFIG_FAIR_GROUP_SCHED && CONFIG_SMP case. The function print_cfs_group_stats() which prints out struct sched_avg of struct rq is already guarded with CONFIG_FAIR_GROUP_SCHED. Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/530DCDC5.1060406@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27sched/deadline: Prevent rt_time growth to infinityJuri Lelli
Kirill Tkhai noted: Since deadline tasks share rt bandwidth, we must care about bandwidth timer set. Otherwise rt_time may grow up to infinity in update_curr_dl(), if there are no other available RT tasks on top level bandwidth. RT task were in fact throttled right after they got enqueued, and never executed again (rt_time never again went below rt_runtime). Peter then proposed to accrue DL execution on rt_time only when rt timer is active, and proposed a patch (this patch is a slight modification of that) to implement that behavior. While this solves Kirill problem, it has a drawback. Indeed, Kirill noted again: It looks we may get into a situation, when all CPU time is shared between RT and DL tasks: rt_runtime = n rt_period = 2n | RT working, DL sleeping | DL working, RT sleeping | ----------------------------------------------------------- | (1) duration = n | (2) duration = n | (repeat) |--------------------------|------------------------------| | (rt_bw timer is running) | (rt_bw timer is not running) | No time for fair tasks at all. While this can happen during the first period, if rq is always backlogged, RT tasks won't have the opportunity to execute anymore: rt_time reached rt_runtime during (1), suppose after (2) RT is enqueued back, it gets throttled since rt timer didn't fire, replenishment is from now on eaten up by DL tasks that accrue their execution on rt_time (while rt timer is active - we have an RT task waiting for replenishment). FAIR tasks are not touched after this first period. Ok, this is not ideal, and the situation is even worse! What above (the nice case), practically never happens in reality, where your rt timer is not aligned to tasks periods, tasks are in general not periodic, etc.. Long story short, you always risk to overload your system. This patch is based on Peter's idea, but exploits an additional fact: if you don't have RT tasks enqueued, it makes little sense to continue incrementing rt_time once you reached the upper limit (DL tasks have their own mechanism for throttling). This cures both problems: - no matter how many DL instances in the past, you'll have an rt_time slightly above rt_runtime when an RT task is enqueued, and from that point on (after the first replenishment), the task will normally execute; - you can still eat up all bandwidth during the first period, but not anymore after that, remember that DL execution will increment rt_time till the upper limit is reached. The situation is still not perfect! But, we have a simple solution for now, that limits how much you can jeopardize your system, as we keep working towards the right answer: RT groups scheduled using deadline servers. Reported-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27sched/deadline: Switch CPU's presence test orderJuri Lelli
Commit 82b9580 ("sched/deadline: Test for CPU's presence explicitly") changed how we check if a CPU returned by cpudeadline machinery is valid. But, we don't want to call cpu_present() if best_cpu is equal to -1. So, switch the order of tests inside WARN_ON(). Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: boris.ostrovsky@oracle.com Cc: konrad.wilk@oracle.com Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/1393238832-9100-1-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27sched/deadline: Cleanup RT leftovers from {inc/dec}_dl_migrationKirill Tkhai
In deadline class we do not have group scheduling. So, let's remove unnecessary X = X; equations. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Link: http://lkml.kernel.org/r/1393343543.4089.5.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27sched: Fix double normalization of vruntimeGeorge McCollister
dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Signed-off-by: George McCollister <george.mccollister@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-24smp: Rename __smp_call_function_single() to smp_call_function_single_async()Frederic Weisbecker
The name __smp_call_function_single() doesn't tell much about the properties of this function, especially when compared to smp_call_function_single(). The comments above the implementation are also misleading. The main point of this function is actually not to be able to embed the csd in an object. This is actually a requirement that result from the purpose of this function which is to raise an IPI asynchronously. As such it can be called with interrupts disabled. And this feature comes at the cost of the caller who then needs to serialize the IPIs on this csd. Lets rename the function and enhance the comments so that they reflect these properties. Suggested-by: Christoph Hellwig <hch@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-24smp: Remove wait argument from __smp_call_function_single()Frederic Weisbecker
The main point of calling __smp_call_function_single() is to send an IPI in a pure asynchronous way. By embedding a csd in an object, a caller can send the IPI without waiting for a previous one to complete as is required by smp_call_function_single() for example. As such, sending this kind of IPI can be safe even when irqs are disabled. This flexibility comes at the expense of the caller who then needs to synchronize the csd lifecycle by himself and make sure that IPIs on a single csd are serialized. This is how __smp_call_function_single() works when wait = 0 and this usecase is relevant. Now there don't seem to be any usecase with wait = 1 that can't be covered by smp_call_function_single() instead, which is safer. Lets look at the two possible scenario: 1) The user calls __smp_call_function_single(wait = 1) on a csd embedded in an object. It looks like a nice and convenient pattern at the first sight because we can then retrieve the object from the IPI handler easily. But actually it is a waste of memory space in the object since the csd can be allocated from the stack by smp_call_function_single(wait = 1) and the object can be passed an the IPI argument. Besides that, embedding the csd in an object is more error prone because the caller must take care of the serialization of the IPIs for this csd. 2) The user calls __smp_call_function_single(wait = 1) on a csd that is allocated on the stack. It's ok but smp_call_function_single() can do it as well and it already takes care of the allocation on the stack. Again it's more simple and less error prone. Therefore, using the underscore prepend API version with wait = 1 is a bad pattern and a sign that the caller can do safer and more simple. There was a single user of that which has just been converted. So lets remove this option to discourage further users. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-22sched, nohz: Exclude isolated cores from load balancingMike Galbraith
The user explicitly disabled load balancing, else this core would not be disconnected. Don't add these to nohz.idle_cpus_mask. Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Lei Wen <leiwen@marvell.com> Link: http://lkml.kernel.org/n/tip-vmme4f49psirp966pklm5l9j@git.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched: Fix select_task_rq_fair() description commentsMorten Rasmussen
Brings select_task_rq_fair() description comments up-to-date. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392732864-10927-1-git-send-email-morten.rasmussen@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICEDongsheng Yang
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/bd80780f19b4f9b4a765acc353c8dbc130274dd6.1392103744.git.yangds.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched/rt: Make init_sched_rt_calss() __initLi Zefan
It's a bootstrap function. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/52F5CC09.1080502@huawei.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched/rt: Remove 'leaf_rt_rq_list' from 'struct rq'Li Zefan
This is a leftover from commit e23ee74777f389369431d77390c4b09332ce026a ("sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_list"). Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/52F5CBF6.4060901@huawei.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched: Consider pi boosting in setscheduler()Thomas Gleixner
If a PI boosted task policy/priority is modified by a setscheduler() call we unconditionally dequeue and requeue the task if it is on the runqueue even if the new priority is lower than the current effective boosted priority. This can result in undesired reordering of the priority bucket list. If the new priority is less or equal than the current effective we just store the new parameters in the task struct and leave the scheduler class and the runqueue untouched. This is handled when the task deboosts itself. Only if the new priority is higher than the effective boosted priority we apply the change immediately. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ Rebase ontop of v3.14-rc1. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Dario Faggioli <raistlin@linux.it> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-7-git-send-email-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched: Queue RT tasks to head when prio dropsThomas Gleixner
The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-6-git-send-email-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched: Adjust p->sched_reset_on_fork when nothing else changesThomas Gleixner
If the policy and priority remain unchanged a possible modification of p->sched_reset_on_fork gets lost in the early exit path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ Rebase ontop of v3.14-rc1. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-5-git-send-email-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched: Add better debug output for might_sleep()Thomas Gleixner
might_sleep() can tell us where interrupts have been disabled, but we have no idea what disabled preemption. Add some debug infrastructure. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-4-git-send-email-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched: Check for idle task in might_sleep()Thomas Gleixner
Idle is not allowed to call sleeping functions ever! Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-3-git-send-email-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched: Init idle->on_rq in init_idle()Thomas Gleixner
We stumbled in RT over a SMP bringup issue on ARM where the idle->on_rq == 0 was causing try_to_wakeup() on the other cpu to run into nada land. After adding that idle->on_rq = 1; I was able to find the root cause of the lockup: the idle task on the newly woken up cpu was fiddling with a sleeping spinlock, which is a nono. I kept the init of idle->on_rq to keep the state consistent and to avoid another long lasting debug session. As a side note, the whole debug mess could have been avoided if might_sleep() would have yelled when called from the idle task. That's fixed with patch 2/6 - and that one actually has a changelog :) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-2-git-send-email-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-21sched: Remove some #ifdefferyPeter Zijlstra
Remove a few gratuitous #ifdefs in pick_next_task*(). Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-nnzddp5c4fijyzzxxrwlxghf@git.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched: Fix hotplug task migrationPeter Zijlstra
Dan Carpenter reported: > kernel/sched/rt.c:1347 pick_next_task_rt() warn: variable dereferenced before check 'prev' (see line 1338) > kernel/sched/deadline.c:1011 pick_next_task_dl() warn: variable dereferenced before check 'prev' (see line 1005) Kirill also spotted that migrate_tasks() will have an instant NULL deref because pick_next_task() will immediately deref prev. Instead of fixing all the corner cases because migrate_tasks() can pass in a NULL prev task in the unlikely case of hot-un-plug, provide a fake task such that we can remove all the NULL checks from the far more common paths. A further problem; not previously spotted; is that because we pushed pre_schedule() and idle_balance() into pick_next_task() we now need to avoid those getting called and pulling more tasks on our dying CPU. We avoid pull_{dl,rt}_task() by setting fake_task.prio to MAX_PRIO+1. We also note that since we call pick_next_task() exactly the amount of times we have runnable tasks present, we should never land in idle_balance(). Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()") Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Reported-by: Kirill Tkhai <tkhai@yandex.ru> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140212094930.GB3545@laptop.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/fair: Remove idle_balance() declaration in sched.hPeter Zijlstra
Remove idle_balance() from the public life; also reduce some #ifdef clutter by folding the pick_next_task_fair() idle path into idle_balance(). Cc: mingo@kernel.org Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140211151148.GP27965@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/fair: Reset se-depth when task switched to FAIRMichael wang
Sasha reported: [ 522.645288] BUG: unable to handle kernel NULL pointer dereference at ... [ 522.646271] IP: [<ffffffff81186c6f>] check_preempt_wakeup+0x11f/0x210 ... [ 522.650021] Call Trace: [ 522.650021] <IRQ> [ 522.650021] [<ffffffff8117361d>] check_preempt_curr+0x3d/0xb0 [ 522.650021] [<ffffffff81175d88>] ttwu_do_wakeup+0x18/0x130 ... which was caused by the se-depth changed during the time when task is not FAIR, and we will use the wrong depth value after it switched back to FAIR. This patch reset the depth at the time when task switched to FAIR, make sure that we always have the correct value when task is FAIR. Cc: Ingo Molnar <mingo@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5305732D.70001@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/deadline: Remove useless dl_nr_totalKirill Tkhai
In deadline class we do not have group scheduling like in RT. dl_nr_total is the same as dl_nr_running. So, one of them should be removed. Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/368631392675853@web20h.yandex.ru Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/deadline: Test for CPU's presence explicitlyBoris Ostrovsky
A hot-removed CPU may have ID that is numerically larger than the number of existing CPUs in the system (e.g. we can unplug CPU 4 from a system that has CPUs 0, 1 and 4). Thus the WARN_ONs should check whether the CPU in question is currently present, not whether its ID value is less than num_present_cpus(). Cc: Ingo Molnar <mingo@kernel.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392646353-1874-1-git-send-email-boris.ostrovsky@oracle.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched: Add 'flags' argument to sched_{set,get}attr() syscallsPeter Zijlstra
Because of a recent syscall design debate; its deemed appropriate for each syscall to have a flags argument for future extension; without immediately requiring new syscalls. Cc: juri.lelli@gmail.com Cc: Ingo Molnar <mingo@redhat.com> Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched: Fix information leak in sys_sched_getattr()Vegard Nossum
We're copying the on-stack structure to userspace, but forgot to give the right number of bytes to copy. This allows the calling process to obtain up to PAGE_SIZE bytes from the stack (and possibly adjacent kernel memory). This fix copies only as much as we actually have on the stack (attr->size defaults to the size of the struct) and leaves the rest of the userspace-provided buffer untouched. Found using kmemcheck + trinity. Fixes: d50dde5a10f30 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI") Cc: Dario Faggioli <raistlin@linux.it> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392585857-10725-1-git-send-email-vegard.nossum@oracle.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched,numa: add cond_resched to task_numa_workRik van Riel
Normally task_numa_work scans over a fairly small amount of memory, but it is possible to run into a large unpopulated part of virtual memory, with no pages mapped. In that case, task_numa_work can run for a while, and it may make sense to reschedule as required. Cc: akpm@linux-foundation.org Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Xing Gang <gang.xing@hp.com> Tested-by: Chegu Vinod <chegu_vinod@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392761566-24834-2-git-send-email-riel@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/core: Make dl_b->lock IRQ safeJuri Lelli
Fix this lockdep warning: [ 44.804600] ========================================================= [ 44.805746] [ INFO: possible irq lock inversion dependency detected ] [ 44.805746] 3.14.0-rc2-test+ #14 Not tainted [ 44.805746] --------------------------------------------------------- [ 44.805746] bash/3674 just changed the state of lock: [ 44.805746] (&dl_b->lock){+.....}, at: [<ffffffff8106ad15>] sched_rt_handler+0x132/0x248 [ 44.805746] but this lock was taken by another, HARDIRQ-safe lock in the past: [ 44.805746] (&rq->lock){-.-.-.} and interrupts could create inverse lock ordering between them. [ 44.805746] [ 44.805746] other info that might help us debug this: [ 44.805746] Possible interrupt unsafe locking scenario: [ 44.805746] [ 44.805746] CPU0 CPU1 [ 44.805746] ---- ---- [ 44.805746] lock(&dl_b->lock); [ 44.805746] local_irq_disable(); [ 44.805746] lock(&rq->lock); [ 44.805746] lock(&dl_b->lock); [ 44.805746] <Interrupt> [ 44.805746] lock(&rq->lock); by making dl_b->lock acquiring always IRQ safe. Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392107067-19907-3-git-send-email-juri.lelli@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/core: Fix sched_rt_global_validateJuri Lelli
Don't compare sysctl_sched_rt_runtime against sysctl_sched_rt_period if the former is equal to RUNTIME_INF, otherwise disabling -rt bandwidth management (with CONFIG_RT_GROUP_SCHED=n) fails. Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392107067-19907-2-git-send-email-juri.lelli@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/deadline: Fix overflow to handle period==0 and deadline!=0Steven Rostedt
While debugging the crash with the bad nr_running accounting, I hit another bug where, after running my sched deadline test, I was getting failures to take a CPU offline. It was giving me a -EBUSY error. Adding a bunch of trace_printk()s around, I found that the cpu notifier that called sched_cpu_inactive() was returning a failure. The overflow value was coming up negative? Talking this over with Juri, the problem is that the total_bw update was suppose to be made by dl_overflow() which, during my tests, seemed to not be called. Adding more trace_printk()s, it wasn't that it wasn't called, but it exited out right away with the check of new_bw being equal to p->dl.dl_bw. The new_bw calculates the ratio between period and runtime. The bug is that if you set a deadline, you do not need to set a period if you plan on the period being equal to the deadline. That is, if period is zero and deadline is not, then the system call should set the period to be equal to the deadline. This is done elsewhere in the code. The fix is easy, check if period is set, and if it is not, then use the deadline. Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140219135335.7e74abd4@gandalf.local.home Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/deadline: Fix bad accounting of nr_runningJuri Lelli
Rostedt writes: My test suite was locking up hard when enabling mmiotracer. This was due to the mmiotracer placing all but one CPU offline. I found this out when I was able to reproduce the bug with just my stress-cpu-hotplug test. This bug baffled me because it would not always trigger, and would only trigger on the first run after boot up. The stress-cpu-hotplug test would crash hard the first run, or never crash at all. But a new reboot may cause it to crash on the first run again. I spent all week bisecting this, as I couldn't find a consistent reproducer. I finally narrowed it down to the sched deadline patches, and even more peculiar, to the commit that added the sched deadline boot up self test to the latency tracer. Then it dawned on me to what the bug was. All it took was to run a task under sched deadline to screw up the CPU hot plugging. This explained why it would lock up only on the first run of the stress-cpu-hotplug test. The bug happened when the boot up self test of the schedule latency tracer would test a deadline task. The deadline task would corrupt something that would cause CPU hotplug to fail. If it didn't corrupt it, the stress test would always work (there's no other sched deadline tasks that would run to cause problems). If it did corrupt on boot up, the first test would lockup hard. I proved this theory by running my deadline test program on another box, and then run the stress-cpu-hotplug test, and it would now consistently lock up. I could run stress-cpu-hotplug over and over with no problem, but once I ran the deadline test, the next run of the stress-cpu-hotplug would lock hard. After adding lots of tracing to the code, I found the cause. The function tracer showed that migrate_tasks() was stuck in an infinite loop, where rq->nr_running never equaled 1 to break out of it. When I added a trace_printk() to see what that number was, it was 335 and never decrementing! Looking at the deadline code I found: static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { dequeue_dl_entity(&p->dl); dequeue_pushable_dl_task(rq, p); } static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { update_curr_dl(rq); __dequeue_task_dl(rq, p, flags); dec_nr_running(rq); } And this: if (dl_runtime_exceeded(rq, dl_se)) { __dequeue_task_dl(rq, curr, 0); if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted))) dl_se->dl_throttled = 1; else enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH); if (!is_leftmost(curr, &rq->dl)) resched_task(curr); } Notice how we call __dequeue_task_dl() and in the else case we call enqueue_task_dl()? Also notice that dequeue_task_dl() has underscores where enqueue_task_dl() does not. The enqueue_task_dl() calls inc_nr_running(rq), but __dequeue_task_dl() does not. This is where we get nr_running out of sync. [snip] Another point where nr_running can get out of sync is when the dl_timer fires: dl_se->dl_throttled = 0; if (p->on_rq) { enqueue_task_dl(rq, p, ENQUEUE_REPLENISH); if (task_has_dl_policy(rq->curr)) check_preempt_curr_dl(rq, p, 0); else resched_task(rq->curr); This patch does two things: - correctly accounts for throttled tasks (that are now considered !running); - fixes the bug, updating nr_running from {inc,dec}_dl_tasks(), since we risk to update it twice in some situations (e.g., a task is dequeued while it has exceeded its budget). Cc: mingo@redhat.com Cc: torvalds@linux-foundation.org Cc: akpm@linux-foundation.org Reported-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392884379-13744-1-git-send-email-juri.lelli@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/mm: call finish_arch_post_lock_switch in idle_task_exit and use_mmMartin Schwidefsky
The finish_arch_post_lock_switch is called at the end of the task switch after all locks have been released. In concept it is paired with the switch_mm function, but the current code only does the call in finish_task_switch. Add the call to idle_task_exit and use_mm. One use case for the additional calls is s390 which will use finish_arch_post_lock_switch to wait for the completion of TLB flush operations. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2014-02-13cgroup: drop @skip_css from cgroup_taskset_for_each()Tejun Heo
If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching css. The intention of the interface is to make it easy to skip css's (cgroup_subsys_states) which already match the migration target; however, this is entirely unnecessary as migration taskset doesn't include tasks which are already in the target cgroup. Drop @skip_css from cgroup_taskset_for_each(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Daniel Borkmann <dborkman@redhat.com>
2014-02-12cgroup: remove cgroup->nameTejun Heo
cgroup->name handling became quite complicated over time involving dedicated struct cgroup_name for RCU protection. Now that cgroup is on kernfs, we can drop all of it and simply use kernfs_name/path() and friends. Replace cgroup->name and all related code with kernfs name/path constructs. * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top of kernfs counterparts, which involves semantic changes. pr_cont_cgroup_name() and pr_cont_cgroup_path() added. * cgroup->name handling dropped from cgroup_rename(). * All users of cgroup_name/path() updated to the new semantics. Users which were formatting the string just to printk them are converted to use pr_cont_cgroup_name/path() instead, which simplifies things quite a bit. As cgroup_name() no longer requires RCU read lock around it, RCU lockings which were protecting only cgroup_name() are removed. v2: Comment above oom_info_lock updated as suggested by Michal. v3: dummy_top doesn't have a kn associated and pr_cont_cgroup_name/path() ended up calling the matching kernfs functions with NULL kn leading to oops. Test for NULL kn and print "/" if so. This issue was reported by Fengguang Wu. v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-02-11sched/idle: Move cpu/idle.c to sched/idle.cNicolas Pitre
Integration of cpuidle with the scheduler requires that the idle loop be closely integrated with the scheduler proper. Moving cpu/idle.c into the sched directory will allow for a smoother integration, and eliminate a subdirectory which contained only one source file. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.LFD.2.11.1401301102210.1652@knanqh.ubzr Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-11sched: Add statistic for newidle load balance costAlex Shi
Tracking rq->max_idle_balance_cost and sd->max_newidle_lb_cost. It's useful to know these values in debug mode. Signed-off-by: Alex Shi <alex.shi@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/52E0F3BF.5020904@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-11sched: Delete is_same_group() outside CONFIG_FAIR_GROUP_SCHEDDietmar Eggemann
Since is_same_group() is only used in the group scheduling code, there is no need to define it outside CONFIG_FAIR_GROUP_SCHED. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391005773-29493-1-git-send-email-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-11sched: Push down pre_schedule() and idle_balance()Peter Zijlstra
This patch both merged idle_balance() and pre_schedule() and pushes both of them into pick_next_task(). Conceptually pre_schedule() and idle_balance() are rather similar, both are used to pull more work onto the current CPU. We cannot however first move idle_balance() into pre_schedule_fair() since there is no guarantee the last runnable task is a fair task, and thus we would miss newidle balances. Similarly, the dl and rt pre_schedule calls must be ran before idle_balance() since their respective tasks have higher priority and it would not do to delay their execution searching for less important tasks first. However, by noticing that pick_next_tasks() already traverses the sched_class hierarchy in the right order, we can get the right behaviour and do away with both calls. We must however change the special case optimization to also require that prev is of sched_class_fair, otherwise we can miss doing a dl or rt pull where we needed one. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched: Clean up idle task SMP logicPeter Zijlstra
The idle post_schedule flag is just a vile waste of time, furthermore it appears unneeded, move the idle_enter_fair() call into pick_next_task_idle(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: alex.shi@linaro.org Cc: mingo@kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/n/tip-aljykihtxJt3mkokxi0qZurb@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched/fair: Optimize cgroup pick_next_task_fair()Peter Zijlstra
Since commit 2f36825b1 ("sched: Next buddy hint on sleep and preempt path") it is likely we pick a new task from the same cgroup, doing a put and then set on all intermediate entities is a waste of time, so try to avoid this. Measured using: mount nodev /cgroup -t cgroup -o cpu cd /cgroup mkdir a; cd a mkdir b; cd b mkdir c; cd c echo $$ > tasks perf stat --repeat 10 -- taskset 1 perf bench sched pipe PRE : 4.542422684 seconds time elapsed ( +- 0.33% ) POST: 4.389409991 seconds time elapsed ( +- 0.32% ) Which shows a significant improvement of ~3.5% Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched/fair: Clean up the __clear_buddies_*() functionsPeter Zijlstra
Slightly easier code flow, no functional changes. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>