summaryrefslogtreecommitdiffstats
path: root/kernel/sched/rt.c
AgeCommit message (Collapse)Author
2014-06-04printk: Add printk_deferred_onceJohn Stultz
Two of the three prink_deferred uses are really printk_once style uses, so add a printk_deferred_once macro to simplify those call sites. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04printk: rename printk_sched to printk_deferredJohn Stultz
After learning we'll need some sort of deferred printk functionality in the timekeeping core, Peter suggested we rename the printk_sched function so it can be reused by needed subsystems. This only changes the function name. No logic changes. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-22sched, nohz: Change rq->nr_running to always use wrappersKirill Tkhai
Sometimes ->nr_running may cross 2 but interrupt is not being sent to rq's cpu. In this case we don't reenable the timer. Looks like this may be the reason for rare unexpected effects, if nohz is enabled. Patch replaces all places of direct changing of nr_running and makes add_nr_running() caring about crossing border. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18sched: Revert commit 4c6c4e38c4e9 ("sched/core: Fix endless loop in ↵Kirill Tkhai
pick_next_task()") This reverts commit 4c6c4e38c4e9 ("sched/core: Fix endless loop in pick_next_task()"), which is not necessary after ("sched/rt: Substract number of tasks of throttled queues from rq->nr_running"). Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> [conflict resolution with stop task checking patch] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1394835307.18748.34.camel@HP-250-G1-Notebook-PC Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18sched/rt: Substract number of tasks of throttled queues from rq->nr_runningKirill Tkhai
Now rq->rt becomes to be able to be in dequeued or enqueued state. We add new member rt_rq->rt_queued, which is used to indicate this. The member is used only for top queue rq->rt_rq. The goal is to fit generic scheme which is used in deadline and fair classes, i.e. throttled rt_rq's rt_nr_running is beeing substracted from rq->nr_running. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1394835300.18748.33.camel@HP-250-G1-Notebook-PC Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18sched/rt: Add accessors rq_of_rt_se()Kirill Tkhai
Two accessors for RT_GROUP_SCHED and !RT_GROUP_SCHED cases. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1394835295.18748.32.camel@HP-250-G1-Notebook-PC Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18sched/rt: Sum number of all children tasks in hierarhy at ->rt_nr_runningKirill Tkhai
{inc,dec}_rt_tasks() used to count entities which are directly queued on the rt_rq. If an entity was not a task (i.e., it is some queue), its children were not counted. There is no problem here, but now we want to count number of all tasks which are actually queued under the rt_rq in all the hierarchy (except throttled rt queues). Empty queues are not able to be queued and all of the places, which use ->rt_nr_running, just compare it with zero, so we do not break anything here. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1394835289.18748.31.camel@HP-250-G1-Notebook-PC Cc: linux-kernel@vger.kernel.org [ Twiddled the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18sched/rt: Do not try to push tasks if pinned task switches to RTKirill V Tkhai
Just switched pinned task is not able to be pushed. If the rq had had several RT tasks before they have already been considered as candidates to be pushed (or pulled). Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140312061833.3a43aa64@gandalf.local.home Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-17sched: Check for stop task appearance when balancing happensKirill Tkhai
We need to do it like we do for the other higher priority classes.. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Cc: Michael wang <wangyun@linux.vnet.ibm.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/336561397137116@web27h.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11sched/core: Fix endless loop in pick_next_task()Kirill Tkhai
1) Single cpu machine case. When rq has only RT tasks, but no one of them can be picked because of throttling, we enter in endless loop. pick_next_task_{dl,rt} return NULL. In pick_next_task_fair() we permanently go to retry if (rq->nr_running != rq->cfs.h_nr_running) return RETRY_TASK; (rq->nr_running is not being decremented when rt_rq becomes throttled). No chances to unthrottle any rt_rq or to wake fair here, because of rq is locked permanently and interrupts are disabled. 2) In case of SMP this can cause a hang too. Although we unlock rq in idle_balance(), interrupts are still disabled. The solution is to check for available tasks in DL and RT classes instead of checking for sum. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1394098321.19290.11.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11sched/rt: Fix picking RT and DL tasks from empty queueKirill Tkhai
The problems: 1) We check for rt_nr_running before call of put_prev_task(). If previous task is RT, its rt_rq may become throttled and dequeued after this call. In case of p is from rt->rq this just causes picking a task from throttled queue, but in case of its rt_rq is child we are guaranteed catch BUG_ON. 2) The same with deadline class. The only difference we operate on only dl_rq. This patch fixes all the above problems and it adds a small skip in the DL update like we've already done for RT class: if (unlikely((s64)delta_exec <= 0)) return; This will optimize sequential update_curr_dl() calls a little. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Link: http://lkml.kernel.org/r/1393946746.3643.3.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11Merge branch 'sched/urgent' into sched/coreIngo Molnar
Pick up fixes before queueing up new changes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27sched: Guarantee task priority in pick_next_task()Peter Zijlstra
Michael spotted that the idle_balance() push down created a task priority problem. Previously, when we called idle_balance() before pick_next_task() it wasn't a problem when -- because of the rq->lock droppage -- an rt/dl task slipped in. Similarly for pre_schedule(), rt pre-schedule could have a dl task slip in. But by pulling it into the pick_next_task() loop, we'll not try a higher task priority again. Cure this by creating a re-start condition in pick_next_task(); and triggering this from pick_next_task_{rt,fair}(). It also fixes a live-lock where we get stuck in pick_next_task_fair() due to idle_balance() seeing !0 nr_running but there not actually being any fair tasks about. Reported-by: Michael Wang <wangyun@linux.vnet.ibm.com> Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()") Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27sched/deadline: Prevent rt_time growth to infinityJuri Lelli
Kirill Tkhai noted: Since deadline tasks share rt bandwidth, we must care about bandwidth timer set. Otherwise rt_time may grow up to infinity in update_curr_dl(), if there are no other available RT tasks on top level bandwidth. RT task were in fact throttled right after they got enqueued, and never executed again (rt_time never again went below rt_runtime). Peter then proposed to accrue DL execution on rt_time only when rt timer is active, and proposed a patch (this patch is a slight modification of that) to implement that behavior. While this solves Kirill problem, it has a drawback. Indeed, Kirill noted again: It looks we may get into a situation, when all CPU time is shared between RT and DL tasks: rt_runtime = n rt_period = 2n | RT working, DL sleeping | DL working, RT sleeping | ----------------------------------------------------------- | (1) duration = n | (2) duration = n | (repeat) |--------------------------|------------------------------| | (rt_bw timer is running) | (rt_bw timer is not running) | No time for fair tasks at all. While this can happen during the first period, if rq is always backlogged, RT tasks won't have the opportunity to execute anymore: rt_time reached rt_runtime during (1), suppose after (2) RT is enqueued back, it gets throttled since rt timer didn't fire, replenishment is from now on eaten up by DL tasks that accrue their execution on rt_time (while rt timer is active - we have an RT task waiting for replenishment). FAIR tasks are not touched after this first period. Ok, this is not ideal, and the situation is even worse! What above (the nice case), practically never happens in reality, where your rt timer is not aligned to tasks periods, tasks are in general not periodic, etc.. Long story short, you always risk to overload your system. This patch is based on Peter's idea, but exploits an additional fact: if you don't have RT tasks enqueued, it makes little sense to continue incrementing rt_time once you reached the upper limit (DL tasks have their own mechanism for throttling). This cures both problems: - no matter how many DL instances in the past, you'll have an rt_time slightly above rt_runtime when an RT task is enqueued, and from that point on (after the first replenishment), the task will normally execute; - you can still eat up all bandwidth during the first period, but not anymore after that, remember that DL execution will increment rt_time till the upper limit is reached. The situation is still not perfect! But, we have a simple solution for now, that limits how much you can jeopardize your system, as we keep working towards the right answer: RT groups scheduled using deadline servers. Reported-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched/rt: Make init_sched_rt_calss() __initLi Zefan
It's a bootstrap function. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/52F5CC09.1080502@huawei.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-21sched: Remove some #ifdefferyPeter Zijlstra
Remove a few gratuitous #ifdefs in pick_next_task*(). Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-nnzddp5c4fijyzzxxrwlxghf@git.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched: Fix hotplug task migrationPeter Zijlstra
Dan Carpenter reported: > kernel/sched/rt.c:1347 pick_next_task_rt() warn: variable dereferenced before check 'prev' (see line 1338) > kernel/sched/deadline.c:1011 pick_next_task_dl() warn: variable dereferenced before check 'prev' (see line 1005) Kirill also spotted that migrate_tasks() will have an instant NULL deref because pick_next_task() will immediately deref prev. Instead of fixing all the corner cases because migrate_tasks() can pass in a NULL prev task in the unlikely case of hot-un-plug, provide a fake task such that we can remove all the NULL checks from the far more common paths. A further problem; not previously spotted; is that because we pushed pre_schedule() and idle_balance() into pick_next_task() we now need to avoid those getting called and pulling more tasks on our dying CPU. We avoid pull_{dl,rt}_task() by setting fake_task.prio to MAX_PRIO+1. We also note that since we call pick_next_task() exactly the amount of times we have runnable tasks present, we should never land in idle_balance(). Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()") Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Reported-by: Kirill Tkhai <tkhai@yandex.ru> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140212094930.GB3545@laptop.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-11sched: Push down pre_schedule() and idle_balance()Peter Zijlstra
This patch both merged idle_balance() and pre_schedule() and pushes both of them into pick_next_task(). Conceptually pre_schedule() and idle_balance() are rather similar, both are used to pull more work onto the current CPU. We cannot however first move idle_balance() into pre_schedule_fair() since there is no guarantee the last runnable task is a fair task, and thus we would miss newidle balances. Similarly, the dl and rt pre_schedule calls must be ran before idle_balance() since their respective tasks have higher priority and it would not do to delay their execution searching for less important tasks first. However, by noticing that pick_next_tasks() already traverses the sched_class hierarchy in the right order, we can get the right behaviour and do away with both calls. We must however change the special case optimization to also require that prev is of sched_class_fair, otherwise we can miss doing a dl or rt pull where we needed one. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched: Push put_prev_task() into pick_next_task()Peter Zijlstra
In order to avoid having to do put/set on a whole cgroup hierarchy when we context switch, push the put into pick_next_task() so that both operations are in the same function. Further changes then allow us to possibly optimize away redundant work. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logicJuri Lelli
Introduces data structures relevant for implementing dynamic migration of -deadline tasks and the logic for checking if runqueues are overloaded with -deadline tasks and for choosing where a task should migrate, when it is the case. Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can be moved among CPUs when necessary. It is also possible to bind a task to a (set of) CPU(s), thus restricting its capability of migrating, or forbidding migrations at all. The very same approach used in sched_rt is utilised: - -deadline tasks are kept into CPU-specific runqueues, - -deadline tasks are migrated among runqueues to achieve the following: * on an M-CPU system the M earliest deadline ready tasks are always running; * affinity/cpusets settings of all the -deadline tasks is always respected. Therefore, this very special form of "load balancing" is done with an active method, i.e., the scheduler pushes or pulls tasks between runqueues when they are woken up and/or (de)scheduled. IOW, every time a preemption occurs, the descheduled task might be sent to some other CPU (depending on its deadline) to continue executing (push). On the other hand, every time a CPU becomes idle, it might pull the second earliest deadline ready task from some other CPU. To enforce this, a pull operation is always attempted before taking any scheduling decision (pre_schedule()), as well as a push one after each scheduling decision (post_schedule()). In addition, when a task arrives or wakes up, the best CPU where to resume it is selected taking into account its affinity mask, the system topology, but also its deadline. E.g., from the scheduling point of view, the best CPU where to wake up (and also where to push) a task is the one which is running the task with the latest deadline among the M executing ones. In order to facilitate these decisions, per-runqueue "caching" of the deadlines of the currently running and of the first ready task is used. Queued but not running tasks are also parked in another rb-tree to speed-up pushes. Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-17sched/rt: Fix rq's cpupri leak while enqueue/dequeue child RT entitiesKirill Tkhai
This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> CC: Steven Rostedt <rostedt@goodmis.org> CC: stable@vger.kernel.org Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-26sched/rt: Fix task_tick_rt() commentLi Bin
This issue was introduced by 454c79999f7e ("sched/rt: Fix SCHED_RR across cgroups") that missed the word 'not'. Fix it. Signed-off-by: Li Bin <huawei.libin@huawei.com> Cc: <guohanjun@huawei.com> Cc: <xiexiuqi@huawei.com> Cc: <peterz@infradead.org> Link: http://lkml.kernel.org/r/1382357743-54136-1-git-send-email-huawei.libin@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-16sched/rt: Add missing rmb()Peter Zijlstra
While discussing the proposed SCHED_DEADLINE patches which in parts mimic the existing FIFO code it was noticed that the wmb in rt_set_overloaded() didn't have a matching barrier. The only site using rt_overloaded() to test the rto_count is pull_rt_task() and we should issue a matching rmb before then assuming there's an rto_mask bit set. Without that smp_rmb() in there we could actually miss seeing the rto_mask bit. Also, change to using smp_[wr]mb(), even though this is SMP only code; memory barriers without smp_ always make me think they're against hardware of some sort. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: vincent.guittot@linaro.org Cc: luca.abeni@unitn.it Cc: bruce.ashfield@windriver.com Cc: dhaval.giani@gmail.com Cc: rostedt@goodmis.org Cc: hgu1972@gmail.com Cc: oleg@redhat.com Cc: fweisbec@gmail.com Cc: darren@dvhart.com Cc: johan.eker@ericsson.com Cc: p.faure@akatech.ch Cc: paulmck@linux.vnet.ibm.com Cc: raistlin@linux.it Cc: claudio@evidence.eu.com Cc: insop.song@gmail.com Cc: michael@amarulasolutions.com Cc: liming.wang@windriver.com Cc: fchecconi@gmail.com Cc: jkacur@redhat.com Cc: tommaso.cucinotta@sssup.it Cc: Juri Lelli <juri.lelli@gmail.com> Cc: harald.gustafsson@ericsson.com Cc: nicola.manica@disi.unitn.it Cc: tglx@linutronix.de Link: http://lkml.kernel.org/r/20131015103507.GF10651@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09sched/numa: Introduce migrate_swap()Peter Zijlstra
Use the new stop_two_cpus() to implement migrate_swap(), a function that flips two tasks between their respective cpus. I'm fairly sure there's a less crude way than employing the stop_two_cpus() method, but everything I tried either got horribly fragile and/or complex. So keep it simple for now. The notable detail is how we 'migrate' tasks that aren't runnable anymore. We'll make it appear like we migrated them before they went to sleep. The sole difference is the previous cpu in the wakeup path, so we override this. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/1381141781-10992-39-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-06sched/rt: Remove redundant nr_cpus_allowed testShawn Bohrer
In 76854c7e8f3f4172fef091e78d88b3b751463ac6 ("sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles") an optimization was added to select_task_rq_rt() that immediately returns when p->nr_cpus_allowed == 1 at the beginning of the function. This makes the latter p->nr_cpus_allowed > 1 check redundant, which can now be removed. Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: tomk@rgmadvisors.com Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.bohrer@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_listKirill Tkhai
[ Peter, this is based off of some of my work, I ran it though a few tests and it passed. I also reviewed it, and added my SOB as I am somewhat a co-author to it. ] Based on the patch by Steven Rostedt from previous year: https://lkml.org/lkml/2012/4/18/517 1)Simplify pull_rt_task() logic: search in pushable tasks of dest runqueue. The only pullable tasks are the tasks which are pushable in their local rq, and no others. 2)Remove .leaf_rt_rq_list member of struct rt_rq and functions connected with it: nobody uses it since now. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/287571370557898@web7d.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28sched: Use an accessor to read the rq clockFrederic Weisbecker
Read the runqueue clock through an accessor. This prepares for adding a debugging infrastructure to detect missing or redundant calls to update_rq_clock() between a scheduler's entry and exit point. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Turner <pjt@google.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365724262-20142-6-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28sched: Remove redundant update_runtime notifierNeil Zhang
migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Signed-off-by: Neil Zhang <zhangwm@marvell.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365685499-26515-1-git-send-email-zhangwm@marvell.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-10sched: Use this_rq() helperNathan Zimmer
It is a few instructions more efficent to and slightly more readable to use this_rq()-> instead of cpu_rq(smp_processor_id())-> . Size comparison of kernel/sched/fair.o: text data bss dec hex filename 27972 122 26 28120 6dd8 fair.o.before 27956 122 26 28104 6dc8 fair.o.after Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1368116643-87971-1-git-send-email-nzimmer@sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-19Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Main changes: - scheduler side full-dynticks (user-space execution is undisturbed and receives no timer IRQs) preparation changes that convert the cputime accounting code to be full-dynticks ready, from Frederic Weisbecker. - Initial sched.h split-up changes, by Clark Williams - select_idle_sibling() performance improvement by Mike Galbraith: " 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs " - sched_rr_get_interval() ABI fix/change. We think this detail is not used by apps (so it's not an ABI in practice), but lets keep it under observation. - misc RT scheduling cleanups, optimizations" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h> cputime: Remove irqsave from seqlock readers sched, powerpc: Fix sched.h split-up build failure cputime: Restore CPU_ACCOUNTING config defaults for PPC64 sched/rt: Move rt specific bits into new header file sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice sched: Move sched.h sysctl bits into separate header sched: Fix signedness bug in yield_to() sched: Fix select_idle_sibling() bouncing cow syndrome sched/rt: Further simplify pick_rt_task() sched/rt: Do not account zero delta_exec in update_curr_rt() cputime: Safely read cputime of full dynticks CPUs kvm: Prepare to add generic guest entry/exit callbacks cputime: Use accessors to read task cputime stats cputime: Allow dynamic switch between tick/virtual based cputime accounting cputime: Generic on-demand virtual cputime accounting cputime: Move default nsecs_to_cputime() to jiffies based cputime file cputime: Librarize per nsecs resolution cputime definitions cputime: Avoid multiplication overflow on utime scaling context_tracking: Export context state for generic vtime ... Fix up conflict in kernel/context_tracking.c due to comment additions.
2013-02-07sched/rt: Add a tuning knob to allow changing SCHED_RR timesliceClark Williams
Add a /proc/sys/kernel scheduler knob named sched_rr_timeslice_ms that allows global changing of the SCHED_RR timeslice value. User visable value is in milliseconds but is stored as jiffies. Setting to 0 (zero) resets to the default (currently 100ms). Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094704.13751796@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-03sched/rt: Further simplify pick_rt_task()Kirill Tkhai
Function next_prio() has been removed and pull_rt_task() is the only user of pick_next_highest_task_rt() at the moment. pull_rt_task is not interested in p->nr_cpus_allowed, its only interest is the fact that cpu is allowed to execute p. If nr_cpus_allowed == 1, cpu != task_cpu(p) and cpu is allowed then it means that task p is in the middle of the migration techniques; the task waits until it is moved by migration thread. So, lets pull it earlier. Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> CC: linux-rt-users <linux-rt-users@vger.kernel.org> Link: http://lkml.kernel.org/r/70871359644177@web16d.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-31sched/rt: Do not account zero delta_exec in update_curr_rt()Kirill Tkhai
There are several places of consecutive calls of dequeue_task_rt() and put_prev_task_rt() in the scheduler. For example, function rt_mutex_setprio() does it. The both calls lead to update_curr_rt(), the second of it receives zeroed delta_exec. The only effective action in this case is call of sched_rt_avg_update(), which can change rq->age_stamp and rq->rt_avg. But it is possible in case of ""floating"" rq->clock. This fact is not reasonable to be accounted. Another actions do nothing. Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> CC: linux-rt-users <linux-rt-users@vger.kernel.org> Link: http://lkml.kernel.org/r/931541359550236@web1g.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-25sched/rt: Avoid updating RT entry timeout twice within one tick periodYing Xue
The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, it means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Fan Du <fan.du@windriver.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <peterz@infradead.org> Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-25sched/rt: Use root_domain of rt_rq not current processorShawn Bohrer
When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Mike Galbraith <bitbucket@online.de> Cc: peterz@infradead.org Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/1358186131-29494-1-git-send-email-sbohrer@rgmadvisors.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-24sched/rt: Add reschedule check to switched_from_rt()Kirill Tkhai
Reschedule rq->curr if the first RT task has just been pulled to the rq. Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tkhai Kirill <tkhai@yandex.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/118761353614535@web28f.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-13sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSWPeter Zijlstra
Now that the last architecture to use this has stopped doing so (ARM, thanks Catalin!) we can remove this complexity from the scheduler core. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04sched: Unthrottle rt runqueues in __disable_runtime()Peter Boonstoppel
migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <pboonstoppel@nvidia.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-13sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttledMike Galbraith
Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-06-06sched/rt: Fix lockdep annotation within find_lock_lowest_rq()Peter Zijlstra
Roland Dreier reported spurious, hard to trigger lockdep warnings within the scheduler - without any real lockup. This bit gives us the right clue: > [89945.640512] [<ffffffff8103fa1a>] double_lock_balance+0x5a/0x90 > [89945.640568] [<ffffffff8104c546>] push_rt_task+0xc6/0x290 if you look at that code you'll find the double_lock_balance() in question is the one in find_lock_lowest_rq() [yay for inlining]. Now find_lock_lowest_rq() has a bug.. it fails to use double_unlock_balance() in one exit path, if this results in a retry in push_rt_task() we'll call double_lock_balance() again, at which point we'll run into said lockdep confusion. Reported-by: Roland Dreier <roland@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched/rt: Fix SCHED_RR across cgroupsColin Cross
task_tick_rt() has an optimization to only reschedule SCHED_RR tasks if they were the only element on their rq. However, with cgroups a SCHED_RR task could be the only element on its per-cgroup rq but still be competing with other SCHED_RR tasks in its parent's cgroup. In this case, the SCHED_RR task in the child cgroup would never yield at the end of its timeslice. If the child cgroup rt_runtime_us was the same as the parent cgroup rt_runtime_us, the task in the parent cgroup would starve completely. Modify task_tick_rt() to check that the task is the only task on its rq, and that the each of the scheduling entities of its ancestors is also the only entity on its rq. Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'Peter Zijlstra
Since nr_cpus_allowed is used outside of sched/rt.c and wants to be used outside of there more, move it to a more natural site. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-14Merge branch 'tip/sched/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into sched/core Pull a scheduler optimization commit from Steven Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-12sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in ↵Kirill Tkhai
set_cpus_allowed_rt() Migration status depends on a difference of weight from 0 and 1. If weight > 1 (<= 1) and old weight <= 1 (> 1) then task becomes pushable (or not pushable). We are not insterested in its exact values, is it 3 or 4, for example. Now if we are changing affinity from a set of 3 cpus to a set of 4, the- task will be dequeued and enqueued sequentially without important difference in comparison with initial state. The only difference is in internal representation of plist queue of pushable tasks and the fact that the task may won't be the first in a sequence of the same priority tasks. But it seems to me it gives nothing. Link: http://lkml.kernel.org/r/273741334120764@web83.yandex.ru Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tkhai Kirill <tkhai@yandex.ru> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-03-27sched/rt: Improve pick_next_highest_task_rt()Michael J Wang
Avoid extra work by continuing on to the next rt_rq if the highest prio task in current rt_rq is the same priority as our candidate task. More detailed explanation: if next is not NULL, then we have found a candidate task, and its priority is next->prio. Now we are looking for an even higher priority task in the other rt_rq's. idx is the highest priority in the current candidate rt_rq. In the current 3.3 code, if idx is equal to next->prio, we would start scanning the tasks in that rt_rq and replace the current candidate task with a task from that rt_rq. But the new task would only have a priority that is equal to our previous candidate task, so we have not advanced our goal of finding a higher prio task. So we should avoid the extra work by continuing on to the next rt_rq if idx is equal to next->prio. Signed-off-by: Michael J Wang <mjwang@broadcom.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/2EF88150C0EF2C43A218742ED384C1BC0FC83D6B@IRVEXCHMB08.corp.ad.broadcom.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-12printk/sched: Introduce special printk_sched() for those awkward momentsPeter Zijlstra
There's a few awkward printk()s inside of scheduler guts that people prefer to keep but really are rather deadlock prone. Fudge around it by storing the text in a per-cpu buffer and poll it using the existing printk_tick() handler. This will drop output when its more frequent than once a tick, however only the affinity thing could possible go that fast and for that just one should suffice to notify the admin he's done something silly.. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-wua3lmkt3dg8nfts66o6brne@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched/rt: Do not throttle when PI boostingPeter Zijlstra
When a runqueue has rt_runtime_us = 0 then the only way it can accumulate rt_time is via PI boosting. That causes the runqueue to be throttled and replenishing does not change anything due to rt_runtime_us = 0. So avoid that situation by clearing rt_time and skip the throttling alltogether. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> [ Changelog ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-7x70cypsotjb4jvcor3edctk@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched/rt: Keep period timer ticking when rt throttling is activePeter Zijlstra
When a runqueue is throttled we cannot disable the period timer because that timer is the only way to undo the throttling. We got stale throttling entries when a rq was throttled and then the global sysctl was disabled, which stopped the timer. Signed-off-by: Peter Zijlstra <peterz@infradead.org> [ Added changelog ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-nuj34q52p6ro7szapuz84i0v@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-22sched: Make initial SCHED_RR timeslace DEF_TIMESLICEHiroshi Shimamoto
Current the initial SCHED_RR timeslice of init_task is HZ, which means 1s, and is not same as the default SCHED_RR timeslice DEF_TIMESLICE. Change that initial timeslice to the DEF_TIMESLICE. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> [ s/DEF_TIMESLICE/RR_TIMESLICE/g ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4F3C9995.3010800@ct.jp.nec.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-27sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSWChanho Min
This issue happens under the following conditions: 1. preemption is off 2. __ARCH_WANT_INTERRUPTS_ON_CTXSW is defined 3. RT scheduling class 4. SMP system Sequence is as follows: 1.suppose current task is A. start schedule() 2.task A is enqueued pushable task at the entry of schedule() __schedule prev = rq->curr; ... put_prev_task put_prev_task_rt enqueue_pushable_task 4.pick the task B as next task. next = pick_next_task(rq); 3.rq->curr set to task B and context_switch is started. rq->curr = next; 4.At the entry of context_swtich, release this cpu's rq->lock. context_switch prepare_task_switch prepare_lock_switch raw_spin_unlock_irq(&rq->lock); 5.Shortly after rq->lock is released, interrupt is occurred and start IRQ context 6.try_to_wake_up() which called by ISR acquires rq->lock try_to_wake_up ttwu_remote rq = __task_rq_lock(p) ttwu_do_wakeup(rq, p, wake_flags); task_woken_rt 7.push_rt_task picks the task A which is enqueued before. task_woken_rt push_rt_tasks(rq) next_task = pick_next_pushable_task(rq) 8.At find_lock_lowest_rq(), If double_lock_balance() returns 0, lowest_rq can be the remote rq. (But,If preemption is on, double_lock_balance always return 1 and it does't happen.) push_rt_task find_lock_lowest_rq if (double_lock_balance(rq, lowest_rq)).. 9.find_lock_lowest_rq return the available rq. task A is migrated to the remote cpu/rq. push_rt_task ... deactivate_task(rq, next_task, 0); set_task_cpu(next_task, lowest_rq->cpu); activate_task(lowest_rq, next_task, 0); 10. But, task A is on irq context at this cpu. So, task A is scheduled by two cpus at the same time until restore from IRQ. Task A's stack is corrupted. To fix it, don't migrate an RT task if it's still running. Signed-off-by: Chanho Min <chanho.min@lge.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/CAOAMb1BHA=5fm7KTewYyke6u-8DP0iUuJMpgQw54vNeXFsGpoQ@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>