summaryrefslogtreecommitdiffstats
path: root/kernel/sched
AgeCommit message (Collapse)Author
2013-10-09sched/numa: Select a preferred node with the most numa hinting faultsMel Gorman
This patch selects a preferred node for a task to run on based on the NUMA hinting faults. This information is later used to migrate tasks towards the node during balancing. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-21-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09sched/numa: Track NUMA hinting faults on per-node basisMel Gorman
This patch tracks what nodes numa hinting faults were incurred on. This information is later used to schedule a task on the node storing the pages most frequently faulted by the task. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09sched/numa: Slow scan rate if no NUMA hinting faults are being recordedMel Gorman
NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page was migrated. For long-lived but idle processes there may be no faults but the scan rate will be high and just waste CPU. This patch will slow the scan rate for processes that are not trapping faults. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-19-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09sched/numa: Set the scan rate proportional to the memory usage of the task ↵Mel Gorman
being scanned The NUMA PTE scan rate is controlled with a combination of the numa_balancing_scan_period_min, numa_balancing_scan_period_max and numa_balancing_scan_size. This scan rate is independent of the size of the task and as an aside it is further complicated by the fact that numa_balancing_scan_size controls how many pages are marked pte_numa and not how much virtual memory is scanned. In combination, it is almost impossible to meaningfully tune the min and max scan periods and reasoning about performance is complex when the time to complete a full scan is is partially a function of the tasks memory size. This patch alters the semantic of the min and max tunables to be about tuning the length time it takes to complete a scan of a tasks occupied virtual address space. Conceptually this is a lot easier to understand. There is a "sanity" check to ensure the scan rate is never extremely fast based on the amount of virtual memory that should be scanned in a second. The default of 2.5G seems arbitrary but it is to have the maximum scan rate after the patch roughly match the maximum scan rate before the patch was applied. On a similar note, numa_scan_period is in milliseconds and not jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies to numa_scan_period means that the rate scanning slows depends on HZ which is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09sched/numa: Initialise numa_next_scan properlyMel Gorman
Scan delay logic and resets are currently initialised to start scanning immediately instead of delaying properly. Initialise them properly at fork time and catch when a new mm has been allocated. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a ↵Mel Gorman
new node" PTE scanning and NUMA hinting fault handling is expensive so commit 5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node") deferred the PTE scan until a task had been scheduled on another node. The problem is that in the purely shared memory case that this may never happen and no NUMA hinting fault information will be captured. We are not ruling out the possibility that something better can be done here but for now, this patch needs to be reverted and depend entirely on the scan_delay to avoid punishing short-lived processes. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09sched/numa: Continue PTE scanning even if migrate rate limitedPeter Zijlstra
Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate limited sees like a bad idea. Even if this node can't migrate anymore other nodes might and we want up-to-date information to do balance decisions. We already rate limit the actual migrations, this should leave enough bandwidth to allow the non-migrating scanning. I think its important we keep up-to-date information if we're going to do placement based on it. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-15-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09sched/numa: Mitigate chance that same task always updates PTEsPeter Zijlstra
With a trace_printk("working\n"); right after the cmpxchg in task_numa_work() we can see that of a 4 thread process, its always the same task winning the race and doing the protection change. This is a problem since the task doing the protection change has a penalty for taking faults -- it is busy when marking the PTEs. If its always the same task the ->numa_faults[] get severely skewed. Avoid this by delaying the task doing the protection change such that it is unlikely to win the privilege again. Before: root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15 thread 0/0-3232 [022] .... 212.787402: task_numa_work: working thread 0/0-3232 [022] .... 212.888473: task_numa_work: working thread 0/0-3232 [022] .... 212.989538: task_numa_work: working thread 0/0-3232 [022] .... 213.090602: task_numa_work: working thread 0/0-3232 [022] .... 213.191667: task_numa_work: working thread 0/0-3232 [022] .... 213.292734: task_numa_work: working thread 0/0-3232 [022] .... 213.393804: task_numa_work: working thread 0/0-3232 [022] .... 213.494869: task_numa_work: working thread 0/0-3232 [022] .... 213.596937: task_numa_work: working thread 0/0-3232 [022] .... 213.699000: task_numa_work: working thread 0/0-3232 [022] .... 213.801067: task_numa_work: working thread 0/0-3232 [022] .... 213.903155: task_numa_work: working thread 0/0-3232 [022] .... 214.005201: task_numa_work: working thread 0/0-3232 [022] .... 214.107266: task_numa_work: working thread 0/0-3232 [022] .... 214.209342: task_numa_work: working After: root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15 thread 0/0-3253 [005] .... 136.865051: task_numa_work: working thread 0/2-3255 [026] .... 136.965134: task_numa_work: working thread 0/3-3256 [024] .... 137.065217: task_numa_work: working thread 0/3-3256 [024] .... 137.165302: task_numa_work: working thread 0/3-3256 [024] .... 137.265382: task_numa_work: working thread 0/0-3253 [004] .... 137.366465: task_numa_work: working thread 0/2-3255 [026] .... 137.466549: task_numa_work: working thread 0/0-3253 [004] .... 137.566629: task_numa_work: working thread 0/0-3253 [004] .... 137.666711: task_numa_work: working thread 0/1-3254 [028] .... 137.766799: task_numa_work: working thread 0/0-3253 [004] .... 137.866876: task_numa_work: working thread 0/2-3255 [026] .... 137.966960: task_numa_work: working thread 0/1-3254 [028] .... 138.067041: task_numa_work: working thread 0/2-3255 [026] .... 138.167123: task_numa_work: working thread 0/3-3256 [024] .... 138.267207: task_numa_work: working Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-14-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09sched/numa: Fix commentsPeter Zijlstra
Fix a 80 column violation and a PTE vs PMD reference. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-4-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-06sched/rt: Remove redundant nr_cpus_allowed testShawn Bohrer
In 76854c7e8f3f4172fef091e78d88b3b751463ac6 ("sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles") an optimization was added to select_task_rq_rt() that immediately returns when p->nr_cpus_allowed == 1 at the beginning of the function. This makes the latter p->nr_cpus_allowed > 1 check redundant, which can now be removed. Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: tomk@rgmadvisors.com Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.bohrer@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25sched: Prepare for per-cpu preempt_countPeter Zijlstra
When using per-cpu preempt_count variables we need to save/restore the preempt_count on context switch (into per task storage; for instance the old thread_info::preempt_count variable) because of PREEMPT_ACTIVE. However, this means that on fork() the preempt_count value of the last context switch gets copied and if we had a PREEMPT_ACTIVE switch right before cloning a child task the child task will now too have PREEMPT_ACTIVE set and start its life with an extra PREEMPT_ACTIVE count. Therefore we need to make init_task_preempt_count() unconditional; this resets whatever preempt_count we inherited from our parent process. Doing so for !per-cpu implementations is harmless. For !PREEMPT_COUNT kernels we need to be careful not to start life with an increased preempt_count. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-4k0b7oy1rcdyzochwiixuwi9@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25sched: Extract the basic add/sub preempt_count modifiersPeter Zijlstra
Rewrite the preempt_count macros in order to extract the 3 basic preempt_count value modifiers: __preempt_count_add() __preempt_count_sub() and the new: __preempt_count_dec_and_test() And since we're at it anyway, replace the unconventional $op_preempt_count names with the more conventional preempt_count_$op. Since these basic operators are equivalent to the previous _notrace() variants, do away with the _notrace() versions. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-ewbpdbupy9xpsjhg960zwbv8@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25sched: Create more preempt_count accessorsPeter Zijlstra
We need a few special preempt_count accessors: - task_preempt_count() for when we're interested in the preemption count of another (non-running) task. - init_task_preempt_count() for properly initializing the preemption count. - init_idle_preempt_count() a special case of the above for the idle threads. With these no generic code ever touches thread_info::preempt_count anymore and architectures could choose to remove it. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-jf5swrio8l78j37d06fzmo4r@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25sched: Add NEED_RESCHED to the preempt_countPeter Zijlstra
In order to combine the preemption and need_resched test we need to fold the need_resched information into the preempt_count value. Since the NEED_RESCHED flag is set across CPUs this needs to be an atomic operation, however we very much want to avoid making preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED infrastructure in place but at 3 sites test it and fold its value into preempt_count; namely: - resched_task() when setting TIF_NEED_RESCHED on the current task - scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a remote task it follows it up with a reschedule IPI and we can modify the cpu local preempt_count from there. - cpu_idle_loop() for when resched_task() found tsk_is_polling(). We use an inverted bitmask to indicate need_resched so that a 0 means both need_resched and !atomic. Also remove the barrier() in preempt_enable() between preempt_enable_no_resched() and preempt_check_resched() to avoid having to reload the preemption value and allow the compiler to use the flags of the previuos decrement. I couldn't come up with any sane reason for this barrier() to be there as preempt_enable_no_resched() already has a barrier() before doing the decrement. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-7a7m5qqbn5pmwnd4wko9u6da@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25sched: Introduce preempt_count accessor functionsPeter Zijlstra
Replace the single preempt_count() 'function' that's an lvalue with two proper functions: preempt_count() - returns the preempt_count value as rvalue preempt_count_set() - Allows setting the preempt-count value Also provide preempt_count_ptr() as a convenience wrapper to implement all modifying operations. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-orxrbycjozopqfhb4dxdkdvb@git.kernel.org [ Fixed build failure. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25sched, rcu: Make RCU use resched_cpu()Peter Zijlstra
We're going to deprecate and remove set_need_resched() for it will do the wrong thing. Make an exception for RCU and allow it to use resched_cpu() which will do the right thing. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-2eywnacjl1nllctl1nszqa5w@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25sched: Micro-optimize by dropping unnecessary task_rq() callsMichael S. Tsirkin
We always know the rq used, let's just pass it around. This seems to cut the size of scheduler core down a tiny bit: Before: [linux]$ size kernel/sched/core.o.orig text data bss dec hex filename 62760 16130 3876 82766 1434e kernel/sched/core.o.orig After: [linux]$ size kernel/sched/core.o.patched text data bss dec hex filename 62566 16130 3876 82572 1428c kernel/sched/core.o.patched Probably speeds it up as well. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130922142054.GA11499@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20sched/balancing: Periodically decay max cost of idle balanceJason Low
This patch builds on patch 2 and periodically decays that max value to do idle balancing per sched domain by approximately 1% per second. Also decay the rq's max_idle_balance_cost value. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379096813-3032-4-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20sched/balancing: Consider max cost of idle balance per sched domainJason Low
In this patch, we keep track of the max cost we spend doing idle load balancing for each sched domain. If the avg time the CPU remains idle is less then the time we have already spent on idle balancing + the max cost of idle balancing in the sched domain, then we don't continue to attempt the balance. We also keep a per rq variable, max_idle_balance_cost, which keeps track of the max time spent on newidle load balances throughout all its domains so that we can determine the avg_idle's max value. By using the max, we avoid overrunning the average. This further reduces the chance we attempt balancing when the CPU is not idle for longer than the cost to balance. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379096813-3032-3-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20sched: Reduce overestimating rq->avg_idleJason Low
When updating avg_idle, if the delta exceeds some max value, then avg_idle gets set to the max, regardless of what the previous avg was. This can cause avg_idle to often be overestimated. This patch modifies the way we update avg_idle by always updating it with the function call to update_avg() first. Then, if avg_idle exceeds the max, we set it to the max. Signed-off-by: Jason Low <jason.low2@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379096813-3032-2-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20sched/balancing: Prevent the reselection of a previous env.dst_cpu if some ↵Vladimir Davydov
tasks are pinned Currently new_dst_cpu is prevented from being reselected actually, not dst_cpu. This can result in attempting to pull tasks to this_cpu twice. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/281f59b6e596c718dd565ad267fc38f5b8e5c995.1379265590.git.vdavydov@parallels.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20Merge branch 'sched/urgent' into sched/coreIngo Molnar
Merge in the latest fixes before applying a dependent patch. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20sched/balancing: Fix cfs_rq->task_h_load calculationVladimir Davydov
Patch a003a2 (sched: Consider runnable load average in move_tasks()) sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is always 0. This mistype leads to all tasks having weight 0 when load balancing in a cpu-cgroup enabled setup. There obviously should be sum of weights of all runnable tasks there instead. Fix it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379173186-11944-1-git-send-email-vdavydov@parallels.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20sched/balancing: Fix 'local->avg_load > busiest->avg_load' case in ↵Vladimir Davydov
fix_small_imbalance() In busiest->group_imb case we can come to fix_small_imbalance() with local->avg_load > busiest->avg_load. This can result in wrong imbalance fix-up, because there is the following check there where all the members are unsigned: if (busiest->avg_load - local->avg_load + scaled_busy_load_per_task >= (scaled_busy_load_per_task * imbn)) { env->imbalance = busiest->load_per_task; return; } As a result we can end up constantly bouncing tasks from one cpu to another if there are pinned tasks. Fix it by substituting the subtraction with an equivalent addition in the check. [ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus belonging to different cores on an HT-enabled machine with N logical cpus: just look at se.nr_migrations growth. ] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/ef167822e5c5b2d96cf5b0e3e4f4bdff3f0414a2.1379252740.git.vdavydov@parallels.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20sched/balancing: Fix 'local->avg_load > sds->avg_load' case in ↵Vladimir Davydov
calculate_imbalance() In busiest->group_imb case we can come to calculate_imbalance() with local->avg_load >= busiest->avg_load >= sds->avg_load. This can result in imbalance overflow, because it is calculated as follows env->imbalance = min( max_pull * busiest->group_power, (sds->avg_load - local->avg_load) * local->group_power) / SCHED_POWER_SCALE; As a result we can end up constantly bouncing tasks from one cpu to another if there are pinned tasks. Fix this by skipping the assignment and assuming imbalance=0 in case local->avg_load > sds->avg_load. [ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus belonging to different cores on an HT-enabled machine with N logical cpus: just look at se.nr_migrations growth. ] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/8f596cc6bc0e5e655119dc892c9bfcad26e971f4.1379252740.git.vdavydov@parallels.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-18Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix comment for sched_info_depart sched/Documentation: Update sched-design-CFS.txt documentation sched/debug: Take PID namespace into account sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones
2013-09-16sched: Fix comment for sched_info_departMichael S. Tsirkin
sched_info_depart seems to be only called from sched_info_switch(), so only on involuntary task switch. Fix the comment to match. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Link: http://lkml.kernel.org/r/20130916083036.GA1113@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Performance regression fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix load balancing performance regression in should_we_balance()
2013-09-12sched/fair: Fix the group_capacity computationPeter Zijlstra
Do away with 'phantom' cores due to N*frac(smt_power) >= 1 by limiting the capacity to the actual number of cores. The assumption of 1 < smt_power < 2 is an actual requirement because of what SMT is so this should work regardless of the SMT implementation. It can still be defeated by creative use of cpu hotplug, but if you're one of those freaks, you get to live with it. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guitto@linaro.org> Link: http://lkml.kernel.org/n/tip-dczmbi8tfgixacg1ji2av1un@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12sched/fair: Rework and comment the group_capacity codePeter Zijlstra
Pull out the group_capacity computation so that we can more clearly comment its issues. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-az1hl1ya55k361nkeh9bj0yw@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12sched/fair: Fix group power_orig computationPeter Zijlstra
When looking at the code I noticed we don't actually compute sgp->power_orig correctly for groups, fix that. Currently the only consumer of that value is fix_small_capacity() which is only used on POWER7+ and that code excludes this case by being limited to SD_SHARE_CPUPOWER which is only ever set on the SMT domain which must be the lowest domain and this has singleton groups. So nothing should be affected by this change. Cc: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-db2pe0vxwunv37plc7onnugj@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12sched/fair: Reduce local_group logicPeter Zijlstra
Try and reduce the local_group logic by pulling most of it into update_sd_lb_stats. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-mgezl354xgyhiyrte78fdkpd@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12sched/fair: Rewrite group_imb triggerPeter Zijlstra
Change the group_imb detection from the old 'load-spike' detector to an actual imbalance detector. We set it from the lower domain balance pass when it fails to create a balance in the presence of task affinities. The advantage is that this should no longer generate the false positive group_imb conditions generated by transient load spikes from the normal balancing/bulk-wakeup etc. behaviour. While I haven't actually observed those they could happen. I'm not entirely happy with this patch; it somehow feels a little fragile. Nor does it solve the biggest issue I have with the group_imb code; it it still a fragile construct in that once we 'fixed' the imbalance we'll not detect the group_imb again and could end up re-creating it. That said, this patch does seem to preserve behaviour for the described degenerate case. In particular on my 2*6*2 wsm-ep: taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done' ends up with 9 spinners, each on their own CPU; whereas if you disable the group_imb code that typically doesn't happen (you'll get one pair sharing a CPU most of the time). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-36fpbgl39dv4u51b6yz2ypz5@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12sched/debug: Take PID namespace into accountPeter Zijlstra
Emmanuel reported that /proc/sched_debug didn't report the right PIDs when using namespaces, cure this. Reported-by: Emmanuel Deloget <emmanuel.deloget@efixo.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130909110141.GM31370@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12sched/fair: Fix small race where child->se.parent,cfs_rq might point to ↵Daisuke Nishimura
invalid ones There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-10sched: Fix load balancing performance regression in should_we_balance()Joonsoo Kim
Commit 23f0d20 ("sched: Factor out code to should_we_balance()") introduces the should_we_balance() function. This function should return 1 if this cpu is appropriate for balancing. But the newly introduced code doesn't do so, it returns 0 instead of 1. This introduces performance regression, reported by Dave Chinner: v4 filesystem v5 filesystem 3.11+xfsdev: 220k files/s 225k files/s 3.12-git 180k files/s 185k files/s 3.12-git-revert 245k files/s 247k files/s You can find more detailed information at: https://lkml.org/lkml/2013/9/10/1 This patch corrects the return value of should_we_balance() function as orignally intended. With this patch, Dave Chinner reports that the regression is gone: v4 filesystem v5 filesystem 3.11+xfsdev: 220k files/s 225k files/s 3.12-git 180k files/s 185k files/s 3.12-git-revert 245k files/s 247k files/s 3.12-git-fix 249k files/s 248k files/s Reported-by: Dave Chinner <dchinner@redhat.com> Tested-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Link: http://lkml.kernel.org/r/20130910065448.GA20368@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-05Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cputime fix from Ingo Molnar: "This fixes a longer-standing cputime accounting bug that Stanislaw Gruszka finally managed to track down" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cputime: Do not scale when utime == 0
2013-09-04Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Gleb Natapov: "The highlights of the release are nested EPT and pv-ticketlocks support (hypervisor part, guest part, which is most of the code, goes through tip tree). Apart of that there are many fixes for all arches" Fix up semantic conflicts as discussed in the pull request thread.. * 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (88 commits) ARM: KVM: Add newlines to panic strings ARM: KVM: Work around older compiler bug ARM: KVM: Simplify tracepoint text ARM: KVM: Fix kvm_set_pte assignment ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256 ARM: KVM: Bugfix: vgic_bytemap_get_reg per cpu regs ARM: KVM: vgic: fix GICD_ICFGRn access ARM: KVM: vgic: simplify vgic_get_target_reg KVM: MMU: remove unused parameter KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate() KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX KVM: x86: update masterclock when kvmclock_offset is calculated (v2) KVM: PPC: Book3S: Fix compile error in XICS emulation KVM: PPC: Book3S PR: return appropriate error when allocation fails arch: powerpc: kvm: add signed type cast for comparation KVM: x86: add comments where MMIO does not return to the emulator KVM: vmx: count exits to userspace during invalid guest emulation KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmp kvm: optimize away THP checks in kvm_is_mmio_pfn() ...
2013-09-04Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers/nohz changes from Ingo Molnar: "It mostly contains fixes and full dynticks off-case optimizations, by Frederic Weisbecker" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) nohz: Include local CPU in full dynticks global kick nohz: Optimize full dynticks's sched hooks with static keys nohz: Optimize full dynticks state checks with static keys nohz: Rename a few state variables vtime: Always debug check snapshot source _before_ updating it vtime: Always scale generic vtime accounting results vtime: Optimize full dynticks accounting off case with static keys vtime: Describe overriden functions in dedicated arch headers m68k: hardirq_count() only need preempt_mask.h hardirq: Split preempt count mask definitions context_tracking: Split low level state headers vtime: Fix racy cputime delta update vtime: Remove a few unneeded generic vtime state checks context_tracking: User/kernel broundary cross trace events context_tracking: Optimize context switch off case with static keys context_tracking: Optimize guest APIs off case with static key context_tracking: Optimize main APIs off case with static key context_tracking: Ground setup for static key use context_tracking: Remove full dynticks' hacky dependency on wide context tracking nohz: Only enable context tracking on full dynticks CPUs ...
2013-09-04Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Various optimizations, cleanups and smaller fixes - no major changes in scheduler behavior" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix the sd_parent_degenerate() code sched/fair: Rework and comment the group_imb code sched/fair: Optimize find_busiest_queue() sched/fair: Make group power more consistent sched/fair: Remove duplicate load_per_task computations sched/fair: Shrink sg_lb_stats and play memset games sched: Clean-up struct sd_lb_stat sched: Factor out code to should_we_balance() sched: Remove one division operation in find_busiest_queue() sched/cputime: Use this_cpu_add() in task_group_account_field() cpumask: Fix cpumask leak in partition_sched_domains() sched/x86: Optimize switch_mm() for multi-threaded workloads generic-ipi: Kill unnecessary variable - csd_flags numa: Mark __node_set() as __always_inline sched/fair: Cleanup: remove duplicate variable declaration sched/__wake_up_sync_key(): Fix nr_exclusive tasks which lead to WF_SYNC clearing
2013-09-04Merge branches 'perf-urgent-for-linus' and 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf changes from Ingo Molnar: "As a first remark I'd like to point out that the obsolete '-f' (--force) option, which has not done anything for several releases, has been removed from 'perf record' and related utilities. Everyone please update muscle memory accordingly! :-) Main changes on the perf kernel side: - Performance optimizations: . for trace events, by Steve Rostedt. . for time values, by Peter Zijlstra - New hardware support: . for Intel Silvermont (22nm Atom) CPUs, by Zheng Yan . for Intel SNB-EP uncore PMUs, by Zheng Yan - Enhanced hardware support: . for Intel uncore PMUs: add filter support for QPI boxes, by Zheng Yan - Core perf events code enhancements and fixes: . for full-nohz feature handling, by Frederic Weisbecker . for group events, by Jiri Olsa . for call chains, by Frederic Weisbecker . for event stream parsing, by Adrian Hunter - New ABI details: . Add attr->mmap2 attribute, by Stephane Eranian . Add PERF_EVENT_IOC_ID ioctl to return event ID, by Jiri Olsa . Export u64 time_zero on the mmap header page to allow TSC calculation, by Adrian Hunter . Add dummy software event, by Adrian Hunter. . Add a new PERF_SAMPLE_IDENTIFIER to make samples always parseable, by Adrian Hunter. . Make Power7 events available via sysfs, by Runzhen Wang. - Code cleanups and refactorings: . for nohz-full, by Frederic Weisbecker . for group events, by Jiri Olsa - Documentation updates: . for perf_event_type, by Peter Zijlstra Main changes on the perf tooling side (some of these tooling changes utilize the above kernel side changes): - Lots of 'perf trace' enhancements: . Make 'perf trace' command line arguments consistent with 'perf record', by David Ahern. . Allow specifying syscalls a la strace, by Arnaldo Carvalho de Melo. . Add --verbose and -o/--output options, by Arnaldo Carvalho de Melo. . Support ! in -e expressions, to filter a list of syscalls, by Arnaldo Carvalho de Melo. . Arg formatting improvements to allow masking arguments in syscalls such as futex and open, where the some arguments are ignored and thus should not be printed depending on other args, by Arnaldo Carvalho de Melo. . Beautify futex open, openat, open_by_handle_at, lseek and futex syscalls, by Arnaldo Carvalho de Melo. . Add option to analyze events in a file versus live, so that one can do: [root@zoo ~]# perf record -a -e raw_syscalls:* sleep 1 [ perf record: Woken up 0 times to write data ] [ perf record: Captured and wrote 25.150 MB perf.data (~1098836 samples) ] [root@zoo ~]# perf trace -i perf.data -e futex --duration 1 17.799 ( 1.020 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, ua 113.344 (95.429 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 4294967 133.778 ( 1.042 ms): 18004 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 429496 [root@zoo ~]# By David Ahern. . Honor target pid / tid options when analyzing a file, by David Ahern. . Introduce better formatting of syscall arguments, including so far beautifiers for mmap, madvise, syscall return values, by Arnaldo Carvalho de Melo. . Handle HUGEPAGE defines in the mmap beautifier, by David Ahern. - 'perf report/top' enhancements: . Do annotation using /proc/kcore and /proc/kallsyms when available, removing the forced need for a vmlinux file kernel assembly annotation. This also improves this use case because vmlinux has just the initial kernel image, not what is actually in use after various code patchings by things like alternatives. By Adrian Hunter. . Add --ignore-callees=<regex> option to collapse undesired parts of call graphs, by Greg Price. . Simplify symbol filtering by doing it at machine class level, by Adrian Hunter. . Add support for callchains in the gtk UI, by Namhyung Kim. . Add --objdump option to 'perf top', by Sukadev Bhattiprolu. - 'perf kvm' enhancements: . Add option to print only events that exceed a specified time duration, by David Ahern. . Improve stack trace printing, by David Ahern. . Update documentation of the live command, by David Ahern . Add perf kvm stat live mode that combines aspects of 'perf kvm stat' record and report, by David Ahern. . Add option to analyze specific VM in perf kvm stat report, by David Ahern. . Do not require /lib/modules/* on a guest, by Jason Wessel. - 'perf script' enhancements: . Fix symbol offset computation for some dsos, by David Ahern. . Fix named threads support, by David Ahern. . Don't install scripting files files when perl/python support is disabled, by Arnaldo Carvalho de Melo. - 'perf test' enhancements: . Add various improvements and fixes to the "vmlinux matches kallsyms" 'perf test' entry, related to the /proc/kcore annotation feature. By Adrian Hunter. . Add sample parsing test, by Adrian Hunter. . Add test for reading object code, by Adrian Hunter. . Add attr record group sampling test, by Jiri Olsa. . Misc testing infrastructure improvements and other details, by Jiri Olsa. - 'perf list' enhancements: . Skip unsupported hardware events, by Namhyung Kim. . List pmu events, by Andi Kleen. - 'perf diff' enhancements: . Add support for more than two files comparison, by Jiri Olsa. - 'perf sched' enhancements: . Various improvements, including removing reliance on some scheduler tracepoints that provide the same information as the PERF_RECORD_{FORK,EXIT} events. By David Ahern. . Remove odd build stall by moving a large struct initialization from a local variable to a global one, by Namhyung Kim. - 'perf stat' enhancements: . Add --initial-delay option to skip measuring for a defined startup phase, by Andi Kleen. - Generic perf tooling infrastructure/plumbing changes: . Tidy up sample parsing validation, by Adrian Hunter. . Fix up jobserver setup in libtraceevent Makefile. by Arnaldo Carvalho de Melo. . Debug improvements, by Adrian Hunter. . Fix correlation of samples coming after PERF_RECORD_EXIT event, by David Ahern. . Improve robustness of the topology parsing code, by Stephane Eranian. . Add group leader sampling, that allows just one event in a group to sample while the other events have just its values read, by Jiri Olsa. . Add support for a new modifier "D", which requests that the event, or group of events, be pinned to the PMU. By Michael Ellerman. . Support callchain sorting based on addresses, by Andi Kleen . Prep work for multi perf data file storage, by Jiri Olsa. . libtraceevent cleanups, by Namhyung Kim. And lots and lots of other fixes and code reorganizations that did not make it into the list, see the shortlog, diffstat and the Git log for details!" [ Also merge a leftover from the 3.11 cycle ] * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Prevent race in unthrottling code * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (237 commits) perf trace: Tell arg formatters the arg index perf trace: Add beautifier for open's flags arg perf trace: Add beautifier for lseek's whence arg perf tools: Fix symbol offset computation for some dsos perf list: Skip unsupported events perf tests: Add 'keep tracking' test perf tools: Add support for PERF_COUNT_SW_DUMMY perf: Add a dummy software event to keep tracking perf trace: Add beautifier for futex 'operation' parm perf trace: Allow syscall arg formatters to mask args perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node() perf: Export struct perf_branch_entry to userspace perf: Add attr->mmap2 attribute to an event perf/x86: Add Silvermont (22nm Atom) support perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X perf trace: Handle missing HUGEPAGE defines perf trace: Honor target pid / tid options when analyzing a file perf trace: Add option to analyze events in a file versus live perf evlist: Add tracepoint lookup by name perf tests: Add a sample parsing test ...
2013-09-04sched/cputime: Do not scale when utime == 0Stanislaw Gruszka
scale_stime() silently assumes that stime < rtime, otherwise when stime == rtime and both values are big enough (operations on them do not fit in 32 bits), the resulting scaling stime can be bigger than rtime. In consequence utime = rtime - stime results in negative value. User space visible symptoms of the bug are overflowed TIME values on ps/top, for example: $ ps aux | grep rcu root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0] root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0] root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] or overflowed utime values read directly from /proc/$PID/stat Reference: https://lkml.org/lkml/2013/8/20/259 Reported-and-tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: stable@vger.kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-03Merge branch 'for-3.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on the cgroup front. Most changes aren't visible to userland at all at this point and are laying foundation for the planned unified hierarchy. - The biggest change is decoupling the lifetime management of css (cgroup_subsys_state) from that of cgroup's. Because controllers (cpu, memory, block and so on) will need to be dynamically enabled and disabled, css which is the association point between a cgroup and a controller may come and go dynamically across the lifetime of a cgroup. Till now, css's were created when the associated cgroup was created and stayed till the cgroup got destroyed. Assumptions around this tight coupling permeated through cgroup core and controllers. These assumptions are gradually removed, which consists bulk of patches, and css destruction path is completely decoupled from cgroup destruction path. Note that decoupling of creation path is relatively easy on top of these changes and the patchset is pending for the next window. - cgroup has its own event mechanism cgroup.event_control, which is only used by memcg. It is overly complex trying to achieve high flexibility whose benefits seem dubious at best. Going forward, new events will simply generate file modified event and the existing mechanism is being made specific to memcg. This pull request contains prepatory patches for such change. - Various fixes and cleanups" Fixed up conflict in kernel/cgroup.c as per Tejun. * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits) cgroup: fix cgroup_css() invocation in css_from_id() cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp() cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup cgroup: implement CFTYPE_NO_PREFIX cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax cgroup: fix cgroup_write_event_control() cgroup: fix subsystem file accesses on the root cgroup cgroup: change cgroup_from_id() to css_from_id() cgroup: use css_get() in cgroup_create() to check CSS_ROOT cpuset: remove an unncessary forward declaration cgroup: RCU protect each cgroup_subsys_state release cgroup: move subsys file removal to kill_css() cgroup: factor out kill_css() cgroup: decouple cgroup_subsys_state destruction from cgroup destruction cgroup: replace cgroup->css_kill_cnt with ->nr_css cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item cgroup: move cgroup->subsys[] assignment to online_css() cgroup: reorganize css init / exit paths cgroup: add __rcu modifier to cgroup->subsys[] ...
2013-09-02sched/fair: Fix the sd_parent_degenerate() codePeter Zijlstra
I found that on my WSM box I had a redundant domain: [ 0.949769] CPU0 attaching sched-domain: [ 0.953765] domain 0: span 0,12 level SIBLING [ 0.958335] groups: 0 (cpu_power = 587) 12 (cpu_power = 588) [ 0.964548] domain 1: span 0-5,12-17 level MC [ 0.969206] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176) [ 0.984993] domain 2: span 0-5,12-17 level CPU [ 0.989822] groups: 0-5,12-17 (cpu_power = 7055) [ 0.995049] domain 3: span 0-23 level NUMA [ 0.999620] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056) Note how domain 2 has only a single group and spans the same CPUs as domain 1. We should not keep such domains and do in fact have code to prune these. It turns out that the 'new' SD_PREFER_SIBLING flag causes this, it makes sd_parent_degenerate() fail on the CPU domain. We can easily fix this by 'ignoring' the SD_PREFER_SIBLING bit and transfering it to whatever domain ends up covering the span. With this patch the domains now look like this: [ 0.950419] CPU0 attaching sched-domain: [ 0.954454] domain 0: span 0,12 level SIBLING [ 0.959039] groups: 0 (cpu_power = 587) 12 (cpu_power = 588) [ 0.965271] domain 1: span 0-5,12-17 level MC [ 0.969936] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176) [ 0.985737] domain 2: span 0-23 level NUMA [ 0.990231] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056) Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-ys201g4jwukj0h8xcamakxq1@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched/fair: Rework and comment the group_imb codePeter Zijlstra
Rik reported some weirdness due to the group_imb code. As a start to looking at it, clean it up a little and add a few explanatory comments. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-caeeqttnla4wrrmhp5uf89gp@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched/fair: Optimize find_busiest_queue()Peter Zijlstra
Use for_each_cpu_and() and thereby avoid computing the capacity for CPUs we know we're not interested in. Reviewed-by: Paul Turner <pjt@google.com> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-lppceyv6kb3a19g8spmrn20b@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched/fair: Make group power more consistentPeter Zijlstra
For easier access, less dereferences and more consistent value, store the group power in update_sg_lb_stats() and use it thereafter. The actual value in sched_group::sched_group_power::power can change throughout the load-balance pass if we're unlucky. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-739xxqkyvftrhnh9ncudutc7@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched/fair: Remove duplicate load_per_task computationsPeter Zijlstra
Since we already compute (but don't store) the sgs load_per_task value in update_sg_lb_stats() we might as well store it and not re-compute it later on. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-ym1vmljiwbzgdnnrwp9azftq@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched/fair: Shrink sg_lb_stats and play memset gamesPeter Zijlstra
We can shrink sg_lb_stats because rq::nr_running is an unsigned int and cpu numbers are 'int' Before: sgs: /* size: 72, cachelines: 2, members: 10 */ sds: /* size: 184, cachelines: 3, members: 7 */ After: sgs: /* size: 56, cachelines: 1, members: 10 */ sds: /* size: 152, cachelines: 3, members: 7 */ Further we can avoid clearing all of sds since we do a total clear/assignment of sg_stats in update_sg_lb_stats() with exception of busiest_stat.avg_load which is referenced in update_sd_pick_busiest(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-0klzmz9okll8wc0nsudguc9p@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched: Clean-up struct sd_lb_statJoonsoo Kim
There is no reason to maintain separate variables for this_group and busiest_group in sd_lb_stat, except saving some space. But this structure is always allocated in stack, so this saving isn't really benificial [peterz: reducing stack space is good; in this case readability increases enough that I think its still beneficial] This patch unify these variables, so IMO, readability may be improved. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> [ Rename this to local -- avoids confusion between this_cpu and the C++ this pointer. ] Reviewed-by: Paul Turner <pjt@google.com> [ Lots of style edits, a few fixes and a rename. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1375778203-31343-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>