summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2014-02-11sched: Delete is_same_group() outside CONFIG_FAIR_GROUP_SCHEDDietmar Eggemann
Since is_same_group() is only used in the group scheduling code, there is no need to define it outside CONFIG_FAIR_GROUP_SCHED. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391005773-29493-1-git-send-email-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-11sched: Push down pre_schedule() and idle_balance()Peter Zijlstra
This patch both merged idle_balance() and pre_schedule() and pushes both of them into pick_next_task(). Conceptually pre_schedule() and idle_balance() are rather similar, both are used to pull more work onto the current CPU. We cannot however first move idle_balance() into pre_schedule_fair() since there is no guarantee the last runnable task is a fair task, and thus we would miss newidle balances. Similarly, the dl and rt pre_schedule calls must be ran before idle_balance() since their respective tasks have higher priority and it would not do to delay their execution searching for less important tasks first. However, by noticing that pick_next_tasks() already traverses the sched_class hierarchy in the right order, we can get the right behaviour and do away with both calls. We must however change the special case optimization to also require that prev is of sched_class_fair, otherwise we can miss doing a dl or rt pull where we needed one. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-11PM / QoS: Introcuce latency tolerance device PM QoS typeRafael J. Wysocki
Add a new latency tolerance device PM QoS type to be use for specifying active state (RPM_ACTIVE) memory access (DMA) latency tolerance requirements for devices. It may be used to prevent hardware from choosing overly aggressive energy-saving operation modes (causing too much latency to appear) for the whole platform. This feature reqiures hardware support, so it only will be available for devices having a new .set_latency_tolerance() callback in struct dev_pm_info populated, in which case the routine pointed to by it should implement whatever is necessary to transfer the effective requirement value to the hardware. Whenever the effective latency tolerance changes for the device, its .set_latency_tolerance() callback will be executed and the effective value will be passed to it. If that value is negative, which means that the list of latency tolerance requirements for the device is empty, the callback is expected to switch the underlying hardware latency tolerance control mechanism to an autonomous mode if available. If that value is PM_QOS_LATENCY_ANY, in turn, and the hardware supports a special "no requirement" setting, the callback is expected to use it. That allows software to prevent the hardware from automatically updating the device's latency tolerance in response to its power state changes (e.g. during transitions from D3cold to D0), which generally may be done in the autonomous latency tolerance control mode. If .set_latency_tolerance() is present for the device, a new pm_qos_latency_tolerance_us attribute will be present in the devivce's power directory in sysfs. Then, user space can use that attribute to specify its latency tolerance requirement for the device, if any. Writing "any" to it means "no requirement, but do not let the hardware control latency tolerance" and writing "auto" to it allows the hardware to be switched to the autonomous mode if there are no other requirements from the kernel side in the device's list. This changeset includes a fix from Mika Westerberg. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-02-11PM / QoS: Add no_constraints_value field to struct pm_qos_constraintsRafael J. Wysocki
Add a new field, no_constraints_value, to struct pm_qos_constraints representing a list of PM QoS constraint requests to be returned by pm_qos_get_value() when that list of requests is empty. That field will be equal to default_value for all of the existing global PM QoS classes and for the resume latency device PM QoS type, but it will be different from default_value for the new latency tolerance device PM QoS type introduced by the next changeset. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-02-10sched: Clean up idle task SMP logicPeter Zijlstra
The idle post_schedule flag is just a vile waste of time, furthermore it appears unneeded, move the idle_enter_fair() call into pick_next_task_idle(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: alex.shi@linaro.org Cc: mingo@kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/n/tip-aljykihtxJt3mkokxi0qZurb@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched/fair: Optimize cgroup pick_next_task_fair()Peter Zijlstra
Since commit 2f36825b1 ("sched: Next buddy hint on sleep and preempt path") it is likely we pick a new task from the same cgroup, doing a put and then set on all intermediate entities is a waste of time, so try to avoid this. Measured using: mount nodev /cgroup -t cgroup -o cpu cd /cgroup mkdir a; cd a mkdir b; cd b mkdir c; cd c echo $$ > tasks perf stat --repeat 10 -- taskset 1 perf bench sched pipe PRE : 4.542422684 seconds time elapsed ( +- 0.33% ) POST: 4.389409991 seconds time elapsed ( +- 0.32% ) Which shows a significant improvement of ~3.5% Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched/fair: Clean up the __clear_buddies_*() functionsPeter Zijlstra
Slightly easier code flow, no functional changes. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched: Push put_prev_task() into pick_next_task()Peter Zijlstra
In order to avoid having to do put/set on a whole cgroup hierarchy when we context switch, push the put into pick_next_task() so that both operations are in the same function. Further changes then allow us to possibly optimize away redundant work. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched/fair: Track cgroup depthPeter Zijlstra
Track depth in cgroup tree, this is useful for things like find_matching_se() where you need to get to a common parent of two sched entities. Keeping the depth avoids having to calculate it on the spot, which saves a number of possible cache-misses. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched: Move rq->idle_stamp up to the coreDaniel Lezcano
idle_balance() modifies the rq->idle_stamp field, making this information shared across core.c and fair.c. As we know if the cpu is going to idle or not with the previous patch, let's encapsulate the rq->idle_stamp information in core.c by moving it up to the caller. The idle_balance() function returns true in case a balancing occured and the cpu won't be idle, false if no balance happened and the cpu is going idle. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: alex.shi@linaro.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389949444-14821-3-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched: Fix race in idle_balance()Daniel Lezcano
The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: alex.shi@linaro.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389949444-14821-2-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched: Remove 'cpu' parameter from idle_balance()Daniel Lezcano
The cpu parameter passed to idle_balance() is not needed as it could be retrieved from 'struct rq.' Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: alex.shi@linaro.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389949444-14821-1-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09lockdep: Change mark_held_locks() to check hlock->check instead of ↵Oleg Nesterov
lockdep_no_validate The __lockdep_no_validate check in mark_held_locks() adds the subtle and (afaics) unnecessary difference between no-validate and check==0. And this looks even more inconsistent because __lock_acquire() skips mark_irqflags()->mark_lock() if !check. Change mark_held_locks() to check hlock->check instead. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140120182013.GA26505@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09lockdep: Don't create the wrong dependency on hlock->check == 0Oleg Nesterov
Test-case: DEFINE_MUTEX(m1); DEFINE_MUTEX(m2); DEFINE_MUTEX(mx); void lockdep_should_complain(void) { lockdep_set_novalidate_class(&mx); // m1 -> mx -> m2 mutex_lock(&m1); mutex_lock(&mx); mutex_lock(&m2); mutex_unlock(&m2); mutex_unlock(&mx); mutex_unlock(&m1); // m2 -> m1 ; should trigger the warning mutex_lock(&m2); mutex_lock(&m1); mutex_unlock(&m1); mutex_unlock(&m2); } this doesn't trigger any warning, lockdep can't detect the trivial deadlock. This is because lock(&mx) correctly avoids m1 -> mx dependency, it skips validate_chain() due to mx->check == 0. But lock(&m2) wrongly adds mx -> m2 and thus m1 -> m2 is not created. rcu_lock_acquire()->lock_acquire(check => 0) is fine due to read == 2, so currently only __lockdep_no_validate__ can trigger this problem. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140120182010.GA26498@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09lockdep: Make held_lock->check and "int check" argument boolOleg Nesterov
The "int check" argument of lock_acquire() and held_lock->check are misleading. This is actually a boolean: 2 means "true", everything else is "false". And there is no need to pass 1 or 0 to lock_acquire() depending on CONFIG_PROVE_LOCKING, __lock_acquire() checks prove_locking at the start and clears "check" if !CONFIG_PROVE_LOCKING. Note: probably we can simply kill this member/arg. The only explicit user of check => 0 is rcu_lock_acquire(), perhaps we can change it to use lock_acquire(trylock =>, read => 2). __lockdep_no_validate means check => 0 implicitly, but we can change validate_chain() to check hlock->instance->key instead. Not to mention it would be nice to get rid of lockdep_set_novalidate_class(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140120182006.GA26495@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09sched: Implement task_nice() as static inline functionDongsheng Yang
As patch "sched: Move the priority specific bits into a new header file" exposes the priority related macros in linux/sched/prio.h, we don't have to implement task_nice() in kernel/sched/core.c any more. This patch implements it in linux/sched/sched.h as static inline function, saving the kernel stack and enhancing performance a bit. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Cc: clark.williams@gmail.com Cc: rostedt@goodmis.org Cc: raistlin@linux.it Cc: juri.lelli@gmail.com Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1390878045-7096-1-git-send-email-yangds.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09genirq: Add devm_request_any_context_irq()Stephen Boyd
Some drivers use request_any_context_irq() but there isn't a devm_* function for it. Add one so that these drivers don't need to explicitly free the irq on driver detach. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: linux-arm-kernel@lists.infradead.org Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Link: http://lkml.kernel.org/r/1388709460-19222-3-git-send-email-sboyd@codeaurora.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-09tick: Fixup more fallout from hrtimer broadcast modePreeti U Murthy
The hrtimer mode of broadcast is supported only when GENERIC_CLOCKEVENTS_BROADCAST and TICK_ONESHOT config options are enabled. Hence compile in the functions for hrtimer mode of broadcast only when these options are selected. Also fix max_delta_ticks value for the pseudo clock device. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/52F719EE.9010304@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-09sched: Expose some macros related to priorityDongsheng Yang
Some macros in kernel/sched/sched.h about priority are private to kernel/sched. But they are useful to other parts of the core kernel. This patch moves these macros from kernel/sched/sched.h to include/linux/sched/prio.h so that they are available to other subsystems. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Cc: raistlin@linux.it Cc: juri.lelli@gmail.com Cc: clark.williams@gmail.com Cc: rostedt@goodmis.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/2b022810905b52d13238466807f4b2a691577180.1390859827.git.yangds.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09sched/deadline: Skip in switched_to_dl() if task is currentKirill Tkhai
When p is current and it's not of dl class, then there are no other dl taks in the rq. If we had had pushable tasks in some other rq, they would have been pushed earlier. So, skip "p == rq->curr" case. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Acked-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140128072421.32315.25300.stgit@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09perf/x86: Push the duration-logging printk() to IRQ contextPeter Zijlstra
Calling printk() from NMI context is bad (TM), so move it to IRQ context. This also avoids the problem where the printk() time is measured by the generic NMI duration goo and triggers a second warning. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Don Zickus <dzickus@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/n/tip-75dv35xf6dhhmeb7nq6fua31@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-08Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "Add a missing Kconfig dependency" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Generic irq chip requires IRQ_DOMAIN
2014-02-08Merge branch 'for-3.14-fixes' into for-3.15Tejun Heo
Pending kernfs conversion depends on fixes in for-3.14-fixes. Pull it into for-3.15. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-08cgroup: remove cgroup_root_mutexTejun Heo
cgroup_root_mutex was added to avoid deadlock involving namespace_sem via cgroup_show_options(). It added a lot of overhead for the small purpose of it and, because it's nested under cgroup_mutex, it has very limited usefulness. The previous patch made cgroup_show_options() not use cgroup_root_mutex, so nobody needs it anymore. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-08cgroup: update locking in cgroup_show_options()Tejun Heo
cgroup_show_options() grabs cgroup_root_mutex to protect the options changing while printing; however, holding root_mutex or not doesn't really make much difference for the function. subsys_mask can be atomically tested and most of the options aren't allowed to change anyway once mounted. The only field which needs synchronization is ->release_agent_path. This patch introduces a dedicated spinlock to synchronize accesses to the field and drops cgroup_root_mutex locking from cgroup_show_options(). The next patch will remove cgroup_root_mutex. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-08cgroup: rename cgroup_subsys->subsys_id to ->idTejun Heo
It's no longer referenced outside cgroup core, so renaming is easy. Let's rename it for consistency & brevity. This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-08cgroup: clean up cgroup_subsys names and initializationTejun Heo
cgroup_subsys is a bit messier than it needs to be. * The name of a subsys can be different from its internal identifier defined in cgroup_subsys.h. Most subsystems use the matching name but three - cpu, memory and perf_event - use different ones. * cgroup_subsys_id enums are postfixed with _subsys_id and each cgroup_subsys is postfixed with _subsys. cgroup.h is widely included throughout various subsystems, it doesn't and shouldn't have claim on such generic names which don't have any qualifier indicating that they belong to cgroup. * cgroup_subsys->subsys_id should always equal the matching cgroup_subsys_id enum; however, we require each controller to initialize it and then BUG if they don't match, which is a bit silly. This patch cleans up cgroup_subsys names and initialization by doing the followings. * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each cgroup_subsys with _cgrp_subsys. * With the above, renaming subsys identifiers to match the userland visible names doesn't cause any naming conflicts. All non-matching identifiers are renamed to match the official names. cpu_cgroup -> cpu mem_cgroup -> memory perf -> perf_event * controllers no longer need to initialize ->subsys_id and ->name. They're generated in cgroup core and set automatically during boot. * Redundant cgroup_subsys declarations removed. * While updating BUG_ON()s in cgroup_init_early(), convert them to WARN()s. BUGging that early during boot is stupid - the kernel can't print anything, even through serial console and the trap handler doesn't even link stack frame properly for back-tracing. This patch doesn't introduce any behavior changes. v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs classid handling into core"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Ingo Molnar <mingo@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Thomas Graf <tgraf@suug.ch>
2014-02-08cgroup: drop module supportTejun Heo
With module supported dropped from net_prio, no controller is using cgroup module support. None of actual resource controllers can be built as a module and we aren't gonna add new controllers which don't control resources. This patch drops module support from cgroup. * cgroup_[un]load_subsys() and cgroup_subsys->module removed. * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(), cgroup_subsys.h now uses IS_ENABLED() directly. * enum cgroup_subsys_id now exactly matches the list of enabled controllers as ordered in cgroup_subsys.h. * cgroup_subsys[] is now a contiguously occupied array. Size specification is no longer necessary and dropped. * for_each_builtin_subsys() is removed and for_each_subsys() is updated to not require any locking. * module ref handling is removed from rebind_subsystems(). * Module related comments dropped. v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs classid handling into core"). v3: Added {} around the if (need_forkexit_callback) block in cgroup_post_fork() for readability as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-08cgroup: fix locking in cgroup_cfts_commit()Tejun Heo
cgroup_cfts_commit() walks the cgroup hierarchy that the target subsystem is attached to and tries to apply the file changes. Due to the convolution with inode locking, it can't keep cgroup_mutex locked while iterating. It currently holds only RCU read lock around the actual iteration and then pins the found cgroup using dget(). Unfortunately, this is incorrect. Although the iteration does check cgroup_is_dead() before invoking dget(), there's nothing which prevents the dentry from going away inbetween. Note that this is different from the usual css iterations where css_tryget() is used to pin the css - css_tryget() tests whether the css can be pinned and fails if not. The problem can be solved by simply holding cgroup_mutex instead of RCU read lock around the iteration, which actually reduces LOC. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
2014-02-08cgroup: fix error return from cgroup_create()Tejun Heo
cgroup_create() was returning 0 after allocation failures. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
2014-02-08cgroup: fix error return value in cgroup_mount()Tejun Heo
When cgroup_mount() fails to allocate an id for the root, it didn't set ret before jumping to unlock_drop ending up returning 0 after a failure. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
2014-02-07cgroup: use an ordered workqueue for cgroup destructionHugh Dickins
Sometimes the cleanup after memcg hierarchy testing gets stuck in mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. There may turn out to be several causes, but a major cause is this: the workitem to offline parent can get run before workitem to offline child; parent's mem_cgroup_reparent_charges() circles around waiting for the child's pages to be reparented to its lrus, but it's holding cgroup_mutex which prevents the child from reaching its mem_cgroup_reparent_charges(). Just use an ordered workqueue for cgroup_destroy_wq. tj: Committing as the temporary fix until the reverse dependency can be removed from memcg. Comment updated accordingly. Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") Suggested-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-07time: Fixup fallout from recent clockevent/tick changesThomas Gleixner
Make the stub function static inline instead of static and move the clockevents related function into the proper ifdeffed section. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Soren Brinkmann <soren.brinkmann@xilinx.com> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
2014-02-07tick: Introduce hrtimer based broadcastPreeti U Murthy
On some architectures, in certain CPU deep idle states the local timers stop. An external clock device is used to wakeup these CPUs. The kernel support for the wakeup of these CPUs is provided by the tick broadcast framework by using the external clock device as the wakeup source. However not all implementations of architectures provide such an external clock device. This patch includes support in the broadcast framework to handle the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle states. This patchset introduces a pseudo clock device which can be registered by the archs as tick_broadcast_device in the absence of a real external clock device. Once registered, the broadcast framework will work as is for these architectures as long as the archs take care of the BROADCAST_ENTER notification failing for one of the CPUs. This CPU is made the stand by CPU to handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*. The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the stand by CPU dynamically moves around and so does the hrtimer which is queued to trigger at the next earliest wakeup time. This is consistent with the case where an external clock device is present. The smp affinity of this clock device is set to the CPU with the earliest wakeup. This patchset handles the hotplug of the stand by CPU as well by moving the hrtimer on to the CPU handling the CPU_DEAD notification. Originally-from: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: deepthi@linux.vnet.ibm.com Cc: paulmck@linux.vnet.ibm.com Cc: fweisbec@gmail.com Cc: paulus@samba.org Cc: srivatsa.bhat@linux.vnet.ibm.com Cc: svaidy@linux.vnet.ibm.com Cc: peterz@infradead.org Cc: benh@kernel.crashing.org Cc: rafael.j.wysocki@intel.com Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20140207080632.17187.80532.stgit@preeti.in.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-07time: Change the return type of clockevents_notify() to integerPreeti U Murthy
The broadcast framework can potentially be made use of by archs which do not have an external clock device as well. Then, it is required that one of the CPUs need to handle the broadcasting of wakeup IPIs to the CPUs in deep idle. As a result its local timers should remain functional all the time. For such a CPU, the BROADCAST_ENTER notification has to fail indicating that its clock device cannot be shutdown. To make way for this support, change the return type of tick_broadcast_oneshot_control() and hence clockevents_notify() to indicate such scenarios. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: deepthi@linux.vnet.ibm.com Cc: paulmck@linux.vnet.ibm.com Cc: fweisbec@gmail.com Cc: paulus@samba.org Cc: srivatsa.bhat@linux.vnet.ibm.com Cc: svaidy@linux.vnet.ibm.com Cc: peterz@infradead.org Cc: benh@kernel.crashing.org Cc: rafael.j.wysocki@intel.com Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20140207080606.17187.78306.stgit@preeti.in.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-07clockevents: Adjust timer interval when frequency changesSoren Brinkmann
clockevent devices in periodic mode are not updated when the frequency of the device changes. Issue a dev->set_mode() callback which forces the device to reevaluate the timer settings. Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Michal Simek <michal.simek@xilinx.com> Link: http://lkml.kernel.org/r/1391466877-28908-3-git-send-email-soren.brinkmann@xilinx.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-07clockevents: Serialize calls to clockevents_update_freq() in the coreThomas Gleixner
We can identify the broadcast device in the core and serialize all callers including interrupts on a different CPU against the update. Also, disabling interrupts is moved into the core allowing callers to leave interrutps enabled when calling clockevents_update_freq(). Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Soeren Brinkmann <soren.brinkmann@xilinx.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Michal Simek <michal.simek@xilinx.com> Link: http://lkml.kernel.org/r/1391466877-28908-2-git-send-email-soren.brinkmann@xilinx.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-07timekeeping: Move clock sync work to power efficient workqueueShaibal Dutta
For better use of CPU idle time, allow the scheduler to select the CPU on which the CMOS clock sync work would be scheduled. This improves idle residency time and conserver power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com> [zoran.markovic@linaro.org: Added commit message. Aligned code.] Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/1391195904-12497-1-git-send-email-zoran.markovic@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-06time: Fix overflow when HZ is smaller than 60Mikulas Patocka
When compiling for the IA-64 ski emulator, HZ is set to 32 because the emulation is slow and we don't want to waste too many cycles processing timers. Alpha also has an option to set HZ to 32. This causes integer underflow in kernel/time/jiffies.c: kernel/time/jiffies.c:66:2: warning: large integer implicitly truncated to unsigned type [-Woverflow] .mult = NSEC_PER_JIFFY << JIFFIES_SHIFT, /* details above */ ^ This patch reduces the JIFFIES_SHIFT value to avoid the overflow. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1401241639100.23871@file01.intranet.prod.int.rdu2.redhat.com Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-05execve: use 'struct filename *' for executable name passingLinus Torvalds
This changes 'do_execve()' to get the executable name as a 'struct filename', and to free it when it is done. This is what the normal users want, and it simplifies and streamlines their error handling. The controlled lifetime of the executable name also fixes a use-after-free problem with the trace_sched_process_exec tracepoint: the lifetime of the passed-in string for kernel users was not at all obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize the pathname allocation lifetime with the execve() having finished, which in turn meant that the trace point that happened after mm_release() of the old process VM ended up using already free'd memory. To solve the kernel string lifetime issue, this simply introduces "getname_kernel()" that works like the normal user-space getname() function, except with the source coming from kernel memory. As Oleg points out, this also means that we could drop the tcomm[] array from 'struct linux_binprm', since the pathname lifetime now covers setup_new_exec(). That would be a separate cleanup. Reported-by: Igor Zhbanov <i.zhbanov@samsung.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-05genirq: Generic irq chip requires IRQ_DOMAINNitin A Kamble
The generic_chip.c uses interfaces from irq_domain.c which is controlled by the IRQ_DOMAIN config option, but there is no Kconfig dependency so the build can fail: linux/kernel/irq/generic-chip.c:400:11: error: 'irq_domain_xlate_onetwocell' undeclared here (not in a function) Select IRQ_DOMAIN when GENERIC_IRQ_CHIP is selected. Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com> Link: http://lkml.kernel.org/r/1391129410-54548-2-git-send-email-nitin.a.kamble@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # 3.11+
2014-02-03arm, pm, vmpressure: add missing slab.h includesTejun Heo
arch/arm/mach-tegra/pm.c, kernel/power/console.c and mm/vmpressure.c were somehow getting slab.h indirectly through cgroup.h which in turn was getting it indirectly through xattr.h. A scheduled cgroup change drops xattr.h inclusion from cgroup.h and breaks compilation of these three files. Add explicit slab.h includes to the three files. A pending cgroup patch depends on this change and it'd be great if this can be routed through cgroup/for-3.14-fixes branch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Stephen Warren <swarren@wwwdotorg.org> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: linux-tegra@vger.kernel.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: cgroups@vger.kernel.org
2014-02-02compat: Fix sparse address space warningsH. Peter Anvin
In compat_sys_old_getrlimit() we pass a kernel pointer to sys_old_getrlimit() inside a set_fs() bracket. This is okay, so we can safely cast the affected pointer to __user. In compat_clock_nanosleep_restart(), the variable "rmtp" holds a user pointer. Annotate it as such. Both of these warnings are ancient, but were reported by Fengguang Wu's test system due to other changes. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Toyo Abe <toyoa@mvista.com> Link: http://lkml.kernel.org/n/tip-507h7cq5e45eg6ygtykon3bf@git.kernel.org
2014-02-02compat: Get rid of (get|put)_compat_time(val|spec)H. Peter Anvin
We have two APIs for compatiblity timespec/val, with confusingly similar names. compat_(get|put)_time(val|spec) *do* handle the case where COMPAT_USE_64BIT_TIME is set, whereas (get|put)_compat_time(val|spec) do not. This is an accident waiting to happen. Clean it up by favoring the full-service version; the limited version is replaced with double-underscore versions static to kernel/compat.c. A common pattern is to convert a struct timespec to kernel format in an allocation on the user stack. Unfortunately it is open-coded in several places. Since this allocation isn't actually needed if COMPAT_USE_64BIT_TIME is true (since user format == kernel format) encapsulate that whole pattern into the function compat_convert_timespec(). An equivalent function should be written for struct timeval if it is needed in the future. Finally, get rid of compat_(get|put)_timeval_convert(): each was only used once, and the latter was not even doing what the function said (no conversion actually was being done.) Moving the conversion into compat_sys_settimeofday() itself makes the code much more similar to sys_settimeofday() itself. v3: Remove unused compat_convert_timeval(). v2: Drop bogus "const" in the destination argument for compat_convert_time*(). Cc: Mauro Carvalho Chehab <m.chehab@samsung.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Hans Verkuil <hans.verkuil@cisco.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Tested-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-02Merge branch 'linus' into sched/core, to resolve conflictsIngo Molnar
Conflicts: kernel/sysctl.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-02Merge branch 'linus' into core/lockingIngo Molnar
Refresh the topic. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-31Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer/dynticks updates from Ingo Molnar: "This tree contains misc dynticks updates: a fix and three cleanups" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/nohz: Fix overflow error in scheduler_tick_max_deferment() nohz_full: fix code style issue of tick_nohz_full_stop_tick nohz: Get timekeeping max deferment outside jiffies_lock tick: Rename tick_check_idle() to tick_irq_enter()
2014-01-31Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A crash fix and documentation updates" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Make sched_class::get_rr_interval() optional sched/deadline: Add sched_dl documentation sched: Fix docbook parameter annotation error in wait.h
2014-01-31Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core debug changes from Ingo Molnar: "This contains mostly kernel debugging related updates: - make hung_task detection more configurable to distros - add final bits for x86 UV NMI debugging, with related KGDB changes - update the mailing-list of MAINTAINERS entries I'm involved with" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hung_task: Display every hung task warning sysctl: Add neg_one as a standard constraint x86/uv/nmi, kgdb/kdb: Fix UV NMI handler when KDB not configured x86/uv/nmi: Fix Sparse warnings kgdb/kdb: Fix no KDB config problem MAINTAINERS: Restore "L: linux-kernel@vger.kernel.org" entries
2014-01-30kernel/smp.c: remove cpumask_ipiRoman Gushchin
After commit 9a46ad6d6df3 ("smp: make smp_call_function_many() use logic similar to smp_call_function_single()"), cfd->cpumask is accessed only in smp_call_function_many(). So there is no more need to copy it into cfd->cpumask_ipi before putting csd into the list. The cpumask_ipi field is obsolete and can be removed. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Ingo Molnar <mingo@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Wang YanQing <udknight@gmail.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>