summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2014-02-08cgroup: fix error return from cgroup_create()Tejun Heo
cgroup_create() was returning 0 after allocation failures. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
2014-02-08cgroup: fix error return value in cgroup_mount()Tejun Heo
When cgroup_mount() fails to allocate an id for the root, it didn't set ret before jumping to unlock_drop ending up returning 0 after a failure. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
2014-02-07cgroup: use an ordered workqueue for cgroup destructionHugh Dickins
Sometimes the cleanup after memcg hierarchy testing gets stuck in mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. There may turn out to be several causes, but a major cause is this: the workitem to offline parent can get run before workitem to offline child; parent's mem_cgroup_reparent_charges() circles around waiting for the child's pages to be reparented to its lrus, but it's holding cgroup_mutex which prevents the child from reaching its mem_cgroup_reparent_charges(). Just use an ordered workqueue for cgroup_destroy_wq. tj: Committing as the temporary fix until the reverse dependency can be removed from memcg. Comment updated accordingly. Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") Suggested-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-07time: Fixup fallout from recent clockevent/tick changesThomas Gleixner
Make the stub function static inline instead of static and move the clockevents related function into the proper ifdeffed section. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Soren Brinkmann <soren.brinkmann@xilinx.com> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
2014-02-07tick: Introduce hrtimer based broadcastPreeti U Murthy
On some architectures, in certain CPU deep idle states the local timers stop. An external clock device is used to wakeup these CPUs. The kernel support for the wakeup of these CPUs is provided by the tick broadcast framework by using the external clock device as the wakeup source. However not all implementations of architectures provide such an external clock device. This patch includes support in the broadcast framework to handle the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle states. This patchset introduces a pseudo clock device which can be registered by the archs as tick_broadcast_device in the absence of a real external clock device. Once registered, the broadcast framework will work as is for these architectures as long as the archs take care of the BROADCAST_ENTER notification failing for one of the CPUs. This CPU is made the stand by CPU to handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*. The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the stand by CPU dynamically moves around and so does the hrtimer which is queued to trigger at the next earliest wakeup time. This is consistent with the case where an external clock device is present. The smp affinity of this clock device is set to the CPU with the earliest wakeup. This patchset handles the hotplug of the stand by CPU as well by moving the hrtimer on to the CPU handling the CPU_DEAD notification. Originally-from: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: deepthi@linux.vnet.ibm.com Cc: paulmck@linux.vnet.ibm.com Cc: fweisbec@gmail.com Cc: paulus@samba.org Cc: srivatsa.bhat@linux.vnet.ibm.com Cc: svaidy@linux.vnet.ibm.com Cc: peterz@infradead.org Cc: benh@kernel.crashing.org Cc: rafael.j.wysocki@intel.com Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20140207080632.17187.80532.stgit@preeti.in.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-07time: Change the return type of clockevents_notify() to integerPreeti U Murthy
The broadcast framework can potentially be made use of by archs which do not have an external clock device as well. Then, it is required that one of the CPUs need to handle the broadcasting of wakeup IPIs to the CPUs in deep idle. As a result its local timers should remain functional all the time. For such a CPU, the BROADCAST_ENTER notification has to fail indicating that its clock device cannot be shutdown. To make way for this support, change the return type of tick_broadcast_oneshot_control() and hence clockevents_notify() to indicate such scenarios. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: deepthi@linux.vnet.ibm.com Cc: paulmck@linux.vnet.ibm.com Cc: fweisbec@gmail.com Cc: paulus@samba.org Cc: srivatsa.bhat@linux.vnet.ibm.com Cc: svaidy@linux.vnet.ibm.com Cc: peterz@infradead.org Cc: benh@kernel.crashing.org Cc: rafael.j.wysocki@intel.com Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20140207080606.17187.78306.stgit@preeti.in.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-07clockevents: Adjust timer interval when frequency changesSoren Brinkmann
clockevent devices in periodic mode are not updated when the frequency of the device changes. Issue a dev->set_mode() callback which forces the device to reevaluate the timer settings. Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Michal Simek <michal.simek@xilinx.com> Link: http://lkml.kernel.org/r/1391466877-28908-3-git-send-email-soren.brinkmann@xilinx.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-07clockevents: Serialize calls to clockevents_update_freq() in the coreThomas Gleixner
We can identify the broadcast device in the core and serialize all callers including interrupts on a different CPU against the update. Also, disabling interrupts is moved into the core allowing callers to leave interrutps enabled when calling clockevents_update_freq(). Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Soeren Brinkmann <soren.brinkmann@xilinx.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Michal Simek <michal.simek@xilinx.com> Link: http://lkml.kernel.org/r/1391466877-28908-2-git-send-email-soren.brinkmann@xilinx.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-07timekeeping: Move clock sync work to power efficient workqueueShaibal Dutta
For better use of CPU idle time, allow the scheduler to select the CPU on which the CMOS clock sync work would be scheduled. This improves idle residency time and conserver power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com> [zoran.markovic@linaro.org: Added commit message. Aligned code.] Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/1391195904-12497-1-git-send-email-zoran.markovic@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-06time: Fix overflow when HZ is smaller than 60Mikulas Patocka
When compiling for the IA-64 ski emulator, HZ is set to 32 because the emulation is slow and we don't want to waste too many cycles processing timers. Alpha also has an option to set HZ to 32. This causes integer underflow in kernel/time/jiffies.c: kernel/time/jiffies.c:66:2: warning: large integer implicitly truncated to unsigned type [-Woverflow] .mult = NSEC_PER_JIFFY << JIFFIES_SHIFT, /* details above */ ^ This patch reduces the JIFFIES_SHIFT value to avoid the overflow. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1401241639100.23871@file01.intranet.prod.int.rdu2.redhat.com Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-05execve: use 'struct filename *' for executable name passingLinus Torvalds
This changes 'do_execve()' to get the executable name as a 'struct filename', and to free it when it is done. This is what the normal users want, and it simplifies and streamlines their error handling. The controlled lifetime of the executable name also fixes a use-after-free problem with the trace_sched_process_exec tracepoint: the lifetime of the passed-in string for kernel users was not at all obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize the pathname allocation lifetime with the execve() having finished, which in turn meant that the trace point that happened after mm_release() of the old process VM ended up using already free'd memory. To solve the kernel string lifetime issue, this simply introduces "getname_kernel()" that works like the normal user-space getname() function, except with the source coming from kernel memory. As Oleg points out, this also means that we could drop the tcomm[] array from 'struct linux_binprm', since the pathname lifetime now covers setup_new_exec(). That would be a separate cleanup. Reported-by: Igor Zhbanov <i.zhbanov@samsung.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-05genirq: Generic irq chip requires IRQ_DOMAINNitin A Kamble
The generic_chip.c uses interfaces from irq_domain.c which is controlled by the IRQ_DOMAIN config option, but there is no Kconfig dependency so the build can fail: linux/kernel/irq/generic-chip.c:400:11: error: 'irq_domain_xlate_onetwocell' undeclared here (not in a function) Select IRQ_DOMAIN when GENERIC_IRQ_CHIP is selected. Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com> Link: http://lkml.kernel.org/r/1391129410-54548-2-git-send-email-nitin.a.kamble@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # 3.11+
2014-02-03arm, pm, vmpressure: add missing slab.h includesTejun Heo
arch/arm/mach-tegra/pm.c, kernel/power/console.c and mm/vmpressure.c were somehow getting slab.h indirectly through cgroup.h which in turn was getting it indirectly through xattr.h. A scheduled cgroup change drops xattr.h inclusion from cgroup.h and breaks compilation of these three files. Add explicit slab.h includes to the three files. A pending cgroup patch depends on this change and it'd be great if this can be routed through cgroup/for-3.14-fixes branch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Stephen Warren <swarren@wwwdotorg.org> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: linux-tegra@vger.kernel.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: cgroups@vger.kernel.org
2014-02-02compat: Fix sparse address space warningsH. Peter Anvin
In compat_sys_old_getrlimit() we pass a kernel pointer to sys_old_getrlimit() inside a set_fs() bracket. This is okay, so we can safely cast the affected pointer to __user. In compat_clock_nanosleep_restart(), the variable "rmtp" holds a user pointer. Annotate it as such. Both of these warnings are ancient, but were reported by Fengguang Wu's test system due to other changes. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Toyo Abe <toyoa@mvista.com> Link: http://lkml.kernel.org/n/tip-507h7cq5e45eg6ygtykon3bf@git.kernel.org
2014-02-02compat: Get rid of (get|put)_compat_time(val|spec)H. Peter Anvin
We have two APIs for compatiblity timespec/val, with confusingly similar names. compat_(get|put)_time(val|spec) *do* handle the case where COMPAT_USE_64BIT_TIME is set, whereas (get|put)_compat_time(val|spec) do not. This is an accident waiting to happen. Clean it up by favoring the full-service version; the limited version is replaced with double-underscore versions static to kernel/compat.c. A common pattern is to convert a struct timespec to kernel format in an allocation on the user stack. Unfortunately it is open-coded in several places. Since this allocation isn't actually needed if COMPAT_USE_64BIT_TIME is true (since user format == kernel format) encapsulate that whole pattern into the function compat_convert_timespec(). An equivalent function should be written for struct timeval if it is needed in the future. Finally, get rid of compat_(get|put)_timeval_convert(): each was only used once, and the latter was not even doing what the function said (no conversion actually was being done.) Moving the conversion into compat_sys_settimeofday() itself makes the code much more similar to sys_settimeofday() itself. v3: Remove unused compat_convert_timeval(). v2: Drop bogus "const" in the destination argument for compat_convert_time*(). Cc: Mauro Carvalho Chehab <m.chehab@samsung.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Hans Verkuil <hans.verkuil@cisco.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Tested-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-02Merge branch 'linus' into sched/core, to resolve conflictsIngo Molnar
Conflicts: kernel/sysctl.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-02Merge branch 'linus' into core/lockingIngo Molnar
Refresh the topic. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-31Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer/dynticks updates from Ingo Molnar: "This tree contains misc dynticks updates: a fix and three cleanups" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/nohz: Fix overflow error in scheduler_tick_max_deferment() nohz_full: fix code style issue of tick_nohz_full_stop_tick nohz: Get timekeeping max deferment outside jiffies_lock tick: Rename tick_check_idle() to tick_irq_enter()
2014-01-31Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A crash fix and documentation updates" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Make sched_class::get_rr_interval() optional sched/deadline: Add sched_dl documentation sched: Fix docbook parameter annotation error in wait.h
2014-01-31Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core debug changes from Ingo Molnar: "This contains mostly kernel debugging related updates: - make hung_task detection more configurable to distros - add final bits for x86 UV NMI debugging, with related KGDB changes - update the mailing-list of MAINTAINERS entries I'm involved with" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hung_task: Display every hung task warning sysctl: Add neg_one as a standard constraint x86/uv/nmi, kgdb/kdb: Fix UV NMI handler when KDB not configured x86/uv/nmi: Fix Sparse warnings kgdb/kdb: Fix no KDB config problem MAINTAINERS: Restore "L: linux-kernel@vger.kernel.org" entries
2014-01-30kernel/smp.c: remove cpumask_ipiRoman Gushchin
After commit 9a46ad6d6df3 ("smp: make smp_call_function_many() use logic similar to smp_call_function_single()"), cfd->cpumask is accessed only in smp_call_function_many(). So there is no more need to copy it into cfd->cpumask_ipi before putting csd into the list. The cpumask_ipi field is obsolete and can be removed. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Ingo Molnar <mingo@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Wang YanQing <udknight@gmail.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30kernel: use lockless list for smp_call_function_singleChristoph Hellwig
Make smp_call_function_single and friends more efficient by using a lockless list. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...
2014-01-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted stuff; the biggest pile here is Christoph's ACL series. Plus assorted cleanups and fixes all over the place... There will be another pile later this week" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits) __dentry_path() fixes vfs: Remove second variable named error in __dentry_path vfs: Is mounted should be testing mnt_ns for NULL or error. Fix race when checking i_size on direct i/o read hfsplus: remove can_set_xattr nfsd: use get_acl and ->set_acl fs: remove generic_acl nfs: use generic posix ACL infrastructure for v3 Posix ACLs gfs2: use generic posix ACL infrastructure jfs: use generic posix ACL infrastructure xfs: use generic posix ACL infrastructure reiserfs: use generic posix ACL infrastructure ocfs2: use generic posix ACL infrastructure jffs2: use generic posix ACL infrastructure hfsplus: use generic posix ACL infrastructure f2fs: use generic posix ACL infrastructure ext2/3/4: use generic posix ACL infrastructure btrfs: use generic posix ACL infrastructure fs: make posix_acl_create more useful fs: make posix_acl_chmod more useful ...
2014-01-28sched/numa: Turn some magic numbers into #definesRik van Riel
Cleanup suggested by Mel Gorman. Now the code contains some more hints on what statistics go where. Suggested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-10-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Rename variables in task_numa_fault()Rik van Riel
We track both the node of the memory after a NUMA fault, and the node of the CPU on which the fault happened. Rename the local variables in task_numa_fault to make things more explicit. Suggested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-9-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Do statistics calculation using local variables onlyRik van Riel
The current code in task_numa_placement calculates the difference between the old and the new value, but also temporarily stores half of the old value in the per-process variables. The NUMA balancing code looks at those per-process variables, and having other tasks temporarily see halved statistics could lead to unwanted numa migrations. This can be avoided by doing all the math in local variables. This change also simplifies the code a little. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-8-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Normalize faults_cpu stats and weigh by CPU useRik van Riel
Tracing the code that decides the active nodes has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the active work. This resulted in the node with the garbage collector being marked the only active node in the group. This issue is avoided if we weigh the statistics by CPU use of each task in the numa group, instead of by how many faults each thread has occurred. To achieve this, we normalize the number of faults to the fraction of faults that occurred on each node, and then multiply that fraction by the fraction of CPU time the task has used since the last time task_numa_placement was invoked. This way the nodes in the active node mask will be the ones where the tasks from the numa group are most actively running, and the influence of eg. the garbage collector and other do-little threads is properly minimized. On a 4 node system, using CPU use statistics calculated over a longer interval results in about 1% fewer page migrations with two 32-warehouse specjbb runs on a 4 node system, and about 5% fewer page migrations, as well as 1% better throughput, with two 8-warehouse specjbb runs, as compared with the shorter term statistics kept by the scheduler. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-7-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa, mm: Use active_nodes nodemask to limit numa migrationsRik van Riel
Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration of pages 3) distribute shared memory across the active nodes, to maximize memory bandwidth available to the workload This patch accomplishes that by implementing the following policy for NUMA migrations: 1) always migrate on a private fault 2) never migrate to a node that is not in the set of active nodes for the numa_group 3) always migrate from a node outside of the set of active nodes, to a node that is in that set 4) within the set of active nodes in the numa_group, only migrate from a node with more NUMA page faults, to a node with fewer NUMA page faults, with a 25% margin to avoid ping-ponging This results in most pages of a workload ending up on the actively used nodes, with reduced ping-ponging of pages between those nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-6-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Build per numa_group active node mask from numa_faults_cpu ↵Rik van Riel
statistics The numa_faults_cpu statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-5-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Track from which nodes NUMA faults are triggeredRik van Riel
Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes a workload is actively running on. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-4-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Rename p->numa_faults to numa_faults_memoryRik van Riel
In order to get a more consistent naming scheme, making it clear which fault statistics track memory locality, and which track CPU locality, rename the memory fault statistics. Suggested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-3-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa, mm: Remove p->numa_migrate_deferredRik van Riel
Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p->numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now that the second stage of the automatic numa balancing code has stabilized, it is time to replace the simplistic migration deferral code with something smarter. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28locking/mutexes/mcs: Restructure the MCS lock defines and locking code into ↵Tim Chen
its own file We will need the MCS lock code for doing optimistic spinning for rwsem and queued rwlock. Extracting the MCS code from mutex.c and put into its own file allow us to reuse this code easily. We also inline mcs_spin_lock and mcs_spin_unlock functions for better efficiency. Note that using the smp_load_acquire/smp_store_release pair used in mcs_lock and mcs_unlock is not sufficient to form a full memory barrier across cpus for many architectures (except x86). For applications that absolutely need a full barrier across multiple cpus with mcs_unlock and mcs_lock pair, smp_mb__after_unlock_lock() should be used after mcs_lock. Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1390347360.3138.63.camel@schen9-DESK Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28locking/mutexes/mcs: Correct barrier usageWaiman Long
This patch corrects the way memory barriers are used in the MCS lock with smp_load_acquire and smp_store_release fucnctions. The previous barriers could leak critical sections if mcs lock is used by itself. It is not a problem when mcs lock is embedded in mutex but will be an issue when the mcs_lock is used elsewhere. The patch removes the incorrect barriers and put in correct barriers with the pair of functions smp_load_acquire and smp_store_release. Suggested-by: Michel Lespinasse <walken@google.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1390347353.3138.62.camel@schen9-DESK Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched: Make sched_class::get_rr_interval() optionalPeter Zijlstra
Not all classes implement (or can implement) a useful get_rr_interval() function, default to a 0 time-slice for them. This fixes a crash reported by Tommi Rantala. Reported-by: Tommi Rantala <tt.rantala@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Tommi Rantala <tt.rantala@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140127105413.GC11314@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/deadline: Add sched_dl documentationDario Faggioli
Add in Documentation/scheduler/ some hints about the design choices, the usage and the future possible developments of the sched_dl scheduling class and of the SCHED_DEADLINE policy. Reviewed-by: Henrik Austad <henrik@austad.us> Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> [ Re-wrote sections 2 and 3. ] Signed-off-by: Luca Abeni <luca.abeni@unitn.it> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1390821615-23247-1-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-27softirq: use const char * const for softirq_to_name, whitespace neateningJoe Perches
Reduce data size a little. Reduce checkpatch noise. $ size kernel/softirq.o* text data bss dec hex filename 11554 6013 4008 21575 5447 kernel/softirq.o.new 11474 6093 4008 21575 5447 kernel/softirq.o.old Signed-off-by: Joe Perches <joe@perches.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27softirq: convert printks to pr_<level>Joe Perches
Use a more current logging style. Signed-off-by: Joe Perches <joe@perches.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27softirq: use ffs() in __do_softirq()Joe Perches
Possible speed improvement of __do_softirq() by using ffs() instead of using a while loop with an & 1 test then single bit shift. Signed-off-by: Joe Perches <joe@perches.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27kernel/kexec.c: use vscnprintf() instead of vsnprintf() in ↵Chen Gang
vmcoreinfo_append_str() vsnprintf() may let 'r' larger than sizeof(buf), in this case, if 'r' is also less than "vmcoreinfo_max_size - vmcoreinfo_size" (left size of destination buffer), next memcpy() will read the unexpected addresses. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27Merge tag 'trace-fixes-3.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "The first two patches fix the debugfs README file to reflect better the new features added to 3.14. The third patch is a minor bugfix to the trace_puts() functions that will crash the system if a developer adds one before the tracing system is setup. It also affects trace_printk() if it has no arguments, as the code will convert it to a trace_puts() as well. Note, this bug will not affect unmodified kernels, as trace_printk() and trace_puts() should only be used by developers for testing" * tag 'trace-fixes-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Check if tracing is enabled in trace_puts() tracing: Fix formatting of trace README file tracing/README: Add event file usage to tracing mini-HOWTO
2014-01-25Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A couple of regression fixes mostly hitting virtualized setups, but also some bare metal systems" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/x86/tsc: Initialize multiplier to 0 sched/clock: Fixup early initialization sched/preempt/x86: Fix voluntary preempt for x86 Revert "sched: Fix sleep time double accounting in enqueue entity"
2014-01-25hung_task: Display every hung task warningAaron Tomlin
When khungtaskd detects hung tasks, it prints out backtraces from a number of those tasks. Limiting the number of backtraces being printed out can result in the user not seeing the information necessary to debug the issue. The hung_task_warnings sysctl controls this feature. This patch makes it possible for hung_task_warnings to accept a special value to print an unlimited number of backtraces when khungtaskd detects hung tasks. The special value is -1. To use this value it is necessary to change types from ulong to int. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: oleg@redhat.com Link: http://lkml.kernel.org/r/1390239253-24030-3-git-send-email-atomlin@redhat.com [ Build warning fix. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25introduce __fcheck_files() to fix rcu_dereference_check_fdtable(), kill ↵Oleg Nesterov
rcu_my_thread_group_empty() rcu_dereference_check_fdtable() looks very wrong, 1. rcu_my_thread_group_empty() was added by 844b9a8707f1 "vfs: fix RCU-lockdep false positive due to /proc" but it doesn't really fix the problem. A CLONE_THREAD (without CLONE_FILES) task can hit the same race with get_files_struct(). And otoh rcu_my_thread_group_empty() can suppress the correct warning if the caller is the CLONE_FILES (without CLONE_THREAD) task. 2. files->count == 1 check is not really right too. Even if this files_struct is not shared it is not safe to access it lockless unless the caller is the owner. Otoh, this check is sub-optimal. files->count == 0 always means it is safe to use it lockless even if files != current->files, but put_files_struct() has to take rcu_read_lock(). See the next patch. This patch removes the buggy checks and turns fcheck_files() into __fcheck_files() which uses rcu_dereference_raw(), the "unshared" callers, fget_light() and fget_raw_light(), can use it to avoid the warning from RCU-lockdep. fcheck_files() is trivially reimplemented as rcu_lockdep_assert() plus __fcheck_files(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25sysctl: Add neg_one as a standard constraintAaron Tomlin
Add neg_one to the list of standard constraints - will be used by the next patch. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: oleg@redhat.com Link: http://lkml.kernel.org/r/1390239253-24030-2-git-send-email-atomlin@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25kgdb/kdb: Fix no KDB config problemMike Travis
Some code added to the debug_core module had KDB dependencies that it shouldn't have. Move the KDB dependent REASON back to the caller to remove the dependency in the debug core code. Update the call from the UV NMI handler to conform to the new interface. Signed-off-by: Mike Travis <travis@sgi.com> Reviewed-by: Hedi Berriche <hedi@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Link: http://lkml.kernel.org/r/20140114162551.318251993@asylum.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25Merge branch 'timers/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/urgent Pull dynticks cleanups from Frederic Weisbecker. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-24Merge tag 'pm+acpi-3.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management updates from Rafael Wysocki: "As far as the number of commits goes, the top spot belongs to ACPI this time with cpufreq in the second position and a handful of PM core, PNP and cpuidle updates. They are fixes and cleanups mostly, as usual, with a couple of new features in the mix. The most visible change is probably that we will create struct acpi_device objects (visible in sysfs) for all devices represented in the ACPI tables regardless of their status and there will be a new sysfs attribute under those objects allowing user space to check that status via _STA. Consequently, ACPI device eject or generally hot-removal will not delete those objects, unless the table containing the corresponding namespace nodes is unloaded, which is extremely rare. Also ACPI container hotplug will be handled quite a bit differently and cpufreq will support CPU boost ("turbo") generically and not only in the acpi-cpufreq driver. Specifics: - ACPI core changes to make it create a struct acpi_device object for every device represented in the ACPI tables during all namespace scans regardless of the current status of that device. In accordance with this, ACPI hotplug operations will not delete those objects, unless the underlying ACPI tables go away. - On top of the above, new sysfs attribute for ACPI device objects allowing user space to check device status by triggering the execution of _STA for its ACPI object. From Srinivas Pandruvada. - ACPI core hotplug changes reducing code duplication, integrating the PCI root hotplug with the core and reworking container hotplug. - ACPI core simplifications making it use ACPI_COMPANION() in the code "glueing" ACPI device objects to "physical" devices. - ACPICA update to upstream version 20131218. This adds support for the DBG2 and PCCT tables to ACPICA, fixes some bugs and improves debug facilities. From Bob Moore, Lv Zheng and Betty Dall. - Init code change to carry out the early ACPI initialization earlier. That should allow us to use ACPI during the timekeeping initialization and possibly to simplify the EFI initialization too. From Chun-Yi Lee. - Clenups of the inclusions of ACPI headers in many places all over from Lv Zheng and Rashika Kheria (work in progress). - New helper for ACPI _DSM execution and rework of the code in drivers that uses _DSM to execute it via the new helper. From Jiang Liu. - New Win8 OSI blacklist entries from Takashi Iwai. - Assorted ACPI fixes and cleanups from Al Stone, Emil Goode, Hanjun Guo, Lan Tianyu, Masanari Iida, Oliver Neukum, Prarit Bhargava, Rashika Kheria, Tang Chen, Zhang Rui. - intel_pstate driver updates, including proper Baytrail support, from Dirk Brandewie and intel_pstate documentation from Ramkumar Ramachandra. - Generic CPU boost ("turbo") support for cpufreq from Lukasz Majewski. - powernow-k6 cpufreq driver fixes from Mikulas Patocka. - cpufreq core fixes and cleanups from Viresh Kumar, Jane Li, Mark Brown. - Assorted cpufreq drivers fixes and cleanups from Anson Huang, John Tobias, Paul Bolle, Paul Walmsley, Sachin Kamat, Shawn Guo, Viresh Kumar. - cpuidle cleanups from Bartlomiej Zolnierkiewicz. - Support for hibernation APM events from Bin Shi. - Hibernation fix to avoid bringing up nonboot CPUs with ACPI EC disabled during thaw transitions from Bjørn Mork. - PM core fixes and cleanups from Ben Dooks, Leonardo Potenza, Ulf Hansson. - PNP subsystem fixes and cleanups from Dmitry Torokhov, Levente Kurusa, Rashika Kheria. - New tool for profiling system suspend from Todd E Brandt and a cpupower tool cleanup from One Thousand Gnomes" * tag 'pm+acpi-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (153 commits) thermal: exynos: boost: Automatic enable/disable of BOOST feature (at Exynos4412) cpufreq: exynos4x12: Change L0 driver data to CPUFREQ_BOOST_FREQ Documentation: cpufreq / boost: Update BOOST documentation cpufreq: exynos: Extend Exynos cpufreq driver to support boost cpufreq / boost: Kconfig: Support for software-managed BOOST acpi-cpufreq: Adjust the code to use the common boost attribute cpufreq: Add boost frequency support in core intel_pstate: Add trace point to report internal state. cpufreq: introduce cpufreq_generic_get() routine ARM: SA1100: Create dummy clk_get_rate() to avoid build failures cpufreq: stats: create sysfs entries when cpufreq_stats is a module cpufreq: stats: free table and remove sysfs entry in a single routine cpufreq: stats: remove hotplug notifiers cpufreq: stats: handle cpufreq_unregister_driver() and suspend/resume properly cpufreq: speedstep: remove unused speedstep_get_state platform: introduce OF style 'modalias' support for platform bus PM / tools: new tool for suspend/resume performance optimization ACPI: fix module autoloading for ACPI enumerated devices ACPI: add module autoloading support for ACPI enumerated devices ACPI: fix create_modalias() return value handling ...
2014-01-23Merge branch 'akpm' (incoming from Andrew)Linus Torvalds
Merge second patch-bomb from Andrew Morton: - various misc bits - the rest of MM - add generic fixmap.h, use it - backlight updates - dynamic_debug updates - printk() updates - checkpatch updates - binfmt_elf - ramfs - init/ - autofs4 - drivers/rtc - nilfs - hfsplus - Documentation/ - coredump - procfs - fork - exec - kexec - kdump - partitions - rapidio - rbtree - userns - memstick - w1 - decompressors * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (197 commits) lib/decompress_unlz4.c: always set an error return code on failures romfs: fix returm err while getting inode in fill_super drivers/w1/masters/w1-gpio.c: add strong pullup emulation drivers/memstick/host/rtsx_pci_ms.c: fix ms card data transfer bug userns: relax the posix_acl_valid() checks arch/sh/kernel/dwarf.c: use rbtree postorder iteration helper instead of solution using repeated rb_erase() fs-ext3-use-rbtree-postorder-iteration-helper-instead-of-opencoding-fix fs/ext3: use rbtree postorder iteration helper instead of opencoding fs/jffs2: use rbtree postorder iteration helper instead of opencoding fs/ext4: use rbtree postorder iteration helper instead of opencoding fs/ubifs: use rbtree postorder iteration helper instead of opencoding net/netfilter/ipset/ip_set_hash_netiface.c: use rbtree postorder iteration instead of opencoding rbtree/test: test rbtree_postorder_for_each_entry_safe() rbtree/test: move rb_node to the middle of the test struct rapidio: add modular rapidio core build into powerpc and mips branches partitions/efi: complete documentation of gpt kernel param purpose kdump: add /sys/kernel/vmcoreinfo ABI documentation kdump: fix exported size of vmcoreinfo note kexec: add sysctl to disable kexec_load fs/exec.c: call arch_pick_mmap_layout() only once ...