summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2012-09-06pstore/ftrace: Convert to its own enable/disable debugfs knobAnton Vorontsov
With this patch we no longer reuse function tracer infrastructure, now we register our own tracer back-end via a debugfs knob. It's a bit more code, but that is the only downside. On the bright side we have: - Ability to make persistent_ram module removable (when needed, we can move ftrace_ops struct into a module). Note that persistent_ram is still not removable for other reasons, but with this patch it's just one thing less to worry about; - Pstore part is more isolated from the generic function tracer. We tried it already by registering our own tracer in available_tracers, but that way we're loosing ability to see the traces while we record them to pstore. This solution is somewhere in the middle: we only register "internal ftracer" back-end, but not the "front-end"; - When there is only pstore tracing enabled, the kernel will only write to the pstore buffer, omitting function tracer buffer (which, of course, still can be enabled via 'echo function > current_tracer'). Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
2012-09-05workqueue: fix possible deadlock in idle worker rebindingTejun Heo
Currently, rebind_workers() and idle_worker_rebind() are two-way interlocked. rebind_workers() waits for idle workers to finish rebinding and rebound idle workers wait for rebind_workers() to finish rebinding busy workers before proceeding. Unfortunately, this isn't enough. The second wait from idle workers is implemented as follows. wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND)); rebind_workers() clears WORKER_REBIND, wakes up the idle workers and then returns. If CPU hotplug cycle happens again before one of the idle workers finishes the above wait_event(), rebind_workers() will repeat the first part of the handshake - set WORKER_REBIND again and wait for the idle worker to finish rebinding - and this leads to deadlock because the idle worker would be waiting for WORKER_REBIND to clear. This is fixed by adding another interlocking step at the end - rebind_workers() now waits for all the idle workers to finish the above WORKER_REBIND wait before returning. This ensures that all rebinding steps are complete on all idle workers before the next hotplug cycle can happen. This problem was diagnosed by Lai Jiangshan who also posted a patch to fix the issue, upon which this patch is based. This is the minimal fix and further patches are scheduled for the next merge window to simplify the CPU hotplug path. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>
2012-09-05workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the ↵Tejun Heo
function This doesn't make any functional difference and is purely to help the next patch to be simpler. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
2012-09-04workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomicLai Jiangshan
The compiler may compile the following code into TWO write/modify instructions. worker->flags &= ~WORKER_UNBOUND; worker->flags |= WORKER_REBIND; so the other CPU may temporarily see worker->flags which doesn't have either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup prematurely. Fix it by using single explicit assignment via ACCESS_ONCE(). Because idle workers have another WORKER_NOT_RUNNING flag, this bug doesn't exist for them; however, update it to use the same pattern for consistency. tj: Applied the change to idle workers too and updated comments and patch description a bit. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2012-09-04perf/hwpb: Invoke __perf_event_disable() if interrupts are already disabledK.Prasad
While debugging a warning message on PowerPC while using hardware breakpoints, it was discovered that when perf_event_disable is invoked through hw_breakpoint_handler function with interrupts disabled, a subsequent IPI in the code path would trigger a WARN_ON_ONCE message in smp_call_function_single function. This patch calls __perf_event_disable() when interrupts are already disabled, instead of perf_event_disable(). Reported-by: Edjunior Barbosa Machado <emachado@linux.vnet.ibm.com> Signed-off-by: K.Prasad <Prasad.Krishnan@gmail.com> [naveen.n.rao@linux.vnet.ibm.com: v3: Check to make sure we target current task] Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120802081635.5811.17737.stgit@localhost.localdomain [ Fixed build error on MIPS. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04perf_event: Switch to internal refcount, fix race with close()Al Viro
Don't mess with file refcounts (or keep a reference to file, for that matter) in perf_event. Use explicit refcount of its own instead. Deal with the race between the final reference to event going away and new children getting created for it by use of atomic_long_inc_not_zero() in inherit_event(); just have the latter free what it had allocated and return NULL, that works out just fine (children of siblings of something doomed are created as singletons, same as if the child of leader had been created and immediately killed). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120820135925.GG23464@ZenIV.linux.org.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04sched: Remove useless code in yield_to()Michael Wang
It's impossible to enter the else branch if we have set skip_clock_update in task_yield_fair(), as yield_to_task_fair() will directly return true after invoke task_yield_fair(). Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Acked-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4FF2925A.9060005@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04sched: Add time unit suffix to sched sysctl knobsNamhyung Kim
Unlike others, sched_migration_cost, sched_time_avg and sched_shares_window doesn't have time unit as suffix. Add them. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1345083330-19486-1-git-send-email-namhyung@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04sched/debug: Limit sd->*_idx range on sysctlNamhyung Kim
Various sd->*_idx's are used for refering the rq's load average table when selecting a cpu to run. However they can be set to any number with sysctl knobs so that it can crash the kernel if something bad is given. Fix it by limiting them into the actual range. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1345104204-8317-1-git-send-email-namhyung@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04sched: Remove AFFINE_WAKEUPS feature flagNamhyung Kim
Commit beac4c7e4a1c ("sched: Remove AFFINE_WAKEUPS feature") removed use of the flag but left the definition. Get rid of it. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Link: http://lkml.kernel.org/r/1345090865-20851-1-git-send-email-namhyung@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04Merge branch 'sched/urgent' into sched/coreIngo Molnar
Merge in the current fixes branch, we are going to apply dependent patches. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04sched: Fix kernel-doc warnings in kernel/sched/fair.cRandy Dunlap
Fix two kernel-doc warnings in kernel/sched/fair.c: Warning(kernel/sched/fair.c:3660): Excess function parameter 'cpus' description in 'update_sg_lb_stats' Warning(kernel/sched/fair.c:3806): Excess function parameter 'cpus' description in 'update_sd_lb_stats' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/50303714.3090204@xenotime.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04sched: Unthrottle rt runqueues in __disable_runtime()Peter Boonstoppel
migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <pboonstoppel@nvidia.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04sched: Add missing call to calc_load_exit_idle()Charles Wang
Azat Khuzhin reported high loadavg in Linux v3.6 After checking the upstream scheduler code, I found Peter's commit: 5167e8d5417b sched/nohz: Rewrite and fix load-avg computation -- again not fully applied, missing the call to calc_load_exit_idle(). After that idle exit in sampling window will always be calculated to non-idle, and the load will be higher than normal. This patch adds the missing call to calc_load_exit_idle(). Signed-off-by: Charles Wang <muming.wq@taobao.com> Cc: stable@kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1345449754-27130-1-git-send-email-muming.wq@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04sched: Fix load avg vs cpu-hotplugPeter Zijlstra
Rabik and Paul reported two different issues related to the same few lines of code. Rabik's issue is that the nr_uninterruptible migration code is wrong in that he sees artifacts due to this (Rabik please do expand in more detail). Paul's issue is that this code as it stands relies on us using stop_machine() for unplug, we all would like to remove this assumption so that eventually we can remove this stop_machine() usage altogether. The only reason we'd have to migrate nr_uninterruptible is so that we could use for_each_online_cpu() loops in favour of for_each_possible_cpu() loops, however since nr_uninterruptible() is the only such loop and its using possible lets not bother at all. The problem Rabik sees is (probably) caused by the fact that by migrating nr_uninterruptible we screw rq->calc_load_active for both rqs involved. So don't bother with fancy migration schemes (meaning we now have to keep using for_each_possible_cpu()) and instead fold any nr_active delta after we migrate all tasks away to make sure we don't have any skewed nr_active accounting. Reported-by: Rakib Mullick <rakib.mullick@gmail.com> Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04timekeeping: Add suspend and resume of clock event devicesRafael J. Wysocki
Some clock event devices, for example such that belong to PM domains, need to be handled in a spcial way during the timekeeping suspend and resume (which takes place in the system core, or "syscore", stages of system power transitions) in analogy with clock sources. Introduce .suspend() and .resume() callbacks for clock event devices that will be executed by timekeeping_suspend/_resume(), respectively, next the the clock sources' .suspend() and .resume() callbacks. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-09-04PM / Domains: Add power off/on function for system core suspend stageRafael J. Wysocki
Introduce function pm_genpd_syscore_switch() and two wrappers around it, pm_genpd_syscore_poweroff() and pm_genpd_syscore_poweron(), allowing the callers to let the generic PM domains framework know that the given device is not necessary any more and its PM domain can be turned off (the former) or that the given device will be required immediately, so its PM domain has to be turned on (the latter) during the system core (syscore) stage of system suspend (or hibernation) and resume. These functions will be used for handling devices registered as clock sources and clock event devices that belong to PM domains. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-09-01time: Move ktime_t overflow checking into timespec_valid_strictJohn Stultz
Andreas Bombe reported that the added ktime_t overflow checking added to timespec_valid in commit 4e8b14526ca7 ("time: Improve sanity checking of timekeeping inputs") was causing problems with X.org because it caused timeouts larger then KTIME_T to be invalid. Previously, these large timeouts would be clamped to KTIME_MAX and would never expire, which is valid. This patch splits the ktime_t overflow checking into a new timespec_valid_strict function, and converts the timekeeping codes internal checking to use this more strict function. Reported-and-tested-by: Andreas Bombe <aeb@debian.org> Cc: Zhouping Liu <zliu@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-28uprobes: Remove "verify" argument from set_orig_insn()Oleg Nesterov
Nobody does set_orig_insn(verify => false), and I think nobody will. Remove this argument. IIUC set_orig_insn(verify => false) was needed to single-step without xol area. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-08-28uprobes: Fold uprobe_reset_state() into uprobe_dup_mmap()Oleg Nesterov
Now that we have uprobe_dup_mmap() we can fold uprobe_reset_state() into the new hook and remove it. mmput()->uprobe_clear_state() can't be called before dup_mmap(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-08-28uprobes: Introduce MMF_HAS_UPROBESOleg Nesterov
Add the new MMF_HAS_UPROBES flag. It is set by install_breakpoint() and it is copied by dup_mmap(), uprobe_pre_sstep_notifier() checks it to avoid the slow path if the task was never probed. Perhaps it makes sense to check it in valid_vma(is_register => false) as well. This needs the new dup_mmap()->uprobe_dup_mmap() hook. We can't use uprobe_reset_state() or put MMF_HAS_UPROBES into MMF_INIT_MASK, we need oldmm->mmap_sem to avoid the race with uprobe_register() or mmap() from another thread. Currently we never clear this bit, it can be false-positive after uprobe_unregister() or uprobe_munmap() or if dup_mmap() hits the probed VM_DONTCOPY vma. But this is fine correctness-wise and has no effect unless the task hits the non-uprobe breakpoint. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-08-28uprobes: Do not use -EEXIST in install_breakpoint() pathsOleg Nesterov
-EEXIST from install_breakpoint() no longer makes sense, all callers should simply treat it as "success". Change the code to return zero and simplify register_for_each_vma(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-08-28uprobes: Change uprobe_mmap() to ignore the errors but check ↵Oleg Nesterov
fatal_signal_pending() Once install_breakpoint() fails uprobe_mmap() "ignores" all other uprobes and returns the error. It was never really needed to to stop after the first error, and in fact it was always wrong at least in -ENOTSUPP case. Change uprobe_mmap() to ignore the errors and always return 0. This is not what we want in the long term, but until we teach the callers to handle the failure it would be better to remove the pointless complications. And this doesn't look too bad, the only "reasonable" error is ENOMEM but in this case the caller should be oom-killed in the likely case or the system has more serious problems. However it makes sense to stop if fatal_signal_pending() == T. In particular this helps if the task was oom-killed. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-08-28uprobes: Kill dup_mmap()->uprobe_mmap(), simplify uprobe_mmap/munmapOleg Nesterov
1. Kill dup_mmap()->uprobe_mmap(), it was only needed to calculate new_mm->uprobes_state.count removed by the previous patch. If the forking process has a pending uprobe (int3) in vma, it will be copied by copy_page_range(), note that it checks vma->anon_vma so "Don't copy ptes" is not possible after install_breakpoint() which does anon_vma_prepare(). 2. Remove is_swbp_at_addr() and "int count" in uprobe_mmap(). Again, this was needed for uprobes_state.count. As a side effect this fixes the bug pointed out by Srikar, this code lacked the necessary put_uprobe(). 3. uprobe_munmap() becomes a nop after the previous patch. Remove the meaningless code but do not remove the helper, we will need it. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-08-28uprobes: Kill uprobes_state->countOleg Nesterov
uprobes_state->count is only needed to avoid the slow path in uprobe_pre_sstep_notifier(). It is also checked in uprobe_munmap() but ironically its only goal to decrement this counter. However, it is very broken. Just some examples: - uprobe_mmap() can race with uprobe_unregister() and wrongly increment the counter if it hits the non-uprobe "int3". Note that install_breakpoint() checks ->consumers first and returns -EEXIST if it is NULL. "atomic_sub() if error" in uprobe_mmap() looks obviously wrong too. - uprobe_munmap() can race with uprobe_register() and wrongly decrement the counter by the same reason. - Suppose an appication tries to increase the mmapped area via sys_mremap(). vma_adjust() does uprobe_munmap(whole_vma) first, this can nullify the counter temporarily and race with another thread which can hit the bp, the application will be killed by SIGTRAP. - Suppose an application mmaps 2 consecutive areas in the same file and one (or both) of these areas has uprobes. In the likely case mmap_region()->vma_merge() suceeds. Like above, this leads to uprobe_munmap/uprobe_mmap from vma_merge()->vma_adjust() but then mmap_region() does another uprobe_mmap(resulting_vma) and doubles the counter. This patch only removes this counter and fixes the compile errors, then we will try to cleanup the changed code and add something else instead. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2012-08-28uprobes: Remove check for uprobe variable in handle_swbp()Sebastian Andrzej Siewior
by the time we get here (after we pass cleanup_ret) uprobe is always is set. If it is NULL we leave very early in the code. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2012-08-28uprobes: Remove redundant lock_page/unlock_pageSrikar Dronamraju
Since read_opcode() reads from the referenced page and doesnt modify the page contents nor the page attributes, there is no need to lock the page. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2012-08-28Merge branch 'perf/urgent' into perf/coreIngo Molnar
Pick up the latest fixes because upcoming uprobes changes will rely on it. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-26Merge remote-tracking branch 'upstream/master' into queueMarcelo Tosatti
Merging critical fixes from upstream required for development. * upstream/master: (809 commits) libata: Add a space to " 2GB ATA Flash Disk" DMA blacklist entry Revert "powerpc: Update g5_defconfig" powerpc/perf: Use pmc_overflow() to detect rolled back events powerpc: Fix VMX in interrupt check in POWER7 copy loops powerpc: POWER7 copy_to_user/copy_from_user patch applied twice powerpc: Fix personality handling in ppc64_personality() powerpc/dma-iommu: Fix IOMMU window check powerpc: Remove unnecessary ifdefs powerpc/kgdb: Restore current_thread_info properly powerpc/kgdb: Bail out of KGDB when we've been triggered powerpc/kgdb: Do not set kgdb_single_step on ppc powerpc/mpic_msgr: Add missing includes powerpc: Fix null pointer deref in perf hardware breakpoints powerpc: Fixup whitespace in xmon powerpc: Fix xmon dl command for new printk implementation xfs: check for possible overflow in xfs_ioc_trim xfs: unlock the AGI buffer when looping in xfs_dialloc xfs: fix uninitialised variable in xfs_rtbuf_get() powerpc/fsl: fix "Failed to mount /dev: No such device" errors powerpc/fsl: update defconfigs ... Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-24cgroup: rename subsys_bits to subsys_maskAristeu Rozanski
In a previous discussion, Tejun Heo suggested to rename references to subsys_bits (added_bits, removed_bits, etc) by something more meaningful. Cc: Li Zefan <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Lennart Poettering <lpoetter@redhat.com> Signed-off-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-24cgroup: add xattr supportAristeu Rozanski
This is one of the items in the plumber's wish list. For use cases: >> What would the use case be for this? > > Attaching meta information to services, in an easily discoverable > way. For example, in systemd we create one cgroup for each service, and > could then store data like the main pid of the specific service as an > xattr on the cgroup itself. That way we'd have almost all service state > in the cgroupfs, which would make it possible to terminate systemd and > later restart it without losing any state information. But there's more: > for example, some very peculiar services cannot be terminated on > shutdown (i.e. fakeraid DM stuff) and it would be really nice if the > services in question could just mark that on their cgroup, by setting an > xattr. On the more desktopy side of things there are other > possibilities: for example there are plans defining what an application > is along the lines of a cgroup (i.e. an app being a collection of > processes). With xattrs one could then attach an icon or human readable > program name on the cgroup. > > The key idea is that this would allow attaching runtime meta information > to cgroups and everything they model (services, apps, vms), that doesn't > need any complex userspace infrastructure, has good access control > (i.e. because the file system enforces that anyway, and there's the > "trusted." xattr namespace), notifications (inotify), and can easily be > shared among applications. > > Lennart v7: - no changes v6: - remove user xattr namespace, only allow trusted and security v5: - check for capabilities before setting/removing xattrs v4: - no changes v3: - instead of config option, use mount option to enable xattr support Original-patch-by: Li Zefan <lizefan@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Lennart Poettering <lpoetter@redhat.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-24cgroup: revise how we re-populate root directoryAristeu Rozanski
When remounting cgroupfs with some subsystems added to it and some removed, cgroup will remove all the files in root directory and then re-popluate it. What I'm doing here is, only remove files which belong to subsystems that are to be unbinded, and only create files for newly-added subsystems. The purpose is to have all other files untouched. This is a preparation for cgroup xattr support. v7: - checkpatch warnings fixed v6: - no changes v5: - no changes v4: - refactored cgroup_clear_directory() to not use cgroup_rm_file() - instead of going thru the list of files, get the file list using the subsystems - use 'subsys_mask' instead of {added,removed}_bits and made cgroup_populate_dir() to match the parameters with cgroup_clear_directory() v3: - refresh patches after recent refactoring Original-patch-by: Li Zefan <lizefan@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Lennart Poettering <lpoetter@redhat.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-24Merge branch 'for-next' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace This is an initial merge in of Eric Biederman's work to start adding user namespace support to the networking. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-23userns: Make credential debugging user namespace safe.Eric W. Biederman
Cc: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-08-23Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This tree contains misc fixlets: a perf script python binding fix, a uprobes fix and a syscall tracing fix." * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf tools: Add missing files to build the python binding uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() fails tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1
2012-08-23Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Mostly small fixes for the fallout of the timekeeping overhaul in 3.6 along with stable fixes to address an accumulation problem and missing sanity checks for RTC readouts and user space provided values." * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Avoid making adjustments if we haven't accumulated anything time: Avoid potential shift overflow with large shift values time: Fix casting issue in timekeeping_forward_now time: Ensure we normalize the timekeeper in tk_xtime_add time: Improve sanity checking of timekeeping inputs
2012-08-23ftrace: Do not test frame pointers if -mfentry is usedSteven Rostedt
The function graph has a test to check if the frame pointer is corrupted, which can happen with various options of gcc with mcount. But this is not an issue with -mfentry as -mfentry does not need nor use frame pointers for function graph tracing. Link: http://lkml.kernel.org/r/20120807194059.773895870@goodmis.org Acked-by: H. Peter Anvin <hpa@linux.intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-08-23ftrace: Add -mfentry to Makefile on function tracerSteven Rostedt
Thanks to Andi Kleen, gcc 4.6.0 now supports -mfentry for x86 (and hopefully soon for other archs). What this does is to have the function profiler start at the beginning of the function instead of after the stack is set up. As plain -pg (mcount) is called after the stack is set up, and in some cases can have issues with the function graph tracer. It also requires frame pointers to be enabled. The -mfentry now calls __fentry__ at the beginning of the function. This allows for compiling without frame pointers and even has the ability to access parameters if needed. If the architecture and the compiler both support -mfentry then use that instead. Link: http://lkml.kernel.org/r/20120807194059.392617243@goodmis.org Acked-by: H. Peter Anvin <hpa@linux.intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-08-23x86/smp: Don't ever patch back to UP if we unplug cpusRusty Russell
We still patch SMP instructions to UP variants if we boot with a single CPU, but not at any other time. In particular, not if we unplug CPUs to return to a single cpu. Paul McKenney points out: mean offline overhead is 6251/48=130.2 milliseconds. If I remove the alternatives_smp_switch() from the offline path [...] the mean offline overhead is 550/42=13.1 milliseconds Basically, we're never going to get those 120ms back, and the code is pretty messy. We get rid of: 1) The "smp-alt-once" boot option. It's actually "smp-alt-boot", the documentation is wrong. It's now the default. 2) The skip_smp_alternatives flag used by suspend. 3) arch_disable_nonboot_cpus_begin() and arch_disable_nonboot_cpus_end() which were only used to set this one flag. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul McKenney <paul.mckenney@us.ibm.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/87vcgwwive.fsf@rustcorp.com.au Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-23PM / Freezer: Fix small typo "regrigerator"Sedat Dilek
Noticed when digging into a suspend issue in linux-next (next-20120821). For more details see <http://marc.info/?t=134554708000002&r=1&w=2>. Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-08-22time: Avoid making adjustments if we haven't accumulated anythingJohn Stultz
If update_wall_time() is called and the current offset isn't large enough to accumulate, avoid re-calling timekeeping_adjust which may change the clock freq and can cause 1ns inconsistencies with CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1345595449-34965-5-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22time: Avoid potential shift overflow with large shift valuesJohn Stultz
Andreas Schwab noticed that the 1 << tk->shift could overflow if the shift value was greater than 30, since 1 would be a 32bit long on 32bit architectures. This issue was introduced by 1e75fa8be (time: Condense timekeeper.xtime into xtime_sec) Use 1ULL instead to ensure we don't overflow on the shift. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1345595449-34965-4-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22time: Fix casting issue in timekeeping_forward_nowAndreas Schwab
arch_gettimeoffset returns a u32 value which when shifted by tk->shift can overflow. This issue was introduced with 1e75fa8be (time: Condense timekeeper.xtime into xtime_sec) Cast it to u64 first. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1345595449-34965-3-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22time: Ensure we normalize the timekeeper in tk_xtime_addJohn Stultz
Andreas noticed problems with resume on specific hardware after commit 1e75fa8b (time: Condense timekeeper.xtime into xtime_sec) combined with commit b44d50dca (time: Fix casting issue in tk_set_xtime and tk_xtime_add) After some digging I realized we aren't normalizing the timekeeper after the add. Add the missing normalize call. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Tested-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1345595449-34965-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-21workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()Tejun Heo
cancel_delayed_work() can't be called from IRQ handlers due to its use of del_timer_sync() and can't cancel work items which are already transferred from timer to worklist. Also, unlike other flush and cancel functions, a canceled delayed_work would still point to the last associated cpu_workqueue. If the workqueue is destroyed afterwards and the work item is re-used on a different workqueue, the queueing code can oops trying to dereference already freed cpu_workqueue. This patch reimplements cancel_delayed_work() using try_to_grab_pending() and set_work_cpu_and_clear_pending(). This allows the function to be called from IRQ handlers and makes its behavior consistent with other flush / cancel functions. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2012-08-21workqueue: use irqsafe timer for delayed_workTejun Heo
Up to now, for delayed_works, try_to_grab_pending() couldn't be used from IRQ handlers because IRQs may happen while delayed_work_timer_fn() is in progress leading to indefinite -EAGAIN. This patch makes delayed_work use the new TIMER_IRQSAFE flag for delayed_work->timer. This makes try_to_grab_pending() and thus mod_delayed_work_on() safe to call from IRQ handlers. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-21Merge branch 'timers/core' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-3.7
2012-08-21Merge branch 'audit-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull audit-tree fixes from Miklos Szeredi: "The audit subsystem maintainers (Al and Eric) are not responding to repeated resends. Eric did ack them a while ago, but no response since then. So I'm sending these directly to you." * 'audit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: audit: clean up refcounting in audit-tree audit: fix refcounting in audit-tree audit: don't free_chunk() after fsnotify_add_mark()
2012-08-21task_work: add a scheduling point in task_work_run()Eric Dumazet
It seems commit 4a9d4b024a31 ("switch fput to task_work_add") re- introduced the problem addressed in 944be0b22472 ("close_files(): add scheduling point") If a server process with a lot of files (say 2 million tcp sockets) is killed, we can spend a lot of time in task_work_run() and trigger a soft lockup. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21timer: Implement TIMER_IRQSAFETejun Heo
Timer internals are protected with irq-safe locks but timer execution isn't, so a timer being dequeued for execution and its execution aren't atomic against IRQs. This makes it impossible to wait for its completion from IRQ handlers and difficult to shoot down a timer from IRQ handlers. This issue caused some issues for delayed_work interface. Because there's no way to reliably shoot down delayed_work->timer from IRQ handlers, __cancel_delayed_work() can't share the logic to steal the target delayed_work with cancel_delayed_work_sync(), and can only steal delayed_works which are on queued on timer. Similarly, the pending mod_delayed_work() can't be used from IRQ handlers. This patch adds a new timer flag TIMER_IRQSAFE, which makes the timer to be executed without enabling IRQ after dequeueing such that its dequeueing and execution are atomic against IRQ handlers. This makes it safe to wait for the timer's completion from IRQ handlers, for example, using del_timer_sync(). It can never be executing on the local CPU and if executing on other CPUs it won't be interrupted until done. This will enable simplifying delayed_work cancel/mod interface. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: torvalds@linux-foundation.org Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1344449428-24962-5-git-send-email-tj@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>