summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2010-02-11ptrace: Add support for generic PTRACE_GETREGSET/PTRACE_SETREGSETSuresh Siddha
Generic support for PTRACE_GETREGSET/PTRACE_SETREGSET commands which export the regsets supported by each architecture using the correponding NT_* types. These NT_* types are already part of the userland ABI, used in representing the architecture specific register sets as different NOTES in an ELF core file. 'addr' parameter for the ptrace system call encode the REGSET type (using the corresppnding NT_* type) and the 'data' parameter points to the struct iovec having the user buffer and the length of that buffer. struct iovec iov = { buf, len}; ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, &iov); On successful completion, iov.len will be updated by the kernel specifying how much the kernel has written/read to/from the user's iov.buf. x86 extended state registers are primarily exported using this interface. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <20100211195614.886724710@sbs-t61.sc.intel.com> Acked-by: Hongjiu Lu <hjl.tools@gmail.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-09Export the symbol of getboottime and mmonotonic_to_bootbasedJason Wang
Export getboottime and monotonic_to_bootbased in order to let them could be used by following patch. Cc: stable@kernel.org Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-02-04Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Handle futex value corruption gracefully futex: Handle user space corruption gracefully futex_lock_pi() key refcnt fix softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume
2010-02-03futex: Handle futex value corruption gracefullyThomas Gleixner
The WARN_ON in lookup_pi_state which complains about a mismatch between pi_state->owner->pid and the pid which we retrieved from the user space futex is completely bogus. The code just emits the warning and then continues despite the fact that it detected an inconsistent state of the futex. A conveniant way for user space to spam the syslog. Replace the WARN_ON by a consistency check. If the values do not match return -EINVAL and let user space deal with the mess it created. This also fixes the missing task_pid_vnr() when we compare the pi_state->owner pid with the futex value. Reported-by: Jermome Marchand <jmarchan@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org>
2010-02-03futex: Handle user space corruption gracefullyThomas Gleixner
If the owner of a PI futex dies we fix up the pi_state and set pi_state->owner to NULL. When a malicious or just sloppy programmed user space application sets the futex value to 0 e.g. by calling pthread_mutex_init(), then the futex can be acquired again. A new waiter manages to enqueue itself on the pi_state w/o damage, but on unlock the kernel dereferences pi_state->owner and oopses. Prevent this by checking pi_state->owner in the unlock path. If pi_state->owner is not current we know that user space manipulated the futex value. Ignore the mess and return -EINVAL. This catches the above case and also the case where a task hijacks the futex by setting the tid value and then tries to unlock it. Reported-by: Jermome Marchand <jmarchan@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org>
2010-02-03futex_lock_pi() key refcnt fixMikael Pettersson
This fixes a futex key reference count bug in futex_lock_pi(), where a key's reference count is incremented twice but decremented only once, causing the backing object to not be released. If the futex is created in a temporary file in an ext3 file system, this bug causes the file's inode to become an "undead" orphan, which causes an oops from a BUG_ON() in ext3_put_super() when the file system is unmounted. glibc's test suite is known to trigger this, see <http://bugzilla.kernel.org/show_bug.cgi?id=14256>. The bug is a regression from 2.6.28-git3, namely Peter Zijlstra's 38d47c1b7075bd7ec3881141bb3629da58f88dab "[PATCH] futex: rely on get_user_pages() for shared futexes". That commit made get_futex_key() also increment the reference count of the futex key, and updated its callers to decrement the key's reference count before returning. Unfortunately the normal exit path in futex_lock_pi() wasn't corrected: the reference count is incremented by get_futex_key() and queue_lock(), but the normal exit path only decrements once, via unqueue_me_pi(). The fix is to put_futex_key() after unqueue_me_pi(), since 2.6.31 this is easily done by 'goto out_put_key' rather than 'goto out'. Signed-off-by: Mikael Pettersson <mikpe@it.uu.se> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org>
2010-02-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: kernel/cred.c: use kmem_cache_free
2010-02-02cgroups: fix to return errno in a failure pathLi Zefan
In cgroup_create(), if alloc_css_id() returns failure, the errno is not propagated to userspace, so mkdir will fail silently. To trigger this bug, we mount blkio (or memory subsystem), and create more then 65534 cgroups. (The number of cgroups is limited to 65535 if a subsystem has use_id == 1) # mount -t cgroup -o blkio xxx /mnt # for ((i = 0; i < 65534; i++)); do mkdir /mnt/$i; done # mkdir /mnt/65534 (should return ENOSPC) # Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Paul Menage <menage@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02kfifo: fix kernel-doc notationRandy Dunlap
Fix kfifo kernel-doc warnings: Warning(kernel/kfifo.c:361): No description found for parameter 'total' Warning(kernel/kfifo.c:402): bad line: @ @lenout: pointer to output variable with copied data Warning(kernel/kfifo.c:412): No description found for parameter 'lenout' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Stefani Seibold <stefani@seibold.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-03kernel/cred.c: use kmem_cache_freeJulia Lawall
Free memory allocated using kmem_cache_zalloc using kmem_cache_free rather than kfree. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x,E,c; @@ x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...) ... when != x = E when != &x ?-kfree(x) +kmem_cache_free(c,x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Steve Dickson <steved@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: James Morris <jmorris@namei.org>
2010-02-01Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Fix check_usage_backwards() error message
2010-02-01Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf, hw_breakpoint, kgdb: Do not take mutex for kernel debugger x86, hw_breakpoints, kgdb: Fix kgdb to use hw_breakpoint API hw_breakpoints: Release the bp slot if arch_validate_hwbkpt_settings() fails. perf: Ignore perf.data.old perf report: Fix segmentation fault when running with '-g none'
2010-02-01Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Correct printk whitespace in warning from cpu down task check sched: Fix incorrect sanity check sched: Fix fork vs hotplug vs cpuset namespaces
2010-02-01Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: Prevent potential kgdb dead lock
2010-02-01softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resumeJason Wessel
When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, sched_clock() gets the time from hardware such as the TSC on x86. In this configuration kgdb will report a softlock warning message on resuming or detaching from a debug session. Sequence of events in the problem case: 1) "cpu sched clock" and "hardware time" are at 100 sec prior to a call to kgdb_handle_exception() 2) Debugger waits in kgdb_handle_exception() for 80 sec and on exit the following is called ... touch_softlockup_watchdog() --> __raw_get_cpu_var(touch_timestamp) = 0; 3) "cpu sched clock" = 100s (it was not updated, because the interrupt was disabled in kgdb) but the "hardware time" = 180 sec 4) The first timer interrupt after resuming from kgdb_handle_exception updates the watchdog from the "cpu sched clock" update_process_times() { ... run_local_timers() --> softlockup_tick() --> check (touch_timestamp == 0) (it is "YES" here, we have set "touch_timestamp = 0" at kgdb) --> __touch_softlockup_watchdog() ***(A)--> reset "touch_timestamp" to "get_timestamp()" (Here, the "touch_timestamp" will still be set to 100s.) ... scheduler_tick() ***(B)--> sched_clock_tick() (update "cpu sched clock" to "hardware time" = 180s) ... } 5) The Second timer interrupt handler appears to have a large jump and trips the softlockup warning. update_process_times() { ... run_local_timers() --> softlockup_tick() --> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s --> printk "soft lockup error messages" ... } note: ***(A) reset "touch_timestamp" to "get_timestamp(this_cpu)" Why is "touch_timestamp" 100 sec, instead of 180 sec? When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, the call trace of get_timestamp() is: get_timestamp(this_cpu) -->cpu_clock(this_cpu) -->sched_clock_cpu(this_cpu) -->__update_sched_clock(sched_clock_data, now) The __update_sched_clock() function uses the GTOD tick value to create a window to normalize the "now" values. So if "now" value is too big for sched_clock_data, it will be ignored. The fix is to invoke sched_clock_tick() to update "cpu sched clock" in order to recover from this state. This is done by introducing the function touch_softlockup_watchdog_sync(). This allows kgdb to request that the sched clock is updated when the watchdog thread runs the first time after a resume from kgdb. [yong.zhang0@gmail.com: Use per cpu instead of an array] Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: peterz@infradead.org LKML-Reference: <1264631124-4837-2-git-send-email-jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-30perf, hw_breakpoint, kgdb: Do not take mutex for kernel debuggerJason Wessel
This patch fixes the regression in functionality where the kernel debugger and the perf API do not nicely share hw breakpoint reservations. The kernel debugger cannot use any mutex_lock() calls because it can start the kernel running from an invalid context. A mutex free version of the reservation API needed to get created for the kernel debugger to safely update hw breakpoint reservations. The possibility for a breakpoint reservation to be concurrently processed at the time that kgdb interrupts the system is improbable. Should this corner case occur the end user is warned, and the kernel debugger will prohibit updating the hardware breakpoint reservations. Any time the kernel debugger reserves a hardware breakpoint it will be a system wide reservation. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: torvalds@linux-foundation.org LKML-Reference: <1264719883-7285-3-git-send-email-jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-30x86, hw_breakpoints, kgdb: Fix kgdb to use hw_breakpoint APIJason Wessel
In the 2.6.33 kernel, the hw_breakpoint API is now used for the performance event counters. The hw_breakpoint_handler() now consumes the hw breakpoints that were previously set by kgdb arch specific code. In order for kgdb to work in conjunction with this core API change, kgdb must use some of the low level functions of the hw_breakpoint API to install, uninstall, and deal with hw breakpoint reservations. The kgdb core required a change to call kgdb_disable_hw_debug anytime a slave cpu enters kgdb_wait() in order to keep all the hw breakpoints in sync as well as to prevent hitting a hw breakpoint while kgdb is active. During the architecture specific initialization of kgdb, it will pre-allocate 4 disabled (struct perf event **) structures. Kgdb will use these to manage the capabilities for the 4 hw breakpoint registers, per cpu. Right now the hw_breakpoint API does not have a way to ask how many breakpoints are available, on each CPU so it is possible that the install of a breakpoint might fail when kgdb restores the system to the run state. The intent of this patch is to first get the basic functionality of hw breakpoints working and leave it to the person debugging the kernel to understand what hw breakpoints are in use and what restrictions have been imposed as a result. Breakpoint constraints will be dealt with in a future patch. While atomic, the x86 specific kgdb code will call arch_uninstall_hw_breakpoint() and arch_install_hw_breakpoint() to manage the cpu specific hw breakpoints. The net result of these changes allow kgdb to use the same pool of hw_breakpoints that are used by the perf event API, but neither knows about future reservations for the available hw breakpoint slots. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: torvalds@linux-foundation.org LKML-Reference: <1264719883-7285-2-git-send-email-jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-28hw_breakpoints: Release the bp slot if arch_validate_hwbkpt_settings() fails.Mahesh Salgaonkar
On a given architecture, when hardware breakpoint registration fails due to un-supported access type (read/write/execute), we lose the bp slot since register_perf_hw_breakpoint() does not release the bp slot on failure. Hence, any subsequent hardware breakpoint registration starts failing with 'no space left on device' error. This patch introduces error handling in register_perf_hw_breakpoint() function and releases bp slot on error. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Maneesh Soni <maneesh@in.ibm.com> LKML-Reference: <20100121125516.GA32521@in.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-01-28sched: Correct printk whitespace in warning from cpu down task checkFrans Pop
Due to an incorrect line break the output currently contains tabs. Also remove trailing space. The actual output that logcheck sent me looked like this: Task events/1 (pid = 10) is on cpu 1^I^I^I^I(state = 1, flags = 84208040) After this patch it becomes: Task events/1 (pid = 10) is on cpu 1 (state = 1, flags = 84208040) Signed-off-by: Frans Pop <elendilplanet.nl> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <201001251456.34996.elendil@planet.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-28sched: Fix incorrect sanity checkPeter Zijlstra
We moved to migrate on wakeup, which means that sleeping tasks could still be present on offline cpus. Amend the check to only test running tasks. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-27lockdep: Fix check_usage_backwards() error messageOleg Nesterov
Lockdep has found the real bug, but the output doesn't look right to me: > ========================================================= > [ INFO: possible irq lock inversion dependency detected ] > 2.6.33-rc5 #77 > --------------------------------------------------------- > emacs/1609 just changed the state of lock: > (&(&tty->ctrl_lock)->rlock){+.....}, at: [<ffffffff8127c648>] tty_fasync+0xe8/0x190 > but this lock took another, HARDIRQ-unsafe lock in the past: > (&(&sighand->siglock)->rlock){-.....} "HARDIRQ-unsafe" and "this lock took another" looks wrong, afaics. > ... key at: [<ffffffff81c054a4>] __key.46539+0x0/0x8 > ... acquired at: > [<ffffffff81089af6>] __lock_acquire+0x1056/0x15a0 > [<ffffffff8108a0df>] lock_acquire+0x9f/0x120 > [<ffffffff81423012>] _raw_spin_lock_irqsave+0x52/0x90 > [<ffffffff8127c1be>] __proc_set_tty+0x3e/0x150 > [<ffffffff8127e01d>] tty_open+0x51d/0x5e0 The stack-trace shows that this lock (ctrl_lock) was taken under ->siglock (which is hopefully irq-safe). This is a clear typo in check_usage_backwards() where we tell the print a fancy routine we're forwards. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100126181641.GA10460@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-26tracing/documentation: Cover new frame pointer semanticsMike Frysinger
Update the graph tracer examples to cover the new frame pointer semantics (in terms of passing it along). Move the HAVE_FUNCTION_GRAPH_FP_TEST docs out of the Kconfig, into the right place, and expand on the details. Signed-off-by: Mike Frysinger <vapier@gentoo.org> LKML-Reference: <1264165967-18938-1-git-send-email-vapier@gentoo.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-26ring-buffer: Check for end of page in iteratorSteven Rostedt
If the iterator comes to an empty page for some reason, or if the page is emptied by a consuming read. The iterator code currently does not check if the iterator is pass the contents, and may return a false entry. This patch adds a check to the ring buffer iterator to test if the current page has been completely read and sets the iterator to the next page if necessary. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-26ring-buffer: Check if ring buffer iterator has stale dataSteven Rostedt
Usually reads of the ring buffer is performed by a single task. There are two types of reads from the ring buffer. One is a consuming read which will consume the entry that was read and the next read will be the entry that follows. The other is an iterator that will let the user read the contents of the ring buffer without modifying it. When an iterator is allocated, writes to the ring buffer are disabled to protect the iterator. The problem exists when consuming reads happen while an iterator is allocated. Specifically, the kind of read that swaps out an entire page (used by splice) and replaces it with a new read. If the iterator is on the page that is swapped out, then the next read may read from this swapped out page and return garbage. This patch adds a check when reading the iterator to make sure that the iterator contents are still valid. If a consuming read has taken place, the iterator is reset. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-26clocksource: Prevent potential kgdb dead lockThomas Gleixner
commit 0f8e8ef7 (clocksource: Simplify clocksource watchdog resume logic) introduced a potential kgdb dead lock. When the kernel is stopped by kgdb inside code which holds watchdog_lock then kgdb dead locks in clocksource_resume_watchdog(). clocksource_resume_watchdog() is called from kbdg via clocksource_touch_watchdog() to avoid that the clock source watchdog marks TSC unstable after the kernel has been stopped. Solve this by replacing spin_lock with a spin_trylock and just return in case the lock is held. Not resetting the watchdog might result in TSC becoming marked unstable, but that's an acceptable penalty for using kgdb. The timekeeping is anyway easily screwed up by kgdb when the system uses either jiffies or a clock source which wraps in short intervals (e.g. pm_timer wraps about every 4.6s), so we really do not have to worry about that occasional TSC marked unstable side effect. The second caller of clocksource_resume_watchdog() is clocksource_resume(). The trylock is safe here as well because the system is UP at this point, interrupts are disabled and nothing else can hold watchdog_lock(). Reported-by: Jason Wessel <jason.wessel@windriver.com> LKML-Reference: <1264480000-6997-4-git-send-email-jason.wessel@windriver.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-01-25tracing: Prevent kernel oops with corrupted bufferSteven Rostedt
If the contents of the ftrace ring buffer gets corrupted and the trace file is read, it could create a kernel oops (usualy just killing the user task thread). This is caused by the checking of the pid in the buffer. If the pid is negative, it still references the cmdline cache array, which could point to an invalid address. The simple fix is to test for negative PIDs. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-24Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clockevent: Don't remove broadcast device when cpu is dead
2010-01-24Merge git://git.infradead.org/~dwmw2/mtd-2.6.33Linus Torvalds
* git://git.infradead.org/~dwmw2/mtd-2.6.33: mtd: tests: fix read, speed and stress tests on NOR flash mtd: Really add ARM pismo support kmsg_dump: Dump on crash_kexec as well
2010-01-21sched: Fix fork vs hotplug vs cpuset namespacesPeter Zijlstra
There are a number of issues: 1) TASK_WAKING vs cgroup_clone (cpusets) copy_process(): sched_fork() child->state = TASK_WAKING; /* waiting for wake_up_new_task() */ if (current->nsproxy != p->nsproxy) ns_cgroup_clone() cgroup_clone() mutex_lock(inode->i_mutex) mutex_lock(cgroup_mutex) cgroup_attach_task() ss->can_attach() ss->attach() [ -> cpuset_attach() ] cpuset_attach_task() set_cpus_allowed_ptr(); while (child->state == TASK_WAKING) cpu_relax(); will deadlock the system. 2) cgroup_clone (cpusets) vs copy_process So even if the above would work we still have: copy_process(): if (current->nsproxy != p->nsproxy) ns_cgroup_clone() cgroup_clone() mutex_lock(inode->i_mutex) mutex_lock(cgroup_mutex) cgroup_attach_task() ss->can_attach() ss->attach() [ -> cpuset_attach() ] cpuset_attach_task() set_cpus_allowed_ptr(); ... p->cpus_allowed = current->cpus_allowed over-writing the modified cpus_allowed. 3) fork() vs hotplug if we unplug the child's cpu after the sanity check when the child gets attached to the task_list but before wake_up_new_task() shit will meet with fan. Solve all these issues by moving fork cpu selection into wake_up_new_task(). Reported-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1264106190.4283.1314.camel@laptop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-01-21Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: x86: Add support for the ANY bit perf: Change the is_software_event() definition perf: Honour event state for aux stream data perf: Fix perf_event_do_pending() fallback callsite perf kmem: Print usage help for unknown commands perf kmem: Increase "Hit" column length hw-breakpoints, perf: Fix broken mmiotrace due to dr6 by reference change perf timechart: Use tid not pid for COMM change
2010-01-21perf: Honour event state for aux stream dataPeter Zijlstra
Anton reported that perf record kept receiving events even after calling ioctl(PERF_EVENT_IOC_DISABLE). It turns out that FORK,COMM and MMAP events didn't respect the disabled state and kept flowing in. Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Anton Blanchard <anton@samba.org> LKML-Reference: <1263459187.4244.265.camel@laptop> CC: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21perf: Fix perf_event_do_pending() fallback callsitePeter Zijlstra
Paul questioned the context in which we should call perf_event_do_pending(). After looking at that I found that it should be called from IRQ context these days, however the fallback call-site is placed in softirq context. Ammend this by placing the callback in the IRQ timer path. Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1263374859.4244.192.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21sched: Reassign prev and switch_count when reacquire_kernel_lock() failYong Zhang
Assume A->B schedule is processing, if B have acquired BKL before and it need reschedule this time. Then on B's context, it will go to need_resched_nonpreemptible for reschedule. But at this time, prev and switch_count are related to A. It's wrong and will lead to incorrect scheduler statistics. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <2674af741001102238w7b0ddcadref00d345e2181d11@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21sched: Fix vmark regression on big machinesMike Galbraith
SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't enabled, leading to many cache misses on large machines as we traverse looking for an idle shared cache to wake to. Change the enabler of select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the sibling domain level. Reported-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1262612696.15495.15.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-18clockevent: Don't remove broadcast device when cpu is deadXiaotian Feng
Marc reported that the BUG_ON in clockevents_notify() triggers on his system. This happens because the kernel tries to remove an active clock event device (used for broadcasting) from the device list. The handling of devices which can be used as per cpu device and as a global broadcast device is suboptimal. The simplest solution for now (and for stable) is to check whether the device is used as global broadcast device, but this needs to be revisited. [ tglx: restored the cpuweight check and massaged the changelog ] Reported-by: Marc Dionne <marc.c.dionne@gmail.com> Tested-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> LKML-Reference: <1262834564-13033-1-git-send-email-dfeng@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org
2010-01-16Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futexes: Remove rw parameter from get_futex_key()
2010-01-16Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing/filters: Add comment for match callbacks tracing/filters: Fix MATCH_FULL filter matching for PTR_STRING tracing/filters: Fix MATCH_MIDDLE_ONLY filter matching lib: Introduce strnstr() tracing/filters: Fix MATCH_END_ONLY filter matching tracing/filters: Fix MATCH_FRONT_ONLY filter matching ftrace: Fix MATCH_END_ONLY function filter tracing/x86: Derive arch from bits argument in recordmcount.pl ring-buffer: Add rb_list_head() wrapper around new reader page next field ring-buffer: Wrap a list.next reference with rb_list_head()
2010-01-16smp_call_function_any(): pass the node value to cpumask_of_node()David John
The change in acpi_cpufreq to use smp_call_function_any causes a warning when it is called since the function erroneously passes the cpu id to cpumask_of_node rather than the node that the cpu is on. Fix this. cpumask_of_node(3): node > nr_node_ids(1) Pid: 1, comm: swapper Not tainted 2.6.33-rc3-00097-g2c1f189 #223 Call Trace: [<ffffffff81028bb3>] cpumask_of_node+0x23/0x58 [<ffffffff81061f51>] smp_call_function_any+0x65/0xfa [<ffffffff810160d1>] ? do_drv_read+0x0/0x2f [<ffffffff81015fba>] get_cur_val+0xb0/0x102 [<ffffffff81016080>] get_cur_freq_on_cpu+0x74/0xc5 [<ffffffff810168a7>] acpi_cpufreq_cpu_init+0x417/0x515 [<ffffffff81562ce9>] ? __down_write+0xb/0xd [<ffffffff8148055e>] cpufreq_add_dev+0x278/0x922 Signed-off-by: David John <davidjon@xenontk.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16kfifo: document everywhere that size has to be power of twoAndi Kleen
On my first try using them I missed that the fifos need to be power of two, resulting in a runtime bug. Document that requirement everywhere (and fix one grammar bug) Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16kfifo: add kfifo_out_peekAndi Kleen
In some upcoming code it's useful to peek into a FIFO without permanentely removing data. This patch implements a new kfifo_out_peek() to do this. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16kfifo: sanitize *_user error handlingAndi Kleen
Right now for kfifo_*_user it's not easily possible to distingush between a user copy failing and the FIFO not containing enough data. The problem is that both conditions are multiplexed into the same return code. Avoid this by moving the "copy length" into a separate output parameter and only return 0/-EFAULT in the main return value. I didn't fully adapt the weird "record" variants, those seem to be unused anyways and were rather messy (should they be just removed?) I would appreciate some double checking if I did all the conversions correctly. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16kfifo: use void * pointers for user buffersAndi Kleen
The pointers to user buffers are currently unsigned char *, which requires a lot of casting in the caller for any non-char typed buffers. Use void * instead. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-14tracing/filters: Add comment for match callbacksLi Zefan
We should be clear on 2 things: - the length parameter of a match callback includes tailing '\0'. - the string to be searched might not be NULL-terminated. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8770.7000608@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-14tracing/filters: Fix MATCH_FULL filter matching for PTR_STRINGLi Zefan
MATCH_FULL matching for PTR_STRING is not working correctly: # echo 'func == vt' > events/bkl/lock_kernel/filter # echo 1 > events/bkl/lock_kernel/enable ... # cat trace Xorg-1484 [000] 1973.392586: lock_kernel: ... func=vt_ioctl() gpm-1402 [001] 1974.027740: lock_kernel: ... func=vt_ioctl() We should pass to regex.match(..., len) the length (including '\0') of the source string instead of the length of the pattern string. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8763.5070707@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-14tracing/filters: Fix MATCH_MIDDLE_ONLY filter matchingLi Zefan
The @str might not be NULL-terminated if it's of type DYN_STRING or STATIC_STRING, so we should use strnstr() instead of strstr(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8753.2000102@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-14tracing/filters: Fix MATCH_END_ONLY filter matchingLi Zefan
For '*foo' pattern, we should allow any string ending with 'foo', but event filtering incorrectly disallows strings like bar_foo_foo: Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8735.6070604@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-14tracing/filters: Fix MATCH_FRONT_ONLY filter matchingLi Zefan
MATCH_FRONT_ONLY actually is a full matching: # ./perf record -R -f -a -e lock:lock_acquire \ --filter 'name ~rcu_*' sleep 1 # ./perf trace (no output) We should pass the length of the pattern string to strncmp(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8721.5090301@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-14ftrace: Fix MATCH_END_ONLY function filterLi Zefan
For '*foo' pattern, we should allow any string ending with 'foo', but ftrace filter incorrectly disallows strings like bar_foo_foo: # echo '*io' > set_ftrace_filter # cat set_ftrace_filter | grep 'req_bio_endio' # cat available_filter_functions | grep 'req_bio_endio' req_bio_endio Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E870E.6060607@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-13futexes: Remove rw parameter from get_futex_key()KOSAKI Motohiro
Currently, futexes have two problem: A) The current futex code doesn't handle private file mappings properly. get_futex_key() uses PageAnon() to distinguish file and anon, which can cause the following bad scenario: 1) thread-A call futex(private-mapping, FUTEX_WAIT), it sleeps on file mapping object. 2) thread-B writes a variable and it makes it cow. 3) thread-B calls futex(private-mapping, FUTEX_WAKE), it wakes up blocked thread on the anonymous page. (but it's nothing) B) Current futex code doesn't handle zero page properly. Read mode get_user_pages() can return zero page, but current futex code doesn't handle it at all. Then, zero page makes infinite loop internally. The solution is to use write mode get_user_page() always for page lookup. It prevents the lookup of both file page of private mappings and zero page. Performance concerns: Probaly very little, because glibc always initialize variables for futex before to call futex(). It means glibc users never see the overhead of this patch. Compatibility concerns: This patch has few compatibility issues. After this patch, FUTEX_WAIT require writable access to futex variables (read-only mappings makes EFAULT). But practically it's not a problem, glibc always initalizes variables for futexes explicitly - nobody uses read-only mappings. Reported-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Darren Hart <dvhltc@us.ibm.com> Cc: <stable@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Ulrich Drepper <drepper@gmail.com> LKML-Reference: <20100105162633.45A2.A69D9226@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-11kernel/signal.c: fix kernel information leak with print-fatal-signals=1Andi Kleen
When print-fatal-signals is enabled it's possible to dump any memory reachable by the kernel to the log by simply jumping to that address from user space. Or crash the system if there's some hardware with read side effects. The fatal signals handler will dump 16 bytes at the execution address, which is fully controlled by ring 3. In addition when something jumps to a unmapped address there will be up to 16 additional useless page faults, which might be potentially slow (and at least is not very efficient) Fortunately this option is off by default and only there on i386. But fix it by checking for kernel addresses and also stopping when there's a page fault. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>