summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2010-07-19padata: Added sysfs primitives to padata subsystemDan Kruchinin
Added sysfs primitives to padata subsystem. Now API user may embedded kobject each padata instance contains into any sysfs hierarchy. For now padata sysfs interface provides only two objects: serial_cpumask [RW] - cpumask for serial workers parallel_cpumask [RW] - cpumask for parallel workers Signed-off-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-07-19padata: Make two separate cpumasksDan Kruchinin
The aim of this patch is to make two separate cpumasks for padata parallel and serial workers respectively. It allows user to make more thin and sophisticated configurations of padata framework. For example user may bind parallel and serial workers to non-intersecting CPU groups to gain better performance. Also each padata instance has notifiers chain for its cpumasks now. If either parallel or serial or both masks were changed all interested subsystems will get notification about that. It's especially useful if padata user uses algorithm for callback CPU selection according to serial cpumask. Signed-off-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-07-19PM / Suspend: Fix ordering of calls in suspend error pathsRafael J. Wysocki
The ACPI suspend code calls suspend_nvs_free() at a wrong place, which may lead to a memory leak if there's an error executing acpi_pm_prepare(), because acpi_pm_finish() will not be called in that case. However, the root cause of this problem is the apparently confusing ordering of calls in suspend error paths that needs to be fixed. In addition to that, fix a typo in a label name in suspend.c. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>
2010-07-19PM / Hibernate: Fix snapshot error code pathRafael J. Wysocki
There is an inconsistency between hibernation_platform_enter() and hibernation_snapshot(), because the latter calls hibernation_ops->end() after failing hibernation_ops->begin(), while the former doesn't do that. Make hibernation_snapshot() behave in the same way as hibernation_platform_enter() in that respect. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>
2010-07-19PM / Hibernate: Fix hibernation_platform_enter()Rafael J. Wysocki
The hibernation_platform_enter() function calls dpm_suspend_noirq() instead of dpm_resume_noirq() by mistake. Fix this. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>
2010-07-19pm_qos: Get rid of the allocation in pm_qos_add_request()James Bottomley
All current users of pm_qos_add_request() have the ability to supply the memory required by the pm_qos routines, so make them do this and eliminate the kmalloc() with pm_qos_add_request(). This has the double benefit of making the call never fail and allowing it to be called from atomic context. Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-07-19pm_qos: Reimplement using plistsJames Bottomley
A lot of the pm_qos extremal value handling is really duplicating what a priority ordered list does, just in a less efficient fashion. Simply redoing the implementation in terms of a plist gets rid of a lot of this junk (although there are several other strange things that could do with tidying up, like pm_qos_request_list has to carry the pm_qos_class with every node, simply because it doesn't get passed in to pm_qos_update_request even though every caller knows full well what parameter it's updating). I think this redo is a win independent of android, so we should do something like this now. Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-07-19PM: Make it possible to avoid races between wakeup and system sleepRafael J. Wysocki
One of the arguments during the suspend blockers discussion was that the mainline kernel didn't contain any mechanisms making it possible to avoid races between wakeup and system suspend. Generally, there are two problems in that area. First, if a wakeup event occurs exactly when /sys/power/state is being written to, it may be delivered to user space right before the freezer kicks in, so the user space consumer of the event may not be able to process it before the system is suspended. Second, if a wakeup event occurs after user space has been frozen, it is not generally guaranteed that the ongoing transition of the system into a sleep state will be aborted. To address these issues introduce a new global sysfs attribute, /sys/power/wakeup_count, associated with a running counter of wakeup events and three helper functions, pm_stay_awake(), pm_relax(), and pm_wakeup_event(), that may be used by kernel subsystems to control the behavior of this attribute and to request the PM core to abort system transitions into a sleep state already in progress. The /sys/power/wakeup_count file may be read from or written to by user space. Reads will always succeed (unless interrupted by a signal) and return the current value of the wakeup events counter. Writes, however, will only succeed if the written number is equal to the current value of the wakeup events counter. If a write is successful, it will cause the kernel to save the current value of the wakeup events counter and to abort the subsequent system transition into a sleep state if any wakeup events are reported after the write has returned. [The assumption is that before writing to /sys/power/state user space will first read from /sys/power/wakeup_count. Next, user space consumers of wakeup events will have a chance to acknowledge or veto the upcoming system transition to a sleep state. Finally, if the transition is allowed to proceed, /sys/power/wakeup_count will be written to and if that succeeds, /sys/power/state will be written to as well. Still, if any wakeup events are reported to the PM core by kernel subsystems after that point, the transition will be aborted.] Additionally, put a wakeup events counter into struct dev_pm_info and make these per-device wakeup event counters available via sysfs, so that it's possible to check the activity of various wakeup event sources within the kernel. To illustrate how subsystems can use pm_wakeup_event(), make the low-level PCI runtime PM wakeup-handling code use it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: markgross <markgross@thegnar.org> Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
2010-07-19PM / Hibernate: Fix typos in comments in kernel/power/swap.cCesar Eduardo Barros
There are a few typos in kernel/power/swap.c. Fix them. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-07-17sched: No need for bootmem special casesPekka Enberg
As of commit dcce284 ("mm: Extend gfp masking to the page allocator") and commit 7e85ee0 ("slab,slub: don't enable interrupts during early boot"), the slab allocator makes sure we don't attempt to sleep during boot. Therefore, remove bootmem special cases from the scheduler and use plain GFP_KERNEL instead. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1279225102-2572-1-git-send-email-penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-17sched: Revert nohz_ratelimit() for nowPeter Zijlstra
Norbert reported that nohz_ratelimit() causes his laptop to burn about 4W (40%) extra. For now back out the change and see if we can adjust the power management code to make better decisions. Reported-by: Norbert Preining <preining@logic.at> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-17sched: Reduce update_group_power() callsPeter Zijlstra
Currently we update cpu_power() too often, update_group_power() only updates the local group's cpu_power but it gets called for all groups. Furthermore, CPU_NEWLY_IDLE invocations will result in all cpus calling it, even though a slow update of cpu_power is sufficient. Therefore move the update under 'idle != CPU_NEWLY_IDLE && local_group' to reduce superfluous invocations. Reported-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <1278612989.1900.176.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-17sched: Update rq->clock for nohz balanced cpusSuresh Siddha
Suresh spotted that we don't update the rq->clock in the nohz load-balancer path. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1278626014.2834.74.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-16rlimits: implement prlimit64 syscallJiri Slaby
This patch adds the code to support the sys_prlimit64 syscall which modifies-and-returns the rlim values of a selected process atomically. The first parameter, pid, being 0 means current process. Unlike the current implementation, it is a generic interface, architecture indepentent so that we needn't handle compat stuff anymore. In the future, after glibc start to use this we can deprecate sys_setrlimit and sys_getrlimit in favor to clean up the code finally. It also adds a possibility of changing limits of other processes. We check the user's permissions to do that and if it succeeds, the new limits are propagated online. This is good for large scale applications such as SAP or databases where administrators need to change limits time by time (e.g. on crashes increase core size). And it is unacceptable to restart the service. For safety, all rlim users now either use accessors or doesn't need them due to - locking - the fact a process was just forked and nobody else knows about it yet (and nobody can't thus read/write limits) hence it is safe to modify limits now. The limitation is that we currently stay at ulong internal representation. So the rlim64_is_infinity check is used where value is compared against ULONG_MAX on 32-bit which is the maximum value there. And since internally the limits are held in struct rlimit, converters which are used before and after do_prlimit call in sys_prlimit64 are introduced. Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2010-07-16rlimits: switch more rlimit syscalls to do_prlimitJiri Slaby
After we added more generic do_prlimit, switch sys_getrlimit to that. Also switch compat handling, so we can get rid of ugly __user casts and avoid setting process' address limit to kernel data and back. Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2010-07-16rlimits: redo do_setrlimit to more generic do_prlimitJiri Slaby
It now allows also reading of limits. I.e. all read and writes will later use this function. It takes two parameters, new and old limits which can be both NULL. If new is non-NULL, the value in it is set to rlimits. If old is non-NULL, current rlimits are stored there. If both are non-NULL, old are stored prior to setting the new ones, atomically. (Similar to sigaction.) Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2010-07-16rlimits: do security check under task_lockJiri Slaby
Do security_task_setrlimit under task_lock. Other tasks may change limits under our hands while we are checking limits inside the function. From now on, they can't. Note that all the security work is done under a spinlock here now. Security hooks count with that, they are called from interrupt context (like security_task_kill) and with spinlocks already held (e.g. capable->security_capable). Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: James Morris <jmorris@namei.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
2010-07-16rlimits: allow setrlimit to non-current tasksJiri Slaby
Add locking to allow setrlimit accept task parameter other than current. Namely, lock tasklist_lock for read and check whether the task structure has sighand non-null. Do all the signal processing under that lock still held. There are some points: 1) security_task_setrlimit is now called with that lock held. This is not new, many security_* functions are called with this lock held already so it doesn't harm (all this security_* stuff does almost the same). 2) task->sighand->siglock (in update_rlimit_cpu) is nested in tasklist_lock. This dependence is already existing. 3) tsk->alloc_lock is nested in tasklist_lock. This is OK too, already existing dependence. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com>
2010-07-16rlimits: split sys_setrlimitJiri Slaby
Create do_setrlimit from sys_setrlimit and declare do_setrlimit in the resource header. This is the first phase to have generic do_prlimit which allows to be called from read, write and compat rlimits code. The new do_setrlimit also accepts a task pointer to change the limits of. Currently, it cannot be other than current, but this will change with locking later. Also pass tsk->group_leader to security_task_setrlimit to check whether current is allowed to change rlimits of the process and not its arbitrary thread because it makes more sense given that rlimit are per process and not per-thread. Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2010-07-16rlimits: make sure ->rlim_max never grows in sys_setrlimitOleg Nesterov
Mostly preparation for Jiri's changes, but probably makes sense anyway. sys_setrlimit() checks new_rlim.rlim_max <= old_rlim->rlim_max, but when it takes task_lock() old_rlim->rlim_max can be already lowered. Move this check under task_lock(). Currently this is not important, we can only race with our sub-thread, this means the application is stupid. But when we change the code to allow the update of !current task's limits, it becomes important to make sure ->rlim_max can be lowered "reliably" even if we race with the application doing sys_setrlimit(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2010-07-16rlimits: add task_struct to update_rlimit_cpuJiri Slaby
Add task_struct as a parameter to update_rlimit_cpu to be able to set rlimit_cpu of different task than current. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: James Morris <jmorris@namei.org>
2010-07-16rlimits: security, add task_struct to setrlimitJiri Slaby
Add task_struct to task_setrlimit of security_operations to be able to set rlimit of task other than current. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Eric Paris <eparis@redhat.com> Acked-by: James Morris <jmorris@namei.org>
2010-07-15tracing: Remove ksym tracerFrederic Weisbecker
The ksym (breakpoint) ftrace plugin has been superseded by perf tools that are much more poweful to use the cpu breakpoints. This tracer doesn't bring more feature. It has been deprecated for a while now, lets remove it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu>
2010-07-14padata: simplify serialization mechanismSteffen Klassert
We count the number of processed objects on a percpu basis, so we need to go through all the percpu reorder queues to calculate the sequence number of the next object that needs serialization. This patch changes this to count the number of processed objects global. So we can calculate the sequence number and the percpu reorder queue of the next object that needs serialization without searching through the percpu reorder queues. This avoids some accesses to memory of foreign cpus. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-07-14padata: make padata_do_parallel to return zero on successSteffen Klassert
To return -EINPROGRESS on success in padata_do_parallel was considered to be odd. This patch changes this to return zero on success. Also the only user of padata, pcrypt is adapted to convert a return of zero to -EINPROGRESS within the crypto layer. This also removes the pcrypt fallback if padata_do_parallel was called on a not running padata instance as we can't handle it anymore. This fallback was unused, so it's save to remove it. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-07-14padata: Handle empty padata cpumasksSteffen Klassert
This patch fixes a bug when the padata cpumask does not intersect with the active cpumask. In this case we get a division by zero in padata_alloc_pd and we end up with a useless padata instance. Padata can end up with an empty cpumask for two reasons: 1. A user removed the last cpu that belongs to the padata cpumask and the active cpumask. 2. The last cpu that belongs to the padata cpumask and the active cpumask goes offline. We introduce a function padata_validate_cpumask to check if the padata cpumask does intersect with the active cpumask. If the cpumasks do not intersect we mark the instance as invalid, so it can't be used. We do not allocate the cpumask dependend recources in this case. This fixes the division by zero and keeps the padate instance in a consistent state. It's not possible to trigger this bug by now because the only padata user, pcrypt uses always the possible cpumask. Reported-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-07-14padata: Block until the instance is unused on stopSteffen Klassert
This patch makes padata_stop to block until the padata instance is unused. Also we split padata_stop to a locked and a unlocked version. This is in preparation to be able to change the cpumask after a call to patata stop. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-07-14padata: Check for valid padata instance on startSteffen Klassert
This patch introduces the PADATA_INVALID flag which is checked on padata start. This will be used to mark a padata instance as invalid, if the padata cpumask does not intersect with the active cpumask. we change padata_start to return an error if the PADATA_INVALID is set. Also we adapt the only padata user, pcrypt to this change. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-07-14workqueue: fix locking in retry path of maybe_create_worker()Tejun Heo
maybe_create_worker() mismanaged locking when worker creation fails and it has to retry. Fix locking and simplify lock manipulation. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Yong Zhang <yong.zhang@windriver.com>
2010-07-14async: use workqueue for worker poolTejun Heo
Replace private worker pool with system_unbound_wq. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Arjan van de Ven <arjan@infradead.org>
2010-07-11fix comment/printk typos concerning "already"Uwe Kleine-König
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-07-09init: Remove the BKL from startup codeArnd Bergmann
I have shown by code review that no driver takes the BKL at init time any more, so whatever the init code was locking against is no longer there and it is now safe to remove the BKL there. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Steven Rostedt <rostedt@goodmis> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-07-09Merge commit 'paulus-perf/master' into nextBenjamin Herrenschmidt
2010-07-07kernel/watchdog: Initialize 'result'Kulikov Vasiliy
Variable on the stack is not initialized to zero, do it explicitly. This bug was found by a compiler warning: kernel/watchdog.c:463: warning: 'result' may be used uninitialized in this function Signed-off-by: Kulikov Vasiliy <segooon@gmail.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1278316854-28442-1-git-send-email-segooon@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-06Merge branch 'perf/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 into perf/core
2010-07-05tracing/kprobes: Support "string" typeMasami Hiramatsu
Support string type tracing and printing in kprobe-tracer. This allows user to trace string data in kernel including __user data. Note that sometimes __user data may not be accessed if it is paged-out (sorry, but kprobes operation should be done in atomic, we can not wait for page-in). Commiter note: Fixed up conflicts with b7e2ece. Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100519195724.2885.18788.stgit@localhost6.localdomain6> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-07-05Merge commit 'v2.6.35-rc4' into perf/coreIngo Molnar
Merge reason: Pick up the latest perf fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-04module: initialize module dynamic debug laterYehuda Sadeh
We should initialize the module dynamic debug datastructures only after determining that the module is not loaded yet. This fixes a bug that introduced in 2.6.35-rc2, where when a trying to load a module twice, we also load it's dynamic printing data twice which causes all sorts of nasty issues. Also handle the dynamic debug cleanup later on failure. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed a #ifdef) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-02Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Cure nr_iowait_cpu() users init: Fix comment init, sched: Fix race between init and kthreadd
2010-07-02workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND insteadTejun Heo
WQ_SINGLE_CPU combined with @max_active of 1 is used to achieve full ordering among works queued to a workqueue. The same can be achieved using WQ_UNBOUND as unbound workqueues always use the gcwq for WORK_CPU_UNBOUND. As @max_active is always one and benefits from cpu locality isn't accessible anyway, serving them with unbound workqueues should be fine. Drop WQ_SINGLE_CPU support and use WQ_UNBOUND instead. Note that most single thread workqueue users will be converted to use multithread or non-reentrant instead and only the ones which require strict ordering will keep using WQ_UNBOUND + @max_active of 1. Signed-off-by: Tejun Heo <tj@kernel.org>
2010-07-02workqueue: implement unbound workqueueTejun Heo
This patch implements unbound workqueue which can be specified with WQ_UNBOUND flag on creation. An unbound workqueue has the following properties. * It uses a dedicated gcwq with a pseudo CPU number WORK_CPU_UNBOUND. This gcwq is always online and disassociated. * Workers are not bound to any CPU and not concurrency managed. Works are dispatched to workers as soon as possible and the only applied limitation is @max_active. IOW, all unbound workqeueues are implicitly high priority. Unbound workqueues can be used as simple execution context provider. Contexts unbound to any cpu are served as soon as possible. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Howells <dhowells@redhat.com>
2010-07-02workqueue: prepare for WQ_UNBOUND implementationTejun Heo
In preparation of WQ_UNBOUND addition, make the following changes. * Add WORK_CPU_* constants for pseudo cpu id numbers used (currently only WORK_CPU_NONE) and use them instead of NR_CPUS. This is to allow another pseudo cpu id for unbound cpu. * Reorder WQ_* flags. * Make workqueue_struct->cpu_wq a union which contains a percpu pointer, regular pointer and an unsigned long value and use kzalloc/kfree() in UP allocation path. This will be used to implement unbound workqueues which will use only one cwq on SMPs. * Move alloc_cwqs() allocation after initialization of wq fields, so that alloc_cwqs() has access to wq->flags. * Trivial relocation of wq local variables in freeze functions. These changes don't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org>
2010-07-02workqueue: fix worker management invocation without pending worksTejun Heo
When there's no pending work to do, worker_thread() goes back to sleep after waking up without checking whether worker management is necessary. This means that idle worker exit requests can be ignored if the gcwq stays empty. Fix it by making worker_thread() always check whether worker management is necessary before going to sleep. Signed-off-by: Tejun Heo <tj@kernel.org>
2010-07-02workqueue: fix incorrect cpu number BUG_ON() in get_work_gcwq()Tejun Heo
get_work_gcwq() was incorrectly triggering BUG_ON() if cpu number is equal to or higher than num_possible_cpus() instead of nr_cpu_ids. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org>
2010-07-02workqueue: fix race condition in flush_workqueue()Tejun Heo
When one flusher is cascading to the next flusher, it first sets wq->first_flusher to the next one and sets up the next flush cycle. If there's nothing to do for the next cycle, it clears wq->flush_flusher and proceeds to the one after that. If the woken up flusher checks wq->first_flusher before it gets cleared, it will incorrectly assume the role of the first flusher, which triggers BUG_ON() sanity check. Fix it by checking wq->first_flusher again after grabbing the mutex. Signed-off-by: Tejun Heo <tj@kernel.org>
2010-07-02workqueue: use worker_set/clr_flags() only from worker itselfTejun Heo
worker_set/clr_flags() assume that if none of NOT_RUNNING flags is set the worker must be contributing to nr_running which is only true if the worker is actually running. As when called from self, it is guaranteed that the worker is running, those functions can be safely used from the worker itself and they aren't necessary from other places anyway. Make the following changes to fix the bug. * Make worker_set/clr_flags() whine if not called from self. * Convert all places which called those functions from other tasks to manipulate flags directly. * Make trustee_thread() directly clear nr_running after setting WORKER_ROGUE on all workers. This is the only place where nr_running manipulation is necessary outside of workers themselves. * While at it, add sanity check for nr_running in worker_enter_idle(). Signed-off-by: Tejun Heo <tj@kernel.org>
2010-07-01sched: Cure nr_iowait_cpu() usersPeter Zijlstra
Commit 0224cf4c5e (sched: Intoduce get_cpu_iowait_time_us()) broke things by not making sure preemption was indeed disabled by the callers of nr_iowait_cpu() which took the iowait value of the current cpu. This resulted in a heap of preempt warnings. Cure this by making nr_iowait_cpu() take a cpu number and fix up the callers to pass in the right number. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Maxim Levitsky <maximlevitsky@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Jiri Slaby <jslaby@suse.cz> Cc: linux-pm@lists.linux-foundation.org LKML-Reference: <1277968037.1868.120.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-01Merge branch 'linus' into core/rcuIngo Molnar
Conflicts: fs/fs-writeback.c Merge reason: Resolve the conflict Note, i picked the version from Linus's tree, which effectively reverts the fs-writeback.c bits of: b97181f: fs: remove all rcu head initializations, except on_stack initializations As the upstream changes to this file changed this code heavily and the first attempt to resolve the conflict resulted in a non-booting kernel. It's safer to re-try this portion of the commit cleanly. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-06-30futex: futex_find_get_task remove credentails checkMichal Hocko
futex_find_get_task is currently used (through lookup_pi_state) from two contexts, futex_requeue and futex_lock_pi_atomic. None of the paths looks it needs the credentials check, though. Different (e)uids shouldn't matter at all because the only thing that is important for shared futex is the accessibility of the shared memory. The credentail check results in glibc assert failure or process hang (if glibc is compiled without assert support) for shared robust pthread mutex with priority inheritance if a process tries to lock already held lock owned by a process with a different euid: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed. The problem is that futex_lock_pi_atomic which is called when we try to lock already held lock checks the current holder (tid is stored in the futex value) to get the PI state. It uses lookup_pi_state which in turn gets task struct from futex_find_get_task. ESRCH is returned either when the task is not found or if credentials check fails. futex_lock_pi_atomic simply returns if it gets ESRCH. glibc code, however, doesn't expect that robust lock returns with ESRCH because it should get either success or owner died. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Darren Hart <dvhltc@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-29kexec: fix Oops in crash_shrink_memory()Pavan Naregundi
When crashkernel is not enabled, "echo 0 > /sys/kernel/kexec_crash_size" OOPSes the kernel in crash_shrink_memory. This happens when crash_shrink_memory tries to release the 'crashk_res' resource which are not reserved. Also value of "/sys/kernel/kexec_crash_size" shows as 1, which should be 0. This patch fixes the OOPS in crash_shrink_memory and shows "/sys/kernel/kexec_crash_size" as 0 when crash kernel memory is not reserved. Signed-off-by: Pavan Naregundi <pavan@linux.vnet.ibm.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Simon Horman <horms@verge.net.au> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>