asmadeus/linux.git - The linux kernel

Age	Commit message (Collapse)	Author
2008-06-04	Merge branch 'for_linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdbts: Use HW breakpoints with CONFIG_DEBUG_RODATA kgdb: use common ascii helpers and put_unaligned_be32 helper
2008-05-29	Merge branch 'sched-fixes-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: re-tune NUMA topologies sched: stop wake_affine from causing serious imbalance sched: fix sched_clock_cpu() revert ("sched: fair-group: SMP-nice for group scheduling") sched: cleanup show_schedstat(): fix memleak sched: unite unlikely pairs in rt_policy() and schedule_debug() revert ("sched: fair: weight calculations")
2008-05-29	Merge commit 'linus/master' into sched-fixes-for-linus	Ingo Molnar

2008-05-29	sched: stop wake_affine from causing serious imbalance	Mike Galbraith
	Prevent short-running wakers of short-running threads from overloading a single cpu via wakeup affinity, and wire up disconnected debug option. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29	sched: fix sched_clock_cpu()	Peter Zijlstra
	Make sched_clock_cpu() return 0 before it has been initialized and avoid corrupting its state due to doing so. This fixes the weird printk timestamp jump reported. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-05-29	revert ("sched: fair-group: SMP-nice for group scheduling")	Ingo Molnar
	Yanmin Zhang reported: Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1. It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. With bisect, I located the following patch: \| 18d95a2832c1392a2d63227a7a6d433cb9f2037e is first bad commit \| commit 18d95a2832c1392a2d63227a7a6d433cb9f2037e \| Author: Peter Zijlstra <a.p.zijlstra@chello.nl> \| Date: Sat Apr 19 19:45:00 2008 +0200 \| \| sched: fair-group: SMP-nice for group scheduling Revert it so that we get v2.6.25 behavior. Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29	sched: cleanup	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29	show_schedstat(): fix memleak	Adrian Bunk
	The Coverity checker spotted a memleak introduced by commit 39106dcf85285e78f3b290022122c76f851379b8 (cpumask: use new cpus_scnprintf function). It seems the kfree() got lost between v2 and v3 of this patch... Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Mike Travis <travis@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29	sched: unite unlikely pairs in rt_policy() and schedule_debug()	Roel Kluin
	Removes obfuscation and may improve assembly. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29	revert ("sched: fair: weight calculations")	Ingo Molnar
	Yanmin Zhang reported: Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has many regressions with 2.6.26-rc1: 1) 8-core stoakley: 28%; 2) 16-core tigerton: 20%; 3) Itanium Montvale: 50%. Bisect located this patch: \| 8f1bc385cfbab474db6c27b5af1e439614f3025c is first bad commit \| commit 8f1bc385cfbab474db6c27b5af1e439614f3025c \| Author: Peter Zijlstra <a.p.zijlstra@chello.nl> \| Date: Sat Apr 19 19:45:00 2008 +0200 \| \| sched: fair: weight calculations Revert it to the 2.6.25 state. Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-28	kgdb: use common ascii helpers and put_unaligned_be32 helper	Harvey Harrison
	Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2008-05-28	splice: fix sendfile() issue with relay	Tom Zanussi
	Splice isn't always incrementing the ppos correctly, which broke relay splice. Signed-off-by: Tom Zanussi <zanussi@comcast.net> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-26	posix timers: discard SI_TIMER signals on exec	Oleg Nesterov
	Based on Roland's patch. This approach was suggested by Austin Clements from the very beginning, and then by Linus. As Austin pointed out, the execing task can be killed by SI_TIMER signal because exec flushes the signal handlers, but doesn't discard the pending signals generated by posix timers. Perhaps not a bug, but people find this surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-26	posix timers: sigqueue_free: don't free sigqueue if it is queued	Oleg Nesterov
	Currently sigqueue_free() removes sigqueue from list, but doesn't cancel the pending signal. This is not consistent, the task should either receive the "full" signal along with siginfo_t, or it shouldn't receive the signal at all. Change sigqueue_free() to clear SIGQUEUE_PREALLOC but leave sigqueue on list if it is queued. This is a user-visible change. If the signal is blocked, it stays queued after sys_timer_delete() until unblocked with the "stale" si_code/si_value, and of course it is still counted wrt RLIMIT_SIGPENDING which also limits the number of posix timers. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24	cgroups: remove node_ prefix_from ns subsystem	Cedric Le Goater
	This is a slight change in the namespace cgroup subsystem api. The change is that previously when cgroup_clone() was called (currently only from the unshare path in ns_proxy cgroup, you'd get a new group named "node_$pid" whereas now you'll get a group named after just your pid.) The only users who would notice it are those who are using the ns_proxy cgroup subsystem to auto-create cgroups when namespaces are unshared - something of an experimental feature, which I think really needs more complete container/namespace support in order to be useful. I suspect the only users are Cedric and Serge, or maybe a few others on containers@lists.linux-foundation.org. And in fact it would only be noticed by the users who make the assumption about how the name is generated, rather than getting it from the /proc/<pid>/cgroups file for the process in question. Whether the change is actually needed or not I'm fairly agnostic on, but I guess it is more elegant to just use the pid as the new group name rather than adding a fairly arbitrary "node_" prefix on the front. [menage@google.com: provided changelog] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: "Paul Menage" <menage@google.com> Cc: "Serge E. Hallyn" <serue@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24	sys_prctl(): fix return of uninitialized value	Shi Weihua
	If none of the switch cases match, the PR_SET_PDEATHSIG and PR_SET_DUMPABLE cases of the switch statement will never write to local variable `error'. Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com> Cc: Andrew G. Morgan <morgan@kernel.org> Acked-by: "Serge E. Hallyn" <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24	signals: fix sigqueue_free() vs __exit_signal() race	Oleg Nesterov
	__exit_signal() does flush_sigqueue(tsk->pending) outside of ->siglock. This can race with another thread doing sigqueue_free(), we can free the same SIGQUEUE_PREALLOC sigqueue twice or corrupt the pending->list. Note that even sys_exit_group() can trigger this race, not only sys_timer_delete(). Move the callsite of flush_sigqueue(tsk->pending) under ->siglock. This patch doesn't touch flush_sigqueue(->shared_pending) below, it is called when there are no other threads which can play with signals, and sigqueue_free() can't be used outside of our thread group. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-23	stop_machine: make stop_machine_run more virtualization friendly	Christian Borntraeger
	On kvm I have seen some rare hangs in stop_machine when I used more guest cpus than hosts cpus. e.g. 32 guest cpus on 1 host cpu triggered the hang quite often. I could also reproduce the problem on a 4 way z/VM host with a 64 way guest. It turned out that the guest was consuming all available cpus mostly for spinning on scheduler locks like rq->lock. This is expected as the threads are calling yield all the time. The problem is now, that the host scheduling decisings together with the guest scheduling decisions and spinlocks not being fair managed to create an interesting scenario similar to a live lock. (Sometimes the hang resolved itself after some minutes) Changing stop_machine to yield the cpu to the hypervisor when yielding inside the guest fixed the problem for me. While I am not completely happy with this patch, I think it causes no harm and it really improves the situation for me. I used cpu_relax for yielding to the hypervisor, does that work on all architectures? p.s.: If you want to reproduce the problem, cpu hotplug and kprobes use stop_machine_run and both triggered the problem after some retries. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> CC: Ingo Molnar <mingo@elte.hu> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23	modules: proper cleanup of kobject without CONFIG_SYSFS	Denis V. Lunev
	kobject: '<NULL>' (ffffffffa0104050): is not initialized, yet kobject_put() is being called. ------------[ cut here ]------------ WARNING: at /home/den/src/linux-netns26/lib/kobject.c:583 kobject_put+0x53/0x55() Modules linked in: ipv6 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ide_cd_mod cdrom button [last unloaded: pktgen] comm: rmmod Tainted: G W 2.6.26-rc3 #585 Call Trace: [<ffffffff802359ab>] warn_on_slowpath+0x58/0x7a [<ffffffff80236aca>] ? printk+0x67/0x69 [<ffffffff80236aca>] ? printk+0x67/0x69 [<ffffffff80324289>] kobject_put+0x53/0x55 [<ffffffff8025e2ee>] free_module+0x87/0xfa [<ffffffff8025fee5>] sys_delete_module+0x178/0x1e1 [<ffffffff804b1e70>] ? lockdep_sys_exit_thunk+0x35/0x67 [<ffffffff804b1dff>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8020c0bb>] system_call_after_swapgs+0x7b/0x80 ---[ end trace 8f5aafa7f6406cf8 ]--- mod->mkobj.kobj is not initialized without CONFIG_SYSFS. Do not call kobject_put in this case. Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23	module loading ELF handling: use SELFMAG instead of numeric constant	Cyrill Gorcunov
	Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-19	Merge branch 'audit.b51' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: [PATCH] list_for_each_rcu must die: audit [patch 1/1] audit_send_reply(): fix error-path memory leak [PATCH] open sessionid permissions
2008-05-17	[PATCH] list_for_each_rcu must die: audit	Paul E. McKenney
	All uses of list_for_each_rcu() can be profitably replaced by the easier-to-use list_for_each_entry_rcu(). This patch makes this change for the Audit system, in preparation for removing the list_for_each_rcu() API entirely. This time with well-formed SOB. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-17	[patch 1/1] audit_send_reply(): fix error-path memory leak	Andrew Morton
	Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10663 Reporter: Daniel Marjamki <danielm77@spray.se> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16	[PATCH] avoid multiplication overflows and signedness issues for max_fds	Al Viro
	Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and we don't want size calculation for ->fd[] to overflow. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16	[PATCH] dup_fd() fixes, part 1	Al Viro
	Move the sucker to fs/file.c in preparation to the rest Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-14	lib: create common ascii hex array	Harvey Harrison
	Add a common hex array in hexdump.c so everyone can use it. Add a common hi/lo helper to avoid the shifting masking that is done to get the upper and lower nibbles of a byte value. Pull the pack_hex_byte helper from kgdb as it is opencoded many places in the tree that will be consolidated. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14	cgroups: fix compile warning	Mirco Tischler
	Return type of cpu_rt_runtime_write() should be int instead of ssize_t. Signed-off-by: Mirco Tischler <mt-ml@gmx.de> Acked-by: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-11	Add new 'cond_resched_bkl()' helper function	Linus Torvalds
	It acts exactly like a regular 'cond_resched()', but will not get optimized away when CONFIG_PREEMPT is set. Normal kernel code is already preemptable in the presense of CONFIG_PREEMPT, so cond_resched() is optimized away (see commit 02b67cc3ba36bdba351d6c3a00593f4ec550d9d3 "sched: do not do cond_resched() when CONFIG_PREEMPT"). But when wanting to conditionally reschedule while holding a lock, you need to use "cond_sched_lock(lock)", and the new function is the BKL equivalent of that. Also make fs/locks.c use it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-10	BKL: revert back to the old spinlock implementation	Linus Torvalds
	The generic semaphore rewrite had a huge performance regression on AIM7 (and potentially other BKL-heavy benchmarks) because the generic semaphores had been rewritten to be simple to understand and fair. The latter, in particular, turns a semaphore-based BKL implementation into a mess of scheduling. The attempt to fix the performance regression failed miserably (see the previous commit 00b41ec2611dc98f87f30753ee00a53db648d662 'Revert "semaphore: fix"'), and so for now the simple and sane approach is to instead just go back to the old spinlock-based BKL implementation that never had any issues like this. This patch also has the advantage of being reported to fix the regression completely according to Yanmin Zhang, unlike the semaphore hack which still left a couple percentage point regression. As a spinlock, the BKL obviously has the potential to be a latency issue, but it's not really any different from any other spinlock in that respect. We do want to get rid of the BKL asap, but that has been the plan for several years. These days, the biggest users are in the tty layer (open/release in particular) and Alan holds out some hope: "tty release is probably a few months away from getting cured - I'm afraid it will almost certainly be the very last user of the BKL in tty to get fixed as it depends on everything else being sanely locked." so while we're not there yet, we do have a plan of action. Tested-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Alexander Viro <viro@ftp.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-10	Revert "semaphore: fix"	Linus Torvalds
	This reverts commit bf726eab3711cf192405d21688a4b21e07b6188a, as it has been reported to cause a regression with processes stuck in __down(), apparently because some missing wakeup. Quoth Sven Wegener: "I'm currently investigating a regression that has showed up with my last git pull yesterday. Bisecting the commits showed bf726e "semaphore: fix" to be the culprit, reverting it fixed the issue. Symptoms: During heavy filesystem usage (e.g. a kernel compile) I get several compiler processes in uninterruptible sleep, blocking all i/o on the filesystem. System is an Intel Core 2 Quad running a 64bit kernel and userspace. Filesystem is xfs on top of lvm. See below for the output of sysrq-w." See http://lkml.org/lkml/2008/5/10/45 for full report. In the meantime, we can just fix the BKL performance regression by reverting back to the good old BKL spinlock implementation instead, since any sleeping lock will generally perform badly, especially if it tries to be fair. Reported-by: Sven Wegener <sven.wegener@stealer.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-09	module: don't ignore vermagic string if module doesn't have modversions	Rusty Russell
	Linus found a logic bug: we ignore the version number in a module's vermagic string if we have CONFIG_MODVERSIONS set, but modversions also lets through a module with no __versions section for modprobe --force (with tainting, but still). We should only ignore the start of the vermagic string if the module actually has crcs to check. Rather than (say) having an entertaining hissy fit and creating a config option to work around the buggy code. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-09	module: be more picky about allowing missing module versions	Rusty Russell
	We allow missing __versions sections, because modprobe --force strips it. It makes less sense to allow sections where there's no version for a specific symbol the module uses, so disallow that. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-08	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes: sched: fix weight calculations semaphore: fix
2008-05-08	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block	Linus Torvalds
	* 'for-linus' of git://git.kernel.dk/linux-2.6-block: Revert "relay: fix splice problem" docbook: fix bio missing parameter block: use unitialized_var() in bio_alloc_bioset() block: avoid duplicate calls to get_part() in disk stat code cfq-iosched: make io priorities inherit CPU scheduling class as well as nice block: optimize generic_unplug_device() block: get rid of likely/unlikely predictions in merge logic vfs: splice remove_suid() cleanup cfq-iosched: fix RCU race in the cfq io_context destructor handling block: adjust tagging function queue bit locking block: sysfs store function needs to grab queue_lock and use queue_flag_*()
2008-05-08	Fix cpuset sched_relax_domain_level control file	Paul Menage
	Due to a merge conflict, the sched_relax_domain_level control file was marked as being handled by cpuset_read/write_u64, but the code to handle it was actually in cpuset_common_file_read/write. Since the value being written/read is in fact a signed integer, it should be treated as such; this patch adds cpuset_read/write_s64 functions, and uses them to handle the sched_relax_domain_level file. With this patch, the sched_relax_domain_level can be read and written, and the correct contents seen/updated. Signed-off-by: Paul Menage <menage@google.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-08	sched: fix weight calculations	Mike Galbraith
	The conversion between virtual and real time is as follows: dvt = rw/w * dt <=> dt = w/rw * dvt Since we want the fair sleeper granularity to be in real time, we actually need to do: dvt = - rw/w * l This bug could be related to the regression reported by Yanmin Zhang: \| Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots \| of regressions with 2.6.26-rc1: \| \| 1) 8-core stoakley: 28%; \| 2) 16-core tigerton: 20%; \| 3) Itanium Montvale: 50%. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-08	semaphore: fix	Ingo Molnar
	Yanmin Zhang reported: \| Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more th \| regression under 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, \| and Itanium Montecito. Bisect located the patch below: \| \| 64ac24e738823161693bf791f87adc802cf529ff is first bad commit \| commit 64ac24e738823161693bf791f87adc802cf529ff \| Author: Matthew Wilcox <matthew@wil.cx> \| Date: Fri Mar 7 21:55:58 2008 -0500 \| \| Generic semaphore implementation \| \| After I manually reverted the patch against 2.6.26-rc1 while fixing \| lots of conflicts/errors, aim7 regression became less than 2%. i reproduced the AIM7 workload and can confirm Yanmin's findings that -.26-rc1 regresses over .25 - by over 67% here. Looking at the workload i found and fixed what i believe to be the real bug causing the AIM7 regression: it was inefficient wakeup / scheduling / locking behavior of the new generic semaphore code, causing suboptimal performance. The problem comes from the following code. The new semaphore code does this on down(): spin_lock_irqsave(&sem->lock, flags); if (likely(sem->count > 0)) sem->count--; else __down(sem); spin_unlock_irqrestore(&sem->lock, flags); and this on up(): spin_lock_irqsave(&sem->lock, flags); if (likely(list_empty(&sem->wait_list))) sem->count++; else __up(sem); spin_unlock_irqrestore(&sem->lock, flags); where __up() does: list_del(&waiter->list); waiter->up = 1; wake_up_process(waiter->task); and where __down() does this in essence: list_add_tail(&waiter.list, &sem->wait_list); waiter.task = task; waiter.up = 0; for (;;) { [...] spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); if (waiter.up) return 0; } the fastpath looks good and obvious, but note the following property of the contended path: if there's a task on the ->wait_list, the up() of the current owner will "pass over" ownership to that waiting task, in a wake-one manner, via the waiter->up flag and by removing the waiter from the wait list. That is all and fine in principle, but as implemented in kernel/semaphore.c it also creates a nasty, hidden source of contention! The contention comes from the following property of the new semaphore code: the new owner owns the semaphore exclusively, even if it is not running yet. So if the old owner, even if just a few instructions later, does a down() [lock_kernel()] again, it will be blocked and will have to wait on the new owner to eventually be scheduled (possibly on another CPU)! Or if another task gets to lock_kernel() sooner than the "new owner" scheduled, it will be blocked unnecessarily and for a very long time when there are 2000 tasks running. I.e. the implementation of the new semaphores code does wake-one and lock ownership in a very restrictive way - it does not allow opportunistic re-locking of the lock at all and keeps the scheduler from picking task order intelligently. This kind of scheduling, with 2000 AIM7 processes running, creates awful cross-scheduling between those 2000 tasks, causes reduced parallelism, a throttled runqueue length and a lot of idle time. With increasing number of CPUs it causes an exponentially worse behavior in AIM7, as the chance for a newly woken new-owner task to actually run anytime soon is less and less likely. Note that it takes just a tiny bit of contention for the 'new-semaphore catastrophy' to happen: the wakeup latencies get added to whatever small contention there is, and quickly snowball out of control! I believe Yanmin's findings and numbers support this analysis too. The best fix for this problem is to use the same scheduling logic that the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and wanted because we do not want to over-schedule), but also allow opportunistic locking of the lock even if a wakee is already "in flight". The patch below implements this new logic. With this patch applied the AIM7 regression is largely fixed on my quad testbox: # v2.6.25 vanilla: .................. Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 56096.4 91 207.5 789.7 0.4675 2000 55894.4 94 208.2 792.7 0.4658 # v2.6.26-rc1-166-gc0a1811 vanilla: ................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 33230.6 83 350.3 784.5 0.2769 2000 31778.1 86 366.3 783.6 0.2648 # v2.6.26-rc1-166-gc0a1811 + semaphore-speedup: ............................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55707.1 92 209.0 795.6 0.4642 2000 55704.4 96 209.0 796.0 0.4642 i.e. a 67% speedup. We are now back to within 1% of the v2.6.25 performance levels and have zero idle time during the test, as expected. Btw., interactivity also improved dramatically with the fix - for example console-switching became almost instantaneous during this workload (which after all is running 2000 tasks at once!), without the patch it was stuck for a minute at times. There's another nice side-effect of this speedup patch, the new generic semaphore code got even smaller: text data bss dec hex filename 1241 0 0 1241 4d9 semaphore.o.before 1207 0 0 1207 4b7 semaphore.o.after (because the waiter.up complication got removed.) Longer-term we should look into using the mutex code for the generic semaphore code as well - but i's not easy due to legacies and it's outside of the scope of v2.6.26 and outside the scope of this patch as well. Bisected-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-08	Revert "relay: fix splice problem"	Jens Axboe
	This reverts commit c3270e577c18b3d0e984c3371493205a4807db9d.
2008-05-05	sched: add optional support for CONFIG_HAVE_UNSTABLE_SCHED_CLOCK	Peter Zijlstra
	this replaces the rq->clock stuff (and possibly cpu_clock()). - architectures that have an 'imperfect' hardware clock can set CONFIG_HAVE_UNSTABLE_SCHED_CLOCK - the 'jiffie' window might be superfulous when we update tick_gtod before the __update_sched_clock() call in sched_clock_tick() - cpu_clock() might be implemented as: sched_clock_cpu(smp_processor_id()) if the accuracy proves good enough - how far can TSC drift in a single jiffie when considering the filtering and idle hooks? [ mingo@elte.hu: various fixes and cleanups ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05	sched: fix cpu clock	Ingo Molnar
	David Miller pointed it out that nothing in cpu_clock() sets prev_cpu_time. This caused __sync_cpu_clock() to be called all the time - against the intention of this code. The result was that in practice we hit a global spinlock every time cpu_clock() is called - which - even though cpu_clock() is used for tracing and debugging, is suboptimal. While at it, also: - move the irq disabling to the outest layer, this should make cpu_clock() warp-free when called with irqs enabled. - use long long instead of cycles_t - for platforms where cycles_t is 32-bit. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05	sched: fair-group: fix a Div0 error of the fair group scheduler	Miao Xie
	When I echoed 0 into the "cpu.shares" file, a Div0 error occured. We found it is caused by the following calling. sched_group_set_shares(tg, shares) set_se_shares(tg->se[i], shares/nr_cpu_ids) __set_se_shares(se, shares) div64_64((1ULL<<32), shares) When the echoed value was less than the number of processores, the result of the sentence "shares/nr_cpu_ids" was 0, and then the system called div64() to divide the result, the Div0 error occured. It is unnecessary that the shares value is divided by nr_cpu_ids, I think. Because in the function __update_group_shares_cpu() and init_tg_cfs_entry(), the shares value isn't divided by nr_cpu_ids when setting shares of the sched entity. This patch fixes this bug. And echoing ULONG_MAX value into cpu.shares also causes Div0 error, so we set a macro MAX_SHARES to limit the max value of shares. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05	sched: fix missing locking in sched_domains code	Heiko Carstens
	Concurrent calls to detach_destroy_domains and arch_init_sched_domains were prevented by the old scheduler subsystem cpu hotplug mutex. When this got converted to get_online_cpus() the locking got broken. Unlike before now several processes can concurrently enter the critical sections that were protected by the old lock. So use the already present doms_cur_mutex to protect these sections again. Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05	sched: make clock sync tunable by architecture code	Ingo Molnar
	make time_sync_thresh tunable to architecture code. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05	sched: fix debugging	Mike Galbraith
	Revert debugging commit 7ba2e74ab5a0518bc953042952dd165724bc70c9. print_cfs_rq_tasks() can induce live-lock if a task is dequeued during list traversal. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05	sched: fix sched_info_switch not being called according to documentation	David Simner
	http://bugzilla.kernel.org/show_bug.cgi?id=10545 sched_stats.h says that __sched_info_switch is "called when prev != next" in the comment. sched.c should therefore do that. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05	sched: fix hrtick_start_fair and CPU-Hotplug	Peter Zijlstra
	Gautham R Shenoy reported: > While running the usual CPU-Hotplug stress tests on linux-2.6.25, > I noticed the following in the console logs. > > This is a wee bit difficult to reproduce. In the past 10 runs I hit this > only once. > > ------------[ cut here ]------------ > > WARNING: at kernel/sched.c:962 hrtick+0x2e/0x65() > > Just wondering if we are doing a good job at handling the cancellation > of any per-cpu scheduler timers during CPU-Hotplug. This looks like its indeed not cancelled at all and migrates the it to another cpu. Fix it via a proper hotplug notifier mechanism. Reported-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05	sched: fix SCHED_FAIR wake-idle logic error	Gregory Haskins
	We currently use an optimization to skip the overhead of wake-idle processing if more than one task is assigned to a run-queue. The assumption is that the system must already be load-balanced or we wouldnt be overloaded to begin with. The problem is that we are looking at rq->nr_running, which may include RT tasks in addition to CFS tasks. Since the presence of RT tasks really has no bearing on the balance status of CFS tasks, this throws the calculation off. This patch changes the logic to only consider the number of CFS tasks when making the decision to optimze the wake-idle. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05	sched: fix RT task-wakeup logic	Gregory Haskins
	Dmitry Adamushko pointed out a logic error in task_wake_up_rt() where we will always evaluate to "true". You can find the thread here: http://lkml.org/lkml/2008/4/22/296 In reality, we only want to try to push tasks away when a wake up request is not going to preempt the current task. So lets fix it. Note: We introduce test_tsk_need_resched() instead of open-coding the flag check so that the merge-conflict with -rt should help remind us that we may need to support NEEDS_RESCHED_DELAYED in the future, too. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05	sched: add statics, don't return void expressions	Harvey Harrison
	Noticed by sparse: kernel/sched.c:760:20: warning: symbol 'sched_feat_names' was not declared. Should it be static? kernel/sched.c:767:5: warning: symbol 'sched_feat_open' was not declared. Should it be static? kernel/sched_fair.c:845:3: warning: returning void-valued expression kernel/sched.c:4386:3: warning: returning void-valued expression Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-05	sched: add debug checks to idle functions	Andrew Morton
	Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: "Justin Mattock" <justinmattock@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>