asmadeus/linux.git - The linux kernel

Age	Commit message (Collapse)	Author
2005-11-07	[PATCH] Kprobes: Track kprobe on a per_cpu basis - sparc64 changes	Ananth N Mavinakayanahalli
	Sparc64 changes to track kprobe execution on a per-cpu basis. We now track the kprobe state machine independently on each cpu using an arch specific kprobe control block. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07	[PATCH] consolidate sys_ptrace()	Christoph Hellwig
	The sys_ptrace boilerplate code (everything outside the big switch statement for the arch-specific requests) is shared by most architectures. This patch moves it to kernel/ptrace.c and leaves the arch-specific code as arch_ptrace. Some architectures have a too different ptrace so we have to exclude them. They continue to keep their implementations. For sh64 I had to add a sh64_ptrace wrapper because it does some initialization on the first call. For um I removed an ifdefed SUBARCH_PTRACE_SPECIAL block, but SUBARCH_PTRACE_SPECIAL isn't defined anywhere in the tree. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Paul Mackerras <paulus@samba.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-By: David Howells <dhowells@redhat.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30	[PATCH] semaphore: Remove __MUTEX_INITIALIZER()	Arthur Othieno
	__MUTEX_INITIALIZER() has no users, and equates to the more commonly used DECLARE_MUTEX(), thus making it pretty much redundant. Remove it for good. Signed-off-by: Arthur Othieno <a.othieno@bluewin.ch> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30	[PATCH] vm: remove unused/broken page_pte[_prot] macros	Tejun Heo
	This patch removes page_pte_prot and page_pte macros from all architectures. Some architectures define both, some only page_pte (broken) and others none. These macros are not used anywhere. page_pte_prot(page, prot) is identical to mk_pte(page, prot) and page_pte(page) is identical to page_pte_prot(page, __pgprot(0)). * The following architectures define both page_pte_prot and page_pte arm, arm26, ia64, sh64, sparc, sparc64 * The following architectures define only page_pte (broken) frv, i386, m32r, mips, sh, x86-64 * All other architectures define neither Signed-off-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] mm: tlb_finish_mmu forget rss	Hugh Dickins
	zap_pte_range has been counting the pages it frees in tlb->freed, then tlb_finish_mmu has used that to update the mm's rss. That got stranger when I added anon_rss, yet updated it by a different route; and stranger when rss and anon_rss became mm_counters with special access macros. And it would no longer be viable if we're relying on page_table_lock to stabilize the mm_counter, but calling tlb_finish_mmu outside that lock. Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own business, just decrement the rss mm_counter in zap_pte_range (yes, there was some point to batching the update, and a subsequent patch restores that). And forget the anal paranoia of first reading the counter to avoid going negative - if rss does go negative, just fix that bug. Remove the mmu_gather's flushes and avoided_flushes from arm and arm26: no use was being made of them. But arm26 alone was actually using the freed, in the way some others use need_flush: give it a need_flush. arm26 seems to prefer spaces to tabs here: respect that. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] mm: tlb_is_full_mm was obscure	Hugh Dickins
	tlb_is_full_mm? What does that mean? The TLB is full? No, it means that the mm's last user has gone and the whole mm is being torn down. And it's an inline function because sparc64 uses a different (slightly better) "tlb_frozen" name for the flag others call "fullmm". And now the ptep_get_and_clear_full macro used in zap_pte_range refers directly to tlb->fullmm, which would be wrong for sparc64. Rather than correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change sparc64 to just use the same poor name as everyone else - is that okay? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] mm: tlb_gather_mmu get_cpu_var	Hugh Dickins
	tlb_gather_mmu dates from before kernel preemption was allowed, and uses smp_processor_id or __get_cpu_var to find its per-cpu mmu_gather. That works because it's currently only called after getting page_table_lock, which is not dropped until after the matching tlb_finish_mmu. But don't rely on that, it will soon change: now disable preemption internally by proper get_cpu_var in tlb_gather_mmu, put_cpu_var in tlb_finish_mmu. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] add sem_is_read/write_locked()	Rik Van Riel
	Add sem_is_read/write_locked functions to the read/write semaphores, along the same lines of the *_is_locked spinlock functions. The swap token tuning patch uses sem_is_read_locked; sem_is_write_locked is added for completeness. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-28	[PATCH] gfp_t: dma-mapping (simple cases)	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-13	[SPARC64]: Eliminate PCI IOMMU dma mapping size limit.	David S. Miller
	The hairy fast allocator in the sparc64 PCI IOMMU code has a hard limit of 256 pages. Certain devices can exceed this when performing very large I/Os. So replace with a more simple allocator, based largely upon the arch/ppc64/kernel/iommu.c code. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-13	[SPARC64]: Consolidate common PCI IOMMU init code.	David S. Miller
	All the PCI controller drivers were doing the same thing setting up the IOMMU software state, put it all in one spot. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-29	[SPARC64]: Kill arch/sparc64/prom/memory.c	David S. Miller
	No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-29	[SPARC64]: Rewrite convoluted physical memory probing.	David S. Miller
	Delete all of the code working with sp_banks[] and replace with clean acquisition and sorting of physical memory parameters from the firmware. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-28	[SPARC64]: Kill all external references to sp_banks[]	David S. Miller
	Thus, we can mark sp_banks[] static in arch/sparc64/mm/init.c Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-28	[SPARC]: Declare paging_init() in asm/pgtable.h	David S. Miller
	Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-28	[SPARC64]: Simplify user fault fixup handling.	David S. Miller
	Instead of doing byte-at-a-time user accesses to figure out where the fault occurred, read the saved fault_address from the current thread structure. For the sake of defensive programming, if the fault_address does not fall into the user buffer range, simply assume the whole area faulted. This will cause the fixup for copy_from_user() to clear the entire kernel side buffer. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-28	[SPARC64]: Fix fault handling in unaligned trap handler.	David S. Miller
	We were not calling kernel_mna_trap_fault() correctly. Instead of being fancy, just return 0 vs. -EFAULT from the assembler stubs, and handle that return value as appropriate. Create an "__retl_efault" stub for assembler exception table entries and use it where possible. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-28	[SPARC64]: Convert to use generic exception table support.	David S. Miller
	The funny "range" exception table entries we had were only used by the compat layer socketcall assembly, and it wasn't even needed there. For free we now get proper exception table sorting and fast binary searching. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-27	[SPARC64]: Add missing IDs for newer cpus.	David S. Miller
	Also, the us3_cpufreq driver can work on Ultra-IV and IV+. They use the SAFARI bus register to control the clock divider just like Ultra-III and III+ do. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-27	[SPARC64]: Add defines for 32MB/256MB PTE page size on Ultra-IV+.	David S. Miller
	Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-26	[SPARC64]: Probe D/I/E-cache config and use.	David S. Miller
	At boot time, determine the D-cache, I-cache and E-cache size and line-size. Use them in cache flushes when appropriate. This change was motivated by discovering that the D-cache on UltraSparc-IIIi and later are 64K not 32K, and the flushes done by the Cheetah error handlers were assuming a 32K size. There are still some pieces of code that are hard coding things and will need to be fixed up at some point. While we're here, fix the D-cache and I-cache parity error handlers to run with interrupts disabled, and when the trap occurs at trap level > 1 log the event via a counter displayed in /proc/cpuinfo. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-25	[SPARC64]: Add CONFIG_DEBUG_PAGEALLOC support.	David S. Miller
	The trick is that we do the kernel linear mapping TLB miss starting with an instruction sequence like this: ba,pt %xcc, kvmap_load xor %g2, %g4, %g5 succeeded by an instruction sequence which performs a full page table walk starting at swapper_pg_dir. We first take over the trap table from the firmware. Then, using this constant PTE generation for the linear mapping area above, we build the kernel page tables for the linear mapping. After this is setup, we patch that branch above into a "nop", which will cause TLB misses to fall through to the full page table walk. With this, the page unmapping for CONFIG_DEBUG_PAGEALLOC is trivial. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-22	[SPARC64]: Rewrite bootup sequence.	David S. Miller
	Instead of all of this cpu-specific code to remap the kernel to the correct location, use portable firmware calls to do this instead. What we do now is the following in position independant assembler: chosen_node = prom_finddevice("/chosen"); prom_mmu_ihandle_cache = prom_getint(chosen_node, "mmu"); vaddr = 4MB_ALIGN(current_text_addr()); prom_translate(vaddr, &paddr_high, &paddr_low, &mode); prom_boot_mapping_mode = mode; prom_boot_mapping_phys_high = paddr_high; prom_boot_mapping_phys_low = paddr_low; prom_map(-1, 8 * 1024 * 1024, KERNBASE, paddr_low); and that replaces the massive amount of by-hand TLB probing and programming we used to do here. The new code should also handle properly the case where the kernel is mapped at the correct address already (think: future kexec support). Consequently, the bulk of remap_kernel() dies as does the entirety of arch/sparc64/prom/map.S We try to share some strings in the PROM library with the ones used at bootup, and while we're here mark input strings to oplib.h routines with "const" when appropriate. There are many more simplifications now possible. For one thing, we can consolidate the two copies we now have of a lot of cpu setup code sitting in head.S and trampoline.S. This is a significant step towards CONFIG_DEBUG_PAGEALLOC support. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-21	[PATCH] Remove unused var from asm/futex.h	Paolo 'Blaisorblade' Giarrusso
	As recently done by Russell King for ARM, commit 4732efbeb997189d9f9b04708dc26bf8613ed721 introduces a generic asm/futex.h copied along most arches, which includes a "-ENOSYS support" to be changed if needed. However, it includes an unused var (taken from the "real" version) which GCC warns about. Remove it from all arches having that file version (i.e. same GIT id). $ git-diff-tree -r HEAD and $ git-ls-tree -r HEAD include/\|grep 9feff4ce1424bc390608326240be369eb13aa648 may be more interesting than looking at the patch itself, to make sure I've just copied the arm header to all other archs having the original dummy version of this file. Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-20	[SPARC64]: Verify vmalloc TLB misses more strictly.	David S. Miller
	Arrange the modules, OBP, and vmalloc areas such that a range verification can be done quite minimally. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-19	[SPARC64]: Move DCACHE_ALIASING_POSSIBLE define to asm/page.h	David S. Miller
	This showed that arch/sparc64/kernel/ptrace.c was not getting the define properly, and thus the code protected by this ifdef was never actually compiled before. So fix that too. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-10	[PATCH] spinlock consolidation	Ingo Molnar
	This patch (written by me and also containing many suggestions of Arjan van de Ven) does a major cleanup of the spinlock code. It does the following things: - consolidates and enhances the spinlock/rwlock debugging code - simplifies the asm/spinlock.h files - encapsulates the raw spinlock type and moves generic spinlock features (such as ->break_lock) into the generic code. - cleans up the spinlock code hierarchy to get rid of the spaghetti. Most notably there's now only a single variant of the debugging code, located in lib/spinlock_debug.c. (previously we had one SMP debugging variant per architecture, plus a separate generic one for UP builds) Also, i've enhanced the rwlock debugging facility, it will now track write-owners. There is new spinlock-owner/CPU-tracking on SMP builds too. All locks have lockup detection now, which will work for both soft and hard spin/rwlock lockups. The arch-level include files now only contain the minimally necessary subset of the spinlock code - all the rest that can be generalized now lives in the generic headers: include/asm-i386/spinlock_types.h \| 16 include/asm-x86_64/spinlock_types.h \| 16 I have also split up the various spinlock variants into separate files, making it easier to see which does what. The new layout is: SMP \| UP ----------------------------\|----------------------------------- asm/spinlock_types_smp.h \| linux/spinlock_types_up.h linux/spinlock_types.h \| linux/spinlock_types.h asm/spinlock_smp.h \| linux/spinlock_up.h linux/spinlock_api_smp.h \| linux/spinlock_api_up.h linux/spinlock.h \| linux/spinlock.h /* * here's the role of the various spinlock/rwlock related include files: * * on SMP builds: * * asm/spinlock_types.h: contains the raw_spinlock_t/raw_rwlock_t and the * initializers * * linux/spinlock_types.h: * defines the generic type and initializers * * asm/spinlock.h: contains the __raw_spin_()/etc. lowlevel implementations, mostly inline assembly code * * (also included on UP-debug builds:) * * linux/spinlock_api_smp.h: * contains the prototypes for the _spin_() APIs. * linux/spinlock.h: builds the final spin_() APIs. * on UP builds: * * linux/spinlock_type_up.h: * contains the generic, simplified UP spinlock type. * (which is an empty structure on non-debug builds) * * linux/spinlock_types.h: * defines the generic type and initializers * * linux/spinlock_up.h: * contains the __raw_spin_()/etc. version of UP builds. (which are NOPs on non-debug, non-preempt * builds) * * (included on UP-non-debug builds:) * * linux/spinlock_api_up.h: * builds the _spin_() APIs. * linux/spinlock.h: builds the final spin_() APIs. / All SMP and UP architectures are converted by this patch. arm, i386, ia64, ppc, ppc64, s390/s390x, x64 was build-tested via crosscompilers. m32r, mips, sh, sparc, have not been tested yet, but should be mostly fine. From: Grant Grundler <grundler@parisc-linux.org> Booted and lightly tested on a500-44 (64-bit, SMP kernel, dual CPU). Builds 32-bit SMP kernel (not booted or tested). I did not try to build non-SMP kernels. That should be trivial to fix up later if necessary. I converted bit ops atomic_hash lock to raw_spinlock_t. Doing so avoids some ugly nesting of linux/.h and asm/.h files. Those particular locks are well tested and contained entirely inside arch specific code. I do NOT expect any new issues to arise with them. If someone does ever need to use debug/metrics with them, then they will need to unravel this hairball between spinlocks, atomic ops, and bit ops that exist only because parisc has exactly one atomic instruction: LDCW (load and clear word). From: "Luck, Tony" <tony.luck@intel.com> ia64 fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjanv@infradead.org> Signed-off-by: Grant Grundler <grundler@parisc-linux.org> Cc: Matthew Wilcox <willy@debian.org> Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se> Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-08	Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6	Linus Torvalds

2005-09-08	[PATCH] Make sparc64 use setup-res.c	David S. Miller
	There were three changes necessary in order to allow sparc64 to use setup-res.c: 1) Sparc64 roots the PCI I/O and MEM address space using parent resources contained in the PCI controller structure. I'm actually surprised no other platforms do this, especially ones like Alpha and PPC{,64}. These resources get linked into the iomem/ioport tree when PCI controllers are probed. So the hierarchy looks like this: iomem --\| PCI controller 1 MEM space --\| device 1 device 2 etc. PCI controller 2 MEM space --\| ... ioport --\| PCI controller 1 IO space --\| ... PCI controller 2 IO space --\| ... You get the idea. The drivers/pci/setup-res.c code allocates using plain iomem_space and ioport_space as the root, so that wouldn't work with the above setup. So I added a pcibios_select_root() that is used to handle this. It uses the PCI controller struct's io_space and mem_space on sparc64, and io{port,mem}_resource on every other platform to keep current behavior. 2) quirk_io_region() is buggy. It takes in raw BUS view addresses and tries to use them as a PCI resource. pci_claim_resource() expects the resource to be fully formed when it gets called. The sparc64 implementation would do the translation but that's absolutely wrong, because if the same resource gets released then re-claimed we'll adjust things twice. So I fixed up quirk_io_region() to do the proper pcibios_bus_to_resource() conversion before passing it on to pci_claim_resource(). 3) I was mistakedly __init'ing the function methods the PCI controller drivers provide on sparc64 to implement some parts of these routines. This was, of course, easy to fix. So we end up with the following, and that nasty SPARC64 makefile ifdef in drivers/pci/Makefile is finally zapped. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2005-09-08	[SPARC64]: Inline membar()'s again.	David S. Miller
	Since GCC has to emit a call and a delay slot to the out-of-line "membar" routines in arch/sparc64/lib/mb.S it is much better to just do the necessary predicted branch inline instead as: ba,pt %xcc, 1f membar #whatever 1: instead of the current: call membar_foo dslot because this way GCC is not required to allocate a stack frame if the function can be a leaf function. This also makes this bug fix easier to backport to 2.4.x Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-07	[PATCH] Clean up struct flock definitions	Stephen Rothwell
	This patch just gathers together all the struct flock definitions except xtensa into asm-generic/fcntl.h. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] Clean up the fcntl operations	Stephen Rothwell
	This patch puts the most popular of each fcntl operation/flag into asm-generic/fcntl.h and cleans up the arch files. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] Clean up the open flags	Stephen Rothwell
	This patch puts the most popular of each open flag into asm-generic/fcntl.h and cleans up the arch files. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] Create asm-generic/fcntl.h	Stephen Rothwell
	This set of patches creates asm-generic/fcntl.h and consolidates as much as possible from the asm-/fcntl.h files into it. This patch just gathers all the identical bits of the asm-/fcntl.h files into asm-generic/fcntl.h. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Yoichi Yuasa <yuasa@hh.iij4u.or.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] remove verify_area(): remove verify_area() from various uaccess.h ↵	Jesper Juhl
	headers Remove the deprecated (and unused) verify_area() from various uaccess.h headers. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] remove asm-*/hdreg.h	Christoph Hellwig
	unused and useless.. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] auxiliary vector cleanups	H. J. Lu
	The size of auxiliary vector is fixed at 42 in linux/sched.h. But it isn't very obvious when looking at linux/elf.h. This patch adds AT_VECTOR_SIZE so that we can change it if necessary when a new vector is added. Because of include file ordering problems, doing this necessitated the extraction of the AT_* symbols into a standalone header file. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] compat: be more consistent about [ug]id_t	Stephen Rothwell
	When I first wrote the compat layer patches, I was somewhat cavalier about the definition of compat_uid_t and compat_gid_t (or maybe I just misunderstood :-)). This patch makes the compat types much more consistent with the types we are being compatible with and hopefully will fix a few bugs along the way. compat type type in compat arch __compat_[ug]id_t __kernel_[ug]id_t __compat_[ug]id32_t __kernel_[ug]id32_t compat_[ug]id_t [ug]id_t The difference is that compat_uid_t is always 32 bits (for the archs we care about) but __compat_uid_t may be 16 bits on some. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] FUTEX_WAKE_OP: pthread_cond_signal() speedup	Jakub Jelinek
	ATM pthread_cond_signal is unnecessarily slow, because it wakes one waiter (which at least on UP usually means an immediate context switch to one of the waiter threads). This waiter wakes up and after a few instructions it attempts to acquire the cv internal lock, but that lock is still held by the thread calling pthread_cond_signal. So it goes to sleep and eventually the signalling thread is scheduled in, unlocks the internal lock and wakes the waiter again. Now, before 2003-09-21 NPTL was using FUTEX_REQUEUE in pthread_cond_signal to avoid this performance issue, but it was removed when locks were redesigned to the 3 state scheme (unlocked, locked uncontended, locked contended). Following scenario shows why simply using FUTEX_REQUEUE in pthread_cond_signal together with using lll_mutex_unlock_force in place of lll_mutex_unlock is not enough and probably why it has been disabled at that time: The number is value in cv->__data.__lock. thr1 thr2 thr3 0 pthread_cond_wait 1 lll_mutex_lock (cv->__data.__lock) 0 lll_mutex_unlock (cv->__data.__lock) 0 lll_futex_wait (&cv->__data.__futex, futexval) 0 pthread_cond_signal 1 lll_mutex_lock (cv->__data.__lock) 1 pthread_cond_signal 2 lll_mutex_lock (cv->__data.__lock) 2 lll_futex_wait (&cv->__data.__lock, 2) 2 lll_futex_requeue (&cv->__data.__futex, 0, 1, &cv->__data.__lock) # FUTEX_REQUEUE, not FUTEX_CMP_REQUEUE 2 lll_mutex_unlock_force (cv->__data.__lock) 0 cv->__data.__lock = 0 0 lll_futex_wake (&cv->__data.__lock, 1) 1 lll_mutex_lock (cv->__data.__lock) 0 lll_mutex_unlock (cv->__data.__lock) # Here, lll_mutex_unlock doesn't know there are threads waiting # on the internal cv's lock Now, I believe it is possible to use FUTEX_REQUEUE in pthread_cond_signal, but it will cost us not one, but 2 extra syscalls and, what's worse, one of these extra syscalls will be done for every single waiting loop in pthread_cond_wait. We would need to use lll_mutex_unlock_force in pthread_cond_signal after requeue and lll_mutex_cond_lock in pthread_cond_wait after lll_futex_wait. Another alternative is to do the unlocking pthread_cond_signal needs to do (the lock can't be unlocked before lll_futex_wake, as that is racy) in the kernel. I have implemented both variants, futex-requeue-glibc.patch is the first one and futex-wake_op{,-glibc}.patch is the unlocking inside of the kernel. The kernel interface allows userland to specify how exactly an unlocking operation should look like (some atomic arithmetic operation with optional constant argument and comparison of the previous futex value with another constant). It has been implemented just for ppc*, x86_64 and i?86, for other architectures I'm including just a stub header which can be used as a starting point by maintainers to write support for their arches and ATM will just return -ENOSYS for FUTEX_WAKE_OP. The requeue patch has been (lightly) tested just on x86_64, the wake_op patch on ppc64 kernel running 32-bit and 64-bit NPTL and x86_64 kernel running 32-bit and 64-bit NPTL. With the following benchmark on UP x86-64 I get: for i in nptl-orig nptl-requeue nptl-wake_op; do echo time elf/ld.so --library-path .:$i /tmp/bench; \ for j in 1 2; do echo ( time elf/ld.so --library-path .:$i /tmp/bench ) 2>&1; done; done time elf/ld.so --library-path .:nptl-orig /tmp/bench real 0m0.655s user 0m0.253s sys 0m0.403s real 0m0.657s user 0m0.269s sys 0m0.388s time elf/ld.so --library-path .:nptl-requeue /tmp/bench real 0m0.496s user 0m0.225s sys 0m0.271s real 0m0.531s user 0m0.242s sys 0m0.288s time elf/ld.so --library-path .:nptl-wake_op /tmp/bench real 0m0.380s user 0m0.176s sys 0m0.204s real 0m0.382s user 0m0.175s sys 0m0.207s The benchmark is at: http://sourceware.org/ml/libc-alpha/2005-03/txt00001.txt Older futex-requeue-glibc.patch version is at: http://sourceware.org/ml/libc-alpha/2005-03/txt00002.txt Older futex-wake_op-glibc.patch version is at: http://sourceware.org/ml/libc-alpha/2005-03/txt00003.txt Will post a new version (just x86-64 fixes so that the patch applies against pthread_cond_signal.S) to libc-hacker ml soon. Attached is the kernel FUTEX_WAKE_OP patch as well as a simple-minded testcase that will not test the atomicity of the operation, but at least check if the threads that should have been woken up are woken up and whether the arithmetic operation in the kernel gave the expected results. Acked-by: Ingo Molnar <mingo@redhat.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Jamie Lokier <jamie@shareable.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Yoichi Yuasa <yuasa@hh.iij4u.or.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05	Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6	Linus Torvalds

2005-09-05	[PATCH] sab: consolidate kmem_bufctl_t	Kyle Moffett
	This is used only in slab.c and each architecture gets to define whcih underlying type is to be used. Seems a bit silly - move it to slab.c and use the same type for all architectures: unsigned int. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05	[PATCH] mm: consolidate get_order	Stephen Rothwell
	Someone mentioned that almost all the architectures used basically the same implementation of get_order. This patch consolidates them into asm-generic/page.h and includes that in the appropriate places. The exceptions are ia64 and ppc which have their own (presumably optimised) versions. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-01	[SPARC]: Kill io_remap_page_range()	David S. Miller
	It's been deprecated long enough and there are no in-tree users any longer. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-31	[SPARC64]: Use 'unsigned long' for port argument to I/O string ops.	David S. Miller
	This kills warnings when building drivers/ide/ide-iops.c and puts us in-line with what other platforms do here. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29	[SPARC64]: Eliminate irq_cpustat_t.	David S. Miller
	We can put the __softirq_pending mask in the cpudata, no need for the silly NR_CPUS array in kernel/softirq.c Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29	[NET]: Introduce SO_{SND,RCV}BUFFORCE socket options	Patrick McHardy
	Allows overriding of sysctl_{wmem,rmrm}_max Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29	[SPARC64]: More fully work around Spitfire Errata 51.	David S. Miller
	It appears that a memory barrier soon after a mispredicted branch, not just in the delay slot, can cause the hang condition of this cpu errata. So move them out-of-line, and explicitly put them into a "branch always, predict taken" delay slot which should fully kill this problem. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29	[SPARC64]: Make debugging spinlocks usable again.	David S. Miller
	When the spinlock routines were moved out of line into kernel/spinlock.c this made it so that the debugging spinlocks record lock acquisition program counts in the kernel/spinlock.c functions not in their callers. This makes the debugging info kind of useless. So record the correct caller's program counter and now this feature is useful once more. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29	[SPARC64]: remove use of asm/segment.h	Kumar Gala
	Removed sparc64 architecture specific users of asm/segment.h and asm-sparc64/segment.h itself Signed-off-by: Kumar Gala <kumar.gala@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29	[SPARC64]: Revamp Spitfire error trap handling.	David S. Miller
	Current uncorrectable error handling was poor enough that the processor could just loop taking the same trap over and over again. Fix things up so that we at least get a log message and perhaps even some register state. In the process, much consolidation became possible, particularly with the correctable error handler. Prefix assembler and C function names with "spitfire" to indicate that these are for Ultra-I/II/IIi/IIe only. More work is needed to make these routines robust and featureful to the level of the Ultra-III error handlers. Signed-off-by: David S. Miller <davem@davemloft.net>