summaryrefslogtreecommitdiffstats
path: root/include/asm-powerpc
AgeCommit message (Collapse)Author
2006-03-27[PATCH] lightweight robust futexes updatesIngo Molnar
- fix: initialize the robust list(s) to NULL in copy_process. - doc update - cleanup: rename _inuser to _inatomic - __user cleanups and other small cleanups Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] lightweight robust futexes: arch defaultsIngo Molnar
This patchset provides a new (written from scratch) implementation of robust futexes, called "lightweight robust futexes". We believe this new implementation is faster and simpler than the vma-based robust futex solutions presented before, and we'd like this patchset to be adopted in the upstream kernel. This is version 1 of the patchset. Background ---------- What are robust futexes? To answer that, we first need to understand what futexes are: normal futexes are special types of locks that in the noncontended case can be acquired/released from userspace without having to enter the kernel. A futex is in essence a user-space address, e.g. a 32-bit lock variable field. If userspace notices contention (the lock is already owned and someone else wants to grab it too) then the lock is marked with a value that says "there's a waiter pending", and the sys_futex(FUTEX_WAIT) syscall is used to wait for the other guy to release it. The kernel creates a 'futex queue' internally, so that it can later on match up the waiter with the waker - without them having to know about each other. When the owner thread releases the futex, it notices (via the variable value) that there were waiter(s) pending, and does the sys_futex(FUTEX_WAKE) syscall to wake them up. Once all waiters have taken and released the lock, the futex is again back to 'uncontended' state, and there's no in-kernel state associated with it. The kernel completely forgets that there ever was a futex at that address. This method makes futexes very lightweight and scalable. "Robustness" is about dealing with crashes while holding a lock: if a process exits prematurely while holding a pthread_mutex_t lock that is also shared with some other process (e.g. yum segfaults while holding a pthread_mutex_t, or yum is kill -9-ed), then waiters for that lock need to be notified that the last owner of the lock exited in some irregular way. To solve such types of problems, "robust mutex" userspace APIs were created: pthread_mutex_lock() returns an error value if the owner exits prematurely - and the new owner can decide whether the data protected by the lock can be recovered safely. There is a big conceptual problem with futex based mutexes though: it is the kernel that destroys the owner task (e.g. due to a SEGFAULT), but the kernel cannot help with the cleanup: if there is no 'futex queue' (and in most cases there is none, futexes being fast lightweight locks) then the kernel has no information to clean up after the held lock! Userspace has no chance to clean up after the lock either - userspace is the one that crashes, so it has no opportunity to clean up. Catch-22. In practice, when e.g. yum is kill -9-ed (or segfaults), a system reboot is needed to release that futex based lock. This is one of the leading bugreports against yum. To solve this problem, 'Robust Futex' patches were created and presented on lkml: the one written by Todd Kneisel and David Singleton is the most advanced at the moment. These patches all tried to extend the futex abstraction by registering futex-based locks in the kernel - and thus give the kernel a chance to clean up. E.g. in David Singleton's robust-futex-6.patch, there are 3 new syscall variants to sys_futex(): FUTEX_REGISTER, FUTEX_DEREGISTER and FUTEX_RECOVER. The kernel attaches such robust futexes to vmas (via vma->vm_file->f_mapping->robust_head), and at do_exit() time, all vmas are searched to see whether they have a robust_head set. Lots of work went into the vma-based robust-futex patch, and recently it has improved significantly, but unfortunately it still has two fundamental problems left: - they have quite complex locking and race scenarios. The vma-based patches had been pending for years, but they are still not completely reliable. - they have to scan _every_ vma at sys_exit() time, per thread! The second disadvantage is a real killer: pthread_exit() takes around 1 microsecond on Linux, but with thousands (or tens of thousands) of vmas every pthread_exit() takes a millisecond or more, also totally destroying the CPU's L1 and L2 caches! This is very much noticeable even for normal process sys_exit_group() calls: the kernel has to do the vma scanning unconditionally! (this is because the kernel has no knowledge about how many robust futexes there are to be cleaned up, because a robust futex might have been registered in another task, and the futex variable might have been simply mmap()-ed into this process's address space). This huge overhead forced the creation of CONFIG_FUTEX_ROBUST, but worse than that: the overhead makes robust futexes impractical for any type of generic Linux distribution. So it became clear to us, something had to be done. Last week, when Thomas Gleixner tried to fix up the vma-based robust futex patch in the -rt tree, he found a handful of new races and we were talking about it and were analyzing the situation. At that point a fundamentally different solution occured to me. This patchset (written in the past couple of days) implements that new solution. Be warned though - the patchset does things we normally dont do in Linux, so some might find the approach disturbing. Parental advice recommended ;-) New approach to robust futexes ------------------------------ At the heart of this new approach there is a per-thread private list of robust locks that userspace is holding (maintained by glibc) - which userspace list is registered with the kernel via a new syscall [this registration happens at most once per thread lifetime]. At do_exit() time, the kernel checks this user-space list: are there any robust futex locks to be cleaned up? In the common case, at do_exit() time, there is no list registered, so the cost of robust futexes is just a simple current->robust_list != NULL comparison. If the thread has registered a list, then normally the list is empty. If the thread/process crashed or terminated in some incorrect way then the list might be non-empty: in this case the kernel carefully walks the list [not trusting it], and marks all locks that are owned by this thread with the FUTEX_OWNER_DEAD bit, and wakes up one waiter (if any). The list is guaranteed to be private and per-thread, so it's lockless. There is one race possible though: since adding to and removing from the list is done after the futex is acquired by glibc, there is a few instructions window for the thread (or process) to die there, leaving the futex hung. To protect against this possibility, userspace (glibc) also maintains a simple per-thread 'list_op_pending' field, to allow the kernel to clean up if the thread dies after acquiring the lock, but just before it could have added itself to the list. Glibc sets this list_op_pending field before it tries to acquire the futex, and clears it after the list-add (or list-remove) has finished. That's all that is needed - all the rest of robust-futex cleanup is done in userspace [just like with the previous patches]. Ulrich Drepper has implemented the necessary glibc support for this new mechanism, which fully enables robust mutexes. (Ulrich plans to commit these changes to glibc-HEAD later today.) Key differences of this userspace-list based approach, compared to the vma based method: - it's much, much faster: at thread exit time, there's no need to loop over every vma (!), which the VM-based method has to do. Only a very simple 'is the list empty' op is done. - no VM changes are needed - 'struct address_space' is left alone. - no registration of individual locks is needed: robust mutexes dont need any extra per-lock syscalls. Robust mutexes thus become a very lightweight primitive - so they dont force the application designer to do a hard choice between performance and robustness - robust mutexes are just as fast. - no per-lock kernel allocation happens. - no resource limits are needed. - no kernel-space recovery call (FUTEX_RECOVER) is needed. - the implementation and the locking is "obvious", and there are no interactions with the VM. Performance ----------- I have benchmarked the time needed for the kernel to process a list of 1 million (!) held locks, using the new method [on a 2GHz CPU]: - with FUTEX_WAIT set [contended mutex]: 130 msecs - without FUTEX_WAIT set [uncontended mutex]: 30 msecs I have also measured an approach where glibc does the lock notification [which it currently does for !pshared robust mutexes], and that took 256 msecs - clearly slower, due to the 1 million FUTEX_WAKE syscalls userspace had to do. (1 million held locks are unheard of - we expect at most a handful of locks to be held at a time. Nevertheless it's nice to know that this approach scales nicely.) Implementation details ---------------------- The patch adds two new syscalls: one to register the userspace list, and one to query the registered list pointer: asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); asmlinkage long sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr, size_t __user *len_ptr); List registration is very fast: the pointer is simply stored in current->robust_list. [Note that in the future, if robust futexes become widespread, we could extend sys_clone() to register a robust-list head for new threads, without the need of another syscall.] So there is virtually zero overhead for tasks not using robust futexes, and even for robust futex users, there is only one extra syscall per thread lifetime, and the cleanup operation, if it happens, is fast and straightforward. The kernel doesnt have any internal distinction between robust and normal futexes. If a futex is found to be held at exit time, the kernel sets the highest bit of the futex word: #define FUTEX_OWNER_DIED 0x40000000 and wakes up the next futex waiter (if any). User-space does the rest of the cleanup. Otherwise, robust futexes are acquired by glibc by putting the TID into the futex field atomically. Waiters set the FUTEX_WAITERS bit: #define FUTEX_WAITERS 0x80000000 and the remaining bits are for the TID. Testing, architecture support ----------------------------- I've tested the new syscalls on x86 and x86_64, and have made sure the parsing of the userspace list is robust [ ;-) ] even if the list is deliberately corrupted. i386 and x86_64 syscalls are wired up at the moment, and Ulrich has tested the new glibc code (on x86_64 and i386), and it works for his robust-mutex testcases. All other architectures should build just fine too - but they wont have the new syscalls yet. Architectures need to implement the new futex_atomic_cmpxchg_inuser() inline function before writing up the syscalls (that function returns -ENOSYS right now). This patch: Add placeholder futex_atomic_cmpxchg_inuser() implementations to every architecture that supports futexes. It returns -ENOSYS. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@infradead.org> Acked-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] unify pfn_to_page: powerpc pfn_to_pageKAMEZAWA Hiroyuki
PowerPC can use generic ones. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] bitops: powerpc: use generic bitopsAkinobu Mita
- remove __{,test_and_}{set,clear,change}_bit() and test_bit() - remove generic_fls64() - remove generic_hweight{64,32,16,8}() - remove sched_find_first_bit() Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] 2TB files: add blkcnt_tTakashi Sato
Add blkcnt_t as the type of inode.i_blocks. This enables you to make the size of blkcnt_t either 4 bytes or 8 bytes on 32 bits architecture with CONFIG_LSF. - CONFIG_LSF Add new configuration parameter. - blkcnt_t On h8300, i386, mips, powerpc, s390 and sh that define sector_t, blkcnt_t is defined as u64 if CONFIG_LSF is enabled; otherwise it is defined as unsigned long. On other architectures, it is defined as unsigned long. - inode.i_blocks Change the type from sector_t to blkcnt_t. Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25powerpc: fix strncasecmp prototypeLinus Torvalds
It takes a size_t, not an int, as its third argument. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] POLLRDHUP/EPOLLRDHUP handling for half-closed devices notificationsDavide Libenzi
Implement the half-closed devices notifiation, by adding a new POLLRDHUP (and its alias EPOLLRDHUP) bit to the existing poll/select sets. Since the existing POLLHUP handling, that does not report correctly half-closed devices, was feared to be changed, this implementation leaves the current POLLHUP reporting unchanged and simply add a new bit that is set in the few places where it makes sense. The same thing was discussed and conceptually agreed quite some time ago: http://lkml.org/lkml/2003/7/12/116 Since this new event bit is added to the existing Linux poll infrastruture, even the existing poll/select system calls will be able to use it. As far as the existing POLLHUP handling, the patch leaves it as is. The pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing archs and sets the bit in the six relevant files. The other attached diff is the simple change required to sys/epoll.h to add the EPOLLRDHUP definition. There is "a stupid program" to test POLLRDHUP delivery here: http://www.xmailserver.org/pollrdhup-test.c It tests poll(2), but since the delivery is same epoll(2) will work equally. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] more for_each_cpu() conversionsAndrew Morton
When we stop allocating percpu memory for not-possible CPUs we must not touch the percpu data for not-possible CPUs at all. The correct way of doing this is to test cpu_possible() or to use for_each_cpu(). This patch is a kernel-wide sweep of all instances of NR_CPUS. I found very few instances of this bug, if any. But the patch converts lots of open-coded test to use the preferred helper macros. Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Acked-by: Kyle McMartin <kyle@parisc-linux.org> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Andi Kleen <ak@muc.de> Cc: Christian Zankel <chris@zankel.net> Cc: Philippe Elie <phil.el@wanadoo.fr> Cc: Nathan Scott <nathans@sgi.com> Cc: Jens Axboe <axboe@suse.de> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpcLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (78 commits) [PATCH] powerpc: Add FSL SEC node to documentation [PATCH] macintosh: tidy-up driver_register() return values [PATCH] powerpc: tidy-up of_register_driver()/driver_register() return values [PATCH] powerpc: via-pmu warning fix [PATCH] macintosh: cleanup the use of i2c headers [PATCH] powerpc: dont allow old RTC to be selected [PATCH] powerpc: make powerbook_sleep_grackle static [PATCH] powerpc: Fix warning in add_memory [PATCH] powerpc: update mailing list addresses [PATCH] powerpc: Remove calculation of io hole [PATCH] powerpc: iseries: Add bootargs to /chosen [PATCH] powerpc: iseries: Add /system-id, /model and /compatible [PATCH] powerpc: Add strne2a() to convert a string from EBCDIC to ASCII [PATCH] powerpc: iseries: Make more stuff static in platforms/iseries/mf.c [PATCH] powerpc: iseries: Remove pointless iSeries_(restart|power_off|halt) [PATCH] powerpc: iseries: mf related cleanups [PATCH] powerpc: Replace platform_is_lpar() with a firmware feature [PATCH] powerpc: trivial: Cleanup whitespace in cputable.h [PATCH] powerpc: Remove unused iommu_off logic from pSeries_init_early() [PATCH] powerpc: Unconfuse htab_bolt_mapping() callers ...
2006-03-22[PATCH] hugepage: Fix hugepage logic in free_pgtables()David Gibson
free_pgtables() has special logic to call hugetlb_free_pgd_range() instead of the normal free_pgd_range() on hugepage VMAs. However, the test it uses to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized range at the start of the vma. is_hugepage_only_range() will return true if the given range has any intersection with a hugepage address region, and in this case the given region need not be hugepage aligned. So, for example, this test can return true if called on, say, a 4k VMA immediately preceding a (nicely aligned) hugepage VMA. At present we get away with this because the powerpc version of hugetlb_free_pgd_range() is just a call to free_pgd_range(). On ia64 (the only other arch with a non-trivial is_hugepage_only_range()) we get away with it for a different reason; the hugepage area is not contiguous with the rest of the user address space, and VMAs are not permitted in between, so the test can't return a false positive there. Nonetheless this should be fixed. We do that in the patch below by replacing the is_hugepage_only_range() test with an explicit test of the VMA using is_vm_hugetlb_page(). This in turn changes behaviour for platforms where is_hugepage_only_range() returns false always (everything except powerpc and ia64). We address this by ensuring that hugetlb_free_pgd_range() is defined to be identical to free_pgd_range() (instead of a no-op) on everything except ia64. Even so, it will prevent some otherwise possible coalescing of calls down to free_pgd_range(). Since this only happens for hugepage VMAs, removing this small optimization seems unlikely to cause any trouble. This patch causes no regressions on the libhugetlbfs testsuite - ppc64 POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] powerpc: Remove calculation of io holeMichael Ellerman
In mm_init_ppc64() we calculate the location of the "IO hole", but then no one ever looks at the value. So don't bother. That's actually all mm_init_ppc64() does, so get rid of it too. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-22[PATCH] powerpc: Add strne2a() to convert a string from EBCDIC to ASCIIMichael Ellerman
Add strne2a() which converts a string from EBCDIC to ASCII. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-22[PATCH] powerpc: iseries: Make more stuff static in platforms/iseries/mf.cMichael Ellerman
Make mf_get_rtc(), mf_get_boot_rtc() and mf_set_rtc() static, cause they can be. We need to move mf_set_rtc() to avoid a forward declaration. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-22[PATCH] powerpc: iseries: Remove pointless iSeries_(restart|power_off|halt)Michael Ellerman
These routines just call through to the mf routines, so point ppc_md straight at the mf routines. We need to pass the cmd through to mf_reboot to make it work, but that seems reasonable. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-22[PATCH] powerpc: iseries: mf related cleanupsMichael Ellerman
Some cleanups in the iSeries code. - Make mf_display_progress() check mf_initialized rather than the caller. - Set mf_initialized in mf_init() rather than in setup.c - Then move mf_initialized into mf.c, the only place it's used. - Move the mf related logic from iSeries_progress() to mf_display_progress() - Use a #define to size the pending_event_prealloc array - Use that define in the initialsation loop rather than sizeof jiggery pokery - Remove stupid comment(s) - Mark stuff static and/or __init Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-22[PATCH] powerpc: Replace platform_is_lpar() with a firmware featureMichael Ellerman
It has been decreed that platform numbers are evil, so as a step in that direction, replace platform_is_lpar() with a FW_FEATURE_LPAR bit. Currently FW_FEATURE_LPAR really means i/pSeries LPAR, in the future we might have to clean that up if we need to be more specific about what LPAR actually means. But that's another patch ... Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-22[PATCH] powerpc: trivial: Cleanup whitespace in cputable.hMichael Ellerman
Remove redundant whitespace in include/asm-powerpc/cputable.h Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-17[PATCH] powerpc: add for_each_node_by_foo helpersChristoph Hellwig
Typical use for of_find_node_by_name and of_find_node_by_type is to iterate over all nodes of a given type/name. Add a helper macro to do that (in spirit of the list_for_each* macros). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-17[PATCH] powerpc: Better pmd_bad() and pud_bad() checksDavid Gibson
At present, the powerpc pmd_bad() and pud_bad() macros return false unless the given pmd or pud is zero. This patch makes these tests more thorough, checking if the given pmd or pud looks like a plausible pte page or pmd page pointer respectively. This can result in helpful error messages when messing with the pagetable code. Signed-off-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-17Merge ../linux-2.6Paul Mackerras
2006-03-16[PATCH] powerpc: properly configure DDR/P5IOC children devsJohn Rose
The dynamic add path for PCI Host Bridges can fail to configure children adapters under P5IOC controllers. It fails to properly fixup bus/device resources, and it fails to properly enable EEH. Both of these steps need to occur before any children devices are enabled in pci_bus_add_devices(). Signed-off-by: John Rose <johnrose@austin.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-09Merge ../linux-2.6Paul Mackerras
2006-03-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc-mergeLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc-merge: powerpc: Fix various syscall/signal/swapcontext bugs [PATCH] powerpc: incorrect rmo_top handling in prom_init [PATCH] powerpc: Fix incorrect pud_ERROR() message [PATCH] powerpc: Expose SMT and L1 icache snoop userland features [PATCH] powerpc: Fix windfarm_pm112 not starting all control loops [PATCH] powerpc: Fix old g5 issues with windfarm powerpc32: Fix timebase synchronization on 32-bit powermacs powerpc: Turn off verbose debug output in powermac platform functions powerpc: Fix might-sleep warning in program check exception handler
2006-03-08[PATCH] fix kexec asmMichael Matz
While testing kexec and kdump we hit problems where the new kernel would freeze or instantly reboot. The easiest way to trigger it was to kexec a kernel compiled for CONFIG_M586 on an athlon cpu. Compiling for CONFIG_MK7 instead would work fine. The patch fixes a few problems with the kexec inline asm. Signed-off-by: Chris Mason <mason@suse.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08[PATCH] powerpc: restore eeh_add_device_late() prototype stubMark Fasheh
We fixed this: arch/powerpc/platforms/pseries/eeh.c: In function `eeh_add_device_tree_late': arch/powerpc/platforms/pseries/eeh.c:901: warning: implicit declaration of function `eeh_add_device_late' arch/powerpc/platforms/pseries/eeh.c: At top level: arch/powerpc/platforms/pseries/eeh.c:918: error: conflicting types for 'eeh_add_device_late' arch/powerpc/platforms/pseries/eeh.c:901: error: previous implicit declaration of 'eeh_add_device_late' was here make[2]: *** [arch/powerpc/platforms/pseries/eeh.o] Error 1 But we forgot the !CONFIG_EEH stub. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08powerpc: Fix various syscall/signal/swapcontext bugsPaul Mackerras
A careful reading of the recent changes to the system call entry/exit paths revealed several problems, plus some things that could be simplified and improved: * 32-bit wasn't testing the _TIF_NOERROR bit in the syscall fast exit path, so it was only doing anything with it once it saw some other bit being set. In other words, the noerror behaviour would apply to the next system call where we had to reschedule or deliver a signal, which is not necessarily the current system call. * 32-bit wasn't doing the call to ptrace_notify in the syscall exit path when the _TIF_SINGLESTEP bit was set. * _TIF_RESTOREALL was in both _TIF_USER_WORK_MASK and _TIF_PERSYSCALL_MASK, which is odd since _TIF_RESTOREALL is only set by system calls. I took it out of _TIF_USER_WORK_MASK. * On 64-bit, _TIF_RESTOREALL wasn't causing the non-volatile registers to be restored (unless perhaps a signal was delivered or the syscall was traced or single-stepped). Thus the non-volatile registers weren't restored on exit from a signal handler. We probably got away with it mostly because signal handlers written in C wouldn't alter the non-volatile registers. * On 32-bit I simplified the code and made it more like 64-bit by making the syscall exit path jump to ret_from_except to handle preemption and signal delivery. * 32-bit was calling do_signal unnecessarily when _TIF_RESTOREALL was set - but I think because of that 32-bit was actually restoring the non-volatile registers on exit from a signal handler. * I changed the order of enabling interrupts and saving the non-volatile registers before calling do_syscall_trace_leave; now we enable interrupts first. Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-03[PATCH] powerpc: Fix incorrect pud_ERROR() messageDavid Gibson
The powerpc pud_ERROR() function misleadingly prints a message indicating a pmd error. This patch fixes it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-03[PATCH] powerpc: Expose SMT and L1 icache snoop userland featuresBenjamin Herrenschmidt
This patch makes userland aware of the icache snoop capability of the POWER5 (and possibly others in the future) and of SMT capabilities. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-01[PATCH] fix build breakage in eeh.c in 2.6.16-rc5-git5Greg KH
This patch should fixe a problem with eeh_add_device_late() not being defined in the ppc64 build process, causing the build to break. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-28Merge ../powerpc-mergePaul Mackerras
2006-02-28[PATCH] powerpc: fix dynamic PCI probe regressionJohn Rose
Some hotplug driver functions were migrated to the kernel for use by EEH in commit 2bf6a8fa21570f37fd1789610da30f70a05ac5e3. Previously, the PCI Hotplug module had been changed to use the new OFDT-based PCI probe when appropriate: 5fa80fcdca9d20d30c9ecec30d4dbff4ed93a5c6 When rpaphp_pci_config_slot() was moved from the rpaphp driver to the new kernel function pcibios_add_pci_devices(), the OFDT-based probe stuff was dropped. This patch restores it. Signed-off-by: John Rose <johnrose@austin.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24[PATCH] powerpc: native atomic_add_unlessNick Piggin
Do atomic_add_unless natively instead of using cmpxchg. Improved register allocation idea from Joel Schopp. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24[PATCH] powerpc: newline for ISYNC_ON_SMPNick Piggin
Add a newline at the end of the ISYNC_ON_SMP string. Needed for a subsequent patch. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24[PATCH] powerpc: Fixup for STRICT_MM_TYPECHECKSDavid Gibson
Currently ARCH=powerpc will not compile when STRICT_MM_TYPECHECKS is turned on and CONFIG_64K_PAGES is turned off. This corrects the problem. Signed-off-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24powerpc: Implement accurate task and CPU time accountingPaul Mackerras
This implements accurate task and cpu time accounting for 64-bit powerpc kernels. Instead of accounting a whole jiffy of time to a task on a timer interrupt because that task happened to be running at the time, we now account time in units of timebase ticks according to the actual time spent by the task in user mode and kernel mode. We also count the time spent processing hardware and software interrupts accurately. This is conditional on CONFIG_VIRT_CPU_ACCOUNTING. If that is not set, we do tick-based approximate accounting as before. To get this accurate information, we read either the PURR (processor utilization of resources register) on POWER5 machines, or the timebase on other machines on * each entry to the kernel from usermode * each exit to usermode * transitions between process context, hard irq context and soft irq context in kernel mode * context switches. On POWER5 systems with shared-processor logical partitioning we also read both the PURR and the timebase at each timer interrupt and context switch in order to determine how much time has been taken by the hypervisor to run other partitions ("steal" time). Unfortunately, since we need values of the PURR on both threads at the same time to accurately calculate the steal time, and since we can only calculate steal time on a per-core basis, the apportioning of the steal time between idle time (time which we ceded to the hypervisor in the idle loop) and actual stolen time is somewhat approximate at the moment. This is all based quite heavily on what s390 does, and it uses the generic interfaces that were added by the s390 developers, i.e. account_system_time(), account_user_time(), etc. This patch doesn't add any new interfaces between the kernel and userspace, and doesn't change the units in which time is reported to userspace by things such as /proc/stat, /proc/<pid>/stat, getrusage(), times(), etc. Internally the various task and cpu times are stored in timebase units, but they are converted to USER_HZ units (1/100th of a second) when reported to userspace. Some precision is therefore lost but there should not be any accumulating error, since the internal accumulation is at full precision. Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24Merge ../powerpc-mergePaul Mackerras
2006-02-24[PATCH] powerpc: Fix runlatch performance issuesAnton Blanchard
The runlatch SPR can take a lot of time to write. My original runlatch code would set it on every exception entry even though most of the time this was not required. It would also continually set it in the idle loop, which is an issue on an SMT capable processor. Now we cache the runlatch value in a threadinfo bit, and only check for it in decrementer and hardware interrupt exceptions as well as the idle loop. Boot on POWER3, POWER5 and iseries, and compile tested on pmac32. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24[PATCH] powerpc: Enable coherency for all pages on 83xx to fix PCI data ↵Kumar Gala
corruption On the 83xx platform to ensure the PCI inbound memory is handled properly we have to turn on coherency for all pages in the MMU. Otherwise we see corruption if inbound "prefetching/streaming" is enabled on the PCI controller. Signed-off-by: Randy Vinson <rvinson@mvista.com> Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24[PATCH] powerpc: Only calculate htab_size in one place for kexecMichael Ellerman
For kexec we need to know the size of the MMU hash table. Currently we calculate the size once in the htab code, and then twice more in the kexec code, once using htab_hash_mask and once using ppc64_pft_size. On some machines the ppc64_pft_size calculation is broken because ppc64_pft_size is not set. So we need to fix the second calculation, but better still we should just calculate the size once and use it everywhere else. Tested on Power5 LPAR, Power4 non-LPAR and Power3. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-17[PATCH] powerpc: Fix accidentally-working typo in __pud_free_tlbDavid Gibson
One of the parameters to the __pud_free_tlb() macro for powerpc is incorrect (see patch) . We get away with it by accident, because the one place the macro is called, the second parameter is a variable named "pud". Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] add asm-generic/mman.hMichael S. Tsirkin
Make new MADV_REMOVE, MADV_DONTFORK, MADV_DOFORK consistent across all arches. The idea is to make it possible to use them portably even before distros include them in libc headers. Move common flags to asm-generic/mman.h Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Cc: Roland Dreier <rolandd@cisco.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14[PATCH] madvise MADV_DONTFORK/MADV_DOFORKMichael S. Tsirkin
Currently, copy-on-write may change the physical address of a page even if the user requested that the page is pinned in memory (either by mlock or by get_user_pages). This happens if the process forks meanwhile, and the parent writes to that page. As a result, the page is orphaned: in case of get_user_pages, the application will never see any data hardware DMA's into this page after the COW. In case of mlock'd memory, the parent is not getting the realtime/security benefits of mlock. In particular, this affects the Infiniband modules which do DMA from and into user pages all the time. This patch adds madvise options to control whether memory range is inherited across fork. Useful e.g. for when hardware is doing DMA from/into these pages. Could also be useful to an application wanting to speed up its forks by cutting large areas out of consideration. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-10[PATCH] powerpc: trivial: modify comments to refer to new location of filesJon Mason
This patch removes all self references and fixes references to files in the now defunct arch/ppc64 tree. I think this accomplises everything wanted, though there might be a few references I missed. Signed-off-by: Jon Mason <jdmason@us.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-10[PATCH] powerpc: Move pSeries firmware feature setup into platforms/pseriesMichael Ellerman
Currently we have some stuff in firmware.h and kernel/firmware.c that is #ifdef CONFIG_PPC_PSERIES. Move it all into platforms/pseries. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-10Merge ../powerpc-mergePaul Mackerras
2006-02-10[PATCH] powerpc: unshare system call registrationJANAK DESAI
Registers system call for the powerpc architecture. Signed-off-by: Janak Desai <janak@us.ibm.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-08Merge branch 'for-linus2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/bird
2006-02-07[PATCH] powerpc: Thermal control for dual core G5sBenjamin Herrenschmidt
This patch adds a windfarm module, windfarm_pm112, for the dual core G5s (both 2 and 4 core models), keeping the machine from getting into vacuum-cleaner mode ;) For proper credits, the patch was initially written by Paul Mackerras, and slightly reworked by me to add overtemp handling among others. The patch also removes the sysfs attributes from windfarm_pm81 and windfarm_pm91 and instead adds code to the windfarm core to automagically expose attributes for sensor & controls. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-08[PATCH] __user annotations in powerpc thread_infoAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-08[PATCH] powerpc signal __user annotationsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>