summaryrefslogtreecommitdiffstats
path: root/arch/x86
AgeCommit message (Collapse)Author
2011-03-12x86: ioapic: Use new move_irq functionsThomas Gleixner
Use the functions which take irq_data. We already have a pointer to irq_data. That avoids a sparse irq lookup in move_*_irq. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-12x86: Use the proper accessors in fixup_irqs()Thomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-12x86: ioapic: Use irq_data->stateThomas Gleixner
Use the state information in irq_data. That avoids a radix-tree lookup from apic_ack_level() and simplifies setup_ioapic_dest(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-12x86: ioapic: Simplify irq chip and handler setupThomas Gleixner
Use pointers instead of ugly multiline if/else constructs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-12x86: Cleanup the genirq name spaceThomas Gleixner
genirq is switching to a consistent name space for the irq related functions. Convert x86. Conversion was done with coccinelle. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-12Merge branch 'x86/apic' into x86/irqThomas Gleixner
Reason: Update to latest genirq code conflicts with pending apic changes Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-12x86-64, NUMA: Don't call numa_set_distanc() for all possible node ↵Tejun Heo
combinations during emulation The distance transforming in numa_emulation() used to call numa_set_distance() for all MAX_NUMNODES * MAX_NUMNODES node combinations regardless of which are enabled. As numa_set_distance() ignores all out-of-bound distance settings, this doesn't cause any problem other than looping unnecessarily many times during boot. However, as MAX_NUMNODES * MAX_NUMNODES can be pretty high, update the code such that it iterates through only the enabled combinations. Yinghai Lu identified the issue and provided an initial patch to address the issue; however, the patch was incorrect in that it didn't build emulated distance table when there's no physical distance table and unnecessarily complex. http://thread.gmane.org/gmane.linux.kernel/1107986/focus=1107988 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Yinghai Lu <yinghai@kernel.org>
2011-03-12x86, binutils, xen: Fix another wrong size directiveAlexander van Heukelum
The latest binutils (2.21.0.20110302/Ubuntu) breaks the build yet another time, under CONFIG_XEN=y due to a .size directive that refers to a slightly differently named (hence, to the now very strict and unforgiving assembler, non-existent) symbol. [ mingo: This unnecessary build breakage caused by new binutils version 2.21 gets escallated back several kernel releases spanning several years of Linux history, affecting over 130,000 upstream kernel commits (!), on CONFIG_XEN=y 64-bit kernels (i.e. essentially affecting all major Linux distro kernel configs). Git annotate tells us that this slight debug symbol code mismatch bug has been introduced in 2008 in commit 3d75e1b8: 3d75e1b8 (Jeremy Fitzhardinge 2008-07-08 15:06:49 -0700 1231) ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs) The 'bug' is just a slight assymetry in ENTRY()/END() debug-symbols sequences, with lots of assembly code between the ENTRY() and the END(): ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs) ... END(do_hypervisor_callback) Human reviewers almost never catch such small mismatches, and binutils never even warned about it either. This new binutils version thus breaks the Xen build on all upstream kernels since v2.6.27, out of the blue. This makes a straightforward Git bisection of all 64-bit Xen-enabled kernels impossible on such binutils, for a bisection window of over hundred thousand historic commits. (!) This is a major fail on the side of binutils and binutils needs to turn this show-stopper build failure into a warning ASAP. ] Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Jan Beulich <jbeulich@novell.com> Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <kees.cook@canonical.com> LKML-Reference: <1299877178-26063-1-git-send-email-heukelum@fastmail.fm> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-11xen/e820: Don't mark balloon memory as E820_UNUSABLE when running as guest ↵Konrad Rzeszutek Wilk
and fix overflow. If we have a guest that asked for: memory=1024 maxmem=2048 Which means we want 1GB now, and create pagetables so that we can expand up to 2GB, we would have this E820 layout: [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 0000000080800000 (usable) Due to patch: "xen/setup: Inhibit resource API from using System RAM E820 gaps as PCI mem gaps." we would mark the memory past the 1GB mark as unusuable resulting in: [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 0000000040000000 (usable) [ 0.000000] Xen: 0000000040000000 - 0000000080800000 (unusable) which meant that we could not balloon up anymore. We could balloon the guest down. The fix is to run the code introduced by the above mentioned patch only for the initial domain. We will have to revisit this once we start introducing a modified E820 for PCI passthrough so that we can utilize the P2M identity code. We also fix an overflow by having UL instead of ULL on 32-bit machines. [v2: Ian pointed to the overflow issue] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-11futex: Sanitize futex ops argument typesMichel Lespinasse
Change futex_atomic_op_inuser and futex_atomic_cmpxchg_inatomic prototypes to use u32 types for the futex as this is the data type the futex core code uses all over the place. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Darren Hart <darren@dvhart.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110311025058.GD26122@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-11futex: Sanitize cmpxchg_futex_value_locked APIMichel Lespinasse
The cmpxchg_futex_value_locked API was funny in that it returned either the original, user-exposed futex value OR an error code such as -EFAULT. This was confusing at best, and could be a source of livelocks in places that retry the cmpxchg_futex_value_locked after trying to fix the issue by running fault_in_user_writeable(). This change makes the cmpxchg_futex_value_locked API more similar to the get_futex_value_locked one, returning an error code and updating the original value through a reference argument. Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Chris Metcalf <cmetcalf@tilera.com> [tile] Acked-by: Tony Luck <tony.luck@intel.com> [ia64] Acked-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michal Simek <monstr@monstr.eu> [microblaze] Acked-by: David Howells <dhowells@redhat.com> [frv] Cc: Darren Hart <darren@dvhart.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110311024851.GC26122@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-11x86: Clean up apic.c and apic.hHenrik Kretzschmar
This patch moves some functions and variables into init sections, makes a function static and removes some lines of cruft. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> LKML-Reference: <1299826956-8607-2-git-send-email-henne@nachtwindheim.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-11x86: Remove superflous goal definition of tsc_syncHenrik Kretzschmar
The extra tsc_sync.o goal definition is superflous. CONFIG_X86_64_SMP depends on CONFIG_SMP and tsc_sync.o is already in the definition of CONFIG_SMP. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> LKML-Reference: <1299826956-8607-1-git-send-email-henne@nachtwindheim.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-10Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, UV: Initialize the broadcast assist unit base destination node id properly x86, numa: Fix numa_emulation code with memory-less node0 x86, build: Make sure mkpiggy fails on read error
2011-03-10xen: events: refactor GSI pirq bindings functionsIan Campbell
Following the example set by xen_allocate_pirq_msi and xen_bind_pirq_msi_to_irq: xen_allocate_pirq becomes xen_allocate_pirq_gsi and now only allocates a pirq number and does not bind it. xen_map_pirq_gsi becomes xen_bind_pirq_gsi_to_irq and binds an existing pirq. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-10xen: events: remove dom0 specific xen_create_msi_irqIan Campbell
The function name does not distinguish it from xen_allocate_pirq_msi (which operates on domU and pvhvm domains rather than dom0). Hoist domain 0 specific functionality up into the only caller leaving functionality common to all guest types in xen_bind_pirq_msi_to_irq. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-10xen: events: use xen_bind_pirq_msi_to_irq from xen_create_msi_irqIan Campbell
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-10xen: events: push set_irq_msi down into xen_create_msi_irqIan Campbell
Makes the tail end of this function look even more like xen_bind_pirq_msi_to_irq. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-10xen: events: separate MSI PIRQ allocation from PIRQ binding to IRQIan Campbell
Split the binding aspect of xen_allocate_pirq_msi out into a new xen_bind_pirq_to_irq function. In xen_hvm_setup_msi_irq when allocating a pirq write the MSI message to signal the PIRQ as soon as the pirq is obtained. There is no way to free the pirq back so if the subsequent binding to an IRQ fails we want to ensure that we will reuse the PIRQ next time rather than leak it. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-10xen: pci: collapse apic_register_gsi_xen_hvm and xen_hvm_register_pirqIan Campbell
apic_register_gsi_xen_hvm is a tiny wrapper around xen_hvm_register_pirq. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-10xen: events: return irq from xen_allocate_pirq_msiIan Campbell
consistent with other similar functions. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-10xen: events: drop XEN_ALLOC_IRQ flag to xen_allocate_pirq_msiIan Campbell
All callers pass this flag so it is pointless. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-10xen: pci: only define xen_initdom_setup_msi_irqs if CONFIG_XEN_DOM0Ian Campbell
Fixes: CC arch/x86/pci/xen.o arch/x86/pci/xen.c:183: warning: 'xen_initdom_setup_msi_irqs' defined but not used Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-03-10Merge branch 'stable/pcifront-fixes' into stable/irq.cleanupKonrad Rzeszutek Wilk
* stable/pcifront-fixes: pci/xen: When free-ing MSI-X/MSI irq->desc also use generic code. pci/xen: Cleanup: convert int** to int[] pci/xen: Use xen_allocate_pirq_msi instead of xen_allocate_pirq xen-pcifront: Sanity check the MSI/MSI-X values xen-pcifront: don't use flush_scheduled_work()
2011-03-10Merge branch 'stable/irq.rework' into stable/irq.cleanupKonrad Rzeszutek Wilk
* stable/irq.rework: xen/irq: Cleanup up the pirq_to_irq for DomU PV PCI passthrough guests as well. xen: Use IRQF_FORCE_RESUME xen/timer: Missing IRQF_NO_SUSPEND in timer code broke suspend. xen: Fix compile error introduced by "switch to new irq_chip functions" xen: Switch to new irq_chip functions xen: Remove stale irq_chip.end xen: events: do not free legacy IRQs xen: events: allocate GSIs and dynamic IRQs from separate IRQ ranges. xen: events: add xen_allocate_irq_{dynamic, gsi} and xen_free_irq xen:events: move find_unbound_irq inside CONFIG_PCI_MSI xen: handled remapped IRQs when enabling a pcifront PCI device. genirq: Add IRQF_FORCE_RESUME
2011-03-10ftrace/graph: Trace function entry before updating indexSteven Rostedt
Currently the index to the ret_stack is updated and the real return address is saved in the ret_stack. Then we call the trace function. The trace function could decide that it doesn't want to trace this function (ex. set_graph_function does not match) and it will return 0 which means not to trace this call. The normal function graph tracer has this code: if (!(trace->depth || ftrace_graph_addr(trace->func)) || ftrace_graph_ignore_irqs()) return 0; What this states is, if the trace depth (which is curr_ret_stack) is zero (top of nested functions) then test if we want to trace this function. If this function is not to be traced, then return 0 and the rest of the function graph tracer logic will not trace this function. The problem arises when an interrupt comes in after we updated the curr_ret_stack. The next function that gets called will have a trace->depth of 1. Which fools this trace code into thinking that we are in a nested function, and that we should trace. This causes interrupts to be traced when they should not be. The solution is to trace the function first and then update the ret_stack. Reported-by: zhiping zhong <xzhong86@163.com> Reported-by: wu zhangjin <wuzhangjin@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-03-10tracing: Fix event alignment: kvm:kvm_hv_hypercallDavid Sharp
Acked-by: Avi Kivity <avi@redhat.com> Signed-off-by: David Sharp <dhsharp@google.com> LKML-Reference: <1291421609-14665-8-git-send-email-dhsharp@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-03-10x86/mm: Fix pgd_lock deadlockAndrea Arcangeli
It's forbidden to take the page_table_lock with the irq disabled or if there's contention the IPIs (for tlb flushes) sent with the page_table_lock held will never run leading to a deadlock. Nobody takes the pgd_lock from irq context so the _irqsave can be removed. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> LKML-Reference: <201102162345.p1GNjMjm021738@imap1.linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-10x86/mm: Handle mm_fault_error() in kernel spaceAndrey Vagin
mm_fault_error() should not execute oom-killer, if page fault occurs in kernel space. E.g. in copy_from_user()/copy_to_user(). This would happen if we find ourselves in OOM on a copy_to_user(), or a copy_from_user() which faults. Without this patch, the kernels hangs up in copy_from_user(), because OOM killer sends SIG_KILL to current process, but it can't handle a signal while in syscall, then the kernel returns to copy_from_user(), reexcute current command and provokes page_fault again. With this patch the kernel return -EFAULT from copy_from_user(). The code, which checks that page fault occurred in kernel space, has been copied from do_sigbus(). This situation is handled by the same way on powerpc, xtensa, tile, ... Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> LKML-Reference: <201103092322.p29NMNPH001682@imap1.linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-09[CPUFREQ] pcc-cpufreq: don't load driver if get_freq fails during init.Naga Chumbalkar
Return 0 on failure. This will cause the initialization of the driver to fail and prevent the driver from loading if the BIOS cannot handle the PCC interface command to "get frequency". Otherwise, the driver will load and display a very high value like "4294967274" (which is actually -EINVAL) for frequency: # cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq 4294967274 Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com> CC: stable@kernel.org Signed-off-by: Dave Jones <davej@redhat.com>
2011-03-09x86: Don't check for BIOS corruption in first 64K when there's no need toNaga Chumbalkar
Due to commit 781c5a67f152c17c3e4a9ed9647f8c0be6ea5ae9 it is likely that the number of areas to scan for BIOS corruption is 0 -- especially when the first 64K is already reserved (X86_RESERVE_LOW is 64K by default). If that's the case then don't set up the scan. Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Cc: <stable@kernel.org> LKML-Reference: <20110225202838.2229.71011.sendpatchset@nchumbalkar.americas.hpqcorp.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-09x86, UV: Initialize the broadcast assist unit base destination node id properlyCliff Wickman
The BAU's initialization of the broadcast description header is lacking the coherence domain (high bits) in the nasid. This causes a catastrophic system failure when running on a system with multiple coherence domains. Signed-off-by: Cliff Wickman <cpw@sgi.com> LKML-Reference: <E1PxKBB-0005F0-3U@eag09.americas.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-09x86: Remove dead config option X86_CPUJan Beulich
This isn't being referenced anywhere, and the selects done from it can be easily done together with all the other X86 ones. v2: Also adjust UML's Kconfig.x86. Signed-off-by: Jan Beulich <jbeulich@novell.com> LKML-Reference: <4D7603DA02000078000351C1@vpn.id2.novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-09Merge commit 'v2.6.38-rc8' into x86/asmIngo Molnar
Merge reason: Update with the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-09x86: Fix binutils-2.21 symbol related build failuresSedat Dilek
New binutils version 2.21.0.20110302-1 started checking that the symbol parameter to the .size directive matches the entry name's symbol parameter, unearthing two mismatches: AS arch/x86/kernel/acpi/wakeup_rm.o arch/x86/kernel/acpi/wakeup_rm.S: Assembler messages: arch/x86/kernel/acpi/wakeup_rm.S:12: Error: .size expression with symbol `wakeup_code_start' does not evaluate to a constant arch/x86/kernel/entry_32.S: Assembler messages: arch/x86/kernel/entry_32.S:1421: Error: .size expression with symbol `apf_page_fault' does not evaluate to a constant The problem was discovered while using Debian's binutils (2.21.0.20110302-1) and experimenting with binutils from upstream. Thanks Alexander and H.J. for the vital help. Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Alexander van Heukelum <heukelum@fastmail.fm> Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rjw@sisk.pl> LKML-Reference: <1299620364-21644-1-git-send-email-sedat.dilek@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-08kprobes: Disabling optimized kprobes for entry text sectionJiri Olsa
You can crash the kernel (with root/admin privileges) using kprobe tracer by running: echo "p system_call_after_swapgs" > ./kprobe_events echo 1 > ./events/kprobes/enable The reason is that at the system_call_after_swapgs label, the kernel stack is not set up. If optimized kprobes are enabled, the user space stack is being used in this case (see optimized kprobe template) and this might result in a crash. There are several places like this over the entry code (entry_$BIT). As it seems there's no any reasonable/maintainable way to disable only those places where the stack is not ready, I switched off the whole entry code from kprobe optimizing. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: acme@redhat.com Cc: fweisbec@gmail.com Cc: ananth@in.ibm.com Cc: davem@davemloft.net Cc: a.p.zijlstra@chello.nl Cc: eric.dumazet@gmail.com Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <1298298313-5980-3-git-send-email-jolsa@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-08x86: Separate out entry text sectionJiri Olsa
Put x86 entry code into a separate link section: .entry.text. Separating the entry text section seems to have performance benefits - caused by more efficient instruction cache usage. Running hackbench with perf stat --repeat showed that the change compresses the icache footprint. The icache load miss rate went down by about 15%: before patch: 19417627 L1-icache-load-misses ( +- 0.147% ) after patch: 16490788 L1-icache-load-misses ( +- 0.180% ) The motivation of the patch was to fix a particular kprobes bug that relates to the entry text section, the performance advantage was discovered accidentally. Whole perf output follows: - results for current tip tree: Performance counter stats for './hackbench/hackbench 10' (500 runs): 19417627 L1-icache-load-misses ( +- 0.147% ) 2676914223 instructions # 0.497 IPC ( +- 0.079% ) 5389516026 cycles ( +- 0.144% ) 0.206267711 seconds time elapsed ( +- 0.138% ) - results for current tip tree with the patch applied: Performance counter stats for './hackbench/hackbench 10' (500 runs): 16490788 L1-icache-load-misses ( +- 0.180% ) 2717734941 instructions # 0.502 IPC ( +- 0.079% ) 5414756975 cycles ( +- 0.148% ) 0.206747566 seconds time elapsed ( +- 0.137% ) Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: masami.hiramatsu.pt@hitachi.com Cc: ananth@in.ibm.com Cc: davem@davemloft.net Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <20110307181039.GB15197@jolsa.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-08Merge commit 'v2.6.38-rc8' into perf/coreIngo Molnar
Merge reason: Merge latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-08KEYS: Add an iovec version of KEYCTL_INSTANTIATEDavid Howells
Add a keyctl op (KEYCTL_INSTANTIATE_IOV) that is like KEYCTL_INSTANTIATE, but takes an iovec array and concatenates the data in-kernel into one buffer. Since the KEYCTL_INSTANTIATE copies the data anyway, this isn't too much of a problem. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-03-05x86: Really print supported CPUs if PROCESSOR_SELECT=yJan Beulich
I'm sure it was a mere oversight that the CONFIG_ prefixes are missing. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Dave Jones <davej@redhat.com> LKML-Reference: <4D7118D30200007800034F79@vpn.id2.novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-05Merge branch 'x86-mm' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into x86/mm
2011-03-05perf: Avoid the percore allocations if the CPU is not HT capableLin Ming
Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1299119690-13991-5-git-send-email-ming.m.lin@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation()Tejun Heo
Undetermined entries in emu_nid_to_phys[] are filled with zero assuming that physical node 0 is always online; however, this might not be true depending on hardware configuration. Find a physical node which is actually online and use it instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Rientjes <rientjes@google.com> LKML-Reference: <alpine.DEB.2.00.1103020628210.31626@chino.kir.corp.google.com>
2011-03-04x86, numa: Fix numa_emulation code with memory-less node0Yinghai Lu
This crash happens on a system that does not have RAM on node0. When numa_emulation is compiled in, and: 1. we boot the system without numa=fake... 2. or we boot the system with numa=fake=128 to make emulation fail we will get: [ 0.076025] ------------[ cut here ]------------ [ 0.080004] kernel BUG at arch/x86/mm/numa_64.c:788! [ 0.080004] invalid opcode: 0000 [#1] SMP [...] need to use early_cpu_to_node() directly, because cpu_to_apicid and apicid_to_node will return node0 that is not onlined. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> LKML-Reference: <4D6ECF72.5010308@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04x86-64, NUMA: Clean up initmem_init()David Rientjes
This patch cleans initmem_init() so that it is more readable and doesn't use an unnecessary array of function pointers to convolute the flow of the code. It also makes it obvious that dummy_numa_init() will always succeed (and documents that requirement) so that the existing BUG() is never actually reached. No functional change. -tj: Updated comment for dummy_numa_init() slightly. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-04x86-64, NUMA: Fix numa_emulation code with node0 without RAMYinghai Lu
On one system that does not have RAM on node0. When numa_emulation is compiled in, and 1. boot system without numa=fake... 2. or boot system with numa=fake=128 to make emulation fail will get: [ 0.092026] ------------[ cut here ]------------ [ 0.096005] kernel BUG at arch/x86/mm/numa_emulation.c:439! [ 0.096005] invalid opcode: 0000 [#1] SMP [ 0.096005] last sysfs file: [ 0.096005] CPU 0 [ 0.096005] Modules linked in: [ 0.096005] [ 0.096005] Pid: 0, comm: swapper Not tainted 2.6.38-rc6-tip-yh-03869-gcb0491d-dirty #684 Sun Microsystems Sun Fire X4240/Sun Fire X4240 [ 0.096005] RIP: 0010:[<ffffffff81cdc65b>] [<ffffffff81cdc65b>] numa_add_cpu+0x56/0xcf [ 0.096005] RSP: 0000:ffffffff82437ed8 EFLAGS: 00010246 ... [ 0.096005] Call Trace: [ 0.096005] [<ffffffff81cd7931>] identify_cpu+0x2d7/0x2df [ 0.096005] [<ffffffff827e54fa>] identify_boot_cpu+0x10/0x30 [ 0.096005] [<ffffffff827e5704>] check_bugs+0x9/0x2d [ 0.096005] [<ffffffff827dceda>] start_kernel+0x3d7/0x3f1 [ 0.096005] [<ffffffff827dc2cc>] x86_64_start_reservations+0x9c/0xa0 [ 0.096005] [<ffffffff827dc4ad>] x86_64_start_kernel+0x1dd/0x1e8 [ 0.096005] Code: 74 06 48 8d 04 90 eb 0f 48 c7 c0 30 d9 00 00 48 03 04 d5 90 0f 60 82 8b 00 83 f8 ff 74 0d 0f a3 05 8b 7e 92 00 19 d2 85 d2 75 02 <0f> 0b 48 98 be 00 01 00 00 48 c7 c7 e0 44 60 82 44 8b 2c 85 e0 [ 0.096005] RIP [<ffffffff81cdc65b>] numa_add_cpu+0x56/0xcf [ 0.096005] RSP <ffffffff82437ed8> [ 0.096026] ---[ end trace a7919e7f17c0a725 ]--- We need to use early_cpu_to_node() directly, because numa_cpu_node() will return node0 that is not onlined. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-04perf: Fix LLC-* events on Intel Nehalem/WestmereAndi Kleen
On Intel Nehalem and Westmere CPUs the generic perf LLC-* events count the L2 caches, not the real L3 LLC - this was inconsistent with behavior on other CPUs. Fixing this requires the use of the special OFFCORE_RESPONSE events which need a separate mask register. This has been implemented by the previous patch, now use this infrastructure to set correct events for the LLC-* on Nehalem and Westmere. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1299119690-13991-3-git-send-email-ming.m.lin@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04perf: Add support for supplementary event registersAndi Kleen
Change logs against Andi's original version: - Extends perf_event_attr:config to config{,1,2} (Peter Zijlstra) - Fixed a major event scheduling issue. There cannot be a ref++ on an event that has already done ref++ once and without calling put_constraint() in between. (Stephane Eranian) - Use thread_cpumask for percore allocation. (Lin Ming) - Use MSR names in the extra reg lists. (Lin Ming) - Remove redundant "c = NULL" in intel_percore_constraints - Fix comment of perf_event_attr::config1 Intel Nehalem/Westmere have a special OFFCORE_RESPONSE event that can be used to monitor any offcore accesses from a core. This is a very useful event for various tunings, and it's also needed to implement the generic LLC-* events correctly. Unfortunately this event requires programming a mask in a separate register. And worse this separate register is per core, not per CPU thread. This patch: - Teaches perf_events that OFFCORE_RESPONSE needs extra parameters. The extra parameters are passed by user space in the perf_event_attr::config1 field. - Adds support to the Intel perf_event core to schedule per core resources. This adds fairly generic infrastructure that can be also used for other per core resources. The basic code has is patterned after the similar AMD northbridge constraints code. Thanks to Stephane Eranian who pointed out some problems in the original version and suggested improvements. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1299119690-13991-2-git-send-email-ming.m.lin@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04perf_events: Update PEBS event constraintsStephane Eranian
This patch updates PEBS event constraints for Intel Atom, Nehalem, Westmere. This patch also reorganizes the PEBS format/constraint detection code. It is now based on processor model and not PEBS format. Two processors may use the same PEBS format without have the same list of PEBS events. In this second version, we simplified the initialization of the PEBS constraints by leveraging the existing switch() statement in perf_event_intel.c. We also renamed the constraint tables to be more consistent with regular constraints. In this 3rd version, we drop BR_INST_RETIRED.MISPRED from Intel Atom as it does not seem to work. Use MISPREDICTED_BRANCH_RETIRED instead. Also add FP_ASSIST.* o both Intel Nehalem and Westmere. I misssed those in the earlier patches. Events were tested using libpfm4 perf_examples. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4d6e6b02.815bdf0a.637b.07a7@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04Merge branch 'perf/urgent' into perf/coreIngo Molnar
Merge reason: Pick up updates before queueing up dependent patches. Signed-off-by: Ingo Molnar <mingo@elte.hu>