summaryrefslogtreecommitdiffstats
path: root/arch
AgeCommit message (Collapse)Author
2014-09-24powerpc: Set CONFIG_NET=y in defconfigsMichal Marek
Commit 5d6be6a5 ("scsi_netlink : Make SCSI_NETLINK dependent on NET instead of selecting NET") removed what happened to be the only instance of 'select NET'. Defconfigs that were relying on the select now lack networking support. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Michal Marek <mmarek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-24parisc: Set CONFIG_NET=y in defconfigsMichal Marek
Commit 5d6be6a5 ("scsi_netlink : Make SCSI_NETLINK dependent on NET instead of selecting NET") removed what happened to be the only instance of 'select NET'. Defconfigs that were relying on the select now lack networking support. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: linux-parisc@vger.kernel.org Signed-off-by: Michal Marek <mmarek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-24mips: Set CONFIG_NET=y in defconfigsMichal Marek
Commit 5d6be6a5 ("scsi_netlink : Make SCSI_NETLINK dependent on NET instead of selecting NET") removed what happened to be the only instance of 'select NET'. Defconfigs that were relying on the select now lack networking support. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: linux-mips@linux-mips.org Signed-off-by: Michal Marek <mmarek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-24Merge tag 'mvebu-soc-3.18' of git://git.infradead.org/linux-mvebu into ↵Olof Johansson
next/cleanup Merge "ARM: mvebu: SoC changes for v3.18" from Jason Cooper: mvebu SoC changes for v3.18 - orion5x - remove pr_warning(), use pr_warn() * tag 'mvebu-soc-3.18' of git://git.infradead.org/linux-mvebu: ARM: orion5x: Convert pr_warning to pr_warn Signed-off-by: Olof Johansson <olof@lixom.net>
2014-09-24ARM: hisi: Fix platmcpm compilation when ARMv6 is selectedWei Xu
When compiling with "ARCH=arm" and "allmodconfig", with commit: 9cdc99919a95e8b54c1998b65bb1bfdabd47d27b [2/7] ARM: hisi: enable MCPM implementation we will get: /tmp/cc6DjYjT.s: Assembler messages: /tmp/cc6DjYjT.s:63: Error: selected processor does not support ARM mode `ubfx r1,r0,#8,#8' /tmp/cc6DjYjT.s:761: Error: selected processor does not support ARM mode `isb ' /tmp/cc6DjYjT.s:762: Error: selected processor does not support ARM mode `dsb ' /tmp/cc6DjYjT.s:769: Error: selected processor does not support ARM mode `isb ' /tmp/cc6DjYjT.s:775: Error: selected processor does not support ARM mode `isb ' /tmp/cc6DjYjT.s:776: Error: selected processor does not support ARM mode `dsb ' /tmp/cc6DjYjT.s:795: Error: selected processor does not support ARM mode `isb ' /tmp/cc6DjYjT.s:801: Error: selected processor does not support ARM mode `isb ' /tmp/cc6DjYjT.s:802: Error: selected processor does not support ARM mode `dsb ' Fix platmcpm compilation when ARMv6 is selected. Signed-off-by: Wei Xu <xuwei5@hisilicon.com> Signed-off-by: Olof Johansson <olof@lixom.net>
2014-09-24Merge branch 'for-linus' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18 This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") which implements __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The commit reverted and patches to implement proper fix will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>
2014-09-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds
Pull crypto fixes from Herbert Xu: "This fixes three issues: - if ccp is loaded on a machine without ccp, it will incorrectly activate causing all requests to fail. Fixed by preventing ccp from loading if hardware isn't available. - not all IRQs were enabled for the qat driver, leading to potential stalls when it is used - disabled buggy AVX CTR implementation in aesni" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: aesni - disable "by8" AVX CTR optimization crypto: ccp - Check for CCP before registering crypto algs crypto: qat - Enable all 32 IRQs
2014-09-24ARM: at91/PMC: don't forget to write PMC_PCDR register to disable clocksLudovic Desroches
When introducing support for sama5d3, the write to PMC_PCDR register has been accidentally removed. Reported-by: Nathalie Cyrille <nathalie.cyrille@atmel.com> Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com> Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com> Cc: <stable@vger.kernel.org> # 3.10.x and later
2014-09-24ARM: at91: fix at91sam9263ek DT mmc pinmuxing settingsAndreas Henriksson
As discovered on a custom board similar to at91sam9263ek and basing its devicetree on that one apparently the pin muxing doesn't get set up properly. This was discovered since the custom boards u-boot does funky stuff with the pin muxing and leaved it set to SPI which made the MMC driver not work under Linux. The fix is simply to define the given configuration as the default. This probably worked by pure luck before, but it's better to make the muxing explicitly set. Signed-off-by: Andreas Henriksson <andreas.henriksson@endian.se> Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com> Cc: <stable@vger.kernel.org> # 3.11+
2014-09-24x86/relocs: Make per_cpu_load_addr staticBen Hutchings
per_cpu_load_addr is only used for 64-bit relocations, but is declared in both configurations of relocs.c - with different types. This has undefined behaviour in general. GNU ld is documented to use the larger size in this case, but other tools may differ and some warn about this. References: https://bugs.debian.org/748577 Reported-by: Michael Tautschnig <mt@debian.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: 748577@bugs.debian.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1411561812.3659.23.camel@decadent.org.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24x86/lib/Makefile: Remove the unnecessary "+= thunk_64.o"Oleg Nesterov
Trivial. We have "lib-y += thunk_$(BITS).o" at the start, no need to add thunk_64.o if !CONFIG_X86_32. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140921184232.GB23727@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24x86: Speed up ___preempt_schedule*() by using THUNK helpersOleg Nesterov
___preempt_schedule() does SAVE_ALL/RESTORE_ALL but this is suboptimal, we do not need to save/restore the callee-saved register. And we already have arch/x86/lib/thunk_*.S which implements the similar asm wrappers, so it makes sense to redefine ___preempt_schedule() as "THUNK ..." and remove preempt.S altogether. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Andy Lutomirski <luto@amacapital.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140921184153.GA23727@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24crypto: aesni - disable "by8" AVX CTR optimizationMathias Krause
The "by8" implementation introduced in commit 22cddcc7df8f ("crypto: aes - AES CTR x86_64 "by8" AVX optimization") is failing crypto tests as it handles counter block overflows differently. It only accounts the right most 32 bit as a counter -- not the whole block as all other implementations do. This makes it fail the cryptomgr test #4 that specifically tests this corner case. As we're quite late in the release cycle, just disable the "by8" variant for now. Reported-by: Romain Francoise <romain@orebokech.com> Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Chandramouli Narayanan <mouli@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-09-24sched: Fix unreleased llc_shared_mask bit during CPU hotplugWanpeng Li
The following bug can be triggered by hot adding and removing a large number of xen domain0's vcpus repeatedly: BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [..] find_busiest_group PGD 5a9d5067 PUD 13067 PMD 0 Oops: 0000 [#3] SMP [...] Call Trace: load_balance ? _raw_spin_unlock_irqrestore idle_balance __schedule schedule schedule_timeout ? lock_timer_base schedule_timeout_uninterruptible msleep lock_device_hotplug_sysfs online_store dev_attr_store sysfs_write_file vfs_write SyS_write system_call_fastpath Last level cache shared mask is built during CPU up and the build_sched_domain() routine takes advantage of it to setup the sched domain CPU topology. However, llc_shared_mask is not released during CPU disable, which leads to an invalid sched domainCPU topology. This patch fix it by releasing the llc_shared_mask correctly during CPU disable. Yasuaki also reported that this can happen on real hardware: https://lkml.org/lkml/2014/7/22/1018 His case is here: == Here is an example on my system. My system has 4 sockets and each socket has 15 cores and HT is enabled. In this case, each core of sockes is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 Socket#2 | 30-44, 90-104 Socket#3 | 45-59, 105-119 Then llc_shared_mask of CPU#30 has 0x3fff80000001fffc0000000. It means that last level cache of Socket#2 is shared with CPU#30-44 and 90-104. When hot-removing socket#2 and #3, each core of sockets is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 But llc_shared_mask is not cleared. So llc_shared_mask of CPU#30 remains having 0x3fff80000001fffc0000000. After that, when hot-adding socket#2 and #3, each core of sockets is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 Socket#2 | 30-59 Socket#3 | 90-119 Then llc_shared_mask of CPU#30 becomes 0x3fff8000fffffffc0000000. It means that last level cache of Socket#2 is shared with CPU#30-59 and 90-104. So the mask has the wrong value. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Tested-by: Linn Crosetto <linn@hp.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Toshi Kani <toshi.kani@hp.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1411547885-48165-1-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24x86/intel/quark: Switch off CR4.PGE so TLB flush uses CR3 insteadBryan O'Donoghue
Quark x1000 advertises PGE via the standard CPUID method PGE bits exist in Quark X1000's PTEs. In order to flush an individual PTE it is necessary to reload CR3 irrespective of the PTE.PGE bit. See Quark Core_DevMan_001.pdf section 6.4.11 This bug was fixed in Galileo kernels, unfixed vanilla kernels are expected to crash and burn on this platform. Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie> Cc: Borislav Petkov <bp@alien8.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1411514784-14885-1-git-send-email-pure.logic@nexus-software.ie Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24x86/smpboot: Speed up suspend/resume by avoiding 100ms sleep for CPU offline ↵Lan Tianyu
during S3 With certain kernel configurations, CPU offline consumes more than 100ms during S3. It's a timing related issue: native_cpu_die() would occasionally fall into a 100ms sleep when the CPU idle loop thread marked the CPU state to DEAD too slowly. What native_cpu_die() does is that it polls the CPU state and waits for 100ms if CPU state hasn't been marked to DEAD. The 100ms sleep doesn't make sense and is purely historic. To avoid such long sleeping, this patch adds a 'struct completion' to each CPU, waits for the completion in native_cpu_die() and wakes up the completion when the CPU state is marked to DEAD. Tested on an Intel Xeon server with 48 cores, Ivybridge and on Haswell laptops. The CPU offlining cost on these machines is reduced from more than 100ms to less than 5ms. The system suspend time is reduced by 2.3s on the servers. Borislav and Prarit also helped to test the patch on an AMD machine and a few systems of various sizes and configurations (multi-socket, single-socket, no hyper threading, etc.). No issues were seen. Tested-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Acked-by: Borislav Petkov <bp@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: srostedt@redhat.com Cc: toshi.kani@hp.com Cc: imammedo@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409039025-32310-1-git-send-email-tianyu.lan@intel.com [ Improved a few minor details in the code, cleaned up the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24perf/x86/intel/uncore: Update support for client uncore IMC PMUStephane Eranian
This patch restructures the memory controller (IMC) uncore PMU support for client SNB/IVB/HSW processors. The main change is that it can now cope with more than one PCI device ID per processor model. There are many flavors of memory controllers for each processor. They have different PCI device ID, yet they behave the same w.r.t. the memory controller PMU that we are interested in. The patch now supports two distinct memory controllers for IVB processors: one for mobile, one for desktop. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140917090616.GA11281@quad Cc: ak@linux.intel.com Cc: kan.liang@intel.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24perf/x86/intel/uncore: Fix PCU filter setup for Sandy/Ivy/Haswell EPAndi Kleen
The PCU frequency band filters use 8 bit each in a register. When setting up the value the shift value was not correctly scaled, which resulted in all filters except for band 0 to be zero. Fix the scaling. This allows to correctly monitor multiple uncore frequency bands. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1409872109-31645-5-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24perf/x86/intel/uncore: Add missing cbox filter flags on IvyBridge-EP uncore ↵Andi Kleen
driver The IvyBridge-EP uncore driver was missing three filter flags: NC, ISOC, C6 which are useful in some cases. Support them in the same way as the Haswell EP driver, by allowing to set them and exposing them in the sysfs formats. Also fix a typo in a define. Relies on the Haswell EP driver to be applied earlier. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1409872109-31645-4-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24perf/x86/intel/uncore: Register the PMU only if the uncore pci device existsYan, Zheng
Current code registers PMUs for all possible uncore pci devices. This is not good because, on some machines, one or more uncore pci devices can be missing. The missing pci device make corresponding PMU unusable. Register the PMU only if the uncore device exists. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1409872109-31645-3-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24perf/x86/intel/uncore: Add Haswell-EP uncore supportYan, Zheng
The uncore subsystem in Haswell-EP is similar to Sandy/Ivy Bridge-EP. There are some differences in config register encoding and pci device IDs. The Haswell-EP uncore also supports a few new events. Add the Haswell-EP driver to the snbep split driver. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> [ Add missing break. Add imc events. Add cbox nc/isoc/c6. ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1409872109-31645-2-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24perf/x86/intel: Use Broadwell cache event list for HaswellAndi Kleen
Use the newly added Broadwell cache event list for Haswell too. All Haswell and Broadwell events and offcore masks used in these lists are identical. However Haswell is very different from the Sandy Bridge list that was used previously. That fixes a wide range of mis-counting cache events. The node events are now only for retired memory events, so prefetching and speculative memory accesses are not included. They are PEBS capable now, which makes it much easier to sample for them, plus it's possible to create address maps with -d. The prefetch events are gone now. They way the hardware counts them is very misleading (some prefetches included, others not), so it seemed best to leave them out. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1409683455-29168-5-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24perf/x86: Add INST_RETIRED.ALL workaroundsAndi Kleen
On Broadwell INST_RETIRED.ALL cannot be used with any period that doesn't have the lowest 6 bits cleared. And the period should not be smaller than 128. Add a new callback to enforce this, and set it for Broadwell. This is erratum BDM57 and BDM11. How does this handle the case when an app requests a specific period with some of the bottom bits set The apps thinks it is sampling at X occurences per sample, when it is in fact at X - 63 (worst case). Short answer: Any useful instruction sampling period needs to be 4-6 orders of magnitude larger than 128, as an PMI every 128 instructions would instantly overwhelm the system and be throttled. So the +-64 error from this is really small compared to the period, much smaller than normal system jitter. Long answer: <write up by Peter:> IFF we guarantee perf_event_attr::sample_period >= 128. Suppose we start out with sample_period=192; then we'll set period_left to 192, we'll end up with left = 128 (we truncate the lower bits). We get an interrupt, find that period_left = 64 (>0 so we return 0 and don't get an overflow handler), up that to 128. Then we trigger again, at n=256. Then we find period_left = -64 (<=0 so we return 1 and do get an overflow). We increment with sample_period so we get left = 128. We fire again, at n=384, period_left = 0 (<=0 so we return 1 and get an overflow). And on and on. So while the individual interrupts are 'wrong' we get then with interval=256,128 in exactly the right ratio to average out at 192. And this works for everything >=128. So the num_samples*fixed_period thing is still entirely correct +- 127, which is good enough I'd say, as you already have that error anyhow. So no need to 'fix' the tools, al we need to do is refuse to create INST_RETIRED:ALL events with sample_period < 128. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kan Liang <kan.liang@intel.com> Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Cc: Mark Davies <junk@eslaf.co.uk> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1409683455-29168-4-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24perf/x86/intel: Add Broadwell core supportAndi Kleen
Add Broadwell support for Broadwell Client to perf. This is very similar to Haswell. It uses a new cache event table, because there were various changes there. The constraint list has one new event that needs to be handled over Haswell. The PEBS event list is the same, so we reuse Haswell's. [fengguang.wu: make intel_bdw_event_constraints[] static] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1409683455-29168-3-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24perf/x86/intel: Document all Haswell modelsAndi Kleen
Add names for each Haswell model as requested by Peter. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1409683455-29168-2-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24perf/x86/intel: Remove incorrect model number from Haswell perfAndi Kleen
71 is a Broadwell, not a Haswell. The model number was added by mistake earlier. Remove it for now, until it can be re-added later with real Broadwell support. In practice it does not cause a lot of issues because the Broadwell PMU is very similar to Haswell, but some details were wrong, and it's better to handle it correctly. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Link: http://lkml.kernel.org/r/1409683455-29168-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24x86, sched: Add new topology for multi-NUMA-node CPUsDave Hansen
I'm getting the spew below when booting with Haswell (Xeon E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled in the BIOS. It seems similar to the issue that some folks from AMD ran in to on their systems and addressed in this commit: 161270fc1f9d ("x86/smp: Fix topology checks on AMD MCM CPUs") Both these Intel and AMD systems break an assumption which is being enforced by topology_sane(): a socket may not contain more than one NUMA node. AMD special-cased their system by looking for a cpuid flag. The Intel mode is dependent on BIOS options and I do not know of a way which it is enumerated other than the tables being parsed during the CPU bringup process. In other words, we have to trust the ACPI tables <shudder>. This detects the situation where a NUMA node occurs at a place in the middle of the "CPU" sched domains. It replaces the default topology with one that relies on the NUMA information from the firmware (SRAT table) for all levels of sched domains above the hyperthreads. This also fixes a sysfs bug. We used to freak out when we saw the "mc" group cross a node boundary, so we stopped building the MC group. MC gets exported as the 'core_siblings_list' in /sys/devices/system/cpu/cpu*/topology/ and this caused CPUs with the same 'physical_package_id' to not be listed together in 'core_siblings_list'. This violates a statement from Documentation/ABI/testing/sysfs-devices-system-cpu: core_siblings: internal kernel map of cpu#'s hardware threads within the same physical_package_id. core_siblings_list: human-readable list of the logical CPU numbers within the same physical_package_id as cpu#. The sysfs effects here cause an issue with the hwloc tool where it gets confused and thinks there are more sockets than are physically present. Before this patch, there are two packages: # cd /sys/devices/system/cpu/ # cat cpu*/topology/physical_package_id | sort | uniq -c 18 0 18 1 But 4 _sets_ of core siblings: # cat cpu*/topology/core_siblings_list | sort | uniq -c 9 0-8 9 18-26 9 27-35 9 9-17 After this set, there are only 2 sets of core siblings, which is what we expect for a 2-socket system. # cat cpu*/topology/physical_package_id | sort | uniq -c 18 0 18 1 # cat cpu*/topology/core_siblings_list | sort | uniq -c 18 0-17 18 18-35 Example spew: ... NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. #2 #3 #4 #5 #6 #7 #8 .... node #1, CPUs: #9 ------------[ cut here ]------------ WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90() sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. Modules linked in: CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631 Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014 0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48 ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009 ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98 Call Trace: [<ffffffff8172e485>] dump_stack+0x45/0x56 [<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0 [<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50 [<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90 [<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0 [<ffffffff8107568d>] start_secondary+0x1ad/0x240 ---[ end trace 3fe5f587a9fcde61 ]--- #10 #11 #12 #13 #14 #15 #16 #17 .... node #2, CPUs: #18 #19 #20 #21 #22 #23 #24 #25 #26 .... node #3, CPUs: #27 #28 #29 #30 #31 #32 #33 #34 #35 Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> [ Added LLC domain and s/match_mc/match_die/ ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Rientjes <rientjes@google.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: brice.goglin@gmail.com Cc: "H. Peter Anvin" <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/20140918193334.C065EBCE@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSWPeter Zijlstra
Kirill found that there's a subtle race in the __ARCH_WANT_UNLOCKED_CTXSW code, and instead of fixing it, remove the entire exception because neither arch that uses it seems to actually still require it. Boot tested on mips64el (qemu) only. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kirill Tkhai <tkhai@yandex.ru> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Qais Yousef <qais.yousef@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Tony Luck <tony.luck@intel.com> Cc: oleg@redhat.com Cc: linux@roeck-us.net Cc: linux-ia64@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Link: http://lkml.kernel.org/r/20140923150641.GH3312@worktop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24x86/PCI: Mark PCI BIOS initialization code as suchMathias Krause
The pci_find_bios() function is only ever called from initialization code, therefore can be marked as such, too. This, in turn, allows marking other functions called only in this context as well. The bios32_indirect variable can be marked as __initdata as it is only referenced from __init functions now. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2014-09-24x86/PCI: Constify pci_mmcfg_probes[] arrayMathias Krause
The pci_mmcfg_probes[] array is only ever read, therefore make it const. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2014-09-24x86/PCI: Mark constants of pci_mmcfg_nvidia_mcp55() as __initconstMathias Krause
The constants in pci_mmcfg_nvidia_mcp55() need to be marked as __initconst or they will remain in memory after init memory was released. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2014-09-24x86/PCI: Move __init annotation to the correct placeMathias Krause
According to include/linux/init.h, the __init annotation should be added immediately before the function name. However, for quite a few functions in mmconfig-shared.c this is not the case. It's either before the return type or even in the middle of it. Beside gcc still getting it right, we should change them to comply to the rules of include/linux/init.h. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2014-09-24kvm: x86: Unpin and remove kvm_arch->apic_access_pageTang Chen
In order to make the APIC access page migratable, stop pinning it in memory. And because the APIC access page is not pinned in memory, we can remove kvm_arch->apic_access_page. When we need to write its physical address into vmcs, we use gfn_to_page() to get its page struct, which is needed to call page_to_phys(); the page is then immediately unpinned. Suggested-by: Gleb Natapov <gleb@kernel.org> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24kvm: vmx: Implement set_apic_access_page_addrTang Chen
Currently, the APIC access page is pinned by KVM for the entire life of the guest. We want to make it migratable in order to make memory hot-unplug available for machines that run KVM. This patch prepares to handle this for the case where there is no nested virtualization, or where the nested guest does not have an APIC page of its own. All accesses to kvm->arch.apic_access_page are changed to go through kvm_vcpu_reload_apic_access_page. If the APIC access page is invalidated when the host is running, we update the VMCS in the next guest entry. If it is invalidated when the guest is running, the MMU notifier will force an exit, after which we will handle everything as in the previous case. If it is invalidated when a nested guest is running, the request will update either the VMCS01 or the VMCS02. Updating the VMCS01 is done at the next L2->L1 exit, while updating the VMCS02 is done in prepare_vmcs02. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24kvm: x86: Add request bit to reload APIC access page addressTang Chen
Currently, the APIC access page is pinned by KVM for the entire life of the guest. We want to make it migratable in order to make memory hot-unplug available for machines that run KVM. This patch prepares to handle this in generic code, through a new request bit (that will be set by the MMU notifier) and a new hook that is called whenever the request bit is processed. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24kvm: Add arch specific mmu notifier for page invalidationTang Chen
This will be used to let the guest run while the APIC access page is not pinned. Because subsequent patches will fill in the function for x86, place the (still empty) x86 implementation in the x86.c file instead of adding an inline function in kvm_host.h. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24kvm: Fix page ageing bugsAndres Lagar-Cavilla
1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24kvm/x86/mmu: Pass gfn and level to rmapp callback.Andres Lagar-Cavilla
Callbacks don't have to do extra computation to learn what the caller (lvm_handle_hva_range()) knows very well. Useful for debugging/tracing/printk/future. Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-onlyPaolo Bonzini
On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA. In that case, KVM will fail to patch VMCALL instructions to VMMCALL as required on AMD processors. The failure mode is currently a divide-by-zero exception, which obviously is a KVM bug that has to be fixed. However, picking the right instruction between VMCALL and VMMCALL will be faster and will help if you cannot upgrade the hypervisor. Reported-by: Chris Webb <chris@arachsys.com> Tested-by: Chris Webb <chris@arachsys.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24kvm: x86: use macros to compute bank MSRsChen Yucong
Avoid open coded calculations for bank MSRs by using well-defined macros that hide the index of higher bank MSRs. No semantic changes. Signed-off-by: Chen Yucong <slaoub@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24KVM: x86: Remove debug assertion of non-PAE reserved bitsNadav Amit
Commit 346874c9507a ("KVM: x86: Fix CR3 reserved bits") removed non-PAE reserved bits which were not according to Intel SDM. However, residue was left in a debug assertion (CR3_NONPAE_RESERVED_BITS). Remove it. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24kvm: x86: fix two typos in commentTiejun Chen
s/drity/dirty and s/vmsc01/vmcs01 Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24KVM: vmx: Inject #GP on invalid PAT CRNadav Amit
Guest which sets the PAT CR to invalid value should get a #GP. Currently, if vmx supports loading PAT CR during entry, then the value is not checked. This patch makes the required check in that case. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24KVM: x86: emulating descriptor load misses long-mode caseNadav Amit
In 64-bit mode a #GP should be delivered to the guest "if the code segment descriptor pointed to by the selector in the 64-bit gate doesn't have the L-bit set and the D-bit clear." - Intel SDM "Interrupt 13—General Protection Exception (#GP)". This patch fixes the behavior of CS loading emulation code. Although the comment says that segment loading is not supported in long mode, this function is executed in long mode, so the fix is necassary. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24KVM: x86: directly use kvm_make_request againLiang Chen
A one-line wrapper around kvm_make_request is not particularly useful. Replace kvm_mmu_flush_tlb() with kvm_make_request(). Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24KVM: x86: count actual tlb flushesRadim Krčmář
- we count KVM_REQ_TLB_FLUSH requests, not actual flushes (KVM can have multiple requests for one flush) - flushes from kvm_flush_remote_tlbs aren't counted - it's easy to make a direct request by mistake Solve these by postponing the counting to kvm_check_request(). Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24KVM: nested VMX: disable perf cpuid reportingMarcelo Tosatti
Initilization of L2 guest with -cpu host, on L1 guest with -cpu host triggers: (qemu) KVM: entry failed, hardware error 0x7 ... nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported Nested VMX MSR load/store support is not sufficient to allow perf for L2 guest. Until properly fixed, trap CPUID and disable function 0xA. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24KVM: x86: Don't report guest userspace emulation error to userspaceNadav Amit
Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to user-space") disabled the reporting of L2 (nested guest) emulation failures to userspace due to race-condition between a vmexit and the instruction emulator. The same rational applies also to userspace applications that are permitted by the guest OS to access MMIO area or perform PIO. This patch extends the current behavior - of injecting a #UD instead of reporting it to userspace - also for guest userspace code. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24kvm: Make init_rmode_tss() return 0 on success.Paolo Bonzini
In init_rmode_tss(), there two variables indicating the return value, r and ret, and it return 0 on error, 1 on success. The function is only called by vmx_set_tss_addr(), and ret is redundant. This patch removes the redundant variable, by making init_rmode_tss() return 0 on success, -errno on failure. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24KVM: x86: Warn if guest virtual address space is not 48-bitsNadav Amit
The KVM emulator code assumes that the guest virtual address space (in 64-bit) is 48-bits wide. Fail the KVM_SET_CPUID and KVM_SET_CPUID2 ioctl if userspace tries to create a guest that does not obey this restriction. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>