summaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2012-03-08KVM: Ignore the writes to MSR_K7_HWCR(3)Nicolae Mogoreanu
When CPUID Fn8000_0001_EAX reports 0x00100f22 Windows 7 x64 guest tries to set bit 3 in MSRC001_0015 in nt!KiDisableCacheErrataSource and fails. This patch will ignore this step and allow things to move on without having to fake CPUID value. Signed-off-by: Nicolae Mogoreanu <mogoreanu@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: MMU: make use of ->root_level in reset_rsvds_bits_maskDavidlohr Bueso
The reset_rsvds_bits_mask() function can use the guest walker's root level number instead of using a separate 'level' variable. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: PMU: add proper support for fixed counter 2Gleb Natapov
Currently pmu emulation emulates fixed counter 2 as bus cycles architectural counter, but since commit 9c1497ea591b25d perf has pseudo encoding for it. Use it. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: PMU: Fix raw event checkGleb Natapov
If eventsel has EDGE, INV or CMASK set we should create raw counter for it, but the check is done on a wrong variable. Fix it. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: PMU: warn when pin control is set in eventsel msrGleb Natapov
Print warning once if pin control bit is set in eventsel msr since emulation does not support it yet. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: VMX: Fix delayed load of shared MSRsAvi Kivity
Shared MSRs (MSR_*STAR and related) are stored in both vmx->guest_msrs and in the CPU registers, but vmx_set_msr() only updated memory. Prior to 46199f33c2953, this didn't matter, since we called vmx_load_host_state(), which scheduled a vmx_save_host_state(), which re-synchronized the CPU state, but now we don't, so the CPU state will not be synchronized until the next exit to host userspace. This mostly affects nested vmx workloads, which play with these MSRs a lot. Fix by loading the MSR eagerly. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Allow host IRQ sharing for assigned PCI 2.3 devicesJan Kiszka
PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share legacy IRQs of such devices with other host devices when passing them to a guest. The new IRQ sharing feature introduced here is optional, user space has to request it explicitly. Moreover, user space can inform us about its view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt and signaling it if the guest masked it via the virtualized PCI config space. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Ensure all vcpus are consistent with in-kernel irqchip settingsAvi Kivity
If some vcpus are created before KVM_CREATE_IRQCHIP, then irqchip_in_kernel() and vcpu->arch.apic will be inconsistent, leading to potential NULL pointer dereferences. Fix by: - ensuring that no vcpus are installed when KVM_CREATE_IRQCHIP is called - ensuring that a vcpu has an apic if it is installed after KVM_CREATE_IRQCHIP This is somewhat long winded because vcpu->arch.apic is created without kvm->lock held. Based on earlier patch by Michael Ellerman. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: x86 emulator: Allow PM/VM86 switch during task switchKevin Wolf
Task switches can switch between Protected Mode and VM86. The current mode must be updated during the task switch emulation so that the new segment selectors are interpreted correctly. In order to let privilege checks succeed, rflags needs to be updated in the vcpu struct as this causes a CPL update. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: SVM: Fix CPL updatesKevin Wolf
Keep CPL at 0 in real mode and at 3 in VM86. In protected/long mode, use RPL rather than DPL of the code segment. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: x86 emulator: VM86 segments must have DPL 3Kevin Wolf
Setting the segment DPL to 0 for at least the VM86 code segment makes the VM entry fail on VMX. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: x86 emulator: Fix task switch privilege checksKevin Wolf
Currently, all task switches check privileges against the DPL of the TSS. This is only correct for jmp/call to a TSS. If a task gate is used, the DPL of this take gate is used for the check instead. Exceptions, external interrupts and iret shouldn't perform any check. [avi: kill kvm-kmod remnants] Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulationGleb Natapov
Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Introduce kvm_memory_slot::arch and move lpage_info into itTakuya Yoshikawa
Some members of kvm_memory_slot are not used by every architecture. This patch is the first step to make this difference clear by introducing kvm_memory_slot::arch; lpage_info is moved into it. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Introduce gfn_to_index() which returns the index for a given levelTakuya Yoshikawa
This patch cleans up the code and removes the "(void)level;" warning suppressor. Note that we can also use this for PT_PAGE_TABLE_LEVEL to treat every level uniformly later. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Fix write protection race during dirty loggingTakuya Yoshikawa
This patch fixes a race introduced by: commit 95d4c16ce78cb6b7549a09159c409d52ddd18dae KVM: Optimize dirty logging by rmap_write_protect() During protecting pages for dirty logging, other threads may also try to protect a page in mmu_sync_children() or kvm_mmu_get_page(). In such a case, because get_dirty_log releases mmu_lock before flushing TLB's, the following race condition can happen: A (get_dirty_log) B (another thread) lock(mmu_lock) clear pte.w unlock(mmu_lock) lock(mmu_lock) pte.w is already cleared unlock(mmu_lock) skip TLB flush return ... TLB flush Though thread B assumes the page has already been protected when it returns, the remaining TLB entry will break that assumption. This patch fixes this problem by making get_dirty_log hold the mmu_lock until it flushes the TLB's. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: VMX: remove yield_on_hltRaghavendra K T
yield_on_hlt was introduced for CPU bandwidth capping. Now it is redundant with CFS hardlimit. yield_on_hlt also complicates the scenario in paravirtual environment, that needs to trap halt. for e.g. paravirtualized ticket spinlocks. Acked-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Track TSC synchronization in generationsZachary Amsden
This allows us to track the original nanosecond and counter values at each phase of TSC writing by the guest. This gets us perfect offset matching for stable TSC systems, and perfect software computed TSC matching for machines with unstable TSC. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Dont mark TSC unstable due to S4 suspendZachary Amsden
During a host suspend, TSC may go backwards, which KVM interprets as an unstable TSC. Technically, KVM should not be marking the TSC unstable, which causes the TSC clocksource to go bad, but we need to be adjusting the TSC offsets in such a case. Dealing with this issue is a little tricky as the only place we can reliably do it is before much of the timekeeping infrastructure is up and running. On top of this, we are not in a KVM thread context, so we may not be able to safely access VCPU fields. Instead, we compute our best known hardware offset at power-up and stash it to be applied to all VCPUs when they actually start running. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Allow adjust_tsc_offset to be in host or guest cyclesMarcelo Tosatti
Redefine the API to take a parameter indicating whether an adjustment is in host or guest cycles. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Add last_host_tsc tracking back to KVMZachary Amsden
The variable last_host_tsc was removed from upstream code. I am adding it back for two reasons. First, it is unnecessary to use guest TSC computation to conclude information about the host TSC. The guest may set the TSC backwards (this case handled by the previous patch), but the computation of guest TSC (and fetching an MSR) is significanlty more work and complexity than simply reading the hardware counter. In addition, we don't actually need the guest TSC for any part of the computation, by always recomputing the offset, we can eliminate the need to deal with the current offset and any scaling factors that may apply. The second reason is that later on, we are going to be using the host TSC value to restore TSC offsets after a host S4 suspend, so we need to be reading the host values, not the guest values here. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Fix last_guest_tsc / tsc_offset semanticsZachary Amsden
The variable last_guest_tsc was being used as an ad-hoc indicator that guest TSC has been initialized and recorded correctly. However, it may not have been, it could be that guest TSC has been set to some large value, the back to a small value (by, say, a software reboot). This defeats the logic and causes KVM to falsely assume that the guest TSC has gone backwards, marking the host TSC unstable, which is undesirable behavior. In addition, rather than try to compute an offset adjustment for the TSC on unstable platforms, just recompute the whole offset. This allows us to get rid of one callsite for adjust_tsc_offset, which is problematic because the units it takes are in guest units, but here, the computation was originally being done in host units. Doing this, and also recording last_guest_tsc when the TSC is written allow us to remove the tricky logic which depended on last_guest_tsc being zero to indicate a reset of uninitialized value. Instead, we now have the guarantee that the guest TSC offset is always at least something which will get us last_guest_tsc. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Leave TSC synchronization window open with each new syncZachary Amsden
Currently, when the TSC is written by the guest, the variable ns is updated to force the current write to appear to have taken place at the time of the first write in this sync phase. This leaves a cliff at the end of the match window where updates will fall of the end. There are two scenarios where this can be a problem in practe - first, on a system with a large number of VCPUs, the sync period may last for an extended period of time. The second way this can happen is if the VM reboots very rapidly and we catch a VCPU TSC synchronization just around the edge. We may be unaware of the reboot, and thus the first VCPU might synchronize with an old set of the timer (at, say 0.97 seconds ago, when first powered on). The second VCPU can come in 0.04 seconds later to try to synchronize, but it misses the window because it is just over the threshold. Instead, stop doing this artificial setback of the ns variable and just update it with every write of the TSC. It may be observed that doing so causes values computed by compute_guest_tsc to diverge slightly across CPUs - note that the last_tsc_ns and last_tsc_write variable are used here, and now they last_tsc_ns will be different for each VCPU, reflecting the actual time of the update. However, compute_guest_tsc is used only for guests which already have TSC stability issues, and further, note that the previous patch has caused last_tsc_write to be incremented by the difference in nanoseconds, converted back into guest cycles. As such, only boundary rounding errors should be visible, which given the resolution in nanoseconds, is going to only be a few cycles and only visible in cross-CPU consistency tests. The problem can be fixed by adding a new set of variables to track the start offset and start write value for the current sync cycle. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Improve TSC offset matchingZachary Amsden
There are a few improvements that can be made to the TSC offset matching code. First, we don't need to call the 128-bit multiply (especially on a constant number), the code works much nicer to do computation in nanosecond units. Second, the way everything is setup with software TSC rate scaling, we currently have per-cpu rates. Obviously this isn't too desirable to use in practice, but if for some reason we do change the rate of all VCPUs at runtime, then reset the TSCs, we will only want to match offsets for VCPUs running at the same rate. Finally, for the case where we have an unstable host TSC, but rate scaling is being done in hardware, we should call the platform code to compute the TSC offset, so the math is reorganized to recompute the base instead, then transform the base into an offset using the existing API. [avi: fix 64-bit division on i386] Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> KVM: Fix 64-bit division in kvm_write_tsc() Breaks i386 build. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Infrastructure for software and hardware based TSC rate scalingZachary Amsden
This requires some restructuring; rather than use 'virtual_tsc_khz' to indicate whether hardware rate scaling is in effect, we consider each VCPU to always have a virtual TSC rate. Instead, there is new logic above the vendor-specific hardware scaling that decides whether it is even necessary to use and updates all rate variables used by common code. This means we can simply query the virtual rate at any point, which is needed for software rate scaling. There is also now a threshold added to the TSC rate scaling; minor differences and variations of measured TSC rate can accidentally provoke rate scaling to be used when it is not needed. Instead, we have a tolerance variable called tsc_tolerance_ppm, which is the maximum variation from user requested rate at which scaling will be used. The default is 250ppm, which is the half the threshold for NTP adjustment, allowing for some hardware variation. In the event that hardware rate scaling is not available, we can kludge a bit by forcing TSC catchup to turn on when a faster than hardware speed has been requested, but there is nothing available yet for the reverse case; this requires a trap and emulate software implementation for RDTSC, which is still forthcoming. [avi: fix 64-bit division on i386] Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: x86: reset edge sense circuit of i8259 on initGleb Natapov
The spec says that during initialization "The edge sense circuit is reset which means that following initialization an interrupt request (IR) input must make a low-to-high transition to generate an interrupt", but currently if edge triggered interrupt is in IRR it is delivered after i8259 initialization. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: x86 emulator: reject SYSENTER in compatibility mode on AMD guestsAvi Kivity
If the guest thinks it's an AMD, it will not have prepared the SYSENTER MSRs, and if the guest executes SYSENTER in compatibility mode, it will fails. Detect this condition and #UD instead, like the spec says. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: Don't mistreat edge-triggered INIT IPI as INIT de-assert. (LAPIC)Julian Stecklina
If the guest programs an IPI with level=0 (de-assert) and trig_mode=0 (edge), it is erroneously treated as INIT de-assert and ignored, but to quote the spec: "For this delivery mode [INIT de-assert], the level flag must be set to 0 and trigger mode flag to 1." Signed-off-by: Julian Stecklina <js@alien8.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: SVM: comment nested paging and virtualization module parametersDavidlohr Bueso
Also use true instead of 1 for enabling by default. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Remove unused kvm parameter from rmap_next()Takuya Yoshikawa
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Remove unused kvm parameter from __gfn_to_rmap()Takuya Yoshikawa
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: x86 emulator: Remove byte-sized MOVSX/MOVZX hackAvi Kivity
Currently we treat MOVSX/MOVZX with a byte source as a byte instruction, and change the destination operand size with a hack. Change it to be a word instruction, so the destination receives its natural size, and change the source to be SrcMem8. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-03-05KVM: x86 emulator: add 8-bit memory operandsAvi Kivity
Useful for MOVSX/MOVZX. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-03-05KVM: SVM: Add support for AMD's OSVW feature in guestsBoris Ostrovsky
In some cases guests should not provide workarounds for errata even when the physical processor is affected. For example, because of erratum 400 on family 10h processors a Linux guest will read an MSR (resulting in VMEXIT) before going to idle in order to avoid getting stuck in a non-C0 state. This is not necessary: HLT and IO instructions are intercepted and therefore there is no reason for erratum 400 workaround in the guest. This patch allows us to present a guest with certain errata as fixed, regardless of the state of actual hardware. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: unnecessary NX state assignmentDavidlohr Bueso
We can remove the first ->nx state assignment since it is assigned afterwards anyways. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: s390: ucontrol: export SIE control block to userCarsten Otte
This patch exports the s390 SIE hardware control block to userspace via the mapping of the vcpu file descriptor. In order to do so, a new arch callback named kvm_arch_vcpu_fault is introduced for all architectures. It allows to map architecture specific pages. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: s390: add parameter for KVM_CREATE_VMCarsten Otte
This patch introduces a new config option for user controlled kernel virtual machines. It introduces a parameter to KVM_CREATE_VM that allows to set bits that alter the capabilities of the newly created virtual machine. The parameter is passed to kvm_arch_init_vm for all architectures. The only valid modifier bit for now is KVM_VM_S390_UCONTROL. This requires CAP_SYS_ADMIN privileges and creates a user controlled virtual machine on s390 architectures. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: remove the redundant get_written_sptesXiao Guangrong
get_written_sptes is called twice in kvm_mmu_pte_write, one of them can be removed Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Add missing large page accounting to drop_large_spte()Takuya Yoshikawa
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05KVM: MMU: Remove for_each_unsync_children() macroTakuya Yoshikawa
There is only one user of it and for_each_set_bit() does the same. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05Merge branch 'perf/urgent' into perf/coreIngo Molnar
Conflicts: tools/perf/builtin-record.c tools/perf/builtin-top.c tools/perf/perf.h tools/perf/util/top.h Merge reason: resolve these cherry-picking conflicts. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-02perf/x86/kvm: Fix Host-Only/Guest-Only counting with SVM disabledJoerg Roedel
It turned out that a performance counter on AMD does not count at all when the GO or HO bit is set in the control register and SVM is disabled in EFER. This patch works around this issue by masking out the HO bit in the performance counter control register when SVM is not enabled. The GO bit is not touched because it is only set when the user wants to count in guest-mode only. So when SVM is disabled the counter should not run at all and the not-counting is the intended behaviour. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Avi Kivity <avi@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: David Ahern <dsahern@gmail.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Robert Richter <robert.richter@amd.com> Cc: stable@vger.kernel.org # v3.2 Link: http://lkml.kernel.org/r/1330523852-19566-1-git-send-email-joerg.roedel@amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-24static keys: Introduce 'struct static_key', static_key_true()/false() and ↵Ingo Molnar
static_key_slow_[inc|dec]() So here's a boot tested patch on top of Jason's series that does all the cleanups I talked about and turns jump labels into a more intuitive to use facility. It should also address the various misconceptions and confusions that surround jump labels. Typical usage scenarios: #include <linux/static_key.h> struct static_key key = STATIC_KEY_INIT_TRUE; if (static_key_false(&key)) do unlikely code else do likely code Or: if (static_key_true(&key)) do likely code else do unlikely code The static key is modified via: static_key_slow_inc(&key); ... static_key_slow_dec(&key); The 'slow' prefix makes it abundantly clear that this is an expensive operation. I've updated all in-kernel code to use this everywhere. Note that I (intentionally) have not pushed through the rename blindly through to the lowest levels: the actual jump-label patching arch facility should be named like that, so we want to decouple jump labels from the static-key facility a bit. On non-jump-label enabled architectures static keys default to likely()/unlikely() branches. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: a.p.zijlstra@chello.nl Cc: mathieu.desnoyers@efficios.com Cc: davem@davemloft.net Cc: ddaney.cavm@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-21i387: Split up <asm/i387.h> into exported and internal interfacesLinus Torvalds
While various modules include <asm/i387.h> to get access to things we actually *intend* for them to use, most of that header file was really pretty low-level internal stuff that we really don't want to expose to others. So split the header file into two: the small exported interfaces remain in <asm/i387.h>, while the internal definitions that are only used by core architecture code are now in <asm/fpu-internal.h>. The guiding principle for this was to expose functions that we export to modules, and leave them in <asm/i387.h>, while stuff that is used by task switching or was marked GPL-only is in <asm/fpu-internal.h>. The fpu-internal.h file could be further split up too, especially since arch/x86/kvm/ uses some of the remaining stuff for its module. But that kvm usage should probably be abstracted out a bit, and at least now the internal FPU accessor functions are much more contained. Even if it isn't perhaps as contained as it _could_ be. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-18i387: move TS_USEDFPU flag from thread_info to task_structLinus Torvalds
This moves the bit that indicates whether a thread has ownership of the FPU from the TS_USEDFPU bit in thread_info->status to a word of its own (called 'has_fpu') in task_struct->thread.has_fpu. This fixes two independent bugs at the same time: - changing 'thread_info->status' from the scheduler causes nasty problems for the other users of that variable, since it is defined to be thread-synchronous (that's what the "TS_" part of the naming was supposed to indicate). So perfectly valid code could (and did) do ti->status |= TS_RESTORE_SIGMASK; and the compiler was free to do that as separate load, or and store instructions. Which can cause problems with preemption, since a task switch could happen in between, and change the TS_USEDFPU bit. The change to TS_USEDFPU would be overwritten by the final store. In practice, this seldom happened, though, because the 'status' field was seldom used more than once, so gcc would generally tend to generate code that used a read-modify-write instruction and thus happened to avoid this problem - RMW instructions are naturally low fat and preemption-safe. - On x86-32, the current_thread_info() pointer would, during interrupts and softirqs, point to a *copy* of the real thread_info, because x86-32 uses %esp to calculate the thread_info address, and thus the separate irq (and softirq) stacks would cause these kinds of odd thread_info copy aliases. This is normally not a problem, since interrupts aren't supposed to look at thread information anyway (what thread is running at interrupt time really isn't very well-defined), but it confused the heck out of irq_fpu_usable() and the code that tried to squirrel away the FPU state. (It also caused untold confusion for us poor kernel developers). It also turns out that using 'task_struct' is actually much more natural for most of the call sites that care about the FPU state, since they tend to work with the task struct for other reasons anyway (ie scheduling). And the FPU data that we are going to save/restore is found there too. Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to the %esp issue. Cc: Arjan van de Ven <arjan@linux.intel.com> Reported-and-tested-by: Raphael Prevost <raphael@buro.asia> Acked-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16i387: don't ever touch TS_USEDFPU directly, use helper functionsLinus Torvalds
This creates three helper functions that do the TS_USEDFPU accesses, and makes everybody that used to do it by hand use those helpers instead. In addition, there's a couple of helper functions for the "change both CR0.TS and TS_USEDFPU at the same time" case, and the places that do that together have been changed to use those. That means that we have fewer random places that open-code this situation. The intent is partly to clarify the code without actually changing any semantics yet (since we clearly still have some hard to reproduce bug in this area), but also to make it much easier to use another approach entirely to caching the CR0.TS bit for software accesses. Right now we use a bit in the thread-info 'status' variable (this patch does not change that), but we might want to make it a full field of its own or even make it a per-cpu variable. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-01KVM: do not #GP on perf MSR writes when vPMU is disabledGleb Natapov
Return to behaviour perf MSR had before introducing vPMU in case vPMU is disabled. Some guests access those registers unconditionally and do not expect it to fail. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-02-01KVM: x86: fix missing checks in syscall emulationStephan Bärwolf
On hosts without this patch, 32bit guests will crash (and 64bit guests may behave in a wrong way) for example by simply executing following nasm-demo-application: [bits 32] global _start SECTION .text _start: syscall (I tested it with winxp and linux - both always crashed) Disassembly of section .text: 00000000 <_start>: 0: 0f 05 syscall The reason seems a missing "invalid opcode"-trap (int6) for the syscall opcode "0f05", which is not available on Intel CPUs within non-longmodes, as also on some AMD CPUs within legacy-mode. (depending on CPU vendor, MSR_EFER and cpuid) Because previous mentioned OSs may not engage corresponding syscall target-registers (STAR, LSTAR, CSTAR), they remain NULL and (non trapping) syscalls are leading to multiple faults and finally crashs. Depending on the architecture (AMD or Intel) pretended by guests, various checks according to vendor's documentation are implemented to overcome the current issue and behave like the CPUs physical counterparts. [mtosatti: cleanup/beautify code] Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-02-01KVM: x86: extend "struct x86_emulate_ops" with "get_cpuid"Stephan Bärwolf
In order to be able to proceed checks on CPU-specific properties within the emulator, function "get_cpuid" is introduced. With "get_cpuid" it is possible to virtually call the guests "cpuid"-opcode without changing the VM's context. [mtosatti: cleanup/beautify code] Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-01-13module_param: make bool parameters really bool (arch)Rusty Russell
module_param(bool) used to counter-intuitively take an int. In fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy trick. It's time to remove the int/unsigned int option. For this version it'll simply give a warning, but it'll break next kernel version. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>