summaryrefslogtreecommitdiffstats
path: root/arch/x86
AgeCommit message (Collapse)Author
2014-11-08KVM: x86: Emulator mis-decodes VEX instructions on real-modeNadav Amit
Commit 7fe864dc942c (KVM: x86: Mark VEX-prefix instructions emulation as unimplemented, 2014-06-02) marked VEX instructions as such in protected mode. VEX-prefix instructions are not supported relevant on real-mode and VM86, but should cause #UD instead of being decoded as LES/LDS. Fix this behaviour to be consistent with real hardware. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> [Check for mod == 3, rather than 2 or 3. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07cpumask: factor out show_cpumap into separate helper functionSudeep Holla
Many sysfs *_show function use cpu{list,mask}_scnprintf to copy cpumap to the buffer aligned to PAGE_SIZE, append '\n' and '\0' to return null terminated buffer with newline. This patch creates a new helper function cpumap_print_to_pagebuf in cpumask.h using newly added bitmap_print_to_pagebuf and consolidates most of those sysfs functions using the new helper function. Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Suggested-by: Stephen Boyd <sboyd@codeaurora.org> Tested-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: x86@kernel.org Cc: linux-acpi@vger.kernel.org Cc: linux-pci@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-07KVM: x86: Remove redundant and incorrect cpl check on task-switchNadav Amit
Task-switch emulation checks the privilege level prior to performing the task-switch. This check is incorrect in the case of task-gates, in which the tss.dpl is ignored, and can cause superfluous exceptions. Moreover this check is unnecassary, since the CPU checks the privilege levels prior to exiting. Intel SDM 25.4.2 says "If CALL or JMP accesses a TSS descriptor directly outside IA-32e mode, privilege levels are checked on the TSS descriptor" prior to exiting. AMD 15.14.1 says "The intercept is checked before the task switch takes place but after the incoming TSS and task gate (if one was involved) have been checked for correctness." This patch removes the CPL checks for CALL and JMP. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: Inject #GP when loading system segments with non-canonical baseNadav Amit
When emulating LTR/LDTR/LGDT/LIDT, #GP should be injected if the base is non-canonical. Otherwise, VM-entry will fail. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: Combine the lgdt and lidt emulation logicNadav Amit
LGDT and LIDT emulation logic is almost identical. Merge the logic into a single point to avoid redundancy. This will be used by the next patch that will ensure the bases of the loaded GDTR and IDTR are canonical. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: Do not update EFLAGS on faulting emulationNadav Amit
If the emulation ends in fault, eflags should not be updated. However, several instruction emulations (actually all the fastops) currently update eflags, if the fault was detected afterwards (e.g., #PF during writeback). Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: MOV to CR3 can set bit 63Nadav Amit
Although Intel SDM mentions bit 63 is reserved, MOV to CR3 can have bit 63 set. As Intel SDM states in section 4.10.4 "Invalidation of TLBs and Paging-Structure Caches": " MOV to CR3. ... If CR4.PCIDE = 1 and bit 63 of the instruction’s source operand is 0 ..." In other words, bit 63 is not reserved. KVM emulator currently consider bit 63 as reserved. Fix it. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: Emulate push sreg as done in CoreNadav Amit
According to Intel SDM push of segment selectors is done in the following manner: "if the operand size is 32-bits, either a zero-extended value is pushed on the stack or the segment selector is written on the stack using a 16-bit move. For the last case, all recent Core and Atom processors perform a 16-bit move, leaving the upper portion of the stack location unmodified." This patch modifies the behavior to match the core behavior. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: Wrong flags on CMPS and SCAS emulationNadav Amit
CMPS and SCAS instructions are evaluated in the wrong order. For reference (of CMPS), see http://www.fermimn.gov.it/linux/quarta/x86/cmps.htm : "Note that the direction of subtraction for CMPS is [SI] - [DI] or [ESI] - [EDI]. The left operand (SI or ESI) is the source and the right operand (DI or EDI) is the destination. This is the reverse of the usual Intel convention in which the left operand is the destination and the right operand is the source." Introducing em_cmp_r for this matter that performs comparison in reverse order using fastop infrastructure to avoid a wrapper function. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: SYSCALL cannot clear eflags[1]Nadav Amit
SYSCALL emulation currently clears in 64-bit mode eflags according to MSR_SYSCALL_MASK. However, on bare-metal eflags[1] which is fixed to one cannot be cleared, even if MSR_SYSCALL_MASK masks the bit. This wrong behavior may result in failed VM-entry, as VT disallows entry with eflags[1] cleared. This patch sets the bit after masking eflags on syscall. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: Emulation of MOV-sreg to memory uses incorrect sizeNadav Amit
In x86, you can only MOV-sreg to memory with either 16-bits or 64-bits size. In contrast, KVM may write to 32-bits memory on MOV-sreg. This patch fixes KVM behavior, and sets the destination operand size to two, if the destination is memory. When destination is registers, and the operand size is 32-bits, the high 16-bits in modern CPUs is filled with zero. This is handled correctly. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: Breakpoints do not consider CS.baseNadav Amit
x86 debug registers hold a linear address. Therefore, breakpoints detection should consider CS.base, and check whether instruction linear address equals (CS.base + RIP). This patch introduces a function to evaluate RIP linear address and uses it for breakpoints detection. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: Clear DR6[0:3] on #DB during handle_drNadav Amit
DR6[0:3] (previous breakpoint indications) are cleared when #DB is injected during handle_exception, just as real hardware does. Similarily, handle_dr should clear DR6[0:3]. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: Emulator should set DR6 upon GD like real CPUNadav Amit
It should clear B0-B3 and set BD. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: No error-code on real-mode exceptionsNadav Amit
Real-mode exceptions do not deliver error code. As can be seen in Intel SDM volume 2, real-mode exceptions do not have parentheses, which indicate error-code. To avoid significant changes of the code, the error code is "removed" during exception queueing. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: decode_modrm does not regard modrm correctlyNadav Amit
In one occassion, decode_modrm uses the rm field after it is extended with REX.B to determine the addressing mode. Doing so causes it not to read the offset for rip-relative addressing with REX.B=1. This patch moves the fetch where we already mask REX.B away instead. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07KVM: x86: reset RVI upon system resetWei Wang
A bug was reported as follows: when running Windows 7 32-bit guests on qemu-kvm, sometimes the guests run into blue screen during reboot. The problem was that a guest's RVI was not cleared when it rebooted. This patch has fixed the problem. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com> Tested-by: Rongrong Liu <rongrongx.liu@intel.com>, Da Chun <ngugc@qq.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07kvm: x86: vmx: avoid returning bool to distinguish success from errorPaolo Bonzini
Return a negative error code instead, and WARN() when we should be covering the entire 2-bit space of vmcs_field_type's return value. For increased robustness, add a BUILD_BUG_ON checking the range of vmcs_field_to_offset. Suggested-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()Tiejun Chen
Instead of vmx_init(), actually it would make reasonable sense to do anything specific to vmx hardware setting in vmx_x86_ops->hardware_setup(). Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07kvm: x86: vmx: move down hardware_setup() and hardware_unsetup()Tiejun Chen
Just move this pair of functions down to make sure later we can add something dependent on others. Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-06PCI/MSI: Add pci_msi_ignore_mask to prevent writes to MSI/MSI-X Mask BitsYijing Wang
MSI-X vector Mask Bits are in MSI-X Tables in PCI memory space. Xen PV guests can't write to those tables. MSI vector Mask Bits are in PCI configuration space. Xen PV guests can write to config space, but those writes are ignored. Commit 0e4ccb1505a9 ("PCI: Add x86_msi.msi_mask_irq() and msix_mask_irq()") added a way to override default_mask_msi_irqs() and default_mask_msix_irqs() so they can be no-ops in Xen guests, but this is more complicated than necessary. Add "pci_msi_ignore_mask" in the core PCI MSI code. If set, default_mask_msi_irqs() and default_mask_msix_irqs() return without doing anything. This is less flexible, but much simpler. [bhelgaas: changelog] Signed-off-by: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> CC: xen-devel@lists.xenproject.org
2014-11-06crypto: aesni - remove unnecessary #defineValentin Rothberg
The CPP identifier 'HAS_PCBC' is defined when the Kconfig option CRYPTO_PCBC is set as 'y' or 'm', and is further used in two ifdef blocks to conditionally compile source code. This indirection hides the actual Kconfig dependency and complicates readability. Moreover, it's inconsistent with the rest of the ifdef blocks in the file, which directly reference Kconfig options. This patch removes 'HAS_PCBC' and replaces its occurrences with the actual dependency on 'CRYPTO_PCBC' being set as 'y' or 'm'. Signed-off-by: Valentin Rothberg <valentinrothberg@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-11-05x86, microcode: Fix accessing dis_ucode_ldr on 32-bitBorislav Petkov
We should be accessing it through a pointer, like on the BSP. Tested-by: Richard Hendershot <rshendershot@mchsi.com> Fixes: 65cef1311d5d ("x86, microcode: Add a disable chicken bit") Cc: <stable@vger.kernel.org> # v3.15+ Signed-off-by: Borislav Petkov <bp@suse.de>
2014-11-05KVM: x86: Fix uninitialized op->type for some immediate valuesNadav Amit
The emulator could reuse an op->type from a previous instruction for some immediate values. If it mistakenly considers the operands as memory operands, it will performs a memory read and overwrite op->val. Consider for instance the ROR instruction - src2 (the number of times) would be read from memory instead of being used as immediate. Mark every immediate operand as such to avoid this problem. Cc: stable@vger.kernel.org Fixes: c44b4c6ab80eef3a9c52c7b3f0c632942e6489aa Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-04x86-64: Use RIP-relative addressing for most per-CPU accessesJan Beulich
Observing that per-CPU data (in the SMP case) is reachable by exploiting 64-bit address wraparound (building on the default kernel load address being at 16Mb), the one byte shorter RIP-relative addressing form can be used for most per-CPU accesses. The one exception are the "stable" reads, where the use of the "P" operand modifier prevents the compiler from using RIP-relative addressing, but is unavoidable due to the use of the "p" constraint (side note: with gcc 4.9.x the intended effect of this isn't being achieved anymore, see gcc bug 63637). With the dependency on the minimum kernel load address, arbitrarily low values for CONFIG_PHYSICAL_START are now no longer possible. A link time assertion is being added, directing to the need to increase that value when it triggers. Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/5458A1780200007800044A9D@mail.emea.novell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-04x86-64: Handle PC-relative relocations on per-CPU dataJan Beulich
This is in preparation of using RIP-relative addressing in many of the per-CPU accesses. Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/5458A15A0200007800044A9A@mail.emea.novell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-04x86: Convert a few more per-CPU items to read-mostly onesJan Beulich
Both this_cpu_off and cpu_info aren't getting modified post boot, yet are being accessed on enough code paths that grouping them with other frequently read items seems desirable. For cpu_info this at the same time implies removing the cache line alignment (which afaict became pointless when it got converted to per-CPU data years ago). Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/54589BD20200007800044A84@mail.emea.novell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-04x86: mm: Use 2GB memory block size on large-memory x86-64 systemsDaniel J Blueman
On large-memory x86-64 systems of 64GB or more with memory hot-plug enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces the number of directories in /sys/devices/system/memory from 512 to 32, making it more manageable, and reducing the creation time accordingly. This caveat is that the memory can't be offlined (for hotplug or otherwise) with the finer default 128MB granularity, but this is unimportant due to the high memory densities generally used with such large-memory systems, where eg a single DIMM is the order of 16GB. Signed-off-by: Daniel J Blueman <daniel@numascale.com> Cc: Steffen Persvold <sp@numascale.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Link: http://lkml.kernel.org/r/1415089784-28779-4-git-send-email-daniel@numascale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-04x86: numachip: APIC driver cleanupsDaniel J Blueman
Drop printing that serves no purpose, as it's printing fixed or known values, and mark constant structure appropriately. Signed-off-by: Daniel J Blueman <daniel@numascale.com> Cc: Steffen Persvold <sp@numascale.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Link: http://lkml.kernel.org/r/1415089784-28779-3-git-send-email-daniel@numascale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-04x86: numachip: Elide self-IPI ICR pollingDaniel J Blueman
The default self-IPI path polls the ICR to delay sending the IPI until there is no IPI in progress. This is redundant on x86-86 APICs, since IPIs are queued. See the AMD64 Architecture Programmer's Manual, vol 2, p525. Signed-off-by: Daniel J Blueman <daniel@numascale.com> Cc: Steffen Persvold <sp@numascale.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Link: http://lkml.kernel.org/r/1415089784-28779-2-git-send-email-daniel@numascale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-04x86: numachip: Fix 16-bit APIC ID truncationDaniel J Blueman
Prevent 16-bit APIC IDs being truncated by using correct mask. This fixes booting large systems, where the wrong core would receive the startup and init IPIs, causing hanging. Signed-off-by: Daniel J Blueman <daniel@numascale.com> Cc: Steffen Persvold <sp@numascale.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Link: http://lkml.kernel.org/r/1415089784-28779-1-git-send-email-daniel@numascale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-04perf/x86: Fix embarrasing typoPeter Zijlstra (Intel)
Because we're all human and typing sucks.. Fixes: 7fb0f1de49fc ("perf/x86: Fix compile warnings for intel_uncore") Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: x86@kernel.org Link: http://lkml.kernel.org/n/tip-be0bftjh8yfm4uvmvtf3yi87@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-03Merge branch 'platform/remove_owner' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux into driver-core-next Remove all .owner fields from platform drivers
2014-11-03x86_64,vsyscall: Make vsyscall emulation configurableAndy Lutomirski
This adds CONFIG_X86_VSYSCALL_EMULATION, guarded by CONFIG_EXPERT. Turning it off completely disables vsyscall emulation, saving ~3.5k for vsyscall_64.c, 4k for vsyscall_emu_64.S (the fake vsyscall page), some tiny amount of core mm code that supports a gate area, and possibly 4k for a wasted pagetable. The latter is because the vsyscall addresses are misaligned and fit poorly in the fixmap. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/406db88b8dd5f0cbbf38216d11be34bbb43c7eae.1414618407.git.luto@amacapital.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-03x86_64, vsyscall: Rewrite comment and clean up headers in vsyscall codeAndy Lutomirski
vsyscall_64.c is just vsyscall emulation. Tidy it up accordingly. [ tglx: Preserved the original copyright notices ] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/9c448d5643d0fdb618f8cde9a54c21d2bcd486ce.1414618407.git.luto@amacapital.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-03x86_64, vsyscall: Turn vsyscalls all the way off when vsyscall==noneAndy Lutomirski
I see no point in having an unusable read-only page sitting at 0xffffffffff600000 when vsyscall=none. Instead, skip mapping it and remove it from /proc/PID/maps. I kept the ratelimited warning when programs try to use a vsyscall in this mode, since it may help admins avoid confusion. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/0dddbadc1d4e3bfbaf887938ff42afc97a7cc1f2.1414618407.git.luto@amacapital.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-03x86: UV BAU: Increase maximum CPUs per socket/hubJames Custer
We have encountered hardware with 18 cores/socket that gives 36 CPUs/socket with hyperthreading enabled. This exceeds the current MAX_CPUS_PER_SOCKET causing a failure in get_cpu_topology. Increase MAX_CPUS_PER_SOCKET to 64 and MAX_CPUS_PER_UVHUB to 128. Signed-off-by: James Custer <jcuster@sgi.com> Cc: Russ Anderson <rja@sgi.com> Link: http://lkml.kernel.org/r/1414952199-185319-1-git-send-email-jcuster@sgi.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-03x86: UV BAU: Avoid NULL pointer reference in ptc_seq_showJames Custer
In init_per_cpu(), when get_cpu_topology() fails, init_per_cpu_tunables() is not called afterwards. This means that bau_control->statp is NULL. If a user then reads /proc/sgi_uv/ptc_statistics ptc_seq_show() references a NULL pointer. Therefore, since uv_bau_init calls set_bau_off when init_per_cpu() fails, we add code that detects when the bau is off in ptc_seq_show() to avoid referencing a NULL pointer. Signed-off-by: James Custer <jcuster@sgi.com> Cc: Russ Anderson <rja@sgi.com> Link: http://lkml.kernel.org/r/1414952199-185319-2-git-send-email-jcuster@sgi.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-03x86,vdso: Use LSL unconditionally for vgetcpuAndy Lutomirski
LSL is faster than RDTSCP and works everywhere; there's no need to switch between them depending on CPU. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Andi Kleen <andi@firstfloor.org> Link: http://lkml.kernel.org/r/72f73d5ec4514e02bba345b9759177ef03742efb.1414706021.git.luto@amacapital.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-03x86: mm: Re-use the early_ioremap fixed areaMinfei Huang
The temp fixed area is only used during boot for early_ioremap(), and it is unused when ioremap() is functional. vmalloc/pkmap area become available after early boot so the temp fixed area is available for re-use. The virtual address is more precious on i386, especially turning on high memory. So we can re-use the virtual address space. Remove the now unused defines FIXADDR_BOOT_START and FIXADDR_BOOT_SIZE. Signed-off-by: Minfei Huang <mnfhuang@gmail.com> Cc: pbonzini@redhat.com Cc: bp@suse.de Link: http://lkml.kernel.org/r/1414582717-32729-1-git-send-email-mnfhuang@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-03kvm: kvmclock: use get_cpu() and put_cpu()Tiejun Chen
We can use get_cpu() and put_cpu() to replace preempt_disable()/cpu = smp_processor_id() and preempt_enable() for slightly better code. Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: x86: optimize some accesses to LVTT and SPIVRadim Krčmář
We mirror a subset of these registers in separate variables. Using them directly should be faster. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: x86: detect LVTT changes under APICvRadim Krčmář
APIC-write VM exits are "trap-like": they save CS:RIP values for the instruction after the write, and more importantly, the handler will already see the new value in the virtual-APIC page. This means that apic_reg_write cannot use kvm_apic_get_reg to omit timer cancelation when mode changes. timer_mode_mask shouldn't be changing as it depends on cpuid. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: x86: detect SPIV changes under APICvRadim Krčmář
APIC-write VM exits are "trap-like": they save CS:RIP values for the instruction after the write, and more importantly, the handler will already see the new value in the virtual-APIC page. This caused a bug if you used KVM_SET_IRQCHIP to set the SW-enabled bit in the SPIV register. The chain of events is as follows: * When the irqchip is added to the destination VM, the apic_sw_disabled static key is incremented (1) * When the KVM_SET_IRQCHIP ioctl is invoked, it is decremented (0) * When the guest disables the bit in the SPIV register, e.g. as part of shutdown, apic_set_spiv does not notice the change and the static key is _not_ incremented. * When the guest is destroyed, the static key is decremented (-1), resulting in this trace: WARNING: at kernel/jump_label.c:81 __static_key_slow_dec+0xa6/0xb0() jump label: negative count! [<ffffffff816bf898>] dump_stack+0x19/0x1b [<ffffffff8107c6f1>] warn_slowpath_common+0x61/0x80 [<ffffffff8107c76c>] warn_slowpath_fmt+0x5c/0x80 [<ffffffff811931e6>] __static_key_slow_dec+0xa6/0xb0 [<ffffffff81193226>] static_key_slow_dec_deferred+0x16/0x20 [<ffffffffa0637698>] kvm_free_lapic+0x88/0xa0 [kvm] [<ffffffffa061c63e>] kvm_arch_vcpu_uninit+0x2e/0xe0 [kvm] [<ffffffffa05ff301>] kvm_vcpu_uninit+0x21/0x40 [kvm] [<ffffffffa067cec7>] vmx_free_vcpu+0x47/0x70 [kvm_intel] [<ffffffffa061bc50>] kvm_arch_vcpu_free+0x50/0x60 [kvm] [<ffffffffa061ca22>] kvm_arch_destroy_vm+0x102/0x260 [kvm] [<ffffffff810b68fd>] ? synchronize_srcu+0x1d/0x20 [<ffffffffa06030d1>] kvm_put_kvm+0xe1/0x1c0 [kvm] [<ffffffffa06036f8>] kvm_vcpu_release+0x18/0x20 [kvm] [<ffffffff81215c62>] __fput+0x102/0x310 [<ffffffff81215f4e>] ____fput+0xe/0x10 [<ffffffff810ab664>] task_work_run+0xb4/0xe0 [<ffffffff81083944>] do_exit+0x304/0xc60 [<ffffffff816c8dfc>] ? _raw_spin_unlock_irq+0x2c/0x50 [<ffffffff810fd22d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 [<ffffffff8108432c>] do_group_exit+0x4c/0xc0 [<ffffffff810843b4>] SyS_exit_group+0x14/0x20 [<ffffffff816d33a9>] system_call_fastpath+0x16/0x1b Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: x86: Enable Intel AVX-512 for guestChao Peng
Expose Intel AVX-512 feature bits to guest. Also add checks for xcr0 AVX512 related bits according to spec: http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: x86: fix deadline tsc interrupt injectionRadim Krčmář
The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a situation where we lose a pending deadline timer in a MSR write. Losing it is fine, because it effectively occurs before the timer fired, so we should be able to cancel or postpone it. Another problem comes from interaction with QEMU, or other userspace that can set deadline MSR without a good reason, when timer is already pending: one guest's deadline request results in more than one interrupt because one is injected immediately on MSR write from userspace and one through hrtimer later. The solution is to remove the injection when replacing a pending timer and to improve the usual QEMU path, we inject without a hrtimer when the deadline has already passed. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reported-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: x86: add apic_timer_expired()Radim Krčmář
Make the code reusable. If the timer was already pending, we shouldn't be waiting in a queue, so wake_up can be skipped, simplifying the path. There is no 'reinject' case => the comment is removed. Current race behaves correctly. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: vmx: Unavailable DR4/5 is checked before CPLNadav Amit
If DR4/5 is accessed when it is unavailable (since CR4.DE is set), then #UD should be generated even if CPL>0. This is according to Intel SDM Table 6-2: "Priority Among Simultaneous Exceptions and Interrupts". Note, that this may happen on the first DR access, even if the host does not sets debug breakpoints. Obviously, it occurs when the host debugs the guest. This patch moves the DR4/5 checks from __kvm_set_dr/_kvm_get_dr to handle_dr. The emulator already checks DR4/5 availability in check_dr_read. Nested virutalization related calls to kvm_set_dr/kvm_get_dr would not like to inject exceptions to the guest. As for SVM, the patch follows the previous logic as much as possible. Anyhow, it appears the DR interception code might be buggy - even if the DR access may cause an exception, the instruction is skipped. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: x86: Emulator performs code segment checks on read accessNadav Amit
When read access is performed using a readable code segment, the "conforming" and "non-conforming" checks should not be done. As a result, read using non-conforming readable code segment fails. This is according to Intel SDM 5.6.1 ("Accessing Data in Code Segments"). The fix is not to perform the "non-conforming" checks if the access is not a fetch; the relevant checks are already done when loading the segment. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: x86: Clear DR7.LE during task-switchNadav Amit
DR7.LE should be cleared during task-switch. This feature is poorly documented. For reference, see: http://pdos.csail.mit.edu/6.828/2005/readings/i386/s12_02.htm SDM [17.2.4]: This feature is not supported in the P6 family processors, later IA-32 processors, and Intel 64 processors. AMD [2:13.1.1.4]: This bit is ignored by implementations of the AMD64 architecture. Intel's formulation could mean that it isn't even zeroed, but current hardware indeed does not behave like that. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>