Age | Commit message (Collapse) | Author |
|
When we invalidate a THP page, we call the handler with the same
rmap_pde argument 512 times in the following loop:
for each guest page in the range
for each level
unmap using rmap
This patch avoids these extra handler calls by changing the loop order
like this:
for each level
for each rmap in the range
unmap using rmap
With the preceding patches in the patch series, this made THP page
invalidation more than 5 times faster on our x86 host: the host became
more responsive during swapping the guest's memory as a result.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
This restricts the tracing to page aging and makes it possible to
optimize kvm_handle_hva_range() further in the following patch.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
This is needed to push trace_kvm_age_page() into kvm_age_rmapp() in the
following patch.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
This makes it possible to loop over rmap_pde arrays in the same way as
we do over rmap so that we can optimize kvm_handle_hva_range() easily in
the following patch.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
kvm_mmu_notifier_invalidate_range_start()
When we tested KVM under memory pressure, with THP enabled on the host,
we noticed that MMU notifier took a long time to invalidate huge pages.
Since the invalidation was done with mmu_lock held, it not only wasted
the CPU but also made the host harder to respond.
This patch mitigates this by using kvm_handle_hva_range().
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
When guest's memory is backed by THP pages, MMU notifier needs to call
kvm_unmap_hva(), which in turn leads to kvm_handle_hva(), in a loop to
invalidate a range of pages which constitute one huge page:
for each page
for each memslot
if page is in memslot
unmap using rmap
This means although every page in that range is expected to be found in
the same memslot, we are forced to check unrelated memslots many times.
If the guest has more memslots, the situation will become worse.
Furthermore, if the range does not include any pages in the guest's
memory, the loop over the pages will just consume extra time.
This patch, together with the following patches, solves this problem by
introducing kvm_handle_hva_range() which makes the loop look like this:
for each memslot
for each page in memslot
unmap using rmap
In this new processing, the actual work is converted to a loop over rmap
which is much more cache friendly than before.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
This restricts hva handling in mmu code and makes it easier to extend
kvm_handle_hva() so that it can treat a range of addresses later in this
patch series.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
We can treat every level uniformly.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
Pick up the latest ring-buffer fixes, before applying a new fix.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This patch handles PCID/INVPCID for guests.
Process-context identifiers (PCIDs) are a facility by which a logical processor
may cache information for multiple linear-address spaces so that the processor
may retain cached information when software switches to a different linear
address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual
Volume 3A for details.
For guests with EPT, the PCID feature is enabled and INVPCID behaves as running
natively.
For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
The P bit of page fault error code is missed in this tracepoint, fix it by
passing the full error code
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
To see what happen on this path and help us to optimize it
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
If the the present bit of page fault error code is set, it indicates
the shadow page is populated on all levels, it means what we do is
only modify the access bit which can be done out of mmu-lock
Currently, in order to simplify the code, we only fix the page fault
caused by write-protect on the fast path
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
This bit indicates whether the spte can be writable on MMU, that means
the corresponding gpte is writable and the corresponding gfn is not
protected by shadow page protection
In the later path, SPTE_MMU_WRITEABLE will indicates whether the spte
can be locklessly updated
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
mmu_spte_update() is the common function, we can easily audit the path
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Export the present bit of page fault error code, the later patch
will use it
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Use __drop_large_spte to cleanup this function and comment spte_write_protect
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Introduce a common function to abstract spte write-protect to
cleanup the code
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
The reture value of __rmap_write_protect is either 1 or 0, use
true/false instead of these
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Our emulation should be complete enough that we can emulate guests
while they are in big real mode, or in a mode transition that is not
virtualizable without unrestricted guest support.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Opcode 0F 00 /3. Encountered during Windows XP secondary processor bringup.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Guest software doesn't actually depend on it, but vmx will refuse us
entry if we don't. Set the bit in both the cached segment and memory,
just to be nice.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Some operations want to modify the descriptor later on, so save the
address for future use.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Opcode 0F 00 /2. Used by isolinux durign the protected mode transition.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Opcodes 0F C8 - 0F CF.
Used by the SeaBIOS cdrom code (though not in big real mode).
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
If instruction emulation fails, report it properly to userspace.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Process the event, possibly injecting an interrupt, before continuing.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Opcode C8.
Only ENTER with lexical nesting depth 0 is implemented, since others are
very rare. We'll fail emulation if nonzero lexical depth is used so data
is not corrupted.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
This allows us to reuse the code without populating ctxt->src and
overriding ctxt->op_bytes.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Commit 2adb5ad9fe1 removed ByteOp from MOVZX/MOVSX, replacing them by
SrcMem8, but neglected to fix the dependency in the emulation code
on ByteOp. This caused the instruction not to have any effect in
some circumstances.
Fix by replacing the check for ByteOp with the equivalent src.op_bytes == 1.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Opcode 9F.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
If we return early from an invalid guest state emulation loop, make
sure we return to it later if the guest state is still invalid.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Checking EFLAGS.IF is incorrect as we might be in interrupt shadow. If
that is the case, the main loop will notice that and not inject the interrupt,
causing an endless loop.
Fix by using vmx_interrupt_allowed() to check if we can inject an interrupt
instead.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Opcodes 0F 01 /0 and 0F 01 /1
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
We correctly default to SS when BP is used as a base in 16-bit address mode,
but we don't do that for 32-bit mode.
Fix by adjusting the default to SS when either ESP or EBP is used as the base
register.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Opcode c9; used by some variants of Windows during boot, in big real mode.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Otherwise, if the guest ends up looping, we never exit the srcu critical
section, which causes synchronize_srcu() to hang.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Some userspace (e.g. QEMU 1.1) munge the d and g bits of segment
descriptors, causing us not to recognize them as unusable segments
with emulate_invalid_guest_state=1. Relax the check by testing for
segment not present (a non-present segment cannot be usable).
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
The operand size for these instructions is 8 bytes in long mode, even without
a REX prefix. Set it explicitly.
Triggered while booting Linux with emulate_invalid_guest_state=1.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Null SS is valid in long mode; allow loading it.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Opcode 0F A2.
Used by Linux during the mode change trampoline while in a state that is
not virtualizable on vmx without unrestricted_guest, so we need to emulate
it is emulate_invalid_guest_state=1.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Instead of getting an exact leaf, follow the spec and fall back to the last
main leaf instead. This lets us easily emulate the cpuid instruction in the
emulator.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Introduce kvm_cpuid() to perform the leaf limit check and calculate
register values, and let kvm_emulate_cpuid() just handle reading and
writing the registers from/to the vcpu. This allows us to reuse
kvm_cpuid() in a context where directly reading and writing registers
is not desired.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
In protected mode, the CPL is defined as the lower two bits of CS, as set by
the last far jump. But during the transition to protected mode, there is no
last far jump, so we need to return zero (the inherited real mode CPL).
Fix by reading CPL from the cache during the transition. This isn't 100%
correct since we don't set the CPL cache on a far jump, but since protected
mode transition will always jump to a segment with RPL=0, it will always
work.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Currently the MMU's ->new_cr3() callback does nothing when guest paging
is disabled or when two-dimentional paging (e.g. EPT on Intel) is active.
This means that an emulated write to cr3 can be lost; kvm_set_cr3() will
write vcpu-arch.cr3, but the GUEST_CR3 field in the VMCS will retain its
old value and this is what the guest sees.
This bug did not have any effect until now because:
- with unrestricted guest, or with svm, we never emulate a mov cr3 instruction
- without unrestricted guest, and with paging enabled, we also never emulate a
mov cr3 instruction
- without unrestricted guest, but with paging disabled, the guest's cr3 is
ignored until the guest enables paging; at this point the value from arch.cr3
is loaded correctly my the mov cr0 instruction which turns on paging
However, the patchset that enables big real mode causes us to emulate mov cr3
instructions in protected mode sometimes (when guest state is not virtualizable
by vmx); this mov cr3 is effectively ignored and will crash the guest.
The fix is to make nonpaging_new_cr3() call mmu_free_roots() to force a cr3
reload. This is awkward because now all the new_cr3 callbacks to the same
thing, and because mmu_free_roots() is somewhat of an overkill; but fixing
that is more complicated and will be done after this minimal fix.
Observed in the Window XP 32-bit installer while bringing up secondary vcpus.
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core
Pull tracing updates from Steve Rostedt.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There are macros that are Intel specific and not x86 generic. Rename
them into INTEL_*.
This patch removes X86_PMC_IDX_GENERIC and does:
$ sed -i -e 's/X86_PMC_MAX_/INTEL_PMC_MAX_/g' \
arch/x86/include/asm/kvm_host.h \
arch/x86/include/asm/perf_event.h \
arch/x86/kernel/cpu/perf_event.c \
arch/x86/kernel/cpu/perf_event_p4.c \
arch/x86/kvm/pmu.c
$ sed -i -e 's/X86_PMC_IDX_FIXED/INTEL_PMC_IDX_FIXED/g' \
arch/x86/include/asm/perf_event.h \
arch/x86/kernel/cpu/perf_event.c \
arch/x86/kernel/cpu/perf_event_intel.c \
arch/x86/kernel/cpu/perf_event_intel_ds.c \
arch/x86/kvm/pmu.c
$ sed -i -e 's/X86_PMC_MSK_/INTEL_PMC_MSK_/g' \
arch/x86/include/asm/perf_event.h \
arch/x86/kernel/cpu/perf_event.c
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340217996-2254-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Fix:
[ 3190.059226] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 3190.062224] IP: [<ffffffffa02aac66>] mmu_page_zap_pte+0x10/0xa7 [kvm]
[ 3190.063760] PGD 104f50067 PUD 112bea067 PMD 0
[ 3190.065309] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 3190.066860] CPU 1
[ ...... ]
[ 3190.109629] Call Trace:
[ 3190.111342] [<ffffffffa02aada6>] kvm_mmu_prepare_zap_page+0xa9/0x1fc [kvm]
[ 3190.113091] [<ffffffffa02ab2f5>] mmu_shrink+0x11f/0x1f3 [kvm]
[ 3190.114844] [<ffffffffa02ab25d>] ? mmu_shrink+0x87/0x1f3 [kvm]
[ 3190.116598] [<ffffffff81150c9d>] ? prune_super+0x142/0x154
[ 3190.118333] [<ffffffff8110a4f4>] ? shrink_slab+0x39/0x31e
[ 3190.120043] [<ffffffff8110a687>] shrink_slab+0x1cc/0x31e
[ 3190.121718] [<ffffffff8110ca1d>] do_try_to_free_pages
This is caused by shrinking page from the empty mmu, although we have
checked n_used_mmu_pages, it is useless since the check is out of mmu-lock
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
The kvm_emulate_insn tracepoint used __print_insn()
for printing its instructions. However it makes the
format of the event hard to parse as it reveals TP
internals.
Fortunately, kernel provides __print_hex for almost
same purpose, we can use it instead of open coding
it. The user-space can be changed to parse it later.
That means raw kernel tracing will not be affected
by this change:
# cd /sys/kernel/debug/tracing/
# cat events/kvm/kvm_emulate_insn/format
name: kvm_emulate_insn
ID: 29
format:
...
print fmt: "%x:%llx:%s (%s)%s", REC->csbase, REC->rip, __print_hex(REC->insn, REC->len), \
__print_symbolic(REC->flags, { 0, "real" }, { (1 << 0) | (1 << 1), "vm16" }, \
{ (1 << 0), "prot16" }, { (1 << 0) | (1 << 2), "prot32" }, { (1 << 0) | (1 << 3), "prot64" }), \
REC->failed ? " failed" : ""
# echo 1 > events/kvm/kvm_emulate_insn/enable
# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 2183/2183 #P:12
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
qemu-kvm-1782 [002] ...1 140.931636: kvm_emulate_insn: 0:c102fa25:89 10 (prot32)
qemu-kvm-1781 [004] ...1 140.931637: kvm_emulate_insn: 0:c102fa25:89 10 (prot32)
Link: http://lkml.kernel.org/n/tip-wfw6y3b9ugtey8snaow9nmg5@git.kernel.org
Link: http://lkml.kernel.org/r/1340757701-10711-2-git-send-email-namhyung@kernel.org
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: kvm@vger.kernel.org
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|