Age | Commit message (Collapse) | Author |
|
vmcs->cpu indicates whether it exists on the target cpu, -1 means the vmcs
does not exist on any vcpu
If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu
list. The list can be corrupted if the cpu prefetch the vmcs's list before
reading vmcs->cpu. Meanwhile, we should remove vmcs from the list before
making vmcs->vcpu == -1 be visible
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
In loaded_vmcs_clear, loaded_vmcs->cpu is the fist parameter passed to
smp_call_function_single, if the target cpu is downing (doing cpu hot remove),
loaded_vmcs->cpu can become -1 then -1 is passed to smp_call_function_single
It can be triggered when vcpu is being destroyed, loaded_vmcs_clear is called
in the preemptionable context
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
As requested by Glauber, do not update kvmclock area on vcpu->pcpu
migration, in case the host has stable TSC.
This is to reduce cacheline bouncing.
Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
With master clock, a pvclock clock read calculates:
ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ]
Where 'rdtsc' is the host TSC.
system_timestamp and tsc_timestamp are unique, one tuple
per VM: the "master clock".
Given a host with synchronized TSCs, its obvious that
guest TSC must be matched for the above to guarantee monotonicity.
Allow master clock usage only if guest TSCs are synchronized.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
TSC initialization will soon make use of online_vcpus.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
KVM added a global variable to guarantee monotonicity in the guest.
One of the reasons for that is that the time between
1. ktime_get_ts(×pec);
2. rdtscll(tsc);
Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.
The time required to execute 2. is not the same on those two instances
executing in different VCPUS (cache misses, interrupts...).
If the TSC value that is used by the host to interpolate when
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, and
a single <system_timestamp, tsc_timestamp> tuple is visible to all
vcpus simultaneously, this problem disappears. See comment on top
of pvclock_update_vm_gtod_copy for details.
Monotonicity is then guaranteed by synchronicity of the host TSCs
and guest TSCs.
Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
Register a notifier for clocksource change event. In case
the host switches to clock other than TSC, disable master
clock usage.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
migration) to clear the bit.
Noticed by Paolo Bonzini.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
In __emulate_1op_rax_rdx, we use "+a" and "+d" which are input/output
constraints, and *then* use "a" and "d" as input constraints. This is
incorrect, but happens to work on some versions of gcc.
However, it breaks gcc with -O0 and icc, and may break on future
versions of gcc.
Reported-and-tested-by: Melanie Blower <melanie.blower@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/B3584E72CFEBED439A3ECA9BCE67A4EF1B17AF90@FMSMSX107.amr.corp.intel.com
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
The commit [ad756a16: KVM: VMX: Implement PCID/INVPCID for guests with
EPT] introduced the unconditional access to SECONDARY_VM_EXEC_CONTROL,
and this triggers kernel warnings like below on old CPUs:
vmwrite error: reg 401e value a0568000 (err 12)
Pid: 13649, comm: qemu-kvm Not tainted 3.7.0-rc4-test2+ #154
Call Trace:
[<ffffffffa0558d86>] vmwrite_error+0x27/0x29 [kvm_intel]
[<ffffffffa054e8cb>] vmcs_writel+0x1b/0x20 [kvm_intel]
[<ffffffffa054f114>] vmx_cpuid_update+0x74/0x170 [kvm_intel]
[<ffffffffa03629b6>] kvm_vcpu_ioctl_set_cpuid2+0x76/0x90 [kvm]
[<ffffffffa0341c67>] kvm_arch_vcpu_ioctl+0xc37/0xed0 [kvm]
[<ffffffff81143f7c>] ? __vunmap+0x9c/0x110
[<ffffffffa0551489>] ? vmx_vcpu_load+0x39/0x1a0 [kvm_intel]
[<ffffffffa0340ee2>] ? kvm_arch_vcpu_load+0x52/0x1a0 [kvm]
[<ffffffffa032dcd4>] ? vcpu_load+0x74/0xd0 [kvm]
[<ffffffffa032deb0>] kvm_vcpu_ioctl+0x110/0x5e0 [kvm]
[<ffffffffa032e93d>] ? kvm_dev_ioctl+0x4d/0x4a0 [kvm]
[<ffffffff8117dc6f>] do_vfs_ioctl+0x8f/0x530
[<ffffffff81139d76>] ? remove_vma+0x56/0x60
[<ffffffff8113b708>] ? do_munmap+0x328/0x400
[<ffffffff81187c8c>] ? fget_light+0x4c/0x100
[<ffffffff8117e1a1>] sys_ioctl+0x91/0xb0
[<ffffffff815a942d>] system_call_fastpath+0x1a/0x1f
This patch adds a check for the availability of secondary exec
control to avoid these warnings.
Cc: <stable@vger.kernel.org> [v3.6+]
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
No need to check return value before breaking switch.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
Return value of this function will be that of ioctl().
#include <stdio.h>
#include <linux/kvm.h>
int main () {
int fd;
fd = open ("/dev/kvm", 0);
fd = ioctl (fd, KVM_CREATE_VM, 0);
ioctl (fd, KVM_SET_TSS_ADDR, 0xfffff000);
perror ("");
return 0;
}
Output is "Operation not permitted". That's not what
we want.
Return -EINVAL in this case.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
We should avoid kfree()ing error pointer in kvm_vcpu_ioctl() and
kvm_arch_vcpu_ioctl().
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
On hosts without the XSAVE support unprivileged local user can trigger
oops similar to the one below by setting X86_CR4_OSXSAVE bit in guest
cr4 register using KVM_SET_SREGS ioctl and later issuing KVM_RUN
ioctl.
invalid opcode: 0000 [#2] SMP
Modules linked in: tun ip6table_filter ip6_tables ebtable_nat ebtables
...
Pid: 24935, comm: zoog_kvm_monito Tainted: G D 3.2.0-3-686-pae
EIP: 0060:[<f8b9550c>] EFLAGS: 00210246 CPU: 0
EIP is at kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm]
EAX: 00000001 EBX: 000f387e ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: ef5a0060 ESP: d7c63e70
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process zoog_kvm_monito (pid: 24935, ti=d7c62000 task=ed84a0c0
task.ti=d7c62000)
Stack:
00000001 f70a1200 f8b940a9 ef5a0060 00000000 00200202 f8769009 00000000
ef5a0060 000f387e eda5c020 8722f9c8 00015bae 00000000 ed84a0c0 ed84a0c0
c12bf02d 0000ae80 ef7f8740 fffffffb f359b740 ef5a0060 f8b85dc1 0000ae80
Call Trace:
[<f8b940a9>] ? kvm_arch_vcpu_ioctl_set_sregs+0x2fe/0x308 [kvm]
...
[<c12bfb44>] ? syscall_call+0x7/0xb
Code: 89 e8 e8 14 ee ff ff ba 00 00 04 00 89 e8 e8 98 48 ff ff 85 c0 74
1e 83 7d 48 00 75 18 8b 85 08 07 00 00 31 c9 8b 95 0c 07 00 00 <0f> 01
d1 c7 45 48 01 00 00 00 c7 45 1c 01 00 00 00 0f ae f0 89
EIP: [<f8b9550c>] kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm] SS:ESP
0068:d7c63e70
QEMU first retrieves the supported features via KVM_GET_SUPPORTED_CPUID
and then sets them later. So guest's X86_FEATURE_XSAVE should be masked
out on hosts without X86_FEATURE_XSAVE, making kvm_set_cr4 with
X86_CR4_OSXSAVE fail. Userspaces that allow specifying guest cpuid with
X86_FEATURE_XSAVE even on hosts that do not support it, might be
susceptible to this attack from inside the guest as well.
Allow setting X86_CR4_OSXSAVE bit only if host has XSAVE support.
Signed-off-by: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
After commit b3356bf0dbb349 (KVM: emulator: optimize "rep ins" handling),
the pieces of io data can be collected and write them to the guest memory
or MMIO together
Unfortunately, kvm splits the mmio access into 8 bytes and store them to
vcpu->mmio_fragments. If the guest uses "rep ins" to move large data, it
will cause vcpu->mmio_fragments overflow
The bug can be exposed by isapc (-M isapc):
[23154.818733] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[ ......]
[23154.858083] Call Trace:
[23154.859874] [<ffffffffa04f0e17>] kvm_get_cr8+0x1d/0x28 [kvm]
[23154.861677] [<ffffffffa04fa6d4>] kvm_arch_vcpu_ioctl_run+0xcda/0xe45 [kvm]
[23154.863604] [<ffffffffa04f5a1a>] ? kvm_arch_vcpu_load+0x17b/0x180 [kvm]
Actually, we can use one mmio_fragment to store a large mmio access then
split it when we pass the mmio-exit-info to userspace. After that, we only
need two entries to store mmio info for the cross-mmio pages access
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
This patch filters noslot pfn out from error pfns based on Marcelo comment:
noslot pfn is not a error pfn
After this patch,
- is_noslot_pfn indicates that the gfn is not in slot
- is_error_pfn indicates that the gfn is in slot but the error is occurred
when translate the gfn to pfn
- is_error_noslot_pfn indicates that the pfn either it is error pfns or it
is noslot pfn
And is_invalid_pfn can be removed, it makes the code more clean
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
Merge reason: development work has dependency on kvm patches merged
upstream.
Conflicts:
arch/powerpc/include/asm/Kbuild
arch/powerpc/include/asm/kvm_para.h
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
The mmu_notifier_retry is not specific to any vcpu (and never will be)
so only take struct kvm as a parameter.
The motivation is the ARM mmu code that needs to call this from
somewhere where we long let go of the vcpu pointer.
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
We can not directly call kvm_release_pfn_clean to release the pfn
since we can meet noslot pfn which is used to cache mmio info into
spte
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Use __func__ instead of the function name in svm_hardware_enable since
those things tend to get out of sync. This also slims down printk line
length in conjunction with using pr_err.
No functionality change.
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
The #PF with PFEC.RSV = 1 indicates that the guest is accessing MMIO, we
can not fix it if it is caused by delivery event. Reporting internal error
for this case
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
VM exits during Event Delivery is really unexpected if it is not caused
by Exceptions/EPT-VIOLATION/TASK_SWITCH, we'd better to report an internal
and freeze the guest, the VMM has the chance to check the guest
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
The only difference between FNAME(update_pte) and FNAME(pte_prefetch)
is that the former is allowed to prefetch gfn from dirty logged slot,
so introduce a common function to prefetch spte
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
The function does not depend on guest mmu mode, move it out from
paging_tmpl.h
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Let it return emulate state instead of spte like __direct_map
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Remove mmu_is_invalid and use is_invalid_pfn instead
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
No users outside of kvm/x86.c.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
There are no external callers of this function as there is no concept of
resetting a vcpu from generic code.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
|
Pull KVM updates from Avi Kivity:
"Highlights of the changes for this release include support for vfio
level triggered interrupts, improved big real mode support on older
Intels, a streamlines guest page table walker, guest APIC speedups,
PIO optimizations, better overcommit handling, and read-only memory."
* tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
KVM: s390: Fix vcpu_load handling in interrupt code
KVM: x86: Fix guest debug across vcpu INIT reset
KVM: Add resampling irqfds for level triggered interrupts
KVM: optimize apic interrupt delivery
KVM: MMU: Eliminate pointless temporary 'ac'
KVM: MMU: Avoid access/dirty update loop if all is well
KVM: MMU: Eliminate eperm temporary
KVM: MMU: Optimize is_last_gpte()
KVM: MMU: Simplify walk_addr_generic() loop
KVM: MMU: Optimize pte permission checks
KVM: MMU: Update accessed and dirty bits after guest pagetable walk
KVM: MMU: Move gpte_access() out of paging_tmpl.h
KVM: MMU: Optimize gpte_access() slightly
KVM: MMU: Push clean gpte write protection out of gpte_access()
KVM: clarify kvmclock documentation
KVM: make processes waiting on vcpu mutex killable
KVM: SVM: Make use of asm.h
KVM: VMX: Make use of asm.h
KVM: VMX: Make lto-friendly
KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()
...
Conflicts:
arch/s390/include/asm/processor.h
arch/x86/kvm/i8259.c
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/fpu update from Ingo Molnar:
"The biggest change is the addition of the non-lazy (eager) FPU saving
support model and enabling it on CPUs with optimized xsaveopt/xrstor
FPU state saving instructions.
There are also various Sparse fixes"
Fix up trivial add-add conflict in arch/x86/kernel/traps.c
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
x86, fpu: remove cpu_has_xmm check in the fx_finit()
x86, fpu: make eagerfpu= boot param tri-state
x86, fpu: enable eagerfpu by default for xsaveopt
x86, fpu: decouple non-lazy/eager fpu restore from xsave
x86, fpu: use non-lazy fpu restore for processors supporting xsave
lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models
x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
x86, fpu: drop_fpu() before restoring new state from sigframe
x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels
x86, fpu: Consolidate inline asm routines for saving/restoring fpu state
x86, signal: Cleanup ifdefs and is_ia32, is_x32
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/cleanups from Ingo Molnar:
"Smaller cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arch/x86: Remove unecessary semicolons
x86, boot: Remove obsolete and unused constant RAMDISK
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf update from Ingo Molnar:
"Lots of changes in this cycle as well, with hundreds of commits from
over 30 contributors. Most of the activity was on the tooling side.
Higher level changes:
- New 'perf kvm' analysis tool, from Xiao Guangrong.
- New 'perf trace' system-wide tracing tool
- uprobes fixes + cleanups from Oleg Nesterov.
- Lots of patches to make perf build on Android out of box, from
Irina Tirdea
- Extend ftrace function tracing utility to be more dynamic for its
users. It allows for data passing to the callback functions, as
well as reading regs as if a breakpoint were to trigger at function
entry.
The main goal of this patch series was to allow kprobes to use
ftrace as an optimized probe point when a probe is placed on an
ftrace nop. With lots of help from Masami Hiramatsu, and going
through lots of iterations, we finally came up with a good
solution.
- Add cpumask for uncore pmu, use it in 'stat', from Yan, Zheng.
- Various tracing updates from Steve Rostedt
- Clean up and improve 'perf sched' performance by elliminating lots
of needless calls to libtraceevent.
- Event group parsing support, from Jiri Olsa
- UI/gtk refactorings and improvements from Namhyung Kim
- Add support for non-tracepoint events in perf script python, from
Feng Tang
- Add --symbols to 'script', similar to the one in 'report', from
Feng Tang.
Infrastructure enhancements and fixes:
- Convert the trace builtins to use the growing evsel/evlist
tracepoint infrastructure, removing several open coded constructs
like switch like series of strcmp to dispatch events, etc.
Basically what had already been showcased in 'perf sched'.
- Add evsel constructor for tracepoints, that uses libtraceevent just
to parse the /format events file, use it in a new 'perf test' to
make sure the libtraceevent format parsing regressions can be more
readily caught.
- Some strange errors were happening in some builds, but not on the
next, reported by several people, problem was some parser related
files, generated during the build, didn't had proper make deps, fix
from Eric Sandeen.
- Introduce struct and cache information about the environment where
a perf.data file was captured, from Namhyung Kim.
- Fix handling of unresolved samples when --symbols is used in
'report', from Feng Tang.
- Add union member access support to 'probe', from Hyeoncheol Lee.
- Fixups to die() removal, from Namhyung Kim.
- Render fixes for the TUI, from Namhyung Kim.
- Don't enable annotation in non symbolic view, from Namhyung Kim.
- Fix pipe mode in 'report', from Namhyung Kim.
- Move related stats code from stat to util/, will be used by the
'stat' kvm tool, from Xiao Guangrong.
- Remove die()/exit() calls from several tools.
- Resolve vdso callchains, from Jiri Olsa
- Don't pass const char pointers to basename, so that we can
unconditionally use libgen.h and thus avoid ifdef BIONIC lines,
from David Ahern
- Refactor hist formatting so that it can be reused with the GTK
browser, From Namhyung Kim
- Fix build for another rbtree.c change, from Adrian Hunter.
- Make 'perf diff' command work with evsel hists, from Jiri Olsa.
- Use the only field_sep var that is set up: symbol_conf.field_sep,
fix from Jiri Olsa.
- .gitignore compiled python binaries, from Namhyung Kim.
- Get rid of die() in more libtraceevent places, from Namhyung Kim.
- Rename libtraceevent 'private' struct member to 'priv' so that it
works in C++, from Steven Rostedt
- Remove lots of exit()/die() calls from tools so that the main perf
exit routine can take place, from David Ahern
- Fix x86 build on x86-64, from David Ahern.
- {int,str,rb}list fixes from Suzuki K Poulose
- perf.data header fixes from Namhyung Kim
- Allow user to indicate objdump path, needed in cross environments,
from Maciek Borzecki
- Fix hardware cache event name generation, fix from Jiri Olsa
- Add round trip test for sw, hw and cache event names, catching the
problem Jiri fixed, after Jiri's patch, the test passes
successfully.
- Clean target should do clean for lib/traceevent too, fix from David
Ahern
- Check the right variable for allocation failure, fix from Namhyung
Kim
- Set up evsel->tp_format regardless of evsel->name being set
already, fix from Namhyung Kim
- Oprofile fixes from Robert Richter.
- Remove perf_event_attr needless version inflation, from Jiri Olsa
- Introduce libtraceevent strerror like error reporting facility,
from Namhyung Kim
- Add pmu mappings to perf.data header and use event names from cmd
line, from Robert Richter
- Fix include order for bison/flex-generated C files, from Ben
Hutchings
- Build fixes and documentation corrections from David Ahern
- Assorted cleanups from Robert Richter
- Let O= makes handle relative paths, from Steven Rostedt
- perf script python fixes, from Feng Tang.
- Initial bash completion support, from Frederic Weisbecker
- Allow building without libelf, from Namhyung Kim.
- Support DWARF CFI based unwind to have callchains when %bp based
unwinding is not possible, from Jiri Olsa.
- Symbol resolution fixes, while fixing support PPC64 files with an
.opt ELF section was the end goal, several fixes for code that
handles all architectures and cleanups are included, from Cody
Schafer.
- Assorted fixes for Documentation and build in 32 bit, from Robert
Richter
- Cache the libtraceevent event_format associated to each evsel
early, so that we avoid relookups, i.e. calling pevent_find_event
repeatedly when processing tracepoint events.
[ This is to reduce the surface contact with libtraceevents and
make clear what is that the perf tools needs from that lib: so
far parsing the common and per event fields. ]
- Don't stop the build if the audit libraries are not installed, fix
from Namhyung Kim.
- Fix bfd.h/libbfd detection with recent binutils, from Markus
Trippelsdorf.
- Improve warning message when libunwind devel packages not present,
from Jiri Olsa"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (282 commits)
perf trace: Add aliases for some syscalls
perf probe: Print an enum type variable in "enum variable-name" format when showing accessible variables
perf tools: Check libaudit availability for perf-trace builtin
perf hists: Add missing period_* fields when collapsing a hist entry
perf trace: New tool
perf evsel: Export the event_format constructor
perf evsel: Introduce rawptr() method
perf tools: Use perf_evsel__newtp in the event parser
perf evsel: The tracepoint constructor should store sys:name
perf evlist: Introduce set_filter() method
perf evlist: Renane set_filters method to apply_filters
perf test: Add test to check we correctly parse and match syscall open parms
perf evsel: Handle endianity in intval method
perf evsel: Know if byte swap is needed
perf tools: Allow handling a NULL cpu_map as meaning "all cpus"
perf evsel: Improve tracepoint constructor setup
tools lib traceevent: Fix error path on pevent_parse_event
perf test: Fix build failure
trace: Move trace event enable from fs_initcall to core_initcall
tracing: Add an option for disabling markers
...
|
|
If we reset a vcpu on INIT, we so far overwrote dr7 as provided by
KVM_SET_GUEST_DEBUG, and we also cleared switch_db_regs unconditionally.
Fix this by saving the dr7 used for guest debugging and calculating the
effective register value as well as switch_db_regs on any potential
change. This will change to focus of the set_guest_debug vendor op to
update_dp_bp_intercept.
Found while trying to stop on start_secondary.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
To emulate level triggered interrupts, add a resample option to
KVM_IRQFD. When specified, a new resamplefd is provided that notifies
the user when the irqchip has been resampled by the VM. This may, for
instance, indicate an EOI. Also in this mode, posting of an interrupt
through an irqfd only asserts the interrupt. On resampling, the
interrupt is automatically de-asserted prior to user notification.
This enables level triggered interrupts to be posted and re-enabled
from vfio with no userspace intervention.
All resampling irqfds can make use of a single irq source ID, so we
reserve a new one for this interface.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.
kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().
So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.
Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
change the cr0.TS bit for the host state during vm-exit to be always clear
and cr0.TS bit is set in the __vmx_load_host_state() when the FPU
(guest FPU or the host task's FPU) state is not active. This ensures
that the host/guest FPU state is properly saved, restored
during context-switch and with interrupts (using irq_fpu_usable()) not
stomping on the active FPU state.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1348164109.26695.338.camel@sbsiddha-desk.sc.intel.com
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Exporting KVM exit information to userspace to be consumed by perf.
Signed-off-by: Dong Hao <haodong@linux.vnet.ibm.com>
[ Dong Hao <haodong@linux.vnet.ibm.com>: rebase it on acme's git tree ]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Runzhen Wang <runzhen@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1347870675-31495-2-git-send-email-haodong@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus. In case
of logical mode loop only through vcpus in a logical cluster irq is sent
to.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
'ac' essentially reconstructs the 'access' variable we already
have, except for the PFERR_PRESENT_MASK and PFERR_RSVD_MASK. As
these are not used by callees, just use 'access' directly.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Keep track of accessed/dirty bits; if they are all set, do not
enter the accessed/dirty update loop.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
'eperm' is no longer used in the walker loop, so we can eliminate it.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Instead of branchy code depending on level, gpte.ps, and mmu configuration,
prepare everything in a bitmap during mode changes and look it up during
runtime.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
The page table walk is coded as an infinite loop, with a special
case on the last pte.
Code it as an ordinary loop with a termination condition on the last
pte (large page or walk length exhausted), and put the last pte handling
code after the loop where it belongs.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
While unspecified, the behaviour of Intel processors is to first
perform the page table walk, then, if the walk was successful, to
atomically update the accessed and dirty bits of walked paging elements.
While we are not required to follow this exactly, doing so will allow us
to perform the access permissions check after the walk is complete, rather
than after each walk step.
(the tricky case is SMEP: a zero in any pte's U bit makes the referenced
page a supervisor page, so we can't fault on a one bit during the walk
itself).
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
We no longer rely on paging_tmpl.h defines; so we can move the function
to mmu.c.
Rely on zero extension to 64 bits to get the correct nx behaviour.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
If nx is disabled, then is gpte[63] is set we will hit a reserved
bit set fault before checking permissions; so we can ignore the
setting of efer.nxe.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
gpte_access() computes the access permissions of a guest pte and also
write-protects clean gptes. This is wrong when we are servicing a
write fault (since we'll be setting the dirty bit momentarily) but
correct when instantiating a speculative spte, or when servicing a
read fault (since we'll want to trap a following write in order to
set the dirty bit).
It doesn't seem to hurt in practice, but in order to make the code
readable, push the write protection out of gpte_access() and into
a new protect_clean_gpte() which is called explicitly when needed.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|