diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/cdrom/ide-cd | 39 | ||||
-rw-r--r-- | Documentation/feature-removal-schedule.txt | 44 | ||||
-rw-r--r-- | Documentation/filesystems/Locking | 18 | ||||
-rw-r--r-- | Documentation/filesystems/sharedsubtree.txt | 16 | ||||
-rw-r--r-- | Documentation/kprobes.txt | 207 | ||||
-rw-r--r-- | Documentation/kvm/api.txt | 12 | ||||
-rw-r--r-- | Documentation/powerpc/dts-bindings/fsl/dma.txt | 8 |
7 files changed, 267 insertions, 77 deletions
diff --git a/Documentation/cdrom/ide-cd b/Documentation/cdrom/ide-cd index 2c558cd6c1e..f4dc9de2694 100644 --- a/Documentation/cdrom/ide-cd +++ b/Documentation/cdrom/ide-cd @@ -159,42 +159,7 @@ two arguments: the CDROM device, and the slot number to which you wish to change. If the slot number is -1, the drive is unloaded. -4. Compilation options ----------------------- - -There are a few additional options which can be set when compiling the -driver. Most people should not need to mess with any of these; they -are listed here simply for completeness. A compilation option can be -enabled by adding a line of the form `#define <option> 1' to the top -of ide-cd.c. All these options are disabled by default. - -VERBOSE_IDE_CD_ERRORS - If this is set, ATAPI error codes will be translated into textual - descriptions. In addition, a dump is made of the command which - provoked the error. This is off by default to save the memory used - by the (somewhat long) table of error descriptions. - -STANDARD_ATAPI - If this is set, the code needed to deal with certain drives which do - not properly implement the ATAPI spec will be disabled. If you know - your drive implements ATAPI properly, you can turn this on to get a - slightly smaller kernel. - -NO_DOOR_LOCKING - If this is set, the driver will never attempt to lock the door of - the drive. - -CDROM_NBLOCKS_BUFFER - This sets the size of the buffer to be used for a CDROMREADAUDIO - ioctl. The default is 8. - -TEST - This currently enables an additional ioctl which enables a user-mode - program to execute an arbitrary packet command. See the source for - details. This should be left off unless you know what you're doing. - - -5. Common problems +4. Common problems ------------------ This section discusses some common problems encountered when trying to @@ -371,7 +336,7 @@ f. Data corruption. expense of low system performance. -6. cdchange.c +5. cdchange.c ------------- /* diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 73ef30dbe61..31575e220f3 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -117,19 +117,25 @@ Who: Mauro Carvalho Chehab <mchehab@infradead.org> --------------------------- What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) -When: November 2005 +When: 2.6.35/2.6.36 Files: drivers/pcmcia/: pcmcia_ioctl.c Why: With the 16-bit PCMCIA subsystem now behaving (almost) like a normal hotpluggable bus, and with it using the default kernel infrastructure (hotplug, driver core, sysfs) keeping the PCMCIA control ioctl needed by cardmgr and cardctl from pcmcia-cs is - unnecessary, and makes further cleanups and integration of the + unnecessary and potentially harmful (it does not provide for + proper locking), and makes further cleanups and integration of the PCMCIA subsystem into the Linux kernel device driver model more difficult. The features provided by cardmgr and cardctl are either handled by the kernel itself now or are available in the new pcmciautils package available at http://kernel.org/pub/linux/utils/kernel/pcmcia/ -Who: Dominik Brodowski <linux@brodo.de> + + For all architectures except ARM, the associated config symbol + has been removed from kernel 2.6.34; for ARM, it will be likely + be removed from kernel 2.6.35. The actual code will then likely + be removed from kernel 2.6.36. +Who: Dominik Brodowski <linux@dominikbrodowski.net> --------------------------- @@ -550,3 +556,35 @@ Why: udev fully replaces this special file system that only contains CAPI NCCI TTY device nodes. User space (pppdcapiplugin) works without noticing the difference. Who: Jan Kiszka <jan.kiszka@web.de> + +---------------------------- + +What: KVM memory aliases support +When: July 2010 +Why: Memory aliasing support is used for speeding up guest vga access + through the vga windows. + + Modern userspace no longer uses this feature, so it's just bitrotted + code and can be removed with no impact. +Who: Avi Kivity <avi@redhat.com> + +---------------------------- + +What: KVM kernel-allocated memory slots +When: July 2010 +Why: Since 2.6.25, kvm supports user-allocated memory slots, which are + much more flexible than kernel-allocated slots. All current userspace + supports the newer interface and this code can be removed with no + impact. +Who: Avi Kivity <avi@redhat.com> + +---------------------------- + +What: KVM paravirt mmu host support +When: January 2011 +Why: The paravirt mmu host support is slower than non-paravirt mmu, both + on newer and older hardware. It is already not exposed to the guest, + and kept only for live migration purposes. +Who: Avi Kivity <avi@redhat.com> + +---------------------------- diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 18b9d0ca063..06bbbed7120 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -460,13 +460,6 @@ in sys_read() and friends. --------------------------- dquot_operations ------------------------------- prototypes: - int (*initialize) (struct inode *, int); - int (*drop) (struct inode *); - int (*alloc_space) (struct inode *, qsize_t, int); - int (*alloc_inode) (const struct inode *, unsigned long); - int (*free_space) (struct inode *, qsize_t); - int (*free_inode) (const struct inode *, unsigned long); - int (*transfer) (struct inode *, struct iattr *); int (*write_dquot) (struct dquot *); int (*acquire_dquot) (struct dquot *); int (*release_dquot) (struct dquot *); @@ -479,13 +472,6 @@ a proper locking wrt the filesystem and call the generic quota operations. What filesystem should expect from the generic quota functions: FS recursion Held locks when called -initialize: yes maybe dqonoff_sem -drop: yes - -alloc_space: ->mark_dirty() - -alloc_inode: ->mark_dirty() - -free_space: ->mark_dirty() - -free_inode: ->mark_dirty() - -transfer: yes - write_dquot: yes dqonoff_sem or dqptr_sem acquire_dquot: yes dqonoff_sem or dqptr_sem release_dquot: yes dqonoff_sem or dqptr_sem @@ -495,10 +481,6 @@ write_info: yes dqonoff_sem FS recursion means calling ->quota_read() and ->quota_write() from superblock operations. -->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called -only directly by the filesystem and do not call any fs functions only -the ->mark_dirty() operation. - More details about quota locking can be found in fs/dquot.c. --------------------------- vm_operations_struct ----------------------------- diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.txt index 23a181074f9..fc0e39af43c 100644 --- a/Documentation/filesystems/sharedsubtree.txt +++ b/Documentation/filesystems/sharedsubtree.txt @@ -837,6 +837,9 @@ replicas continue to be exactly same. individual lists does not affect propagation or the way propagation tree is modified by operations. + All vfsmounts in a peer group have the same ->mnt_master. If it is + non-NULL, they form a contiguous (ordered) segment of slave list. + A example propagation tree looks as shown in the figure below. [ NOTE: Though it looks like a forest, if we consider all the shared mounts as a conceptual entity called 'pnode', it becomes a tree] @@ -874,8 +877,19 @@ replicas continue to be exactly same. NOTE: The propagation tree is orthogonal to the mount tree. +8B Locking: + + ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected + by namespace_sem (exclusive for modifications, shared for reading). + + Normally we have ->mnt_flags modifications serialized by vfsmount_lock. + There are two exceptions: do_add_mount() and clone_mnt(). + The former modifies a vfsmount that has not been visible in any shared + data structures yet. + The latter holds namespace_sem and the only references to vfsmount + are in lists that can't be traversed without namespace_sem. -8B Algorithm: +8C Algorithm: The crux of the implementation resides in rbind/move operation. diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt index 053037a1fe6..2f9115c0ae6 100644 --- a/Documentation/kprobes.txt +++ b/Documentation/kprobes.txt @@ -1,6 +1,7 @@ Title : Kernel Probes (Kprobes) Authors : Jim Keniston <jkenisto@us.ibm.com> - : Prasanna S Panchamukhi <prasanna@in.ibm.com> + : Prasanna S Panchamukhi <prasanna.panchamukhi@gmail.com> + : Masami Hiramatsu <mhiramat@redhat.com> CONTENTS @@ -15,6 +16,7 @@ CONTENTS 9. Jprobes Example 10. Kretprobes Example Appendix A: The kprobes debugfs interface +Appendix B: The kprobes sysctl interface 1. Concepts: Kprobes, Jprobes, Return Probes @@ -42,13 +44,13 @@ registration/unregistration of a group of *probes. These functions can speed up unregistration process when you have to unregister a lot of probes at once. -The next three subsections explain how the different types of -probes work. They explain certain things that you'll need to -know in order to make the best use of Kprobes -- e.g., the -difference between a pre_handler and a post_handler, and how -to use the maxactive and nmissed fields of a kretprobe. But -if you're in a hurry to start using Kprobes, you can skip ahead -to section 2. +The next four subsections explain how the different types of +probes work and how jump optimization works. They explain certain +things that you'll need to know in order to make the best use of +Kprobes -- e.g., the difference between a pre_handler and +a post_handler, and how to use the maxactive and nmissed fields of +a kretprobe. But if you're in a hurry to start using Kprobes, you +can skip ahead to section 2. 1.1 How Does a Kprobe Work? @@ -161,13 +163,125 @@ In case probed function is entered but there is no kretprobe_instance object available, then in addition to incrementing the nmissed count, the user entry_handler invocation is also skipped. +1.4 How Does Jump Optimization Work? + +If you configured your kernel with CONFIG_OPTPROBES=y (currently +this option is supported on x86/x86-64, non-preemptive kernel) and +the "debug.kprobes_optimization" kernel parameter is set to 1 (see +sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump +instruction instead of a breakpoint instruction at each probepoint. + +1.4.1 Init a Kprobe + +When a probe is registered, before attempting this optimization, +Kprobes inserts an ordinary, breakpoint-based kprobe at the specified +address. So, even if it's not possible to optimize this particular +probepoint, there'll be a probe there. + +1.4.2 Safety Check + +Before optimizing a probe, Kprobes performs the following safety checks: + +- Kprobes verifies that the region that will be replaced by the jump +instruction (the "optimized region") lies entirely within one function. +(A jump instruction is multiple bytes, and so may overlay multiple +instructions.) + +- Kprobes analyzes the entire function and verifies that there is no +jump into the optimized region. Specifically: + - the function contains no indirect jump; + - the function contains no instruction that causes an exception (since + the fixup code triggered by the exception could jump back into the + optimized region -- Kprobes checks the exception tables to verify this); + and + - there is no near jump to the optimized region (other than to the first + byte). + +- For each instruction in the optimized region, Kprobes verifies that +the instruction can be executed out of line. + +1.4.3 Preparing Detour Buffer + +Next, Kprobes prepares a "detour" buffer, which contains the following +instruction sequence: +- code to push the CPU's registers (emulating a breakpoint trap) +- a call to the trampoline code which calls user's probe handlers. +- code to restore registers +- the instructions from the optimized region +- a jump back to the original execution path. + +1.4.4 Pre-optimization + +After preparing the detour buffer, Kprobes verifies that none of the +following situations exist: +- The probe has either a break_handler (i.e., it's a jprobe) or a +post_handler. +- Other instructions in the optimized region are probed. +- The probe is disabled. +In any of the above cases, Kprobes won't start optimizing the probe. +Since these are temporary situations, Kprobes tries to start +optimizing it again if the situation is changed. + +If the kprobe can be optimized, Kprobes enqueues the kprobe to an +optimizing list, and kicks the kprobe-optimizer workqueue to optimize +it. If the to-be-optimized probepoint is hit before being optimized, +Kprobes returns control to the original instruction path by setting +the CPU's instruction pointer to the copied code in the detour buffer +-- thus at least avoiding the single-step. + +1.4.5 Optimization + +The Kprobe-optimizer doesn't insert the jump instruction immediately; +rather, it calls synchronize_sched() for safety first, because it's +possible for a CPU to be interrupted in the middle of executing the +optimized region(*). As you know, synchronize_sched() can ensure +that all interruptions that were active when synchronize_sched() +was called are done, but only if CONFIG_PREEMPT=n. So, this version +of kprobe optimization supports only kernels with CONFIG_PREEMPT=n.(**) + +After that, the Kprobe-optimizer calls stop_machine() to replace +the optimized region with a jump instruction to the detour buffer, +using text_poke_smp(). + +1.4.6 Unoptimization + +When an optimized kprobe is unregistered, disabled, or blocked by +another kprobe, it will be unoptimized. If this happens before +the optimization is complete, the kprobe is just dequeued from the +optimized list. If the optimization has been done, the jump is +replaced with the original code (except for an int3 breakpoint in +the first byte) by using text_poke_smp(). + +(*)Please imagine that the 2nd instruction is interrupted and then +the optimizer replaces the 2nd instruction with the jump *address* +while the interrupt handler is running. When the interrupt +returns to original address, there is no valid instruction, +and it causes an unexpected result. + +(**)This optimization-safety checking may be replaced with the +stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y +kernel. + +NOTE for geeks: +The jump optimization changes the kprobe's pre_handler behavior. +Without optimization, the pre_handler can change the kernel's execution +path by changing regs->ip and returning 1. However, when the probe +is optimized, that modification is ignored. Thus, if you want to +tweak the kernel's execution path, you need to suppress optimization, +using one of the following techniques: +- Specify an empty function for the kprobe's post_handler or break_handler. + or +- Config CONFIG_OPTPROBES=n. + or +- Execute 'sysctl -w debug.kprobes_optimization=n' + 2. Architectures Supported Kprobes, jprobes, and return probes are implemented on the following architectures: -- i386 -- x86_64 (AMD-64, EM64T) +- i386 (Supports jump optimization) +- x86_64 (AMD-64, EM64T) (Supports jump optimization) - ppc64 - ia64 (Does not support probes on instruction slot1.) - sparc64 (Return probes not yet implemented.) @@ -193,6 +307,10 @@ it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO), so you can use "objdump -d -l vmlinux" to see the source-to-object code mapping. +If you want to reduce probing overhead, set "Kprobes jump optimization +support" (CONFIG_OPTPROBES) to "y". You can find this option under the +"Kprobes" line. + 4. API Reference The Kprobes API includes a "register" function and an "unregister" @@ -389,7 +507,10 @@ the probe which has been registered. Kprobes allows multiple probes at the same address. Currently, however, there cannot be multiple jprobes on the same function at -the same time. +the same time. Also, a probepoint for which there is a jprobe or +a post_handler cannot be optimized. So if you install a jprobe, +or a kprobe with a post_handler, at an optimized probepoint, the +probepoint will be unoptimized automatically. In general, you can install a probe anywhere in the kernel. In particular, you can probe interrupt handlers. Known exceptions @@ -453,6 +574,38 @@ reason, Kprobes doesn't support return probes (or kprobes or jprobes) on the x86_64 version of __switch_to(); the registration functions return -EINVAL. +On x86/x86-64, since the Jump Optimization of Kprobes modifies +instructions widely, there are some limitations to optimization. To +explain it, we introduce some terminology. Imagine a 3-instruction +sequence consisting of a two 2-byte instructions and one 3-byte +instruction. + + IA + | +[-2][-1][0][1][2][3][4][5][6][7] + [ins1][ins2][ ins3 ] + [<- DCR ->] + [<- JTPR ->] + +ins1: 1st Instruction +ins2: 2nd Instruction +ins3: 3rd Instruction +IA: Insertion Address +JTPR: Jump Target Prohibition Region +DCR: Detoured Code Region + +The instructions in DCR are copied to the out-of-line buffer +of the kprobe, because the bytes in DCR are replaced by +a 5-byte jump instruction. So there are several limitations. + +a) The instructions in DCR must be relocatable. +b) The instructions in DCR must not include a call instruction. +c) JTPR must not be targeted by any jump or call instruction. +d) DCR must not straddle the border betweeen functions. + +Anyway, these limitations are checked by the in-kernel instruction +decoder, so you don't need to worry about that. + 6. Probe Overhead On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0 @@ -476,6 +629,19 @@ k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07 ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU) k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99 +6.1 Optimized Probe Overhead + +Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to +process. Here are sample overhead figures (in usec) for x86 architectures. +k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe, +r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe. + +i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips +k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33 + +x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips +k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30 + 7. TODO a. SystemTap (http://sourceware.org/systemtap): Provides a simplified @@ -523,7 +689,8 @@ is also specified. Following columns show probe status. If the probe is on a virtual address that is no longer valid (module init sections, module virtual addresses that correspond to modules that've been unloaded), such probes are marked with [GONE]. If the probe is temporarily disabled, -such probes are marked with [DISABLED]. +such probes are marked with [DISABLED]. If the probe is optimized, it is +marked with [OPTIMIZED]. /sys/kernel/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly. @@ -533,3 +700,19 @@ registered probes will be disarmed, till such time a "1" is echoed to this file. Note that this knob just disarms and arms all kprobes and doesn't change each probe's disabling state. This means that disabled kprobes (marked [DISABLED]) will be not enabled if you turn ON all kprobes by this knob. + + +Appendix B: The kprobes sysctl interface + +/proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF. + +When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides +a knob to globally and forcibly turn jump optimization (see section +1.4) ON or OFF. By default, jump optimization is allowed (ON). +If you echo "0" to this file or set "debug.kprobes_optimization" to +0 via sysctl, all optimized probes will be unoptimized, and any new +probes registered after that will not be optimized. Note that this +knob *changes* the optimized state. This means that optimized probes +(marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be +removed). If the knob is turned on, they will be optimized again. + diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index 2811e452f75..c6416a39816 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt @@ -23,12 +23,12 @@ of a virtual machine. The ioctls belong to three classes Only run vcpu ioctls from the same thread that was used to create the vcpu. -2. File descritpors +2. File descriptors The kvm API is centered around file descriptors. An initial open("/dev/kvm") obtains a handle to the kvm subsystem; this handle can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this -handle will create a VM file descripror which can be used to issue VM +handle will create a VM file descriptor which can be used to issue VM ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu and return a file descriptor pointing to it. Finally, ioctls on a vcpu fd can be used to control the vcpu, including the important task of @@ -643,7 +643,7 @@ Type: vm ioctl Parameters: struct kvm_clock_data (in) Returns: 0 on success, -1 on error -Sets the current timestamp of kvmclock to the valued specific in its parameter. +Sets the current timestamp of kvmclock to the value specified in its parameter. In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios such as migration. @@ -795,11 +795,11 @@ Unused. __u64 data_offset; /* relative to kvm_run start */ } io; -If exit_reason is KVM_EXIT_IO_IN or KVM_EXIT_IO_OUT, then the vcpu has +If exit_reason is KVM_EXIT_IO, then the vcpu has executed a port I/O instruction which could not be satisfied by kvm. data_offset describes where the data is located (KVM_EXIT_IO_OUT) or where kvm expects application code to place the data for the next -KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a patcked array. +KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a packed array. struct { struct kvm_debug_exit_arch arch; @@ -815,7 +815,7 @@ Unused. __u8 is_write; } mmio; -If exit_reason is KVM_EXIT_MMIO or KVM_EXIT_IO_OUT, then the vcpu has +If exit_reason is KVM_EXIT_MMIO, then the vcpu has executed a memory-mapped I/O instruction which could not be satisfied by kvm. The 'data' member contains the written data if 'is_write' is true, and should be filled by application code otherwise. diff --git a/Documentation/powerpc/dts-bindings/fsl/dma.txt b/Documentation/powerpc/dts-bindings/fsl/dma.txt index 0732cdd05ba..2a4b4bce611 100644 --- a/Documentation/powerpc/dts-bindings/fsl/dma.txt +++ b/Documentation/powerpc/dts-bindings/fsl/dma.txt @@ -44,21 +44,29 @@ Example: compatible = "fsl,mpc8349-dma-channel", "fsl,elo-dma-channel"; cell-index = <0>; reg = <0 0x80>; + interrupt-parent = <&ipic>; + interrupts = <71 8>; }; dma-channel@80 { compatible = "fsl,mpc8349-dma-channel", "fsl,elo-dma-channel"; cell-index = <1>; reg = <0x80 0x80>; + interrupt-parent = <&ipic>; + interrupts = <71 8>; }; dma-channel@100 { compatible = "fsl,mpc8349-dma-channel", "fsl,elo-dma-channel"; cell-index = <2>; reg = <0x100 0x80>; + interrupt-parent = <&ipic>; + interrupts = <71 8>; }; dma-channel@180 { compatible = "fsl,mpc8349-dma-channel", "fsl,elo-dma-channel"; cell-index = <3>; reg = <0x180 0x80>; + interrupt-parent = <&ipic>; + interrupts = <71 8>; }; }; |