diff options
Diffstat (limited to 'Documentation')
55 files changed, 3348 insertions, 710 deletions
diff --git a/Documentation/ABI/stable/sysfs-driver-usb-usbtmc b/Documentation/ABI/stable/sysfs-driver-usb-usbtmc new file mode 100644 index 00000000000..9a75fb22187 --- /dev/null +++ b/Documentation/ABI/stable/sysfs-driver-usb-usbtmc @@ -0,0 +1,62 @@ +What: /sys/bus/usb/drivers/usbtmc/devices/*/interface_capabilities +What: /sys/bus/usb/drivers/usbtmc/devices/*/device_capabilities +Date: August 2008 +Contact: Greg Kroah-Hartman <gregkh@suse.de> +Description: + These files show the various USB TMC capabilities as described + by the device itself. The full description of the bitfields + can be found in the USB TMC documents from the USB-IF entitled + "Universal Serial Bus Test and Measurement Class Specification + (USBTMC) Revision 1.0" section 4.2.1.8. + + The files are read only. + + +What: /sys/bus/usb/drivers/usbtmc/devices/*/usb488_interface_capabilities +What: /sys/bus/usb/drivers/usbtmc/devices/*/usb488_device_capabilities +Date: August 2008 +Contact: Greg Kroah-Hartman <gregkh@suse.de> +Description: + These files show the various USB TMC capabilities as described + by the device itself. The full description of the bitfields + can be found in the USB TMC documents from the USB-IF entitled + "Universal Serial Bus Test and Measurement Class, Subclass + USB488 Specification (USBTMC-USB488) Revision 1.0" section + 4.2.2. + + The files are read only. + + +What: /sys/bus/usb/drivers/usbtmc/devices/*/TermChar +Date: August 2008 +Contact: Greg Kroah-Hartman <gregkh@suse.de> +Description: + This file is the TermChar value to be sent to the USB TMC + device as described by the document, "Universal Serial Bus Test + and Measurement Class Specification + (USBTMC) Revision 1.0" as published by the USB-IF. + + Note that the TermCharEnabled file determines if this value is + sent to the device or not. + + +What: /sys/bus/usb/drivers/usbtmc/devices/*/TermCharEnabled +Date: August 2008 +Contact: Greg Kroah-Hartman <gregkh@suse.de> +Description: + This file determines if the TermChar is to be sent to the + device on every transaction or not. For more details about + this, please see the document, "Universal Serial Bus Test and + Measurement Class Specification (USBTMC) Revision 1.0" as + published by the USB-IF. + + +What: /sys/bus/usb/drivers/usbtmc/devices/*/auto_abort +Date: August 2008 +Contact: Greg Kroah-Hartman <gregkh@suse.de> +Description: + This file determines if the the transaction of the USB TMC + device is to be automatically aborted if there is any error. + For more details about this, please see the document, + "Universal Serial Bus Test and Measurement Class Specification + (USBTMC) Revision 1.0" as published by the USB-IF. diff --git a/Documentation/ABI/testing/sysfs-bus-umc b/Documentation/ABI/testing/sysfs-bus-umc new file mode 100644 index 00000000000..948fec41244 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-umc @@ -0,0 +1,28 @@ +What: /sys/bus/umc/ +Date: July 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + The Wireless Host Controller Interface (WHCI) + specification describes a PCI-based device with + multiple capabilities; the UWB Multi-interface + Controller (UMC). + + The umc bus presents each of the individual + capabilties as a device. + +What: /sys/bus/umc/devices/.../capability_id +Date: July 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + The ID of this capability, with 0 being the radio + controller capability. + +What: /sys/bus/umc/devices/.../version +Date: July 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + The specification version this capability's hardware + interface complies with. diff --git a/Documentation/ABI/testing/sysfs-bus-usb b/Documentation/ABI/testing/sysfs-bus-usb index 11a3c1682ce..7772928ee48 100644 --- a/Documentation/ABI/testing/sysfs-bus-usb +++ b/Documentation/ABI/testing/sysfs-bus-usb @@ -85,3 +85,62 @@ Description: Users: PowerTOP <power@bughost.org> http://www.lesswatts.org/projects/powertop/ + +What: /sys/bus/usb/device/<busnum>-<devnum>...:<config num>-<interface num>/supports_autosuspend +Date: January 2008 +KernelVersion: 2.6.27 +Contact: Sarah Sharp <sarah.a.sharp@intel.com> +Description: + When read, this file returns 1 if the interface driver + for this interface supports autosuspend. It also + returns 1 if no driver has claimed this interface, as an + unclaimed interface will not stop the device from being + autosuspended if all other interface drivers are idle. + The file returns 0 if autosuspend support has not been + added to the driver. +Users: + USB PM tool + git://git.moblin.org/users/sarah/usb-pm-tool/ + +What: /sys/bus/usb/device/.../authorized +Date: July 2008 +KernelVersion: 2.6.26 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + Authorized devices are available for use by device + drivers, non-authorized one are not. By default, wired + USB devices are authorized. + + Certified Wireless USB devices are not authorized + initially and should be (by writing 1) after the + device has been authenticated. + +What: /sys/bus/usb/device/.../wusb_cdid +Date: July 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + For Certified Wireless USB devices only. + + A devices's CDID, as 16 space-separated hex octets. + +What: /sys/bus/usb/device/.../wusb_ck +Date: July 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + For Certified Wireless USB devices only. + + Write the device's connection key (CK) to start the + authentication of the device. The CK is 16 + space-separated hex octets. + +What: /sys/bus/usb/device/.../wusb_disconnect +Date: July 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + For Certified Wireless USB devices only. + + Write a 1 to force the device to disconnect + (equivalent to unplugging a wired USB device). diff --git a/Documentation/ABI/testing/sysfs-bus-usb-devices-usbsevseg b/Documentation/ABI/testing/sysfs-bus-usb-devices-usbsevseg new file mode 100644 index 00000000000..cb830df8777 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-usb-devices-usbsevseg @@ -0,0 +1,43 @@ +Where: /sys/bus/usb/.../powered +Date: August 2008 +Kernel Version: 2.6.26 +Contact: Harrison Metzger <harrisonmetz@gmail.com> +Description: Controls whether the device's display will powered. + A value of 0 is off and a non-zero value is on. + +Where: /sys/bus/usb/.../mode_msb +Where: /sys/bus/usb/.../mode_lsb +Date: August 2008 +Kernel Version: 2.6.26 +Contact: Harrison Metzger <harrisonmetz@gmail.com> +Description: Controls the devices display mode. + For a 6 character display the values are + MSB 0x06; LSB 0x3F, and + for an 8 character display the values are + MSB 0x08; LSB 0xFF. + +Where: /sys/bus/usb/.../textmode +Date: August 2008 +Kernel Version: 2.6.26 +Contact: Harrison Metzger <harrisonmetz@gmail.com> +Description: Controls the way the device interprets its text buffer. + raw: each character controls its segment manually + hex: each character is between 0-15 + ascii: each character is between '0'-'9' and 'A'-'F'. + +Where: /sys/bus/usb/.../text +Date: August 2008 +Kernel Version: 2.6.26 +Contact: Harrison Metzger <harrisonmetz@gmail.com> +Description: The text (or data) for the device to display + +Where: /sys/bus/usb/.../decimals +Date: August 2008 +Kernel Version: 2.6.26 +Contact: Harrison Metzger <harrisonmetz@gmail.com> +Description: Controls the decimal places on the device. + To set the nth decimal place, give this field + the value of 10 ** n. Assume this field has + the value k and has 1 or more decimal places set, + to set the mth place (where m is not already set), + change this fields value to k + 10 ** m.
\ No newline at end of file diff --git a/Documentation/ABI/testing/sysfs-class-usb_host b/Documentation/ABI/testing/sysfs-class-usb_host new file mode 100644 index 00000000000..46b66ad1f1b --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-usb_host @@ -0,0 +1,25 @@ +What: /sys/class/usb_host/usb_hostN/wusb_chid +Date: July 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + Write the CHID (16 space-separated hex octets) for this host controller. + This starts the host controller, allowing it to accept connection from + WUSB devices. + + Set an all zero CHID to stop the host controller. + +What: /sys/class/usb_host/usb_hostN/wusb_trust_timeout +Date: July 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + Devices that haven't sent a WUSB packet to the host + within 'wusb_trust_timeout' ms are considered to have + disconnected and are removed. The default value of + 4000 ms is the value required by the WUSB + specification. + + Since this relates to security (specifically, the + lifetime of PTKs and GTKs) it should not be changed + from the default. diff --git a/Documentation/ABI/testing/sysfs-class-uwb_rc b/Documentation/ABI/testing/sysfs-class-uwb_rc new file mode 100644 index 00000000000..a0d18dbeb7a --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-uwb_rc @@ -0,0 +1,144 @@ +What: /sys/class/uwb_rc +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + Interfaces for WiMedia Ultra Wideband Common Radio + Platform (UWB) radio controllers. + + Familiarity with the ECMA-368 'High Rate Ultra + Wideband MAC and PHY Specification' is assumed. + +What: /sys/class/uwb_rc/beacon_timeout_ms +Date: July 2008 +KernelVersion: 2.6.27 +Description: + If no beacons are received from a device for at least + this time, the device will be considered to have gone + and it will be removed. The default is 3 superframes + (~197 ms) as required by the specification. + +What: /sys/class/uwb_rc/uwbN/ +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + An individual UWB radio controller. + +What: /sys/class/uwb_rc/uwbN/beacon +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + Write: + + <channel> [<bpst offset>] + + to start beaconing on a specific channel, or stop + beaconing if <channel> is -1. Valid channels depends + on the radio controller's supported band groups. + + <bpst offset> may be used to try and join a specific + beacon group if more than one was found during a scan. + +What: /sys/class/uwb_rc/uwbN/scan +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + Write: + + <channel> <type> [<bpst offset>] + + to start (or stop) scanning on a channel. <type> is one of: + 0 - scan + 1 - scan outside BP + 2 - scan while inactive + 3 - scanning disabled + 4 - scan (with start time of <bpst offset>) + +What: /sys/class/uwb_rc/uwbN/mac_address +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + The EUI-48, in colon-separated hex octets, for this + radio controller. A write will change the radio + controller's EUI-48 but only do so while the device is + not beaconing or scanning. + +What: /sys/class/uwb_rc/uwbN/wusbhc +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + A symlink to the device (if any) of the WUSB Host + Controller PAL using this radio controller. + +What: /sys/class/uwb_rc/uwbN/<EUI-48>/ +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + A neighbour UWB device that has either been detected + as part of a scan or is a member of the radio + controllers beacon group. + +What: /sys/class/uwb_rc/uwbN/<EUI-48>/BPST +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + The time (using the radio controllers internal 1 ms + interval superframe timer) of the last beacon from + this device was received. + +What: /sys/class/uwb_rc/uwbN/<EUI-48>/DevAddr +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + The current DevAddr of this device in colon separated + hex octets. + +What: /sys/class/uwb_rc/uwbN/<EUI-48>/EUI_48 +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + + The EUI-48 of this device in colon separated hex + octets. + +What: /sys/class/uwb_rc/uwbN/<EUI-48>/BPST +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + +What: /sys/class/uwb_rc/uwbN/<EUI-48>/IEs +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + The latest IEs included in this device's beacon, in + space separated hex octets with one IE per line. + +What: /sys/class/uwb_rc/uwbN/<EUI-48>/LQE +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + Link Quality Estimate - the Signal to Noise Ratio + (SNR) of all packets received from this device in dB. + This gives an estimate on a suitable PHY rate. Refer + to [ECMA-368] section 13.3 for more details. + +What: /sys/class/uwb_rc/uwbN/<EUI-48>/RSSI +Date: July 2008 +KernelVersion: 2.6.27 +Contact: linux-usb@vger.kernel.org +Description: + Received Signal Strength Indication - the strength of + the received signal in dB. LQE is a more useful + measure of the radio link quality. diff --git a/Documentation/ABI/testing/sysfs-wusb_cbaf b/Documentation/ABI/testing/sysfs-wusb_cbaf new file mode 100644 index 00000000000..a99c5f86a37 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-wusb_cbaf @@ -0,0 +1,100 @@ +What: /sys/bus/usb/drivers/wusb_cbaf/.../wusb_* +Date: August 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + Various files for managing Cable Based Association of + (wireless) USB devices. + + The sequence of operations should be: + + 1. Device is plugged in. + + 2. The connection manager (CM) sees a device with CBA capability. + (the wusb_chid etc. files in /sys/devices/blah/OURDEVICE). + + 3. The CM writes the host name, supported band groups, + and the CHID (host ID) into the wusb_host_name, + wusb_host_band_groups and wusb_chid files. These + get sent to the device and the CDID (if any) for + this host is requested. + + 4. The CM can verify that the device's supported band + groups (wusb_device_band_groups) are compatible + with the host. + + 5. The CM reads the wusb_cdid file. + + 6. The CM looks it up its database. + + - If it has a matching CHID,CDID entry, the device + has been authorized before and nothing further + needs to be done. + + - If the CDID is zero (or the CM doesn't find a + matching CDID in its database), the device is + assumed to be not known. The CM may associate + the host with device by: writing a randomly + generated CDID to wusb_cdid and then a random CK + to wusb_ck (this uploads the new CC to the + device). + + CMD may choose to prompt the user before + associating with a new device. + + 7. Device is unplugged. + + References: + [WUSB-AM] Association Models Supplement to the + Certified Wireless Universal Serial Bus + Specification, version 1.0. + +What: /sys/bus/usb/drivers/wusb_cbaf/.../wusb_chid +Date: August 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + The CHID of the host formatted as 16 space-separated + hex octets. + + Writes fetches device's supported band groups and the + the CDID for any existing association with this host. + +What: /sys/bus/usb/drivers/wusb_cbaf/.../wusb_host_name +Date: August 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + A friendly name for the host as a UTF-8 encoded string. + +What: /sys/bus/usb/drivers/wusb_cbaf/.../wusb_host_band_groups +Date: August 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + The band groups supported by the host, in the format + defined in [WUSB-AM]. + +What: /sys/bus/usb/drivers/wusb_cbaf/.../wusb_device_band_groups +Date: August 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + The band groups supported by the device, in the format + defined in [WUSB-AM]. + +What: /sys/bus/usb/drivers/wusb_cbaf/.../wusb_cdid +Date: August 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + The device's CDID formatted as 16 space-separated hex + octets. + +What: /sys/bus/usb/drivers/wusb_cbaf/.../wusb_ck +Date: August 2008 +KernelVersion: 2.6.27 +Contact: David Vrabel <david.vrabel@csr.com> +Description: + Write 16 space-separated random, hex octets to + associate with the device. diff --git a/Documentation/DocBook/gadget.tmpl b/Documentation/DocBook/gadget.tmpl index ea3bc9565e6..6ef2f0073e5 100644 --- a/Documentation/DocBook/gadget.tmpl +++ b/Documentation/DocBook/gadget.tmpl @@ -557,6 +557,9 @@ Near-term plans include converting all of them, except for "gadgetfs". </para> !Edrivers/usb/gadget/f_acm.c +!Edrivers/usb/gadget/f_ecm.c +!Edrivers/usb/gadget/f_subset.c +!Edrivers/usb/gadget/f_obex.c !Edrivers/usb/gadget/f_serial.c </sect1> diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl index 4c63e586416..ae15d55350e 100644 --- a/Documentation/DocBook/kernel-hacking.tmpl +++ b/Documentation/DocBook/kernel-hacking.tmpl @@ -1105,7 +1105,7 @@ static struct block_device_operations opt_fops = { </listitem> <listitem> <para> - Function names as strings (__FUNCTION__). + Function names as strings (__func__). </para> </listitem> <listitem> diff --git a/Documentation/MSI-HOWTO.txt b/Documentation/MSI-HOWTO.txt index a51f693c154..256defd7e17 100644 --- a/Documentation/MSI-HOWTO.txt +++ b/Documentation/MSI-HOWTO.txt @@ -236,10 +236,8 @@ software system can set different pages for controlling accesses to the MSI-X structure. The implementation of MSI support requires the PCI subsystem, not a device driver, to maintain full control of the MSI-X table/MSI-X PBA (Pending Bit Array) and MMIO address space of the MSI-X -table/MSI-X PBA. A device driver is prohibited from requesting the MMIO -address space of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem -will fail enabling MSI-X on its hardware device when it calls the function -pci_enable_msix(). +table/MSI-X PBA. A device driver should not access the MMIO address +space of the MSI-X table/MSI-X PBA. 5.3.2 API pci_enable_msix diff --git a/Documentation/PCI/pci.txt b/Documentation/PCI/pci.txt index 8d4dc6250c5..fd4907a2968 100644 --- a/Documentation/PCI/pci.txt +++ b/Documentation/PCI/pci.txt @@ -163,6 +163,10 @@ need pass only as many optional fields as necessary: o class and classmask fields default to 0 o driver_data defaults to 0UL. +Note that driver_data must match the value used by any of the pci_device_id +entries defined in the driver. This makes the driver_data field mandatory +if all the pci_device_id entries have a non-zero driver_data value. + Once added, the driver probe routine will be invoked for any unclaimed PCI devices listed in its (newly updated) pci_ids list. diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt index 16c251230c8..ddeb14beacc 100644 --- a/Documentation/PCI/pcieaer-howto.txt +++ b/Documentation/PCI/pcieaer-howto.txt @@ -203,22 +203,17 @@ to mmio_enabled. 3.3 helper functions -3.3.1 int pci_find_aer_capability(struct pci_dev *dev); -pci_find_aer_capability locates the PCI Express AER capability -in the device configuration space. If the device doesn't support -PCI-Express AER, the function returns 0. - -3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev); +3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev); pci_enable_pcie_error_reporting enables the device to send error messages to root port when an error is detected. Note that devices don't enable the error reporting by default, so device drivers need call this function to enable it. -3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev); +3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev); pci_disable_pcie_error_reporting disables the device to send error messages to root port when an error is detected. -3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); +3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable error status register. diff --git a/Documentation/cgroups.txt b/Documentation/cgroups/cgroups.txt index d9014aa0eb6..d9014aa0eb6 100644 --- a/Documentation/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt new file mode 100644 index 00000000000..c50ab58b72e --- /dev/null +++ b/Documentation/cgroups/freezer-subsystem.txt @@ -0,0 +1,99 @@ + The cgroup freezer is useful to batch job management system which start +and stop sets of tasks in order to schedule the resources of a machine +according to the desires of a system administrator. This sort of program +is often used on HPC clusters to schedule access to the cluster as a +whole. The cgroup freezer uses cgroups to describe the set of tasks to +be started/stopped by the batch job management system. It also provides +a means to start and stop the tasks composing the job. + + The cgroup freezer will also be useful for checkpointing running groups +of tasks. The freezer allows the checkpoint code to obtain a consistent +image of the tasks by attempting to force the tasks in a cgroup into a +quiescent state. Once the tasks are quiescent another task can +walk /proc or invoke a kernel interface to gather information about the +quiesced tasks. Checkpointed tasks can be restarted later should a +recoverable error occur. This also allows the checkpointed tasks to be +migrated between nodes in a cluster by copying the gathered information +to another node and restarting the tasks there. + + Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping +and resuming tasks in userspace. Both of these signals are observable +from within the tasks we wish to freeze. While SIGSTOP cannot be caught, +blocked, or ignored it can be seen by waiting or ptracing parent tasks. +SIGCONT is especially unsuitable since it can be caught by the task. Any +programs designed to watch for SIGSTOP and SIGCONT could be broken by +attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can +demonstrate this problem using nested bash shells: + + $ echo $$ + 16644 + $ bash + $ echo $$ + 16690 + + From a second, unrelated bash shell: + $ kill -SIGSTOP 16690 + $ kill -SIGCONT 16990 + + <at this point 16990 exits and causes 16644 to exit too> + + This happens because bash can observe both signals and choose how it +responds to them. + + Another example of a program which catches and responds to these +signals is gdb. In fact any program designed to use ptrace is likely to +have a problem with this method of stopping and resuming tasks. + + In contrast, the cgroup freezer uses the kernel freezer code to +prevent the freeze/unfreeze cycle from becoming visible to the tasks +being frozen. This allows the bash example above and gdb to run as +expected. + + The freezer subsystem in the container filesystem defines a file named +freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the +cgroup. Subsequently writing "THAWED" will unfreeze the tasks in the cgroup. +Reading will return the current state. + +* Examples of usage : + + # mkdir /containers/freezer + # mount -t cgroup -ofreezer freezer /containers + # mkdir /containers/0 + # echo $some_pid > /containers/0/tasks + +to get status of the freezer subsystem : + + # cat /containers/0/freezer.state + THAWED + +to freeze all tasks in the container : + + # echo FROZEN > /containers/0/freezer.state + # cat /containers/0/freezer.state + FREEZING + # cat /containers/0/freezer.state + FROZEN + +to unfreeze all tasks in the container : + + # echo THAWED > /containers/0/freezer.state + # cat /containers/0/freezer.state + THAWED + +This is the basic mechanism which should do the right thing for user space task +in a simple scenario. + +It's important to note that freezing can be incomplete. In that case we return +EBUSY. This means that some tasks in the cgroup are busy doing something that +prevents us from completely freezing the cgroup at this time. After EBUSY, +the cgroup will remain partially frozen -- reflected by freezer.state reporting +"FREEZING" when read. The state will remain "FREEZING" until one of these +things happens: + + 1) Userspace cancels the freezing operation by writing "THAWED" to + the freezer.state file + 2) Userspace retries the freezing operation by writing "FROZEN" to + the freezer.state file (writing "FREEZING" is not legal + and returns EIO) + 3) The tasks that blocked the cgroup from entering the "FROZEN" + state disappear from the cgroup's set of tasks. diff --git a/Documentation/controllers/memory.txt b/Documentation/controllers/memory.txt index 9b53d582736..1c07547d3f8 100644 --- a/Documentation/controllers/memory.txt +++ b/Documentation/controllers/memory.txt @@ -112,14 +112,22 @@ the per cgroup LRU. 2.2.1 Accounting details -All mapped pages (RSS) and unmapped user pages (Page Cache) are accounted. -RSS pages are accounted at the time of page_add_*_rmap() unless they've already -been accounted for earlier. A file page will be accounted for as Page Cache; -it's mapped into the page tables of a process, duplicate accounting is carefully -avoided. Page Cache pages are accounted at the time of add_to_page_cache(). -The corresponding routines that remove a page from the page tables or removes -a page from Page Cache is used to decrement the accounting counters of the -cgroup. +All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. +(some pages which never be reclaimable and will not be on global LRU + are not accounted. we just accounts pages under usual vm management.) + +RSS pages are accounted at page_fault unless they've already been accounted +for earlier. A file page will be accounted for as Page Cache when it's +inserted into inode (radix-tree). While it's mapped into the page tables of +processes, duplicate accounting is carefully avoided. + +A RSS page is unaccounted when it's fully unmapped. A PageCache page is +unaccounted when it's removed from radix-tree. + +At page migration, accounting information is kept. + +Note: we just account pages-on-lru because our purpose is to control amount +of used pages. not-on-lru pages are tend to be out-of-control from vm view. 2.3 Shared Page Accounting diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 47e568a9370..5c86c258c79 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -48,7 +48,7 @@ hooks, beyond what is already present, required to manage dynamic job placement on large systems. Cpusets use the generic cgroup subsystem described in -Documentation/cgroup.txt. +Documentation/cgroups/cgroups.txt. Requests by a task, using the sched_setaffinity(2) system call to include CPUs in its CPU affinity mask, and using the mbind(2) and diff --git a/Documentation/devices.txt b/Documentation/devices.txt index 05c80645e4e..2be08240ee8 100644 --- a/Documentation/devices.txt +++ b/Documentation/devices.txt @@ -2571,6 +2571,9 @@ Your cooperation is appreciated. 160 = /dev/usb/legousbtower0 1st USB Legotower device ... 175 = /dev/usb/legousbtower15 16th USB Legotower device + 176 = /dev/usb/usbtmc1 First USB TMC device + ... + 192 = /dev/usb/usbtmc16 16th USB TMC device 240 = /dev/usb/dabusb0 First daubusb device ... 243 = /dev/usb/dabusb3 Fourth dabusb device diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index f5f812daf9f..05d71b4b943 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -359,3 +359,11 @@ Why: The 2.6 kernel supports direct writing to ide CD drives, which eliminates the need for ide-scsi. The new method is more efficient in every way. Who: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> + +--------------------------- + +What: i2c_attach_client(), i2c_detach_client(), i2c_driver->detach_client() +When: 2.6.29 (ideally) or 2.6.30 (more likely) +Why: Deprecated by the new (standard) device driver binding model. Use + i2c_driver->probe() and ->remove() instead. +Who: Jean Delvare <khali@linux-fr.org> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 295f26cd895..9dd2a3bb2ac 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -96,6 +96,11 @@ errors=remount-ro(*) Remount the filesystem read-only on an error. errors=continue Keep going on a filesystem error. errors=panic Panic and halt the machine if an error occurs. +data_err=ignore(*) Just print an error message if an error occurs + in a file data buffer in ordered mode. +data_err=abort Abort the journal if an error occurs in a file + data buffer in ordered mode. + grpid Give objects the same group ID as their creator. bsdgroups diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index eb154ef36c2..174eaff7ded 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -2,19 +2,24 @@ Ext4 Filesystem =============== -This is a development version of the ext4 filesystem, an advanced level -of the ext3 filesystem which incorporates scalability and reliability -enhancements for supporting large filesystems (64 bit) in keeping with -increasing disk capacities and state-of-the-art feature requirements. +Ext4 is an an advanced level of the ext3 filesystem which incorporates +scalability and reliability enhancements for supporting large filesystems +(64 bit) in keeping with increasing disk capacities and state-of-the-art +feature requirements. -Mailing list: linux-ext4@vger.kernel.org +Mailing list: linux-ext4@vger.kernel.org +Web site: http://ext4.wiki.kernel.org 1. Quick usage instructions: =========================== +Note: More extensive information for getting started with ext4 can be + found at the ext4 wiki site at the URL: + http://ext4.wiki.kernel.org/index.php/Ext4_Howto + - Compile and install the latest version of e2fsprogs (as of this - writing version 1.41) from: + writing version 1.41.3) from: http://sourceforge.net/project/showfiles.php?group_id=2406 @@ -36,11 +41,9 @@ Mailing list: linux-ext4@vger.kernel.org # mke2fs -t ext4 /dev/hda1 - Or configure an existing ext3 filesystem to support extents and set - the test_fs flag to indicate that it's ok for an in-development - filesystem to touch this filesystem: + Or to configure an existing ext3 filesystem to support extents: - # tune2fs -O extents -E test_fs /dev/hda1 + # tune2fs -O extents /dev/hda1 If the filesystem was created with 128 byte inodes, it can be converted to use 256 byte for greater efficiency via: @@ -104,8 +107,8 @@ exist yet so I'm not sure they're in the near-term roadmap. The big performance win will come with mballoc, delalloc and flex_bg grouping of bitmaps and inode tables. Some test results available here: - - http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html - - http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html 3. Options ========== @@ -214,9 +217,6 @@ noreservation bsddf (*) Make 'df' act like BSD. minixdf Make 'df' act like Minix. -check=none Don't do extra checking of bitmaps on mount. -nocheck - debug Extra debugging information is sent to syslog. errors=remount-ro(*) Remount the filesystem read-only on an error. @@ -253,8 +253,6 @@ nobh (a) cache disk block mapping information "nobh" option tries to avoid associating buffer heads (supported only for "writeback" mode). -mballoc (*) Use the multiple block allocator for block allocation -nomballoc disabled multiple block allocator for block allocation. stripe=n Number of filesystem blocks that mballoc will try to use for allocation size and alignment. For RAID5/6 systems this should be the number of data diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index c032bf39e8b..bcceb99b81d 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1384,15 +1384,18 @@ causes the kernel to prefer to reclaim dentries and inodes. dirty_background_ratio ---------------------- -Contains, as a percentage of total system memory, the number of pages at which -the pdflush background writeback daemon will start writing out dirty data. +Contains, as a percentage of the dirtyable system memory (free pages + mapped +pages + file cache, not including locked pages and HugePages), the number of +pages at which the pdflush background writeback daemon will start writing out +dirty data. dirty_ratio ----------------- -Contains, as a percentage of total system memory, the number of pages at which -a process which is generating disk writes will itself start writing out dirty -data. +Contains, as a percentage of the dirtyable system memory (free pages + mapped +pages + file cache, not including locked pages and HugePages), the number of +pages at which a process which is generating disk writes will itself start +writing out dirty data. dirty_writeback_centisecs ------------------------- @@ -2412,24 +2415,29 @@ will be dumped when the <pid> process is dumped. coredump_filter is a bitmask of memory types. If a bit of the bitmask is set, memory segments of the corresponding memory type are dumped, otherwise they are not dumped. -The following 4 memory types are supported: +The following 7 memory types are supported: - (bit 0) anonymous private memory - (bit 1) anonymous shared memory - (bit 2) file-backed private memory - (bit 3) file-backed shared memory - (bit 4) ELF header pages in file-backed private memory areas (it is effective only if the bit 2 is cleared) + - (bit 5) hugetlb private memory + - (bit 6) hugetlb shared memory Note that MMIO pages such as frame buffer are never dumped and vDSO pages are always dumped regardless of the bitmask status. -Default value of coredump_filter is 0x3; this means all anonymous memory -segments are dumped. + Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only + effected by bit 5-6. + +Default value of coredump_filter is 0x23; this means all anonymous memory +segments and hugetlb private memory are dumped. If you don't want to dump all shared memory segments attached to pid 1234, -write 1 to the process's proc file. +write 0x21 to the process's proc file. - $ echo 0x1 > /proc/1234/coredump_filter + $ echo 0x21 > /proc/1234/coredump_filter When a new process is created, the process inherits the bitmask status from its parent. It is useful to set up coredump_filter before the program runs. diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt index 6a0d70a22f0..dd84ea3c10d 100644 --- a/Documentation/filesystems/ubifs.txt +++ b/Documentation/filesystems/ubifs.txt @@ -86,6 +86,15 @@ norm_unmount (*) commit on unmount; the journal is committed fast_unmount do not commit on unmount; this option makes unmount faster, but the next mount slower because of the need to replay the journal. +bulk_read read more in one go to take advantage of flash + media that read faster sequentially +no_bulk_read (*) do not bulk-read +no_chk_data_crc skip checking of CRCs on data nodes in order to + improve read performance. Use this option only + if the flash media is highly reliable. The effect + of this option is that corruption of the contents + of a file can go unnoticed. +chk_data_crc (*) do not skip checking CRCs on data nodes Quick usage instructions diff --git a/Documentation/hwmon/adt7470 b/Documentation/hwmon/adt7470 new file mode 100644 index 00000000000..75d13ca147c --- /dev/null +++ b/Documentation/hwmon/adt7470 @@ -0,0 +1,76 @@ +Kernel driver adt7470 +===================== + +Supported chips: + * Analog Devices ADT7470 + Prefix: 'adt7470' + Addresses scanned: I2C 0x2C, 0x2E, 0x2F + Datasheet: Publicly available at the Analog Devices website + +Author: Darrick J. Wong + +Description +----------- + +This driver implements support for the Analog Devices ADT7470 chip. There may +be other chips that implement this interface. + +The ADT7470 uses the 2-wire interface compatible with the SMBus 2.0 +specification. Using an analog to digital converter it measures up to ten (10) +external temperatures. It has four (4) 16-bit counters for measuring fan speed. +There are four (4) PWM outputs that can be used to control fan speed. + +A sophisticated control system for the PWM outputs is designed into the ADT7470 +that allows fan speed to be adjusted automatically based on any of the ten +temperature sensors. Each PWM output is individually adjustable and +programmable. Once configured, the ADT7470 will adjust the PWM outputs in +response to the measured temperatures with further host intervention. This +feature can also be disabled for manual control of the PWM's. + +Each of the measured inputs (temperature, fan speed) has corresponding high/low +limit values. The ADT7470 will signal an ALARM if any measured value exceeds +either limit. + +The ADT7470 DOES NOT sample all inputs continuously. A single pin on the +ADT7470 is connected to a multitude of thermal diodes, but the chip must be +instructed explicitly to read the multitude of diodes. If you want to use +automatic fan control mode, you must manually read any of the temperature +sensors or the fan control algorithm will not run. The chip WILL NOT DO THIS +AUTOMATICALLY; this must be done from userspace. This may be a bug in the chip +design, given that many other AD chips take care of this. The driver will not +read the registers more often than once every 5 seconds. Further, +configuration data is only read once per minute. + +Special Features +---------------- + +The ADT7470 has a 8-bit ADC and is capable of measuring temperatures with 1 +degC resolution. + +The Analog Devices datasheet is very detailed and describes a procedure for +determining an optimal configuration for the automatic PWM control. + +Configuration Notes +------------------- + +Besides standard interfaces driver adds the following: + +* PWM Control + +* pwm#_auto_point1_pwm and pwm#_auto_point1_temp and +* pwm#_auto_point2_pwm and pwm#_auto_point2_temp - + +point1: Set the pwm speed at a lower temperature bound. +point2: Set the pwm speed at a higher temperature bound. + +The ADT7470 will scale the pwm between the lower and higher pwm speed when +the temperature is between the two temperature boundaries. PWM values range +from 0 (off) to 255 (full speed). Fan speed will be set to maximum when the +temperature sensor associated with the PWM control exceeds +pwm#_auto_point2_temp. + +Notes +----- + +As stated above, the temperature inputs must be read periodically from +userspace in order for the automatic pwm algorithm to run. diff --git a/Documentation/hwmon/it87 b/Documentation/hwmon/it87 index 3496b7020e7..042c0415140 100644 --- a/Documentation/hwmon/it87 +++ b/Documentation/hwmon/it87 @@ -136,10 +136,10 @@ once-only alarms. The IT87xx only updates its values each 1.5 seconds; reading it more often will do no harm, but will return 'old' values. -To change sensor N to a thermistor, 'echo 2 > tempN_type' where N is 1, 2, +To change sensor N to a thermistor, 'echo 4 > tempN_type' where N is 1, 2, or 3. To change sensor N to a thermal diode, 'echo 3 > tempN_type'. Give 0 for unused sensor. Any other value is invalid. To configure this at -startup, consult lm_sensors's /etc/sensors.conf. (2 = thermistor; +startup, consult lm_sensors's /etc/sensors.conf. (4 = thermistor; 3 = thermal diode) diff --git a/Documentation/hwmon/lm85 b/Documentation/hwmon/lm85 index 6d41db7f17f..40062074129 100644 --- a/Documentation/hwmon/lm85 +++ b/Documentation/hwmon/lm85 @@ -163,16 +163,6 @@ configured individually according to the following options. * pwm#_auto_pwm_min - this specifies the PWM value for temp#_auto_temp_off temperature. (PWM value from 0 to 255) -* pwm#_auto_pwm_freq - select base frequency of PWM output. You can select - in range of 10.0 to 94.0 Hz in .1 Hz units. - (Values 100 to 940). - -The pwm#_auto_pwm_freq can be set to one of the following 8 values. Setting the -frequency to a value not on this list, will result in the next higher frequency -being selected. The actual device frequency may vary slightly from this -specification as designed by the manufacturer. Consult the datasheet for more -details. (PWM Frequency values: 100, 150, 230, 300, 380, 470, 620, 940) - * pwm#_auto_pwm_minctl - this flags selects for temp#_auto_temp_off temperature the bahaviour of fans. Write 1 to let fans spinning at pwm#_auto_pwm_min or write 0 to let them off. diff --git a/Documentation/hwmon/lm87 b/Documentation/hwmon/lm87 index ec27aa1b94c..6b47b67fd96 100644 --- a/Documentation/hwmon/lm87 +++ b/Documentation/hwmon/lm87 @@ -65,11 +65,10 @@ The LM87 has four pins which can serve one of two possible functions, depending on the hardware configuration. Some functions share pins, so not all functions are available at the same -time. Which are depends on the hardware setup. This driver assumes that -the BIOS configured the chip correctly. In that respect, it differs from -the original driver (from lm_sensors for Linux 2.4), which would force the -LM87 to an arbitrary, compile-time chosen mode, regardless of the actual -chipset wiring. +time. Which are depends on the hardware setup. This driver normally +assumes that firmware configured the chip correctly. Where this is not +the case, platform code must set the I2C client's platform_data to point +to a u8 value to be written to the channel register. For reference, here is the list of exclusive functions: - in0+in5 (default) or temp3 diff --git a/Documentation/hwmon/lm90 b/Documentation/hwmon/lm90 index aa4a0ec2008..e0d5206d1de 100644 --- a/Documentation/hwmon/lm90 +++ b/Documentation/hwmon/lm90 @@ -11,7 +11,7 @@ Supported chips: Prefix: 'lm99' Addresses scanned: I2C 0x4c and 0x4d Datasheet: Publicly available at the National Semiconductor website - http://www.national.com/pf/LM/LM89.html + http://www.national.com/mpf/LM/LM89.html * National Semiconductor LM99 Prefix: 'lm99' Addresses scanned: I2C 0x4c and 0x4d @@ -21,18 +21,32 @@ Supported chips: Prefix: 'lm86' Addresses scanned: I2C 0x4c Datasheet: Publicly available at the National Semiconductor website - http://www.national.com/pf/LM/LM86.html + http://www.national.com/mpf/LM/LM86.html * Analog Devices ADM1032 Prefix: 'adm1032' Addresses scanned: I2C 0x4c and 0x4d - Datasheet: Publicly available at the Analog Devices website - http://www.analog.com/en/prod/0,2877,ADM1032,00.html + Datasheet: Publicly available at the ON Semiconductor website + http://www.onsemi.com/PowerSolutions/product.do?id=ADM1032 * Analog Devices ADT7461 Prefix: 'adt7461' Addresses scanned: I2C 0x4c and 0x4d - Datasheet: Publicly available at the Analog Devices website - http://www.analog.com/en/prod/0,2877,ADT7461,00.html - Note: Only if in ADM1032 compatibility mode + Datasheet: Publicly available at the ON Semiconductor website + http://www.onsemi.com/PowerSolutions/product.do?id=ADT7461 + * Maxim MAX6646 + Prefix: 'max6646' + Addresses scanned: I2C 0x4d + Datasheet: Publicly available at the Maxim website + http://www.maxim-ic.com/quick_view2.cfm/qv_pk/3497 + * Maxim MAX6647 + Prefix: 'max6646' + Addresses scanned: I2C 0x4e + Datasheet: Publicly available at the Maxim website + http://www.maxim-ic.com/quick_view2.cfm/qv_pk/3497 + * Maxim MAX6649 + Prefix: 'max6646' + Addresses scanned: I2C 0x4c + Datasheet: Publicly available at the Maxim website + http://www.maxim-ic.com/quick_view2.cfm/qv_pk/3497 * Maxim MAX6657 Prefix: 'max6657' Addresses scanned: I2C 0x4c @@ -70,25 +84,21 @@ Description The LM90 is a digital temperature sensor. It senses its own temperature as well as the temperature of up to one external diode. It is compatible -with many other devices such as the LM86, the LM89, the LM99, the ADM1032, -the MAX6657, MAX6658, MAX6659, MAX6680 and the MAX6681 all of which are -supported by this driver. +with many other devices, many of which are supported by this driver. Note that there is no easy way to differentiate between the MAX6657, MAX6658 and MAX6659 variants. The extra address and features of the MAX6659 are not supported by this driver. The MAX6680 and MAX6681 only differ in their pinout, therefore they obviously can't (and don't need to) -be distinguished. Additionally, the ADT7461 is supported if found in -ADM1032 compatibility mode. +be distinguished. The specificity of this family of chipsets over the ADM1021/LM84 family is that it features critical limits with hysteresis, and an increased resolution of the remote temperature measurement. The different chipsets of the family are not strictly identical, although -very similar. This driver doesn't handle any specific feature for now, -with the exception of SMBus PEC. For reference, here comes a non-exhaustive -list of specific features: +very similar. For reference, here comes a non-exhaustive list of specific +features: LM90: * Filter and alert configuration register at 0xBF. @@ -114,9 +124,11 @@ ADT7461: * Lower resolution for remote temperature MAX6657 and MAX6658: + * Better local resolution * Remote sensor type selection MAX6659: + * Better local resolution * Selectable address * Second critical temperature limit * Remote sensor type selection @@ -127,7 +139,8 @@ MAX6680 and MAX6681: All temperature values are given in degrees Celsius. Resolution is 1.0 degree for the local temperature, 0.125 degree for the remote -temperature. +temperature, except for the MAX6657, MAX6658 and MAX6659 which have a +resolution of 0.125 degree for both temperatures. Each sensor has its own high and low limits, plus a critical limit. Additionally, there is a relative hysteresis value common to both critical diff --git a/Documentation/hwmon/pc87360 b/Documentation/hwmon/pc87360 index 89a8fcfa78d..cbac32b59c8 100644 --- a/Documentation/hwmon/pc87360 +++ b/Documentation/hwmon/pc87360 @@ -5,12 +5,7 @@ Supported chips: * National Semiconductor PC87360, PC87363, PC87364, PC87365 and PC87366 Prefixes: 'pc87360', 'pc87363', 'pc87364', 'pc87365', 'pc87366' Addresses scanned: none, address read from Super I/O config space - Datasheets: - http://www.national.com/pf/PC/PC87360.html - http://www.national.com/pf/PC/PC87363.html - http://www.national.com/pf/PC/PC87364.html - http://www.national.com/pf/PC/PC87365.html - http://www.national.com/pf/PC/PC87366.html + Datasheets: No longer available Authors: Jean Delvare <khali@linux-fr.org> diff --git a/Documentation/hwmon/pc87427 b/Documentation/hwmon/pc87427 index 9a0708f9f49..d1ebbe510f3 100644 --- a/Documentation/hwmon/pc87427 +++ b/Documentation/hwmon/pc87427 @@ -5,7 +5,7 @@ Supported chips: * National Semiconductor PC87427 Prefix: 'pc87427' Addresses scanned: none, address read from Super I/O config space - Datasheet: http://www.winbond.com.tw/E-WINBONDHTM/partner/apc_007.html + Datasheet: No longer available Author: Jean Delvare <khali@linux-fr.org> diff --git a/Documentation/hwmon/w83781d b/Documentation/hwmon/w83781d index 6f800a0283e..c91e0b63ea1 100644 --- a/Documentation/hwmon/w83781d +++ b/Documentation/hwmon/w83781d @@ -353,7 +353,7 @@ in6=255 # PWM -Additional info about PWM on the AS99127F (may apply to other Asus +* Additional info about PWM on the AS99127F (may apply to other Asus chips as well) by Jean Delvare as of 2004-04-09: AS99127F revision 2 seems to have two PWM registers at 0x59 and 0x5A, @@ -396,7 +396,7 @@ Please contact us if you can figure out how it is supposed to work. As long as we don't know more, the w83781d driver doesn't handle PWM on AS99127F chips at all. -Additional info about PWM on the AS99127F rev.1 by Hector Martin: +* Additional info about PWM on the AS99127F rev.1 by Hector Martin: I've been fiddling around with the (in)famous 0x59 register and found out the following values do work as a form of coarse pwm: @@ -418,3 +418,36 @@ change. My mobo is an ASUS A7V266-E. This behavior is similar to what I got with speedfan under Windows, where 0-15% would be off, 15-2x% (can't remember the exact value) would be 70% and higher would be full on. + +* Additional info about PWM on the AS99127F rev.1 from lm-sensors + ticket #2350: + +I conducted some experiment on Asus P3B-F motherboard with AS99127F +(Ver. 1). + +I confirm that 0x59 register control the CPU_Fan Header on this +motherboard, and 0x5a register control PWR_Fan. + +In order to reduce the dependency of specific fan, the measurement is +conducted with a digital scope without fan connected. I found out that +P3B-F actually output variable DC voltage on fan header center pin, +looks like PWM is filtered on this motherboard. + +Here are some of measurements: + +0x80 20 mV +0x81 20 mV +0x82 232 mV +0x83 1.2 V +0x84 2.31 V +0x85 3.44 V +0x86 4.62 V +0x87 5.81 V +0x88 7.01 V +9x89 8.22 V +0x8a 9.42 V +0x8b 10.6 V +0x8c 11.9 V +0x8d 12.4 V +0x8e 12.4 V +0x8f 12.4 V diff --git a/Documentation/hwmon/w83791d b/Documentation/hwmon/w83791d index a67d3b7a709..5663e491655 100644 --- a/Documentation/hwmon/w83791d +++ b/Documentation/hwmon/w83791d @@ -58,29 +58,35 @@ internal state that allows no clean access (Bank with ID register is not currently selected). If you know the address of the chip, use a 'force' parameter; this will put it into a more well-behaved state first. -The driver implements three temperature sensors, five fan rotation speed -sensors, and ten voltage sensors. +The driver implements three temperature sensors, ten voltage sensors, +five fan rotation speed sensors and manual PWM control of each fan. Temperatures are measured in degrees Celsius and measurement resolution is 1 degC for temp1 and 0.5 degC for temp2 and temp3. An alarm is triggered when the temperature gets higher than the Overtemperature Shutdown value; it stays on until the temperature falls below the Hysteresis value. +Voltage sensors (also known as IN sensors) report their values in millivolts. +An alarm is triggered if the voltage has crossed a programmable minimum +or maximum limit. + Fan rotation speeds are reported in RPM (rotations per minute). An alarm is triggered if the rotation speed has dropped below a programmable limit. Fan readings can be divided by a programmable divider (1, 2, 4, 8, 16, 32, 64 or 128 for all fans) to give the readings more range or accuracy. -Voltage sensors (also known as IN sensors) report their values in millivolts. -An alarm is triggered if the voltage has crossed a programmable minimum -or maximum limit. +Each fan controlled is controlled by PWM. The PWM duty cycle can be read and +set for each fan separately. Valid values range from 0 (stop) to 255 (full). +PWM 1-3 support Thermal Cruise mode, in which the PWMs are automatically +regulated to keep respectively temp 1-3 at a certain target temperature. +See below for the description of the sysfs-interface. The w83791d has a global bit used to enable beeping from the speaker when an alarm is triggered as well as a bitmask to enable or disable the beep for specific alarms. You need both the global beep enable bit and the corresponding beep bit to be on for a triggered alarm to sound a beep. -The sysfs interface to the gloabal enable is via the sysfs beep_enable file. +The sysfs interface to the global enable is via the sysfs beep_enable file. This file is used for both legacy and new code. The sysfs interface to the beep bitmask has migrated from the original legacy @@ -105,6 +111,27 @@ going forward. The driver reads the hardware chip values at most once every three seconds. User mode code requesting values more often will receive cached values. +/sys files +---------- +The sysfs-interface is documented in the 'sysfs-interface' file. Only +chip-specific options are documented here. + +pwm[1-3]_enable - this file controls mode of fan/temperature control for + fan 1-3. Fan/PWM 4-5 only support manual mode. + * 1 Manual mode + * 2 Thermal Cruise mode + * 3 Fan Speed Cruise mode (no further support) + +temp[1-3]_target - defines the target temperature for Thermal Cruise mode. + Unit: millidegree Celsius + RW + +temp[1-3]_tolerance - temperature tolerance for Thermal Cruise mode. + Specifies an interval around the target temperature + in which the fan speed is not changed. + Unit: millidegree Celsius + RW + Alarms bitmap vs. beep_mask bitmask ------------------------------------ For legacy code using the alarms and beep_mask files: @@ -132,7 +159,3 @@ tart2 : alarms: 0x020000 beep_mask: 0x080000 <== mismatch tart3 : alarms: 0x040000 beep_mask: 0x100000 <== mismatch case_open : alarms: 0x001000 beep_mask: 0x001000 global_enable: alarms: -------- beep_mask: 0x800000 (modified via beep_enable) - -W83791D TODO: ---------------- -Provide a patch for smart-fan control (still need appropriate motherboard/fans) diff --git a/Documentation/i2c/busses/i2c-i801 b/Documentation/i2c/busses/i2c-i801 index c31e0291e16..81c0c59a60e 100644 --- a/Documentation/i2c/busses/i2c-i801 +++ b/Documentation/i2c/busses/i2c-i801 @@ -13,8 +13,9 @@ Supported adapters: * Intel 631xESB/632xESB (ESB2) * Intel 82801H (ICH8) * Intel 82801I (ICH9) - * Intel Tolapai - * Intel ICH10 + * Intel EP80579 (Tolapai) + * Intel 82801JI (ICH10) + * Intel PCH Datasheets: Publicly available at the Intel website Authors: @@ -32,7 +33,7 @@ Description ----------- The ICH (properly known as the 82801AA), ICH0 (82801AB), ICH2 (82801BA), -ICH3 (82801CA/CAM) and later devices are Intel chips that are a part of +ICH3 (82801CA/CAM) and later devices (PCH) are Intel chips that are a part of Intel's '810' chipset for Celeron-based PCs, '810E' chipset for Pentium-based PCs, '815E' chipset, and others. diff --git a/Documentation/i2c/porting-clients b/Documentation/i2c/porting-clients deleted file mode 100644 index 7bf82c08f6c..00000000000 --- a/Documentation/i2c/porting-clients +++ /dev/null @@ -1,160 +0,0 @@ -Revision 7, 2007-04-19 -Jean Delvare <khali@linux-fr.org> -Greg KH <greg@kroah.com> - -This is a guide on how to convert I2C chip drivers from Linux 2.4 to -Linux 2.6. I have been using existing drivers (lm75, lm78) as examples. -Then I converted a driver myself (lm83) and updated this document. -Note that this guide is strongly oriented towards hardware monitoring -drivers. Many points are still valid for other type of drivers, but -others may be irrelevant. - -There are two sets of points below. The first set concerns technical -changes. The second set concerns coding policy. Both are mandatory. - -Although reading this guide will help you porting drivers, I suggest -you keep an eye on an already ported driver while porting your own -driver. This will help you a lot understanding what this guide -exactly means. Choose the chip driver that is the more similar to -yours for best results. - -Technical changes: - -* [Driver type] Any driver that was relying on i2c-isa has to be - converted to a proper isa, platform or pci driver. This is not - covered by this guide. - -* [Includes] Get rid of "version.h" and <linux/i2c-proc.h>. - Includes typically look like that: - #include <linux/module.h> - #include <linux/init.h> - #include <linux/slab.h> - #include <linux/jiffies.h> - #include <linux/i2c.h> - #include <linux/hwmon.h> /* for hardware monitoring drivers */ - #include <linux/hwmon-sysfs.h> - #include <linux/hwmon-vid.h> /* if you need VRM support */ - #include <linux/err.h> /* for class registration */ - Please respect this inclusion order. Some extra headers may be - required for a given driver (e.g. "lm75.h"). - -* [Addresses] SENSORS_I2C_END becomes I2C_CLIENT_END, ISA addresses - are no more handled by the i2c core. Address ranges are no more - supported either, define each individual address separately. - SENSORS_INSMOD_<n> becomes I2C_CLIENT_INSMOD_<n>. - -* [Client data] Get rid of sysctl_id. Try using standard names for - register values (for example, temp_os becomes temp_max). You're - still relatively free here, but you *have* to follow the standard - names for sysfs files (see the Sysctl section below). - -* [Function prototypes] The detect functions loses its flags - parameter. Sysctl (e.g. lm75_temp) and miscellaneous functions - are off the list of prototypes. This usually leaves five - prototypes: - static int lm75_attach_adapter(struct i2c_adapter *adapter); - static int lm75_detect(struct i2c_adapter *adapter, int address, - int kind); - static void lm75_init_client(struct i2c_client *client); - static int lm75_detach_client(struct i2c_client *client); - static struct lm75_data lm75_update_device(struct device *dev); - -* [Sysctl] All sysctl stuff is of course gone (defines, ctl_table - and functions). Instead, you have to define show and set functions for - each sysfs file. Only define set for writable values. Take a look at an - existing 2.6 driver for details (it87 for example). Don't forget - to define the attributes for each file (this is that step that - links callback functions). Use the file names specified in - Documentation/hwmon/sysfs-interface for the individual files. Also - convert the units these files read and write to the specified ones. - If you need to add a new type of file, please discuss it on the - sensors mailing list <lm-sensors@lm-sensors.org> by providing a - patch to the Documentation/hwmon/sysfs-interface file. - -* [Attach] The attach function should make sure that the adapter's - class has I2C_CLASS_HWMON (or whatever class is suitable for your - driver), using the following construct: - if (!(adapter->class & I2C_CLASS_HWMON)) - return 0; - Call i2c_probe() instead of i2c_detect(). - -* [Detect] As mentioned earlier, the flags parameter is gone. - The type_name and client_name strings are replaced by a single - name string, which will be filled with a lowercase, short string. - The labels used for error paths are reduced to the number needed. - It is advised that the labels are given descriptive names such as - exit and exit_free. Don't forget to properly set err before - jumping to error labels. By the way, labels should be left-aligned. - Use kzalloc instead of kmalloc. - Use i2c_set_clientdata to set the client data (as opposed to - a direct access to client->data). - Use strlcpy instead of strcpy or snprintf to copy the client name. - Replace the sysctl directory registration by calls to - device_create_file. Move the driver initialization before any - sysfs file creation. - Register the client with the hwmon class (using hwmon_device_register) - if applicable. - Drop client->id. - Drop any 24RF08 corruption prevention you find, as this is now done - at the i2c-core level, and doing it twice voids it. - Don't add I2C_CLIENT_ALLOW_USE to client->flags, it's the default now. - -* [Init] Limits must not be set by the driver (can be done later in - user-space). Chip should not be reset default (although a module - parameter may be used to force it), and initialization should be - limited to the strictly necessary steps. - -* [Detach] Remove the call to i2c_deregister_entry. Do not log an - error message if i2c_detach_client fails, as i2c-core will now do - it for you. - Unregister from the hwmon class if applicable. - -* [Update] The function prototype changed, it is now - passed a device structure, which you have to convert to a client - using to_i2c_client(dev). The update function should return a - pointer to the client data. - Don't access client->data directly, use i2c_get_clientdata(client) - instead. - Use time_after() instead of direct jiffies comparison. - -* [Interface] Make sure there is a MODULE_LICENSE() line, at the bottom - of the file (after MODULE_AUTHOR() and MODULE_DESCRIPTION(), in this - order). - -* [Driver] The flags field of the i2c_driver structure is gone. - I2C_DF_NOTIFY is now the default behavior. - The i2c_driver structure has a driver member, which is itself a - structure, those name member should be initialized to a driver name - string. i2c_driver itself has no name member anymore. - -* [Driver model] Instead of shutdown or reboot notifiers, provide a - shutdown() method in your driver. - -* [Power management] Use the driver model suspend() and resume() - callbacks instead of the obsolete pm_register() calls. - -Coding policy: - -* [Copyright] Use (C), not (c), for copyright. - -* [Debug/log] Get rid of #ifdef DEBUG/#endif constructs whenever you - can. Calls to printk for debugging purposes are replaced by calls to - dev_dbg where possible, else to pr_debug. Here is an example of how - to call it (taken from lm75_detect): - dev_dbg(&client->dev, "Starting lm75 update\n"); - Replace other printk calls with the dev_info, dev_err or dev_warn - function, as appropriate. - -* [Constants] Constants defines (registers, conversions) should be - aligned. This greatly improves readability. - Alignments are achieved by the means of tabs, not spaces. Remember - that tabs are set to 8 in the Linux kernel code. - -* [Layout] Avoid extra empty lines between comments and what they - comment. Respect the coding style (see Documentation/CodingStyle), - in particular when it comes to placing curly braces. - -* [Comments] Make sure that no comment refers to a file that isn't - part of the Linux source tree (typically doc/chips/<chip name>), - and that remaining comments still match the code. Merging comment - lines when possible is encouraged. diff --git a/Documentation/i2c/writing-clients b/Documentation/i2c/writing-clients index d73ee117a8c..6b9af7d479c 100644 --- a/Documentation/i2c/writing-clients +++ b/Documentation/i2c/writing-clients @@ -10,23 +10,21 @@ General remarks =============== Try to keep the kernel namespace as clean as possible. The best way to -do this is to use a unique prefix for all global symbols. This is +do this is to use a unique prefix for all global symbols. This is especially important for exported symbols, but it is a good idea to do it for non-exported symbols too. We will use the prefix `foo_' in this -tutorial, and `FOO_' for preprocessor variables. +tutorial. The driver structure ==================== Usually, you will implement a single driver structure, and instantiate -all clients from it. Remember, a driver structure contains general access +all clients from it. Remember, a driver structure contains general access routines, and should be zero-initialized except for fields with data you provide. A client structure holds device-specific information like the driver model device node, and its I2C address. -/* iff driver uses driver model ("new style") binding model: */ - static struct i2c_device_id foo_idtable[] = { { "foo", my_id_for_foo }, { "bar", my_id_for_bar }, @@ -40,7 +38,6 @@ static struct i2c_driver foo_driver = { .name = "foo", }, - /* iff driver uses driver model ("new style") binding model: */ .id_table = foo_ids, .probe = foo_probe, .remove = foo_remove, @@ -49,24 +46,19 @@ static struct i2c_driver foo_driver = { .detect = foo_detect, .address_data = &addr_data, - /* else, driver uses "legacy" binding model: */ - .attach_adapter = foo_attach_adapter, - .detach_client = foo_detach_client, - - /* these may be used regardless of the driver binding model */ .shutdown = foo_shutdown, /* optional */ .suspend = foo_suspend, /* optional */ .resume = foo_resume, /* optional */ - .command = foo_command, /* optional */ + .command = foo_command, /* optional, deprecated */ } - + The name field is the driver name, and must not contain spaces. It should match the module name (if the driver can be compiled as a module), although you can use MODULE_ALIAS (passing "foo" in this example) to add another name for the module. If the driver name doesn't match the module name, the module won't be automatically loaded (hotplug/coldplug). -All other fields are for call-back functions which will be explained +All other fields are for call-back functions which will be explained below. @@ -74,34 +66,13 @@ Extra client data ================= Each client structure has a special `data' field that can point to any -structure at all. You should use this to keep device-specific data, -especially in drivers that handle multiple I2C or SMBUS devices. You -do not always need this, but especially for `sensors' drivers, it can -be very useful. +structure at all. You should use this to keep device-specific data. /* store the value */ void i2c_set_clientdata(struct i2c_client *client, void *data); /* retrieve the value */ - void *i2c_get_clientdata(struct i2c_client *client); - -An example structure is below. - - struct foo_data { - struct i2c_client client; - enum chips type; /* To keep the chips type for `sensors' drivers. */ - - /* Because the i2c bus is slow, it is often useful to cache the read - information of a chip for some time (for example, 1 or 2 seconds). - It depends of course on the device whether this is really worthwhile - or even sensible. */ - struct mutex update_lock; /* When we are reading lots of information, - another process should not update the - below information */ - char valid; /* != 0 if the following fields are valid. */ - unsigned long last_updated; /* In jiffies */ - /* Add the read information here too */ - }; + void *i2c_get_clientdata(const struct i2c_client *client); Accessing the client @@ -109,11 +80,9 @@ Accessing the client Let's say we have a valid client structure. At some time, we will need to gather information from the client, or write new information to the -client. How we will export this information to user-space is less -important at this moment (perhaps we do not need to do this at all for -some obscure clients). But we need generic reading and writing routines. +client. -I have found it useful to define foo_read and foo_write function for this. +I have found it useful to define foo_read and foo_write functions for this. For some cases, it will be easier to call the i2c functions directly, but many chips have some kind of register-value idea that can easily be encapsulated. @@ -121,33 +90,33 @@ be encapsulated. The below functions are simple examples, and should not be copied literally. - int foo_read_value(struct i2c_client *client, u8 reg) - { - if (reg < 0x10) /* byte-sized register */ - return i2c_smbus_read_byte_data(client,reg); - else /* word-sized register */ - return i2c_smbus_read_word_data(client,reg); - } - - int foo_write_value(struct i2c_client *client, u8 reg, u16 value) - { - if (reg == 0x10) /* Impossible to write - driver error! */ { - return -1; - else if (reg < 0x10) /* byte-sized register */ - return i2c_smbus_write_byte_data(client,reg,value); - else /* word-sized register */ - return i2c_smbus_write_word_data(client,reg,value); - } +int foo_read_value(struct i2c_client *client, u8 reg) +{ + if (reg < 0x10) /* byte-sized register */ + return i2c_smbus_read_byte_data(client, reg); + else /* word-sized register */ + return i2c_smbus_read_word_data(client, reg); +} + +int foo_write_value(struct i2c_client *client, u8 reg, u16 value) +{ + if (reg == 0x10) /* Impossible to write - driver error! */ + return -EINVAL; + else if (reg < 0x10) /* byte-sized register */ + return i2c_smbus_write_byte_data(client, reg, value); + else /* word-sized register */ + return i2c_smbus_write_word_data(client, reg, value); +} Probing and attaching ===================== The Linux I2C stack was originally written to support access to hardware -monitoring chips on PC motherboards, and thus it embeds some assumptions -that are more appropriate to SMBus (and PCs) than to I2C. One of these -assumptions is that most adapters and devices drivers support the SMBUS_QUICK -protocol to probe device presence. Another is that devices and their drivers +monitoring chips on PC motherboards, and thus used to embed some assumptions +that were more appropriate to SMBus (and PCs) than to I2C. One of these +assumptions was that most adapters and devices drivers support the SMBUS_QUICK +protocol to probe device presence. Another was that devices and their drivers can be sufficiently configured using only such probe primitives. As Linux and its I2C stack became more widely used in embedded systems @@ -164,6 +133,9 @@ since the "legacy" model requires drivers to create "i2c_client" device objects after SMBus style probing, while the Linux driver model expects drivers to be given such device objects in their probe() routines. +The legacy model is deprecated now and will soon be removed, so we no +longer document it here. + Standard Driver Model Binding ("New Style") ------------------------------------------- @@ -193,8 +165,8 @@ matches the device's name. It is passed the entry that was matched so the driver knows which one in the table matched. -Device Creation (Standard driver model) ---------------------------------------- +Device Creation +--------------- If you know for a fact that an I2C device is connected to a given I2C bus, you can instantiate that device by simply filling an i2c_board_info @@ -221,8 +193,8 @@ in the I2C bus driver. You may want to save the returned i2c_client reference for later use. -Device Detection (Standard driver model) ----------------------------------------- +Device Detection +---------------- Sometimes you do not know in advance which I2C devices are connected to a given I2C bus. This is for example the case of hardware monitoring @@ -246,8 +218,8 @@ otherwise misdetections are likely to occur and things can get wrong quickly. -Device Deletion (Standard driver model) ---------------------------------------- +Device Deletion +--------------- Each I2C device which has been created using i2c_new_device() or i2c_new_probed_device() can be unregistered by calling @@ -256,264 +228,37 @@ called automatically before the underlying I2C bus itself is removed, as a device can't survive its parent in the device driver model. -Legacy Driver Binding Model ---------------------------- +Initializing the driver +======================= + +When the kernel is booted, or when your foo driver module is inserted, +you have to do some initializing. Fortunately, just registering the +driver module is usually enough. -Most i2c devices can be present on several i2c addresses; for some this -is determined in hardware (by soldering some chip pins to Vcc or Ground), -for others this can be changed in software (by writing to specific client -registers). Some devices are usually on a specific address, but not always; -and some are even more tricky. So you will probably need to scan several -i2c addresses for your clients, and do some sort of detection to see -whether it is actually a device supported by your driver. +static int __init foo_init(void) +{ + return i2c_add_driver(&foo_driver); +} + +static void __exit foo_cleanup(void) +{ + i2c_del_driver(&foo_driver); +} + +/* Substitute your own name and email address */ +MODULE_AUTHOR("Frodo Looijaard <frodol@dds.nl>" +MODULE_DESCRIPTION("Driver for Barf Inc. Foo I2C devices"); -To give the user a maximum of possibilities, some default module parameters -are defined to help determine what addresses are scanned. Several macros -are defined in i2c.h to help you support them, as well as a generic -detection algorithm. - -You do not have to use this parameter interface; but don't try to use -function i2c_probe() if you don't. - - -Probing classes (Legacy model) ------------------------------- - -All parameters are given as lists of unsigned 16-bit integers. Lists are -terminated by I2C_CLIENT_END. -The following lists are used internally: - - normal_i2c: filled in by the module writer. - A list of I2C addresses which should normally be examined. - probe: insmod parameter. - A list of pairs. The first value is a bus number (-1 for any I2C bus), - the second is the address. These addresses are also probed, as if they - were in the 'normal' list. - ignore: insmod parameter. - A list of pairs. The first value is a bus number (-1 for any I2C bus), - the second is the I2C address. These addresses are never probed. - This parameter overrules the 'normal_i2c' list only. - force: insmod parameter. - A list of pairs. The first value is a bus number (-1 for any I2C bus), - the second is the I2C address. A device is blindly assumed to be on - the given address, no probing is done. - -Additionally, kind-specific force lists may optionally be defined if -the driver supports several chip kinds. They are grouped in a -NULL-terminated list of pointers named forces, those first element if the -generic force list mentioned above. Each additional list correspond to an -insmod parameter of the form force_<kind>. - -Fortunately, as a module writer, you just have to define the `normal_i2c' -parameter. The complete declaration could look like this: - - /* Scan 0x4c to 0x4f */ - static const unsigned short normal_i2c[] = { 0x4c, 0x4d, 0x4e, 0x4f, - I2C_CLIENT_END }; - - /* Magic definition of all other variables and things */ - I2C_CLIENT_INSMOD; - /* Or, if your driver supports, say, 2 kind of devices: */ - I2C_CLIENT_INSMOD_2(foo, bar); - -If you use the multi-kind form, an enum will be defined for you: - enum chips { any_chip, foo, bar, ... } -You can then (and certainly should) use it in the driver code. - -Note that you *have* to call the defined variable `normal_i2c', -without any prefix! - - -Attaching to an adapter (Legacy model) --------------------------------------- - -Whenever a new adapter is inserted, or for all adapters if the driver is -being registered, the callback attach_adapter() is called. Now is the -time to determine what devices are present on the adapter, and to register -a client for each of them. - -The attach_adapter callback is really easy: we just call the generic -detection function. This function will scan the bus for us, using the -information as defined in the lists explained above. If a device is -detected at a specific address, another callback is called. - - int foo_attach_adapter(struct i2c_adapter *adapter) - { - return i2c_probe(adapter,&addr_data,&foo_detect_client); - } - -Remember, structure `addr_data' is defined by the macros explained above, -so you do not have to define it yourself. - -The i2c_probe function will call the foo_detect_client -function only for those i2c addresses that actually have a device on -them (unless a `force' parameter was used). In addition, addresses that -are already in use (by some other registered client) are skipped. - - -The detect client function (Legacy model) ------------------------------------------ - -The detect client function is called by i2c_probe. The `kind' parameter -contains -1 for a probed detection, 0 for a forced detection, or a positive -number for a forced detection with a chip type forced. - -Returning an error different from -ENODEV in a detect function will cause -the detection to stop: other addresses and adapters won't be scanned. -This should only be done on fatal or internal errors, such as a memory -shortage or i2c_attach_client failing. - -For now, you can ignore the `flags' parameter. It is there for future use. - - int foo_detect_client(struct i2c_adapter *adapter, int address, - int kind) - { - int err = 0; - int i; - struct i2c_client *client; - struct foo_data *data; - const char *name = ""; - - /* Let's see whether this adapter can support what we need. - Please substitute the things you need here! */ - if (!i2c_check_functionality(adapter,I2C_FUNC_SMBUS_WORD_DATA | - I2C_FUNC_SMBUS_WRITE_BYTE)) - goto ERROR0; - - /* OK. For now, we presume we have a valid client. We now create the - client structure, even though we cannot fill it completely yet. - But it allows us to access several i2c functions safely */ - - if (!(data = kzalloc(sizeof(struct foo_data), GFP_KERNEL))) { - err = -ENOMEM; - goto ERROR0; - } - - client = &data->client; - i2c_set_clientdata(client, data); - - client->addr = address; - client->adapter = adapter; - client->driver = &foo_driver; - - /* Now, we do the remaining detection. If no `force' parameter is used. */ - - /* First, the generic detection (if any), that is skipped if any force - parameter was used. */ - if (kind < 0) { - /* The below is of course bogus */ - if (foo_read(client, FOO_REG_GENERIC) != FOO_GENERIC_VALUE) - goto ERROR1; - } - - /* Next, specific detection. This is especially important for `sensors' - devices. */ - - /* Determine the chip type. Not needed if a `force_CHIPTYPE' parameter - was used. */ - if (kind <= 0) { - i = foo_read(client, FOO_REG_CHIPTYPE); - if (i == FOO_TYPE_1) - kind = chip1; /* As defined in the enum */ - else if (i == FOO_TYPE_2) - kind = chip2; - else { - printk("foo: Ignoring 'force' parameter for unknown chip at " - "adapter %d, address 0x%02x\n",i2c_adapter_id(adapter),address); - goto ERROR1; - } - } - - /* Now set the type and chip names */ - if (kind == chip1) { - name = "chip1"; - } else if (kind == chip2) { - name = "chip2"; - } - - /* Fill in the remaining client fields. */ - strlcpy(client->name, name, I2C_NAME_SIZE); - data->type = kind; - mutex_init(&data->update_lock); /* Only if you use this field */ - - /* Any other initializations in data must be done here too. */ - - /* This function can write default values to the client registers, if - needed. */ - foo_init_client(client); - - /* Tell the i2c layer a new client has arrived */ - if ((err = i2c_attach_client(client))) - goto ERROR1; - - return 0; - - /* OK, this is not exactly good programming practice, usually. But it is - very code-efficient in this case. */ - - ERROR1: - kfree(data); - ERROR0: - return err; - } - - -Removing the client (Legacy model) -================================== - -The detach_client call back function is called when a client should be -removed. It may actually fail, but only when panicking. This code is -much simpler than the attachment code, fortunately! - - int foo_detach_client(struct i2c_client *client) - { - int err; - - /* Try to detach the client from i2c space */ - if ((err = i2c_detach_client(client))) - return err; - - kfree(i2c_get_clientdata(client)); - return 0; - } - - -Initializing the module or kernel -================================= - -When the kernel is booted, or when your foo driver module is inserted, -you have to do some initializing. Fortunately, just attaching (registering) -the driver module is usually enough. - - static int __init foo_init(void) - { - int res; - - if ((res = i2c_add_driver(&foo_driver))) { - printk("foo: Driver registration failed, module not inserted.\n"); - return res; - } - return 0; - } - - static void __exit foo_cleanup(void) - { - i2c_del_driver(&foo_driver); - } - - /* Substitute your own name and email address */ - MODULE_AUTHOR("Frodo Looijaard <frodol@dds.nl>" - MODULE_DESCRIPTION("Driver for Barf Inc. Foo I2C devices"); - - /* a few non-GPL license types are also allowed */ - MODULE_LICENSE("GPL"); - - module_init(foo_init); - module_exit(foo_cleanup); - -Note that some functions are marked by `__init', and some data structures -by `__initdata'. These functions and structures can be removed after -kernel booting (or module loading) is completed. +/* a few non-GPL license types are also allowed */ +MODULE_LICENSE("GPL"); + +module_init(foo_init); +module_exit(foo_cleanup); + +Note that some functions are marked by `__init'. These functions can +be removed after kernel booting (or module loading) is completed. +Likewise, functions marked by `__exit' are dropped by the compiler when +the code is built into the kernel, as they would never be called. Power Management @@ -548,33 +293,35 @@ Command function A generic ioctl-like function call back is supported. You will seldom need this, and its use is deprecated anyway, so newer design should not -use it. Set it to NULL. +use it. Sending and receiving ===================== If you want to communicate with your device, there are several functions -to do this. You can find all of them in i2c.h. +to do this. You can find all of them in <linux/i2c.h>. -If you can choose between plain i2c communication and SMBus level -communication, please use the last. All adapters understand SMBus level -commands, but only some of them understand plain i2c! +If you can choose between plain I2C communication and SMBus level +communication, please use the latter. All adapters understand SMBus level +commands, but only some of them understand plain I2C! -Plain i2c communication +Plain I2C communication ----------------------- - extern int i2c_master_send(struct i2c_client *,const char* ,int); - extern int i2c_master_recv(struct i2c_client *,char* ,int); + int i2c_master_send(struct i2c_client *client, const char *buf, + int count); + int i2c_master_recv(struct i2c_client *client, char *buf, int count); These routines read and write some bytes from/to a client. The client contains the i2c address, so you do not have to include it. The second -parameter contains the bytes the read/write, the third the length of the -buffer. Returned is the actual number of bytes read/written. - - extern int i2c_transfer(struct i2c_adapter *adap, struct i2c_msg *msg, - int num); +parameter contains the bytes to read/write, the third the number of bytes +to read/write (must be less than the length of the buffer.) Returned is +the actual number of bytes read/written. + + int i2c_transfer(struct i2c_adapter *adap, struct i2c_msg *msg, + int num); This sends a series of messages. Each message can be a read or write, and they can be mixed in any way. The transactions are combined: no @@ -583,49 +330,45 @@ for each message the client address, the number of bytes of the message and the message data itself. You can read the file `i2c-protocol' for more information about the -actual i2c protocol. +actual I2C protocol. SMBus communication ------------------- - extern s32 i2c_smbus_xfer (struct i2c_adapter * adapter, u16 addr, - unsigned short flags, - char read_write, u8 command, int size, - union i2c_smbus_data * data); - - This is the generic SMBus function. All functions below are implemented - in terms of it. Never use this function directly! - - - extern s32 i2c_smbus_read_byte(struct i2c_client * client); - extern s32 i2c_smbus_write_byte(struct i2c_client * client, u8 value); - extern s32 i2c_smbus_read_byte_data(struct i2c_client * client, u8 command); - extern s32 i2c_smbus_write_byte_data(struct i2c_client * client, - u8 command, u8 value); - extern s32 i2c_smbus_read_word_data(struct i2c_client * client, u8 command); - extern s32 i2c_smbus_write_word_data(struct i2c_client * client, - u8 command, u16 value); - extern s32 i2c_smbus_process_call(struct i2c_client *client, - u8 command, u16 value); - extern s32 i2c_smbus_read_block_data(struct i2c_client * client, - u8 command, u8 *values); - extern s32 i2c_smbus_write_block_data(struct i2c_client * client, - u8 command, u8 length, - u8 *values); - extern s32 i2c_smbus_read_i2c_block_data(struct i2c_client * client, - u8 command, u8 length, u8 *values); - extern s32 i2c_smbus_write_i2c_block_data(struct i2c_client * client, - u8 command, u8 length, - u8 *values); + s32 i2c_smbus_xfer(struct i2c_adapter *adapter, u16 addr, + unsigned short flags, char read_write, u8 command, + int size, union i2c_smbus_data *data); + +This is the generic SMBus function. All functions below are implemented +in terms of it. Never use this function directly! + + s32 i2c_smbus_read_byte(struct i2c_client *client); + s32 i2c_smbus_write_byte(struct i2c_client *client, u8 value); + s32 i2c_smbus_read_byte_data(struct i2c_client *client, u8 command); + s32 i2c_smbus_write_byte_data(struct i2c_client *client, + u8 command, u8 value); + s32 i2c_smbus_read_word_data(struct i2c_client *client, u8 command); + s32 i2c_smbus_write_word_data(struct i2c_client *client, + u8 command, u16 value); + s32 i2c_smbus_process_call(struct i2c_client *client, + u8 command, u16 value); + s32 i2c_smbus_read_block_data(struct i2c_client *client, + u8 command, u8 *values); + s32 i2c_smbus_write_block_data(struct i2c_client *client, + u8 command, u8 length, const u8 *values); + s32 i2c_smbus_read_i2c_block_data(struct i2c_client *client, + u8 command, u8 length, u8 *values); + s32 i2c_smbus_write_i2c_block_data(struct i2c_client *client, + u8 command, u8 length, + const u8 *values); These ones were removed from i2c-core because they had no users, but could be added back later if needed: - extern s32 i2c_smbus_write_quick(struct i2c_client * client, u8 value); - extern s32 i2c_smbus_block_process_call(struct i2c_client *client, - u8 command, u8 length, - u8 *values) + s32 i2c_smbus_write_quick(struct i2c_client *client, u8 value); + s32 i2c_smbus_block_process_call(struct i2c_client *client, + u8 command, u8 length, u8 *values); All these transactions return a negative errno value on failure. The 'write' transactions return 0 on success; the 'read' transactions return the read @@ -642,7 +385,5 @@ General purpose routines Below all general purpose routines are listed, that were not mentioned before. - /* This call returns a unique low identifier for each registered adapter. - */ - extern int i2c_adapter_id(struct i2c_adapter *adap); - + /* Return the adapter number for a specific adapter */ + int i2c_adapter_id(struct i2c_adapter *adap); diff --git a/Documentation/ia64/xen.txt b/Documentation/ia64/xen.txt new file mode 100644 index 00000000000..c61a99f7c8b --- /dev/null +++ b/Documentation/ia64/xen.txt @@ -0,0 +1,183 @@ + Recipe for getting/building/running Xen/ia64 with pv_ops + -------------------------------------------------------- + +This recipe describes how to get xen-ia64 source and build it, +and run domU with pv_ops. + +============ +Requirements +============ + + - python + - mercurial + it (aka "hg") is an open-source source code + management software. See the below. + http://www.selenic.com/mercurial/wiki/ + - git + - bridge-utils + +================================= +Getting and Building Xen and Dom0 +================================= + + My environment is; + Machine : Tiger4 + Domain0 OS : RHEL5 + DomainU OS : RHEL5 + + 1. Download source + # hg clone http://xenbits.xensource.com/ext/ia64/xen-unstable.hg + # cd xen-unstable.hg + # hg clone http://xenbits.xensource.com/ext/ia64/linux-2.6.18-xen.hg + + 2. # make world + + 3. # make install-tools + + 4. copy kernels and xen + # cp xen/xen.gz /boot/efi/efi/redhat/ + # cp build-linux-2.6.18-xen_ia64/vmlinux.gz \ + /boot/efi/efi/redhat/vmlinuz-2.6.18.8-xen + + 5. make initrd for Dom0/DomU + # make -C linux-2.6.18-xen.hg ARCH=ia64 modules_install \ + O=$(/bin/pwd)/build-linux-2.6.18-xen_ia64 + # mkinitrd -f /boot/efi/efi/redhat/initrd-2.6.18.8-xen.img \ + 2.6.18.8-xen --builtin mptspi --builtin mptbase \ + --builtin mptscsih --builtin uhci-hcd --builtin ohci-hcd \ + --builtin ehci-hcd + +================================ +Making a disk image for guest OS +================================ + + 1. make file + # dd if=/dev/zero of=/root/rhel5.img bs=1M seek=4096 count=0 + # mke2fs -F -j /root/rhel5.img + # mount -o loop /root/rhel5.img /mnt + # cp -ax /{dev,var,etc,usr,bin,sbin,lib} /mnt + # mkdir /mnt/{root,proc,sys,home,tmp} + + Note: You may miss some device files. If so, please create them + with mknod. Or you can use tar instead of cp. + + 2. modify DomU's fstab + # vi /mnt/etc/fstab + /dev/xvda1 / ext3 defaults 1 1 + none /dev/pts devpts gid=5,mode=620 0 0 + none /dev/shm tmpfs defaults 0 0 + none /proc proc defaults 0 0 + none /sys sysfs defaults 0 0 + + 3. modify inittab + set runlevel to 3 to avoid X trying to start + # vi /mnt/etc/inittab + id:3:initdefault: + Start a getty on the hvc0 console + X0:2345:respawn:/sbin/mingetty hvc0 + tty1-6 mingetty can be commented out + + 4. add hvc0 into /etc/securetty + # vi /mnt/etc/securetty (add hvc0) + + 5. umount + # umount /mnt + +FYI, virt-manager can also make a disk image for guest OS. +It's GUI tools and easy to make it. + +================== +Boot Xen & Domain0 +================== + + 1. replace elilo + elilo of RHEL5 can boot Xen and Dom0. + If you use old elilo (e.g RHEL4), please download from the below + http://elilo.sourceforge.net/cgi-bin/blosxom + and copy into /boot/efi/efi/redhat/ + # cp elilo-3.6-ia64.efi /boot/efi/efi/redhat/elilo.efi + + 2. modify elilo.conf (like the below) + # vi /boot/efi/efi/redhat/elilo.conf + prompt + timeout=20 + default=xen + relocatable + + image=vmlinuz-2.6.18.8-xen + label=xen + vmm=xen.gz + initrd=initrd-2.6.18.8-xen.img + read-only + append=" -- rhgb root=/dev/sda2" + +The append options before "--" are for xen hypervisor, +the options after "--" are for dom0. + +FYI, your machine may need console options like +"com1=19200,8n1 console=vga,com1". For example, +append="com1=19200,8n1 console=vga,com1 -- rhgb console=tty0 \ +console=ttyS0 root=/dev/sda2" + +===================================== +Getting and Building domU with pv_ops +===================================== + + 1. get pv_ops tree + # git clone http://people.valinux.co.jp/~yamahata/xen-ia64/linux-2.6-xen-ia64.git/ + + 2. git branch (if necessary) + # cd linux-2.6-xen-ia64/ + # git checkout -b your_branch origin/xen-ia64-domu-minimal-2008may19 + (Note: The current branch is xen-ia64-domu-minimal-2008may19. + But you would find the new branch. You can see with + "git branch -r" to get the branch lists. + http://people.valinux.co.jp/~yamahata/xen-ia64/for_eagl/linux-2.6-ia64-pv-ops.git/ + is also available. The tree is based on + git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 test) + + + 3. copy .config for pv_ops of domU + # cp arch/ia64/configs/xen_domu_wip_defconfig .config + + 4. make kernel with pv_ops + # make oldconfig + # make + + 5. install the kernel and initrd + # cp vmlinux.gz /boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU + # make modules_install + # mkinitrd -f /boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img \ + 2.6.26-rc3xen-ia64-08941-g1b12161 --builtin mptspi \ + --builtin mptbase --builtin mptscsih --builtin uhci-hcd \ + --builtin ohci-hcd --builtin ehci-hcd + +======================== +Boot DomainU with pv_ops +======================== + + 1. make config of DomU + # vi /etc/xen/rhel5 + kernel = "/boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU" + ramdisk = "/boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img" + vcpus = 1 + memory = 512 + name = "rhel5" + disk = [ 'file:/root/rhel5.img,xvda1,w' ] + root = "/dev/xvda1 ro" + extra= "rhgb console=hvc0" + + 2. After boot xen and dom0, start xend + # /etc/init.d/xend start + ( In the debugging case, # XEND_DEBUG=1 xend trace_start ) + + 3. start domU + # xm create -c rhel5 + +========= +Reference +========= +- Wiki of Xen/IA64 upstream merge + http://wiki.xensource.com/xenwiki/XenIA64/UpstreamMerge + +Written by Akio Takebe <takebe_akio@jp.fujitsu.com> on 28 May 2008 diff --git a/Documentation/ioctl-number.txt b/Documentation/ioctl-number.txt index 1c6b545635a..b880ce5dbd3 100644 --- a/Documentation/ioctl-number.txt +++ b/Documentation/ioctl-number.txt @@ -92,6 +92,7 @@ Code Seq# Include File Comments 'J' 00-1F drivers/scsi/gdth_ioctl.h 'K' all linux/kd.h 'L' 00-1F linux/loop.h +'L' 20-2F driver/usb/misc/vstusb.h 'L' E0-FF linux/ppdd.h encrypted disk device driver <http://linux01.gwdg.de/~alatham/ppdd.html> 'M' all linux/soundcard.h @@ -110,6 +111,8 @@ Code Seq# Include File Comments 'W' 00-1F linux/wanrouter.h conflict! 'X' all linux/xfs_fs.h 'Y' all linux/cyclades.h +'[' 00-07 linux/usb/usbtmc.h USB Test and Measurement Devices + <mailto:gregkh@suse.de> 'a' all ATM on linux <http://lrcwww.epfl.ch/linux-atm/magic.html> 'b' 00-FF bit3 vme host bridge diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt index 0705040531a..3f4bc840da8 100644 --- a/Documentation/kdump/kdump.txt +++ b/Documentation/kdump/kdump.txt @@ -109,7 +109,8 @@ There are two possible methods of using Kdump. 2) Or use the system kernel binary itself as dump-capture kernel and there is no need to build a separate dump-capture kernel. This is possible only with the architecutres which support a relocatable kernel. As - of today, i386, x86_64 and ia64 architectures support relocatable kernel. + of today, i386, x86_64, ppc64 and ia64 architectures support relocatable + kernel. Building a relocatable kernel is advantageous from the point of view that one does not have to build a second kernel for capturing the dump. But @@ -207,8 +208,15 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64) Dump-capture kernel config options (Arch Dependent, ppc64) ---------------------------------------------------------- -* Make and install the kernel and its modules. DO NOT add this kernel - to the boot loader configuration files. +1) Enable "Build a kdump crash kernel" support under "Kernel" options: + + CONFIG_CRASH_DUMP=y + +2) Enable "Build a relocatable kernel" support + + CONFIG_RELOCATABLE=y + + Make and install the kernel and its modules. Dump-capture kernel config options (Arch Dependent, ia64) ---------------------------------------------------------- diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index dd28a0d5698..343e0f0f84b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -101,6 +101,7 @@ parameter is applicable: X86-64 X86-64 architecture is enabled. More X86-64 boot options can be found in Documentation/x86_64/boot-options.txt . + X86 Either 32bit or 64bit x86 (same as X86-32+X86-64) In addition, the following text indicates that the option: @@ -217,20 +218,47 @@ and is between 256 and 4096 characters. It is defined in the file acpi.debug_level= [HW,ACPI] Format: <int> Each bit of the <int> indicates an ACPI debug level, - 1: enable, 0: disable. It is useful for boot time - debugging. After system has booted up, it can be set - via /sys/module/acpi/parameters/debug_level. - CONFIG_ACPI_DEBUG must be enabled for this to produce any output. - Available bits (add the numbers together) to enable different - debug output levels of the ACPI subsystem: - 0x01 error 0x02 warn 0x04 init 0x08 debug object - 0x10 info 0x20 init names 0x40 parse 0x80 load - 0x100 dispatch 0x200 execute 0x400 names 0x800 operation region - 0x1000 bfield 0x2000 tables 0x4000 values 0x8000 objects - 0x10000 resources 0x20000 user requests 0x40000 package. - The number can be in decimal or prefixed with 0x in hex. - Warning: Many of these options can produce a lot of - output and make your system unusable. Be very careful. + which corresponds to the level in an ACPI_DEBUG_PRINT + statement. After system has booted up, this mask + can be set via /sys/module/acpi/parameters/debug_level. + + CONFIG_ACPI_DEBUG must be enabled for this to produce + any output. The number can be in decimal or prefixed + with 0x in hex. Some of these options produce so much + output that the system is unusable. + + The following global components are defined by the + ACPI CA: + 0x01 error + 0x02 warn + 0x04 init + 0x08 debug object + 0x10 info + 0x20 init names + 0x40 parse + 0x80 load + 0x100 dispatch + 0x200 execute + 0x400 names + 0x800 operation region + 0x1000 bfield + 0x2000 tables + 0x4000 values + 0x8000 objects + 0x10000 resources + 0x20000 user requests + 0x40000 package + The number can be in decimal or prefixed with 0x in hex. + Warning: Many of these options can produce a lot of + output and make your system unusable. Be very careful. + + acpi.power_nocheck= [HW,ACPI] + Format: 1/0 enable/disable the check of power state. + On some bogus BIOS the _PSC object/_STA object of + power resource can't return the correct device power + state. In such case it is unneccessary to check its + power state again in power transition. + 1 : disable the power state check acpi_pm_good [X86-32,X86-64] Override the pmtimer bug detection: force the kernel @@ -690,7 +718,7 @@ and is between 256 and 4096 characters. It is defined in the file See Documentation/block/as-iosched.txt and Documentation/block/deadline-iosched.txt for details. - elfcorehdr= [X86-32, X86_64] + elfcorehdr= [IA64,PPC,SH,X86-32,X86_64] Specifies physical address of start of kernel core image elf header. Generally kexec loader will pass this option to capture kernel. @@ -796,6 +824,8 @@ and is between 256 and 4096 characters. It is defined in the file Defaults to the default architecture's huge page size if not specified. + hlt [BUGS=ARM,SH] + i8042.debug [HW] Toggle i8042 debug mode i8042.direct [HW] Put keyboard port into non-translated mode i8042.dumbkbd [HW] Pretend that controller can only read data from @@ -1211,6 +1241,10 @@ and is between 256 and 4096 characters. It is defined in the file mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel memory. + memchunk=nn[KMG] + [KNL,SH] Allow user to override the default size for + per-device physically contiguous DMA buffers. + memmap=exactmap [KNL,X86-32,X86_64] Enable setting of an exact E820 memory map, as specified by the user. Such memmap=exactmap lines can be constructed based on @@ -1393,6 +1427,8 @@ and is between 256 and 4096 characters. It is defined in the file nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. + nodsp [SH] Disable hardware DSP at boot time. + noefi [X86-32,X86-64] Disable EFI runtime services support. noexec [IA-64] @@ -1409,13 +1445,15 @@ and is between 256 and 4096 characters. It is defined in the file noexec32=off: disable non-executable mappings read implies executable mappings + nofpu [SH] Disable hardware FPU at boot time. + nofxsr [BUGS=X86-32] Disables x86 floating point extended register save and restore. The kernel will only save legacy floating-point registers on task switch. noclflush [BUGS=X86] Don't use the CLFLUSH instruction - nohlt [BUGS=ARM] + nohlt [BUGS=ARM,SH] no-hlt [BUGS=X86-32] Tells the kernel that the hlt instruction doesn't work correctly and not to @@ -1578,7 +1616,7 @@ and is between 256 and 4096 characters. It is defined in the file See also Documentation/paride.txt. pci=option[,option...] [PCI] various PCI subsystem options: - off [X86-32] don't probe for the PCI bus + off [X86] don't probe for the PCI bus bios [X86-32] force use of PCI BIOS, don't access the hardware directly. Use this if your machine has a non-standard PCI host bridge. @@ -1586,9 +1624,9 @@ and is between 256 and 4096 characters. It is defined in the file hardware access methods are allowed. Use this if you experience crashes upon bootup and you suspect they are caused by the BIOS. - conf1 [X86-32] Force use of PCI Configuration + conf1 [X86] Force use of PCI Configuration Mechanism 1. - conf2 [X86-32] Force use of PCI Configuration + conf2 [X86] Force use of PCI Configuration Mechanism 2. noaer [PCIE] If the PCIEAER kernel config parameter is enabled, this kernel boot option can be used to @@ -1608,37 +1646,37 @@ and is between 256 and 4096 characters. It is defined in the file this option if the kernel is unable to allocate IRQs or discover secondary PCI buses on your motherboard. - rom [X86-32] Assign address space to expansion ROMs. + rom [X86] Assign address space to expansion ROMs. Use with caution as certain devices share address decoders between ROMs and other resources. - norom [X86-32,X86_64] Do not assign address space to + norom [X86] Do not assign address space to expansion ROMs that do not already have BIOS assigned address ranges. - irqmask=0xMMMM [X86-32] Set a bit mask of IRQs allowed to be + irqmask=0xMMMM [X86] Set a bit mask of IRQs allowed to be assigned automatically to PCI devices. You can make the kernel exclude IRQs of your ISA cards this way. - pirqaddr=0xAAAAA [X86-32] Specify the physical address + pirqaddr=0xAAAAA [X86] Specify the physical address of the PIRQ table (normally generated by the BIOS) if it is outside the F0000h-100000h range. - lastbus=N [X86-32] Scan all buses thru bus #N. Can be + lastbus=N [X86] Scan all buses thru bus #N. Can be useful if the kernel is unable to find your secondary buses and you want to tell it explicitly which ones they are. - assign-busses [X86-32] Always assign all PCI bus + assign-busses [X86] Always assign all PCI bus numbers ourselves, overriding whatever the firmware may have done. - usepirqmask [X86-32] Honor the possible IRQ mask stored + usepirqmask [X86] Honor the possible IRQ mask stored in the BIOS $PIR table. This is needed on some systems with broken BIOSes, notably some HP Pavilion N5400 and Omnibook XE3 notebooks. This will have no effect if ACPI IRQ routing is enabled. - noacpi [X86-32] Do not use ACPI for IRQ routing + noacpi [X86] Do not use ACPI for IRQ routing or for PCI scanning. - use_crs [X86-32] Use _CRS for PCI resource + use_crs [X86] Use _CRS for PCI resource allocation. routeirq Do IRQ routing for all PCI devices. This is normally done in pci_enable_device(), @@ -1667,6 +1705,12 @@ and is between 256 and 4096 characters. It is defined in the file reserved for the CardBus bridge's memory window. The default value is 64 megabytes. + pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power + Management. + off Disable ASPM. + force Enable ASPM even on devices that claim not to support it. + WARNING: Forcing ASPM on may cause system lockups. + pcmv= [HW,PCMCIA] BadgePAD 4 pd. [PARIDE] @@ -1694,6 +1738,10 @@ and is between 256 and 4096 characters. It is defined in the file Override pmtimer IOPort with a hex value. e.g. pmtmr=0x508 + pnp.debug [PNP] + Enable PNP debug messages. This depends on the + CONFIG_PNP_DEBUG_MESSAGES option. + pnpacpi= [ACPI] { off } @@ -2191,7 +2239,7 @@ and is between 256 and 4096 characters. It is defined in the file thermal.crt= [HW,ACPI] -1: disable all critical trip points in all thermal zones - <degrees C>: lower all critical trip points + <degrees C>: override all critical trip points thermal.nocrt= [HW,ACPI] Set to disable actions on ACPI thermal zone @@ -2253,6 +2301,25 @@ and is between 256 and 4096 characters. It is defined in the file autosuspended. Devices for which the delay is set to a negative value won't be autosuspended at all. + usbcore.usbfs_snoop= + [USB] Set to log all usbfs traffic (default 0 = off). + + usbcore.blinkenlights= + [USB] Set to cycle leds on hubs (default 0 = off). + + usbcore.old_scheme_first= + [USB] Start with the old device initialization + scheme (default 0 = off). + + usbcore.use_both_schemes= + [USB] Try the other device initialization scheme + if the first one fails (default 1 = enabled). + + usbcore.initial_descriptor_timeout= + [USB] Specifies timeout for the initial 64-byte + USB_REQ_GET_DESCRIPTOR request in milliseconds + (default 5000 = 5.0 seconds). + usbhid.mousepoll= [USBHID] The interval which mice are to be polled at. diff --git a/Documentation/laptops/acer-wmi.txt b/Documentation/laptops/acer-wmi.txt index 69b5dd4e5a5..2b3a6b5260b 100644 --- a/Documentation/laptops/acer-wmi.txt +++ b/Documentation/laptops/acer-wmi.txt @@ -1,7 +1,7 @@ Acer Laptop WMI Extras Driver http://code.google.com/p/aceracpi -Version 0.1 -9th February 2008 +Version 0.2 +18th August 2008 Copyright 2007-2008 Carlos Corbacho <carlos@strangeworlds.co.uk> @@ -87,17 +87,7 @@ acer-wmi come with built-in wireless. However, should you feel so inclined to ever wish to remove the card, or swap it out at some point, please get in touch with me, as we may well be able to gain some data on wireless card detection. -To read the status of the wireless radio (0=off, 1=on): -cat /sys/devices/platform/acer-wmi/wireless - -To enable the wireless radio: -echo 1 > /sys/devices/platform/acer-wmi/wireless - -To disable the wireless radio: -echo 0 > /sys/devices/platform/acer-wmi/wireless - -To set the state of the wireless radio when loading acer-wmi, pass: -wireless=X (where X is 0 or 1) +The wireless radio is exposed through rfkill. Bluetooth ********* @@ -117,17 +107,7 @@ For the adventurously minded - if you want to buy an internal bluetooth module off the internet that is compatible with your laptop and fit it, then it will work just fine with acer-wmi. -To read the status of the bluetooth module (0=off, 1=on): -cat /sys/devices/platform/acer-wmi/wireless - -To enable the bluetooth module: -echo 1 > /sys/devices/platform/acer-wmi/bluetooth - -To disable the bluetooth module: -echo 0 > /sys/devices/platform/acer-wmi/bluetooth - -To set the state of the bluetooth module when loading acer-wmi, pass: -bluetooth=X (where X is 0 or 1) +Bluetooth is exposed through rfkill. 3G ** diff --git a/Documentation/markers.txt b/Documentation/markers.txt index d9f50a19fa0..089f6138fcd 100644 --- a/Documentation/markers.txt +++ b/Documentation/markers.txt @@ -50,10 +50,12 @@ Connecting a function (probe) to a marker is done by providing a probe (function to call) for the specific marker through marker_probe_register() and can be activated by calling marker_arm(). Marker deactivation can be done by calling marker_disarm() as many times as marker_arm() has been called. Removing a probe -is done through marker_probe_unregister(); it will disarm the probe and make -sure there is no caller left using the probe when it returns. Probe removal is -preempt-safe because preemption is disabled around the probe call. See the -"Probe example" section below for a sample probe module. +is done through marker_probe_unregister(); it will disarm the probe. +marker_synchronize_unregister() must be called before the end of the module exit +function to make sure there is no caller left using the probe. This, and the +fact that preemption is disabled around the probe call, make sure that probe +removal and module unload are safe. See the "Probe example" section below for a +sample probe module. The marker mechanism supports inserting multiple instances of the same marker. Markers can be put in inline functions, inlined static functions, and diff --git a/Documentation/mtd/nand_ecc.txt b/Documentation/mtd/nand_ecc.txt new file mode 100644 index 00000000000..bdf93b7f0f2 --- /dev/null +++ b/Documentation/mtd/nand_ecc.txt @@ -0,0 +1,714 @@ +Introduction +============ + +Having looked at the linux mtd/nand driver and more specific at nand_ecc.c +I felt there was room for optimisation. I bashed the code for a few hours +performing tricks like table lookup removing superfluous code etc. +After that the speed was increased by 35-40%. +Still I was not too happy as I felt there was additional room for improvement. + +Bad! I was hooked. +I decided to annotate my steps in this file. Perhaps it is useful to someone +or someone learns something from it. + + +The problem +=========== + +NAND flash (at least SLC one) typically has sectors of 256 bytes. +However NAND flash is not extremely reliable so some error detection +(and sometimes correction) is needed. + +This is done by means of a Hamming code. I'll try to explain it in +laymans terms (and apologies to all the pro's in the field in case I do +not use the right terminology, my coding theory class was almost 30 +years ago, and I must admit it was not one of my favourites). + +As I said before the ecc calculation is performed on sectors of 256 +bytes. This is done by calculating several parity bits over the rows and +columns. The parity used is even parity which means that the parity bit = 1 +if the data over which the parity is calculated is 1 and the parity bit = 0 +if the data over which the parity is calculated is 0. So the total +number of bits over the data over which the parity is calculated + the +parity bit is even. (see wikipedia if you can't follow this). +Parity is often calculated by means of an exclusive or operation, +sometimes also referred to as xor. In C the operator for xor is ^ + +Back to ecc. +Let's give a small figure: + +byte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 +byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 +byte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 +byte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 +byte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 +.... +byte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 +byte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 + cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 + cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 + cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 + +This figure represents a sector of 256 bytes. +cp is my abbreviaton for column parity, rp for row parity. + +Let's start to explain column parity. +cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. +so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. +Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7. +cp2 is the parity over bit0, bit1, bit4 and bit5 +cp3 is the parity over bit2, bit3, bit6 and bit7. +cp4 is the parity over bit0, bit1, bit2 and bit3. +cp5 is the parity over bit4, bit5, bit6 and bit7. +Note that each of cp0 .. cp5 is exactly one bit. + +Row parity actually works almost the same. +rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) +rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) +rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... +(so handle two bytes, then skip 2 bytes). +rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) +for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. +so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) +and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. +The story now becomes quite boring. I guess you get the idea. +rp6 covers 8 bytes then skips 8 etc +rp7 skips 8 bytes then covers 8 etc +rp8 covers 16 bytes then skips 16 etc +rp9 skips 16 bytes then covers 16 etc +rp10 covers 32 bytes then skips 32 etc +rp11 skips 32 bytes then covers 32 etc +rp12 covers 64 bytes then skips 64 etc +rp13 skips 64 bytes then covers 64 etc +rp14 covers 128 bytes then skips 128 +rp15 skips 128 bytes then covers 128 + +In the end the parity bits are grouped together in three bytes as +follows: +ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 +ECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 +ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 +ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 + +I detected after writing this that ST application note AN1823 +(http://www.st.com/stonline/books/pdf/docs/10123.pdf) gives a much +nicer picture.(but they use line parity as term where I use row parity) +Oh well, I'm graphically challenged, so suffer with me for a moment :-) +And I could not reuse the ST picture anyway for copyright reasons. + + +Attempt 0 +========= + +Implementing the parity calculation is pretty simple. +In C pseudocode: +for (i = 0; i < 256; i++) +{ + if (i & 0x01) + rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; + else + rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; + if (i & 0x02) + rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; + else + rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; + if (i & 0x04) + rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; + else + rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; + if (i & 0x08) + rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; + else + rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; + if (i & 0x10) + rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; + else + rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; + if (i & 0x20) + rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; + else + rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; + if (i & 0x40) + rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; + else + rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; + if (i & 0x80) + rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; + else + rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; + cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; + cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; + cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; + cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 + cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 + cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 +} + + +Analysis 0 +========== + +C does have bitwise operators but not really operators to do the above +efficiently (and most hardware has no such instructions either). +Therefore without implementing this it was clear that the code above was +not going to bring me a Nobel prize :-) + +Fortunately the exclusive or operation is commutative, so we can combine +the values in any order. So instead of calculating all the bits +individually, let us try to rearrange things. +For the column parity this is easy. We can just xor the bytes and in the +end filter out the relevant bits. This is pretty nice as it will bring +all cp calculation out of the if loop. + +Similarly we can first xor the bytes for the various rows. +This leads to: + + +Attempt 1 +========= + +const char parity[256] = { + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 +}; + +void ecc1(const unsigned char *buf, unsigned char *code) +{ + int i; + const unsigned char *bp = buf; + unsigned char cur; + unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; + unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; + unsigned char par; + + par = 0; + rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; + rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; + rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; + rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; + + for (i = 0; i < 256; i++) + { + cur = *bp++; + par ^= cur; + if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; + if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; + if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; + } + code[0] = + (parity[rp7] << 7) | + (parity[rp6] << 6) | + (parity[rp5] << 5) | + (parity[rp4] << 4) | + (parity[rp3] << 3) | + (parity[rp2] << 2) | + (parity[rp1] << 1) | + (parity[rp0]); + code[1] = + (parity[rp15] << 7) | + (parity[rp14] << 6) | + (parity[rp13] << 5) | + (parity[rp12] << 4) | + (parity[rp11] << 3) | + (parity[rp10] << 2) | + (parity[rp9] << 1) | + (parity[rp8]); + code[2] = + (parity[par & 0xf0] << 7) | + (parity[par & 0x0f] << 6) | + (parity[par & 0xcc] << 5) | + (parity[par & 0x33] << 4) | + (parity[par & 0xaa] << 3) | + (parity[par & 0x55] << 2); + code[0] = ~code[0]; + code[1] = ~code[1]; + code[2] = ~code[2]; +} + +Still pretty straightforward. The last three invert statements are there to +give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash +all data is 0xff, so the checksum then matches. + +I also introduced the parity lookup. I expected this to be the fastest +way to calculate the parity, but I will investigate alternatives later +on. + + +Analysis 1 +========== + +The code works, but is not terribly efficient. On my system it took +almost 4 times as much time as the linux driver code. But hey, if it was +*that* easy this would have been done long before. +No pain. no gain. + +Fortunately there is plenty of room for improvement. + +In step 1 we moved from bit-wise calculation to byte-wise calculation. +However in C we can also use the unsigned long data type and virtually +every modern microprocessor supports 32 bit operations, so why not try +to write our code in such a way that we process data in 32 bit chunks. + +Of course this means some modification as the row parity is byte by +byte. A quick analysis: +for the column parity we use the par variable. When extending to 32 bits +we can in the end easily calculate p0 and p1 from it. +(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 +respectively) +also rp2 and rp3 can be easily retrieved from par as rp3 covers the +first two bytes and rp2 the last two bytes. + +Note that of course now the loop is executed only 64 times (256/4). +And note that care must taken wrt byte ordering. The way bytes are +ordered in a long is machine dependent, and might affect us. +Anyway, if there is an issue: this code is developed on x86 (to be +precise: a DELL PC with a D920 Intel CPU) + +And of course the performance might depend on alignment, but I expect +that the I/O buffers in the nand driver are aligned properly (and +otherwise that should be fixed to get maximum performance). + +Let's give it a try... + + +Attempt 2 +========= + +extern const char parity[256]; + +void ecc2(const unsigned char *buf, unsigned char *code) +{ + int i; + const unsigned long *bp = (unsigned long *)buf; + unsigned long cur; + unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; + unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; + unsigned long par; + + par = 0; + rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; + rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; + rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; + rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; + + for (i = 0; i < 64; i++) + { + cur = *bp++; + par ^= cur; + if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; + } + /* + we need to adapt the code generation for the fact that rp vars are now + long; also the column parity calculation needs to be changed. + we'll bring rp4 to 15 back to single byte entities by shifting and + xoring + */ + rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; + rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; + rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; + rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; + rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; + rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; + rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; + rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; + rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; + rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; + rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; + rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; + rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; + rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; + par ^= (par >> 16); + rp1 = (par >> 8); rp1 &= 0xff; + rp0 = (par & 0xff); + par ^= (par >> 8); par &= 0xff; + + code[0] = + (parity[rp7] << 7) | + (parity[rp6] << 6) | + (parity[rp5] << 5) | + (parity[rp4] << 4) | + (parity[rp3] << 3) | + (parity[rp2] << 2) | + (parity[rp1] << 1) | + (parity[rp0]); + code[1] = + (parity[rp15] << 7) | + (parity[rp14] << 6) | + (parity[rp13] << 5) | + (parity[rp12] << 4) | + (parity[rp11] << 3) | + (parity[rp10] << 2) | + (parity[rp9] << 1) | + (parity[rp8]); + code[2] = + (parity[par & 0xf0] << 7) | + (parity[par & 0x0f] << 6) | + (parity[par & 0xcc] << 5) | + (parity[par & 0x33] << 4) | + (parity[par & 0xaa] << 3) | + (parity[par & 0x55] << 2); + code[0] = ~code[0]; + code[1] = ~code[1]; + code[2] = ~code[2]; +} + +The parity array is not shown any more. Note also that for these +examples I kinda deviated from my regular programming style by allowing +multiple statements on a line, not using { } in then and else blocks +with only a single statement and by using operators like ^= + + +Analysis 2 +========== + +The code (of course) works, and hurray: we are a little bit faster than +the linux driver code (about 15%). But wait, don't cheer too quickly. +THere is more to be gained. +If we look at e.g. rp14 and rp15 we see that we either xor our data with +rp14 or with rp15. However we also have par which goes over all data. +This means there is no need to calculate rp14 as it can be calculated from +rp15 through rp14 = par ^ rp15; +(or if desired we can avoid calculating rp15 and calculate it from +rp14). That is why some places refer to inverse parity. +Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. +Effectively this means we can eliminate the else clause from the if +statements. Also we can optimise the calculation in the end a little bit +by going from long to byte first. Actually we can even avoid the table +lookups + +Attempt 3 +========= + +Odd replaced: + if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; +with + if (i & 0x01) rp5 ^= cur; + if (i & 0x02) rp7 ^= cur; + if (i & 0x04) rp9 ^= cur; + if (i & 0x08) rp11 ^= cur; + if (i & 0x10) rp13 ^= cur; + if (i & 0x20) rp15 ^= cur; + + and outside the loop added: + rp4 = par ^ rp5; + rp6 = par ^ rp7; + rp8 = par ^ rp9; + rp10 = par ^ rp11; + rp12 = par ^ rp13; + rp14 = par ^ rp15; + +And after that the code takes about 30% more time, although the number of +statements is reduced. This is also reflected in the assembly code. + + +Analysis 3 +========== + +Very weird. Guess it has to do with caching or instruction parallellism +or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting +observation was that this one is only 30% slower (according to time) +executing the code as my 3Ghz D920 processor. + +Well, it was expected not to be easy so maybe instead move to a +different track: let's move back to the code from attempt2 and do some +loop unrolling. This will eliminate a few if statements. I'll try +different amounts of unrolling to see what works best. + + +Attempt 4 +========= + +Unrolled the loop 1, 2, 3 and 4 times. +For 4 the code starts with: + + for (i = 0; i < 4; i++) + { + cur = *bp++; + par ^= cur; + rp4 ^= cur; + rp6 ^= cur; + rp8 ^= cur; + rp10 ^= cur; + if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; + cur = *bp++; + par ^= cur; + rp5 ^= cur; + rp6 ^= cur; + ... + + +Analysis 4 +========== + +Unrolling once gains about 15% +Unrolling twice keeps the gain at about 15% +Unrolling three times gives a gain of 30% compared to attempt 2. +Unrolling four times gives a marginal improvement compared to unrolling +three times. + +I decided to proceed with a four time unrolled loop anyway. It was my gut +feeling that in the next steps I would obtain additional gain from it. + +The next step was triggered by the fact that par contains the xor of all +bytes and rp4 and rp5 each contain the xor of half of the bytes. +So in effect par = rp4 ^ rp5. But as xor is commutative we can also say +that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can +eliminate rp5 (or rp4, but I already foresaw another optimisation). +The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. + + +Attempt 5 +========= + +Effectively so all odd digit rp assignments in the loop were removed. +This included the else clause of the if statements. +Of course after the loop we need to correct things by adding code like: + rp5 = par ^ rp4; +Also the initial assignments (rp5 = 0; etc) could be removed. +Along the line I also removed the initialisation of rp0/1/2/3. + + +Analysis 5 +========== + +Measurements showed this was a good move. The run-time roughly halved +compared with attempt 4 with 4 times unrolled, and we only require 1/3rd +of the processor time compared to the current code in the linux kernel. + +However, still I thought there was more. I didn't like all the if +statements. Why not keep a running parity and only keep the last if +statement. Time for yet another version! + + +Attempt 6 +========= + +THe code within the for loop was changed to: + + for (i = 0; i < 4; i++) + { + cur = *bp++; tmppar = cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp8 ^= cur; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; + + par ^= tmppar; + if ((i & 0x1) == 0) rp12 ^= tmppar; + if ((i & 0x2) == 0) rp14 ^= tmppar; + } + +As you can see tmppar is used to accumulate the parity within a for +iteration. In the last 3 statements is is added to par and, if needed, +to rp12 and rp14. + +While making the changes I also found that I could exploit that tmppar +contains the running parity for this iteration. So instead of having: +rp4 ^= cur; rp6 = cur; +I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next +statement. A similar change was done for rp8 and rp10 + + +Analysis 6 +========== + +Measuring this code again showed big gain. When executing the original +linux code 1 million times, this took about 1 second on my system. +(using time to measure the performance). After this iteration I was back +to 0.075 sec. Actually I had to decide to start measuring over 10 +million interations in order not to loose too much accuracy. This one +definitely seemed to be the jackpot! + +There is a little bit more room for improvement though. There are three +places with statements: +rp4 ^= cur; rp6 ^= cur; +It seems more efficient to also maintain a variable rp4_6 in the while +loop; This eliminates 3 statements per loop. Of course after the loop we +need to correct by adding: + rp4 ^= rp4_6; + rp6 ^= rp4_6 +Furthermore there are 4 sequential assingments to rp8. This can be +encoded slightly more efficient by saving tmppar before those 4 lines +and later do rp8 = rp8 ^ tmppar ^ notrp8; +(where notrp8 is the value of rp8 before those 4 lines). +Again a use of the commutative property of xor. +Time for a new test! + + +Attempt 7 +========= + +The new code now looks like: + + for (i = 0; i < 4; i++) + { + cur = *bp++; tmppar = cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; + + cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; + + notrp8 = tmppar; + cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; + rp8 = rp8 ^ tmppar ^ notrp8; + + cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; + + par ^= tmppar; + if ((i & 0x1) == 0) rp12 ^= tmppar; + if ((i & 0x2) == 0) rp14 ^= tmppar; + } + rp4 ^= rp4_6; + rp6 ^= rp4_6; + + +Not a big change, but every penny counts :-) + + +Analysis 7 +========== + +Acutally this made things worse. Not very much, but I don't want to move +into the wrong direction. Maybe something to investigate later. Could +have to do with caching again. + +Guess that is what there is to win within the loop. Maybe unrolling one +more time will help. I'll keep the optimisations from 7 for now. + + +Attempt 8 +========= + +Unrolled the loop one more time. + + +Analysis 8 +========== + +This makes things worse. Let's stick with attempt 6 and continue from there. +Although it seems that the code within the loop cannot be optimised +further there is still room to optimize the generation of the ecc codes. +We can simply calcualate the total parity. If this is 0 then rp4 = rp5 +etc. If the parity is 1, then rp4 = !rp5; +But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits +in the result byte and then do something like + code[0] |= (code[0] << 1); +Lets test this. + + +Attempt 9 +========= + +Changed the code but again this slightly degrades performance. Tried all +kind of other things, like having dedicated parity arrays to avoid the +shift after parity[rp7] << 7; No gain. +Change the lookup using the parity array by using shift operators (e.g. +replace parity[rp7] << 7 with: +rp7 ^= (rp7 << 4); +rp7 ^= (rp7 << 2); +rp7 ^= (rp7 << 1); +rp7 &= 0x80; +No gain. + +The only marginal change was inverting the parity bits, so we can remove +the last three invert statements. + +Ah well, pity this does not deliver more. Then again 10 million +iterations using the linux driver code takes between 13 and 13.5 +seconds, whereas my code now takes about 0.73 seconds for those 10 +million iterations. So basically I've improved the performance by a +factor 18 on my system. Not that bad. Of course on different hardware +you will get different results. No warranties! + +But of course there is no such thing as a free lunch. The codesize almost +tripled (from 562 bytes to 1434 bytes). Then again, it is not that much. + + +Correcting errors +================= + +For correcting errors I again used the ST application note as a starter, +but I also peeked at the existing code. +The algorithm itself is pretty straightforward. Just xor the given and +the calculated ecc. If all bytes are 0 there is no problem. If 11 bits +are 1 we have one correctable bit error. If there is 1 bit 1, we have an +error in the given ecc code. +It proved to be fastest to do some table lookups. Performance gain +introduced by this is about a factor 2 on my system when a repair had to +be done, and 1% or so if no repair had to be done. +Code size increased from 330 bytes to 686 bytes for this function. +(gcc 4.2, -O3) + + +Conclusion +========== + +The gain when calculating the ecc is tremendous. Om my development hardware +a speedup of a factor of 18 for ecc calculation was achieved. On a test on an +embedded system with a MIPS core a factor 7 was obtained. +On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor +5 (big endian mode, gcc 4.1.2, -O3) +For correction not much gain could be obtained (as bitflips are rare). Then +again there are also much less cycles spent there. + +It seems there is not much more gain possible in this, at least when +programmed in C. Of course it might be possible to squeeze something more +out of it with an assembler program, but due to pipeline behaviour etc +this is very tricky (at least for intel hw). + +Author: Frans Meulenbroeks +Copyright (C) 2008 Koninklijke Philips Electronics NV. diff --git a/Documentation/powerpc/booting-without-of.txt b/Documentation/powerpc/booting-without-of.txt index de4063cb4fd..02ea9a971b8 100644 --- a/Documentation/powerpc/booting-without-of.txt +++ b/Documentation/powerpc/booting-without-of.txt @@ -1917,6 +1917,8 @@ platforms are moved over to use the flattened-device-tree model. inverse clock polarity (CPOL) mode - spi-cpha - (optional) Empty property indicating device requires shifted clock phase (CPHA) mode + - spi-cs-high - (optional) Empty property indicating device requires + chip select active high SPI example for an MPC5200 SPI bus: spi@f00 { diff --git a/Documentation/powerpc/dts-bindings/fsl/board.txt b/Documentation/powerpc/dts-bindings/fsl/board.txt index 74ae6f1cd2d..81a917ef96e 100644 --- a/Documentation/powerpc/dts-bindings/fsl/board.txt +++ b/Documentation/powerpc/dts-bindings/fsl/board.txt @@ -2,13 +2,13 @@ Required properties: - - device_type : Should be "board-control" + - compatible : Should be "fsl,<board>-bcsr" - reg : Offset and length of the register set for the device Example: bcsr@f8000000 { - device_type = "board-control"; + compatible = "fsl,mpc8360mds-bcsr"; reg = <f8000000 8000>; }; diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index e1ff0d920a5..bde799e0659 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -369,4 +369,5 @@ can be ORed together: 2 - A module was force loaded by insmod -f. Set by modutils >= 2.4.9 and module-init-tools. 4 - Unsafe SMP processors: SMP with CPUs not designed for SMP. + 64 - A module from drivers/staging was loaded. diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 5ce0952aa06..10a0263ebb3 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt @@ -95,7 +95,9 @@ On all - write a character to /proc/sysrq-trigger. e.g.: 'p' - Will dump the current registers and flags to your console. -'q' - Will dump a list of all running timers. +'q' - Will dump per CPU lists of all armed hrtimers (but NOT regular + timer_list timers) and detailed information about all + clockevent devices. 'r' - Turns off keyboard raw mode and sets it to XLATE. diff --git a/Documentation/tracepoints.txt b/Documentation/tracepoints.txt new file mode 100644 index 00000000000..5d354e16749 --- /dev/null +++ b/Documentation/tracepoints.txt @@ -0,0 +1,101 @@ + Using the Linux Kernel Tracepoints + + Mathieu Desnoyers + + +This document introduces Linux Kernel Tracepoints and their use. It provides +examples of how to insert tracepoints in the kernel and connect probe functions +to them and provides some examples of probe functions. + + +* Purpose of tracepoints + +A tracepoint placed in code provides a hook to call a function (probe) that you +can provide at runtime. A tracepoint can be "on" (a probe is connected to it) or +"off" (no probe is attached). When a tracepoint is "off" it has no effect, +except for adding a tiny time penalty (checking a condition for a branch) and +space penalty (adding a few bytes for the function call at the end of the +instrumented function and adds a data structure in a separate section). When a +tracepoint is "on", the function you provide is called each time the tracepoint +is executed, in the execution context of the caller. When the function provided +ends its execution, it returns to the caller (continuing from the tracepoint +site). + +You can put tracepoints at important locations in the code. They are +lightweight hooks that can pass an arbitrary number of parameters, +which prototypes are described in a tracepoint declaration placed in a header +file. + +They can be used for tracing and performance accounting. + + +* Usage + +Two elements are required for tracepoints : + +- A tracepoint definition, placed in a header file. +- The tracepoint statement, in C code. + +In order to use tracepoints, you should include linux/tracepoint.h. + +In include/trace/subsys.h : + +#include <linux/tracepoint.h> + +DEFINE_TRACE(subsys_eventname, + TPPTOTO(int firstarg, struct task_struct *p), + TPARGS(firstarg, p)); + +In subsys/file.c (where the tracing statement must be added) : + +#include <trace/subsys.h> + +void somefct(void) +{ + ... + trace_subsys_eventname(arg, task); + ... +} + +Where : +- subsys_eventname is an identifier unique to your event + - subsys is the name of your subsystem. + - eventname is the name of the event to trace. +- TPPTOTO(int firstarg, struct task_struct *p) is the prototype of the function + called by this tracepoint. +- TPARGS(firstarg, p) are the parameters names, same as found in the prototype. + +Connecting a function (probe) to a tracepoint is done by providing a probe +(function to call) for the specific tracepoint through +register_trace_subsys_eventname(). Removing a probe is done through +unregister_trace_subsys_eventname(); it will remove the probe sure there is no +caller left using the probe when it returns. Probe removal is preempt-safe +because preemption is disabled around the probe call. See the "Probe example" +section below for a sample probe module. + +The tracepoint mechanism supports inserting multiple instances of the same +tracepoint, but a single definition must be made of a given tracepoint name over +all the kernel to make sure no type conflict will occur. Name mangling of the +tracepoints is done using the prototypes to make sure typing is correct. +Verification of probe type correctness is done at the registration site by the +compiler. Tracepoints can be put in inline functions, inlined static functions, +and unrolled loops as well as regular functions. + +The naming scheme "subsys_event" is suggested here as a convention intended +to limit collisions. Tracepoint names are global to the kernel: they are +considered as being the same whether they are in the core kernel image or in +modules. + + +* Probe / tracepoint example + +See the example provided in samples/tracepoints/src + +Compile them with your kernel. + +Run, as root : +modprobe tracepoint-example (insmod order is not important) +modprobe tracepoint-probe-example +cat /proc/tracepoint-example (returns an expected error) +rmmod tracepoint-example tracepoint-probe-example +dmesg diff --git a/Documentation/tracers/mmiotrace.txt b/Documentation/tracers/mmiotrace.txt index a4afb560a45..5bbbe209622 100644 --- a/Documentation/tracers/mmiotrace.txt +++ b/Documentation/tracers/mmiotrace.txt @@ -36,7 +36,7 @@ $ mount -t debugfs debugfs /debug $ echo mmiotrace > /debug/tracing/current_tracer $ cat /debug/tracing/trace_pipe > mydump.txt & Start X or whatever. -$ echo "X is up" > /debug/tracing/marker +$ echo "X is up" > /debug/tracing/trace_marker $ echo none > /debug/tracing/current_tracer Check for lost events. @@ -59,9 +59,8 @@ The 'cat' process should stay running (sleeping) in the background. Load the driver you want to trace and use it. Mmiotrace will only catch MMIO accesses to areas that are ioremapped while mmiotrace is active. -[Unimplemented feature:] During tracing you can place comments (markers) into the trace by -$ echo "X is up" > /debug/tracing/marker +$ echo "X is up" > /debug/tracing/trace_marker This makes it easier to see which part of the (huge) trace corresponds to which action. It is recommended to place descriptive markers about what you do. diff --git a/Documentation/usb/WUSB-Design-overview.txt b/Documentation/usb/WUSB-Design-overview.txt new file mode 100644 index 00000000000..4c3d62c7843 --- /dev/null +++ b/Documentation/usb/WUSB-Design-overview.txt @@ -0,0 +1,448 @@ + +Linux UWB + Wireless USB + WiNET + + (C) 2005-2006 Intel Corporation + Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com> + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License version + 2 as published by the Free Software Foundation. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + 02110-1301, USA. + + +Please visit http://bughost.org/thewiki/Design-overview.txt-1.8 for +updated content. + + * Design-overview.txt-1.8 + +This code implements a Ultra Wide Band stack for Linux, as well as +drivers for the the USB based UWB radio controllers defined in the +Wireless USB 1.0 specification (including Wireless USB host controller +and an Intel WiNET controller). + + 1. Introduction + 1. HWA: Host Wire adapters, your Wireless USB dongle + + 2. DWA: Device Wired Adaptor, a Wireless USB hub for wired + devices + 3. WHCI: Wireless Host Controller Interface, the PCI WUSB host + adapter + 2. The UWB stack + 1. Devices and hosts: the basic structure + + 2. Host Controller life cycle + + 3. On the air: beacons and enumerating the radio neighborhood + + 4. Device lists + 5. Bandwidth allocation + + 3. Wireless USB Host Controller drivers + + 4. Glossary + + + Introduction + +UWB is a wide-band communication protocol that is to serve also as the +low-level protocol for others (much like TCP sits on IP). Currently +these others are Wireless USB and TCP/IP, but seems Bluetooth and +Firewire/1394 are coming along. + +UWB uses a band from roughly 3 to 10 GHz, transmitting at a max of +~-41dB (or 0.074 uW/MHz--geography specific data is still being +negotiated w/ regulators, so watch for changes). That band is divided in +a bunch of ~1.5 GHz wide channels (or band groups) composed of three +subbands/subchannels (528 MHz each). Each channel is independent of each +other, so you could consider them different "busses". Initially this +driver considers them all a single one. + +Radio time is divided in 65536 us long /superframes/, each one divided +in 256 256us long /MASs/ (Media Allocation Slots), which are the basic +time/media allocation units for transferring data. At the beginning of +each superframe there is a Beacon Period (BP), where every device +transmit its beacon on a single MAS. The length of the BP depends on how +many devices are present and the length of their beacons. + +Devices have a MAC (fixed, 48 bit address) and a device (changeable, 16 +bit address) and send periodic beacons to advertise themselves and pass +info on what they are and do. They advertise their capabilities and a +bunch of other stuff. + +The different logical parts of this driver are: + + * + + *UWB*: the Ultra-Wide-Band stack -- manages the radio and + associated spectrum to allow for devices sharing it. Allows to + control bandwidth assingment, beaconing, scanning, etc + + * + + *WUSB*: the layer that sits on top of UWB to provide Wireless USB. + The Wireless USB spec defines means to control a UWB radio and to + do the actual WUSB. + + + HWA: Host Wire adapters, your Wireless USB dongle + +WUSB also defines a device called a Host Wire Adaptor (HWA), which in +mere terms is a USB dongle that enables your PC to have UWB and Wireless +USB. The Wireless USB Host Controller in a HWA looks to the host like a +[Wireless] USB controller connected via USB (!) + +The HWA itself is broken in two or three main interfaces: + + * + + *RC*: Radio control -- this implements an interface to the + Ultra-Wide-Band radio controller. The driver for this implements a + USB-based UWB Radio Controller to the UWB stack. + + * + + *HC*: the wireless USB host controller. It looks like a USB host + whose root port is the radio and the WUSB devices connect to it. + To the system it looks like a separate USB host. The driver (will) + implement a USB host controller (similar to UHCI, OHCI or EHCI) + for which the root hub is the radio...To reiterate: it is a USB + controller that is connected via USB instead of PCI. + + * + + *WINET*: some HW provide a WiNET interface (IP over UWB). This + package provides a driver for it (it looks like a network + interface, winetX). The driver detects when there is a link up for + their type and kick into gear. + + + DWA: Device Wired Adaptor, a Wireless USB hub for wired devices + +These are the complement to HWAs. They are a USB host for connecting +wired devices, but it is connected to your PC connected via Wireless +USB. To the system it looks like yet another USB host. To the untrained +eye, it looks like a hub that connects upstream wirelessly. + +We still offer no support for this; however, it should share a lot of +code with the HWA-RC driver; there is a bunch of factorization work that +has been done to support that in upcoming releases. + + + WHCI: Wireless Host Controller Interface, the PCI WUSB host adapter + +This is your usual PCI device that implements WHCI. Similar in concept +to EHCI, it allows your wireless USB devices (including DWAs) to connect +to your host via a PCI interface. As in the case of the HWA, it has a +Radio Control interface and the WUSB Host Controller interface per se. + +There is still no driver support for this, but will be in upcoming +releases. + + + The UWB stack + +The main mission of the UWB stack is to keep a tally of which devices +are in radio proximity to allow drivers to connect to them. As well, it +provides an API for controlling the local radio controllers (RCs from +now on), such as to start/stop beaconing, scan, allocate bandwidth, etc. + + + Devices and hosts: the basic structure + +The main building block here is the UWB device (struct uwb_dev). For +each device that pops up in radio presence (ie: the UWB host receives a +beacon from it) you get a struct uwb_dev that will show up in +/sys/class/uwb and in /sys/bus/uwb/devices. + +For each RC that is detected, a new struct uwb_rc is created. In turn, a +RC is also a device, so they also show in /sys/class/uwb and +/sys/bus/uwb/devices, but at the same time, only radio controllers show +up in /sys/class/uwb_rc. + + * + + [*] The reason for RCs being also devices is that not only we can + see them while enumerating the system device tree, but also on the + radio (their beacons and stuff), so the handling has to be + likewise to that of a device. + +Each RC driver is implemented by a separate driver that plugs into the +interface that the UWB stack provides through a struct uwb_rc_ops. The +spec creators have been nice enough to make the message format the same +for HWA and WHCI RCs, so the driver is really a very thin transport that +moves the requests from the UWB API to the device [/uwb_rc_ops->cmd()/] +and sends the replies and notifications back to the API +[/uwb_rc_neh_grok()/]. Notifications are handled to the UWB daemon, that +is chartered, among other things, to keep the tab of how the UWB radio +neighborhood looks, creating and destroying devices as they show up or +dissapear. + +Command execution is very simple: a command block is sent and a event +block or reply is expected back. For sending/receiving command/events, a +handle called /neh/ (Notification/Event Handle) is opened with +/uwb_rc_neh_open()/. + +The HWA-RC (USB dongle) driver (drivers/uwb/hwa-rc.c) does this job for +the USB connected HWA. Eventually, drivers/whci-rc.c will do the same +for the PCI connected WHCI controller. + + + Host Controller life cycle + +So let's say we connect a dongle to the system: it is detected and +firmware uploaded if needed [for Intel's i1480 +/drivers/uwb/ptc/usb.c:ptc_usb_probe()/] and then it is reenumerated. +Now we have a real HWA device connected and +/drivers/uwb/hwa-rc.c:hwarc_probe()/ picks it up, that will set up the +Wire-Adaptor environment and then suck it into the UWB stack's vision of +the world [/drivers/uwb/lc-rc.c:uwb_rc_add()/]. + + * + + [*] The stack should put a new RC to scan for devices + [/uwb_rc_scan()/] so it finds what's available around and tries to + connect to them, but this is policy stuff and should be driven + from user space. As of now, the operator is expected to do it + manually; see the release notes for documentation on the procedure. + +When a dongle is disconnected, /drivers/uwb/hwa-rc.c:hwarc_disconnect()/ +takes time of tearing everything down safely (or not...). + + + On the air: beacons and enumerating the radio neighborhood + +So assuming we have devices and we have agreed for a channel to connect +on (let's say 9), we put the new RC to beacon: + + * + + $ echo 9 0 > /sys/class/uwb_rc/uwb0/beacon + +Now it is visible. If there were other devices in the same radio channel +and beacon group (that's what the zero is for), the dongle's radio +control interface will send beacon notifications on its +notification/event endpoint (NEEP). The beacon notifications are part of +the event stream that is funneled into the API with +/drivers/uwb/neh.c:uwb_rc_neh_grok()/ and delivered to the UWBD, the UWB +daemon through a notification list. + +UWBD wakes up and scans the event list; finds a beacon and adds it to +the BEACON CACHE (/uwb_beca/). If he receives a number of beacons from +the same device, he considers it to be 'onair' and creates a new device +[/drivers/uwb/lc-dev.c:uwbd_dev_onair()/]. Similarly, when no beacons +are received in some time, the device is considered gone and wiped out +[uwbd calls periodically /uwb/beacon.c:uwb_beca_purge()/ that will purge +the beacon cache of dead devices]. + + + Device lists + +All UWB devices are kept in the list of the struct bus_type uwb_bus. + + + Bandwidth allocation + +The UWB stack maintains a local copy of DRP availability through +processing of incoming *DRP Availability Change* notifications. This +local copy is currently used to present the current bandwidth +availability to the user through the sysfs file +/sys/class/uwb_rc/uwbx/bw_avail. In the future the bandwidth +availability information will be used by the bandwidth reservation +routines. + +The bandwidth reservation routines are in progress and are thus not +present in the current release. When completed they will enable a user +to initiate DRP reservation requests through interaction with sysfs. DRP +reservation requests from remote UWB devices will also be handled. The +bandwidth management done by the UWB stack will include callbacks to the +higher layers will enable the higher layers to use the reservations upon +completion. [Note: The bandwidth reservation work is in progress and +subject to change.] + + + Wireless USB Host Controller drivers + +*WARNING* This section needs a lot of work! + +As explained above, there are three different types of HCs in the WUSB +world: HWA-HC, DWA-HC and WHCI-HC. + +HWA-HC and DWA-HC share that they are Wire-Adapters (USB or WUSB +connected controllers), and their transfer management system is almost +identical. So is their notification delivery system. + +HWA-HC and WHCI-HC share that they are both WUSB host controllers, so +they have to deal with WUSB device life cycle and maintenance, wireless +root-hub + +HWA exposes a Host Controller interface (HWA-HC 0xe0/02/02). This has +three endpoints (Notifications, Data Transfer In and Data Transfer +Out--known as NEP, DTI and DTO in the code). + +We reserve UWB bandwidth for our Wireless USB Cluster, create a Cluster +ID and tell the HC to use all that. Then we start it. This means the HC +starts sending MMCs. + + * + + The MMCs are blocks of data defined somewhere in the WUSB1.0 spec + that define a stream in the UWB channel time allocated for sending + WUSB IEs (host to device commands/notifications) and Device + Notifications (device initiated to host). Each host defines a + unique Wireless USB cluster through MMCs. Devices can connect to a + single cluster at the time. The IEs are Information Elements, and + among them are the bandwidth allocations that tell each device + when can they transmit or receive. + +Now it all depends on external stimuli. + +*New device connection* + +A new device pops up, it scans the radio looking for MMCs that give out +the existence of Wireless USB channels. Once one (or more) are found, +selects which one to connect to. Sends a /DN_Connect/ (device +notification connect) during the DNTS (Device Notification Time +Slot--announced in the MMCs + +HC picks the /DN_Connect/ out (nep module sends to notif.c for delivery +into /devconnect/). This process starts the authentication process for +the device. First we allocate a /fake port/ and assign an +unauthenticated address (128 to 255--what we really do is +0x80 | fake_port_idx). We fiddle with the fake port status and /khubd/ +sees a new connection, so he moves on to enable the fake port with a reset. + +So now we are in the reset path -- we know we have a non-yet enumerated +device with an unauthorized address; we ask user space to authenticate +(FIXME: not yet done, similar to bluetooth pairing), then we do the key +exchange (FIXME: not yet done) and issue a /set address 0/ to bring the +device to the default state. Device is authenticated. + +From here, the USB stack takes control through the usb_hcd ops. khubd +has seen the port status changes, as we have been toggling them. It will +start enumerating and doing transfers through usb_hcd->urb_enqueue() to +read descriptors and move our data. + +*Device life cycle and keep alives* + +Everytime there is a succesful transfer to/from a device, we update a +per-device activity timestamp. If not, every now and then we check and +if the activity timestamp gets old, we ping the device by sending it a +Keep Alive IE; it responds with a /DN_Alive/ pong during the DNTS (this +arrives to us as a notification through +devconnect.c:wusb_handle_dn_alive(). If a device times out, we +disconnect it from the system (cleaning up internal information and +toggling the bits in the fake hub port, which kicks khubd into removing +the rest of the stuff). + +This is done through devconnect:__wusb_check_devs(), which will scan the +device list looking for whom needs refreshing. + +If the device wants to disconnect, it will either die (ugly) or send a +/DN_Disconnect/ that will prompt a disconnection from the system. + +*Sending and receiving data* + +Data is sent and received through /Remote Pipes/ (rpipes). An rpipe is +/aimed/ at an endpoint in a WUSB device. This is the same for HWAs and +DWAs. + +Each HC has a number of rpipes and buffers that can be assigned to them; +when doing a data transfer (xfer), first the rpipe has to be aimed and +prepared (buffers assigned), then we can start queueing requests for +data in or out. + +Data buffers have to be segmented out before sending--so we send first a +header (segment request) and then if there is any data, a data buffer +immediately after to the DTI interface (yep, even the request). If our +buffer is bigger than the max segment size, then we just do multiple +requests. + +[This sucks, because doing USB scatter gatter in Linux is resource +intensive, if any...not that the current approach is not. It just has to +be cleaned up a lot :)]. + +If reading, we don't send data buffers, just the segment headers saying +we want to read segments. + +When the xfer is executed, we receive a notification that says data is +ready in the DTI endpoint (handled through +xfer.c:wa_handle_notif_xfer()). In there we read from the DTI endpoint a +descriptor that gives us the status of the transfer, its identification +(given when we issued it) and the segment number. If it was a data read, +we issue another URB to read into the destination buffer the chunk of +data coming out of the remote endpoint. Done, wait for the next guy. The +callbacks for the URBs issued from here are the ones that will declare +the xfer complete at some point and call it's callback. + +Seems simple, but the implementation is not trivial. + + * + + *WARNING* Old!! + +The main xfer descriptor, wa_xfer (equivalent to a URB) contains an +array of segments, tallys on segments and buffers and callback +information. Buried in there is a lot of URBs for executing the segments +and buffer transfers. + +For OUT xfers, there is an array of segments, one URB for each, another +one of buffer URB. When submitting, we submit URBs for segment request +1, buffer 1, segment 2, buffer 2...etc. Then we wait on the DTI for xfer +result data; when all the segments are complete, we call the callback to +finalize the transfer. + +For IN xfers, we only issue URBs for the segments we want to read and +then wait for the xfer result data. + +*URB mapping into xfers* + +This is done by hwahc_op_urb_[en|de]queue(). In enqueue() we aim an +rpipe to the endpoint where we have to transmit, create a transfer +context (wa_xfer) and submit it. When the xfer is done, our callback is +called and we assign the status bits and release the xfer resources. + +In dequeue() we are basically cancelling/aborting the transfer. We issue +a xfer abort request to the HC, cancell all the URBs we had submitted +and not yet done and when all that is done, the xfer callback will be +called--this will call the URB callback. + + + Glossary + +*DWA* -- Device Wire Adapter + +USB host, wired for downstream devices, upstream connects wirelessly +with Wireless USB. + +*EVENT* -- Response to a command on the NEEP + +*HWA* -- Host Wire Adapter / USB dongle for UWB and Wireless USB + +*NEH* -- Notification/Event Handle + +Handle/file descriptor for receiving notifications or events. The WA +code requires you to get one of this to listen for notifications or +events on the NEEP. + +*NEEP* -- Notification/Event EndPoint + +Stuff related to the management of the first endpoint of a HWA USB +dongle that is used to deliver an stream of events and notifications to +the host. + +*NOTIFICATION* -- Message coming in the NEEP as response to something. + +*RC* -- Radio Control + +Design-overview.txt-1.8 (last edited 2006-11-04 12:22:24 by +InakyPerezGonzalez) + diff --git a/Documentation/usb/anchors.txt b/Documentation/usb/anchors.txt index 5e6b64c20d2..6f24f566955 100644 --- a/Documentation/usb/anchors.txt +++ b/Documentation/usb/anchors.txt @@ -52,6 +52,11 @@ Therefore no guarantee is made that the URBs have been unlinked when the call returns. They may be unlinked later but will be unlinked in finite time. +usb_scuttle_anchored_urbs() +--------------------------- + +All URBs of an anchor are unanchored en masse. + usb_wait_anchor_empty_timeout() ------------------------------- @@ -59,4 +64,16 @@ This function waits for all URBs associated with an anchor to finish or a timeout, whichever comes first. Its return value will tell you whether the timeout was reached. +usb_anchor_empty() +------------------ + +Returns true if no URBs are associated with an anchor. Locking +is the caller's responsibility. + +usb_get_from_anchor() +--------------------- +Returns the oldest anchored URB of an anchor. The URB is unanchored +and returned with a reference. As you may mix URBs to several +destinations in one anchor you have no guarantee the chronologically +first submitted URB is returned.
\ No newline at end of file diff --git a/Documentation/usb/misc_usbsevseg.txt b/Documentation/usb/misc_usbsevseg.txt new file mode 100644 index 00000000000..0f6be4f9930 --- /dev/null +++ b/Documentation/usb/misc_usbsevseg.txt @@ -0,0 +1,46 @@ +USB 7-Segment Numeric Display +Manufactured by Delcom Engineering + +Device Information +------------------ +USB VENDOR_ID 0x0fc5 +USB PRODUCT_ID 0x1227 +Both the 6 character and 8 character displays have PRODUCT_ID, +and according to Delcom Engineering no queryable information +can be obtained from the device to tell them apart. + +Device Modes +------------ +By default, the driver assumes the display is only 6 characters +The mode for 6 characters is: + MSB 0x06; LSB 0x3f +For the 8 character display: + MSB 0x08; LSB 0xff +The device can accept "text" either in raw, hex, or ascii textmode. +raw controls each segment manually, +hex expects a value between 0-15 per character, +ascii expects a value between '0'-'9' and 'A'-'F'. +The default is ascii. + +Device Operation +---------------- +1. Turn on the device: + echo 1 > /sys/bus/usb/.../powered +2. Set the device's mode: + echo $mode_msb > /sys/bus/usb/.../mode_msb + echo $mode_lsb > /sys/bus/usb/.../mode_lsb +3. Set the textmode: + echo $textmode > /sys/bus/usb/.../textmode +4. set the text (for example): + echo "123ABC" > /sys/bus/usb/.../text (ascii) + echo "A1B2" > /sys/bus/usb/.../text (ascii) + echo -ne "\x01\x02\x03" > /sys/bus/usb/.../text (hex) +5. Set the decimal places. + The device has either 6 or 8 decimal points. + to set the nth decimal place calculate 10 ** n + and echo it in to /sys/bus/usb/.../decimals + To set multiple decimals points sum up each power. + For example, to set the 0th and 3rd decimal place + echo 1001 > /sys/bus/usb/.../decimals + + diff --git a/Documentation/usb/power-management.txt b/Documentation/usb/power-management.txt index 9d31140e3f5..e48ea1d5101 100644 --- a/Documentation/usb/power-management.txt +++ b/Documentation/usb/power-management.txt @@ -350,12 +350,12 @@ without holding the mutex. There also are a couple of utility routines drivers can use: - usb_autopm_enable() sets pm_usage_cnt to 1 and then calls - usb_autopm_set_interface(), which will attempt an autoresume. - - usb_autopm_disable() sets pm_usage_cnt to 0 and then calls + usb_autopm_enable() sets pm_usage_cnt to 0 and then calls usb_autopm_set_interface(), which will attempt an autosuspend. + usb_autopm_disable() sets pm_usage_cnt to 1 and then calls + usb_autopm_set_interface(), which will attempt an autoresume. + The conventional usage pattern is that a driver calls usb_autopm_get_interface() in its open routine and usb_autopm_put_interface() in its close or release routine. But diff --git a/Documentation/usb/wusb-cbaf b/Documentation/usb/wusb-cbaf new file mode 100644 index 00000000000..2e78b70f3ad --- /dev/null +++ b/Documentation/usb/wusb-cbaf @@ -0,0 +1,139 @@ +#! /bin/bash +# + +set -e + +progname=$(basename $0) +function help +{ + cat <<EOF +Usage: $progname COMMAND DEVICEs [ARGS] + +Command for manipulating the pairing/authentication credentials of a +Wireless USB device that supports wired-mode Cable-Based-Association. + +Works in conjunction with the wusb-cba.ko driver from http://linuxuwb.org. + + +DEVICE + + sysfs path to the device to authenticate; for example, both this + guys are the same: + + /sys/devices/pci0000:00/0000:00:1d.7/usb1/1-4/1-4.4/1-4.4:1.1 + /sys/bus/usb/drivers/wusb-cbaf/1-4.4:1.1 + +COMMAND/ARGS are + + start + + Start a WUSB host controller (by setting up a CHID) + + set-chid DEVICE HOST-CHID HOST-BANDGROUP HOST-NAME + + Sets host information in the device; after this you can call the + get-cdid to see how does this device report itself to us. + + get-cdid DEVICE + + Get the device ID associated to the HOST-CHDI we sent with + 'set-chid'. We might not know about it. + + set-cc DEVICE + + If we allow the device to connect, set a random new CDID and CK + (connection key). Device saves them for the next time it wants to + connect wireless. We save them for that next time also so we can + authenticate the device (when we see the CDID he uses to id + itself) and the CK to crypto talk to it. + +CHID is always 16 hex bytes in 'XX YY ZZ...' form +BANDGROUP is almost always 0001 + +Examples: + + You can default most arguments to '' to get a sane value: + + $ $progname set-chid '' '' '' "My host name" + + A full sequence: + + $ $progname set-chid '' '' '' "My host name" + $ $progname get-cdid '' + $ $progname set-cc '' + +EOF +} + + +# Defaults +# FIXME: CHID should come from a database :), band group from the host +host_CHID="00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff" +host_band_group="0001" +host_name=$(hostname) + +devs="$(echo /sys/bus/usb/drivers/wusb-cbaf/[0-9]*)" +hdevs="$(for h in /sys/class/uwb_rc/*/wusbhc; do readlink -f $h; done)" + +result=0 +case $1 in + start) + for dev in ${2:-$hdevs} + do + uwb_rc=$(readlink -f $dev/uwb_rc) + if cat $uwb_rc/beacon | grep -q -- "-1" + then + echo 13 0 > $uwb_rc/beacon + echo I: started beaconing on ch 13 on $(basename $uwb_rc) >&2 + fi + echo $host_CHID > $dev/wusb_chid + echo I: started host $(basename $dev) >&2 + done + ;; + stop) + for dev in ${2:-$hdevs} + do + echo 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > $dev/wusb_chid + echo I: stopped host $(basename $dev) >&2 + uwb_rc=$(readlink -f $dev/uwb_rc) + echo -1 | cat > $uwb_rc/beacon + echo I: stopped beaconing on $(basename $uwb_rc) >&2 + done + ;; + set-chid) + shift + for dev in ${2:-$devs}; do + echo "${4:-$host_name}" > $dev/wusb_host_name + echo "${3:-$host_band_group}" > $dev/wusb_host_band_groups + echo ${2:-$host_CHID} > $dev/wusb_chid + done + ;; + get-cdid) + for dev in ${2:-$devs} + do + cat $dev/wusb_cdid + done + ;; + set-cc) + for dev in ${2:-$devs}; do + shift + CDID="$(head --bytes=16 /dev/urandom | od -tx1 -An)" + CK="$(head --bytes=16 /dev/urandom | od -tx1 -An)" + echo "$CDID" > $dev/wusb_cdid + echo "$CK" > $dev/wusb_ck + + echo I: CC set >&2 + echo "CHID: $(cat $dev/wusb_chid)" + echo "CDID:$CDID" + echo "CK: $CK" + done + ;; + help|h|--help|-h) + help + ;; + *) + echo "E: Unknown usage" 1>&2 + help 1>&2 + result=1 +esac +exit $result diff --git a/Documentation/video4linux/CARDLIST.au0828 b/Documentation/video4linux/CARDLIST.au0828 index aa05e5bb22f..d5cb4ea287b 100644 --- a/Documentation/video4linux/CARDLIST.au0828 +++ b/Documentation/video4linux/CARDLIST.au0828 @@ -1,5 +1,5 @@ 0 -> Unknown board (au0828) - 1 -> Hauppauge HVR950Q (au0828) [2040:7200,2040:7210,2040:7217,2040:721b,2040:721f,2040:7280,0fd9:0008] + 1 -> Hauppauge HVR950Q (au0828) [2040:7200,2040:7210,2040:7217,2040:721b,2040:721e,2040:721f,2040:7280,0fd9:0008] 2 -> Hauppauge HVR850 (au0828) [2040:7240] 3 -> DViCO FusionHDTV USB (au0828) [0fe9:d620] 4 -> Hauppauge HVR950Q rev xxF8 (au0828) [2040:7201,2040:7211,2040:7281] diff --git a/Documentation/video4linux/CARDLIST.tuner b/Documentation/video4linux/CARDLIST.tuner index 30bbdda68d0..691d2f37dc5 100644 --- a/Documentation/video4linux/CARDLIST.tuner +++ b/Documentation/video4linux/CARDLIST.tuner @@ -75,3 +75,4 @@ tuner=73 - Samsung TCPG 6121P30A tuner=75 - Philips TEA5761 FM Radio tuner=76 - Xceive 5000 tuner tuner=77 - TCL tuner MF02GIP-5N-E +tuner=78 - Philips FMD1216MEX MK3 Hybrid Tuner diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt new file mode 100644 index 00000000000..125eed560e5 --- /dev/null +++ b/Documentation/vm/unevictable-lru.txt @@ -0,0 +1,615 @@ + +This document describes the Linux memory management "Unevictable LRU" +infrastructure and the use of this infrastructure to manage several types +of "unevictable" pages. The document attempts to provide the overall +rationale behind this mechanism and the rationale for some of the design +decisions that drove the implementation. The latter design rationale is +discussed in the context of an implementation description. Admittedly, one +can obtain the implementation details--the "what does it do?"--by reading the +code. One hopes that the descriptions below add value by provide the answer +to "why does it do that?". + +Unevictable LRU Infrastructure: + +The Unevictable LRU adds an additional LRU list to track unevictable pages +and to hide these pages from vmscan. This mechanism is based on a patch by +Larry Woodman of Red Hat to address several scalability problems with page +reclaim in Linux. The problems have been observed at customer sites on large +memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB +of main memory will have over 32 million 4k pages in a single zone. When a +large fraction of these pages are not evictable for any reason [see below], +vmscan will spend a lot of time scanning the LRU lists looking for the small +fraction of pages that are evictable. This can result in a situation where +all cpus are spending 100% of their time in vmscan for hours or days on end, +with the system completely unresponsive. + +The Unevictable LRU infrastructure addresses the following classes of +unevictable pages: + ++ page owned by ramfs ++ page mapped into SHM_LOCKed shared memory regions ++ page mapped into VM_LOCKED [mlock()ed] vmas + +The infrastructure might be able to handle other conditions that make pages +unevictable, either by definition or by circumstance, in the future. + + +The Unevictable LRU List + +The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list +called the "unevictable" list and an associated page flag, PG_unevictable, to +indicate that the page is being managed on the unevictable list. The +PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active +flag in that it indicates on which LRU list a page resides when PG_lru is set. +The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU +Kconfig option. + +The Unevictable LRU infrastructure maintains unevictable pages on an additional +LRU list for a few reasons: + +1) We get to "treat unevictable pages just like we treat other pages in the + system, which means we get to use the same code to manipulate them, the + same code to isolate them (for migrate, etc.), the same code to keep track + of the statistics, etc..." [Rik van Riel] + +2) We want to be able to migrate unevictable pages between nodes--for memory + defragmentation, workload management and memory hotplug. The linux kernel + can only migrate pages that it can successfully isolate from the lru lists. + If we were to maintain pages elsewise than on an lru-like list, where they + can be found by isolate_lru_page(), we would prevent their migration, unless + we reworked migration code to find the unevictable pages. + + +The unevictable LRU list does not differentiate between file backed and swap +backed [anon] pages. This differentiation is only important while the pages +are, in fact, evictable. + +The unevictable LRU list benefits from the "arrayification" of the per-zone +LRU lists and statistics originally proposed and posted by Christoph Lameter. + +The unevictable list does not use the lru pagevec mechanism. Rather, +unevictable pages are placed directly on the page's zone's unevictable +list under the zone lru_lock. The reason for this is to prevent stranding +of pages on the unevictable list when one task has the page isolated from the +lru and other tasks are changing the "evictability" state of the page. + + +Unevictable LRU and Memory Controller Interaction + +The memory controller data structure automatically gets a per zone unevictable +lru list as a result of the "arrayification" of the per-zone LRU lists. The +memory controller tracks the movement of pages to and from the unevictable list. +When a memory control group comes under memory pressure, the controller will +not attempt to reclaim pages on the unevictable list. This has a couple of +effects. Because the pages are "hidden" from reclaim on the unevictable list, +the reclaim process can be more efficient, dealing only with pages that have +a chance of being reclaimed. On the other hand, if too many of the pages +charged to the control group are unevictable, the evictable portion of the +working set of the tasks in the control group may not fit into the available +memory. This can cause the control group to thrash or to oom-kill tasks. + + +Unevictable LRU: Detecting Unevictable Pages + +The function page_evictable(page, vma) in vmscan.c determines whether a +page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions, +page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's +address space using a wrapper function. Wrapper functions are used to set, +clear and test the flag to reduce the requirement for #ifdef's throughout the +source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created. +This flag remains for the life of the inode. + +For shared memory regions, AS_UNEVICTABLE is set when an application +successfully SHM_LOCKs the region and is removed when the region is +SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page +tables for the region as does, for example, mlock(). So, we make no special +effort to push any pages in the SHM_LOCKed region to the unevictable list. +Vmscan will do this when/if it encounters the pages during reclaim. On +SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the +unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed +region is destroyed, the pages are also "rescued" from the unevictable list in +the process of freeing them. + +page_evictable() detects mlock()ed pages by testing an additional page flag, +PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a +non-NULL vma is supplied, page_evictable() will check whether the vma is +VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and +update the appropriate statistics if the vma is VM_LOCKED. This method allows +efficient "culling" of pages in the fault path that are being faulted in to +VM_LOCKED vmas. + + +Unevictable Pages and Vmscan [shrink_*_list()] + +If unevictable pages are culled in the fault path, or moved to the unevictable +list at mlock() or mmap() time, vmscan will never encounter the pages until +they have become evictable again, for example, via munlock() and have been +"rescued" from the unevictable list. However, there may be situations where we +decide, for the sake of expediency, to leave a unevictable page on one of the +regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for +such pages in all of the shrink_{active|inactive|page}_list() functions and +will "cull" such pages that it encounters--that is, it diverts those pages to +the unevictable list for the zone being scanned. + +There may be situations where a page is mapped into a VM_LOCKED vma, but the +page is not marked as PageMlocked. Such pages will make it all the way to +shrink_page_list() where they will be detected when vmscan walks the reverse +map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list() +will cull the page at that point. + +Note that for anonymous pages, shrink_page_list() attempts to add the page to +the swap cache before it tries to unmap the page. To avoid this unnecessary +consumption of swap space, shrink_page_list() calls try_to_munlock() to check +whether any VM_LOCKED vmas map the page without attempting to unmap the page. +If try_to_munlock() returns SWAP_MLOCK, shrink_page_list() will cull the page +without consuming swap space. try_to_munlock() will be described below. + +To "cull" an unevictable page, vmscan simply puts the page back on the lru +list using putback_lru_page()--the inverse operation to isolate_lru_page()-- +after dropping the page lock. Because the condition which makes the page +unevictable may change once the page is unlocked, putback_lru_page() will +recheck the unevictable state of a page that it places on the unevictable lru +list. If the page has become unevictable, putback_lru_page() removes it from +the list and retries, including the page_unevictable() test. Because such a +race is a rare event and movement of pages onto the unevictable list should be +rare, these extra evictabilty checks should not occur in the majority of calls +to putback_lru_page(). + + +Mlocked Page: Prior Work + +The "Unevictable Mlocked Pages" infrastructure is based on work originally +posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". +Nick posted his patch as an alternative to a patch posted by Christoph +Lameter to achieve the same objective--hiding mlocked pages from vmscan. +In Nick's patch, he used one of the struct page lru list link fields as a count +of VM_LOCKED vmas that map the page. This use of the link field for a count +prevented the management of the pages on an LRU list. Thus, mlocked pages were +not migratable as isolate_lru_page() could not find them and the lru list link +field was not available to the migration subsystem. Nick resolved this by +putting mlocked pages back on the lru list before attempting to isolate them, +thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated +with the Unevictable LRU work, the count was replaced by walking the reverse +map to determine whether any VM_LOCKED vmas mapped the page. More on this +below. + + +Mlocked Pages: Basic Management + +Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of +unevictable pages. When such a page has been "noticed" by the memory +management subsystem, the page is marked with the PG_mlocked [PageMlocked()] +flag. A PageMlocked() page will be placed on the unevictable LRU list when +it is added to the LRU. Pages can be "noticed" by memory management in +several places: + +1) in the mlock()/mlockall() system call handlers. +2) in the mmap() system call handler when mmap()ing a region with the + MAP_LOCKED flag, or mmap()ing a region in a task that has called + mlockall() with the MCL_FUTURE flag. Both of these conditions result + in the VM_LOCKED flag being set for the vma. +3) in the fault path, if mlocked pages are "culled" in the fault path, + and when a VM_LOCKED stack segment is expanded. +4) as mentioned above, in vmscan:shrink_page_list() with attempting to + reclaim a page in a VM_LOCKED vma--via try_to_unmap() or try_to_munlock(). + +Mlocked pages become unlocked and rescued from the unevictable list when: + +1) mapped in a range unlocked via the munlock()/munlockall() system calls. +2) munmapped() out of the last VM_LOCKED vma that maps the page, including + unmapping at task exit. +3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file. +4) before a page is COWed in a VM_LOCKED vma. + + +Mlocked Pages: mlock()/mlockall() System Call Handling + +Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() +for each vma in the range specified by the call. In the case of mlockall(), +this is the entire active address space of the task. Note that mlock_fixup() +is used for both mlock()ing and munlock()ing a range of memory. A call to +mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED +is treated as a no-op--mlock_fixup() simply returns. + +If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas" +below, mlock_fixup() will attempt to merge the vma with its neighbors or split +off a subset of the vma if the range does not cover the entire vma. Once the +vma has been merged or split or neither, mlock_fixup() will call +__mlock_vma_pages_range() to fault in the pages via get_user_pages() and +to mark the pages as mlocked via mlock_vma_page(). + +Note that the vma being mlocked might be mapped with PROT_NONE. In this case, +get_user_pages() will be unable to fault in the pages. That's OK. If pages +do end up getting faulted into this VM_LOCKED vma, we'll handle them in the +fault path or in vmscan. + +Also note that a page returned by get_user_pages() could be truncated or +migrated out from under us, while we're trying to mlock it. To detect +this, __mlock_vma_pages_range() tests the page_mapping after acquiring +the page lock. If the page is still associated with its mapping, we'll +go ahead and call mlock_vma_page(). If the mapping is gone, we just +unlock the page and move on. Worse case, this results in page mapped +in a VM_LOCKED vma remaining on a normal LRU list without being +PageMlocked(). Again, vmscan will detect and cull such pages. + +mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will +TestSetPageMlocked() for each page returned by get_user_pages(). We use +TestSetPageMlocked() because the page might already be mlocked by another +task/vma and we don't want to do extra work. We especially do not want to +count an mlocked page more than once in the statistics. If the page was +already mlocked, mlock_vma_page() is done. + +If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the +page from the LRU, as it is likely on the appropriate active or inactive list +at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will +putback the page--putback_lru_page()--which will notice that the page is now +mlocked and divert the page to the zone's unevictable LRU list. If +mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle +it later if/when it attempts to reclaim the page. + + +Mlocked Pages: Filtering Special Vmas + +mlock_fixup() filters several classes of "special" vmas: + +1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind + these mappings are inherently pinned, so we don't need to mark them as + mlocked. In any case, most of the pages have no struct page in which to + so mark the page. Because of this, get_user_pages() will fail for these + vmas, so there is no sense in attempting to visit them. + +2) vmas mapping hugetlbfs page are already effectively pinned into memory. + We don't need nor want to mlock() these pages. However, to preserve the + prior behavior of mlock()--before the unevictable/mlock changes--mlock_fixup() + will call make_pages_present() in the hugetlbfs vma range to allocate the + huge pages and populate the ptes. + +3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of + kernel pages, such as the vdso page, relay channel pages, etc. These pages + are inherently unevictable and are not managed on the LRU lists. + mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls + make_pages_present() to populate the ptes. + +Note that for all of these special vmas, mlock_fixup() does not set the +VM_LOCKED flag. Therefore, we won't have to deal with them later during +munlock() or munmap()--for example, at task exit. Neither does mlock_fixup() +account these vmas against the task's "locked_vm". + +Mlocked Pages: Downgrading the Mmap Semaphore. + +mlock_fixup() must be called with the mmap semaphore held for write, because +it may have to merge or split vmas. However, mlocking a large region of +memory can take a long time--especially if vmscan must reclaim pages to +satisfy the regions requirements. Faulting in a large region with the mmap +semaphore held for write can hold off other faults on the address space, in +the case of a multi-threaded task. It can also hold off scans of the task's +address space via /proc. While testing under heavy load, it was observed that +the ps(1) command could be held off for many minutes while a large segment was +mlock()ed down. + +To address this issue, and to make the system more responsive during mlock()ing +of large segments, mlock_fixup() downgrades the mmap semaphore to read mode +during the call to __mlock_vma_pages_range(). This works fine. However, the +callers of mlock_fixup() expect the semaphore to be returned in write mode. +So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not +support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore +and reacquire it in write mode. In a multi-threaded task, it is possible for +the task memory map to change while the semaphore is dropped. Therefore, +mlock_fixup() looks up the vma at the range start address after reacquiring +the semaphore in write mode and verifies that it still covers the original +range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of +mlock_fixup() have been changed to deal with this new error condition. + +Note: when munlocking a region, all of the pages should already be resident-- +unless we have racing threads mlocking() and munlocking() regions. So, +unlocking should not have to wait for page allocations nor faults of any kind. +Therefore mlock_fixup() does not downgrade the semaphore for munlock(). + + +Mlocked Pages: munlock()/munlockall() System Call Handling + +The munlock() and munlockall() system calls are handled by the same functions-- +do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock +vs lock operation indicated by an argument. So, these system calls are also +handled by mlock_fixup(). Again, if called for an already munlock()ed vma, +mlock_fixup() simply returns. Because of the vma filtering discussed above, +VM_LOCKED will not be set in any "special" vmas. So, these vmas will be +ignored for munlock. + +If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off +the specified range. The range is then munlocked via the function +__mlock_vma_pages_range()--the same function used to mlock a vma range-- +passing a flag to indicate that munlock() is being performed. + +Because the vma access protections could have been changed to PROT_NONE after +faulting in and mlocking some pages, get_user_pages() was unreliable for visiting +these pages for munlocking. Because we don't want to leave pages mlocked(), +get_user_pages() was enhanced to accept a flag to ignore the permissions when +fetching the pages--all of which should be resident as a result of previous +mlock()ing. + +For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling +munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked +flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page() +use the Test*PageMlocked() function to handle the case where the page might +have already been unlocked by another task. If the page was mlocked, +munlock_vma_page() updates that zone statistics for the number of mlocked +pages. Note, however, that at this point we haven't checked whether the page +is mapped by other VM_LOCKED vmas. + +We can't call try_to_munlock(), the function that walks the reverse map to check +for other VM_LOCKED vmas, without first isolating the page from the LRU. +try_to_munlock() is a variant of try_to_unmap() and thus requires that the page +not be on an lru list. [More on these below.] However, the call to +isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). +So, we go ahead and clear PG_mlocked up front, as this might be the only chance +we have. If we can successfully isolate the page, we go ahead and +try_to_munlock(), which will restore the PG_mlocked flag and update the zone +page statistics if it finds another vma holding the page mlocked. If we fail +to isolate the page, we'll have left a potentially mlocked page on the LRU. +This is fine, because we'll catch it later when/if vmscan tries to reclaim the +page. This should be relatively rare. + +Mlocked Pages: Migrating Them... + +A page that is being migrated has been isolated from the lru lists and is +held locked across unmapping of the page, updating the page's mapping +[address_space] entry and copying the contents and state, until the +page table entry has been replaced with an entry that refers to the new +page. Linux supports migration of mlocked pages and other unevictable +pages. This involves simply moving the PageMlocked and PageUnevictable states +from the old page to the new page. + +Note that page migration can race with mlocking or munlocking of the same +page. This has been discussed from the mlock/munlock perspective in the +respective sections above. Both processes [migration, m[un]locking], hold +the page locked. This provides the first level of synchronization. Page +migration zeros out the page_mapping of the old page before unlocking it, +so m[un]lock can skip these pages by testing the page mapping under page +lock. + +When completing page migration, we place the new and old pages back onto the +lru after dropping the page lock. The "unneeded" page--old page on success, +new page on failure--will be freed when the reference count held by the +migration process is released. To ensure that we don't strand pages on the +unevictable list because of a race between munlock and migration, page +migration uses the putback_lru_page() function to add migrated pages back to +the lru. + + +Mlocked Pages: mmap(MAP_LOCKED) System Call Handling + +In addition the the mlock()/mlockall() system calls, an application can request +that a region of memory be mlocked using the MAP_LOCKED flag with the mmap() +call. Furthermore, any mmap() call or brk() call that expands the heap by a +task that has previously called mlockall() with the MCL_FUTURE flag will result +in the newly mapped memory being mlocked. Before the unevictable/mlock changes, +the kernel simply called make_pages_present() to allocate pages and populate +the page table. + +To mlock a range of memory under the unevictable/mlock infrastructure, the +mmap() handler and task address space expansion functions call +mlock_vma_pages_range() specifying the vma and the address range to mlock. +mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in +"Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will +have already been set by the caller, in filtered vmas. Thus these vma's need +not be visited for munlock when the region is unmapped. + +For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to +fault/allocate the pages and mlock them. Again, like mlock_fixup(), +mlock_vma_pages_range() downgrades the mmap semaphore to read mode before +attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore +back to write mode before returning. + +The callers of mlock_vma_pages_range() will have already added the memory +range to be mlocked to the task's "locked_vm". To account for filtered vmas, +mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the +callers then subtract a non-negative return value from the task's locked_vm. +A negative return value represent an error--for example, from get_user_pages() +attempting to fault in a vma with PROT_NONE access. In this case, we leave +the memory range accounted as locked_vm, as the protections could be changed +later and pages allocated into that region. + + +Mlocked Pages: munmap()/exit()/exec() System Call Handling + +When unmapping an mlocked region of memory, whether by an explicit call to +munmap() or via an internal unmap from exit() or exec() processing, we must +munlock the pages if we're removing the last VM_LOCKED vma that maps the pages. +Before the unevictable/mlock changes, mlocking did not mark the pages in any way, +so unmapping them required no processing. + +To munlock a range of memory under the unevictable/mlock infrastructure, the +munmap() hander and task address space tear down function call +munlock_vma_pages_all(). The name reflects the observation that one always +specifies the entire vma range when munlock()ing during unmap of a region. +Because of the vma filtering when mlocking() regions, only "normal" vmas that +actually contain mlocked pages will be passed to munlock_vma_pages_all(). + +munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup() +for the munlock case, calls __munlock_vma_pages_range() to walk the page table +for the vma's memory range and munlock_vma_page() each resident page mapped by +the vma. This effectively munlocks the page, only if this is the last +VM_LOCKED vma that maps the page. + + +Mlocked Page: try_to_unmap() + +[Note: the code changes represented by this section are really quite small +compared to the text to describe what happening and why, and to discuss the +implications.] + +Pages can, of course, be mapped into multiple vmas. Some of these vmas may +have VM_LOCKED flag set. It is possible for a page mapped into one or more +VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one +of the active or inactive LRU lists. This could happen if, for example, a +task in the process of munlock()ing the page could not isolate the page from +the LRU. As a result, vmscan/shrink_page_list() might encounter such a page +as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To +handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED +vmas while it is walking a page's reverse map. + +try_to_unmap() is always called, by either vmscan for reclaim or for page +migration, with the argument page locked and isolated from the LRU. BUG_ON() +assertions enforce this requirement. Separate functions handle anonymous and +mapped file pages, as these types of pages have different reverse map +mechanisms. + + try_to_unmap_anon() + +To unmap anonymous pages, each vma in the list anchored in the anon_vma must be +visited--at least until a VM_LOCKED vma is encountered. If the page is being +unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked +pages are migratable. However, for reclaim, if the page is mapped into a +VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap +semphore of the mm_struct to which the vma belongs in read mode. If this is +successful, try_to_unmap() will mlock the page via mlock_vma_page()--we +wouldn't have gotten to try_to_unmap() if the page were already mlocked--and +will return SWAP_MLOCK, indicating that the page is unevictable. If the +mmap semaphore cannot be acquired, we are not sure whether the page is really +unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN. + + try_to_unmap_file() -- linear mappings + +Unmapping of a mapped file page works the same, except that the scan visits +all vmas that maps the page's index/page offset in the page's mapping's +reverse map priority search tree. It must also visit each vma in the page's +mapping's non-linear list, if the list is non-empty. As for anonymous pages, +on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will +attempt to acquire the associated mm_struct's mmap semaphore to mlock the page, +returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not. + + try_to_unmap_file() -- non-linear mappings + +If a page's mapping contains a non-empty non-linear mapping vma list, then +try_to_un{map|lock}() must also visit each vma in that list to determine +whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit +all vmas in the non-linear list to ensure that the pages is not/should not be +mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate. +However, there is no easy way to determine whether the page is actually mapped +in a given vma--either for unmapping or testing whether the VM_LOCKED vma +actually pins the page. + +So, try_to_unmap_file() handles non-linear mappings by scanning a certain +number of pages--a "cluster"--in each non-linear vma associated with the page's +mapping, for each file mapped page that vmscan tries to unmap. If this happens +to unmap the page we're trying to unmap, try_to_unmap() will notice this on +return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it +will return SWAP_AGAIN, causing vmscan to recirculate this page. We take +advantage of the cluster scan in try_to_unmap_cluster() as follows: + +For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap +semaphore of the associated mm_struct for read without blocking. If this +attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will +retain the mmap semaphore for the scan; otherwise it drops it here. Then, +for each page in the cluster, if we're holding the mmap semaphore for a locked +vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This +call is a no-op if the page is already locked, but will mlock any pages in +the non-linear mapping that happen to be unlocked. If one of the pages so +mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will +return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan +to cull the page, rather than recirculating it on the inactive list. Again, +if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns +SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but +couldn't be mlocked. + + +Mlocked pages: try_to_munlock() Reverse Map Scan + +TODO/FIXME: a better name might be page_mlocked()--analogous to the +page_referenced() reverse map walker--especially if we continue to call this +from shrink_page_list(). See related TODO/FIXME below. + +When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() System +Call Handling" above--tries to munlock a page, or when shrink_page_list() +encounters an anonymous page that is not yet in the swap cache, they need to +determine whether or not the page is mapped by any VM_LOCKED vma, without +actually attempting to unmap all ptes from the page. For this purpose, the +unevictable/mlock infrastructure introduced a variant of try_to_unmap() called +try_to_munlock(). + +try_to_munlock() calls the same functions as try_to_unmap() for anonymous and +mapped file pages with an additional argument specifing unlock versus unmap +processing. Again, these functions walk the respective reverse maps looking +for VM_LOCKED vmas. When such a vma is found for anonymous pages and file +pages mapped in linear VMAs, as in the try_to_unmap() case, the functions +attempt to acquire the associated mmap semphore, mlock the page via +mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the +pre-clearing of the page's PG_mlocked done by munlock_vma_page() and informs +shrink_page_list() that the anonymous page should be culled rather than added +to the swap cache in preparation for a try_to_unmap() that will almost +certainly fail. + +If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap +semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() +to recycle the page on the inactive list and hope that it has better luck +with the page next time. + +For file pages mapped into non-linear vmas, the try_to_munlock() logic works +slightly differently. On encountering a VM_LOCKED non-linear vma that might +map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking +the page. munlock_vma_page() will just leave the page unlocked and let +vmscan deal with it--the usual fallback position. + +Note that try_to_munlock()'s reverse map walk must visit every vma in a pages' +reverse map to determine that a page is NOT mapped into any VM_LOCKED vma. +However, the scan can terminate when it encounters a VM_LOCKED vma and can +successfully acquire the vma's mmap semphore for read and mlock the page. +Although try_to_munlock() can be called many [very many!] times when +munlock()ing a large region or tearing down a large address space that has been +mlocked via mlockall(), overall this is a fairly rare event. In addition, +although shrink_page_list() calls try_to_munlock() for every anonymous page that +it handles that is not yet in the swap cache, on average anonymous pages will +have very short reverse map lists. + +Mlocked Page: Page Reclaim in shrink_*_list() + +shrink_active_list() culls any obviously unevictable pages--i.e., +!page_evictable(page, NULL)--diverting these to the unevictable lru +list. However, shrink_active_list() only sees unevictable pages that +made it onto the active/inactive lru lists. Note that these pages do not +have PageUnevictable set--otherwise, they would be on the unevictable list and +shrink_active_list would never see them. + +Some examples of these unevictable pages on the LRU lists are: + +1) ramfs pages that have been placed on the lru lists when first allocated. + +2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to + allocate or fault in the pages in the shared memory region. This happens + when an application accesses the page the first time after SHM_LOCKing + the segment. + +3) Mlocked pages that could not be isolated from the lru and moved to the + unevictable list in mlock_vma_page(). + +3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't + acquire the vma's mmap semaphore to test the flags and set PageMlocked. + munlock_vma_page() was forced to let the page back on to the normal + LRU list for vmscan to handle. + +shrink_inactive_list() also culls any unevictable pages that it finds +on the inactive lists, again diverting them to the appropriate zone's unevictable +lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became +SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or +pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from +the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice +the latter, but will pass on to shrink_page_list(). + +shrink_page_list() again culls obviously unevictable pages that it could +encounter for similar reason to shrink_inactive_list(). As already discussed, +shrink_page_list() proactively looks for anonymous pages that should have +PG_mlocked set but don't--these would not be detected by page_evictable()--to +avoid adding them to the swap cache unnecessarily. File pages mapped into +VM_LOCKED vmas but without PG_mlocked set will make it all the way to +try_to_unmap(). shrink_page_list() will divert them to the unevictable list when +try_to_unmap() returns SWAP_MLOCK, as discussed above. + +TODO/FIXME: If we can enhance the swap cache to reliably remove entries +with page_count(page) > 2, as long as all ptes are mapped to the page and +not the swap entry, we can probably remove the call to try_to_munlock() in +shrink_page_list() and just remove the page from the swap cache when +try_to_unmap() returns SWAP_MLOCK. Currently, remove_exclusive_swap_page() +doesn't seem to allow that. + + |