diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/Changes | 15 | ||||
-rw-r--r-- | Documentation/SubmittingPatches | 76 | ||||
-rw-r--r-- | Documentation/arm/Samsung-S3C24XX/GPIO.txt | 10 | ||||
-rw-r--r-- | Documentation/development-process/5.Posting | 31 | ||||
-rw-r--r-- | Documentation/feature-removal-schedule.txt | 10 | ||||
-rw-r--r-- | Documentation/filesystems/debugfs.txt | 158 | ||||
-rw-r--r-- | Documentation/i2c/busses/i2c-ocores | 17 | ||||
-rw-r--r-- | Documentation/ide/ide.txt | 2 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 7 | ||||
-rw-r--r-- | Documentation/lguest/Makefile | 3 | ||||
-rw-r--r-- | Documentation/lguest/lguest.c | 1008 | ||||
-rw-r--r-- | Documentation/lguest/lguest.txt | 1 | ||||
-rw-r--r-- | Documentation/power/devices.txt | 34 | ||||
-rw-r--r-- | Documentation/sound/alsa/ALSA-Configuration.txt | 36 | ||||
-rw-r--r-- | Documentation/sound/alsa/HD-Audio-Models.txt | 18 | ||||
-rw-r--r-- | Documentation/sound/alsa/Procfile.txt | 36 | ||||
-rw-r--r-- | Documentation/sound/alsa/README.maya44 | 163 | ||||
-rw-r--r-- | Documentation/sound/alsa/soc/dapm.txt | 1 | ||||
-rw-r--r-- | Documentation/x86/x86_64/boot-options.txt | 44 | ||||
-rw-r--r-- | Documentation/x86/x86_64/machinecheck | 8 |
20 files changed, 955 insertions, 723 deletions
diff --git a/Documentation/Changes b/Documentation/Changes index b95082be4d5..d21b3b5aa54 100644 --- a/Documentation/Changes +++ b/Documentation/Changes @@ -48,6 +48,7 @@ o procps 3.2.0 # ps --version o oprofile 0.9 # oprofiled --version o udev 081 # udevinfo -V o grub 0.93 # grub --version +o mcelog 0.6 Kernel compilation ================== @@ -276,6 +277,16 @@ before running exportfs or mountd. It is recommended that all NFS services be protected from the internet-at-large by a firewall where that is possible. +mcelog +------ + +In Linux 2.6.31+ the i386 kernel needs to run the mcelog utility +as a regular cronjob similar to the x86-64 kernel to process and log +machine check events when CONFIG_X86_NEW_MCE is enabled. Machine check +events are errors reported by the CPU. Processing them is strongly encouraged. +All x86-64 kernels since 2.6.4 require the mcelog utility to +process machine checks. + Getting updated software ======================== @@ -365,6 +376,10 @@ FUSE ---- o <http://sourceforge.net/projects/fuse> +mcelog +------ +o <ftp://ftp.kernel.org/pub/linux/utils/cpu/mce/mcelog/> + Networking ********** diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index f309d3c6221..6c456835c1f 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -91,6 +91,10 @@ Be as specific as possible. The WORST descriptions possible include things like "update driver X", "bug fix for driver X", or "this patch includes updates for subsystem X. Please apply." +The maintainer will thank you if you write your patch description in a +form which can be easily pulled into Linux's source code management +system, git, as a "commit log". See #15, below. + If your description starts to get long, that's a sign that you probably need to split up your patch. See #3, next. @@ -405,7 +409,14 @@ person it names. This tag documents that potentially interested parties have been included in the discussion -14) Using Tested-by: and Reviewed-by: +14) Using Reported-by:, Tested-by: and Reviewed-by: + +If this patch fixes a problem reported by somebody else, consider adding a +Reported-by: tag to credit the reporter for their contribution. Please +note that this tag should not be added without the reporter's permission, +especially if the problem was not reported in a public forum. That said, +if we diligently credit our bug reporters, they will, hopefully, be +inspired to help us again in the future. A Tested-by: tag indicates that the patch has been successfully tested (in some environment) by the person named. This tag informs maintainers that @@ -444,7 +455,7 @@ offer a Reviewed-by tag for a patch. This tag serves to give credit to reviewers and to inform maintainers of the degree of review which has been done on the patch. Reviewed-by: tags, when supplied by reviewers known to understand the subject area and to perform thorough reviews, will normally -increase the liklihood of your patch getting into the kernel. +increase the likelihood of your patch getting into the kernel. 15) The canonical patch format @@ -485,12 +496,33 @@ phrase" should not be a filename. Do not use the same "summary phrase" for every patch in a whole patch series (where a "patch series" is an ordered sequence of multiple, related patches). -Bear in mind that the "summary phrase" of your email becomes -a globally-unique identifier for that patch. It propagates -all the way into the git changelog. The "summary phrase" may -later be used in developer discussions which refer to the patch. -People will want to google for the "summary phrase" to read -discussion regarding that patch. +Bear in mind that the "summary phrase" of your email becomes a +globally-unique identifier for that patch. It propagates all the way +into the git changelog. The "summary phrase" may later be used in +developer discussions which refer to the patch. People will want to +google for the "summary phrase" to read discussion regarding that +patch. It will also be the only thing that people may quickly see +when, two or three months later, they are going through perhaps +thousands of patches using tools such as "gitk" or "git log +--oneline". + +For these reasons, the "summary" must be no more than 70-75 +characters, and it must describe both what the patch changes, as well +as why the patch might be necessary. It is challenging to be both +succinct and descriptive, but that is what a well-written summary +should do. + +The "summary phrase" may be prefixed by tags enclosed in square +brackets: "Subject: [PATCH tag] <summary phrase>". The tags are not +considered part of the summary phrase, but describe how the patch +should be treated. Common tags might include a version descriptor if +the multiple versions of the patch have been sent out in response to +comments (i.e., "v1, v2, v3"), or "RFC" to indicate a request for +comments. If there are four patches in a patch series the individual +patches may be numbered like this: 1/4, 2/4, 3/4, 4/4. This assures +that developers understand the order in which the patches should be +applied and that they have reviewed or applied all of the patches in +the patch series. A couple of example Subjects: @@ -510,19 +542,31 @@ the patch author in the changelog. The explanation body will be committed to the permanent source changelog, so should make sense to a competent reader who has long since forgotten the immediate details of the discussion that might -have led to this patch. +have led to this patch. Including symptoms of the failure which the +patch addresses (kernel log messages, oops messages, etc.) is +especially useful for people who might be searching the commit logs +looking for the applicable patch. If a patch fixes a compile failure, +it may not be necessary to include _all_ of the compile failures; just +enough that it is likely that someone searching for the patch can find +it. As in the "summary phrase", it is important to be both succinct as +well as descriptive. The "---" marker line serves the essential purpose of marking for patch handling tools where the changelog message ends. One good use for the additional comments after the "---" marker is for -a diffstat, to show what files have changed, and the number of inserted -and deleted lines per file. A diffstat is especially useful on bigger -patches. Other comments relevant only to the moment or the maintainer, -not suitable for the permanent changelog, should also go here. -Use diffstat options "-p 1 -w 70" so that filenames are listed from the -top of the kernel source tree and don't use too much horizontal space -(easily fit in 80 columns, maybe with some indentation). +a diffstat, to show what files have changed, and the number of +inserted and deleted lines per file. A diffstat is especially useful +on bigger patches. Other comments relevant only to the moment or the +maintainer, not suitable for the permanent changelog, should also go +here. A good example of such comments might be "patch changelogs" +which describe what has changed between the v1 and v2 version of the +patch. + +If you are going to include a diffstat after the "---" marker, please +use diffstat options "-p 1 -w 70" so that filenames are listed from +the top of the kernel source tree and don't use too much horizontal +space (easily fit in 80 columns, maybe with some indentation). See more details on the proper patch format in the following references. diff --git a/Documentation/arm/Samsung-S3C24XX/GPIO.txt b/Documentation/arm/Samsung-S3C24XX/GPIO.txt index ea7ccfc4b27..948c8718d96 100644 --- a/Documentation/arm/Samsung-S3C24XX/GPIO.txt +++ b/Documentation/arm/Samsung-S3C24XX/GPIO.txt @@ -51,7 +51,7 @@ PIN Numbers ----------- Each pin has an unique number associated with it in regs-gpio.h, - eg S3C2410_GPA0 or S3C2410_GPF1. These defines are used to tell + eg S3C2410_GPA(0) or S3C2410_GPF(1). These defines are used to tell the GPIO functions which pin is to be used. @@ -65,11 +65,11 @@ Configuring a pin Eg: - s3c2410_gpio_cfgpin(S3C2410_GPA0, S3C2410_GPA0_ADDR0); - s3c2410_gpio_cfgpin(S3C2410_GPE8, S3C2410_GPE8_SDDAT1); + s3c2410_gpio_cfgpin(S3C2410_GPA(0), S3C2410_GPA0_ADDR0); + s3c2410_gpio_cfgpin(S3C2410_GPE(8), S3C2410_GPE8_SDDAT1); - which would turn GPA0 into the lowest Address line A0, and set - GPE8 to be connected to the SDIO/MMC controller's SDDAT1 line. + which would turn GPA(0) into the lowest Address line A0, and set + GPE(8) to be connected to the SDIO/MMC controller's SDDAT1 line. Reading the current configuration diff --git a/Documentation/development-process/5.Posting b/Documentation/development-process/5.Posting index dd48132a74d..f622c1e9f0f 100644 --- a/Documentation/development-process/5.Posting +++ b/Documentation/development-process/5.Posting @@ -119,7 +119,7 @@ which takes quite a bit of time and thought after the "real work" has been done. When done properly, though, it is time well spent. -5.4: PATCH FORMATTING +5.4: PATCH FORMATTING AND CHANGELOGS So now you have a perfect series of patches for posting, but the work is not done quite yet. Each patch needs to be formatted into a message which @@ -146,8 +146,33 @@ that end, each patch will be composed of the following: - One or more tag lines, with, at a minimum, one Signed-off-by: line from the author of the patch. Tags will be described in more detail below. -The above three items should, normally, be the text used when committing -the change to a revision control system. They are followed by: +The items above, together, form the changelog for the patch. Writing good +changelogs is a crucial but often-neglected art; it's worth spending +another moment discussing this issue. When writing a changelog, you should +bear in mind that a number of different people will be reading your words. +These include subsystem maintainers and reviewers who need to decide +whether the patch should be included, distributors and other maintainers +trying to decide whether a patch should be backported to other kernels, bug +hunters wondering whether the patch is responsible for a problem they are +chasing, users who want to know how the kernel has changed, and more. A +good changelog conveys the needed information to all of these people in the +most direct and concise way possible. + +To that end, the summary line should describe the effects of and motivation +for the change as well as possible given the one-line constraint. The +detailed description can then amplify on those topics and provide any +needed additional information. If the patch fixes a bug, cite the commit +which introduced the bug if possible. If a problem is associated with +specific log or compiler output, include that output to help others +searching for a solution to the same problem. If the change is meant to +support other changes coming in later patch, say so. If internal APIs are +changed, detail those changes and how other developers should respond. In +general, the more you can put yourself into the shoes of everybody who will +be reading your changelog, the better that changelog (and the kernel as a +whole) will be. + +Needless to say, the changelog should be the text used when committing the +change to a revision control system. It will be followed by: - The patch itself, in the unified ("-u") patch format. Using the "-p" option to diff will associate function names with changes, making the diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index de491a3e231..ec9ef5d0d7b 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -437,3 +437,13 @@ Why: Superseded by tdfxfb. I2C/DDC support used to live in a separate driver but this caused driver conflicts. Who: Jean Delvare <khali@linux-fr.org> Krzysztof Helt <krzysztof.h1@wp.pl> + +---------------------------- + +What: CONFIG_X86_OLD_MCE +When: 2.6.32 +Why: Remove the old legacy 32bit machine check code. This has been + superseded by the newer machine check code from the 64bit port, + but the old version has been kept around for easier testing. Note this + doesn't impact the old P5 and WinChip machine check handlers. +Who: Andi Kleen <andi@firstfloor.org> diff --git a/Documentation/filesystems/debugfs.txt b/Documentation/filesystems/debugfs.txt new file mode 100644 index 00000000000..ed52af60c2d --- /dev/null +++ b/Documentation/filesystems/debugfs.txt @@ -0,0 +1,158 @@ +Copyright 2009 Jonathan Corbet <corbet@lwn.net> + +Debugfs exists as a simple way for kernel developers to make information +available to user space. Unlike /proc, which is only meant for information +about a process, or sysfs, which has strict one-value-per-file rules, +debugfs has no rules at all. Developers can put any information they want +there. The debugfs filesystem is also intended to not serve as a stable +ABI to user space; in theory, there are no stability constraints placed on +files exported there. The real world is not always so simple, though [1]; +even debugfs interfaces are best designed with the idea that they will need +to be maintained forever. + +Debugfs is typically mounted with a command like: + + mount -t debugfs none /sys/kernel/debug + +(Or an equivalent /etc/fstab line). + +Note that the debugfs API is exported GPL-only to modules. + +Code using debugfs should include <linux/debugfs.h>. Then, the first order +of business will be to create at least one directory to hold a set of +debugfs files: + + struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); + +This call, if successful, will make a directory called name underneath the +indicated parent directory. If parent is NULL, the directory will be +created in the debugfs root. On success, the return value is a struct +dentry pointer which can be used to create files in the directory (and to +clean it up at the end). A NULL return value indicates that something went +wrong. If ERR_PTR(-ENODEV) is returned, that is an indication that the +kernel has been built without debugfs support and none of the functions +described below will work. + +The most general way to create a file within a debugfs directory is with: + + struct dentry *debugfs_create_file(const char *name, mode_t mode, + struct dentry *parent, void *data, + const struct file_operations *fops); + +Here, name is the name of the file to create, mode describes the access +permissions the file should have, parent indicates the directory which +should hold the file, data will be stored in the i_private field of the +resulting inode structure, and fops is a set of file operations which +implement the file's behavior. At a minimum, the read() and/or write() +operations should be provided; others can be included as needed. Again, +the return value will be a dentry pointer to the created file, NULL for +error, or ERR_PTR(-ENODEV) if debugfs support is missing. + +In a number of cases, the creation of a set of file operations is not +actually necessary; the debugfs code provides a number of helper functions +for simple situations. Files containing a single integer value can be +created with any of: + + struct dentry *debugfs_create_u8(const char *name, mode_t mode, + struct dentry *parent, u8 *value); + struct dentry *debugfs_create_u16(const char *name, mode_t mode, + struct dentry *parent, u16 *value); + struct dentry *debugfs_create_u32(const char *name, mode_t mode, + struct dentry *parent, u32 *value); + struct dentry *debugfs_create_u64(const char *name, mode_t mode, + struct dentry *parent, u64 *value); + +These files support both reading and writing the given value; if a specific +file should not be written to, simply set the mode bits accordingly. The +values in these files are in decimal; if hexadecimal is more appropriate, +the following functions can be used instead: + + struct dentry *debugfs_create_x8(const char *name, mode_t mode, + struct dentry *parent, u8 *value); + struct dentry *debugfs_create_x16(const char *name, mode_t mode, + struct dentry *parent, u16 *value); + struct dentry *debugfs_create_x32(const char *name, mode_t mode, + struct dentry *parent, u32 *value); + +Note that there is no debugfs_create_x64(). + +These functions are useful as long as the developer knows the size of the +value to be exported. Some types can have different widths on different +architectures, though, complicating the situation somewhat. There is a +function meant to help out in one special case: + + struct dentry *debugfs_create_size_t(const char *name, mode_t mode, + struct dentry *parent, + size_t *value); + +As might be expected, this function will create a debugfs file to represent +a variable of type size_t. + +Boolean values can be placed in debugfs with: + + struct dentry *debugfs_create_bool(const char *name, mode_t mode, + struct dentry *parent, u32 *value); + +A read on the resulting file will yield either Y (for non-zero values) or +N, followed by a newline. If written to, it will accept either upper- or +lower-case values, or 1 or 0. Any other input will be silently ignored. + +Finally, a block of arbitrary binary data can be exported with: + + struct debugfs_blob_wrapper { + void *data; + unsigned long size; + }; + + struct dentry *debugfs_create_blob(const char *name, mode_t mode, + struct dentry *parent, + struct debugfs_blob_wrapper *blob); + +A read of this file will return the data pointed to by the +debugfs_blob_wrapper structure. Some drivers use "blobs" as a simple way +to return several lines of (static) formatted text output. This function +can be used to export binary information, but there does not appear to be +any code which does so in the mainline. Note that all files created with +debugfs_create_blob() are read-only. + +There are a couple of other directory-oriented helper functions: + + struct dentry *debugfs_rename(struct dentry *old_dir, + struct dentry *old_dentry, + struct dentry *new_dir, + const char *new_name); + + struct dentry *debugfs_create_symlink(const char *name, + struct dentry *parent, + const char *target); + +A call to debugfs_rename() will give a new name to an existing debugfs +file, possibly in a different directory. The new_name must not exist prior +to the call; the return value is old_dentry with updated information. +Symbolic links can be created with debugfs_create_symlink(). + +There is one important thing that all debugfs users must take into account: +there is no automatic cleanup of any directories created in debugfs. If a +module is unloaded without explicitly removing debugfs entries, the result +will be a lot of stale pointers and no end of highly antisocial behavior. +So all debugfs users - at least those which can be built as modules - must +be prepared to remove all files and directories they create there. A file +can be removed with: + + void debugfs_remove(struct dentry *dentry); + +The dentry value can be NULL, in which case nothing will be removed. + +Once upon a time, debugfs users were required to remember the dentry +pointer for every debugfs file they created so that all files could be +cleaned up. We live in more civilized times now, though, and debugfs users +can call: + + void debugfs_remove_recursive(struct dentry *dentry); + +If this function is passed a pointer for the dentry corresponding to the +top-level directory, the entire hierarchy below that directory will be +removed. + +Notes: + [1] http://lwn.net/Articles/309298/ diff --git a/Documentation/i2c/busses/i2c-ocores b/Documentation/i2c/busses/i2c-ocores index cfcebb10d14..c269aaa2f26 100644 --- a/Documentation/i2c/busses/i2c-ocores +++ b/Documentation/i2c/busses/i2c-ocores @@ -20,6 +20,8 @@ platform_device with the base address and interrupt number. The dev.platform_data of the device should also point to a struct ocores_i2c_platform_data (see linux/i2c-ocores.h) describing the distance between registers and the input clock speed. +There is also a possibility to attach a list of i2c_board_info which +the i2c-ocores driver will add to the bus upon creation. E.G. something like: @@ -36,9 +38,24 @@ static struct resource ocores_resources[] = { }, }; +/* optional board info */ +struct i2c_board_info ocores_i2c_board_info[] = { + { + I2C_BOARD_INFO("tsc2003", 0x48), + .platform_data = &tsc2003_platform_data, + .irq = TSC_IRQ + }, + { + I2C_BOARD_INFO("adv7180", 0x42 >> 1), + .irq = ADV_IRQ + } +}; + static struct ocores_i2c_platform_data myi2c_data = { .regstep = 2, /* two bytes between registers */ .clock_khz = 50000, /* input clock of 50MHz */ + .devices = ocores_i2c_board_info, /* optional table of devices */ + .num_devices = ARRAY_SIZE(ocores_i2c_board_info), /* table size */ }; static struct platform_device myi2c = { diff --git a/Documentation/ide/ide.txt b/Documentation/ide/ide.txt index 0c78f4b1d9d..e77bebfa7b0 100644 --- a/Documentation/ide/ide.txt +++ b/Documentation/ide/ide.txt @@ -216,6 +216,8 @@ Other kernel parameters for ide_core are: * "noflush=[interface_number.device_number]" to disable flush requests +* "nohpa=[interface_number.device_number]" to disable Host Protected Area + * "noprobe=[interface_number.device_number]" to skip probing * "nowerr=[interface_number.device_number]" to ignore the WRERR_STAT bit diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 7bcdebffdab..0bf8a882ee9 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -887,11 +887,8 @@ and is between 256 and 4096 characters. It is defined in the file ide-core.nodma= [HW] (E)IDE subsystem Format: =0.0 to prevent dma on hda, =0.1 hdb =1.0 hdc - .vlb_clock .pci_clock .noflush .noprobe .nowerr .cdrom - .chs .ignore_cable are additional options - See Documentation/ide/ide.txt. - - idebus= [HW] (E)IDE subsystem - VLB/PCI bus speed + .vlb_clock .pci_clock .noflush .nohpa .noprobe .nowerr + .cdrom .chs .ignore_cable are additional options See Documentation/ide/ide.txt. ide-pci-generic.all-generic-ide [HW] (E)IDE subsystem diff --git a/Documentation/lguest/Makefile b/Documentation/lguest/Makefile index 1f4f9e888bd..28c8cdfcafd 100644 --- a/Documentation/lguest/Makefile +++ b/Documentation/lguest/Makefile @@ -1,6 +1,5 @@ # This creates the demonstration utility "lguest" which runs a Linux guest. -CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 -I../../include -I../../arch/x86/include -U_FORTIFY_SOURCE -LDLIBS:=-lz +CFLAGS:=-m32 -Wall -Wmissing-declarations -Wmissing-prototypes -O3 -I../../include -I../../arch/x86/include -U_FORTIFY_SOURCE all: lguest diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index d36fcc0f271..9ebcd6ef361 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -16,6 +16,7 @@ #include <sys/types.h> #include <sys/stat.h> #include <sys/wait.h> +#include <sys/eventfd.h> #include <fcntl.h> #include <stdbool.h> #include <errno.h> @@ -59,7 +60,6 @@ typedef uint8_t u8; /*:*/ #define PAGE_PRESENT 0x7 /* Present, RW, Execute */ -#define NET_PEERNUM 1 #define BRIDGE_PFX "bridge:" #ifndef SIOCBRADDIF #define SIOCBRADDIF 0x89a2 /* add interface to bridge */ @@ -76,19 +76,12 @@ static bool verbose; do { if (verbose) printf(args); } while(0) /*:*/ -/* File descriptors for the Waker. */ -struct { - int pipe[2]; - int lguest_fd; -} waker_fds; - /* The pointer to the start of guest memory. */ static void *guest_base; /* The maximum guest physical address allowed, and maximum possible. */ static unsigned long guest_limit, guest_max; -/* The pipe for signal hander to write to. */ -static int timeoutpipe[2]; -static unsigned int timeout_usec = 500; +/* The /dev/lguest file descriptor. */ +static int lguest_fd; /* a per-cpu variable indicating whose vcpu is currently running */ static unsigned int __thread cpu_id; @@ -96,11 +89,6 @@ static unsigned int __thread cpu_id; /* This is our list of devices. */ struct device_list { - /* Summary information about the devices in our list: ready to pass to - * select() to ask which need servicing.*/ - fd_set infds; - int max_infd; - /* Counter to assign interrupt numbers. */ unsigned int next_irq; @@ -126,22 +114,21 @@ struct device /* The linked-list pointer. */ struct device *next; - /* The this device's descriptor, as mapped into the Guest. */ + /* The device's descriptor, as mapped into the Guest. */ struct lguest_device_desc *desc; + /* We can't trust desc values once Guest has booted: we use these. */ + unsigned int feature_len; + unsigned int num_vq; + /* The name of this device, for --verbose. */ const char *name; - /* If handle_input is set, it wants to be called when this file - * descriptor is ready. */ - int fd; - bool (*handle_input)(int fd, struct device *me); - /* Any queues attached to this device */ struct virtqueue *vq; - /* Handle status being finalized (ie. feature bits stable). */ - void (*ready)(struct device *me); + /* Is it operational */ + bool running; /* Device-specific data. */ void *priv; @@ -164,22 +151,28 @@ struct virtqueue /* Last available index we saw. */ u16 last_avail_idx; - /* The routine to call when the Guest pings us, or timeout. */ - void (*handle_output)(int fd, struct virtqueue *me, bool timeout); + /* How many are used since we sent last irq? */ + unsigned int pending_used; - /* Outstanding buffers */ - unsigned int inflight; + /* Eventfd where Guest notifications arrive. */ + int eventfd; - /* Is this blocked awaiting a timer? */ - bool blocked; + /* Function for the thread which is servicing this virtqueue. */ + void (*service)(struct virtqueue *vq); + pid_t thread; }; /* Remember the arguments to the program so we can "reboot" */ static char **main_args; -/* Since guest is UP and we don't run at the same time, we don't need barriers. - * But I include them in the code in case others copy it. */ -#define wmb() +/* The original tty settings to restore on exit. */ +static struct termios orig_term; + +/* We have to be careful with barriers: our devices are all run in separate + * threads and so we need to make sure that changes visible to the Guest happen + * in precise order. */ +#define wmb() __asm__ __volatile__("" : : : "memory") +#define mb() __asm__ __volatile__("" : : : "memory") /* Convert an iovec element to the given type. * @@ -245,7 +238,7 @@ static void iov_consume(struct iovec iov[], unsigned num_iov, unsigned len) static u8 *get_feature_bits(struct device *dev) { return (u8 *)(dev->desc + 1) - + dev->desc->num_vq * sizeof(struct lguest_vqconfig); + + dev->num_vq * sizeof(struct lguest_vqconfig); } /*L:100 The Launcher code itself takes us out into userspace, that scary place @@ -505,99 +498,19 @@ static void concat(char *dst, char *args[]) * saw the arguments it expects when we looked at initialize() in lguest_user.c: * the base of Guest "physical" memory, the top physical page to allow and the * entry point for the Guest. */ -static int tell_kernel(unsigned long start) +static void tell_kernel(unsigned long start) { unsigned long args[] = { LHREQ_INITIALIZE, (unsigned long)guest_base, guest_limit / getpagesize(), start }; - int fd; - verbose("Guest: %p - %p (%#lx)\n", guest_base, guest_base + guest_limit, guest_limit); - fd = open_or_die("/dev/lguest", O_RDWR); - if (write(fd, args, sizeof(args)) < 0) + lguest_fd = open_or_die("/dev/lguest", O_RDWR); + if (write(lguest_fd, args, sizeof(args)) < 0) err(1, "Writing to /dev/lguest"); - - /* We return the /dev/lguest file descriptor to control this Guest */ - return fd; } /*:*/ -static void add_device_fd(int fd) -{ - FD_SET(fd, &devices.infds); - if (fd > devices.max_infd) - devices.max_infd = fd; -} - -/*L:200 - * The Waker. - * - * With console, block and network devices, we can have lots of input which we - * need to process. We could try to tell the kernel what file descriptors to - * watch, but handing a file descriptor mask through to the kernel is fairly - * icky. - * - * Instead, we clone off a thread which watches the file descriptors and writes - * the LHREQ_BREAK command to the /dev/lguest file descriptor to tell the Host - * stop running the Guest. This causes the Launcher to return from the - * /dev/lguest read with -EAGAIN, where it will write to /dev/lguest to reset - * the LHREQ_BREAK and wake us up again. - * - * This, of course, is merely a different *kind* of icky. - * - * Given my well-known antipathy to threads, I'd prefer to use processes. But - * it's easier to share Guest memory with threads, and trivial to share the - * devices.infds as the Launcher changes it. - */ -static int waker(void *unused) -{ - /* Close the write end of the pipe: only the Launcher has it open. */ - close(waker_fds.pipe[1]); - - for (;;) { - fd_set rfds = devices.infds; - unsigned long args[] = { LHREQ_BREAK, 1 }; - unsigned int maxfd = devices.max_infd; - - /* We also listen to the pipe from the Launcher. */ - FD_SET(waker_fds.pipe[0], &rfds); - if (waker_fds.pipe[0] > maxfd) - maxfd = waker_fds.pipe[0]; - - /* Wait until input is ready from one of the devices. */ - select(maxfd+1, &rfds, NULL, NULL, NULL); - - /* Message from Launcher? */ - if (FD_ISSET(waker_fds.pipe[0], &rfds)) { - char c; - /* If this fails, then assume Launcher has exited. - * Don't do anything on exit: we're just a thread! */ - if (read(waker_fds.pipe[0], &c, 1) != 1) - _exit(0); - continue; - } - - /* Send LHREQ_BREAK command to snap the Launcher out of it. */ - pwrite(waker_fds.lguest_fd, args, sizeof(args), cpu_id); - } - return 0; -} - -/* This routine just sets up a pipe to the Waker process. */ -static void setup_waker(int lguest_fd) -{ - /* This pipe is closed when Launcher dies, telling Waker. */ - if (pipe(waker_fds.pipe) != 0) - err(1, "Creating pipe for Waker"); - - /* Waker also needs to know the lguest fd */ - waker_fds.lguest_fd = lguest_fd; - - if (clone(waker, malloc(4096) + 4096, CLONE_VM | SIGCHLD, NULL) == -1) - err(1, "Creating Waker"); -} - /* * Device Handling. * @@ -623,49 +536,90 @@ static void *_check_pointer(unsigned long addr, unsigned int size, /* Each buffer in the virtqueues is actually a chain of descriptors. This * function returns the next descriptor in the chain, or vq->vring.num if we're * at the end. */ -static unsigned next_desc(struct virtqueue *vq, unsigned int i) +static unsigned next_desc(struct vring_desc *desc, + unsigned int i, unsigned int max) { unsigned int next; /* If this descriptor says it doesn't chain, we're done. */ - if (!(vq->vring.desc[i].flags & VRING_DESC_F_NEXT)) - return vq->vring.num; + if (!(desc[i].flags & VRING_DESC_F_NEXT)) + return max; /* Check they're not leading us off end of descriptors. */ - next = vq->vring.desc[i].next; + next = desc[i].next; /* Make sure compiler knows to grab that: we don't want it changing! */ wmb(); - if (next >= vq->vring.num) + if (next >= max) errx(1, "Desc next is %u", next); return next; } +/* This actually sends the interrupt for this virtqueue */ +static void trigger_irq(struct virtqueue *vq) +{ + unsigned long buf[] = { LHREQ_IRQ, vq->config.irq }; + + /* Don't inform them if nothing used. */ + if (!vq->pending_used) + return; + vq->pending_used = 0; + + /* If they don't want an interrupt, don't send one, unless empty. */ + if ((vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) + && lg_last_avail(vq) != vq->vring.avail->idx) + return; + + /* Send the Guest an interrupt tell them we used something up. */ + if (write(lguest_fd, buf, sizeof(buf)) != 0) + err(1, "Triggering irq %i", vq->config.irq); +} + /* This looks in the virtqueue and for the first available buffer, and converts * it to an iovec for convenient access. Since descriptors consist of some * number of output then some number of input descriptors, it's actually two * iovecs, but we pack them into one and note how many of each there were. * - * This function returns the descriptor number found, or vq->vring.num (which - * is never a valid descriptor number) if none was found. */ -static unsigned get_vq_desc(struct virtqueue *vq, - struct iovec iov[], - unsigned int *out_num, unsigned int *in_num) + * This function returns the descriptor number found. */ +static unsigned wait_for_vq_desc(struct virtqueue *vq, + struct iovec iov[], + unsigned int *out_num, unsigned int *in_num) { - unsigned int i, head; - u16 last_avail; + unsigned int i, head, max; + struct vring_desc *desc; + u16 last_avail = lg_last_avail(vq); + + while (last_avail == vq->vring.avail->idx) { + u64 event; + + /* OK, tell Guest about progress up to now. */ + trigger_irq(vq); + + /* OK, now we need to know about added descriptors. */ + vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; + + /* They could have slipped one in as we were doing that: make + * sure it's written, then check again. */ + mb(); + if (last_avail != vq->vring.avail->idx) { + vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY; + break; + } + + /* Nothing new? Wait for eventfd to tell us they refilled. */ + if (read(vq->eventfd, &event, sizeof(event)) != sizeof(event)) + errx(1, "Event read failed?"); + + /* We don't need to be notified again. */ + vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY; + } /* Check it isn't doing very strange things with descriptor numbers. */ - last_avail = lg_last_avail(vq); if ((u16)(vq->vring.avail->idx - last_avail) > vq->vring.num) errx(1, "Guest moved used index from %u to %u", last_avail, vq->vring.avail->idx); - /* If there's nothing new since last we looked, return invalid. */ - if (vq->vring.avail->idx == last_avail) - return vq->vring.num; - /* Grab the next descriptor number they're advertising, and increment * the index we've seen. */ head = vq->vring.avail->ring[last_avail % vq->vring.num]; @@ -678,15 +632,28 @@ static unsigned get_vq_desc(struct virtqueue *vq, /* When we start there are none of either input nor output. */ *out_num = *in_num = 0; + max = vq->vring.num; + desc = vq->vring.desc; i = head; + + /* If this is an indirect entry, then this buffer contains a descriptor + * table which we handle as if it's any normal descriptor chain. */ + if (desc[i].flags & VRING_DESC_F_INDIRECT) { + if (desc[i].len % sizeof(struct vring_desc)) + errx(1, "Invalid size for indirect buffer table"); + + max = desc[i].len / sizeof(struct vring_desc); + desc = check_pointer(desc[i].addr, desc[i].len); + i = 0; + } + do { /* Grab the first descriptor, and check it's OK. */ - iov[*out_num + *in_num].iov_len = vq->vring.desc[i].len; + iov[*out_num + *in_num].iov_len = desc[i].len; iov[*out_num + *in_num].iov_base - = check_pointer(vq->vring.desc[i].addr, - vq->vring.desc[i].len); + = check_pointer(desc[i].addr, desc[i].len); /* If this is an input descriptor, increment that count. */ - if (vq->vring.desc[i].flags & VRING_DESC_F_WRITE) + if (desc[i].flags & VRING_DESC_F_WRITE) (*in_num)++; else { /* If it's an output descriptor, they're all supposed @@ -697,11 +664,10 @@ static unsigned get_vq_desc(struct virtqueue *vq, } /* If we've got too many, that implies a descriptor loop. */ - if (*out_num + *in_num > vq->vring.num) + if (*out_num + *in_num > max) errx(1, "Looped descriptor"); - } while ((i = next_desc(vq, i)) != vq->vring.num); + } while ((i = next_desc(desc, i, max)) != max); - vq->inflight++; return head; } @@ -719,44 +685,20 @@ static void add_used(struct virtqueue *vq, unsigned int head, int len) /* Make sure buffer is written before we update index. */ wmb(); vq->vring.used->idx++; - vq->inflight--; -} - -/* This actually sends the interrupt for this virtqueue */ -static void trigger_irq(int fd, struct virtqueue *vq) -{ - unsigned long buf[] = { LHREQ_IRQ, vq->config.irq }; - - /* If they don't want an interrupt, don't send one, unless empty. */ - if ((vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) - && vq->inflight) - return; - - /* Send the Guest an interrupt tell them we used something up. */ - if (write(fd, buf, sizeof(buf)) != 0) - err(1, "Triggering irq %i", vq->config.irq); + vq->pending_used++; } /* And here's the combo meal deal. Supersize me! */ -static void add_used_and_trigger(int fd, struct virtqueue *vq, - unsigned int head, int len) +static void add_used_and_trigger(struct virtqueue *vq, unsigned head, int len) { add_used(vq, head, len); - trigger_irq(fd, vq); + trigger_irq(vq); } /* * The Console * - * Here is the input terminal setting we save, and the routine to restore them - * on exit so the user gets their terminal back. */ -static struct termios orig_term; -static void restore_term(void) -{ - tcsetattr(STDIN_FILENO, TCSANOW, &orig_term); -} - -/* We associate some data with the console for our exit hack. */ + * We associate some data with the console for our exit hack. */ struct console_abort { /* How many times have they hit ^C? */ @@ -766,276 +708,275 @@ struct console_abort }; /* This is the routine which handles console input (ie. stdin). */ -static bool handle_console_input(int fd, struct device *dev) +static void console_input(struct virtqueue *vq) { int len; unsigned int head, in_num, out_num; - struct iovec iov[dev->vq->vring.num]; - struct console_abort *abort = dev->priv; - - /* First we need a console buffer from the Guests's input virtqueue. */ - head = get_vq_desc(dev->vq, iov, &out_num, &in_num); - - /* If they're not ready for input, stop listening to this file - * descriptor. We'll start again once they add an input buffer. */ - if (head == dev->vq->vring.num) - return false; + struct console_abort *abort = vq->dev->priv; + struct iovec iov[vq->vring.num]; + /* Make sure there's a descriptor waiting. */ + head = wait_for_vq_desc(vq, iov, &out_num, &in_num); if (out_num) errx(1, "Output buffers in console in queue?"); - /* This is why we convert to iovecs: the readv() call uses them, and so - * it reads straight into the Guest's buffer. */ - len = readv(dev->fd, iov, in_num); + /* Read it in. */ + len = readv(STDIN_FILENO, iov, in_num); if (len <= 0) { - /* This implies that the console is closed, is /dev/null, or - * something went terribly wrong. */ + /* Ran out of input? */ warnx("Failed to get console input, ignoring console."); - /* Put the input terminal back. */ - restore_term(); - /* Remove callback from input vq, so it doesn't restart us. */ - dev->vq->handle_output = NULL; - /* Stop listening to this fd: don't call us again. */ - return false; + /* For simplicity, dying threads kill the whole Launcher. So + * just nap here. */ + for (;;) + pause(); } - /* Tell the Guest about the new input. */ - add_used_and_trigger(fd, dev->vq, head, len); + add_used_and_trigger(vq, head, len); /* Three ^C within one second? Exit. * - * This is such a hack, but works surprisingly well. Each ^C has to be - * in a buffer by itself, so they can't be too fast. But we check that - * we get three within about a second, so they can't be too slow. */ - if (len == 1 && ((char *)iov[0].iov_base)[0] == 3) { - if (!abort->count++) - gettimeofday(&abort->start, NULL); - else if (abort->count == 3) { - struct timeval now; - gettimeofday(&now, NULL); - if (now.tv_sec <= abort->start.tv_sec+1) { - unsigned long args[] = { LHREQ_BREAK, 0 }; - /* Close the fd so Waker will know it has to - * exit. */ - close(waker_fds.pipe[1]); - /* Just in case Waker is blocked in BREAK, send - * unbreak now. */ - write(fd, args, sizeof(args)); - exit(2); - } - abort->count = 0; - } - } else - /* Any other key resets the abort counter. */ + * This is such a hack, but works surprisingly well. Each ^C has to + * be in a buffer by itself, so they can't be too fast. But we check + * that we get three within about a second, so they can't be too + * slow. */ + if (len != 1 || ((char *)iov[0].iov_base)[0] != 3) { abort->count = 0; + return; + } - /* Everything went OK! */ - return true; + abort->count++; + if (abort->count == 1) + gettimeofday(&abort->start, NULL); + else if (abort->count == 3) { + struct timeval now; + gettimeofday(&now, NULL); + /* Kill all Launcher processes with SIGINT, like normal ^C */ + if (now.tv_sec <= abort->start.tv_sec+1) + kill(0, SIGINT); + abort->count = 0; + } } -/* Handling output for console is simple: we just get all the output buffers - * and write them to stdout. */ -static void handle_console_output(int fd, struct virtqueue *vq, bool timeout) +/* This is the routine which handles console output (ie. stdout). */ +static void console_output(struct virtqueue *vq) { unsigned int head, out, in; - int len; struct iovec iov[vq->vring.num]; - /* Keep getting output buffers from the Guest until we run out. */ - while ((head = get_vq_desc(vq, iov, &out, &in)) != vq->vring.num) { - if (in) - errx(1, "Input buffers in output queue?"); - len = writev(STDOUT_FILENO, iov, out); - add_used_and_trigger(fd, vq, head, len); + head = wait_for_vq_desc(vq, iov, &out, &in); + if (in) + errx(1, "Input buffers in console output queue?"); + while (!iov_empty(iov, out)) { + int len = writev(STDOUT_FILENO, iov, out); + if (len <= 0) + err(1, "Write to stdout gave %i", len); + iov_consume(iov, out, len); } -} - -/* This is called when we no longer want to hear about Guest changes to a - * virtqueue. This is more efficient in high-traffic cases, but it means we - * have to set a timer to check if any more changes have occurred. */ -static void block_vq(struct virtqueue *vq) -{ - struct itimerval itm; - - vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY; - vq->blocked = true; - - itm.it_interval.tv_sec = 0; - itm.it_interval.tv_usec = 0; - itm.it_value.tv_sec = 0; - itm.it_value.tv_usec = timeout_usec; - - setitimer(ITIMER_REAL, &itm, NULL); + add_used(vq, head, 0); } /* * The Network * * Handling output for network is also simple: we get all the output buffers - * and write them (ignoring the first element) to this device's file descriptor - * (/dev/net/tun). + * and write them to /dev/net/tun. */ -static void handle_net_output(int fd, struct virtqueue *vq, bool timeout) +struct net_info { + int tunfd; +}; + +static void net_output(struct virtqueue *vq) { - unsigned int head, out, in, num = 0; - int len; + struct net_info *net_info = vq->dev->priv; + unsigned int head, out, in; struct iovec iov[vq->vring.num]; - static int last_timeout_num; - - /* Keep getting output buffers from the Guest until we run out. */ - while ((head = get_vq_desc(vq, iov, &out, &in)) != vq->vring.num) { - if (in) - errx(1, "Input buffers in output queue?"); - len = writev(vq->dev->fd, iov, out); - if (len < 0) - err(1, "Writing network packet to tun"); - add_used_and_trigger(fd, vq, head, len); - num++; - } - /* Block further kicks and set up a timer if we saw anything. */ - if (!timeout && num) - block_vq(vq); - - /* We never quite know how long should we wait before we check the - * queue again for more packets. We start at 500 microseconds, and if - * we get fewer packets than last time, we assume we made the timeout - * too small and increase it by 10 microseconds. Otherwise, we drop it - * by one microsecond every time. It seems to work well enough. */ - if (timeout) { - if (num < last_timeout_num) - timeout_usec += 10; - else if (timeout_usec > 1) - timeout_usec--; - last_timeout_num = num; - } + head = wait_for_vq_desc(vq, iov, &out, &in); + if (in) + errx(1, "Input buffers in net output queue?"); + if (writev(net_info->tunfd, iov, out) < 0) + errx(1, "Write to tun failed?"); + add_used(vq, head, 0); +} + +/* Will reading from this file descriptor block? */ +static bool will_block(int fd) +{ + fd_set fdset; + struct timeval zero = { 0, 0 }; + FD_ZERO(&fdset); + FD_SET(fd, &fdset); + return select(fd+1, &fdset, NULL, NULL, &zero) != 1; } -/* This is where we handle a packet coming in from the tun device to our +/* This is where we handle packets coming in from the tun device to our * Guest. */ -static bool handle_tun_input(int fd, struct device *dev) +static void net_input(struct virtqueue *vq) { - unsigned int head, in_num, out_num; int len; - struct iovec iov[dev->vq->vring.num]; - - /* First we need a network buffer from the Guests's recv virtqueue. */ - head = get_vq_desc(dev->vq, iov, &out_num, &in_num); - if (head == dev->vq->vring.num) { - /* Now, it's expected that if we try to send a packet too - * early, the Guest won't be ready yet. Wait until the device - * status says it's ready. */ - /* FIXME: Actually want DRIVER_ACTIVE here. */ - - /* Now tell it we want to know if new things appear. */ - dev->vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; - wmb(); - - /* We'll turn this back on if input buffers are registered. */ - return false; - } else if (out_num) - errx(1, "Output buffers in network recv queue?"); - - /* Read the packet from the device directly into the Guest's buffer. */ - len = readv(dev->fd, iov, in_num); - if (len <= 0) - err(1, "reading network"); + unsigned int head, out, in; + struct iovec iov[vq->vring.num]; + struct net_info *net_info = vq->dev->priv; - /* Tell the Guest about the new packet. */ - add_used_and_trigger(fd, dev->vq, head, len); + head = wait_for_vq_desc(vq, iov, &out, &in); + if (out) + errx(1, "Output buffers in net input queue?"); - verbose("tun input packet len %i [%02x %02x] (%s)\n", len, - ((u8 *)iov[1].iov_base)[0], ((u8 *)iov[1].iov_base)[1], - head != dev->vq->vring.num ? "sent" : "discarded"); + /* Deliver interrupt now, since we're about to sleep. */ + if (vq->pending_used && will_block(net_info->tunfd)) + trigger_irq(vq); - /* All good. */ - return true; + len = readv(net_info->tunfd, iov, in); + if (len <= 0) + err(1, "Failed to read from tun."); + add_used(vq, head, len); } -/*L:215 This is the callback attached to the network and console input - * virtqueues: it ensures we try again, in case we stopped console or net - * delivery because Guest didn't have any buffers. */ -static void enable_fd(int fd, struct virtqueue *vq, bool timeout) +/* This is the helper to create threads. */ +static int do_thread(void *_vq) { - add_device_fd(vq->dev->fd); - /* Snap the Waker out of its select loop. */ - write(waker_fds.pipe[1], "", 1); + struct virtqueue *vq = _vq; + + for (;;) + vq->service(vq); + return 0; } -static void net_enable_fd(int fd, struct virtqueue *vq, bool timeout) +/* When a child dies, we kill our entire process group with SIGTERM. This + * also has the side effect that the shell restores the console for us! */ +static void kill_launcher(int signal) { - /* We don't need to know again when Guest refills receive buffer. */ - vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY; - enable_fd(fd, vq, timeout); + kill(0, SIGTERM); } -/* When the Guest tells us they updated the status field, we handle it. */ -static void update_device_status(struct device *dev) +static void reset_device(struct device *dev) { struct virtqueue *vq; - /* This is a reset. */ - if (dev->desc->status == 0) { - verbose("Resetting device %s\n", dev->name); + verbose("Resetting device %s\n", dev->name); - /* Clear any features they've acked. */ - memset(get_feature_bits(dev) + dev->desc->feature_len, 0, - dev->desc->feature_len); + /* Clear any features they've acked. */ + memset(get_feature_bits(dev) + dev->feature_len, 0, dev->feature_len); - /* Zero out the virtqueues. */ - for (vq = dev->vq; vq; vq = vq->next) { - memset(vq->vring.desc, 0, - vring_size(vq->config.num, LGUEST_VRING_ALIGN)); - lg_last_avail(vq) = 0; + /* We're going to be explicitly killing threads, so ignore them. */ + signal(SIGCHLD, SIG_IGN); + + /* Zero out the virtqueues, get rid of their threads */ + for (vq = dev->vq; vq; vq = vq->next) { + if (vq->thread != (pid_t)-1) { + kill(vq->thread, SIGTERM); + waitpid(vq->thread, NULL, 0); + vq->thread = (pid_t)-1; } - } else if (dev->desc->status & VIRTIO_CONFIG_S_FAILED) { + memset(vq->vring.desc, 0, + vring_size(vq->config.num, LGUEST_VRING_ALIGN)); + lg_last_avail(vq) = 0; + } + dev->running = false; + + /* Now we care if threads die. */ + signal(SIGCHLD, (void *)kill_launcher); +} + +static void create_thread(struct virtqueue *vq) +{ + /* Create stack for thread and run it. Since stack grows + * upwards, we point the stack pointer to the end of this + * region. */ + char *stack = malloc(32768); + unsigned long args[] = { LHREQ_EVENTFD, + vq->config.pfn*getpagesize(), 0 }; + + /* Create a zero-initialized eventfd. */ + vq->eventfd = eventfd(0, 0); + if (vq->eventfd < 0) + err(1, "Creating eventfd"); + args[2] = vq->eventfd; + + /* Attach an eventfd to this virtqueue: it will go off + * when the Guest does an LHCALL_NOTIFY for this vq. */ + if (write(lguest_fd, &args, sizeof(args)) != 0) + err(1, "Attaching eventfd"); + + /* CLONE_VM: because it has to access the Guest memory, and + * SIGCHLD so we get a signal if it dies. */ + vq->thread = clone(do_thread, stack + 32768, CLONE_VM | SIGCHLD, vq); + if (vq->thread == (pid_t)-1) + err(1, "Creating clone"); + /* We close our local copy, now the child has it. */ + close(vq->eventfd); +} + +static void start_device(struct device *dev) +{ + unsigned int i; + struct virtqueue *vq; + + verbose("Device %s OK: offered", dev->name); + for (i = 0; i < dev->feature_len; i++) + verbose(" %02x", get_feature_bits(dev)[i]); + verbose(", accepted"); + for (i = 0; i < dev->feature_len; i++) + verbose(" %02x", get_feature_bits(dev) + [dev->feature_len+i]); + + for (vq = dev->vq; vq; vq = vq->next) { + if (vq->service) + create_thread(vq); + } + dev->running = true; +} + +static void cleanup_devices(void) +{ + struct device *dev; + + for (dev = devices.dev; dev; dev = dev->next) + reset_device(dev); + + /* If we saved off the original terminal settings, restore them now. */ + if (orig_term.c_lflag & (ISIG|ICANON|ECHO)) + tcsetattr(STDIN_FILENO, TCSANOW, &orig_term); +} + +/* When the Guest tells us they updated the status field, we handle it. */ +static void update_device_status(struct device *dev) +{ + /* A zero status is a reset, otherwise it's a set of flags. */ + if (dev->desc->status == 0) + reset_device(dev); + else if (dev->desc->status & VIRTIO_CONFIG_S_FAILED) { warnx("Device %s configuration FAILED", dev->name); + if (dev->running) + reset_device(dev); } else if (dev->desc->status & VIRTIO_CONFIG_S_DRIVER_OK) { - unsigned int i; - - verbose("Device %s OK: offered", dev->name); - for (i = 0; i < dev->desc->feature_len; i++) - verbose(" %02x", get_feature_bits(dev)[i]); - verbose(", accepted"); - for (i = 0; i < dev->desc->feature_len; i++) - verbose(" %02x", get_feature_bits(dev) - [dev->desc->feature_len+i]); - - if (dev->ready) - dev->ready(dev); + if (!dev->running) + start_device(dev); } } /* This is the generic routine we call when the Guest uses LHCALL_NOTIFY. */ -static void handle_output(int fd, unsigned long addr) +static void handle_output(unsigned long addr) { struct device *i; - struct virtqueue *vq; - /* Check each device and virtqueue. */ + /* Check each device. */ for (i = devices.dev; i; i = i->next) { + struct virtqueue *vq; + /* Notifications to device descriptors update device status. */ if (from_guest_phys(addr) == i->desc) { update_device_status(i); return; } - /* Notifications to virtqueues mean output has occurred. */ + /* Devices *can* be used before status is set to DRIVER_OK. */ for (vq = i->vq; vq; vq = vq->next) { - if (vq->config.pfn != addr/getpagesize()) + if (addr != vq->config.pfn*getpagesize()) continue; - - /* Guest should acknowledge (and set features!) before - * using the device. */ - if (i->desc->status == 0) { - warnx("%s gave early output", i->name); - return; - } - - if (strcmp(vq->dev->name, "console") != 0) - verbose("Output to %s\n", vq->dev->name); - if (vq->handle_output) - vq->handle_output(fd, vq, false); + if (i->running) + errx(1, "Notification on running %s", i->name); + start_device(i); return; } } @@ -1049,71 +990,6 @@ static void handle_output(int fd, unsigned long addr) strnlen(from_guest_phys(addr), guest_limit - addr)); } -static void handle_timeout(int fd) -{ - char buf[32]; - struct device *i; - struct virtqueue *vq; - - /* Clear the pipe */ - read(timeoutpipe[0], buf, sizeof(buf)); - - /* Check each device and virtqueue: flush blocked ones. */ - for (i = devices.dev; i; i = i->next) { - for (vq = i->vq; vq; vq = vq->next) { - if (!vq->blocked) - continue; - - vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; - vq->blocked = false; - if (vq->handle_output) - vq->handle_output(fd, vq, true); - } - } -} - -/* This is called when the Waker wakes us up: check for incoming file - * descriptors. */ -static void handle_input(int fd) -{ - /* select() wants a zeroed timeval to mean "don't wait". */ - struct timeval poll = { .tv_sec = 0, .tv_usec = 0 }; - - for (;;) { - struct device *i; - fd_set fds = devices.infds; - int num; - - num = select(devices.max_infd+1, &fds, NULL, NULL, &poll); - /* Could get interrupted */ - if (num < 0) - continue; - /* If nothing is ready, we're done. */ - if (num == 0) - break; - - /* Otherwise, call the device(s) which have readable file - * descriptors and a method of handling them. */ - for (i = devices.dev; i; i = i->next) { - if (i->handle_input && FD_ISSET(i->fd, &fds)) { - if (i->handle_input(fd, i)) - continue; - - /* If handle_input() returns false, it means we - * should no longer service it. Networking and - * console do this when there's no input - * buffers to deliver into. Console also uses - * it when it discovers that stdin is closed. */ - FD_CLR(i->fd, &devices.infds); - } - } - - /* Is this the timeout fd? */ - if (FD_ISSET(timeoutpipe[0], &fds)) - handle_timeout(fd); - } -} - /*L:190 * Device Setup * @@ -1129,8 +1005,8 @@ static void handle_input(int fd) static u8 *device_config(const struct device *dev) { return (void *)(dev->desc + 1) - + dev->desc->num_vq * sizeof(struct lguest_vqconfig) - + dev->desc->feature_len * 2; + + dev->num_vq * sizeof(struct lguest_vqconfig) + + dev->feature_len * 2; } /* This routine allocates a new "struct lguest_device_desc" from descriptor @@ -1159,7 +1035,7 @@ static struct lguest_device_desc *new_dev_desc(u16 type) /* Each device descriptor is followed by the description of its virtqueues. We * specify how many descriptors the virtqueue is to have. */ static void add_virtqueue(struct device *dev, unsigned int num_descs, - void (*handle_output)(int, struct virtqueue *, bool)) + void (*service)(struct virtqueue *)) { unsigned int pages; struct virtqueue **i, *vq = malloc(sizeof(*vq)); @@ -1174,8 +1050,8 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs, vq->next = NULL; vq->last_avail_idx = 0; vq->dev = dev; - vq->inflight = 0; - vq->blocked = false; + vq->service = service; + vq->thread = (pid_t)-1; /* Initialize the configuration. */ vq->config.num = num_descs; @@ -1191,6 +1067,7 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs, * yet, otherwise we'd be overwriting them. */ assert(dev->desc->config_len == 0 && dev->desc->feature_len == 0); memcpy(device_config(dev), &vq->config, sizeof(vq->config)); + dev->num_vq++; dev->desc->num_vq++; verbose("Virtqueue page %#lx\n", to_guest_phys(p)); @@ -1199,15 +1076,6 @@ static void add_virtqueue(struct device *dev, unsigned int num_descs, * second. */ for (i = &dev->vq; *i; i = &(*i)->next); *i = vq; - - /* Set the routine to call when the Guest does something to this - * virtqueue. */ - vq->handle_output = handle_output; - - /* As an optimization, set the advisory "Don't Notify Me" flag if we - * don't have a handler */ - if (!handle_output) - vq->vring.used->flags = VRING_USED_F_NO_NOTIFY; } /* The first half of the feature bitmask is for us to advertise features. The @@ -1219,7 +1087,7 @@ static void add_feature(struct device *dev, unsigned bit) /* We can't extend the feature bits once we've added config bytes */ if (dev->desc->feature_len <= bit / CHAR_BIT) { assert(dev->desc->config_len == 0); - dev->desc->feature_len = (bit / CHAR_BIT) + 1; + dev->feature_len = dev->desc->feature_len = (bit/CHAR_BIT) + 1; } features[bit / CHAR_BIT] |= (1 << (bit % CHAR_BIT)); @@ -1243,22 +1111,17 @@ static void set_config(struct device *dev, unsigned len, const void *conf) * calling new_dev_desc() to allocate the descriptor and device memory. * * See what I mean about userspace being boring? */ -static struct device *new_device(const char *name, u16 type, int fd, - bool (*handle_input)(int, struct device *)) +static struct device *new_device(const char *name, u16 type) { struct device *dev = malloc(sizeof(*dev)); /* Now we populate the fields one at a time. */ - dev->fd = fd; - /* If we have an input handler for this file descriptor, then we add it - * to the device_list's fdset and maxfd. */ - if (handle_input) - add_device_fd(dev->fd); dev->desc = new_dev_desc(type); - dev->handle_input = handle_input; dev->name = name; dev->vq = NULL; - dev->ready = NULL; + dev->feature_len = 0; + dev->num_vq = 0; + dev->running = false; /* Append to device list. Prepending to a single-linked list is * easier, but the user expects the devices to be arranged on the bus @@ -1286,13 +1149,10 @@ static void setup_console(void) * raw input stream to the Guest. */ term.c_lflag &= ~(ISIG|ICANON|ECHO); tcsetattr(STDIN_FILENO, TCSANOW, &term); - /* If we exit gracefully, the original settings will be - * restored so the user can see what they're typing. */ - atexit(restore_term); } - dev = new_device("console", VIRTIO_ID_CONSOLE, - STDIN_FILENO, handle_console_input); + dev = new_device("console", VIRTIO_ID_CONSOLE); + /* We store the console state in dev->priv, and initialize it. */ dev->priv = malloc(sizeof(struct console_abort)); ((struct console_abort *)dev->priv)->count = 0; @@ -1301,31 +1161,13 @@ static void setup_console(void) * they put something the input queue, we make sure we're listening to * stdin. When they put something in the output queue, we write it to * stdout. */ - add_virtqueue(dev, VIRTQUEUE_NUM, enable_fd); - add_virtqueue(dev, VIRTQUEUE_NUM, handle_console_output); + add_virtqueue(dev, VIRTQUEUE_NUM, console_input); + add_virtqueue(dev, VIRTQUEUE_NUM, console_output); - verbose("device %u: console\n", devices.device_num++); + verbose("device %u: console\n", ++devices.device_num); } /*:*/ -static void timeout_alarm(int sig) -{ - write(timeoutpipe[1], "", 1); -} - -static void setup_timeout(void) -{ - if (pipe(timeoutpipe) != 0) - err(1, "Creating timeout pipe"); - - if (fcntl(timeoutpipe[1], F_SETFL, - fcntl(timeoutpipe[1], F_GETFL) | O_NONBLOCK) != 0) - err(1, "Making timeout pipe nonblocking"); - - add_device_fd(timeoutpipe[0]); - signal(SIGALRM, timeout_alarm); -} - /*M:010 Inter-guest networking is an interesting area. Simplest is to have a * --sharenet=<name> option which opens or creates a named pipe. This can be * used to send packets to another guest in a 1:1 manner. @@ -1447,21 +1289,23 @@ static int get_tun_device(char tapif[IFNAMSIZ]) static void setup_tun_net(char *arg) { struct device *dev; - int netfd, ipfd; + struct net_info *net_info = malloc(sizeof(*net_info)); + int ipfd; u32 ip = INADDR_ANY; bool bridging = false; char tapif[IFNAMSIZ], *p; struct virtio_net_config conf; - netfd = get_tun_device(tapif); + net_info->tunfd = get_tun_device(tapif); /* First we create a new network device. */ - dev = new_device("net", VIRTIO_ID_NET, netfd, handle_tun_input); + dev = new_device("net", VIRTIO_ID_NET); + dev->priv = net_info; /* Network devices need a receive and a send queue, just like * console. */ - add_virtqueue(dev, VIRTQUEUE_NUM, net_enable_fd); - add_virtqueue(dev, VIRTQUEUE_NUM, handle_net_output); + add_virtqueue(dev, VIRTQUEUE_NUM, net_input); + add_virtqueue(dev, VIRTQUEUE_NUM, net_output); /* We need a socket to perform the magic network ioctls to bring up the * tap interface, connect to the bridge etc. Any socket will do! */ @@ -1502,6 +1346,8 @@ static void setup_tun_net(char *arg) add_feature(dev, VIRTIO_NET_F_HOST_TSO4); add_feature(dev, VIRTIO_NET_F_HOST_TSO6); add_feature(dev, VIRTIO_NET_F_HOST_ECN); + /* We handle indirect ring entries */ + add_feature(dev, VIRTIO_RING_F_INDIRECT_DESC); set_config(dev, sizeof(conf), &conf); /* We don't need the socket any more; setup is done. */ @@ -1550,20 +1396,18 @@ struct vblk_info * Remember that the block device is handled by a separate I/O thread. We head * straight into the core of that thread here: */ -static bool service_io(struct device *dev) +static void blk_request(struct virtqueue *vq) { - struct vblk_info *vblk = dev->priv; + struct vblk_info *vblk = vq->dev->priv; unsigned int head, out_num, in_num, wlen; int ret; u8 *in; struct virtio_blk_outhdr *out; - struct iovec iov[dev->vq->vring.num]; + struct iovec iov[vq->vring.num]; off64_t off; - /* See if there's a request waiting. If not, nothing to do. */ - head = get_vq_desc(dev->vq, iov, &out_num, &in_num); - if (head == dev->vq->vring.num) - return false; + /* Get the next request. */ + head = wait_for_vq_desc(vq, iov, &out_num, &in_num); /* Every block request should contain at least one output buffer * (detailing the location on disk and the type of request) and one @@ -1637,83 +1481,21 @@ static bool service_io(struct device *dev) if (out->type & VIRTIO_BLK_T_BARRIER) fdatasync(vblk->fd); - /* We can't trigger an IRQ, because we're not the Launcher. It does - * that when we tell it we're done. */ - add_used(dev->vq, head, wlen); - return true; -} - -/* This is the thread which actually services the I/O. */ -static int io_thread(void *_dev) -{ - struct device *dev = _dev; - struct vblk_info *vblk = dev->priv; - char c; - - /* Close other side of workpipe so we get 0 read when main dies. */ - close(vblk->workpipe[1]); - /* Close the other side of the done_fd pipe. */ - close(dev->fd); - - /* When this read fails, it means Launcher died, so we follow. */ - while (read(vblk->workpipe[0], &c, 1) == 1) { - /* We acknowledge each request immediately to reduce latency, - * rather than waiting until we've done them all. I haven't - * measured to see if it makes any difference. - * - * That would be an interesting test, wouldn't it? You could - * also try having more than one I/O thread. */ - while (service_io(dev)) - write(vblk->done_fd, &c, 1); - } - return 0; -} - -/* Now we've seen the I/O thread, we return to the Launcher to see what happens - * when that thread tells us it's completed some I/O. */ -static bool handle_io_finish(int fd, struct device *dev) -{ - char c; - - /* If the I/O thread died, presumably it printed the error, so we - * simply exit. */ - if (read(dev->fd, &c, 1) != 1) - exit(1); - - /* It did some work, so trigger the irq. */ - trigger_irq(fd, dev->vq); - return true; -} - -/* When the Guest submits some I/O, we just need to wake the I/O thread. */ -static void handle_virtblk_output(int fd, struct virtqueue *vq, bool timeout) -{ - struct vblk_info *vblk = vq->dev->priv; - char c = 0; - - /* Wake up I/O thread and tell it to go to work! */ - if (write(vblk->workpipe[1], &c, 1) != 1) - /* Presumably it indicated why it died. */ - exit(1); + add_used(vq, head, wlen); } /*L:198 This actually sets up a virtual block device. */ static void setup_block_file(const char *filename) { - int p[2]; struct device *dev; struct vblk_info *vblk; - void *stack; struct virtio_blk_config conf; - /* This is the pipe the I/O thread will use to tell us I/O is done. */ - pipe(p); - /* The device responds to return from I/O thread. */ - dev = new_device("block", VIRTIO_ID_BLOCK, p[0], handle_io_finish); + dev = new_device("block", VIRTIO_ID_BLOCK); /* The device has one virtqueue, where the Guest places requests. */ - add_virtqueue(dev, VIRTQUEUE_NUM, handle_virtblk_output); + add_virtqueue(dev, VIRTQUEUE_NUM, blk_request); /* Allocate the room for our own bookkeeping */ vblk = dev->priv = malloc(sizeof(*vblk)); @@ -1735,49 +1517,29 @@ static void setup_block_file(const char *filename) set_config(dev, sizeof(conf), &conf); - /* The I/O thread writes to this end of the pipe when done. */ - vblk->done_fd = p[1]; - - /* This is the second pipe, which is how we tell the I/O thread about - * more work. */ - pipe(vblk->workpipe); - - /* Create stack for thread and run it. Since stack grows upwards, we - * point the stack pointer to the end of this region. */ - stack = malloc(32768); - /* SIGCHLD - We dont "wait" for our cloned thread, so prevent it from - * becoming a zombie. */ - if (clone(io_thread, stack + 32768, CLONE_VM | SIGCHLD, dev) == -1) - err(1, "Creating clone"); - - /* We don't need to keep the I/O thread's end of the pipes open. */ - close(vblk->done_fd); - close(vblk->workpipe[0]); - verbose("device %u: virtblock %llu sectors\n", - devices.device_num, le64_to_cpu(conf.capacity)); + ++devices.device_num, le64_to_cpu(conf.capacity)); } +struct rng_info { + int rfd; +}; + /* Our random number generator device reads from /dev/random into the Guest's * input buffers. The usual case is that the Guest doesn't want random numbers * and so has no buffers although /dev/random is still readable, whereas * console is the reverse. * * The same logic applies, however. */ -static bool handle_rng_input(int fd, struct device *dev) +static void rng_input(struct virtqueue *vq) { int len; unsigned int head, in_num, out_num, totlen = 0; - struct iovec iov[dev->vq->vring.num]; + struct rng_info *rng_info = vq->dev->priv; + struct iovec iov[vq->vring.num]; /* First we need a buffer from the Guests's virtqueue. */ - head = get_vq_desc(dev->vq, iov, &out_num, &in_num); - - /* If they're not ready for input, stop listening to this file - * descriptor. We'll start again once they add an input buffer. */ - if (head == dev->vq->vring.num) - return false; - + head = wait_for_vq_desc(vq, iov, &out_num, &in_num); if (out_num) errx(1, "Output buffers in rng?"); @@ -1785,7 +1547,7 @@ static bool handle_rng_input(int fd, struct device *dev) * it reads straight into the Guest's buffer. We loop to make sure we * fill it. */ while (!iov_empty(iov, in_num)) { - len = readv(dev->fd, iov, in_num); + len = readv(rng_info->rfd, iov, in_num); if (len <= 0) err(1, "Read from /dev/random gave %i", len); iov_consume(iov, in_num, len); @@ -1793,25 +1555,23 @@ static bool handle_rng_input(int fd, struct device *dev) } /* Tell the Guest about the new input. */ - add_used_and_trigger(fd, dev->vq, head, totlen); - - /* Everything went OK! */ - return true; + add_used(vq, head, totlen); } /* And this creates a "hardware" random number device for the Guest. */ static void setup_rng(void) { struct device *dev; - int fd; + struct rng_info *rng_info = malloc(sizeof(*rng_info)); - fd = open_or_die("/dev/random", O_RDONLY); + rng_info->rfd = open_or_die("/dev/random", O_RDONLY); /* The device responds to return from I/O thread. */ - dev = new_device("rng", VIRTIO_ID_RNG, fd, handle_rng_input); + dev = new_device("rng", VIRTIO_ID_RNG); + dev->priv = rng_info; /* The device has one virtqueue, where the Guest places inbufs. */ - add_virtqueue(dev, VIRTQUEUE_NUM, enable_fd); + add_virtqueue(dev, VIRTQUEUE_NUM, rng_input); verbose("device %u: rng\n", devices.device_num++); } @@ -1827,17 +1587,18 @@ static void __attribute__((noreturn)) restart_guest(void) for (i = 3; i < FD_SETSIZE; i++) close(i); - /* The exec automatically gets rid of the I/O and Waker threads. */ + /* Reset all the devices (kills all threads). */ + cleanup_devices(); + execv(main_args[0], main_args); err(1, "Could not exec %s", main_args[0]); } /*L:220 Finally we reach the core of the Launcher which runs the Guest, serves * its input and output, and finally, lays it to rest. */ -static void __attribute__((noreturn)) run_guest(int lguest_fd) +static void __attribute__((noreturn)) run_guest(void) { for (;;) { - unsigned long args[] = { LHREQ_BREAK, 0 }; unsigned long notify_addr; int readval; @@ -1848,8 +1609,7 @@ static void __attribute__((noreturn)) run_guest(int lguest_fd) /* One unsigned long means the Guest did HCALL_NOTIFY */ if (readval == sizeof(notify_addr)) { verbose("Notify on address %#lx\n", notify_addr); - handle_output(lguest_fd, notify_addr); - continue; + handle_output(notify_addr); /* ENOENT means the Guest died. Reading tells us why. */ } else if (errno == ENOENT) { char reason[1024] = { 0 }; @@ -1858,19 +1618,9 @@ static void __attribute__((noreturn)) run_guest(int lguest_fd) /* ERESTART means that we need to reboot the guest */ } else if (errno == ERESTART) { restart_guest(); - /* EAGAIN means a signal (timeout). - * Anything else means a bug or incompatible change. */ - } else if (errno != EAGAIN) + /* Anything else means a bug or incompatible change. */ + } else err(1, "Running guest failed"); - - /* Only service input on thread for CPU 0. */ - if (cpu_id != 0) - continue; - - /* Service input, then unset the BREAK to release the Waker. */ - handle_input(lguest_fd); - if (pwrite(lguest_fd, args, sizeof(args), cpu_id) < 0) - err(1, "Resetting break"); } } /*L:240 @@ -1904,8 +1654,8 @@ int main(int argc, char *argv[]) /* Memory, top-level pagetable, code startpoint and size of the * (optional) initrd. */ unsigned long mem = 0, start, initrd_size = 0; - /* Two temporaries and the /dev/lguest file descriptor. */ - int i, c, lguest_fd; + /* Two temporaries. */ + int i, c; /* The boot information for the Guest. */ struct boot_params *boot; /* If they specify an initrd file to load. */ @@ -1913,18 +1663,10 @@ int main(int argc, char *argv[]) /* Save the args: we "reboot" by execing ourselves again. */ main_args = argv; - /* We don't "wait" for the children, so prevent them from becoming - * zombies. */ - signal(SIGCHLD, SIG_IGN); - /* First we initialize the device list. Since console and network - * device receive input from a file descriptor, we keep an fdset - * (infds) and the maximum fd number (max_infd) with the head of the - * list. We also keep a pointer to the last device. Finally, we keep - * the next interrupt number to use for devices (1: remember that 0 is - * used by the timer). */ - FD_ZERO(&devices.infds); - devices.max_infd = -1; + /* First we initialize the device list. We keep a pointer to the last + * device, and the next interrupt number to use for devices (1: + * remember that 0 is used by the timer). */ devices.lastdev = NULL; devices.next_irq = 1; @@ -1982,9 +1724,6 @@ int main(int argc, char *argv[]) /* We always have a console device */ setup_console(); - /* We can timeout waiting for Guest network transmit. */ - setup_timeout(); - /* Now we load the kernel */ start = load_kernel(open_or_die(argv[optind+1], O_RDONLY)); @@ -2023,15 +1762,16 @@ int main(int argc, char *argv[]) /* We tell the kernel to initialize the Guest: this returns the open * /dev/lguest file descriptor. */ - lguest_fd = tell_kernel(start); + tell_kernel(start); + + /* Ensure that we terminate if a child dies. */ + signal(SIGCHLD, kill_launcher); - /* We clone off a thread, which wakes the Launcher whenever one of the - * input file descriptors needs attention. We call this the Waker, and - * we'll cover it in a moment. */ - setup_waker(lguest_fd); + /* If we exit via err(), this kills all the threads, restores tty. */ + atexit(cleanup_devices); /* Finally, run the Guest. This doesn't return. */ - run_guest(lguest_fd); + run_guest(); } /*:*/ diff --git a/Documentation/lguest/lguest.txt b/Documentation/lguest/lguest.txt index 28c747362f9..efb3a6a045a 100644 --- a/Documentation/lguest/lguest.txt +++ b/Documentation/lguest/lguest.txt @@ -37,7 +37,6 @@ Running Lguest: "Paravirtualized guest support" = Y "Lguest guest support" = Y "High Memory Support" = off/4GB - "PAE (Physical Address Extension) Support" = N "Alignment value to which kernel should be aligned" = 0x100000 (CONFIG_PARAVIRT=y, CONFIG_LGUEST_GUEST=y, CONFIG_HIGHMEM64G=n and CONFIG_PHYSICAL_ALIGN=0x100000) diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt index 421e7d00ffd..c9abbd86bc1 100644 --- a/Documentation/power/devices.txt +++ b/Documentation/power/devices.txt @@ -75,9 +75,6 @@ may need to apply in domain-specific ways to their devices: struct bus_type { ... int (*suspend)(struct device *dev, pm_message_t state); - int (*suspend_late)(struct device *dev, pm_message_t state); - - int (*resume_early)(struct device *dev); int (*resume)(struct device *dev); }; @@ -226,20 +223,7 @@ The phases are seen by driver notifications issued in this order: This call should handle parts of device suspend logic that require sleeping. It probably does work to quiesce the device which hasn't - been abstracted into class.suspend() or bus.suspend_late(). - - 3 bus.suspend_late(dev, message) is called with IRQs disabled, and - with only one CPU active. Until the bus.resume_early() phase - completes (see later), IRQs are not enabled again. This method - won't be exposed by all busses; for message based busses like USB, - I2C, or SPI, device interactions normally require IRQs. This bus - call may be morphed into a driver call with bus-specific parameters. - - This call might save low level hardware state that might otherwise - be lost in the upcoming low power state, and actually put the - device into a low power state ... so that in some cases the device - may stay partly usable until this late. This "late" call may also - help when coping with hardware that behaves badly. + been abstracted into class.suspend(). The pm_message_t parameter is currently used to refine those semantics (described later). @@ -351,19 +335,11 @@ devices processing each phase's calls before the next phase begins. The phases are seen by driver notifications issued in this order: - 1 bus.resume_early(dev) is called with IRQs disabled, and with - only one CPU active. As with bus.suspend_late(), this method - won't be supported on busses that require IRQs in order to - interact with devices. - - This reverses the effects of bus.suspend_late(). - - 2 bus.resume(dev) is called next. This may be morphed into a device - driver call with bus-specific parameters; implementations may sleep. - - This reverses the effects of bus.suspend(). + 1 bus.resume(dev) reverses the effects of bus.suspend(). This may + be morphed into a device driver call with bus-specific parameters; + implementations may sleep. - 3 class.resume(dev) is called for devices associated with a class + 2 class.resume(dev) is called for devices associated with a class that has such a method. Implementations may sleep. This reverses the effects of class.suspend(), and would usually diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt index 012858d2b11..5c08d96f407 100644 --- a/Documentation/sound/alsa/ALSA-Configuration.txt +++ b/Documentation/sound/alsa/ALSA-Configuration.txt @@ -460,6 +460,25 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. The power-management is supported. + Module snd-ctxfi + ---------------- + + Module for Creative Sound Blaster X-Fi boards (20k1 / 20k2 chips) + * Creative Sound Blaster X-Fi Titanium Fatal1ty Champion Series + * Creative Sound Blaster X-Fi Titanium Fatal1ty Professional Series + * Creative Sound Blaster X-Fi Titanium Professional Audio + * Creative Sound Blaster X-Fi Titanium + * Creative Sound Blaster X-Fi Elite Pro + * Creative Sound Blaster X-Fi Platinum + * Creative Sound Blaster X-Fi Fatal1ty + * Creative Sound Blaster X-Fi XtremeGamer + * Creative Sound Blaster X-Fi XtremeMusic + + reference_rate - reference sample rate, 44100 or 48000 (default) + multiple - multiple to ref. sample rate, 1 or 2 (default) + + This module supports multiple cards. + Module snd-darla20 ------------------ @@ -925,6 +944,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. * Onkyo SE-90PCI * Onkyo SE-200PCI * ESI Juli@ + * ESI Maya44 * Hercules Fortissimo IV * EGO-SYS WaveTerminal 192M @@ -933,7 +953,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. prodigy71xt, prodigy71hifi, prodigyhd2, prodigy192, juli, aureon51, aureon71, universe, ap192, k8x800, phase22, phase28, ms300, av710, se200pci, se90pci, - fortissimo4, sn25p, WT192M + fortissimo4, sn25p, WT192M, maya44 This module supports multiple cards and autoprobe. @@ -1093,6 +1113,13 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. This module supports multiple cards. The driver requires the firmware loader support on kernel. + Module snd-lx6464es + ------------------- + + Module for Digigram LX6464ES boards + + This module supports multiple cards. + Module snd-maestro3 ------------------- @@ -1543,13 +1570,15 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Module snd-sc6000 ----------------- - Module for Gallant SC-6000 soundcard. + Module for Gallant SC-6000 soundcard and later models: SC-6600 + and SC-7000. port - Port # (0x220 or 0x240) mss_port - MSS Port # (0x530 or 0xe80) irq - IRQ # (5,7,9,10,11) mpu_irq - MPU-401 IRQ # (5,7,9,10) ,0 - no MPU-401 irq dma - DMA # (1,3,0) + joystick - Enable gameport - 0 = disable (default), 1 = enable This module supports multiple cards. @@ -1859,7 +1888,8 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. ------------------- Module for sound cards based on the Asus AV100/AV200 chips, - i.e., Xonar D1, DX, D2, D2X, HDAV1.3 (Deluxe), and Essence STX. + i.e., Xonar D1, DX, D2, D2X, HDAV1.3 (Deluxe), Essence ST + (Deluxe) and Essence STX. This module supports autoprobe and multiple cards. diff --git a/Documentation/sound/alsa/HD-Audio-Models.txt b/Documentation/sound/alsa/HD-Audio-Models.txt index 322869fc8a9..de8e10a9410 100644 --- a/Documentation/sound/alsa/HD-Audio-Models.txt +++ b/Documentation/sound/alsa/HD-Audio-Models.txt @@ -36,6 +36,7 @@ ALC260 acer Acer TravelMate will Will laptops (PB V7900) replacer Replacer 672V + favorit100 Maxdata Favorit 100XS basic fixed pin assignment (old default model) test for testing/debugging purpose, almost all controls can adjusted. Appearing only when compiled with @@ -85,10 +86,11 @@ ALC269 eeepc-p703 ASUS Eeepc P703 P900A eeepc-p901 ASUS Eeepc P901 S101 fujitsu FSC Amilo + lifebook Fujitsu Lifebook S6420 auto auto-config reading BIOS (default) -ALC662/663 -========== +ALC662/663/272 +============== 3stack-dig 3-stack (2-channel) with SPDIF 3stack-6ch 3-stack (6-channel) 3stack-6ch-dig 3-stack (6-channel) with SPDIF @@ -107,6 +109,9 @@ ALC662/663 asus-mode4 ASUS asus-mode5 ASUS asus-mode6 ASUS + dell Dell with ALC272 + dell-zm1 Dell ZM1 with ALC272 + samsung-nc10 Samsung NC10 mini notebook auto auto-config reading BIOS (default) ALC882/885 @@ -118,6 +123,7 @@ ALC882/885 asus-a7j ASUS A7J asus-a7m ASUS A7M macpro MacPro support + mb5 Macbook 5,1 mbp3 Macbook Pro rev3 imac24 iMac 24'' with jack detection w2jc ASUS W2JC @@ -133,10 +139,12 @@ ALC883/888 acer Acer laptops (Travelmate 3012WTMi, Aspire 5600, etc) acer-aspire Acer Aspire 9810 acer-aspire-4930g Acer Aspire 4930G + acer-aspire-8930g Acer Aspire 8930G medion Medion Laptops medion-md2 Medion MD2 targa-dig Targa/MSI - targa-2ch-dig Targs/MSI with 2-channel + targa-2ch-dig Targa/MSI with 2-channel + targa-8ch-dig Targa/MSI with 8-channel (MSI GX620) laptop-eapd 3-jack with SPDIF I/O and EAPD (Clevo M540JE, M550JE) lenovo-101e Lenovo 101E lenovo-nb0763 Lenovo NB0763 @@ -150,6 +158,9 @@ ALC883/888 fujitsu-pi2515 Fujitsu AMILO Pi2515 fujitsu-xa3530 Fujitsu AMILO XA3530 3stack-6ch-intel Intel DG33* boards + asus-p5q ASUS P5Q-EM boards + mb31 MacBook 3,1 + sony-vaio-tt Sony VAIO TT auto auto-config reading BIOS (default) ALC861/660 @@ -348,6 +359,7 @@ STAC92HD71B* hp-m4 HP mini 1000 hp-dv5 HP dv series hp-hdx HP HDX series + hp-dv4-1222nr HP dv4-1222nr (with LED support) auto BIOS setup (default) STAC92HD73* diff --git a/Documentation/sound/alsa/Procfile.txt b/Documentation/sound/alsa/Procfile.txt index cfac20cf9e3..381908d8ca4 100644 --- a/Documentation/sound/alsa/Procfile.txt +++ b/Documentation/sound/alsa/Procfile.txt @@ -88,26 +88,34 @@ card*/pcm*/info substreams, etc. card*/pcm*/xrun_debug - This file appears when CONFIG_SND_DEBUG=y. - This shows the status of xrun (= buffer overrun/xrun) debug of - ALSA PCM middle layer, as an integer from 0 to 2. The value - can be changed by writing to this file, such as - - # cat 2 > /proc/asound/card0/pcm0p/xrun_debug - - When this value is greater than 0, the driver will show the - messages to kernel log when an xrun is detected. The debug - message is shown also when the invalid H/W pointer is detected - at the update of periods (usually called from the interrupt + This file appears when CONFIG_SND_DEBUG=y and + CONFIG_PCM_XRUN_DEBUG=y. + This shows the status of xrun (= buffer overrun/xrun) and + invalid PCM position debug/check of ALSA PCM middle layer. + It takes an integer value, can be changed by writing to this + file, such as + + # cat 5 > /proc/asound/card0/pcm0p/xrun_debug + + The value consists of the following bit flags: + bit 0 = Enable XRUN/jiffies debug messages + bit 1 = Show stack trace at XRUN / jiffies check + bit 2 = Enable additional jiffies check + + When the bit 0 is set, the driver will show the messages to + kernel log when an xrun is detected. The debug message is + shown also when the invalid H/W pointer is detected at the + update of periods (usually called from the interrupt handler). - When this value is greater than 1, the driver will show the - stack trace additionally. This may help the debugging. + When the bit 1 is set, the driver will show the stack trace + additionally. This may help the debugging. - Since 2.6.30, this option also enables the hwptr check using + Since 2.6.30, this option can enable the hwptr check using jiffies. This detects spontaneous invalid pointer callback values, but can be lead to too much corrections for a (mostly buggy) hardware that doesn't give smooth pointer updates. + This feature is enabled via the bit 2. card*/pcm*/sub*/info The general information of this PCM sub-stream. diff --git a/Documentation/sound/alsa/README.maya44 b/Documentation/sound/alsa/README.maya44 new file mode 100644 index 00000000000..0e41576fa13 --- /dev/null +++ b/Documentation/sound/alsa/README.maya44 @@ -0,0 +1,163 @@ +NOTE: The following is the original document of Rainer's patch that the +current maya44 code based on. Some contents might be obsoleted, but I +keep here as reference -- tiwai + +---------------------------------------------------------------- + +STATE OF DEVELOPMENT: + +This driver is being developed on the initiative of Piotr Makowski (oponek@gmail.com) and financed by Lars Bergmann. +Development is carried out by Rainer Zimmermann (mail@lightshed.de). + +ESI provided a sample Maya44 card for the development work. + +However, unfortunately it has turned out difficult to get detailed programming information, so I (Rainer Zimmermann) had to find out some card-specific information by experiment and conjecture. Some information (in particular, several GPIO bits) is still missing. + +This is the first testing version of the Maya44 driver released to the alsa-devel mailing list (Feb 5, 2008). + + +The following functions work, as tested by Rainer Zimmermann and Piotr Makowski: + +- playback and capture at all sampling rates +- input/output level +- crossmixing +- line/mic switch +- phantom power switch +- analogue monitor a.k.a bypass + + +The following functions *should* work, but are not fully tested: + +- Channel 3+4 analogue - S/PDIF input switching +- S/PDIF output +- all inputs/outputs on the M/IO/DIO extension card +- internal/external clock selection + + +*In particular, we would appreciate testing of these functions by anyone who has access to an M/IO/DIO extension card.* + + +Things that do not seem to work: + +- The level meters ("multi track") in 'alsamixer' do not seem to react to signals in (if this is a bug, it would probably be in the existing ICE1724 code). + +- Ardour 2.1 seems to work only via JACK, not using ALSA directly or via OSS. This still needs to be tracked down. + + +DRIVER DETAILS: + +the following files were added: + +pci/ice1724/maya44.c - Maya44 specific code +pci/ice1724/maya44.h +pci/ice1724/ice1724.patch +pci/ice1724/ice1724.h.patch - PROPOSED patch to ice1724.h (see SAMPLING RATES) +i2c/other/wm8776.c - low-level access routines for Wolfson WM8776 codecs +include/wm8776.h + + +Note that the wm8776.c code is meant to be card-independent and does not actually register the codec with the ALSA infrastructure. +This is done in maya44.c, mainly because some of the WM8776 controls are used in Maya44-specific ways, and should be named appropriately. + + +the following files were created in pci/ice1724, simply #including the corresponding file from the alsa-kernel tree: + +wtm.h +vt1720_mobo.h +revo.h +prodigy192.h +pontis.h +phase.h +maya44.h +juli.h +aureon.h +amp.h +envy24ht.h +se.h +prodigy_hifi.h + + +*I hope this is the correct way to do things.* + + +SAMPLING RATES: + +The Maya44 card (or more exactly, the Wolfson WM8776 codecs) allow a maximum sampling rate of 192 kHz for playback and 92 kHz for capture. + +As the ICE1724 chip only allows one global sampling rate, this is handled as follows: + +* setting the sampling rate on any open PCM device on the maya44 card will always set the *global* sampling rate for all playback and capture channels. + +* In the current state of the driver, setting rates of up to 192 kHz is permitted even for capture devices. + +*AVOID CAPTURING AT RATES ABOVE 96kHz*, even though it may appear to work. The codec cannot actually capture at such rates, meaning poor quality. + + +I propose some additional code for limiting the sampling rate when setting on a capture pcm device. However because of the global sampling rate, this logic would be somewhat problematic. + +The proposed code (currently deactivated) is in ice1712.h.patch, ice1724.c and maya44.c (in pci/ice1712). + + +SOUND DEVICES: + +PCM devices correspond to inputs/outputs as follows (assuming Maya44 is card #0): + +hw:0,0 input - stereo, analog input 1+2 +hw:0,0 output - stereo, analog output 1+2 +hw:0,1 input - stereo, analog input 3+4 OR S/PDIF input +hw:0,1 output - stereo, analog output 3+4 (and SPDIF out) + + +NAMING OF MIXER CONTROLS: + +(for more information about the signal flow, please refer to the block diagram on p.24 of the ESI Maya44 manual, or in the ESI windows software). + + +PCM: (digital) output level for channel 1+2 +PCM 1: same for channel 3+4 + +Mic Phantom+48V: switch for +48V phantom power for electrostatic microphones on input 1/2. + Make sure this is not turned on while any other source is connected to input 1/2. + It might damage the source and/or the maya44 card. + +Mic/Line input: if switch is is on, input jack 1/2 is microphone input (mono), otherwise line input (stereo). + +Bypass: analogue bypass from ADC input to output for channel 1+2. Same as "Monitor" in the windows driver. +Bypass 1: same for channel 3+4. + +Crossmix: cross-mixer from channels 1+2 to channels 3+4 +Crossmix 1: cross-mixer from channels 3+4 to channels 1+2 + +IEC958 Output: switch for S/PDIF output. + This is not supported by the ESI windows driver. + S/PDIF should output the same signal as channel 3+4. [untested!] + + +Digitial output selectors: + + These switches allow a direct digital routing from the ADCs to the DACs. + Each switch determines where the digital input data to one of the DACs comes from. + They are not supported by the ESI windows driver. + For normal operation, they should all be set to "PCM out". + +H/W: Output source channel 1 +H/W 1: Output source channel 2 +H/W 2: Output source channel 3 +H/W 3: Output source channel 4 + +H/W 4 ... H/W 9: unknown function, left in to enable testing. + Possibly some of these control S/PDIF output(s). + If these turn out to be unused, they will go away in later driver versions. + +Selectable values for each of the digital output selectors are: + "PCM out" -> DAC output of the corresponding channel (default setting) + "Input 1"... + "Input 4" -> direct routing from ADC output of the selected input channel + + +-------- + +Feb 14, 2008 +Rainer Zimmermann +mail@lightshed.de + diff --git a/Documentation/sound/alsa/soc/dapm.txt b/Documentation/sound/alsa/soc/dapm.txt index 9e6763264a2..9ac842be9b4 100644 --- a/Documentation/sound/alsa/soc/dapm.txt +++ b/Documentation/sound/alsa/soc/dapm.txt @@ -62,6 +62,7 @@ Audio DAPM widgets fall into a number of types:- o Mic - Mic (and optional Jack) o Line - Line Input/Output (and optional Jack) o Speaker - Speaker + o Supply - Power or clock supply widget used by other widgets. o Pre - Special PRE widget (exec before all others) o Post - Special POST widget (exec after all others) diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt index 2db5893d6c9..29a6ff8bc7d 100644 --- a/Documentation/x86/x86_64/boot-options.txt +++ b/Documentation/x86/x86_64/boot-options.txt @@ -5,21 +5,51 @@ only the AMD64 specific ones are listed here. Machine check - mce=off disable machine check - mce=bootlog Enable logging of machine checks left over from booting. - Disabled by default on AMD because some BIOS leave bogus ones. - If your BIOS doesn't do that it's a good idea to enable though - to make sure you log even machine check events that result - in a reboot. On Intel systems it is enabled by default. + Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables. + + mce=off + Disable machine check + mce=no_cmci + Disable CMCI(Corrected Machine Check Interrupt) that + Intel processor supports. Usually this disablement is + not recommended, but it might be handy if your hardware + is misbehaving. + Note that you'll get more problems without CMCI than with + due to the shared banks, i.e. you might get duplicated + error logs. + mce=dont_log_ce + Don't make logs for corrected errors. All events reported + as corrected are silently cleared by OS. + This option will be useful if you have no interest in any + of corrected errors. + mce=ignore_ce + Disable features for corrected errors, e.g. polling timer + and CMCI. All events reported as corrected are not cleared + by OS and remained in its error banks. + Usually this disablement is not recommended, however if + there is an agent checking/clearing corrected errors + (e.g. BIOS or hardware monitoring applications), conflicting + with OS's error handling, and you cannot deactivate the agent, + then this option will be a help. + mce=bootlog + Enable logging of machine checks left over from booting. + Disabled by default on AMD because some BIOS leave bogus ones. + If your BIOS doesn't do that it's a good idea to enable though + to make sure you log even machine check events that result + in a reboot. On Intel systems it is enabled by default. mce=nobootlog Disable boot machine check logging. - mce=tolerancelevel (number) + mce=tolerancelevel[,monarchtimeout] (number,number) + tolerance levels: 0: always panic on uncorrected errors, log corrected errors 1: panic or SIGBUS on uncorrected errors, log corrected errors 2: SIGBUS or log uncorrected errors, log corrected errors 3: never panic or SIGBUS, log all errors (for testing only) Default is 1 Can be also set using sysfs which is preferable. + monarchtimeout: + Sets the time in us to wait for other CPUs on machine checks. 0 + to disable. nomce (for compatibility with i386): same as mce=off diff --git a/Documentation/x86/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck index a05e58e7b15..b1fb3027328 100644 --- a/Documentation/x86/x86_64/machinecheck +++ b/Documentation/x86/x86_64/machinecheck @@ -41,7 +41,9 @@ check_interval the polling interval. When the poller stops finding MCEs, it triggers an exponential backoff (poll less often) on the polling interval. The check_interval variable is both the initial and - maximum polling interval. + maximum polling interval. 0 means no polling for corrected machine + check errors (but some corrected errors might be still reported + in other ways) tolerant Tolerance level. When a machine check exception occurs for a non @@ -67,6 +69,10 @@ trigger Program to run when a machine check event is detected. This is an alternative to running mcelog regularly from cron and allows to detect events faster. +monarch_timeout + How long to wait for the other CPUs to machine check too on a + exception. 0 to disable waiting for other CPUs. + Unit: us TBD document entries for AMD threshold interrupt configuration |