diff options
author | Grant Likely <grant.likely@secretlab.ca> | 2010-05-25 00:38:26 -0600 |
---|---|---|
committer | Grant Likely <grant.likely@secretlab.ca> | 2010-05-25 00:38:26 -0600 |
commit | b1e50ebcf24668e57f058deb48b0704b5391ed0f (patch) | |
tree | 17e1b69b249d0738317b732186340c9dd053f1a1 /Documentation/cgroups | |
parent | 0c2a2ae32793e3500a15a449612485f5d17dd431 (diff) | |
parent | 7e125f7b9cbfce4101191b8076d606c517a73066 (diff) |
Merge remote branch 'origin' into secretlab/next-spi
Diffstat (limited to 'Documentation/cgroups')
-rw-r--r-- | Documentation/cgroups/blkio-controller.txt | 151 | ||||
-rw-r--r-- | Documentation/cgroups/cgroups.txt | 2 | ||||
-rw-r--r-- | Documentation/cgroups/cpusets.txt | 38 | ||||
-rw-r--r-- | Documentation/cgroups/memcg_test.txt | 2 | ||||
-rw-r--r-- | Documentation/cgroups/memory.txt | 2 |
5 files changed, 155 insertions, 40 deletions
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt index 630879cd9a4..48e0b21b005 100644 --- a/Documentation/cgroups/blkio-controller.txt +++ b/Documentation/cgroups/blkio-controller.txt @@ -17,6 +17,9 @@ HOWTO You can do a very simple testing of running two dd threads in two different cgroups. Here is what you can do. +- Enable Block IO controller + CONFIG_BLK_CGROUP=y + - Enable group scheduling in CFQ CONFIG_CFQ_GROUP_IOSCHED=y @@ -54,32 +57,52 @@ cgroups. Here is what you can do. Various user visible config options =================================== -CONFIG_CFQ_GROUP_IOSCHED - - Enables group scheduling in CFQ. Currently only 1 level of group - creation is allowed. - -CONFIG_DEBUG_CFQ_IOSCHED - - Enables some debugging messages in blktrace. Also creates extra - cgroup file blkio.dequeue. - -Config options selected automatically -===================================== -These config options are not user visible and are selected/deselected -automatically based on IO scheduler configuration. - CONFIG_BLK_CGROUP - - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED. + - Block IO controller. CONFIG_DEBUG_BLK_CGROUP - - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED. + - Debug help. Right now some additional stats file show up in cgroup + if this option is enabled. + +CONFIG_CFQ_GROUP_IOSCHED + - Enables group scheduling in CFQ. Currently only 1 level of group + creation is allowed. Details of cgroup files ======================= - blkio.weight - - Specifies per cgroup weight. - + - Specifies per cgroup weight. This is default weight of the group + on all the devices until and unless overridden by per device rule. + (See blkio.weight_device). Currently allowed range of weights is from 100 to 1000. +- blkio.weight_device + - One can specify per cgroup per device rules using this interface. + These rules override the default value of group weight as specified + by blkio.weight. + + Following is the format. + + #echo dev_maj:dev_minor weight > /path/to/cgroup/blkio.weight_device + Configure weight=300 on /dev/sdb (8:16) in this cgroup + # echo 8:16 300 > blkio.weight_device + # cat blkio.weight_device + dev weight + 8:16 300 + + Configure weight=500 on /dev/sda (8:0) in this cgroup + # echo 8:0 500 > blkio.weight_device + # cat blkio.weight_device + dev weight + 8:0 500 + 8:16 300 + + Remove specific weight for /dev/sda in this cgroup + # echo 8:0 0 > blkio.weight_device + # cat blkio.weight_device + dev weight + 8:16 300 + - blkio.time - disk time allocated to cgroup per device in milliseconds. First two fields specify the major and minor number of the device and @@ -92,13 +115,105 @@ Details of cgroup files third field specifies the number of sectors transferred by the group to/from the device. +- blkio.io_service_bytes + - Number of bytes transferred to/from the disk by the group. These + are further divided by the type of operation - read or write, sync + or async. First two fields specify the major and minor number of the + device, third field specifies the operation type and the fourth field + specifies the number of bytes. + +- blkio.io_serviced + - Number of IOs completed to/from the disk by the group. These + are further divided by the type of operation - read or write, sync + or async. First two fields specify the major and minor number of the + device, third field specifies the operation type and the fourth field + specifies the number of IOs. + +- blkio.io_service_time + - Total amount of time between request dispatch and request completion + for the IOs done by this cgroup. This is in nanoseconds to make it + meaningful for flash devices too. For devices with queue depth of 1, + this time represents the actual service time. When queue_depth > 1, + that is no longer true as requests may be served out of order. This + may cause the service time for a given IO to include the service time + of multiple IOs when served out of order which may result in total + io_service_time > actual time elapsed. This time is further divided by + the type of operation - read or write, sync or async. First two fields + specify the major and minor number of the device, third field + specifies the operation type and the fourth field specifies the + io_service_time in ns. + +- blkio.io_wait_time + - Total amount of time the IOs for this cgroup spent waiting in the + scheduler queues for service. This can be greater than the total time + elapsed since it is cumulative io_wait_time for all IOs. It is not a + measure of total time the cgroup spent waiting but rather a measure of + the wait_time for its individual IOs. For devices with queue_depth > 1 + this metric does not include the time spent waiting for service once + the IO is dispatched to the device but till it actually gets serviced + (there might be a time lag here due to re-ordering of requests by the + device). This is in nanoseconds to make it meaningful for flash + devices too. This time is further divided by the type of operation - + read or write, sync or async. First two fields specify the major and + minor number of the device, third field specifies the operation type + and the fourth field specifies the io_wait_time in ns. + +- blkio.io_merged + - Total number of bios/requests merged into requests belonging to this + cgroup. This is further divided by the type of operation - read or + write, sync or async. + +- blkio.io_queued + - Total number of requests queued up at any given instant for this + cgroup. This is further divided by the type of operation - read or + write, sync or async. + +- blkio.avg_queue_size + - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. + The average queue size for this cgroup over the entire time of this + cgroup's existence. Queue size samples are taken each time one of the + queues of this cgroup gets a timeslice. + +- blkio.group_wait_time + - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. + This is the amount of time the cgroup had to wait since it became busy + (i.e., went from 0 to 1 request queued) to get a timeslice for one of + its queues. This is different from the io_wait_time which is the + cumulative total of the amount of time spent by each IO in that cgroup + waiting in the scheduler queue. This is in nanoseconds. If this is + read when the cgroup is in a waiting (for timeslice) state, the stat + will only report the group_wait_time accumulated till the last time it + got a timeslice and will not include the current delta. + +- blkio.empty_time + - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. + This is the amount of time a cgroup spends without any pending + requests when not being served, i.e., it does not include any time + spent idling for one of the queues of the cgroup. This is in + nanoseconds. If this is read when the cgroup is in an empty state, + the stat will only report the empty_time accumulated till the last + time it had a pending request and will not include the current delta. + +- blkio.idle_time + - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. + This is the amount of time spent by the IO scheduler idling for a + given cgroup in anticipation of a better request than the exising ones + from other queues/cgroups. This is in nanoseconds. If this is read + when the cgroup is in an idling state, the stat will only report the + idle_time accumulated till the last idle period and will not include + the current delta. + - blkio.dequeue - - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This + - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. This gives the statistics about how many a times a group was dequeued from service tree of the device. First two fields specify the major and minor number of the device and third field specifies the number of times a group was dequeued from a particular device. +- blkio.reset_stats + - Writing an int to this file will result in resetting all the stats + for that cgroup. + CFQ sysfs tunable ================= /sys/block/<disk>/queue/iosched/group_isolation diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index a1ca5924faf..57444c2609f 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -572,7 +572,7 @@ void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, Called when a task attach operation has failed after can_attach() has succeeded. A subsystem whose can_attach() has some side-effects should provide this -function, so that the subsytem can implement a rollback. If not, not necessary. +function, so that the subsystem can implement a rollback. If not, not necessary. This will be called only about subsystems whose can_attach() operation have succeeded. diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 4160df82b3f..51682ab2dd1 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt @@ -42,7 +42,7 @@ Nodes to a set of tasks. In this document "Memory Node" refers to an on-line node that contains memory. Cpusets constrain the CPU and Memory placement of tasks to only -the resources within a tasks current cpuset. They form a nested +the resources within a task's current cpuset. They form a nested hierarchy visible in a virtual file system. These are the essential hooks, beyond what is already present, required to manage dynamic job placement on large systems. @@ -53,11 +53,11 @@ Documentation/cgroups/cgroups.txt. Requests by a task, using the sched_setaffinity(2) system call to include CPUs in its CPU affinity mask, and using the mbind(2) and set_mempolicy(2) system calls to include Memory Nodes in its memory -policy, are both filtered through that tasks cpuset, filtering out any +policy, are both filtered through that task's cpuset, filtering out any CPUs or Memory Nodes not in that cpuset. The scheduler will not schedule a task on a CPU that is not allowed in its cpus_allowed vector, and the kernel page allocator will not allocate a page on a -node that is not allowed in the requesting tasks mems_allowed vector. +node that is not allowed in the requesting task's mems_allowed vector. User level code may create and destroy cpusets by name in the cgroup virtual file system, manage the attributes and permissions of these @@ -121,9 +121,9 @@ Cpusets extends these two mechanisms as follows: - Each task in the system is attached to a cpuset, via a pointer in the task structure to a reference counted cgroup structure. - Calls to sched_setaffinity are filtered to just those CPUs - allowed in that tasks cpuset. + allowed in that task's cpuset. - Calls to mbind and set_mempolicy are filtered to just - those Memory Nodes allowed in that tasks cpuset. + those Memory Nodes allowed in that task's cpuset. - The root cpuset contains all the systems CPUs and Memory Nodes. - For any cpuset, one can define child cpusets containing a subset @@ -141,11 +141,11 @@ into the rest of the kernel, none in performance critical paths: - in init/main.c, to initialize the root cpuset at system boot. - in fork and exit, to attach and detach a task from its cpuset. - in sched_setaffinity, to mask the requested CPUs by what's - allowed in that tasks cpuset. + allowed in that task's cpuset. - in sched.c migrate_live_tasks(), to keep migrating tasks within the CPUs allowed by their cpuset, if possible. - in the mbind and set_mempolicy system calls, to mask the requested - Memory Nodes by what's allowed in that tasks cpuset. + Memory Nodes by what's allowed in that task's cpuset. - in page_alloc.c, to restrict memory to allowed nodes. - in vmscan.c, to restrict page recovery to the current cpuset. @@ -155,7 +155,7 @@ new system calls are added for cpusets - all support for querying and modifying cpusets is via this cpuset file system. The /proc/<pid>/status file for each task has four added lines, -displaying the tasks cpus_allowed (on which CPUs it may be scheduled) +displaying the task's cpus_allowed (on which CPUs it may be scheduled) and mems_allowed (on which Memory Nodes it may obtain memory), in the two formats seen in the following example: @@ -323,17 +323,17 @@ stack segment pages of a task. By default, both kinds of memory spreading are off, and memory pages are allocated on the node local to where the task is running, -except perhaps as modified by the tasks NUMA mempolicy or cpuset +except perhaps as modified by the task's NUMA mempolicy or cpuset configuration, so long as sufficient free memory pages are available. When new cpusets are created, they inherit the memory spread settings of their parent. Setting memory spreading causes allocations for the affected page -or slab caches to ignore the tasks NUMA mempolicy and be spread +or slab caches to ignore the task's NUMA mempolicy and be spread instead. Tasks using mbind() or set_mempolicy() calls to set NUMA mempolicies will not notice any change in these calls as a result of -their containing tasks memory spread settings. If memory spreading +their containing task's memory spread settings. If memory spreading is turned off, then the currently specified NUMA mempolicy once again applies to memory page allocations. @@ -357,7 +357,7 @@ pages from the node returned by cpuset_mem_spread_node(). The cpuset_mem_spread_node() routine is also simple. It uses the value of a per-task rotor cpuset_mem_spread_rotor to select the next -node in the current tasks mems_allowed to prefer for the allocation. +node in the current task's mems_allowed to prefer for the allocation. This memory placement policy is also known (in other contexts) as round-robin or interleave. @@ -594,7 +594,7 @@ is attached, is subtle. If a cpuset has its Memory Nodes modified, then for each task attached to that cpuset, the next time that the kernel attempts to allocate a page of memory for that task, the kernel will notice the change -in the tasks cpuset, and update its per-task memory placement to +in the task's cpuset, and update its per-task memory placement to remain within the new cpusets memory placement. If the task was using mempolicy MPOL_BIND, and the nodes to which it was bound overlap with its new cpuset, then the task will continue to use whatever subset @@ -603,13 +603,13 @@ was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed in the new cpuset, then the task will be essentially treated as if it was MPOL_BIND bound to the new cpuset (even though its NUMA placement, as queried by get_mempolicy(), doesn't change). If a task is moved -from one cpuset to another, then the kernel will adjust the tasks +from one cpuset to another, then the kernel will adjust the task's memory placement, as above, the next time that the kernel attempts to allocate a page of memory for that task. If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset will have its allowed CPU placement changed immediately. Similarly, -if a tasks pid is written to another cpusets 'cpuset.tasks' file, then its +if a task's pid is written to another cpusets 'cpuset.tasks' file, then its allowed CPU placement is changed immediately. If such a task had been bound to some subset of its cpuset using the sched_setaffinity() call, the task will be allowed to run on any CPU allowed in its new cpuset, @@ -626,16 +626,16 @@ cpusets memory placement policy 'cpuset.mems' subsequently changes. If the cpuset flag file 'cpuset.memory_migrate' is set true, then when tasks are attached to that cpuset, any pages that task had allocated to it on nodes in its previous cpuset are migrated -to the tasks new cpuset. The relative placement of the page within +to the task's new cpuset. The relative placement of the page within the cpuset is preserved during these migration operations if possible. For example if the page was on the second valid node of the prior cpuset then the page will be placed on the second valid node of the new cpuset. -Also if 'cpuset.memory_migrate' is set true, then if that cpusets +Also if 'cpuset.memory_migrate' is set true, then if that cpuset's 'cpuset.mems' file is modified, pages allocated to tasks in that cpuset, that were on nodes in the previous setting of 'cpuset.mems', will be moved to nodes in the new setting of 'mems.' -Pages that were not in the tasks prior cpuset, or in the cpusets +Pages that were not in the task's prior cpuset, or in the cpuset's prior 'cpuset.mems' setting, will not be moved. There is an exception to the above. If hotplug functionality is used @@ -655,7 +655,7 @@ There is a second exception to the above. GFP_ATOMIC requests are kernel internal allocations that must be satisfied, immediately. The kernel may drop some request, in rare cases even panic, if a GFP_ATOMIC alloc fails. If the request cannot be satisfied within -the current tasks cpuset, then we relax the cpuset, and look for +the current task's cpuset, then we relax the cpuset, and look for memory anywhere we can find it. It's better to violate the cpuset than stress the kernel. diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index f7f68b2ac19..b7eececfb19 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt @@ -244,7 +244,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. we have to check if OLDPAGE/NEWPAGE is a valid page after commit(). 8. LRU - Each memcg has its own private LRU. Now, it's handling is under global + Each memcg has its own private LRU. Now, its handling is under global VM's control (means that it's handled under global zone->lru_lock). Almost all routines around memcg's LRU is called by global LRU's list management functions under zone->lru_lock(). diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 3a6aecd078b..6cab1f29da4 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -263,7 +263,7 @@ some of the pages cached in the cgroup (page cache pages). 4.2 Task migration -When a task migrates from one cgroup to another, it's charge is not +When a task migrates from one cgroup to another, its charge is not carried forward by default. The pages allocated from the original cgroup still remain charged to it, the charge is dropped when the page is freed or reclaimed. |