8 files changed, 143 insertions, 47 deletions
diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX
new file mode 100644
index 00000000000..3f58fa3d6d0
--- /dev/null
+++ b/Documentation/cgroups/00-INDEX
@@ -0,0 +1,18 @@
+00-INDEX
+	- this file
+cgroups.txt
+	- Control Groups definition, implementation details, examples and API.
+cpuacct.txt
+	- CPU Accounting Controller; account CPU usage for groups of tasks.
+cpusets.txt
+	- documents the cpusets feature; assign CPUs and Mem to a set of tasks.
+devices.txt
+	- Device Whitelist Controller; description, interface and security.
+freezer-subsystem.txt
+	- checkpointing; rationale to not use signals, interface.
+memcg_test.txt
+	- Memory Resource Controller; implementation details.
+memory.txt
+	- Memory Resource Controller; design, accounting, interface, testing.
+resource_counter.txt
+	- Resource Counter API.
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 93feb844448..6eb1a97e88c 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -56,7 +56,7 @@ hierarchy, and a set of subsystems; each subsystem has system-specific
 state attached to each cgroup in the hierarchy.  Each hierarchy has
 an instance of the cgroup virtual filesystem associated with it.
 
-At any one time there may be multiple active hierachies of task
+At any one time there may be multiple active hierarchies of task
 cgroups. Each hierarchy is a partition of all tasks in the system.
 
 User level code may create and destroy cgroups by name in an
@@ -124,10 +124,10 @@ following lines:
                                / \
                        Prof (15%) students (5%)
 
-Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
+Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go
 into NFS network class.
 
-At the same time firefox/lynx will share an appropriate CPU/Memory class
+At the same time Firefox/Lynx will share an appropriate CPU/Memory class
 depending on who launched it (prof/student).
 
 With the ability to classify tasks differently for different resources
@@ -325,7 +325,7 @@ and then start a subshell 'sh' in that cgroup:
 Creating, modifying, using the cgroups can be done through the cgroup
 virtual filesystem.
 
-To mount a cgroup hierarchy will all available subsystems, type:
+To mount a cgroup hierarchy with all available subsystems, type:
 # mount -t cgroup xxx /dev/cgroup
 
 The "xxx" is not interpreted by the cgroup code, but will appear in
@@ -333,12 +333,23 @@ The "xxx" is not interpreted by the cgroup code, but will appear in
 
 To mount a cgroup hierarchy with just the cpuset and numtasks
 subsystems, type:
-# mount -t cgroup -o cpuset,numtasks hier1 /dev/cgroup
+# mount -t cgroup -o cpuset,memory hier1 /dev/cgroup
 
 To change the set of subsystems bound to a mounted hierarchy, just
 remount with different options:
+# mount -o remount,cpuset,ns hier1 /dev/cgroup
 
-# mount -o remount,cpuset,ns  /dev/cgroup
+Now memory is removed from the hierarchy and ns is added.
+
+Note this will add ns to the hierarchy but won't remove memory or
+cpuset, because the new options are appended to the old ones:
+# mount -o remount,ns /dev/cgroup
+
+To Specify a hierarchy's release_agent:
+# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
+  xxx /dev/cgroup
+
+Note that specifying 'release_agent' more than once will return failure.
 
 Note that changing the set of subsystems is currently only supported
 when the hierarchy consists of a single (root) cgroup. Supporting
@@ -349,6 +360,11 @@ Then under /dev/cgroup you can find a tree that corresponds to the
 tree of the cgroups in the system. For instance, /dev/cgroup
 is the cgroup that holds the whole system.
 
+If you want to change the value of release_agent:
+# echo "/sbin/new_release_agent" > /dev/cgroup/release_agent
+
+It can also be changed via remount.
+
 If you want to create a new cgroup under /dev/cgroup:
 # cd /dev/cgroup
 # mkdir my_cgroup
@@ -476,11 +492,13 @@ cgroup->parent is still valid. (Note - can also be called for a
 newly-created cgroup if an error occurs after this subsystem's
 create() method has been called for the new cgroup).
 
-void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
+int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
 
 Called before checking the reference count on each subsystem. This may
 be useful for subsystems which have some extra references even if
-there are not tasks in the cgroup.
+there are not tasks in the cgroup. If pre_destroy() returns error code,
+rmdir() will fail with it. From this behavior, pre_destroy() can be
+called multiple times against a cgroup.
 
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task)
@@ -521,7 +539,7 @@ always handled well.
 void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
 (cgroup_mutex held by caller)
 
-Called at the end of cgroup_clone() to do any paramater
+Called at the end of cgroup_clone() to do any parameter
 initialization which might be required before a task could attach.  For
 example in cpusets, no task may attach before 'cpus' and 'mems' are set
 up.
diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index bb775fbe43d..8b930946c52 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -30,3 +30,21 @@ The above steps create a new group g1 and move the current shell
 process (bash) into it. CPU time consumed by this bash and its children
 can be obtained from g1/cpuacct.usage and the same is accumulated in
 /cgroups/cpuacct.usage also.
+
+cpuacct.stat file lists a few statistics which further divide the
+CPU time obtained by the cgroup into user and system times. Currently
+the following statistics are supported:
+
+user: Time spent by tasks of the cgroup in user mode.
+system: Time spent by tasks of the cgroup in kernel mode.
+
+user and system are in USER_HZ unit.
+
+cpuacct controller uses percpu_counter interface to collect user and
+system times. This has two side effects:
+
+- It is theoretically possible to see wrong values for user and system times.
+  This is because percpu_counter_read() on 32bit systems isn't safe
+  against concurrent writes.
+- It is possible to see slightly outdated values for user and system times
+  due to the batch processing nature of percpu_counter.
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 0611e9528c7..f9ca389dddf 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -131,7 +131,7 @@ Cpusets extends these two mechanisms as follows:
  - The hierarchy of cpusets can be mounted at /dev/cpuset, for
    browsing and manipulation from user space.
  - A cpuset may be marked exclusive, which ensures that no other
-   cpuset (except direct ancestors and descendents) may contain
+   cpuset (except direct ancestors and descendants) may contain
    any overlapping CPUs or Memory Nodes.
  - You can list all the tasks (by pid) attached to any cpuset.
 
@@ -226,7 +226,7 @@ nodes with memory--using the cpuset_track_online_nodes() hook.
 --------------------------------
 
 If a cpuset is cpu or mem exclusive, no other cpuset, other than
-a direct ancestor or descendent, may share any of the same CPUs or
+a direct ancestor or descendant, may share any of the same CPUs or
 Memory Nodes.
 
 A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled",
@@ -427,7 +427,7 @@ child cpusets have this flag enabled.
 When doing this, you don't usually want to leave any unpinned tasks in
 the top cpuset that might use non-trivial amounts of CPU, as such tasks
 may be artificially constrained to some subset of CPUs, depending on
-the particulars of this flag setting in descendent cpusets.  Even if
+the particulars of this flag setting in descendant cpusets.  Even if
 such a task could use spare CPU cycles in some other CPUs, the kernel
 scheduler might not consider the possibility of load balancing that
 task to that underused CPU.
@@ -531,9 +531,9 @@ be idle.
 
 Of course it takes some searching cost to find movable tasks and/or
 idle CPUs, the scheduler might not search all CPUs in the domain
-everytime.  In fact, in some architectures, the searching ranges on
+every time.  In fact, in some architectures, the searching ranges on
 events are limited in the same socket or node where the CPU locates,
-while the load balance on tick searchs all.
+while the load balance on tick searches all.
 
 For example, assume CPU Z is relatively far from CPU X.  Even if CPU Z
 is idle while CPU X and the siblings are busy, scheduler can't migrate
@@ -601,7 +601,7 @@ its new cpuset, then the task will continue to use whatever subset
 of MPOL_BIND nodes are still allowed in the new cpuset.  If the task
 was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
 in the new cpuset, then the task will be essentially treated as if it
-was MPOL_BIND bound to the new cpuset (even though its numa placement,
+was MPOL_BIND bound to the new cpuset (even though its NUMA placement,
 as queried by get_mempolicy(), doesn't change).  If a task is moved
 from one cpuset to another, then the kernel will adjust the tasks
 memory placement, as above, the next time that the kernel attempts
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt
index 7cc6e6a6067..57ca4c89fe5 100644
--- a/Documentation/cgroups/devices.txt
+++ b/Documentation/cgroups/devices.txt
@@ -42,7 +42,7 @@ suffice, but we can decide the best way to adequately restrict
 movement as people get some experience with this.  We may just want
 to require CAP_SYS_ADMIN, which at least is a separate bit from
 CAP_MKNOD.  We may want to just refuse moving to a cgroup which
-isn't a descendent of the current one.  Or we may want to use
+isn't a descendant of the current one.  Or we may want to use
 CAP_MAC_ADMIN, since we really are trying to lock down root.
 
 CAP_SYS_ADMIN is needed to modify the whitelist or move another
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index 523a9c16c40..72db89ed060 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -1,5 +1,5 @@
 Memory Resource Controller(Memcg)  Implementation Memo.
-Last Updated: 2009/1/19
+Last Updated: 2009/1/20
 Base Kernel Version: based on 2.6.29-rc2.
 
 Because VM is getting complex (one of reasons is memcg...), memcg's behavior
@@ -356,7 +356,25 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	(Shell-B)
 	# move all tasks in /cgroup/test to /cgroup
 	# /sbin/swapoff -a
-	# rmdir /test/cgroup
+	# rmdir /cgroup/test
 	# kill malloc task.
 
 	Of course, tmpfs v.s. swapoff test should be tested, too.
+
+ 9.8 OOM-Killer
+	Out-of-memory caused by memcg's limit will kill tasks under
+	the memcg. When hierarchy is used, a task under hierarchy
+	will be killed by the kernel.
+	In this case, panic_on_oom shouldn't be invoked and tasks
+	in other groups shouldn't be killed.
+
+	It's not difficult to cause OOM under memcg as following.
+	Case A) when you can swapoff
+	#swapoff -a
+	#echo 50M > /memory.limit_in_bytes
+	run 51M of malloc
+
+	Case B) when you use mem+swap limitation.
+	#echo 50M > memory.limit_in_bytes
+	#echo 50M > memory.memsw.limit_in_bytes
+	run 51M of malloc
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index e1501964df1..1a608877b14 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -6,15 +6,14 @@ used here with the memory controller that is used in hardware.
 
 Salient features
 
-a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages
+a. Enable control of Anonymous, Page Cache (mapped and unmapped) and
+   Swap Cache memory pages.
 b. The infrastructure allows easy addition of other types of memory to control
 c. Provides *zero overhead* for non memory controller users
 d. Provides a double LRU: global memory pressure causes reclaim from the
    global LRU; a cgroup on hitting a limit, reclaims from the per
    cgroup LRU
 
-NOTE: Swap Cache (unmapped) is not accounted now.
-
 Benefits and Purpose of the memory controller
 
 The memory controller isolates the memory behaviour of a group of tasks
@@ -290,34 +289,44 @@ will be charged as a new owner of it.
   moved to the parent. If you want to avoid that, force_empty will be useful.
 
 5.2 stat file
-  memory.stat file includes following statistics (now)
-	cache			- # of pages from page-cache and shmem.
-	rss			- # of pages from anonymous memory.
-	pgpgin			- # of event of charging
-	pgpgout			- # of event of uncharging
-	active_anon		- # of pages on active lru of anon, shmem.
-	inactive_anon 		- # of pages on active lru of anon, shmem
-	active_file		- # of pages on active lru of file-cache
-	inactive_file		- # of pages on inactive lru of file cache
-	unevictable		- # of pages cannot be reclaimed.(mlocked etc)
-
-	Below is depend on CONFIG_DEBUG_VM.
-	inactive_ratio		- VM inernal parameter. (see mm/page_alloc.c)
-	recent_rotated_anon	- VM internal parameter. (see mm/vmscan.c)
-	recent_rotated_file	- VM internal parameter. (see mm/vmscan.c)
-	recent_scanned_anon 	- VM internal parameter. (see mm/vmscan.c)
-	recent_scanned_file 	- VM internal parameter. (see mm/vmscan.c)
-
-  Memo:
+
+memory.stat file includes following statistics
+
+cache		- # of bytes of page cache memory.
+rss		- # of bytes of anonymous and swap cache memory.
+pgpgin		- # of pages paged in (equivalent to # of charging events).
+pgpgout		- # of pages paged out (equivalent to # of uncharging events).
+active_anon	- # of bytes of anonymous and  swap cache memory on active
+		  lru list.
+inactive_anon	- # of bytes of anonymous memory and swap cache memory on
+		  inactive lru list.
+active_file	- # of bytes of file-backed memory on active lru list.
+inactive_file	- # of bytes of file-backed memory on inactive lru list.
+unevictable	- # of bytes of memory that cannot be reclaimed (mlocked etc).
+
+The following additional stats are dependent on CONFIG_DEBUG_VM.
+
+inactive_ratio		- VM internal parameter. (see mm/page_alloc.c)
+recent_rotated_anon	- VM internal parameter. (see mm/vmscan.c)
+recent_rotated_file	- VM internal parameter. (see mm/vmscan.c)
+recent_scanned_anon	- VM internal parameter. (see mm/vmscan.c)
+recent_scanned_file	- VM internal parameter. (see mm/vmscan.c)
+
+Memo:
 	recent_rotated means recent frequency of lru rotation.
 	recent_scanned means recent # of scans to lru.
 	showing for better debug please see the code for meanings.
 
+Note:
+	Only anonymous and swap cache memory is listed as part of 'rss' stat.
+	This should not be confused with the true 'resident set size' or the
+	amount of physical memory used by the cgroup. Per-cgroup rss
+	accounting is not done yet.
 
 5.3 swappiness
   Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
 
-  Following cgroup's swapiness can't be changed.
+  Following cgroups' swapiness can't be changed.
   - root cgroup (uses /proc/sys/vm/swappiness).
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index f196ac1d7d2..95b24d766ea 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -47,13 +47,18 @@ to work with it.
 
 2. Basic accounting routines
 
- a. void res_counter_init(struct res_counter *rc)
+ a. void res_counter_init(struct res_counter *rc,
+				struct res_counter *rc_parent)
 
  	Initializes the resource counter. As usual, should be the first
 	routine called for a new counter.
 
- b. int res_counter_charge[_locked]
-			(struct res_counter *rc, unsigned long val)
+	The struct res_counter *parent can be used to define a hierarchical
+	child -> parent relationship directly in the res_counter structure,
+	NULL can be used to define no relationship.
+
+ c. int res_counter_charge(struct res_counter *rc, unsigned long val,
+				struct res_counter **limit_fail_at)
 
 	When a resource is about to be allocated it has to be accounted
 	with the appropriate resource counter (controller should determine
@@ -67,15 +72,25 @@ to work with it.
 	  * if the charging is performed first, then it should be uncharged
 	    on error path (if the one is called).
 
- c. void res_counter_uncharge[_locked]
+	If the charging fails and a hierarchical dependency exists, the
+	limit_fail_at parameter is set to the particular res_counter element
+	where the charging failed.
+
+ d. int res_counter_charge_locked
+			(struct res_counter *rc, unsigned long val)
+
+	The same as res_counter_charge(), but it must not acquire/release the
+	res_counter->lock internally (it must be called with res_counter->lock
+	held).
+
+ e. void res_counter_uncharge[_locked]
 			(struct res_counter *rc, unsigned long val)
 
 	When a resource is released (freed) it should be de-accounted
 	from the resource counter it was accounted to.  This is called
 	"uncharging".
 
-    The _locked routines imply that the res_counter->lock is taken.
-
+	The _locked routines imply that the res_counter->lock is taken.
 
  2.1 Other accounting routines