diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2013-04-29 17:29:08 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2013-04-29 17:29:08 -0700 |
commit | 73154383f02998fdd6a1f26c7ef33bfc3785a101 (patch) | |
tree | 85a4c10cf32172b99aed01e95ded7269afcc9d7d /Documentation | |
parent | 362ed48dee509abe24cf84b7e137c7a29a8f4d2d (diff) | |
parent | ca0dde97178e75ed1370b8616326f5496a803d65 (diff) |
Merge branch 'akpm' (incoming from Andrew)
Merge first batch of fixes from Andrew Morton:
- A couple of kthread changes
- A few minor audit patches
- A number of fbdev patches. Florian remains AWOL so I'm picking up
some of these.
- A few kbuild things
- ocfs2 updates
- Almost all of the MM queue
(And in the meantime, I already have the second big batch from Andrew
pending in my mailbox ;^)
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (149 commits)
memcg: take reference before releasing rcu_read_lock
mem hotunplug: fix kfree() of bootmem memory
mmKconfig: add an option to disable bounce
mm, nobootmem: do memset() after memblock_reserve()
mm, nobootmem: clean-up of free_low_memory_core_early()
fs/buffer.c: remove unnecessary init operation after allocating buffer_head.
numa, cpu hotplug: change links of CPU and node when changing node number by onlining CPU
mm: fix memory_hotplug.c printk format warning
mm: swap: mark swap pages writeback before queueing for direct IO
swap: redirty page if page write fails on swap file
mm, memcg: give exiting processes access to memory reserves
thp: fix huge zero page logic for page with pfn == 0
memcg: avoid accessing memcg after releasing reference
fs: fix fsync() error reporting
memblock: fix missing comment of memblock_insert_region()
mm: Remove unused parameter of pages_correctly_reserved()
firmware, memmap: fix firmware_map_entry leak
mm/vmstat: add note on safety of drain_zonestat
mm: thp: add split tail pages to shrink page list in page reclaim
mm: allow for outstanding swap writeback accounting
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/cgroups/memory.txt | 70 | ||||
-rw-r--r-- | Documentation/sysctl/vm.txt | 50 | ||||
-rw-r--r-- | Documentation/vm/overcommit-accounting | 8 |
3 files changed, 126 insertions, 2 deletions
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 8b8c28b9864..f336ede58e6 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -40,6 +40,7 @@ Features: - soft limit - moving (recharging) account at moving a task is selectable. - usage threshold notifier + - memory pressure notifier - oom-killer disable knob and oom-notifier - Root cgroup has no limit controls. @@ -65,6 +66,7 @@ Brief summary of control files. memory.stat # show various statistics memory.use_hierarchy # set/show hierarchical account enabled memory.force_empty # trigger forced move charge to parent + memory.pressure_level # set memory pressure notifications memory.swappiness # set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) memory.move_charge_at_immigrate # set/show controls of moving charges @@ -762,7 +764,73 @@ At reading, current status of OOM is shown. under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may be stopped.) -11. TODO +11. Memory Pressure + +The pressure level notifications can be used to monitor the memory +allocation cost; based on the pressure, applications can implement +different strategies of managing their memory resources. The pressure +levels are defined as following: + +The "low" level means that the system is reclaiming memory for new +allocations. Monitoring this reclaiming activity might be useful for +maintaining cache level. Upon notification, the program (typically +"Activity Manager") might analyze vmstat and act in advance (i.e. +prematurely shutdown unimportant services). + +The "medium" level means that the system is experiencing medium memory +pressure, the system might be making swap, paging out active file caches, +etc. Upon this event applications may decide to further analyze +vmstat/zoneinfo/memcg or internal memory usage statistics and free any +resources that can be easily reconstructed or re-read from a disk. + +The "critical" level means that the system is actively thrashing, it is +about to out of memory (OOM) or even the in-kernel OOM killer is on its +way to trigger. Applications should do whatever they can to help the +system. It might be too late to consult with vmstat or any other +statistics, so it's advisable to take an immediate action. + +The events are propagated upward until the event is handled, i.e. the +events are not pass-through. Here is what this means: for example you have +three cgroups: A->B->C. Now you set up an event listener on cgroups A, B +and C, and suppose group C experiences some pressure. In this situation, +only group C will receive the notification, i.e. groups A and B will not +receive it. This is done to avoid excessive "broadcasting" of messages, +which disturbs the system and which is especially bad if we are low on +memory or thrashing. So, organize the cgroups wisely, or propagate the +events manually (or, ask us to implement the pass-through events, +explaining why would you need them.) + +The file memory.pressure_level is only used to setup an eventfd. To +register a notification, an application must: + +- create an eventfd using eventfd(2); +- open memory.pressure_level; +- write string like "<event_fd> <fd of memory.pressure_level> <level>" + to cgroup.event_control. + +Application will be notified through eventfd when memory pressure is at +the specific level (or higher). Read/write operations to +memory.pressure_level are no implemented. + +Test: + + Here is a small script example that makes a new cgroup, sets up a + memory limit, sets up a notification in the cgroup and then makes child + cgroup experience a critical pressure: + + # cd /sys/fs/cgroup/memory/ + # mkdir foo + # cd foo + # cgroup_event_listener memory.pressure_level low & + # echo 8000000 > memory.limit_in_bytes + # echo 8000000 > memory.memsw.limit_in_bytes + # echo $$ > tasks + # dd if=/dev/zero | read x + + (Expect a bunch of notifications, and eventually, the oom-killer will + trigger.) + +12. TODO 1. Add support for accounting huge pages (as a separate controller) 2. Make per-cgroup scanner reclaim not-shared pages first diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 078701fdbd4..dcc75a9ed91 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -18,6 +18,7 @@ files can be found in mm/swap.c. Currently, these files are in /proc/sys/vm: +- admin_reserve_kbytes - block_dump - compact_memory - dirty_background_bytes @@ -53,11 +54,41 @@ Currently, these files are in /proc/sys/vm: - percpu_pagelist_fraction - stat_interval - swappiness +- user_reserve_kbytes - vfs_cache_pressure - zone_reclaim_mode ============================================================== +admin_reserve_kbytes + +The amount of free memory in the system that should be reserved for users +with the capability cap_sys_admin. + +admin_reserve_kbytes defaults to min(3% of free pages, 8MB) + +That should provide enough for the admin to log in and kill a process, +if necessary, under the default overcommit 'guess' mode. + +Systems running under overcommit 'never' should increase this to account +for the full Virtual Memory Size of programs used to recover. Otherwise, +root may not be able to log in to recover the system. + +How do you calculate a minimum useful reserve? + +sshd or login + bash (or some other shell) + top (or ps, kill, etc.) + +For overcommit 'guess', we can sum resident set sizes (RSS). +On x86_64 this is about 8MB. + +For overcommit 'never', we can take the max of their virtual sizes (VSZ) +and add the sum of their RSS. +On x86_64 this is about 128MB. + +Changing this takes effect whenever an application requests memory. + +============================================================== + block_dump block_dump enables block I/O debugging when set to a nonzero value. More @@ -542,6 +573,7 @@ memory until it actually runs out. When this flag is 2, the kernel uses a "never overcommit" policy that attempts to prevent any overcommit of memory. +Note that user_reserve_kbytes affects this policy. This feature can be very useful because there are a lot of programs that malloc() huge amounts of memory "just-in-case" @@ -645,6 +677,24 @@ The default value is 60. ============================================================== +- user_reserve_kbytes + +When overcommit_memory is set to 2, "never overommit" mode, reserve +min(3% of current process size, user_reserve_kbytes) of free memory. +This is intended to prevent a user from starting a single memory hogging +process, such that they cannot recover (kill the hog). + +user_reserve_kbytes defaults to min(3% of the current process size, 128MB). + +If this is reduced to zero, then the user will be allowed to allocate +all free memory with a single process, minus admin_reserve_kbytes. +Any subsequent attempts to execute a command will result in +"fork: Cannot allocate memory". + +Changing this takes effect whenever an application requests memory. + +============================================================== + vfs_cache_pressure ------------------ diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting index 706d7ed9d8d..8eaa2fc4b8f 100644 --- a/Documentation/vm/overcommit-accounting +++ b/Documentation/vm/overcommit-accounting @@ -8,7 +8,9 @@ The Linux kernel supports the following overcommit handling modes default. 1 - Always overcommit. Appropriate for some scientific - applications. + applications. Classic example is code using sparse arrays + and just relying on the virtual memory consisting almost + entirely of zero pages. 2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a @@ -18,6 +20,10 @@ The Linux kernel supports the following overcommit handling modes pages but will receive errors on memory allocation as appropriate. + Useful for applications that want to guarantee their + memory allocations will be available in the future + without having to initialize every page. + The overcommit policy is set via the sysctl `vm.overcommit_memory'. The overcommit percentage is set via `vm.overcommit_ratio'. |