From f606b77f1a9e362451aca8f81d8f36a3a112139e Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Thu, 9 Oct 2014 15:27:37 -0700 Subject: prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation During development of c/r we've noticed that in case if we need to support user namespaces we face a problem with capabilities in prctl(PR_SET_MM, ...) call, in particular once new user namespace is created capable(CAP_SYS_RESOURCE) no longer passes. A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values in one bundle, which would allow the kernel to make more intensive test for sanity of values and same time allow us to support checkpoint/restore of user namespaces. Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of prctl_mm_map structure which carries all the members to be updated. prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size) struct prctl_mm_map { __u64 start_code; __u64 end_code; __u64 start_data; __u64 end_data; __u64 start_brk; __u64 brk; __u64 start_stack; __u64 arg_start; __u64 arg_end; __u64 env_start; __u64 env_end; __u64 *auxv; __u32 auxv_size; __u32 exe_fd; }; All members except @exe_fd correspond ones of struct mm_struct. To figure out which available values these members may take here are meanings of the members. - start_code, end_code: represent bounds of executable code area - start_data, end_data: represent bounds of data area - start_brk, brk: used to calculate bounds for brk() syscall - start_stack: used when accounting space needed for command line arguments, environment and shmat() syscall - arg_start, arg_end, env_start, env_end: represent memory area supplied for command line arguments and environment variables - auxv, auxv_size: carries auxiliary vector, Elf format specifics - exe_fd: file descriptor number for executable link (/proc/self/exe) Thus we apply the following requirements to the values 1) Any member except @auxv, @auxv_size, @exe_fd is rather an address in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr) interval. 2) While @[start|end]_code and @[start|end]_data may point to an nonexisting VMAs (say a program maps own new .text and .data segments during execution) the rest of members should belong to VMA which must exist. 3) Addresses must be ordered, ie @start_ member must not be greater or equal to appropriate @end_ member. 4) As in regular Elf loading procedure we require that @start_brk and @brk be greater than @end_data. 5) If RLIMIT_DATA rlimit is set to non-infinity new values should not exceed existing limit. Same applies to RLIMIT_STACK. 6) Auxiliary vector size must not exceed existing one (which is predefined as AT_VECTOR_SIZE and depends on architecture). 7) File descriptor passed in @exe_file should be pointing to executable file (because we use existing prctl_set_mm_exe_file_locked helper it ensures that the file we are going to use as exe link has all required permission granted). Now about where these members are involved inside kernel code: - @start_code and @end_code are used in /proc/$pid/[stat|statm] output; - @start_data and @end_data are used in /proc/$pid/[stat|statm] output, also they are considered if there enough space for brk() syscall result if RLIMIT_DATA is set; - @start_brk shown in /proc/$pid/stat output and accounted in brk() syscall if RLIMIT_DATA is set; also this member is tested to find a symbolic name of mmap event for perf system (we choose if event is generated for "heap" area); one more aplication is selinux -- we test if a process has PROCESS__EXECHEAP permission if trying to make heap area being executable with mprotect() syscall; - @brk is a current value for brk() syscall which lays inside heap area, it's shown in /proc/$pid/stat. When syscall brk() succesfully provides new memory area to a user space upon brk() completion the mm::brk is updated to carry new value; Both @start_brk and @brk are actively used in /proc/$pid/maps and /proc/$pid/smaps output to find a symbolic name "heap" for VMA being scanned; - @start_stack is printed out in /proc/$pid/stat and used to find a symbolic name "stack" for task and threads in /proc/$pid/maps and /proc/$pid/smaps output, and as the same as with @start_brk -- perf system uses it for event naming. Also kernel treat this member as a start address of where to map vDSO pages and to check if there is enough space for shmat() syscall; - @arg_start, @arg_end, @env_start and @env_end are printed out in /proc/$pid/stat. Another access to the data these members represent is to read /proc/$pid/environ or /proc/$pid/cmdline. Any attempt to read these areas kernel tests with access_process_vm helper so a user must have enough rights for this action; - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly speaking kernel doesn't care much about which exactly data is sitting there because it is solely for userspace; - @exe_fd is referred from /proc/$pid/exe and when generating coredump. We uses prctl_set_mm_exe_file_locked helper to update this member, so exe-file link modification remains one-shot action. Still note that updating exe-file link now doesn't require sys-resource capability anymore, after all there is no much profit in preventing setup own file link (there are a number of ways to execute own code -- ptrace, ld-preload, so that the only reliable way to find which exactly code is executed is to inspect running program memory). Still we require the caller to be at least user-namespace root user. I believe the old interface should be deprecated and ripped off in a couple of kernel releases if no one against. To test if new interface is implemented in the kernel one can pass PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently supported struct prctl_mm_map. [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions] Signed-off-by: Cyrill Gorcunov Cc: Kees Cook Cc: Tejun Heo Acked-by: Andrew Vagin Tested-by: Andrew Vagin Cc: Eric W. Biederman Cc: H. Peter Anvin Acked-by: Serge Hallyn Cc: Pavel Emelyanov Cc: Vasiliy Kulikov Cc: KAMEZAWA Hiroyuki Cc: Michael Kerrisk Cc: Julien Tinnes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/uapi/linux/prctl.h | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 58afc04c107..513df75d0fc 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -1,6 +1,8 @@ #ifndef _LINUX_PRCTL_H #define _LINUX_PRCTL_H +#include + /* Values to pass as first argument to prctl() */ #define PR_SET_PDEATHSIG 1 /* Second arg is a signal */ @@ -119,6 +121,31 @@ # define PR_SET_MM_ENV_END 11 # define PR_SET_MM_AUXV 12 # define PR_SET_MM_EXE_FILE 13 +# define PR_SET_MM_MAP 14 +# define PR_SET_MM_MAP_SIZE 15 + +/* + * This structure provides new memory descriptor + * map which mostly modifies /proc/pid/stat[m] + * output for a task. This mostly done in a + * sake of checkpoint/restore functionality. + */ +struct prctl_mm_map { + __u64 start_code; /* code section bounds */ + __u64 end_code; + __u64 start_data; /* data section bounds */ + __u64 end_data; + __u64 start_brk; /* heap for brk() syscall */ + __u64 brk; + __u64 start_stack; /* stack starts at */ + __u64 arg_start; /* command line arguments bounds */ + __u64 arg_end; + __u64 env_start; /* environment variables bounds */ + __u64 env_end; + __u64 *auxv; /* auxiliary vector */ + __u32 auxv_size; /* vector size */ + __u32 exe_fd; /* /proc/$pid/exe link file */ +}; /* * Set specific pid that is allowed to ptrace the current task. -- cgit v1.2.3-70-g09d2 From 09316c09dde33aae14f34489d9e3d243ec0d5938 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 9 Oct 2014 15:29:32 -0700 Subject: mm/balloon_compaction: add vmstat counters and kpageflags bit Always mark pages with PageBalloon even if balloon compaction is disabled and expose this mark in /proc/kpageflags as KPF_BALLOON. Also this patch adds three counters into /proc/vmstat: "balloon_inflate", "balloon_deflate" and "balloon_migrate". They accumulate balloon activity. Current size of balloon is (balloon_inflate - balloon_deflate) pages. All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON. It should be selected by ballooning driver which wants use this feature. Currently virtio-balloon is the only user. Signed-off-by: Konstantin Khlebnikov Cc: Rafael Aquini Cc: Andrey Ryabinin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/virtio/Kconfig | 1 + drivers/virtio/virtio_balloon.c | 1 + fs/proc/page.c | 3 +++ include/linux/balloon_compaction.h | 2 ++ include/linux/vm_event_item.h | 7 +++++++ include/uapi/linux/kernel-page-flags.h | 1 + mm/Kconfig | 7 ++++++- mm/Makefile | 3 ++- mm/balloon_compaction.c | 2 ++ mm/vmstat.c | 12 +++++++++++- tools/vm/page-types.c | 1 + 11 files changed, 37 insertions(+), 3 deletions(-) (limited to 'include/uapi') diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index c6683f2e396..00b22863827 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -25,6 +25,7 @@ config VIRTIO_PCI config VIRTIO_BALLOON tristate "Virtio balloon driver" depends on VIRTIO + select MEMORY_BALLOON ---help--- This driver supports increasing and decreasing the amount of memory within a KVM guest. diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 2bad7f9dd2a..f893148a107 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -396,6 +396,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, spin_lock_irqsave(&vb_dev_info->pages_lock, flags); balloon_page_insert(vb_dev_info, newpage); vb_dev_info->isolated_pages--; + __count_vm_event(BALLOON_MIGRATE); spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; set_page_pfns(vb->pfns, newpage); diff --git a/fs/proc/page.c b/fs/proc/page.c index e647c55275d..1e3187da1fe 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -133,6 +133,9 @@ u64 stable_page_flags(struct page *page) if (PageBuddy(page)) u |= 1 << KPF_BUDDY; + if (PageBalloon(page)) + u |= 1 << KPF_BALLOON; + u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked); u |= kpf_copy_bit(k, KPF_SLAB, PG_slab); diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h index bc3d2985cc9..9b0a15d06a4 100644 --- a/include/linux/balloon_compaction.h +++ b/include/linux/balloon_compaction.h @@ -166,11 +166,13 @@ static inline gfp_t balloon_mapping_gfp_mask(void) static inline void balloon_page_insert(struct balloon_dev_info *balloon, struct page *page) { + __SetPageBalloon(page); list_add(&page->lru, &balloon->pages); } static inline void balloon_page_delete(struct page *page) { + __ClearPageBalloon(page); list_del(&page->lru); } diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index ced92345c96..730334cdf03 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -72,6 +72,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_ZERO_PAGE_ALLOC, THP_ZERO_PAGE_ALLOC_FAILED, #endif +#ifdef CONFIG_MEMORY_BALLOON + BALLOON_INFLATE, + BALLOON_DEFLATE, +#ifdef CONFIG_BALLOON_COMPACTION + BALLOON_MIGRATE, +#endif +#endif #ifdef CONFIG_DEBUG_TLBFLUSH #ifdef CONFIG_SMP NR_TLB_REMOTE_FLUSH, /* cpu tried to flush others' tlbs */ diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h index 5116a0e4817..2f96d233c98 100644 --- a/include/uapi/linux/kernel-page-flags.h +++ b/include/uapi/linux/kernel-page-flags.h @@ -31,6 +31,7 @@ #define KPF_KSM 21 #define KPF_THP 22 +#define KPF_BALLOON 23 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 0ceb8a567da..1d1ae6b078f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -230,12 +230,17 @@ config SPLIT_PTLOCK_CPUS config ARCH_ENABLE_SPLIT_PMD_PTLOCK boolean +# +# support for memory balloon +config MEMORY_BALLOON + boolean + # # support for memory balloon compaction config BALLOON_COMPACTION bool "Allow for balloon memory compaction/migration" def_bool y - depends on COMPACTION && VIRTIO_BALLOON + depends on COMPACTION && MEMORY_BALLOON help Memory fragmentation introduced by ballooning might reduce significantly the number of 2MB contiguous memory blocks that can be diff --git a/mm/Makefile b/mm/Makefile index f8ed7ab417b..1f534a7f0a7 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -16,7 +16,7 @@ obj-y := filemap.o mempool.o oom_kill.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ util.o mmzone.o vmstat.o backing-dev.o \ mm_init.o mmu_context.o percpu.o slab_common.o \ - compaction.o balloon_compaction.o vmacache.o \ + compaction.o vmacache.o \ interval_tree.o list_lru.o workingset.o \ iov_iter.o debug.o $(mmu-y) @@ -67,3 +67,4 @@ obj-$(CONFIG_ZBUD) += zbud.o obj-$(CONFIG_ZSMALLOC) += zsmalloc.o obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o obj-$(CONFIG_CMA) += cma.o +obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index 3afdabdbc0a..b3cbe19f71b 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -36,6 +36,7 @@ struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info) BUG_ON(!trylock_page(page)); spin_lock_irqsave(&b_dev_info->pages_lock, flags); balloon_page_insert(b_dev_info, page); + __count_vm_event(BALLOON_INFLATE); spin_unlock_irqrestore(&b_dev_info->pages_lock, flags); unlock_page(page); return page; @@ -74,6 +75,7 @@ struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info) } spin_lock_irqsave(&b_dev_info->pages_lock, flags); balloon_page_delete(page); + __count_vm_event(BALLOON_DEFLATE); spin_unlock_irqrestore(&b_dev_info->pages_lock, flags); unlock_page(page); dequeued_page = true; diff --git a/mm/vmstat.c b/mm/vmstat.c index e9ab104b956..cce7c766da7 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -735,7 +735,7 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat, TEXT_FOR_HIGHMEM(xx) xx "_movable", const char * const vmstat_text[] = { - /* Zoned VM counters */ + /* enum zone_stat_item countes */ "nr_free_pages", "nr_alloc_batch", "nr_inactive_anon", @@ -778,10 +778,13 @@ const char * const vmstat_text[] = { "workingset_nodereclaim", "nr_anon_transparent_hugepages", "nr_free_cma", + + /* enum writeback_stat_item counters */ "nr_dirty_threshold", "nr_dirty_background_threshold", #ifdef CONFIG_VM_EVENT_COUNTERS + /* enum vm_event_item counters */ "pgpgin", "pgpgout", "pswpin", @@ -860,6 +863,13 @@ const char * const vmstat_text[] = { "thp_zero_page_alloc", "thp_zero_page_alloc_failed", #endif +#ifdef CONFIG_MEMORY_BALLOON + "balloon_inflate", + "balloon_deflate", +#ifdef CONFIG_BALLOON_COMPACTION + "balloon_migrate", +#endif +#endif /* CONFIG_MEMORY_BALLOON */ #ifdef CONFIG_DEBUG_TLBFLUSH #ifdef CONFIG_SMP "nr_tlb_remote_flush", diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c index c4d6d2e20e0..264fbc297e0 100644 --- a/tools/vm/page-types.c +++ b/tools/vm/page-types.c @@ -132,6 +132,7 @@ static const char * const page_flag_names[] = { [KPF_NOPAGE] = "n:nopage", [KPF_KSM] = "x:ksm", [KPF_THP] = "t:thp", + [KPF_BALLOON] = "o:balloon", [KPF_RESERVED] = "r:reserved", [KPF_MLOCKED] = "m:mlocked", -- cgit v1.2.3-70-g09d2