summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2014-11-19tracing: Fix return value of ftrace_raw_output_prep()Steven Rostedt (Red Hat)
If the trace_seq of ftrace_raw_output_prep() is full this function returns TRACE_TYPE_PARTIAL_LINE, otherwise it returns zero. The problem is that TRACE_TYPE_PARTIAL_LINE happens to be zero! The thing is, the caller of ftrace_raw_output_prep() expects a success to be zero. Change that to expect it to be TRACE_TYPE_HANDLED. Link: http://lkml.kernel.org/r/20141114112522.GA2988@dhcp128.suse.cz Reminded-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19tracing: Remove return values of most trace_seq_*() functionsSteven Rostedt (Red Hat)
The trace_seq_printf() and friends are used to store strings into a buffer that can be passed around from function to function. If the trace_seq buffer fills up, it will not print any more. The return values were somewhat inconsistant and using trace_seq_has_overflowed() was a better way to know if the write to the trace_seq buffer succeeded or not. Now that all users have removed reading the return value of the printf() type functions, they can safely return void and keep future users of them from reading the inconsistent values as well. Link: http://lkml.kernel.org/r/20141114011411.992510720@goodmis.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19tracing: Do not use return values of trace_seq_printf() in syscall tracingSteven Rostedt (Red Hat)
The functions trace_seq_printf() and friends will not be returning values soon and will be void functions. To know if they succeeded or not, the functions trace_seq_has_overflowed() and trace_handle_return() should be used instead. Reviewed-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19tracing/uprobes: Do not use return values of trace_seq_printf()Steven Rostedt (Red Hat)
The functions trace_seq_printf() and friends will soon no longer have return values. Using trace_seq_has_overflowed() and trace_handle_return() should be used instead. Link: http://lkml.kernel.org/r/20141114011411.693008134@goodmis.org Link: http://lkml.kernel.org/r/20141115050602.333705855@goodmis.org Reviewed-by: Masami Hiramatsu <masami.hiramatu.pt@hitachi.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19tracing/probes: Do not use return value of trace_seq_printf()Steven Rostedt (Red Hat)
The functions trace_seq_printf() and friends will soon not have a return value and will only be a void function. Use trace_seq_has_overflowed() instead to know if the trace_seq operations succeeded or not. Link: http://lkml.kernel.org/r/20141114011411.530216306@goodmis.org Reviewed-by: Petr Mladek <pmladek@suse.cz> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19tracing: Do not check return values of trace_seq_p*() for mmio tracerSteven Rostedt (Red Hat)
The return values for trace_seq_printf() and friends are going to be removed and they will become void functions. The mmio tracer checked their return and even did so incorrectly. Some of the funtions which returned the values were never checked themselves. Removing all the checks simplifies the code. Use trace_seq_has_overflowed() and trace_handle_return() where necessary instead. Reviewed-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19kprobes/tracing: Use trace_seq_has_overflowed() for overflow checksSteven Rostedt (Red Hat)
Instead of checking the return value of trace_seq_printf() and friends for overflowing of the buffer, use the trace_seq_has_overflowed() helper function. This cleans up the code quite a bit and also takes us a step closer to changing the return values of trace_seq_printf() and friends to void. Link: http://lkml.kernel.org/r/20141114011411.181812785@goodmis.org Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Petr Mladek <pmladek@suse.cz> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19tracing: Have function_graph use trace_seq_has_overflowed()Steven Rostedt (Red Hat)
Instead of doing individual checks all over the place that makes the code very messy. Just check trace_seq_has_overflowed() at the end or in strategic places. This makes the code much cleaner and also helps with getting closer to removing the return values of trace_seq_printf() and friends. Link: http://lkml.kernel.org/r/20141114011410.987913836@goodmis.org Reviewed-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19tracing: Have branch tracer use trace_handle_return() helper functionSteven Rostedt (Red Hat)
The branch tracer should not be checking the trace_seq_printf() return value as that will soon be void. There's a new trace_handle_return() helper function that will return TRACE_TYPE_PARTIAL_LINE if the trace_seq overflowed and TRACE_TYPE_HANDLED otherwise. Reviewed-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19ring-buffer: Remove check of trace_seq_{puts,printf}() return valuesSteven Rostedt (Red Hat)
Remove checking the return value of all trace_seq_puts(). It was wrong anyway as only the last return value mattered. But as the trace_seq_puts() is going to be a void function in the future, we should not be checking the return value of it anyway. Just return !trace_seq_has_overflowed() instead. Reviewed-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19blktrace/tracing: Use trace_seq_has_overflowed() helper functionSteven Rostedt (Red Hat)
Checking the return code of every trace_seq_printf() operation and having to return early if it overflowed makes the code messy. Using the new trace_seq_has_overflowed() and trace_handle_return() functions allows us to clean up the code. In the future, trace_seq_printf() and friends will be turning into void functions and not returning a value. The trace_seq_has_overflowed() is to be used instead. This cleanup allows that change to take place. Cc: Jens Axboe <axboe@fb.com> Reviewed-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19tracing: Add trace_seq_has_overflowed() and trace_handle_return()Steven Rostedt (Red Hat)
Adding a trace_seq_has_overflowed() which returns true if the trace_seq had too much written into it allows us to simplify the code. Instead of checking the return value of every call to trace_seq_printf() and friends, they can all be called normally, and at the end we can return !trace_seq_has_overflowed() instead. Several functions also return TRACE_TYPE_PARTIAL_LINE when the trace_seq overflowed and TRACE_TYPE_HANDLED otherwise. Another helper function was created called trace_handle_return() which takes a trace_seq and returns these enums. Using this helper function also simplifies the code. This change also makes it possible to remove the return values of trace_seq_printf() and friends. They should instead just be void functions. Link: http://lkml.kernel.org/r/20141114011410.365183157@goodmis.org Reviewed-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19tracing: Fix trace_seq_bitmask() to start at current positionSteven Rostedt (Red Hat)
In trace_seq_bitmask() it calls bitmap_scnprintf() not from the current position of the trace_seq buffer (s->buffer + s->len), but instead from the beginning of the buffer (s->buffer). Luckily, the only user of this "ipi_raise tracepoint" uses it as the first parameter, and as such, the start of the temp buffer in include/trace/ftrace.h (see __get_bitmask()). Reported-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19ftrace/x86/extable: Add is_ftrace_trampoline() functionSteven Rostedt (Red Hat)
Stack traces that happen from function tracing check if the address on the stack is a __kernel_text_address(). That is, is the address kernel code. This calls core_kernel_text() which returns true if the address is part of the builtin kernel code. It also calls is_module_text_address() which returns true if the address belongs to module code. But what is missing is ftrace dynamically allocated trampolines. These trampolines are allocated for individual ftrace_ops that call the ftrace_ops callback functions directly. But if they do a stack trace, the code checking the stack wont detect them as they are neither core kernel code nor module address space. Adding another field to ftrace_ops that also stores the size of the trampoline assigned to it we can create a new function called is_ftrace_trampoline() that returns true if the address is a dynamically allocate ftrace trampoline. Note, it ignores trampolines that are not dynamically allocated as they will return true with the core_kernel_text() function. Link: http://lkml.kernel.org/r/20141119034829.497125839@goodmis.org Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19new helper: audit_file()Al Viro
... for situations when we don't have any candidate in pathnames - basically, in descriptor-based syscalls. [Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19kill f_dentry usesAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19s390/kernel: add system calls for PCI memory accessAlexey Ishchuk
Add the new __NR_s390_pci_mmio_write and __NR_s390_pci_mmio_read system calls to allow user space applications to access device PCI I/O memory pages on s390x platform. [ Martin Schwidefsky: some code beautification ] Signed-off-by: Alexey Ishchuk <aishchuk@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2014-11-18tracing: Fix race of function probes countingSteven Rostedt (Red Hat)
The function probe counting for traceon and traceoff suffered a race condition where if the probe was executing on two or more CPUs at the same time, it could decrement the counter by more than one when disabling (or enabling) the tracer only once. The way the traceon and traceoff probes are suppose to work is that they disable (or enable) tracing once per count. If a user were to echo 'schedule:traceoff:3' into set_ftrace_filter, then when the schedule function was called, it would disable tracing. But the count should only be decremented once (to 2). Then if the user enabled tracing again (via tracing_on file), the next call to schedule would disable tracing again and the count would be decremented to 1. But if multiple CPUS called schedule at the same time, it is possible that the count would be decremented more than once because of the simple "count--" used. By reading the count into a local variable and using memory barriers we can guarantee that the count would only be decremented once per disable (or enable). The stack trace probe had a similar race, but here the stack trace will decrement for each time it is called. But this had the read-modify- write race, where it could stack trace more than the number of times that was specified. This case we use a cmpxchg to stack trace only the number of times specified. The dump probes can still use the old "update_count()" function as they only run once, and that is controlled by the dump logic itself. Link: http://lkml.kernel.org/r/20141118134643.4b550ee4@gandalf.local.home Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-18PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is selectedRafael J. Wysocki
The number of and dependencies between high-level power management Kconfig options make life much harder than necessary. Several conbinations of them have to be tested and supported, even though some of those combinations are very rarely used in practice (if they are used in practice at all). Moreover, the fact that we have separate independent Kconfig options for runtime PM and system suspend is a serious obstacle for integration between the two frameworks. To overcome these difficulties, always select PM_RUNTIME if PM_SLEEP is set. Among other things, this will allow system suspend callbacks provided by bus types and device drivers to rely on the runtime PM framework regardless of the kernel configuration. Enthusiastically-acked-by: Kevin Hilman <khilman@linaro.org> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-11-18bpf: remove test map scaffolding and user proper typesAlexei Starovoitov
proper types and function helpers are ready. Use them in verifier testsuite. Remove temporary stubs Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18bpf: allow eBPF programs to use mapsAlexei Starovoitov
expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem() map accessors to eBPF programs Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18bpf: fix BPF_MAP_LOOKUP_ELEM command return codeAlexei Starovoitov
fix errno of BPF_MAP_LOOKUP_ELEM command as bpf manpage described it in commit b4fc1a460f30("Merge branch 'bpf-next'"): ----- BPF_MAP_LOOKUP_ELEM int bpf_lookup_elem(int fd, void *key, void *value) { union bpf_attr attr = { .map_fd = fd, .key = ptr_to_u64(key), .value = ptr_to_u64(value), }; return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); } bpf() syscall looks up an element with given key in a map fd. If element is found it returns zero and stores element's value into value. If element is not found it returns -1 and sets errno to ENOENT. and further down in manpage: ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM, indicates that element with given key was not found. ----- In general all BPF commands return ENOENT when map element is not found (including BPF_MAP_GET_NEXT_KEY and BPF_MAP_UPDATE_ELEM with flags == BPF_MAP_UPDATE_ONLY) Subsequent patch adds a testsuite to check return values for all of these combinations. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18bpf: add array type of eBPF mapsAlexei Starovoitov
add new map type BPF_MAP_TYPE_ARRAY and its implementation - optimized for fastest possible lookup() . in the future verifier/JIT may recognize lookup() with constant key and optimize it into constant pointer. Can optimize non-constant key into direct pointer arithmetic as well, since pointers and value_size are constant for the life of the eBPF program. In other words array_map_lookup_elem() may be 'inlined' by verifier/JIT while preserving concurrent access to this map from user space - two main use cases for array type: . 'global' eBPF variables: array of 1 element with key=0 and value is a collection of 'global' variables which programs can use to keep the state between events . aggregation of tracing events into fixed set of buckets - all array elements pre-allocated and zero initialized at init time - key as an index in array and can only be 4 byte - map_delete_elem() returns EINVAL, since elements cannot be deleted - map_update_elem() replaces elements in an non-atomic way (for atomic updates hashtable type should be used instead) Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18bpf: add hashtable type of eBPF mapsAlexei Starovoitov
add new map type BPF_MAP_TYPE_HASH and its implementation - maps are created/destroyed by userspace. Both userspace and eBPF programs can lookup/update/delete elements from the map - eBPF programs can be called in_irq(), so use spin_lock_irqsave() mechanism for concurrent updates - key/value are opaque range of bytes (aligned to 8 bytes) - user space provides 3 configuration attributes via BPF syscall: key_size, value_size, max_entries - map takes care of allocating/freeing key/value pairs - map_update_elem() must fail to insert new element when max_entries limit is reached to make sure that eBPF programs cannot exhaust memory - map_update_elem() replaces elements in an atomic way - optimized for speed of lookup() which can be called multiple times from eBPF program which itself is triggered by high volume of events . in the future JIT compiler may recognize lookup() call and optimize it further, since key_size is constant for life of eBPF program Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18bpf: add 'flags' attribute to BPF_MAP_UPDATE_ELEM commandAlexei Starovoitov
the current meaning of BPF_MAP_UPDATE_ELEM syscall command is: either update existing map element or create a new one. Initially the plan was to add a new command to handle the case of 'create new element if it didn't exist', but 'flags' style looks cleaner and overall diff is much smaller (more code reused), so add 'flags' attribute to BPF_MAP_UPDATE_ELEM command with the following meaning: #define BPF_ANY 0 /* create new element or update existing */ #define BPF_NOEXIST 1 /* create new element if it didn't exist */ #define BPF_EXIST 2 /* update existing element */ bpf_update_elem(fd, key, value, BPF_NOEXIST) call can fail with EEXIST if element already exists. bpf_update_elem(fd, key, value, BPF_EXIST) can fail with ENOENT if element doesn't exist. Userspace will call it as: int bpf_update_elem(int fd, void *key, void *value, __u64 flags) { union bpf_attr attr = { .map_fd = fd, .key = ptr_to_u64(key), .value = ptr_to_u64(value), .flags = flags; }; return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); } First two bits of 'flags' are used to encode style of bpf_update_elem() command. Bits 2-63 are reserved for future use. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18cgroup: implement cgroup_get_e_css()Tejun Heo
Implement cgroup_get_e_css() which finds and gets the effective css for the specified cgroup and subsystem combination. This function always returns a valid pinned css. This will be used by cgroup writeback support. While at it, add comment to cgroup_e_css() to explain why that function is different from cgroup_get_e_css() and has to test cgrp->child_subsys_mask instead of cgroup_css(cgrp, ss). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18cgroup: add cgroup_subsys->css_e_css_changed()Tejun Heo
Add a new cgroup_subsys operatoin ->css_e_css_changed(). This is invoked if any of the effective csses seen from the css's cgroup may have changed. This will be used to implement cgroup writeback support. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18cgroup: add cgroup_subsys->css_released()Tejun Heo
Add a new cgroup subsys callback css_released(). This is called when the reference count of the css (cgroup_subsys_state) reaches zero before RCU scheduling free. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()Tejun Heo
When a subsystem is offlined, its entry on @cgrp->subsys[] is cleared asynchronously. If cgroup_subtree_control_write() is requested to enable the subsystem again before the entry is cleared, it has to wait for the previous offlining to finish and clear the @cgrp->subsys[] entry before trying to enable the subsystem again. This is currently done while verifying the input enable / disable parameters. This used to be correct but f63070d350e3 ("cgroup: make interface files visible iff enabled on cgroup->subtree_control") breaks it. The commit is one of the commits implementing subsystem dependency. Through subsystem dependency, some subsystems may be enabled and disabled implicitly in addition to the explicitly requested ones. The actual subsystems to be enabled and disabled are determined during @css_enable/disable calculation. The current offline wait logic skips the ones which are already implicitly enabled and then waits for subsystems in @enable; however, this misses the subsystems which may be implicitly enabled through dependency from @enable. If such implicitly subsystem hasn't yet finished offlining yet, the function ends up trying to create a css when its @cgrp->subsys[] slot is already occupied triggering BUG_ON() in init_and_link_css(). Fix it by moving the wait logic after @css_enable is calculated and waiting for all the subsystems in @css_enable. This fixes the above bug as the mask contains all subsystems which are to be enabled including the ones enabled through dependencies. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: f63070d350e3 ("cgroup: make interface files visible iff enabled on cgroup->subtree_control") Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()Tejun Heo
Make cgroup_subtree_control_write() first calculate new subtree_control (new_sc), child_subsys_mask (new_ss) and css_enable/disable masks before applying them to the cgroup. Also, store the original subtree_control (old_sc) and child_subsys_mask (old_ss) and use them to restore the orignal state after failure. This patch shouldn't cause any behavior changes. This prepares for a fix for a bug in the async css offline wait logic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18cgroup: separate out cgroup_calc_child_subsys_mask() from ↵Tejun Heo
cgroup_refresh_child_subsys_mask() cgroup_refresh_child_subsys_mask() calculates and updates the effective @cgrp->child_subsys_maks according to the current @cgrp->subtree_control. Separate out the calculation part into cgroup_calc_child_subsys_mask(). This will be used to fix a bug in the async css offline wait logic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18PM: Kconfig: fix unmet dependency for CPU_PMPankaj Dubey
If BL_SWITCHER is enabled but SUSPEND and CPU_IDLE is not enabled we are getting following config warning. warning: (BL_SWITCHER) selects CPU_PM which has unmet direct dependencies (SUSPEND || CPU_IDLE) It has been noticed that CPU_PM dependencies in this file are not really required so let's remove these dependencies from CPU_PM. Signed-off-by: Pankaj Dubey <pankaj.dubey@samsung.com> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-11-18PM / hibernate: Deletion of an unnecessary check before the function call ↵Markus Elfring
"vfree" The vfree() function performs also input parameter validation. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-11-18Merge back 'pm-sleep' material for 3.19-rc1.Rafael J. Wysocki
2014-11-18x86, mpx: On-demand kernel allocation of bounds tablesDave Hansen
This is really the meat of the MPX patch set. If there is one patch to review in the entire series, this is the one. There is a new ABI here and this kernel code also interacts with userspace memory in a relatively unusual manner. (small FAQ below). Long Description: This patch adds two prctl() commands to provide enable or disable the management of bounds tables in kernel, including on-demand kernel allocation (See the patch "on-demand kernel allocation of bounds tables") and cleanup (See the patch "cleanup unused bound tables"). Applications do not strictly need the kernel to manage bounds tables and we expect some applications to use MPX without taking advantage of this kernel support. This means the kernel can not simply infer whether an application needs bounds table management from the MPX registers. The prctl() is an explicit signal from userspace. PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to require kernel's help in managing bounds tables. PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel won't allocate and free bounds tables even if the CPU supports MPX. PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds directory out of a userspace register (bndcfgu) and then cache it into a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT will set "bd_addr" to an invalid address. Using this scheme, we can use "bd_addr" to determine whether the management of bounds tables in kernel is enabled. Also, the only way to access that bndcfgu register is via an xsaves, which can be expensive. Caching "bd_addr" like this also helps reduce the cost of those xsaves when doing table cleanup at munmap() time. Unfortunately, we can not apply this optimization to #BR fault time because we need an xsave to get the value of BNDSTATUS. ==== Why does the hardware even have these Bounds Tables? ==== MPX only has 4 hardware registers for storing bounds information. If MPX-enabled code needs more than these 4 registers, it needs to spill them somewhere. It has two special instructions for this which allow the bounds to be moved between the bounds registers and some new "bounds tables". They are similar conceptually to a page fault and will be raised by the MPX hardware during both bounds violations or when the tables are not present. This patch handles those #BR exceptions for not-present tables by carving the space out of the normal processes address space (essentially calling the new mmap() interface indroduced earlier in this patch set.) and then pointing the bounds-directory over to it. The tables *need* to be accessed and controlled by userspace because the instructions for moving bounds in and out of them are extremely frequent. They potentially happen every time a register pointing to memory is dereferenced. Any direct kernel involvement (like a syscall) to access the tables would obviously destroy performance. ==== Why not do this in userspace? ==== This patch is obviously doing this allocation in the kernel. However, MPX does not strictly *require* anything in the kernel. It can theoretically be done completely from userspace. Here are a few ways this *could* be done. I don't think any of them are practical in the real-world, but here they are. Q: Can virtual space simply be reserved for the bounds tables so that we never have to allocate them? A: As noted earlier, these tables are *HUGE*. An X-GB virtual area needs 4*X GB of virtual space, plus 2GB for the bounds directory. If we were to preallocate them for the 128TB of user virtual address space, we would need to reserve 512TB+2GB, which is larger than the entire virtual address space today. This means they can not be reserved ahead of time. Also, a single process's pre-popualated bounds directory consumes 2GB of virtual *AND* physical memory. IOW, it's completely infeasible to prepopulate bounds directories. Q: Can we preallocate bounds table space at the same time memory is allocated which might contain pointers that might eventually need bounds tables? A: This would work if we could hook the site of each and every memory allocation syscall. This can be done for small, constrained applications. But, it isn't practical at a larger scale since a given app has no way of controlling how all the parts of the app might allocate memory (think libraries). The kernel is really the only place to intercept these calls. Q: Could a bounds fault be handed to userspace and the tables allocated there in a signal handler instead of in the kernel? A: (thanks to tglx) mmap() is not on the list of safe async handler functions and even if mmap() would work it still requires locking or nasty tricks to keep track of the allocation state there. Having ruled out all of the userspace-only approaches for managing bounds tables that we could think of, we create them on demand in the kernel. Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-18mpx: Extend siginfo structure to include bound violation informationQiaowei Ren
This patch adds new fields about bound violation into siginfo structure. si_lower and si_upper are respectively lower bound and upper bound when bound violation is caused. Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151819.1908C900@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-17audit: convert status version to a feature bitmapRichard Guy Briggs
The version field defined in the audit status structure was found to have limitations in terms of its expressibility of features supported. This is distict from the get/set features call to be able to command those features that are present. Converting this field from a version number to a feature bitmap will allow distributions to selectively backport and support certain features and will allow upstream to be able to deprecate features in the future. It will allow userspace clients to first query the kernel for which features are actually present and supported. Currently, EINVAL is returned rather than EOPNOTSUP, which isn't helpful in determining if there was an error in the command, or if it simply isn't supported yet. Past features are not represented by this bitmap, but their use may be converted to EOPNOTSUP if needed in the future. Since "version" is too generic to convert with a #define, use a union in the struct status, introducing the member "feature_bitmap" unionized with "version". Convert existing AUDIT_VERSION_* macros over to AUDIT_FEATURE_BITMAP* counterparts, leaving the former for backwards compatibility. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: minor whitespace tweaks] Signed-off-by: Paul Moore <pmoore@redhat.com>
2014-11-16perf: Improve the perf_sample_data struct layoutPeter Zijlstra
This patch reorders fields in the perf_sample_data struct in order to minimize the number of cachelines touched in perf_sample_data_init(). It also removes some intializations which are redundant with the code in kernel/events/core.c Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1411559322-16548-7-git-send-email-eranian@google.com Cc: cebbert.lkml@gmail.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: jolsa@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16perf: Add ability to sample machine state on interruptStephane Eranian
Enable capture of interrupted machine state for each sample. Registers to sample are passed per event in the sample_regs_intr bitmask. To sample interrupt machine state, the PERF_SAMPLE_INTR_REGS must be passed in sample_type. The list of available registers is arch dependent and provided by asm/perf_regs.h Registers are laid out as u64 in the order of the bit order of sample_intr_regs. This patch also adds a new ABI version PERF_ATTR_SIZE_VER4 because we extend the perf_event_attr struct with a new u64 field. Reviewed-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: cebbert.lkml@gmail.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-api@vger.kernel.org Link: http://lkml.kernel.org/r/1411559322-16548-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICKWanpeng Li
Introduce start_hrtick_dl for !CONFIG_SCHED_HRTICK to align with the fair class. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415670747-58726-1-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16sched/deadline: Remove unnecessary definitions in cpudeadline.hpang.xunlei
Actually, cpudl_set() and cpudl_init() can never be used without CONFIG_SMP. Signed-off-by: pang.xunlei <pang.xunlei@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415260327-30465-4-git-send-email-pang.xunlei@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16sched/cpupri: Remove unnecessary definitions in cpupri.hpang.xunlei
Actually, cpupri_set() and cpupri_init() can never be used without CONFIG_SMP. Signed-off-by: pang.xunlei <pang.xunlei@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: "pang.xunlei" <pang.xunlei@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415260327-30465-1-git-send-email-pang.xunlei@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()Wanpeng Li
Do not call dequeue_pushable_dl_task() when failing to push an eligible task, as it remains pushable, merely not at this particular moment. Actually the patch is the same behavior as commit 311e800e16f6 ("sched, rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16sched/fair: Fix stale overloaded status in the busiest group finding logicWanpeng Li
Commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() return 'true' on a busier sd") changes groups to be ranked in the order of overloaded > imbalance > other, and busiest group is picked according to this order. sgs->group_capacity_factor is used to check if the group is overloaded. When the child domain prefers tasks to go to siblings first, the sgs->group_capacity_factor will be set lower than one in order to move all the excess tasks away. However, group overloaded status is not updated when sgs->group_capacity_factor is set to lower than one, which leads to us missing to find the busiest group. This patch fixes it by updating group overloaded status when sg capacity factor is set to one, in order to find the busiest group accurately. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com [ Fixed the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16sched: Move p->nr_cpus_allowed check to select_task_rq()Wanpeng Li
Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq(). This change will make fair.c, rt.c, and deadline.c all start with the same logic. Suggested-and-Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "pang.xunlei" <pang.xunlei@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16sched/completion: Document when to use wait_for_completion_io_*()Wolfram Sang
As discussed in [1], accounting IO is meant for blkio only. Document that so driver authors won't use them for device io. [1] http://thread.gmane.org/gmane.linux.drivers.i2c/20470 Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415098901-2768-1-git-send-email-wsa@the-dreams.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16sched/fair: Kill task_struct::numa_entry and numa_group::task_listKirill Tkhai
Nobody iterates over numa_group::task_list, this just confuses the readers. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16Merge branch 'sched/urgent' into sched/core, to pick up fixes before ↵Ingo Molnar
applying more changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistencyStanislaw Gruszka
Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc test case in cost of breaking another one. After that commit, calling clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result of Y time being smaller than X time. Reproducer/tester can be found further below, it can be compiled and ran by: gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread while ./tst-cpuclock2 ; do : ; done This reproducer, when running on a buggy kernel, will complain about "clock_gettime difference too small". Issue happens because on start in thread_group_cputimer() we initialize sum_exec_runtime of cputimer with threads runtime not yet accounted and then add the threads runtime to running cputimer again on scheduler tick, making it's sum_exec_runtime bigger than actual threads runtime. KOSAKI Motohiro posted a fix for this problem, but that patch was never applied: https://lkml.org/lkml/2013/5/26/191 . This patch takes different approach to cure the problem. It calls update_curr() when cputimer starts, that assure we will have updated stats of running threads and on the next schedule tick we will account only the runtime that elapsed from cputimer start. That also assure we have consistent state between cpu times of individual threads and cpu time of the process consisted by those threads. Full reproducer (tst-cpuclock2.c): #define _GNU_SOURCE #include <unistd.h> #include <sys/syscall.h> #include <stdio.h> #include <time.h> #include <pthread.h> #include <stdint.h> #include <inttypes.h> /* Parameters for the Linux kernel ABI for CPU clocks. */ #define CPUCLOCK_SCHED 2 #define MAKE_PROCESS_CPUCLOCK(pid, clock) \ ((~(clockid_t) (pid) << 3) | (clockid_t) (clock)) static pthread_barrier_t barrier; /* Help advance the clock. */ static void *chew_cpu(void *arg) { pthread_barrier_wait(&barrier); while (1) ; return NULL; } /* Don't use the glibc wrapper. */ static int do_nanosleep(int flags, const struct timespec *req) { clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED); return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL); } static int64_t tsdiff(const struct timespec *before, const struct timespec *after) { int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec; int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec; return after_i - before_i; } int main(void) { int result = 0; pthread_t th; pthread_barrier_init(&barrier, NULL, 2); if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) { perror("pthread_create"); return 1; } pthread_barrier_wait(&barrier); /* The test. */ struct timespec before, after, sleeptimeabs; int64_t sleepdiff, diffabs; const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 }; /* The relative nanosleep. Not sure why this is needed, but its presence seems to make it easier to reproduce the problem. */ if (do_nanosleep(0, &sleeptime) != 0) { perror("clock_nanosleep"); return 1; } /* Get the current time. */ if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) { perror("clock_gettime[2]"); return 1; } /* Compute the absolute sleep time based on the current time. */ uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec; sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000; sleeptimeabs.tv_nsec = nsec % 1000000000; /* Sleep for the computed time. */ if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) { perror("absolute clock_nanosleep"); return 1; } /* Get the time after the sleep. */ if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) { perror("clock_gettime[3]"); return 1; } /* The time after sleep should always be equal to or after the absolute sleep time passed to clock_nanosleep. */ sleepdiff = tsdiff(&sleeptimeabs, &after); if (sleepdiff < 0) { printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff); result = 1; printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec); printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec); printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec); } /* The difference between the timestamps taken before and after the clock_nanosleep call should be equal to or more than the duration of the sleep. */ diffabs = tsdiff(&before, &after); if (diffabs < sleeptime.tv_nsec) { printf("clock_gettime difference too small: %" PRId64 "\n", diffabs); result = 1; } pthread_cancel(th); return result; } Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16sched/cputime: Fix cpu_timer_sample_group() double accountingPeter Zijlstra
While looking over the cpu-timer code I found that we appear to add the delta for the calling task twice, through: cpu_timer_sample_group() thread_group_cputimer() thread_group_cputime() times->sum_exec_runtime += task_sched_runtime(); *sample = cputime.sum_exec_runtime + task_delta_exec(); Which would make the sample run ahead, making the sleep short. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>