summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2014-01-13audit: fix incorrect set of audit_sockGao feng
NETLINK_CB(skb).sk is the socket of user space process, netlink_unicast in kauditd_send_skb wants the kernel side socket. Since the sk_state of audit netlink socket is not NETLINK_CONNECTED, so the netlink_getsockbyportid doesn't return -ECONNREFUSED. And the socket of userspace process can be released anytime, so the audit_sock may point to invalid socket. this patch sets the audit_sock to the kernel side audit netlink socket. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: print error message when fail to create audit socketGao feng
print the error message and then return -ENOMEM. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: fix dangling keywords in audit_log_set_loginuid() outputRichard Guy Briggs
Remove spaces between "new", "old" label modifiers and "auid", "ses" labels in log output since userspace tools can't parse orphaned keywords. Make variable names more consistent and intuitive. Make audit_log_format() argument code easier to read. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: log on errors from filter user rulesRichard Guy Briggs
An error on an AUDIT_NEVER rule disabled logging on that rule. On error on AUDIT_NEVER rules, log. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: audit_log_start running on auditd should not stopToshiyuki Okajima
The backlog cannot be consumed when audit_log_start is running on auditd even if audit_log_start calls wait_for_auditd to consume it. The situation is the deadlock because only auditd can consume the backlog. If the other process needs to send the backlog, it can be also stopped by the deadlock. So, audit_log_start running on auditd should not stop. You can see the deadlock with the following reproducer: # auditctl -a exit,always -S all # reboot Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Reviewed-by: gaofeng@cn.fujitsu.com Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: drop audit_cmd_lock in AUDIT_USER family of casesRichard Guy Briggs
We do not need to hold the audit_cmd_mutex for this family of cases. The possible exception to this is the call to audit_filter_user(), so drop the lock immediately after. To help in fixing the race we are trying to avoid, make sure that nothing called by audit_filter_user() calls audit_log_start(). In particular, watch out for *_audit_rule_match(). This fix will take care of systemd and anything USING audit. It still means that we could race with something configuring audit and auditd shutting down. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reported-by: toshi.okajima@jp.fujitsu.com Tested-by: toshi.okajima@jp.fujitsu.com Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: convert all sessionid declaration to unsigned intEric Paris
Right now the sessionid value in the kernel is a combination of u32, int, and unsigned int. Just use unsigned int throughout. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: Added exe field to audit core dump signal logPaul Davies C
Currently when the coredump signals are logged by the audit system, the actual path to the executable is not logged. Without details of exe, the system admin may not have an exact idea on what program failed. This patch changes the audit_log_task() so that the path to the exe is also logged. This was copied from audit_log_task_info() and the latter enhanced to avoid disappearing text fields. Signed-off-by: Paul Davies C <pauldaviesc@gmail.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: prevent an older auditd shutdown from orphaning a newer auditd startupRichard Guy Briggs
There have been reports of auditd restarts resulting in kaudit not being able to find a newly registered auditd. It results in reports such as: kernel: [ 2077.233573] audit: *NO* daemon at audit_pid=1614 kernel: [ 2077.234712] audit: audit_lost=97 audit_rate_limit=0 audit_backlog_limit=320 kernel: [ 2077.234718] audit: auditd disappeared (previously mis-spelled "dissapeared") One possible cause is a race between the shutdown of an older auditd and a newer one. If the newer one sets the daemon pid to itself in kauditd before the older one has cleared the daemon pid, the newer daemon pid will be erased. This could be caused by an automated system, or by manual intervention, but in either case, there is no use in having the older daemon clear the daemon pid reference since its old pid is no longer being referenced. This patch will prevent that specific case, returning an error of EACCES. The case for preventing a newer auditd from registering itself if there is an existing auditd is a more difficult case that is beyond the scope of this patch. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: refactor audit_receive_msg() to clarify AUDIT_*_RULE* casesRichard Guy Briggs
audit_receive_msg() needlessly contained a fallthrough case that called audit_receive_filter(), containing no common code between the cases. Separate them to make the logic clearer. Refactor AUDIT_LIST_RULES, AUDIT_ADD_RULE, AUDIT_DEL_RULE cases to create audit_rule_change(), audit_list_rules_send() functions. This should not functionally change the logic. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: log AUDIT_TTY_SET config changesRichard Guy Briggs
Log transition of config changes when AUDIT_TTY_SET is called, including both enabled and log_passwd values now in the struct. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: get rid of *NO* daemon at audit_pid=0 messageRichard Guy Briggs
kauditd_send_skb is called after audit_pid was checked to be non-zero. However, it can be set to 0 due to auditd exiting while kauditd_send_skb is still executed and this can result in a spurious warning about missing auditd. Re-check audit_pid before printing the message. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: linux-kernel@vger.kernel.org Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: drop audit_log_abend()Paul Davies C
The audit_log_abend() is used only by the audit_core_dumps(). Thus there is no need of maintaining the audit_log_abend() as a separate function. This patch drops the audit_log_abend() and pushes its functionalities back to the audit_core_dumps(). Apart from that the "reason" field is also dropped from being logged since the reason can be deduced from the signal number. Signed-off-by: Paul Davies C <pauldaviesc@gmail.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: allow unlimited backlog queueRichard Guy Briggs
Since audit can already be disabled by "audit=0" on the kernel boot line, or by the command "auditctl -e 0", it would be more useful to have the audit_backlog_limit set to zero mean effectively unlimited (limited only by system RAM). Acked-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: don't generate loginuid log when audit disabledGao feng
If audit is disabled, we shouldn't generate loginuid audit log. Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: use old_lock in audit_set_featureGao feng
we already have old_lock, no need to calculate it again. Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: don't generate audit feature changed log when audit disabledGao feng
If audit is disabled,we shouldn't generate the audit log. Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: fix incorrect order of log new and old featureGao feng
The order of new feature and old feature is incorrect, this patch fix it. Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: remove useless code in audit_enableGao feng
Since kernel parameter is operated before initcall, so the audit_initialized must be AUDIT_UNINITIALIZED or DISABLED in audit_enable. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: add audit_backlog_wait_time configuration optionRichard Guy Briggs
reaahead-collector abuses the audit logging facility to discover which files are accessed at boot time to make a pre-load list Add a tuning option to audit_backlog_wait_time so that if auditd can't keep up, or gets blocked, the callers won't be blocked. Bump audit_status API version to "2". Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: clean up AUDIT_GET/SET local variables and future-proof APIRichard Guy Briggs
Re-named confusing local variable names (status_set and status_get didn't agree with their command type name) and reduced their scope. Future-proof API changes by not depending on the exact size of the audit_status struct and by adding an API version field. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: add kernel set-up parameter to override default backlog limitRichard Guy Briggs
The default audit_backlog_limit is 64. This was a reasonable limit at one time. systemd causes so much audit queue activity on startup that auditd doesn't start before the backlog queue has already overflowed by more than a factor of 2. On a system with audit= not set on the kernel command line, this isn't an issue since that history isn't kept for auditd when it is available. On a system with audit=1 set on the kernel command line, kaudit tries to keep that history until auditd is able to drain the queue. This default can be changed by the "-b" option in audit.rules once the system has booted, but won't help with lost messages on boot. One way to solve this would be to increase the default backlog queue size to avoid losing any messages before auditd is able to consume them. This would be overkill to the embedded community and insufficient for some servers. Another way to solve it might be to add a kconfig option to set the default based on the system type. An embedded system would get the current (or smaller) default, while Workstations might get more than now and servers might get more. None of these solutions helps if a system's compiled default is too small to see the lost messages without compiling a new kernel. This patch adds a kernel set-up parameter (audit already has one to enable/disable it) "audit_backlog_limit=<n>" that overrides the default to allow the system administrator to set the backlog limit. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: efficiency fix 2: request exclusive wait since all need same resourceDan Duval
These and similar errors were seen on a patched 3.8 kernel when the audit subsystem was overrun during boot: udevd[876]: worker [887] unexpectedly returned with status 0x0100 udevd[876]: worker [887] failed while handling '/devices/pci0000:00/0000:00:03.0/0000:40:00.0' udevd[876]: worker [880] unexpectedly returned with status 0x0100 udevd[876]: worker [880] failed while handling '/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1' udevadm settle - timeout of 180 seconds reached, the event queue contains: /sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1 (3995) /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/INT3F0D:00 (4034) audit: audit_backlog=258 > audit_backlog_limit=256 audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=256 The change below increases the efficiency of the audit code and prevents it from being overrun: Use add_wait_queue_exclusive() in wait_for_auditd() to put the thread on the wait queue. When kauditd dequeues an skb, all of the waiting threads are waiting for the same resource, but only one is going to get it, so there's no need to wake up more than one waiter. See: https://lkml.org/lkml/2013/9/2/479 Signed-off-by: Dan Duval <dan.duval@oracle.com> Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: efficiency fix 1: only wake up if queue shorter than backlog limitDan Duval
These and similar errors were seen on a patched 3.8 kernel when the audit subsystem was overrun during boot: udevd[876]: worker [887] unexpectedly returned with status 0x0100 udevd[876]: worker [887] failed while handling '/devices/pci0000:00/0000:00:03.0/0000:40:00.0' udevd[876]: worker [880] unexpectedly returned with status 0x0100 udevd[876]: worker [880] failed while handling '/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1' udevadm settle - timeout of 180 seconds reached, the event queue contains: /sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1 (3995) /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/INT3F0D:00 (4034) audit: audit_backlog=258 > audit_backlog_limit=256 audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=256 The change below increases the efficiency of the audit code and prevents it from being overrun: Only issue a wake_up in kauditd if the length of the skb queue is less than the backlog limit. Otherwise, threads waiting in wait_for_auditd() will simply wake up, discover that the queue is still too long for them to proceed, and go back to sleep. This results in wasted context switches and machine cycles. kauditd_thread() is the only function that removes buffers from audit_skb_queue so we can't race. If we did, the timeout in wait_for_auditd() would expire and the waiting thread would continue. See: https://lkml.org/lkml/2013/9/2/479 Signed-off-by: Dan Duval <dan.duval@oracle.com> Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: make use of remaining sleep time from wait_for_auditdRichard Guy Briggs
If wait_for_auditd() times out, go immediately to the error function rather than retesting the loop conditions. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: reset audit backlog wait time after error recoveryRichard Guy Briggs
When the audit queue overflows and times out (audit_backlog_wait_time), the audit queue overflow timeout is set to zero. Once the audit queue overflow timeout condition recovers, the timeout should be reset to the original value. See also: https://lkml.org/lkml/2013/9/2/473 Cc: stable@vger.kernel.org # v3.8-rc4+ Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Dan Duval <dan.duval@oracle.com> Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: listen in all network namespacesRichard Guy Briggs
Convert audit from only listening in init_net to use register_pernet_subsys() to dynamically manage the netlink socket list. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: restore order of tty and ses fields in log outputRichard Guy Briggs
When being refactored from audit_log_start() to audit_log_task_info(), in commit e23eb920 the tty and ses fields in the log output got transposed. Restore to original order to avoid breaking search tools. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: fix netlink portid naming and typesRichard Guy Briggs
Normally, netlink ports use the PID of the userspace process as the port ID. If the PID is already in use by a port, the kernel will allocate another port ID to avoid conflict. Re-name all references to netlink ports from pid to portid to reflect this reality and avoid confusion with actual PIDs. Ports use the __u32 type, so re-type all portids accordingly. (This patch is very similar to ebiederman's 5deadd69) Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13audit: Simplify and correct audit_log_capsetEric W. Biederman
- Always report the current process as capset now always only works on the current process. This prevents reporting 0 or a random pid in a random pid namespace. - Don't bother to pass the pid as is available. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> (cherry picked from commit bcc85f0af31af123e32858069eb2ad8f39f90e67) (cherry picked from commit f911cac4556a7a23e0b3ea850233d13b32328692) Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [eparis: fix build error when audit disabled] Signed-off-by: Eric Paris <eparis@redhat.com>
2014-01-13ftrace: Fix synchronization location disabling and freeing ftrace_opsSteven Rostedt (Red Hat)
The synchronization needed after ftrace_ops are unregistered must happen after the callback is disabled from becing called by functions. The current location happens after the function is being removed from the internal lists, but not after the function callbacks were disabled, leaving the functions susceptible of being called after their callbacks are freed. This affects perf and any externel users of function tracing (LTTng and SystemTap). Cc: stable@vger.kernel.org # 3.0+ Fixes: cdbe61bfe704 "ftrace: Allow dynamically allocated function tracers" Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-01-13sched/preempt: Fix up missed PREEMPT_NEED_RESCHED foldingPeter Zijlstra
With various drivers wanting to inject idle time; we get people calling idle routines outside of the idle loop proper. Therefore we need to be extra careful about not missing TIF_NEED_RESCHED -> PREEMPT_NEED_RESCHED propagations. While looking at this, I also realized there's a small window in the existing idle loop where we can miss TIF_NEED_RESCHED; when it hits right after the tif_need_resched() test at the end of the loop but right before the need_resched() test at the start of the loop. So move preempt_fold_need_resched() out of the loop where we're guaranteed to have TIF_NEED_RESCHED set. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-x9jgh45oeayzajz2mjt0y7d6@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/preempt, locking: Rework local_bh_{dis,en}able()Peter Zijlstra
Currently local_bh_disable() is out-of-line for no apparent reason. So inline it to save a few cycles on call/return nonsense, the function body is a single add on x86 (a few loads and store extra on load/store archs). Also expose two new local_bh functions: __local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt); Which implement the actual local_bh_{dis,en}able() behaviour. The next patch uses the exposed @cnt argument to optimize bh lock functions. With build fixes from Jacob Pan. Cc: rjw@rjwysocki.net Cc: rui.zhang@intel.com Cc: jacob.jun.pan@linux.intel.com Cc: Mike Galbraith <bitbucket@online.de> Cc: hpa@zytor.com Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: lenb@kernel.org Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13ftrace: Have function graph only trace based on global_ops filtersSteven Rostedt (Red Hat)
Doing some different tests, I discovered that function graph tracing, when filtered via the set_ftrace_filter and set_ftrace_notrace files, does not always keep with them if another function ftrace_ops is registered to trace functions. The reason is that function graph just happens to trace all functions that the function tracer enables. When there was only one user of function tracing, the function graph tracer did not need to worry about being called by functions that it did not want to trace. But now that there are other users, this becomes a problem. For example, one just needs to do the following: # cd /sys/kernel/debug/tracing # echo schedule > set_ftrace_filter # echo function_graph > current_tracer # cat trace [..] 0) | schedule() { ------------------------------------------ 0) <idle>-0 => rcu_pre-7 ------------------------------------------ 0) ! 2980.314 us | } 0) | schedule() { ------------------------------------------ 0) rcu_pre-7 => <idle>-0 ------------------------------------------ 0) + 20.701 us | } # echo 1 > /proc/sys/kernel/stack_tracer_enabled # cat trace [..] 1) + 20.825 us | } 1) + 21.651 us | } 1) + 30.924 us | } /* SyS_ioctl */ 1) | do_page_fault() { 1) | __do_page_fault() { 1) 0.274 us | down_read_trylock(); 1) 0.098 us | find_vma(); 1) | handle_mm_fault() { 1) | _raw_spin_lock() { 1) 0.102 us | preempt_count_add(); 1) 0.097 us | do_raw_spin_lock(); 1) 2.173 us | } 1) | do_wp_page() { 1) 0.079 us | vm_normal_page(); 1) 0.086 us | reuse_swap_page(); 1) 0.076 us | page_move_anon_rmap(); 1) | unlock_page() { 1) 0.082 us | page_waitqueue(); 1) 0.086 us | __wake_up_bit(); 1) 1.801 us | } 1) 0.075 us | ptep_set_access_flags(); 1) | _raw_spin_unlock() { 1) 0.098 us | do_raw_spin_unlock(); 1) 0.105 us | preempt_count_sub(); 1) 1.884 us | } 1) 9.149 us | } 1) + 13.083 us | } 1) 0.146 us | up_read(); When the stack tracer was enabled, it enabled all functions to be traced, which now the function graph tracer also traces. This is a side effect that should not occur. To fix this a test is added when the function tracing is changed, as well as when the graph tracer is enabled, to see if anything other than the ftrace global_ops function tracer is enabled. If so, then the graph tracer calls a test trampoline that will look at the function that is being traced and compare it with the filters defined by the global_ops. As an optimization, if there's no other function tracers registered, or if the only registered function tracers also use the global ops, the function graph infrastructure will call the registered function graph callback directly and not go through the test trampoline. Cc: stable@vger.kernel.org # 3.3+ Fixes: d2d45c7a03a2 "tracing: Have stack_tracer use a separate list of functions" Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-01-13sched/clock: Fix up clear_sched_clock_stable()Peter Zijlstra
The below tells us the static_key conversion has a problem; since the exact point of clearing that flag isn't too important, delay the flip and use a workqueue to process it. [ ] TSC synchronization [CPU#0 -> CPU#22]: [ ] Measured 8 cycles TSC warp between CPUs, turning off TSC clock. [ ] [ ] ====================================================== [ ] [ INFO: possible circular locking dependency detected ] [ ] 3.13.0-rc3-01745-g848b0d0322cb-dirty #637 Not tainted [ ] ------------------------------------------------------- [ ] swapper/0/1 is trying to acquire lock: [ ] (jump_label_mutex){+.+...}, at: [<ffffffff8115a637>] jump_label_lock+0x17/0x20 [ ] [ ] but task is already holding lock: [ ] (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109408b>] cpu_hotplug_begin+0x2b/0x60 [ ] [ ] which lock already depends on the new lock. [ ] [ ] [ ] the existing dependency chain (in reverse order) is: [ ] [ ] -> #1 (cpu_hotplug.lock){+.+.+.}: [ ] [<ffffffff810def00>] lock_acquire+0x90/0x130 [ ] [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0 [ ] [<ffffffff81093fdc>] get_online_cpus+0x3c/0x60 [ ] [<ffffffff8104cc67>] arch_jump_label_transform+0x37/0x130 [ ] [<ffffffff8115a3cf>] __jump_label_update+0x5f/0x80 [ ] [<ffffffff8115a48d>] jump_label_update+0x9d/0xb0 [ ] [<ffffffff8115aa6d>] static_key_slow_inc+0x9d/0xb0 [ ] [<ffffffff810c0f65>] sched_feat_set+0xf5/0x100 [ ] [<ffffffff810c5bdc>] set_numabalancing_state+0x2c/0x30 [ ] [<ffffffff81d12f3d>] numa_policy_init+0x1af/0x1b7 [ ] [<ffffffff81cebdf4>] start_kernel+0x35d/0x41f [ ] [<ffffffff81ceb5a5>] x86_64_start_reservations+0x2a/0x2c [ ] [<ffffffff81ceb6a2>] x86_64_start_kernel+0xfb/0xfe [ ] [ ] -> #0 (jump_label_mutex){+.+...}: [ ] [<ffffffff810de141>] __lock_acquire+0x1701/0x1eb0 [ ] [<ffffffff810def00>] lock_acquire+0x90/0x130 [ ] [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0 [ ] [<ffffffff8115a637>] jump_label_lock+0x17/0x20 [ ] [<ffffffff8115aa3b>] static_key_slow_inc+0x6b/0xb0 [ ] [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20 [ ] [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70 [ ] [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150 [ ] [<ffffffff81076612>] native_cpu_up+0x3a2/0x890 [ ] [<ffffffff810941cb>] _cpu_up+0xdb/0x160 [ ] [<ffffffff810942c9>] cpu_up+0x79/0x90 [ ] [<ffffffff81d0af6b>] smp_init+0x60/0x8c [ ] [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197 [ ] [<ffffffff8164e32e>] kernel_init+0xe/0x130 [ ] [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0 [ ] [ ] other info that might help us debug this: [ ] [ ] Possible unsafe locking scenario: [ ] [ ] CPU0 CPU1 [ ] ---- ---- [ ] lock(cpu_hotplug.lock); [ ] lock(jump_label_mutex); [ ] lock(cpu_hotplug.lock); [ ] lock(jump_label_mutex); [ ] [ ] *** DEADLOCK *** [ ] [ ] 2 locks held by swapper/0/1: [ ] #0: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81094037>] cpu_maps_update_begin+0x17/0x20 [ ] #1: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109408b>] cpu_hotplug_begin+0x2b/0x60 [ ] [ ] stack backtrace: [ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc3-01745-g848b0d0322cb-dirty #637 [ ] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010 [ ] ffffffff82c9c270 ffff880236843bb8 ffffffff8165c5f5 ffffffff82c9c270 [ ] ffff880236843bf8 ffffffff81658c02 ffff880236843c80 ffff8802368586a0 [ ] ffff880236858678 0000000000000001 0000000000000002 ffff880236858000 [ ] Call Trace: [ ] [<ffffffff8165c5f5>] dump_stack+0x4e/0x7a [ ] [<ffffffff81658c02>] print_circular_bug+0x1f9/0x207 [ ] [<ffffffff810de141>] __lock_acquire+0x1701/0x1eb0 [ ] [<ffffffff816680ff>] ? __atomic_notifier_call_chain+0x8f/0xb0 [ ] [<ffffffff810def00>] lock_acquire+0x90/0x130 [ ] [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20 [ ] [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20 [ ] [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0 [ ] [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20 [ ] [<ffffffff8115a637>] jump_label_lock+0x17/0x20 [ ] [<ffffffff8115aa3b>] static_key_slow_inc+0x6b/0xb0 [ ] [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20 [ ] [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70 [ ] [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150 [ ] [<ffffffff81076612>] native_cpu_up+0x3a2/0x890 [ ] [<ffffffff810941cb>] _cpu_up+0xdb/0x160 [ ] [<ffffffff810942c9>] cpu_up+0x79/0x90 [ ] [<ffffffff81d0af6b>] smp_init+0x60/0x8c [ ] [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197 [ ] [<ffffffff8164e320>] ? rest_init+0xd0/0xd0 [ ] [<ffffffff8164e32e>] kernel_init+0xe/0x130 [ ] [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0 [ ] [<ffffffff8164e320>] ? rest_init+0xd0/0xd0 [ ] ------------[ cut here ]------------ [ ] WARNING: CPU: 0 PID: 1 at /usr/src/linux-2.6/kernel/smp.c:374 smp_call_function_many+0xad/0x300() [ ] Modules linked in: [ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc3-01745-g848b0d0322cb-dirty #637 [ ] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010 [ ] 0000000000000009 ffff880236843be0 ffffffff8165c5f5 0000000000000000 [ ] ffff880236843c18 ffffffff81093d8c 0000000000000000 0000000000000000 [ ] ffffffff81ccd1a0 ffffffff810ca951 0000000000000000 ffff880236843c28 [ ] Call Trace: [ ] [<ffffffff8165c5f5>] dump_stack+0x4e/0x7a [ ] [<ffffffff81093d8c>] warn_slowpath_common+0x8c/0xc0 [ ] [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0 [ ] [<ffffffff81093dda>] warn_slowpath_null+0x1a/0x20 [ ] [<ffffffff8110b72d>] smp_call_function_many+0xad/0x300 [ ] [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30 [ ] [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30 [ ] [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0 [ ] [<ffffffff8110ba96>] smp_call_function+0x46/0x80 [ ] [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30 [ ] [<ffffffff8110bb3c>] on_each_cpu+0x3c/0xa0 [ ] [<ffffffff810ca950>] ? sched_clock_idle_sleep_event+0x20/0x20 [ ] [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0 [ ] [<ffffffff8104f964>] text_poke_bp+0x64/0xd0 [ ] [<ffffffff810ca950>] ? sched_clock_idle_sleep_event+0x20/0x20 [ ] [<ffffffff8104ccde>] arch_jump_label_transform+0xae/0x130 [ ] [<ffffffff8115a3cf>] __jump_label_update+0x5f/0x80 [ ] [<ffffffff8115a48d>] jump_label_update+0x9d/0xb0 [ ] [<ffffffff8115aa6d>] static_key_slow_inc+0x9d/0xb0 [ ] [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20 [ ] [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70 [ ] [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150 [ ] [<ffffffff81076612>] native_cpu_up+0x3a2/0x890 [ ] [<ffffffff810941cb>] _cpu_up+0xdb/0x160 [ ] [<ffffffff810942c9>] cpu_up+0x79/0x90 [ ] [<ffffffff81d0af6b>] smp_init+0x60/0x8c [ ] [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197 [ ] [<ffffffff8164e320>] ? rest_init+0xd0/0xd0 [ ] [<ffffffff8164e32e>] kernel_init+0xe/0x130 [ ] [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0 [ ] [<ffffffff8164e320>] ? rest_init+0xd0/0xd0 [ ] ---[ end trace 6ff1df5620c49d26 ]--- [ ] tsc: Marking TSC unstable due to check_tsc_sync_source failed Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-v55fgqj3nnyqnngmvuu8ep6h@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/clock, x86: Use a static_key for sched_clock_stablePeter Zijlstra
In order to avoid the runtime condition and variable load turn sched_clock_stable into a static_key. Also provide a shorter implementation of local_clock() and cpu_clock(int) when sched_clock_stable==1. MAINLINE PRE POST sched_clock_stable: 1 1 1 (cold) sched_clock: 329841 221876 215295 (cold) local_clock: 301773 234692 220773 (warm) sched_clock: 38375 25602 25659 (warm) local_clock: 100371 33265 27242 (warm) rdtsc: 27340 24214 24208 sched_clock_stable: 0 0 0 (cold) sched_clock: 382634 235941 237019 (cold) local_clock: 396890 297017 294819 (warm) sched_clock: 38194 25233 25609 (warm) local_clock: 143452 71234 71232 (warm) rdtsc: 27345 24245 24243 Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-eummbdechzz37mwmpags1gjr@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/clock: Remove local_irq_disable() from the clocksPeter Zijlstra
Now that x86 no longer requires IRQs disabled for sched_clock() and ia64 never had this requirement (it doesn't seem to do cpufreq at all), we can remove the requirement of disabling IRQs. MAINLINE PRE POST sched_clock_stable: 1 1 1 (cold) sched_clock: 329841 257223 221876 (cold) local_clock: 301773 309889 234692 (warm) sched_clock: 38375 25280 25602 (warm) local_clock: 100371 85268 33265 (warm) rdtsc: 27340 24247 24214 sched_clock_stable: 0 0 0 (cold) sched_clock: 382634 301224 235941 (cold) local_clock: 396890 399870 297017 (warm) sched_clock: 38194 25630 25233 (warm) local_clock: 143452 129629 71234 (warm) rdtsc: 27345 24307 24245 Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-36e5kohiasnr106d077mgubp@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13locking: Optimize lock_bh functionsPeter Zijlstra
Currently all _bh_ lock functions do two preempt_count operations: local_bh_disable(); preempt_disable(); and for the unlock: preempt_enable_no_resched(); local_bh_enable(); Since its a waste of perfectly good cycles to modify the same variable twice when you can do it in one go; use the new __local_bh_{dis,en}able_ip() functions that allow us to provide a preempt_count value to add/sub. So define SOFTIRQ_LOCK_OFFSET as the offset a _bh_ lock needs to add/sub to be done in one go. As a bonus it gets rid of the preempt_enable_no_resched() usage. This reduces a 1000 loops of: spin_lock_bh(&bh_lock); spin_unlock_bh(&bh_lock); from 53596 cycles to 51995 cycles. I didn't do enough measurements to say for absolute sure that the result is significant but the the few runs I did for each suggest it is so. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: jacob.jun.pan@linux.intel.com Cc: Mike Galbraith <bitbucket@online.de> Cc: hpa@zytor.com Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: lenb@kernel.org Cc: rjw@rjwysocki.net Cc: rui.zhang@intel.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Factor out the on_null_domain() checks in trigger_load_balance()Daniel Lezcano
The test on_null_domain is done twice in the trigger_load_balance function. Move the test at the begin of the function, so there is only one check. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-9-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Pass 'struct rq' to nohz_idle_balance()Daniel Lezcano
The cpu information is stored in the struct rq. Pass the struct rq to nohz_idle_balance, so all the functions called in run_rebalance_domains have the same parameters and the 'this_cpu' variable becomes pointless. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ Added !SMP build fix. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-8-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Pass 'struct rq' to rebalance_domains()Daniel Lezcano
The cpu information is stored in the struct rq and the caller of the rebalance_domains function pass the cpu to retrieve the struct rq but it already has the struct rq info. Replace the cpu parameter with the struct rq. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-7-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Remove unused parameter from nohz_balancer_kick()Daniel Lezcano
The cpu parameter is no longer needed in nohz_balancer_kick, let's remove the parameter. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-6-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Remove unused parameter from find_new_ilb()Daniel Lezcano
The 'call_cpu' is never used in the function. Remove it. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-5-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Pass 'struct rq' to on_null_domain()Daniel Lezcano
The on_null_domain() function is getting the cpu to retrieve the struct rq associated with it. Pass 'struct rq' directly to the function as the caller already has the info. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-4-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Reduce nohz_kick_needed() parametersDaniel Lezcano
The cpu information is already stored in the struct rq, so no need to pass it as parameter to the nohz_kick_needed function. The caller of this function just called idle_cpu() before to fill the rq->idle_balance field. Use rq->cpu and rq->idle_balance. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-3-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Reduce trigger_load_balance() parametersDaniel Lezcano
The cpu information is already stored in the struct rq, so no need to pass it as parameter to the trigger_load_balance function. Cc: linaro-kernel@lists.linaro.org Cc: preeti.lkml@gmail.com Cc: mingo@redhat.com Cc: peterz@infradead.org Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-2-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Fix hotplug admission controlPeter Zijlstra
The current hotplug admission control is broken because: CPU_DYING -> migration_call() -> migrate_tasks() -> __migrate_task() cannot fail and hard assumes it _will_ move all tasks off of the dying cpu, failing this will break hotplug. The much simpler solution is a DOWN_PREPARE handler that fails when removing one CPU gets us below the total allocated bandwidth. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131220171343.GL2480@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Remove the sysctl_sched_dl knobsPeter Zijlstra
Remove the deadline specific sysctls for now. The problem with them is that the interaction with the exisiting rt knobs is nearly impossible to get right. The current (as per before this patch) situation is that the rt and dl bandwidth is completely separate and we enforce rt+dl < 100%. This is undesirable because this means that the rt default of 95% leaves us hardly any room, even though dl tasks are saver than rt tasks. Another proposed solution was (a discarted patch) to have the dl bandwidth be a fraction of the rt bandwidth. This is highly confusing imo. Furthermore neither proposal is consistent with the situation we actually want; which is rt tasks ran from a dl server. In which case the rt bandwidth is a direct subset of dl. So whichever way we go, the introduction of dl controls at this point is painful. Therefore remove them and instead share the rt budget. This means that for now the rt knobs are used for dl admission control and the dl runtime is accounted against the rt runtime. I realise that this isn't entirely desirable either; but whatever we do we appear to need to change the interface later, so better have a small interface for now. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Fix up the smp-affinity mask testsPeter Zijlstra
For now deadline tasks are not allowed to set smp affinity; however the current tests are wrong, cure this. The test in __sched_setscheduler() also uses an on-stack cpumask_t which is a no-no. Change both tests to use cpumask_subset() such that we test the root domain span to be a subset of the cpus_allowed mask. This way we're sure the tasks can always run on all CPUs they can be balanced over, and have no effective affinity constraints. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-fyqtb1lapxca3lhsxv9cumdc@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: speed up SCHED_DEADLINE pushes with a push-heapJuri Lelli
Data from tests confirmed that the original active load balancing logic didn't scale neither in the number of CPU nor in the number of tasks (as sched_rt does). Here we provide a global data structure to keep track of deadlines of the running tasks in the system. The structure is composed by a bitmask showing the free CPUs and a max-heap, needed when the system is heavily loaded. The implementation and concurrent access scheme are kept simple by design. However, our measurements show that we can compete with sched_rt on large multi-CPUs machines [1]. Only the push path is addressed, the extension to use this structure also for pull decisions is straightforward. However, we are currently evaluating different (in order to decrease/avoid contention) data structures to solve possibly both problems. We are also going to re-run tests considering recent changes inside cpupri [2]. [1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf [2] http://www.spinics.net/lists/linux-rt-users/msg06778.html Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-14-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>