summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2013-12-09posix-timers: Convert abuses of BUG_ON to WARN_ONFrederic Weisbecker
The posix cpu timers code makes a heavy use of BUG_ON() but none of these concern fatal issues that require to stop the machine. So let's just warn the user when some internal state slips out of our hands. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Remove remaining uses of tasklist_lockFrederic Weisbecker
The remaining uses of tasklist_lock were mostly about synchronizing against sighand modifications, getting coherent and safe group samples and also thread/process wide timers list handling. All of this is already safely synchronizable with the target's sighand lock. Let's use it on these places instead. Also update the comments about locking. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Use sighand lock instead of tasklist_lock on timer deletionFrederic Weisbecker
Timer deletion doesn't need the tasklist lock. We need to protect against: * concurrent access to the lists p->cputime_expires and p->sighand->cputime_expires * task reaping that may also delete the timer list entry * timer firing We already hold the timer lock which protects us against concurrent timer firing. The rest only need the targets sighand to be locked. So hold it and drop the use of tasklist_lock there. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Use sighand lock instead of tasklist_lock for task clock sampleFrederic Weisbecker
There is no need for the tasklist_lock just to take a process wide clock sample. All we need is to get a coherent sample that doesn't race with exit() and exec(): * exit() may be concurrently reaping a task and flushing its time * sighand is unstable under exit() and exec(), and the latter also result in group leader that can change To protect against these, locking the target's sighand is enough. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Consolidate posix_cpu_clock_get()Frederic Weisbecker
Consolidate the clock sampling common code used for both local and remote targets. Note that this introduces a tiny user ABI change: if a PID is passed to clock_gettime() along the clockid, we used to forbid a process wide clock sample when that PID doesn't belong to a group leader. Now after this patch we allow process wide clock samples if that PID belongs to the current task, even if the current task is not the group leader. But local process wide clock samples are allowed if PID == 0 (current task) even if the current task is not the group leader. So in the end this should be no big deal as this actually harmonize the behaviour when the remote sample is actually a local one. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Remove useless clock sample on timers cleanupFrederic Weisbecker
a0b2062b0904ef07944c4a6e4d0f88ee44f1e9f2 ("posix_timers: fix racy timer delta caching on task exit") forgot to remove the arguments used for timer caching. Fix this leftover. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Remove dead task special caseFrederic Weisbecker
Now that we've removed all the optimizations that could result in NULL timer's targets, we can remove all the associated special case handling. Also add some warnings on NULL targets to spot any possible leftover. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Cleanup reaped target handlingFrederic Weisbecker
When a timer's target is seen to be buried, for example on calls to timer_gettime(), the posix cpu timers code behaves a bit like a garbage collector and releases early the reference to the task. Then again, this optimization complicates the code for no much value: it's up to the user to release the timer and its associated ressources by calling timer_delete() after it buries the target tasks. Remove this to simplify the code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Remove dead process posix cpu timers cachingFrederic Weisbecker
Now that we removed dead thread posix cpu timers caching, lets remove the dead process wide version. This caching is similar to the per thread version but it should be even more rare: * If the process id dead, we are not reading its timers status from a thread belonging to its group since they are all dead. So this caching only concern remote process timers reads. Now posix cpu timers using itimers or timer_settime() can't do remote process timers anyway so it's not even clear if there is actually a user for this caching. * Unlike per thread timers caching, this only applies to zombies targets. Buried targets' process wide timers return 0 values. But then again, timer_gettime() can't read remote process timers, so if the process is dead, there can't be any reader left anyway. Then again this caching seem to complicate the code for corner cases that are probably not worth it. So lets get rid of it. Also remove the sample snapshot on dying process timer that is now useless, as suggested by Kosaki. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-09posix-timers: Remove dead thread posix cpu timers cachingFrederic Weisbecker
When a task is exiting or has exited, its posix cpu timers don't tick anymore and won't elapse further. It's too late for them to expire. So any further call to timer_gettime() on these timers will return the same remaining expiry time. The current code optimize this by caching the remaining delta and storing it where we use to save the absolute expiration time. This way, the future calls to timer_gettime() won't need to compute the difference between the absolute expiration time and the current time anymore. Now this optimization doesn't seem to bring much value. Computing the timer remaining delta is not very costly. Fetching the timer value OTOH can be costly in two ways: * CPUCLOCK_SCHED read requires to lock the target's rq. But some optimizations are on the way to make task_sched_runtime() not holding the rq lock of a non-running target. * CPUCLOCK_VIRT/CPUCLOCK_PROF read simply consist in fetching current->utime/current->stime except when the system uses full dynticks cputime accounting. The latter requires a per task lock in order to correctly compute user and system time. But once the target is dead, this lock shouldn't be contended anyway. All in one this caching doesn't seem to be justified. Given that it complicates the code significantly for few wins, let's remove it on single thread timers. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-12-04Merge branch 'timers/core-v2' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/core Pull dynticks updates from Frederic Weisbecker: * Fix a bug where posix cpu timers requeued due to interval got ignored on full dynticks CPUs (not a regression though as it only impacts full dynticks and the bug is there since we merged full dynticks). * Optimizations and cleanups on the use of per CPU APIs to improve code readability, performance and debuggability in the nohz subsystem; * Optimize posix cpu timer by sparing stub workqueue queue with full dynticks off case * Rename some functions to extend with *_this_cpu() suffix for clarity * Refine the naming of some context tracking subsystem state accessors * Trivial spelling fix by Paul Gortmaker Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-02posix-timers: Fix full dynticks CPUs kick on timer reschedulingFrederic Weisbecker
A posix CPU timer can be rearmed while it is firing or after it is notified with a signal. This can happen for example with timers that were set with a non zero interval in timer_settime(). This rearming can happen in two places: 1) On timer firing time, which happens on the target's tick. If the timer can't trigger a signal because it is ignored, it reschedules itself to honour the timer interval. 2) On signal handling from the timer's notification target. This one can be a different task than the timer's target itself. Once the signal is notified, the notification target rearms the timer, again to honour the timer interval. When a timer is rearmed, we need to notify the full dynticks CPUs such that they restart their tick in case they are running tasks that may have a share in elapsing this timer. Now the 1st case above handles full dynticks CPUs with a call to posix_cpu_timer_kick_nohz() from the posix cpu timer firing code. But the second case ignores the fact that some CPUs may run non-idle tasks with their tick off. As a result, when a timer is resheduled after its signal notification, the full dynticks CPUs may completely ignore it and not tick on the timer as expected This patch fixes this bug by handling both cases in one. All we need is to move the kick to the rearming common code in posix_cpu_timer_schedule(). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Olivier Langlois <olivier@olivierlanglois.net>
2013-12-02posix-timers: Spare workqueue if there is no full dynticks CPU to kickFrederic Weisbecker
After a posix cpu timer is set, a workqueue is scheduled in order to kick the full dynticks CPUs and let them restart their tick if necessary in case the task they are running is concerned by the new timer. This kick is implemented by way of IPIs, which require interrupts to be enabled, hence the need for a workqueue to raise them because the posix cpu timer set path has interrupts disabled. Now if there is no full dynticks CPU on the system, the workqueue is still scheduled but it simply won't send any IPI and return immediately. So lets spare that worqueue when it is not needed. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-12-02context_tracking: Wrap static key check into more intuitive function nameFrederic Weisbecker
Use a function with a meaningful name to check the global context tracking state. static_key_false() is a bit confusing for reviewers. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-12-02nohz: Convert a few places to use local per cpu accessesFrederic Weisbecker
A few functions use remote per CPU access APIs when they deal with local values. Just do the right conversion to improve performance, code readability and debug checks. While at it, lets extend some of these function names with *_this_cpu() suffix in order to display their purpose more clearly. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-12-02Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: - Correction of fuzzy and fragile IRQ_RETVAL macro - IRQ related resume fix affecting only XEN - ARM/GIC fix for chained GIC controllers * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip: Gic: fix boot for chained gics irq: Enable all irqs unconditionally in irq_resume genirq: Correct fuzzy and fragile IRQ_RETVAL() definition
2013-12-02Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Various smaller fixlets, all over the place" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/doc: Fix generation of device-drivers sched: Expose preempt_schedule_irq() sched: Fix a trivial typo in comments sched: Remove unused variable in 'struct sched_domain' sched: Avoid NULL dereference on sd_busy sched: Check sched_domain before computing group power MAINTAINERS: Update file patterns in the lockdep and scheduler entries
2013-12-02Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc kernel and tooling fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tools lib traceevent: Fix conversion of pointer to integer of different size perf/trace: Properly use u64 to hold event_id perf: Remove fragile swevent hlist optimization ftrace, perf: Avoid infinite event generation loop tools lib traceevent: Fix use of multiple options in processing field perf header: Fix possible memory leaks in process_group_desc() perf header: Fix bogus group name perf tools: Tag thread comm as overriden
2013-11-29Merge branch 'for-3.13-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "This contains one important fix. The NUMA support added a while back broke ordering guarantees on ordered workqueues. It was enforced by having single frontend interface with @max_active == 1 but the NUMA support puts multiple interfaces on unbound workqueues on NUMA machines thus breaking the ordered guarantee. This is fixed by disabling NUMA support on ordered workqueues. The above and a couple other patches were sitting in for-3.12-fixes but I forgot to push that out, so they ended up waiting a bit too long. My aplogies. Other fixes are minor" * 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix pool ID allocation leakage and remove BUILD_BUG_ON() in init_workqueues workqueue: fix comment typo for __queue_work() workqueue: fix ordered workqueues in NUMA setups workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY
2013-11-29Merge branch 'for-3.13-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Fixes for three issues. - cgroup destruction path could swamp system_wq possibly leading to deadlock. This actually seems to happen in the wild with memcg because memcg destruction path adds nested dependency on system_wq. Resolved by isolating cgroup destruction work items on its dedicated workqueue. - Possible locking context deadlock through seqcount reported by lockdep - Memory leak under certain conditions" * 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix cgroup_subsys_state leak for seq_files cpuset: Fix memory allocator deadlock cgroup: use a dedicated workqueue for cgroup destruction
2013-11-28kernel/extable: fix address-checks for core_kernel and init areasHelge Deller
The init_kernel_text() and core_kernel_text() functions should not include the labels _einittext and _etext when checking if an address is inside the .text or .init sections. Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-27cgroup: fix cgroup_subsys_state leak for seq_filesTejun Heo
If a cgroup file implements either read_map() or read_seq_string(), such file is served using seq_file by overriding file->f_op to cgroup_seqfile_operations, which also overrides the release method to single_release() from cgroup_file_release(). Because cgroup_file_open() didn't use to acquire any resources, this used to be fine, but since f7d58818ba42 ("cgroup: pin cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open() pins the css (cgroup_subsys_state) which is put by cgroup_file_release(). The patch forgot to update the release path for seq_files and each open/release cycle leaks a css reference. Fix it by updating cgroup_file_release() to also handle seq_files and using it for seq_file release path too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v3.12
2013-11-27cpuset: Fix memory allocator deadlockPeter Zijlstra
Juri hit the below lockdep report: [ 4.303391] ====================================================== [ 4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ] [ 4.303394] 3.12.0-dl-peterz+ #144 Not tainted [ 4.303395] ------------------------------------------------------ [ 4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 4.303399] (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290 [ 4.303417] [ 4.303417] and this task is already holding: [ 4.303418] (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100 [ 4.303431] which would create a new lock dependency: [ 4.303432] (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...} [ 4.303436] [ 4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock: [ 4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 { [ 4.303922] HARDIRQ-ON-W at: [ 4.303923] [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0 [ 4.303926] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140 [ 4.303929] [<ffffffff81063dd6>] kthreadd+0x86/0x180 [ 4.303931] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0 [ 4.303933] SOFTIRQ-ON-W at: [ 4.303933] [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0 [ 4.303935] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140 [ 4.303940] [<ffffffff81063dd6>] kthreadd+0x86/0x180 [ 4.303955] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0 [ 4.303959] INITIAL USE at: [ 4.303960] [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0 [ 4.303963] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140 [ 4.303966] [<ffffffff81063dd6>] kthreadd+0x86/0x180 [ 4.303969] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0 [ 4.303972] } Which reports that we take mems_allowed_seq with interrupts enabled. A little digging found that this can only be from cpuset_change_task_nodemask(). This is an actual deadlock because an interrupt doing an allocation will hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin forever waiting for the write side to complete. Cc: John Stultz <john.stultz@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Reported-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@gmail.com> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2013-11-27sched: Expose preempt_schedule_irq()Thomas Gleixner
Tony reported that aa0d53260596 ("ia64: Use preempt_schedule_irq") broke PREEMPT=n builds on ia64. Ok, wrapped my brain around it. I tripped over the magic asm foo which has a single need_resched check and schedule point for both sys call return and interrupt return. So you need the schedule_preempt_irq() for kernel preemption from interrupt return while on a normal syscall preemption a schedule would be sufficient. But using schedule_preempt_irq() is not harmful here in any way. It just sets the preempt_active bit also in cases where it would not be required. Even on preempt=n kernels adding the preempt_active bit is completely harmless. So instead of having an extra function, moving the existing one out of the ifdef PREEMPT looks like the sanest thing to do. It would also allow getting rid of various other sti/schedule/cli asm magic in other archs. Reported-and-Tested-by: Tony Luck <tony.luck@gmail.com> Fixes: aa0d53260596 ("ia64: Use preempt_schedule_irq") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [slightly edited Changelog] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1311211230030.30673@ionos.tec.linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-26Merge tag 'trace-fixes-v3.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This includes two fixes. 1) is a bug fix that happens when root does the following: echo function_graph > current_tracer modprobe foo echo nop > current_tracer This causes the ftrace internal accounting to get screwed up and crashes ftrace, preventing the user from using the function tracer after that. 2) if a TRACE_EVENT has a string field, and NULL is given for it. The internal trace event code does a strlen() and strcpy() on the source of field. If it is NULL it causes the system to oops. This bug has been there since 2.6.31, but no TRACE_EVENT ever passed in a NULL to the string field, until now" * tag 'trace-fixes-v3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Fix function graph with loading of modules tracing: Allow events to have NULL strings
2013-11-26ftrace: Fix function graph with loading of modulesSteven Rostedt (Red Hat)
Commit 8c4f3c3fa9681 "ftrace: Check module functions being traced on reload" fixed module loading and unloading with respect to function tracing, but it missed the function graph tracer. If you perform the following # cd /sys/kernel/debug/tracing # echo function_graph > current_tracer # modprobe nfsd # echo nop > current_tracer You'll get the following oops message: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9() Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000 0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668 ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000 Call Trace: [<ffffffff814fe193>] dump_stack+0x4f/0x7c [<ffffffff8103b80a>] warn_slowpath_common+0x81/0x9b [<ffffffff810b2b9a>] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9 [<ffffffff8103b83e>] warn_slowpath_null+0x1a/0x1c [<ffffffff810b2b9a>] __ftrace_hash_rec_update.part.35+0x168/0x1b9 [<ffffffff81502f89>] ? __mutex_lock_slowpath+0x364/0x364 [<ffffffff810b2cc2>] ftrace_shutdown+0xd7/0x12b [<ffffffff810b47f0>] unregister_ftrace_graph+0x49/0x78 [<ffffffff810c4b30>] graph_trace_reset+0xe/0x10 [<ffffffff810bf393>] tracing_set_tracer+0xa7/0x26a [<ffffffff810bf5e1>] tracing_set_trace_write+0x8b/0xbd [<ffffffff810c501c>] ? ftrace_return_to_handler+0xb2/0xde [<ffffffff811240a8>] ? __sb_end_write+0x5e/0x5e [<ffffffff81122aed>] vfs_write+0xab/0xf6 [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85 [<ffffffff81122dbd>] SyS_write+0x59/0x82 [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85 [<ffffffff8150a2d2>] system_call_fastpath+0x16/0x1b ---[ end trace 940358030751eafb ]--- The above mentioned commit didn't go far enough. Well, it covered the function tracer by adding checks in __register_ftrace_function(). The problem is that the function graph tracer circumvents that (for a slight efficiency gain when function graph trace is running with a function tracer. The gain was not worth this). The problem came with ftrace_startup() which should always be called after __register_ftrace_function(), if you want this bug to be completely fixed. Anyway, this solution moves __register_ftrace_function() inside of ftrace_startup() and removes the need to call them both. Reported-by: Dave Wysochanski <dwysocha@redhat.com> Fixes: ed926f9b35cd ("ftrace: Use counters to enable functions to trace") Cc: stable@vger.kernel.org # 3.0+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-11-25irq: Enable all irqs unconditionally in irq_resumeLaxman Dewangan
When the system enters suspend, it disables all interrupts in suspend_device_irqs(), including the interrupts marked EARLY_RESUME. On the resume side things are different. The EARLY_RESUME interrupts are reenabled in sys_core_ops->resume and the non EARLY_RESUME interrupts are reenabled in the normal system resume path. When suspend_noirq() failed or suspend is aborted for any other reason, we might omit the resume side call to sys_core_ops->resume() and therefor the interrupts marked EARLY_RESUME are not reenabled and stay disabled forever. To solve this, enable all irqs unconditionally in irq_resume() regardless whether interrupts marked EARLY_RESUMEhave been already enabled or not. This might try to reenable already enabled interrupts in the non failure case, but the only affected platform is XEN and it has been confirmed that it does not cause any side effects. [ tglx: Massaged changelog. ] Signed-off-by: Laxman Dewangan <ldewangan@nvidia.com> Acked-by-and-tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Heiko Stuebner <heiko@sntech.de> Reviewed-by: Pavel Machek <pavel@ucw.cz> Cc: <ian.campbell@citrix.com> Cc: <rjw@rjwysocki.net> Cc: <len.brown@intel.com> Cc: <gregkh@linuxfoundation.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1385388587-16442-1-git-send-email-ldewangan@nvidia.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-11-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds
Pull crypto update from Herbert Xu: - Made x86 ablk_helper generic for ARM - Phase out chainiv in favour of eseqiv (affects IPsec) - Fixed aes-cbc IV corruption on s390 - Added constant-time crypto_memneq which replaces memcmp - Fixed aes-ctr in omap-aes - Added OMAP3 ROM RNG support - Add PRNG support for MSM SoC's - Add and use Job Ring API in caam - Misc fixes [ NOTE! This pull request was sent within the merge window, but Herbert has some questionable email sending setup that makes him public enemy #1 as far as gmail is concerned. So most of his emails seem to be trapped by gmail as spam, resulting in me not seeing them. - Linus ] * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (49 commits) crypto: s390 - Fix aes-cbc IV corruption crypto: omap-aes - Fix CTR mode counter length crypto: omap-sham - Add missing modalias padata: make the sequence counter an atomic_t crypto: caam - Modify the interface layers to use JR API's crypto: caam - Add API's to allocate/free Job Rings crypto: caam - Add Platform driver for Job Ring hwrng: msm - Add PRNG support for MSM SoC's ARM: DT: msm: Add Qualcomm's PRNG driver binding document crypto: skcipher - Use eseqiv even on UP machines crypto: talitos - Simplify key parsing crypto: picoxcell - Simplify and harden key parsing crypto: ixp4xx - Simplify and harden key parsing crypto: authencesn - Simplify key parsing crypto: authenc - Export key parsing helper function crypto: mv_cesa: remove deprecated IRQF_DISABLED hwrng: OMAP3 ROM Random Number Generator support crypto: sha256_ssse3 - also test for BMI2 crypto: mv_cesa - Remove redundant of_match_ptr crypto: sahara - Remove redundant of_match_ptr ...
2013-11-22workqueue: fix pool ID allocation leakage and remove BUILD_BUG_ON() in ↵Li Bin
init_workqueues When one work starts execution, the high bits of work's data contain pool ID. It can represent a maximum of WORK_OFFQ_POOL_NONE. Pool ID is assigned WORK_OFFQ_POOL_NONE when the work being initialized indicating that no pool is associated and get_work_pool() uses it to check the associated pool. So if worker_pool_assign_id() assigns a ID greater than or equal WORK_OFFQ_POOL_NONE to a pool, it triggers leakage, and it may break the non-reentrance guarantee. This patch fix this issue by modifying the worker_pool_assign_id() function calling idr_alloc() by setting @end param WORK_OFFQ_POOL_NONE. Furthermore, in the current implementation, the BUILD_BUG_ON() in init_workqueues makes no sense. The number of worker pools needed cannot be determined at compile time, because the number of backing pools for UNBOUND workqueues is dynamic based on the assigned custom attributes. So remove it. tj: Minor comment and indentation updates. Signed-off-by: Li Bin <huawei.libin@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-11-22workqueue: fix comment typo for __queue_work()Li Bin
It seems the "dying" should be "draining" here. Signed-off-by: Li Bin <huawei.libin@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-11-22workqueue: fix ordered workqueues in NUMA setupsTejun Heo
An ordered workqueue implements execution ordering by using single pool_workqueue with max_active == 1. On a given pool_workqueue, work items are processed in FIFO order and limiting max_active to 1 enforces the queued work items to be processed one by one. Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for unbound workqueues") accidentally broke this guarantee by applying NUMA affinity to ordered workqueues too. On NUMA setups, an ordered workqueue would end up with separate pool_workqueues for different nodes. Each pool_workqueue still limits max_active to 1 but multiple work items may be executed concurrently and out of order depending on which node they are queued to. Fix it by using dedicated ordered_wq_attrs[] when creating ordered workqueues. The new attrs match the unbound ones except that no_numa is always set thus forcing all NUMA nodes to share the default pool_workqueue. While at it, add sanity check in workqueue creation path which verifies that an ordered workqueues has only the default pool_workqueue. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Libin <huawei.libin@huawei.com> Cc: stable@vger.kernel.org Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-11-22workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITYOleg Nesterov
Move the setting of PF_NO_SETAFFINITY up before set_cpus_allowed() in create_worker(). Otherwise userland can change ->cpus_allowed in between. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-11-22cgroup: use a dedicated workqueue for cgroup destructionTejun Heo
Since be44562613851 ("cgroup: remove synchronize_rcu() from cgroup_diput()"), cgroup destruction path makes use of workqueue. css freeing is performed from a work item from that point on and a later commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two steps"), moves css offlining to workqueue too. As cgroup destruction isn't depended upon for memory reclaim, the destruction work items were put on the system_wq; unfortunately, some controller may block in the destruction path for considerable duration while holding cgroup_mutex. As large part of destruction path is synchronized through cgroup_mutex, when combined with high rate of cgroup removals, this has potential to fill up system_wq's max_active of 256. Also, it turns out that memcg's css destruction path ends up queueing and waiting for work items on system_wq through work_on_cpu(). If such operation happens while system_wq is fully occupied by cgroup destruction work items, work_on_cpu() can't make forward progress because system_wq is full and other destruction work items on system_wq can't make forward progress because the work item waiting for work_on_cpu() is holding cgroup_mutex, leading to deadlock. This can be fixed by queueing destruction work items on a separate workqueue. This patch creates a dedicated workqueue - cgroup_destroy_wq - for this purpose. As these work items shouldn't have inter-dependencies and mostly serialized by cgroup_mutex anyway, giving high concurrency level doesn't buy anything and the workqueue's @max_active is set to 1 so that destruction work items are executed one by one on each CPU. Hugh Dickins: Because cgroup_init() is run before init_workqueues(), cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a separate core_initcall(). In the future, we probably want to reorder so that workqueue init happens before cgroup_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com> Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils Cc: stable@vger.kernel.org # v3.9+
2013-11-21Merge branch 'for-linus2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "In this patchset, we finally get an SELinux update, with Paul Moore taking over as maintainer of that code. Also a significant update for the Keys subsystem, as well as maintenance updates to Smack, IMA, TPM, and Apparmor" and since I wanted to know more about the updates to key handling, here's the explanation from David Howells on that: "Okay. There are a number of separate bits. I'll go over the big bits and the odd important other bit, most of the smaller bits are just fixes and cleanups. If you want the small bits accounting for, I can do that too. (1) Keyring capacity expansion. KEYS: Consolidate the concept of an 'index key' for key access KEYS: Introduce a search context structure KEYS: Search for auth-key by name rather than target key ID Add a generic associative array implementation. KEYS: Expand the capacity of a keyring Several of the patches are providing an expansion of the capacity of a keyring. Currently, the maximum size of a keyring payload is one page. Subtract a small header and then divide up into pointers, that only gives you ~500 pointers on an x86_64 box. However, since the NFS idmapper uses a keyring to store ID mapping data, that has proven to be insufficient to the cause. Whatever data structure I use to handle the keyring payload, it can only store pointers to keys, not the keys themselves because several keyrings may point to a single key. This precludes inserting, say, and rb_node struct into the key struct for this purpose. I could make an rbtree of records such that each record has an rb_node and a key pointer, but that would use four words of space per key stored in the keyring. It would, however, be able to use much existing code. I selected instead a non-rebalancing radix-tree type approach as that could have a better space-used/key-pointer ratio. I could have used the radix tree implementation that we already have and insert keys into it by their serial numbers, but that means any sort of search must iterate over the whole radix tree. Further, its nodes are a bit on the capacious side for what I want - especially given that key serial numbers are randomly allocated, thus leaving a lot of empty space in the tree. So what I have is an associative array that internally is a radix-tree with 16 pointers per node where the index key is constructed from the key type pointer and the key description. This means that an exact lookup by type+description is very fast as this tells us how to navigate directly to the target key. I made the data structure general in lib/assoc_array.c as far as it is concerned, its index key is just a sequence of bits that leads to a pointer. It's possible that someone else will be able to make use of it also. FS-Cache might, for example. (2) Mark keys as 'trusted' and keyrings as 'trusted only'. KEYS: verify a certificate is signed by a 'trusted' key KEYS: Make the system 'trusted' keyring viewable by userspace KEYS: Add a 'trusted' flag and a 'trusted only' flag KEYS: Separate the kernel signature checking keyring from module signing These patches allow keys carrying asymmetric public keys to be marked as being 'trusted' and allow keyrings to be marked as only permitting the addition or linkage of trusted keys. Keys loaded from hardware during kernel boot or compiled into the kernel during build are marked as being trusted automatically. New keys can be loaded at runtime with add_key(). They are checked against the system keyring contents and if their signatures can be validated with keys that are already marked trusted, then they are marked trusted also and can thus be added into the master keyring. Patches from Mimi Zohar make this usable with the IMA keyrings also. (3) Remove the date checks on the key used to validate a module signature. X.509: Remove certificate date checks It's not reasonable to reject a signature just because the key that it was generated with is no longer valid datewise - especially if the kernel hasn't yet managed to set the system clock when the first module is loaded - so just remove those checks. (4) Make it simpler to deal with additional X.509 being loaded into the kernel. KEYS: Load *.x509 files into kernel keyring KEYS: Have make canonicalise the paths of the X.509 certs better to deduplicate The builder of the kernel now just places files with the extension ".x509" into the kernel source or build trees and they're concatenated by the kernel build and stuffed into the appropriate section. (5) Add support for userspace kerberos to use keyrings. KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches KEYS: Implement a big key type that can save to tmpfs Fedora went to, by default, storing kerberos tickets and tokens in tmpfs. We looked at storing it in keyrings instead as that confers certain advantages such as tickets being automatically deleted after a certain amount of time and the ability for the kernel to get at these tokens more easily. To make this work, two things were needed: (a) A way for the tickets to persist beyond the lifetime of all a user's sessions so that cron-driven processes can still use them. The problem is that a user's session keyrings are deleted when the session that spawned them logs out and the user's user keyring is deleted when the UID is deleted (typically when the last log out happens), so neither of these places is suitable. I've added a system keyring into which a 'persistent' keyring is created for each UID on request. Each time a user requests their persistent keyring, the expiry time on it is set anew. If the user doesn't ask for it for, say, three days, the keyring is automatically expired and garbage collected using the existing gc. All the kerberos tokens it held are then also gc'd. (b) A key type that can hold really big tickets (up to 1MB in size). The problem is that Active Directory can return huge tickets with lots of auxiliary data attached. We don't, however, want to eat up huge tracts of unswappable kernel space for this, so if the ticket is greater than a certain size, we create a swappable shmem file and dump the contents in there and just live with the fact we then have an inode and a dentry overhead. If the ticket is smaller than that, we slap it in a kmalloc()'d buffer" * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (121 commits) KEYS: Fix keyring content gc scanner KEYS: Fix error handling in big_key instantiation KEYS: Fix UID check in keyctl_get_persistent() KEYS: The RSA public key algorithm needs to select MPILIB ima: define '_ima' as a builtin 'trusted' keyring ima: extend the measurement list to include the file signature kernel/system_certificate.S: use real contents instead of macro GLOBAL() KEYS: fix error return code in big_key_instantiate() KEYS: Fix keyring quota misaccounting on key replacement and unlink KEYS: Fix a race between negating a key and reading the error set KEYS: Make BIG_KEYS boolean apparmor: remove the "task" arg from may_change_ptraced_domain() apparmor: remove parent task info from audit logging apparmor: remove tsk field from the apparmor_audit_struct apparmor: fix capability to not use the current task, during reporting Smack: Ptrace access check mode ima: provide hash algo info in the xattr ima: enable support for larger default filedata hash algorithms ima: define kernel parameter 'ima_template=' to change configured default ima: add Kconfig default measurement list template ...
2013-11-21Merge git://git.infradead.org/users/eparis/auditLinus Torvalds
Pull audit updates from Eric Paris: "Nothing amazing. Formatting, small bug fixes, couple of fixes where we didn't get records due to some old VFS changes, and a change to how we collect execve info..." Fixed conflict in fs/exec.c as per Eric and linux-next. * git://git.infradead.org/users/eparis/audit: (28 commits) audit: fix type of sessionid in audit_set_loginuid() audit: call audit_bprm() only once to add AUDIT_EXECVE information audit: move audit_aux_data_execve contents into audit_context union audit: remove unused envc member of audit_aux_data_execve audit: Kill the unused struct audit_aux_data_capset audit: do not reject all AUDIT_INODE filter types audit: suppress stock memalloc failure warnings since already managed audit: log the audit_names record type audit: add child record before the create to handle case where create fails audit: use given values in tty_audit enable api audit: use nlmsg_len() to get message payload length audit: use memset instead of trying to initialize field by field audit: fix info leak in AUDIT_GET requests audit: update AUDIT_INODE filter rule to comparator function audit: audit feature to set loginuid immutable audit: audit feature to only allow unsetting the loginuid audit: allow unsetting the loginuid (with priv) audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE audit: loginuid functions coding style selinux: apply selinux checks on new audit message types ...
2013-11-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs bits and pieces from Al Viro: "Assorted bits that got missed in the first pull request + fixes for a couple of coredump regressions" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fold try_to_ascend() into the sole remaining caller dcache.c: get rid of pointless macros take read_seqbegin_or_lock() and friends to seqlock.h consolidate simple ->d_delete() instances gfs2: endianness misannotations dump_emit(): use __kernel_write(), not vfs_write() dump_align(): fix the dumb braino
2013-11-20Merge tag 'pm+acpi-2-3.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more ACPI and power management updates from Rafael Wysocki: - ACPI-based device hotplug fixes for issues introduced recently and a fix for an older error code path bug in the ACPI PCI host bridge driver - Fix for recently broken OMAP cpufreq build from Viresh Kumar - Fix for a recent hibernation regression related to s2disk - Fix for a locking-related regression in the ACPI EC driver from Puneet Kumar - System suspend error code path fix related to runtime PM and runtime PM documentation update from Ulf Hansson - cpufreq's conservative governor fix from Xiaoguang Chen - New processor IDs for intel_idle and turbostat and removal of an obsolete Kconfig option from Len Brown - New device IDs for the ACPI LPSS (Low-Power Subsystem) driver and ACPI-based PCI hotplug (ACPIPHP) cleanup from Mika Westerberg - Removal of several ACPI video DMI blacklist entries that are not necessary any more from Aaron Lu - Rework of the ACPI companion representation in struct device and code cleanup related to that change from Rafael J Wysocki, Lan Tianyu and Jarkko Nikula - Fixes for assigning names to ACPI-enumerated I2C and SPI devices from Jarkko Nikula * tag 'pm+acpi-2-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (24 commits) PCI / hotplug / ACPI: Drop unused acpiphp_debug declaration ACPI / scan: Set flags.match_driver in acpi_bus_scan_fixed() ACPI / PCI root: Clear driver_data before failing enumeration ACPI / hotplug: Fix PCI host bridge hot removal ACPI / hotplug: Fix acpi_bus_get_device() return value check cpufreq: governor: Remove fossil comment in the cpufreq_governor_dbs() ACPI / video: clean up DMI table for initial black screen problem ACPI / EC: Ensure lock is acquired before accessing ec struct members PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps() ACPI / AC: Remove struct acpi_device pointer from struct acpi_ac spi: Use stable dev_name for ACPI enumerated SPI slaves i2c: Use stable dev_name for ACPI enumerated I2C slaves ACPI: Provide acpi_dev_name accessor for struct acpi_device device name ACPI / bind: Use (put|get)_device() on ACPI device objects too ACPI: Eliminate the DEVICE_ACPI_HANDLE() macro ACPI / driver core: Store an ACPI device pointer in struct acpi_dev_node cpufreq: OMAP: Fix compilation error 'r & ret undeclared' PM / Runtime: Fix error path for prepare PM / Runtime: Update documentation around probe|remove|suspend cpufreq: conservative: set requested_freq to policy max when it is over policy max ...
2013-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Mostly these are fixes for fallout due to merge window changes, as well as cures for problems that have been with us for a much longer period of time" 1) Johannes Berg noticed two major deficiencies in our genetlink registration. Some genetlink protocols we passing in constant counts for their ops array rather than something like ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were using fixed IDs for their multicast groups. We have to retain these fixed IDs to keep existing userland tools working, but reserve them so that other multicast groups used by other protocols can not possibly conflict. In dealing with these two problems, we actually now use less state management for genetlink operations and multicast groups. 2) When configuring interface hardware timestamping, fix several drivers that simply do not validate that the hwtstamp_config value is one the driver actually supports. From Ben Hutchings. 3) Invalid memory references in mwifiex driver, from Amitkumar Karwar. 4) In dev_forward_skb(), set the skb->protocol in the right order relative to skb_scrub_packet(). From Alexei Starovoitov. 5) Bridge erroneously fails to use the proper wrapper functions to make calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki Makita. 6) When detaching a bridge port, make sure to flush all VLAN IDs to prevent them from leaking, also from Toshiaki Makita. 7) Put in a compromise for TCP Small Queues so that deep queued devices that delay TX reclaim non-trivially don't have such a performance decrease. One particularly problematic area is 802.11 AMPDU in wireless. From Eric Dumazet. 8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts here. Fix from Eric Dumzaet, reported by Dave Jones. 9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn. 10) When computing mergeable buffer sizes, virtio-net fails to take the virtio-net header into account. From Michael Dalton. 11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic bumping, this one has been with us for a while. From Eric Dumazet. 12) Fix NULL deref in the new TIPC fragmentation handling, from Erik Hugne. 13) 6lowpan bit used for traffic classification was wrong, from Jukka Rissanen. 14) macvlan has the same issue as normal vlans did wrt. propagating LRO disabling down to the real device, fix it the same way. From Michal Kubecek. 15) CPSW driver needs to soft reset all slaves during suspend, from Daniel Mack. 16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet. 17) The xen-netfront RX buffer refill timer isn't properly scheduled on partial RX allocation success, from Ma JieYue. 18) When ipv6 ping protocol support was added, the AF_INET6 protocol initialization cleanup path on failure was borked a little. Fix from Vlad Yasevich. 19) If a socket disconnects during a read/recvmsg/recvfrom/etc that blocks we can do the wrong thing with the msg_name we write back to userspace. From Hannes Frederic Sowa. There is another fix in the works from Hannes which will prevent future problems of this nature. 20) Fix route leak in VTI tunnel transmit, from Fan Du. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) genetlink: make multicast groups const, prevent abuse genetlink: pass family to functions using groups genetlink: add and use genl_set_err() genetlink: remove family pointer from genl_multicast_group genetlink: remove genl_unregister_mc_group() hsr: don't call genl_unregister_mc_group() quota/genetlink: use proper genetlink multicast APIs drop_monitor/genetlink: use proper genetlink multicast APIs genetlink: only pass array to genl_register_family_with_ops() tcp: don't update snd_nxt, when a socket is switched from repair mode atm: idt77252: fix dev refcnt leak xfrm: Release dst if this dst is improper for vti tunnel netlink: fix documentation typo in netlink_set_err() be2net: Delete secondary unicast MAC addresses during be_close be2net: Fix unconditional enabling of Rx interface options net, virtio_net: replace the magic value ping: prevent NULL pointer dereference on write to msg_name bnx2x: Prevent "timeout waiting for state X" bnx2x: prevent CFC attention bnx2x: Prevent panic during DMAE timeout ...
2013-11-19kernel/bounds: avoid circular dependencies in generated headersKirill A. Shutemov
<linux/spinlock.h> has heavy dependencies on other header files. It triggers circular dependencies in generated headers on IA64, at least: CC kernel/bounds.s In file included from /home/space/kas/git/public/linux/arch/ia64/include/asm/thread_info.h:9:0, from include/linux/thread_info.h:54, from include/asm-generic/preempt.h:4, from arch/ia64/include/generated/asm/preempt.h:1, from include/linux/preempt.h:18, from include/linux/spinlock.h:50, from kernel/bounds.c:14: /home/space/kas/git/public/linux/arch/ia64/include/asm/asm-offsets.h:1:35: fatal error: generated/asm-offsets.h: No such file or directory compilation terminated. Let's replace <linux/spinlock.h> with <linux/spinlock_types.h>, it's enough to find out size of spinlock_t. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-and-Tested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-11-19genetlink: only pass array to genl_register_family_with_ops()Johannes Berg
As suggested by David Miller, make genl_register_family_with_ops() a macro and pass only the array, evaluating ARRAY_SIZE() in the macro, this is a little safer. The openvswitch has some indirection, assing ops/n_ops directly in that code. This might ultimately just assign the pointers in the family initializations, saving the struct genl_family_and_ops and code (once mcast groups are handled differently.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-19Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq cleanups from Ingo Molnar: "This is a multi-arch cleanup series from Thomas Gleixner, which we kept to near the end of the merge window, to not interfere with architecture updates. This series (motivated by the -rt kernel) unifies more aspects of IRQ handling and generalizes PREEMPT_ACTIVE" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: preempt: Make PREEMPT_ACTIVE generic sparc: Use preempt_schedule_irq ia64: Use preempt_schedule_irq m32r: Use preempt_schedule_irq hardirq: Make hardirq bits generic m68k: Simplify low level interrupt handling code genirq: Prevent spurious detection for unconditionally polled interrupts
2013-11-19sched: Fix a trivial typo in commentsShigeru Yoshida
Fix a trivial typo in rq_attach_root(). Signed-off-by: Shigeru Yoshida <shigeru.yoshida@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131117.121236.1990617639803941055.shigeru.yoshida@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-19sched: Avoid NULL dereference on sd_busyPeter Zijlstra
Commit 37dc6b50cee9 ("sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus") forgot to clear 'sd_busy' under some conditions leading to a possible NULL deref in set_cpu_sd_state_idle(). Reported-by: Anton Blanchard <anton@samba.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131118113701.GF3866@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-19sched: Check sched_domain before computing group powerSrikar Dronamraju
After commit 863bffc80898 ("sched/fair: Fix group power_orig computation"), we can dereference rq->sd before it is set. Fix this by falling back to power_of() in this case and add a comment explaining things. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [ Added comment and tweaked patch. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: mikey@neuling.org Link: http://lkml.kernel.org/r/20131113151718.GN21461@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-19perf/trace: Properly use u64 to hold event_idVince Weaver
The 64-bit attr.config value for perf trace events was being copied into an "int" before doing a comparison, meaning the top 32 bits were being truncated. As far as I can tell this didn't cause any errors, but it did mean it was possible to create valid aliases for all the tracepoint ids which I don't think was intended. (For example, 0xffffffff00000018 and 0x18 both enable the same tracepoint). Signed-off-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1311151236100.11932@vincent-weaver-1.um.maine.edu Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-19perf: Remove fragile swevent hlist optimizationPeter Zijlstra
Currently we only allocate a single cpu hashtable for per-cpu swevents; do away with this optimization for it is fragile in the face of things like perf_pmu_migrate_context(). The easiest thing is to make sure all CPUs are consistent wrt state. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130913111447.GN31370@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-19ftrace, perf: Avoid infinite event generation loopPeter Zijlstra
Vince's perf-trinity fuzzer found yet another 'interesting' problem. When we sample the irq_work_exit tracepoint with period==1 (or PERF_SAMPLE_PERIOD) and we add an fasync SIGNAL handler we create an infinite event generation loop: ,-> <IPI> | irq_work_exit() -> | trace_irq_work_exit() -> | ... | __perf_event_overflow() -> (due to fasync) | irq_work_queue() -> (irq_work_list must be empty) '--------- arch_irq_work_raise() Similar things can happen due to regular poll() wakeups if we exceed the ring-buffer wakeup watermark, or have an event_limit. To avoid this, dis-allow sampling this particular tracepoint. In order to achieve this, create a special perf_perm function pointer for each event and call this (when set) on trying to create a tracepoint perf event. [ roasted: use expr... to allow for ',' in your expression ] Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Jones <davej@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20131114152304.GC5364@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-19Merge branch 'pm-sleep'Rafael J. Wysocki
* pm-sleep: PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps()
2013-11-16Merge tag 'trace-3.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing update from Steven Rostedt: "This batch of changes is mostly clean ups and small bug fixes. The only real feature that was added this release is from Namhyung Kim, who introduced "set_graph_notrace" filter that lets you run the function graph tracer and not trace particular functions and their call chain. Tom Zanussi added some updates to the ftrace multibuffer tracing that made it more consistent with the top level tracing. One of the fixes for perf function tracing required an API change in RCU; the addition of "rcu_is_watching()". As Paul McKenney is pushing that change in this release too, he gave me a branch that included all the changes to get that working, and I pulled that into my tree in order to complete the perf function tracing fix" * tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Add rcu annotation for syscall trace descriptors tracing: Do not use signed enums with unsigned long long in fgragh output tracing: Remove unused function ftrace_off_permanent() tracing: Do not assign filp->private_data to freed memory tracing: Add helper function tracing_is_disabled() tracing: Open tracer when ftrace_dump_on_oops is used tracing: Add support for SOFT_DISABLE to syscall events tracing: Make register/unregister_ftrace_command __init tracing: Update event filters for multibuffer recordmcount.pl: Add support for __fentry__ ftrace: Have control op function callback only trace when RCU is watching rcu: Do not trace rcu_is_watching() functions ftrace/x86: skip over the breakpoint for ftrace caller trace/trace_stat: use rbtree postorder iteration helper instead of opencoding ftrace: Add set_graph_notrace filter ftrace: Narrow down the protected area of graph_lock ftrace: Introduce struct ftrace_graph_data ftrace: Get rid of ftrace_graph_filter_enabled tracing: Fix potential out-of-bounds in trace_get_user() tracing: Show more exact help information about snapshot
2013-11-15consolidate simple ->d_delete() instancesAl Viro
Rename simple_delete_dentry() to always_delete_dentry() and export it. Export simple_dentry_operations, while we are at it, and get rid of their duplicates Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>