summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2012-07-30uprobes: Fix register_for_each_vma()->vma_address() checkOleg Nesterov
1. register_for_each_vma() checks that vma_address() == vaddr, but this is not enough. We should also ensure that vaddr >= vm_start, find_vma() guarantees "vaddr < vm_end" only. 2. After the prevous changes, register_for_each_vma() is the only reason why vma_address() has to return loff_t, all other users know that we have the valid mapping at this offset and thus the overflow is not possible. Change the code to use vaddr_to_offset() instead, imho this looks more clean/understandable and now we can change vma_address(). 3. While at it, remove the unnecessary type-cast. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182244.GA20362@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Introduce vaddr_to_offset(vma, vaddr)Oleg Nesterov
Add the new helper, vaddr_to_offset(vma, vaddr) which returns the offset in vma->vm_file this vaddr is mapped at. Change build_probe_list() and find_active_uprobe() to use the new helper, the next patch adds another user. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182242.GA20355@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Teach build_probe_list() to consider the rangeOleg Nesterov
Currently build_probe_list() builds the list of all uprobes attached to the given inode, and the caller should filter out those who don't fall into the [start,end) range, this is sub-optimal. This patch turns find_least_offset_node() into find_node_in_range() which returns the first node inside the [min,max] range, and changes build_probe_list() to use this node as a starting point for rb_prev() and rb_next() to find all other nodes the caller needs. The resulting list is no longer sorted but we do not care. This can speed up both build_probe_list() and the callers, but there is another reason to introduce find_node_in_range(). It can be used to figure out whether the given vma has uprobes or not, this will be needed soon. While at it, shift INIT_LIST_HEAD(tmp_list) into build_probe_list(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182240.GA20352@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Fix overflow in vma_address()/find_active_uprobe()Oleg Nesterov
vma->vm_pgoff is "unsigned long", it should be promoted to loff_t before the multiplication to avoid the overflow. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182233.GA20339@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Suppress uprobe_munmap() from mmput()Oleg Nesterov
uprobe_munmap() does get_user_pages() and it is also called from the final mmput()->exit_mmap() path. This slows down exit/mmput() for no reason, and I think it is simply dangerous/wrong to try to fault-in a page into the dying mm. If nothing else, this happens after the last sync_mm_rss(), afaics handle_mm_fault() can change the task->rss_stat and make the subsequent check_mm() unhappy. Change uprobe_munmap() to check mm->mm_users != 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182231.GA20336@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()Oleg Nesterov
The bug was introduced by me in 449d0d7c ("uprobes: Simplify the usage of uprobe->pending_list"). Yes, we do not care about uprobe->pending_list after return and nobody can remove the current list entry, but put_uprobe(uprobe) can actually free it and thus we need list_for_each_safe(). Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Link: http://lkml.kernel.org/r/20120729182229.GA20329@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Clean up and document write_opcode()->lock_page(old_page)Oleg Nesterov
The comment above write_opcode()->lock_page(old_page) tells about the race with do_wp_page(). I don't really understand which exactly race it means, but afaics this lock_page() was not enough to close all races with do_wp_page(). Anyway, since: 77fc4af1b59d uprobes: Change register_for_each_vma() to take mm->mmap_sem for writing this code is always called with ->mmap_sem held for writing, so we can forget about do_wp_page(). However, we can't simply remove this lock_page(), and the only (afaics) reason is __replace_page()->try_to_free_swap(). Nothing in write_opcode() needs it, move it into __replace_page() and fix the comment. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182220.GA20322@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Kill write_opcode()->lock_page(new_page)Oleg Nesterov
write_opcode() does lock_page(new_page) for no reason. Nobody can see this page until __replace_page() exposes it under ptl lock, and we do nothing with this page after pte_unmap_unlock(). If nothing else, the similar code in do_wp_page() doesn't lock the new page for page_add_new_anon_rmap/set_pte_at_notify. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182218.GA20315@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: __replace_page() should not use page_address_in_vma()Oleg Nesterov
page_address_in_vma(old_page) in __replace_page() is ugly and wrong. The caller already knows the correct virtual address, this page was found by get_user_pages(vaddr). However, page_address_in_vma() can actually fail if page->mapping was cleared by __delete_from_page_cache() after get_user_pages() returns. But this means the race with page reclaim, write_opcode() should not fail, it should retry and read this page again. Probably the race with remove_mapping() is not possible due to page_freeze_refs() logic, but afaics at least shmem_writepage()->shmem_delete_from_page_cache() can clear ->mapping. We could change __replace_page() to return -EAGAIN in this case, but it would be better to simply use the caller's vaddr and rely on page_check_address(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182216.GA20311@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Don't recheck vma/f_mapping in write_opcode()Oleg Nesterov
write_opcode() rechecks valid_vma() and ->f_mapping, this is pointless. The caller, register_for_each_vma() or uprobe_mmap(), has already done these checks under mmap_sem. To clarify, uprobe_mmap() checks valid_vma() only, but we can rely on build_probe_list(vm_file->f_mapping->host). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182212.GA20304@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-26sched: Deliver sched_switch events to the current taskAndrew Vagin
Otherwise they can't be filtered for a defined task: perf record -e sched:sched_switch ./foo This command doesn't report any events without this patch. I think it isn't a security concern if someone knows who will be executed next - this can already be observed by polling /proc state. By default perf is disabled for non-root users in any case. I need these events for profiling sleep times. sched_switch is used for getting callchains and sched_stat_* is used for getting time periods. These events are combined in user space, then it can be analyzed by perf tools. Signed-off-by: Andrew Vagin <avagin@openvz.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arun Sharma <asharma@fb.com> Link: http://lkml.kernel.org/r/1342088069-1005148-1-git-send-email-avagin@openvz.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-22Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core changes from Ingo Molnar: "Continued cleanups of the core time and NTP code, plus more nohz work preparing for tick-less userspace execution." * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Rework timekeeping functions to take timekeeper ptr as argument time: Move xtime_nsec adjustment underflow handling timekeeping_adjust time: Move arch_gettimeoffset() usage into timekeeping_get_ns() time: Refactor accumulation of nsecs to secs time: Condense timekeeper.xtime into xtime_sec time: Explicitly use u32 instead of int for shift values time: Whitespace cleanups per Ingo%27s requests nohz: Move next idle expiry time record into idle logic area nohz: Move ts->idle_calls incrementation into strict idle logic nohz: Rename ts->idle_tick to ts->last_tick nohz: Make nohz API agnostic against idle ticks cputime accounting nohz: Separate idle sleeping time accounting from nohz logic timers: Improve get_next_timer_interrupt() timers: Add accounting of non deferrable timers timers: Consolidate base->next_timer update timers: Create detach_if_pending() and use it
2012-07-22Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp/hotplug changes from Ingo Molnar: "Various cleanups to the SMP hotplug code - a continuing effort of Thomas et al" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smpboot: Remove leftover declaration smp: Remove num_booting_cpus() smp: Remove ipi_call_lock[_irq]()/ipi_call_unlock[_irq]() POWERPC: Smp: remove call to ipi_call_lock()/ipi_call_unlock() SPARC: SMP: Remove call to ipi_call_lock_irq()/ipi_call_unlock_irq() ia64: SMP: Remove call to ipi_call_lock_irq()/ipi_call_unlock_irq() x86-smp-remove-call-to-ipi_call_lock-ipi_call_unlock tile: SMP: Remove call to ipi_call_lock()/ipi_call_unlock() S390: Smp: remove call to ipi_call_lock()/ipi_call_unlock() parisc: Smp: remove call to ipi_call_lock()/ipi_call_unlock() mn10300: SMP: Remove call to ipi_call_lock()/ipi_call_unlock() hexagon: SMP: Remove call to ipi_call_lock()/ipi_call_unlock()
2012-07-22Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events changes from Ingo Molnar: "- kernel side: - Intel uncore PMU support for Nehalem and Sandy Bridge CPUs, we support both the events available via the MSR and via the PCI access space. - various uprobes cleanups and restructurings - PMU driver quirks by microcode version and required x86 microcode loader cleanups/robustization - various tracing robustness updates - static keys: remove obsolete static_branch() - tooling side: - GTK browser improvements - perf report browser: support screenshots to file - more automated tests - perf kvm improvements - perf bench refinements - build environment improvements - pipe mode improvements - libtraceevent updates, we have now hopefully merged most bits with the out of tree forked code base ... and many other goodies." * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (138 commits) tracing: Check for allocation failure in __tracing_open() perf/x86: Fix intel_perfmon_event_mapformatting jump label: Remove static_branch() tracepoint: Use static_key_false(), since static_branch() is deprecated perf/x86: Uncore filter support for SandyBridge-EP perf/x86: Detect number of instances of uncore CBox perf/x86: Fix event constraint for SandyBridge-EP C-Box perf/x86: Use 0xff as pseudo code for fixed uncore event perf/x86: Save a few bytes in 'struct x86_pmu' perf/x86: Add a microcode revision check for SNB-PEBS perf/x86: Improve debug output in check_hw_exists() perf/x86/amd: Unify AMD's generic and family 15h pmus perf/x86: Move Intel specific code to intel_pmu_init() perf/x86: Rename Intel specific macros perf/x86: Fix USER/KERNEL tagging of samples perf tools: Split event symbols arrays to hw and sw parts perf tools: Split out PE_VALUE_SYM parsing token to SW and HW tokens perf tools: Add empty rule for new line in event syntax parsing perf test: Use ARRAY_SIZE in parse events tests tools lib traceevent: Cleanup realloc use ...
2012-07-22Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: "Quoting from Paul, the major features of this series are: 1. Preventing latency spikes of more than 200 microseconds for kernels built with NR_CPUS=4096, which is reportedly becoming the default for some distros. This is a first step, as it does not help with systems that actually -have- 4096 CPUs (work on this case is in progress, but is not yet ready for mainline). This category also includes improving concurrency of rcu_barrier(), placed here due to conflicts. Posted to LKML at: https://lkml.org/lkml/2012/6/22/381 Note that patches 18-22 of that series have been defered to 3.7, as they have not yet proven themselves to be mainline-ready (and yes, these are the ones intended to get rid of RCU's latency spikes for systems that actually have 4096 CPUs). 2. Updates to documentation and rcutorture fixes, the latter category including improvements to rcu_barrier() testing. Posted to LKML at http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/04094.html. 3. Miscellaneous fixes posted to LKML at: https://lkml.org/lkml/2012/6/22/500 with the exception of the last commit, which was posted here: http://www.gossamer-threads.com/lists/linux/kernel/1561830 4. RCU_FAST_NO_HZ fixes and improvements. Posted to LKML at: http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/00006.html http://www.gossamer-threads.com/lists/linux/kernel/1561833 The first four patches of the first series went into 3.5 to fix a regression. 5. Code-style fixes. These were posted to LKML at http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01180.html http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01181.html" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits) rcu: Fix broken strings in RCU's source code. rcu: Fix code-style issues involving "else" rcu: Introduce check for callback list/count mismatch rcu: Make RCU_FAST_NO_HZ respect nohz= boot parameter rcu: Fix qlen_lazy breakage rcu: Round FAST_NO_HZ lazy timeout to nearest second rcu: The rcu_needs_cpu() function is not a quiescent state rcu: Dump only the current CPU's buffers for idle-entry/exit warnings rcu: Add check for CPUs going offline with callbacks queued rcu: Disable preemption in rcu_blocking_is_gp() rcu: Prevent uninitialized string in RCU CPU stall info rcu: Fix rcu_is_cpu_idle() #ifdef in TINY_RCU rcu: Split RCU core processing out of __call_rcu() rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs rcu: Make __call_rcu() handle invocation from idle rcu: Remove function versions of __kfree_rcu and __is_kfree_rcu_offset rcu: Consolidate tree/tiny __rcu_read_{,un}lock() implementations rcu: Remove return value from rcu_assign_pointer() key: Remove extraneous parentheses from rcu_assign_keypointer() rcu: Remove return value from RCU_INIT_POINTER() ...
2012-07-21Merge branch 'anton-kgdb' (kgdb dmesg fixups)Linus Torvalds
Merge emailed kgdb dmesg fixups patches from Anton Vorontsov: "The dmesg command appears to be broken after the printk rework. The old logic in the kdb code makes no sense in terms of current printk/logging storage format, and KDB simply hangs forever upon entering 'dmesg' command. The first patch revives the command by switching to kmsg_dumper iterator. As a side-effect, the code is now much more simpler. A few changes were needed in the printk.c: we needed unlocked variant of the kmsg_dumper iterator, but these can surely wait for 3.6. It's probably too late even for the first patch to go to 3.5, but I'll try to convince otherwise. :-) Here we go: - The current code is broken for sure, and has no hope to work at all. It is a regression - The new code works for me, and probably works for everyone else; - If it compiles (and I urge everyone to compile-test it on your setup), it hardly can make things worse." * Merge emailed patches from Anton Vorontsov: (4 commits) kdb: Switch to nolock variants of kmsg_dump functions printk: Implement some unlocked kmsg_dump functions printk: Remove kdb_syslog_data kdb: Revive dmesg command
2012-07-21kdb: Switch to nolock variants of kmsg_dump functionsAnton Vorontsov
The locked variants are prone to deadlocks (suppose we got to the debugger w/ the logbuf lock held), so let's switch to nolock variants. Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-21printk: Implement some unlocked kmsg_dump functionsAnton Vorontsov
If used from KDB, the locked variants are prone to deadlocks (suppose we got to the debugger w/ the logbuf lock held). So, we have to implement a few routines that grab no logbuf lock. Yet we don't need these functions in modules, so we don't export them. Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-21printk: Remove kdb_syslog_dataAnton Vorontsov
The function is no longer needed, so remove it. Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-21kdb: Revive dmesg commandAnton Vorontsov
The kgdb dmesg command is broken after the printk rework. The old logic in kdb code makes no sense in terms of current printk/logging storage format, and KDB simply hangs forever. This patch revives the command by switching to kmsg_dumper iterator. The code is now much more simpler and shorter. Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-18Make wait_for_device_probe() also do scsi_complete_async_scans()Linus Torvalds
Commit a7a20d103994 ("sd: limit the scope of the async probe domain") make the SCSI device probing run device discovery in it's own async domain. However, as a result, the partition detection was no longer synchronized by async_synchronize_full() (which, despite the name, only synchronizes the global async space, not all of them). Which in turn meant that "wait_for_device_probe()" would not wait for the SCSI partitions to be parsed. And "wait_for_device_probe()" was what the boot time init code relied on for mounting the root filesystem. Now, most people never noticed this, because not only is it timing-dependent, but modern distributions all use initrd. So the root filesystem isn't actually on a disk at all. And then before they actually mount the final disk filesystem, they will have loaded the scsi-wait-scan module, which not only does the expected wait_for_device_probe(), but also does scsi_complete_async_scans(). [ Side note: scsi_complete_async_scans() had also been partially broken, but that was fixed in commit 43a8d39d0137 ("fix async probe regression"), so that same commit a7a20d103994 had actually broken setups even if you used scsi-wait-scan explicitly ] Solve this problem by just moving the scsi_complete_async_scans() call into wait_for_device_probe(). Everybody who wants to wait for device probing to finish really wants the SCSI probing to complete, so there's no reason not to do this. So now "wait_for_device_probe()" really does what the name implies, and properly waits for device probing to finish. This also removes the now unnecessary extra calls to scsi_complete_async_scans(). Reported-and-tested-by: Artem S. Tashkinov <t.artem@mailcity.com> Cc: Dan Williams <dan.j.williams@gmail.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: James Bottomley <jbottomley@parallels.com> Cc: Borislav Petkov <bp@amd64.org> Cc: linux-scsi <linux-scsi@vger.kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-18Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip One more time/ntp fix pulled from Ingo Molnar. * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ntp: Fix STA_INS/DEL clearing bug
2012-07-18Merge branch 'linus' into timers/coreIngo Molnar
Resolve semantic conflict in kernel/time/timekeeping.c. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-18Merge branch 'tip/perf/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core Pull tracing fix from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-18Merge branch 'linus' into perf/coreIngo Molnar
Pick up the latest ring-buffer fixes, before applying a new fix. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-16timekeeping: Add missing update call in timekeeping_resume()Thomas Gleixner
The leap second rework unearthed another issue of inconsistent data. On timekeeping_resume() the timekeeper data is updated, but nothing calls timekeeping_update(), so now the update code in the timer interrupt sees stale values. This has been the case before those changes, but then the timer interrupt was using stale data as well so this went unnoticed for quite some time. Add the missing update call, so all the data is consistent everywhere. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Reported-and-tested-by: "Rafael J. Wysocki" <rjw@sisk.pl> Reported-and-tested-by: Martin Steigerwald <Martin@lichtvoll.de> Cc: LKML <linux-kernel@vger.kernel.org> Cc: Linux PM list <linux-pm@vger.kernel.org> Cc: John Stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>, Cc: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-15time: Rework timekeeping functions to take timekeeper ptr as argumentJohn Stultz
As part of cleaning up the timekeeping code, this patch converts a number of internal functions to takei a timekeeper ptr as an argument, so that the internal functions don't access the global timekeeper structure directly. This allows for further optimizations to reduce lock hold time later. This patch has been updated to include more consistent usage of the timekeeper value, by making sure it is always passed as a argument to non top-level functions. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1342156917-25092-9-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15time: Move xtime_nsec adjustment underflow handling timekeeping_adjustJohn Stultz
When we make adjustments speeding up the clock, its possible for xtime_nsec to underflow. We already handle this properly, but we do so from update_wall_time() instead of the more logical timekeeping_adjust(), where the possible underflow actually occurs. Thus, move the correction logic to the timekeeping_adjust, which is the function that causes the issue. Making update_wall_time() more readable. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1342156917-25092-8-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15time: Move arch_gettimeoffset() usage into timekeeping_get_ns()John Stultz
Since we call arch_gettimeoffset() in all the accessor functions, move arch_gettimeoffset() calls into timekeeping_get_ns() and timekeeping_get_ns_raw() to simplify the code. This also makes the code easier to maintain as we don't have to worry about forgetting the arch_gettimeoffset() as has happened in the past. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1342156917-25092-7-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15time: Refactor accumulation of nsecs to secsJohn Stultz
We do the exact same logic moving nsecs to secs in the timekeeper in multiple places, so condense this into a single function. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1342156917-25092-6-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15time: Condense timekeeper.xtime into xtime_secJohn Stultz
The timekeeper struct has a xtime_nsec, which keeps the sub-nanosecond remainder. This ends up being somewhat duplicative of the timekeeper.xtime.tv_nsec value, and we have to do extra work to keep them apart, copying the full nsec portion out and back in over and over. This patch simplifies some of the logic by taking the timekeeper xtime value and splitting it into timekeeper.xtime_sec and reuses the timekeeper.xtime_nsec for the sub-second portion (stored in higher res shifted nanoseconds). This simplifies some of the accumulation logic. And will allow for more accurate timekeeping once the vsyscall code is updated to use the shifted nanosecond remainder. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1342156917-25092-5-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15time: Explicitly use u32 instead of int for shift valuesJohn Stultz
Ingo noted that using a u32 instead of int for shift values would be better to make sure the compiler doesn't unnecessarily use complex signed arithmetic. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1342156917-25092-4-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15time: Whitespace cleanups per Ingo%27s requestsJohn Stultz
Ingo noted a number of places where there is inconsistent use of whitespace. This patch tries to address the main culprits. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1342156917-25092-3-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15Merge branch 'timers/urgent' into timers/coreThomas Gleixner
Reason: Update to upstream changes to avoid further conflicts. Fixup a trivial merge conflict in kernel/time/tick-sched.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15ntp: Fix STA_INS/DEL clearing bugJohn Stultz
In commit 6b43ae8a619d17c4935c3320d2ef9e92bdeed05d, I introduced a bug that kept the STA_INS or STA_DEL bit from being cleared from time_status via adjtimex() without forcing STA_PLL first. Usually once the STA_INS is set, it isn't cleared until the leap second is applied, so its unlikely this affected anyone. However during testing I noticed it took some effort to cancel a leap second once STA_INS was set. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> CC: stable@vger.kernel.org # 3.4 Link: http://lkml.kernel.org/r/1342156917-25092-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-14Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and ↵Linus Torvalds
'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU, perf, and scheduler fixes from Ingo Molnar. The RCU fix is a revert for an optimization that could cause deadlocks. One of the scheduler commits (164c33c6adee "sched: Fix fork() error path to not crash") is correct but not complete (some architectures like Tile are not covered yet) - the resulting additional fixes are still WIP and Ingo did not want to delay these pending fixes. See this thread on lkml: [PATCH] fork: fix error handling in dup_task() The perf fixes are just trivial oneliners. * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf kvm: Fix segfault with report and mixed guestmount use perf kvm: Fix regression with guest machine creation perf script: Fix format regression due to libtraceevent merge ring-buffer: Fix accounting of entries when removing pages ring-buffer: Fix crash due to uninitialized new_pages list head * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS/sched: Update scheduler file pattern sched/nohz: Rewrite and fix load-avg computation -- again sched: Fix fork() error path to not crash
2012-07-13Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull the leap second fixes from Thomas Gleixner: "It's a rather large series, but well discussed, refined and reviewed. It got a massive testing by John, Prarit and tip. In theory we could split it into two parts. The first two patches f55a6faa3843: hrtimer: Provide clock_was_set_delayed() 4873fa070ae8: timekeeping: Fix leapsecond triggered load spike issue are merely preventing the stuff loops forever issues, which people have observed. But there is no point in delaying the other 4 commits which achieve full correctness into 3.6 as they are tagged for stable anyway. And I rather prefer to have the full fixes merged in bulk than a "prevent the observable wreckage and deal with the hidden fallout later" approach." * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Update hrtimer base offsets each hrtimer_interrupt timekeeping: Provide hrtimer update function hrtimers: Move lock held region in hrtimer_interrupt() timekeeping: Maintain ktime_t based offsets for hrtimers timekeeping: Fix leapsecond triggered load spike issue hrtimer: Provide clock_was_set_delayed()
2012-07-11tracing: Check for allocation failure in __tracing_open()Dan Carpenter
Clean up and return -ENOMEM on if the kzalloc() fails. This also prevents a potential crash, as the pointer that failed to allocate would be later used. Link: http://lkml.kernel.org/r/20120711063507.GF11812@elgon.mountain Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-11Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge random patches from Andrew Morton. * Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (32 commits) memblock: free allocated memblock_reserved_regions later mm: sparse: fix usemap allocation above node descriptor section mm: sparse: fix section usemap placement calculation xtensa: fix incorrect memset shmem: cleanup shmem_add_to_page_cache shmem: fix negative rss in memcg memory.stat tmpfs: revert SEEK_DATA and SEEK_HOLE drivers/rtc/rtc-twl.c: fix threaded IRQ to use IRQF_ONESHOT fat: fix non-atomic NFS i_pos read MAINTAINERS: add OMAP CPUfreq driver to OMAP Power Management section sgi-xp: nested calls to spin_lock_irqsave() fs: ramfs: file-nommu: add SetPageUptodate() drivers/rtc/rtc-mxc.c: fix irq enabled interrupts warning mm/memory_hotplug.c: release memory resources if hotadd_new_pgdat() fails h8300/uaccess: add mising __clear_user() h8300/uaccess: remove assignment to __gu_val in unhandled case of get_user() h8300/time: add missing #include <asm/irq_regs.h> h8300/signal: fix typo "statis" h8300/pgtable: add missing #include <asm-generic/pgtable.h> drivers/rtc/rtc-ab8500.c: ensure correct probing of the AB8500 RTC when Device Tree is enabled ...
2012-07-11c/r: prctl: less paranoid prctl_set_mm_exe_file()Konstantin Khlebnikov
"no other files mapped" requirement from my previous patch (c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too paranoid, it forbids operation even if there mapped one shared-anon vma. Let's check that current mm->exe_file already unmapped, in this case exe_file symlink already outdated and its changing is reasonable. Plus, this patch fixes exit code in case operation success. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Tested-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Kees Cook <keescook@chromium.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11hrtimer: Update hrtimer base offsets each hrtimer_interruptJohn Stultz
The update of the hrtimer base offsets on all cpus cannot be made atomically from the timekeeper.lock held and interrupt disabled region as smp function calls are not allowed there. clock_was_set(), which enforces the update on all cpus, is called either from preemptible process context in case of do_settimeofday() or from the softirq context when the offset modification happened in the timer interrupt itself due to a leap second. In both cases there is a race window for an hrtimer interrupt between dropping timekeeper lock, enabling interrupts and clock_was_set() issuing the updates. Any interrupt which arrives in that window will see the new time but operate on stale offsets. So we need to make sure that an hrtimer interrupt always sees a consistent state of time and offsets. ktime_get_update_offsets() allows us to get the current monotonic time and update the per cpu hrtimer base offsets from hrtimer_interrupt() to capture a consistent state of monotonic time and the offsets. The function replaces the existing ktime_get() calls in hrtimer_interrupt(). The overhead of the new function vs. ktime_get() is minimal as it just adds two store operations. This ensures that any changes to realtime or boottime offsets are noticed and stored into the per-cpu hrtimer base structures, prior to any hrtimer expiration and guarantees that timers are not expired early. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-8-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11timekeeping: Provide hrtimer update functionThomas Gleixner
To finally fix the infamous leap second issue and other race windows caused by functions which change the offsets between the various time bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a function which atomically gets the current monotonic time and updates the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic overhead. The previous patch which provides ktime_t offsets allows us to make this function almost as cheap as ktime_get() which is going to be replaced in hrtimer_interrupt(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11hrtimers: Move lock held region in hrtimer_interrupt()Thomas Gleixner
We need to update the base offsets from this code and we need to do that under base->lock. Move the lock held region around the ktime_get() calls. The ktime_get() calls are going to be replaced with a function which gets the time and the offsets atomically. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/1341960205-56738-6-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11timekeeping: Maintain ktime_t based offsets for hrtimersThomas Gleixner
We need to update the hrtimer clock offsets from the hrtimer interrupt context. To avoid conversions from timespec to ktime_t maintain a ktime_t based representation of those offsets in the timekeeper. This puts the conversion overhead into the code which updates the underlying offsets and provides fast accessible values in the hrtimer interrupt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-4-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11timekeeping: Fix leapsecond triggered load spike issueJohn Stultz
The timekeeping code misses an update of the hrtimer subsystem after a leap second happened. Due to that timers based on CLOCK_REALTIME are either expiring a second early or late depending on whether a leap second has been inserted or deleted until an operation is initiated which causes that update. Unless the update happens by some other means this discrepancy between the timekeeping and the hrtimer data stays forever and timers are expired either early or late. The reported immediate workaround - $ data -s "`date`" - is causing a call to clock_was_set() which updates the hrtimer data structures. See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix Add the missing clock_was_set() call to update_wall_time() in case of a leap second event. The actual update is deferred to softirq context as the necessary smp function call cannot be invoked from hard interrupt context. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reported-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-3-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11hrtimer: Provide clock_was_set_delayed()John Stultz
clock_was_set() cannot be called from hard interrupt context because it calls on_each_cpu(). For fixing the widely reported leap seconds issue it is necessary to call it from hard interrupt context, i.e. the timer tick code, which does the timekeeping updates. Provide a new function which denotes it in the hrtimer cpu base structure of the cpu on which it is called and raise the hrtimer softirq. We then execute the clock_was_set() notificiation from softirq context in run_hrtimer_softirq(). The hrtimer softirq is rarely used, so polling the flag there is not a performance issue. [ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get rid of all this ifdeffery ASAP ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Reported-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11Merge tag 'driver-core-3.5-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull printk fixes from Greg Kroah-Hartman: "Here are some more printk fixes for 3.5-rc6. They resolve all known outstanding issues with the printk changes that have been happening. They have been tested by the people reporting the problems. This hopefully should be it for the printk stuff for 3.5-final. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'driver-core-3.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kmsg: merge continuation records while printing kmsg: /proc/kmsg - support reading of partial log records kmsg: make sure all messages reach a newly registered boot console kmsg: properly handle concurrent non-blocking read() from /proc/kmsg kmsg: add the facility number to the syslog prefix kmsg: escape the backslash character while exporting data printk: replacing the raw_spin_lock/unlock with raw_spin_lock/unlock_irq
2012-07-09kmsg: merge continuation records while printingKay Sievers
In (the unlikely) case our continuation merge buffer is busy, we unfortunately can not merge further continuation printk()s into a single record and have to store them separately, which leads to split-up output of these lines when they are printed. Add some flags about newlines and prefix existence to these records and try to reconstruct the full line again, when the separated records are printed. Reported-By: Michael Neuling <mikey@neuling.org> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Tested-By: Michael Neuling <mikey@neuling.org> Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-09kmsg: /proc/kmsg - support reading of partial log recordsKay Sievers
Restore support for partial reads of any size on /proc/kmsg, in case the supplied read buffer is smaller than the record size. Some people seem to think is is ia good idea to run: $ dd if=/proc/kmsg bs=1 of=... as a klog bridge. Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=44211 Reported-by: Jukka Ollila <jiiksteri@gmail.com> Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-07cgroup: fix cgroup hierarchy umount raceTejun Heo
48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal optional" allowed a css to linger after the associated cgroup is removed. As a css holds a reference on the cgroup's dentry, it means that cgroup dentries may linger for a while. Destroying a superblock which has dentries with positive refcnts is a critical bug and triggers BUG() in vfs code. As each cgroup dentry holds an s_active reference, any lingering cgroup has both its dentry and the superblock pinned and thus preventing premature release of superblock. Unfortunately, after 48ddbe1946, there's a small window while releasing a cgroup which is directly under the root of the hierarchy. When a cgroup directory is released, vfs layer first deletes the corresponding dentry and then invokes dput() on the parent, which may recurse further, so when a cgroup directly below root cgroup is released, the cgroup is first destroyed - which releases the s_active it was holding - and then the dentry for the root cgroup is dput(). This creates a window where the root dentry's refcnt isn't zero but superblock's s_active is. If umount happens before or during this window, vfs will see the root dentry with non-zero refcnt and trigger BUG(). Before 48ddbe1946, this problem didn't exist because the last dentry reference was guaranteed to be put synchronously from rmdir(2) invocation which holds s_active around the whole process. Fix it by holding an extra superblock->s_active reference across dput() from css release, which is the dput() path added by 48ddbe1946 and the only one which doesn't hold an extra s_active ref across the final cgroup dput(). Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <4FEEA5CB.8070809@huawei.com> Reported-by: shyju pv <shyju.pv@huawei.com> Tested-by: shyju pv <shyju.pv@huawei.com> Cc: Sasha Levin <levinsasha928@gmail.com> Acked-by: Li Zefan <lizefan@huawei.com>