summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2010-10-26oom: add per-mm oom disable countYing Han
It's pointless to kill a task if another thread sharing its mm cannot be killed to allow future memory freeing. A subsequent patch will prevent kills in such cases, but first it's necessary to have a way to flag a task that shares memory with an OOM_DISABLE task that doesn't incur an additional tasklist scan, which would make select_bad_process() an O(n^2) function. This patch adds an atomic counter to struct mm_struct that follows how many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN. They cannot be killed by the kernel, so their memory cannot be freed in oom conditions. This only requires task_lock() on the task that we're operating on, it does not require mm->mmap_sem since task_lock() pins the mm and the operation is atomic. [rientjes@google.com: changelog and sys_unshare() code] [rientjes@google.com: protect oom_disable_count with task_lock in fork] [rientjes@google.com: use old_mm for oom_disable_count in exec] Signed-off-by: Ying Han <yinghan@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-25MN10300: Fix the PERCPU() alignment to allow for workqueuesDavid Howells
In the MN10300 arch, we occasionally see an assertion being tripped in alloc_cwqs() at the following line: /* just in case, make sure it's actually aligned */ ---> BUG_ON(!IS_ALIGNED(wq->cpu_wq.v, align)); return wq->cpu_wq.v ? 0 : -ENOMEM; The values are: wa->cpu_wq.v => 0x902776e0 align => 0x100 and align is calculated by the following: const size_t align = max_t(size_t, 1 << WORK_STRUCT_FLAG_BITS, __alignof__(unsigned long long)); This is because the pointer in question (wq->cpu_wq.v) loses some of its lower bits to control flags, and so the object it points to must be sufficiently aligned to avoid the need to use those bits for pointing to things. Currently, 4 control bits and 4 colour bits are used in normal circumstances, plus a debugging bit if debugging is set. This requires the cpu_workqueue_struct struct to be at least 256 bytes aligned (or 512 bytes aligned with debugging). PERCPU() alignment on MN13000, however, is only 32 bytes as set in vmlinux.lds.S. So we set this to PAGE_SIZE (4096) to match most other arches and stick a comment in alloc_cwqs() for anyone else who triggers the assertion. Reported-by: Akira Takeuchi <takeuchi.akr@jp.panasonic.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Mark Salter <msalter@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-24Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) Update broken web addresses in arch directory. Update broken web addresses in the kernel. Revert "drivers/usb: Remove unnecessary return's from void functions" for musb gadget Revert "Fix typo: configuation => configuration" partially ida: document IDA_BITMAP_LONGS calculation ext2: fix a typo on comment in ext2/inode.c drivers/scsi: Remove unnecessary casts of private_data drivers/s390: Remove unnecessary casts of private_data net/sunrpc/rpc_pipe.c: Remove unnecessary casts of private_data drivers/infiniband: Remove unnecessary casts of private_data drivers/gpu/drm: Remove unnecessary casts of private_data kernel/pm_qos_params.c: Remove unnecessary casts of private_data fs/ecryptfs: Remove unnecessary casts of private_data fs/seq_file.c: Remove unnecessary casts of private_data arm: uengine.c: remove C99 comments arm: scoop.c: remove C99 comments Fix typo configue => configure in comments Fix typo: configuation => configuration Fix typo interrest[ing|ed] => interest[ing|ed] Fix various typos of valid in comments ... Fix up trivial conflicts in: drivers/char/ipmi/ipmi_si_intf.c drivers/usb/gadget/rndis.c net/irda/irnet/irnet_ppp.c
2010-10-23Merge branches 'softirq-for-linus', 'x86-debug-for-linus', ↵Linus Torvalds
'x86-numa-for-linus', 'x86-quirks-for-linus', 'x86-setup-for-linus', 'x86-uv-for-linus' and 'x86-vm86-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'softirq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: softirqs: Make wakeup_softirqd static * 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, asm: Restore parentheses around one pushl_cfi argument x86, asm: Fix ancient-GAS workaround x86, asm: Fix CFI macro invocations to deal with shortcomings in gas * 'x86-numa-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, numa: Assign CPUs to nodes in round-robin manner on fake NUMA * 'x86-quirks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: HPET force enable for CX700 / VIA Epia LT * 'x86-setup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, setup: Use string copy operation to optimze copy in kernel compression * 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, UV: Use allocated buffer in tlb_uv.c:tunables_read() * 'x86-vm86-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, vm86: Fix preemption bug for int1 debug and int3 breakpoint handlers.
2010-10-22Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kdb,debug_core: adjust master cpu switch logic against new debug_core locking debug_core: refactor locking for master/slave cpus x86,kgdb: remove unnecessary call to kgdb_correct_hw_break() debug_core: disable hw_breakpoints on all cores in kgdb_cpu_enter() kdb,kgdb: fix sparse fixups kdb: Fix oops in kdb_unregister kdb,ftdump: Remove reference to internal kdb include kdb: Allow kernel loadable modules to add kdb shell functions debug_core: stop rcu warnings on kernel resume debug_core: move all watch dog syncs to a single function x86,kgdb: fix debugger hw breakpoint test regression in 2.6.35
2010-10-22Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: remove in_workqueue_context() workqueue: Clarify that schedule_on_each_cpu is synchronous memory_hotplug: drop spurious calls to flush_scheduled_work() shpchp: update workqueue usage pciehp: update workqueue usage isdn/eicon: don't call flush_scheduled_work() from diva_os_remove_soft_isr() workqueue: add and use WQ_MEM_RECLAIM flag workqueue: fix HIGHPRI handling in keep_working() workqueue: add queue_work and activate_work trace points workqueue: prepare for more tracepoints workqueue: implement flush[_delayed]_work_sync() workqueue: factor out start_flush_work() workqueue: cleanup flush/cancel functions workqueue: implement alloc_ordered_workqueue() Fix up trivial conflict in fs/gfs2/main.c as per Tejun
2010-10-22kdb,debug_core: adjust master cpu switch logic against new debug_core lockingJason Wessel
The kdb shell needs to enforce switching back to the original CPU that took the exception before restoring normal kernel execution. Resuming from a different CPU than what took the original exception will cause problems with spin locks that are freed from the a different processor than had taken the lock. The special logic in dbg_cpu_switch() can go away entirely with because the state of what cpus want to be masters or slaves will remain unchanged between entry and exit of the debug_core exception context. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2010-10-22debug_core: refactor locking for master/slave cpusJason Wessel
For quite some time there have been problems with memory barriers and various races with NMI on multi processor systems using the kernel debugger. The algorithm for entering the kernel debug core and resuming kernel execution was racy and had several known edge case problems with attempting to debug something on a heavily loaded system using breakpoints that are hit repeatedly and quickly. The prior "locking" design entry worked as follows: * The atomic counter kgdb_active was used with atomic exchange in order to elect a master cpu out of all the cpus that may have taken a debug exception. * The master cpu increments all elements of passive_cpu_wait[]. * The master cpu issues the round up cpus message. * Each "slave cpu" that enters the debug core increments its own element in cpu_in_kgdb[]. * Each "slave cpu" spins on passive_cpu_wait[] until it becomes 0. * The master cpu debugs the system. The new scheme removes the two arrays of atomic counters and replaces them with 2 single counters. One counter is used to count the number of cpus waiting to become a master cpu (because one or more hit an exception). The second counter is use to indicate how many cpus have entered as slave cpus. The new entry logic works as follows: * One or more cpus enters via kgdb_handle_exception() and increments the masters_in_kgdb. Each cpu attempts to get the spin lock called dbg_master_lock. * The master cpu sets kgdb_active to the current cpu. * The master cpu takes the spinlock dbg_slave_lock. * The master cpu asks to round up all the other cpus. * Each slave cpu that is not already in kgdb_handle_exception() will enter and increment slaves_in_kgdb. Each slave will now spin try_locking on dbg_slave_lock. * The master cpu waits for the sum of masters_in_kgdb and slaves_in_kgdb to be equal to the sum of the online cpus. * The master cpu debugs the system. In the new design the kgdb_active can only be changed while holding dbg_master_lock. Stress testing has not turned up any further entry/exit races that existed in the prior locking design. The prior locking design suffered from atomic variables not being truly atomic (in the capacity as used by kgdb) along with memory barrier races. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Dongdong Deng <dongdong.deng@windriver.com>
2010-10-22debug_core: disable hw_breakpoints on all cores in kgdb_cpu_enter()Dongdong Deng
The slave cpus do not have the hw breakpoints disabled upon entry to the debug_core and as a result could cause unrecoverable recursive faults on badly placed breakpoints, or get out of sync with the arch specific hw breakpoint operations. This patch addresses the problem by invoking kgdb_disable_hw_debug() earlier in kgdb_enter_cpu for each cpu that enters the debug core. The hw breakpoint dis/enable flow should be: master_debug_cpu slave_debug_cpu \ / kgdb_cpu_enter | kgdb_disable_hw_debug --> uninstall pre-enabled hw_breakpoint | do add/rm dis/enable operates to hw_breakpoints on master_debug_cpu.. | correct_hw_break --> correct/install the enabled hw_breakpoint | leave_kgdb Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2010-10-22kdb,kgdb: fix sparse fixupsJason Wessel
Fix the following sparse warnings: kdb_main.c:328:5: warning: symbol 'kdbgetu64arg' was not declared. Should it be static? kgdboc.c:246:12: warning: symbol 'kgdboc_early_init' was not declared. Should it be static? kgdb.c:652:26: warning: incorrect type in argument 1 (different address spaces) kgdb.c:652:26: expected void const *ptr kgdb.c:652:26: got struct perf_event *[noderef] <asn:3>*pev The one in kgdb.c required the (void * __force) because of the return code from register_wide_hw_breakpoint looking like: return (void __percpu __force *)ERR_PTR(err); Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2010-10-22kdb: Fix oops in kdb_unregisterJason Wessel
Nothing should try to use kdb_commands directly as sometimes it is null. Instead, use the for_each_kdbcmd() iterator. This particular problem dates back to the initial kdb merge (2.6.35), but at that point nothing was dynamically unregistering commands from the kdb shell. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2010-10-22kdb,ftdump: Remove reference to internal kdb includeJason Wessel
Now that include/linux/kdb.h properly exports all the functions required to dynamically add a kdb shell command, the reference to the private kdb header can be removed. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2010-10-22kdb: Allow kernel loadable modules to add kdb shell functionsJason Wessel
In order to allow kernel modules to dynamically add a command to the kdb shell the kdb_register, kdb_register_repeat, kdb_unregister, and kdb_printf need to be exported as GPL symbols. Any kernel module that adds a dynamic kdb shell function should only need to include linux/kdb.h. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2010-10-22debug_core: stop rcu warnings on kernel resumeJason Wessel
When returning from the kernel debugger reset the rcu jiffies_stall value to prevent the rcu stall detector from sending NMI events which invoke a stack dump for each cpu in the system. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2010-10-22debug_core: move all watch dog syncs to a single functionJason Wessel
Move the various clock and watch dog syncs to a single function in advance of adding another sync for the rcu stall detector. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2010-10-22Merge branch 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bklLinus Torvalds
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: vfs: make no_llseek the default vfs: don't use BKL in default_llseek llseek: automatically add .llseek fop libfs: use generic_file_llseek for simple_attr mac80211: disallow seeks in minstrel debug code lirc: make chardev nonseekable viotape: use noop_llseek raw: use explicit llseek file operations ibmasmfs: use generic_file_llseek spufs: use llseek in all file operations arm/omap: use generic_file_llseek in iommu_debug lkdtm: use generic_file_llseek in debugfs net/wireless: use generic_file_llseek in debugfs drm: use noop_llseek
2010-10-22Merge branch 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bklLinus Torvalds
* 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: (30 commits) BKL: remove BKL from freevxfs BKL: remove BKL from qnx4 autofs4: Only declare function when CONFIG_COMPAT is defined autofs: Only declare function when CONFIG_COMPAT is defined ncpfs: Lock socket in ncpfs while setting its callbacks fs/locks.c: prepare for BKL removal BKL: Remove BKL from ncpfs BKL: Remove BKL from OCFS2 BKL: Remove BKL from squashfs BKL: Remove BKL from jffs2 BKL: Remove BKL from ecryptfs BKL: Remove BKL from afs BKL: Remove BKL from USB gadgetfs BKL: Remove BKL from autofs4 BKL: Remove BKL from isofs BKL: Remove BKL from fat BKL: Remove BKL from ext2 filesystem BKL: Remove BKL from do_new_mount() BKL: Remove BKL from cgroup BKL: Remove BKL from NTFS ...
2010-10-22Merge branch 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bklLinus Torvalds
* 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: BKL: introduce CONFIG_BKL. dabusb: remove the BKL sunrpc: remove the big kernel lock init/main.c: remove BKL notations blktrace: remove the big kernel lock rtmutex-tester: make it build without BKL dvb-core: kill the big kernel lock dvb/bt8xx: kill the big kernel lock tlclk: remove big kernel lock fix rawctl compat ioctls breakage on amd64 and itanic uml: kill big kernel lock parisc: remove big kernel lock cris: autoconvert trivial BKL users alpha: kill big kernel lock isapnp: BKL removal s390/block: kill the big kernel lock hpet: kill BKL, add compat_ioctl
2010-10-21Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (71 commits) powerpc/44x: Update ppc44x_defconfig powerpc/watchdog: Make default timeout for Book-E watchdog a Kconfig option fsl_rio: Add comments for sRIO registers. powerpc/fsl-booke: Add e55xx (64-bit) smp defconfig powerpc/fsl-booke: Add p5020 DS board support powerpc/fsl-booke64: Use TLB CAMs to cover linear mapping on FSL 64-bit chips powerpc/fsl-booke: Add support for FSL Arch v1.0 MMU in setup_page_sizes powerpc/fsl-booke: Add support for FSL 64-bit e5500 core powerpc/85xx: add cache-sram support powerpc/85xx: add ngPIXIS FPGA device tree node to the P1022DS board powerpc: Fix compile error with paca code on ppc64e powerpc/fsl-booke: Add p3041 DS board support oprofile/fsl emb: Don't set MSR[PMM] until after clearing the interrupt. powerpc/fsl-booke: Add PCI device ids for P2040/P3041/P5010/P5020 QoirQ chips powerpc/mpc8xxx_gpio: Add support for 'qoriq-gpio' controllers powerpc/fsl_booke: Add support to boot from core other than 0 powerpc/p1022: Add probing for individual DMA channels powerpc/fsl_soc: Search all global-utilities nodes for rstccr powerpc: Fix invalid page flags in create TLB CAM path for PTE_64BIT powerpc/mpc83xx: Support for MPC8308 P1M board ... Fix up conflict with the generic irq_work changes in arch/powerpc/kernel/time.c
2010-10-21Merge branch 'core-memblock-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (74 commits) x86-64: Only set max_pfn_mapped to 512 MiB if we enter via head_64.S xen: Cope with unmapped pages when initializing kernel pagetable memblock, bootmem: Round pfn properly for memory and reserved regions memblock: Annotate memblock functions with __init_memblock memblock: Allow memblock_init to be called early memblock/arm: Fix memblock_region_is_memory() typo x86, memblock: Remove __memblock_x86_find_in_range_size() memblock: Fix wraparound in find_region() x86-32, memblock: Make add_highpages honor early reserved ranges x86, memblock: Fix crashkernel allocation arm, memblock: Fix the sparsemem build memblock: Fix section mismatch warnings powerpc, memblock: Fix memblock API change fallout memblock, microblaze: Fix memblock API change fallout x86: Remove old bootmem code x86, memblock: Use memblock_memory_size()/memblock_free_memory_size() to get correct dma_reserve x86: Remove not used early_res code x86, memblock: Replace e820_/_early string with memblock_ x86: Use memblock to replace early_res x86, memblock: Use memblock_debug to control debug message print out ... Fix up trivial conflicts in arch/x86/kernel/setup.c and kernel/Makefile
2010-10-21Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-armLinus Torvalds
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (278 commits) arm: remove machine_desc.io_pg_offst and .phys_io arm: use addruart macro to establish debug mappings arm: return both physical and virtual addresses from addruart arm/debug: consolidate addruart macros for CONFIG_DEBUG_ICEDCC ARM: make struct machine_desc definition coherent with its comment eukrea_mbimxsd-baseboard: Pass the correct GPIO to gpio_free cpuimx27: fix compile when ULPI is selected mach-pcm037_eet: fix compile errors Fixing ethernet driver compilation error for i.MX31 ADS board cpuimx51: update board support mx5: add cpuimx51sd module and its baseboard iomux-mx51: fix GPIO_1_xx 's IOMUX configuration imx-esdhc: update devices registration mx51: add resources for SD/MMC on i.MX51 iomux-mx51: fix SD1 and SD2's iomux configuration clock-mx51: rename CLOCK1 to CLOCK_CCGR for better readability clock-mx51: factorize clk_set_parent and clk_get_rate eukrea_mbimxsd: add support for DVI displays cpuimx25 & cpuimx35: fix OTG port registration in host mode i.MX31 and i.MX35 : fix errate TLSbo65953 and ENGcm09472 ...
2010-10-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: (26 commits) PM / Wakeup: Show wakeup sources statistics in debugfs PM: Introduce library for device-specific OPPs (v7) PM: Add sysfs attr for rechecking dev hash from PM trace PM: Lock PM device list mutex in show_dev_hash() PM / Runtime: Remove idle notification after failing suspend PM / Hibernate: Modify signature used to mark swap PM / Runtime: Reduce code duplication in core helper functions PM: Allow wakeup events to abort freezing of tasks PM: runtime: add missed pm_request_autosuspend PM / Hibernate: Make some boot messages look less scary PM / Runtime: Implement autosuspend support PM / Runtime: Add no_callbacks flag PM / Runtime: Combine runtime PM entry points PM / Runtime: Merge synchronous and async runtime routines PM / Runtime: Replace boolean arguments with bitflags PM / Runtime: Move code in drivers/base/power/runtime.c sysfs: Add sysfs_merge_group() and sysfs_unmerge_group() PM: Fix potential issue with failing asynchronous suspend PM / Wakeup: Introduce wakeup source objects and event statistics (v3) PM: Fix signed/unsigned warning in dpm_show_time() ...
2010-10-21Merge branch 'irq-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (96 commits) apic, x86: Use BIOS settings for IBS and MCE threshold interrupt LVT offsets apic, x86: Check if EILVT APIC registers are available (AMD only) x86: ioapic: Call free_irte only if interrupt remapping enabled arm: Use ARCH_IRQ_INIT_FLAGS genirq, ARM: Fix boot on ARM platforms genirq: Fix CONFIG_GENIRQ_NO_DEPRECATED=y build x86: Switch sparse_irq allocations to GFP_KERNEL genirq: Switch sparse_irq allocator to GFP_KERNEL genirq: Make sparse_lock a mutex x86: lguest: Use new irq allocator genirq: Remove the now unused sparse irq leftovers genirq: Sanitize dynamic irq handling genirq: Remove arch_init_chip_data() x86: xen: Sanitise sparse_irq handling x86: Use sane enumeration x86: uv: Clean up the direct access to irq_desc x86: Make io_apic.c local functions static genirq: Remove irq_2_iommu x86: Speed up the irq_remapped check in hot pathes intr_remap: Simplify the code further ... Fix up trivial conflicts in arch/x86/Kconfig
2010-10-21Merge branch 'timers-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ntp: Clamp PLL update interval
2010-10-21Merge branch 'futexes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Fix kernel-doc notation & typos futex: Add lock context annotations futex: Mark restart_block.futex.uaddr[2] __user futex: Change 3rd arg of fetch_robust_entry() to unsigned int*
2010-10-21Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Fix errors in nested key ref-counting
2010-10-21Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits) sched: Export account_system_vtime() sched: Call tick_check_idle before __irq_enter sched: Remove irq time from available CPU power sched: Do not account irq time to current task x86: Add IRQ_TIME_ACCOUNTING sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time sched: Add a PF flag for ksoftirqd identification sched: Consolidate account_system_vtime extern declaration sched: Fix softirq time accounting sched: Drop group_capacity to 1 only if local group has extra capacity sched: Force balancing on newidle balance if local group has capacity sched: Set group_imb only a task can be pulled from the busiest cpu sched: Do not consider SCHED_IDLE tasks to be cache hot sched: Drop all load weight manipulation for RT tasks sched: Create special class for stop/migrate work sched: Unindent labels sched: Comment updates: fix default latency and granularity numbers tracing/sched: Add sched_pi_setprio tracepoint sched: Give CPU bound RT tasks preference sched: Try not to migrate higher priority RT tasks ...
2010-10-21Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (163 commits) tracing: Fix compile issue for trace_sched_wakeup.c [S390] hardirq: remove pointless header file includes [IA64] Move local_softirq_pending() definition perf, powerpc: Fix power_pmu_event_init to not use event->ctx ftrace: Remove recursion between recordmcount and scripts/mod/empty jump_label: Add COND_STMT(), reducer wrappery perf: Optimize sw events perf: Use jump_labels to optimize the scheduler hooks jump_label: Add atomic_t interface jump_label: Use more consistent naming perf, hw_breakpoint: Fix crash in hw_breakpoint creation perf: Find task before event alloc perf: Fix task refcount bugs perf: Fix group moving irq_work: Add generic hardirq context callbacks perf_events: Fix transaction recovery in group_sched_in() perf_events: Fix bogus AMD64 generic TLB events perf_events: Fix bogus context time tracking tracing: Remove parent recording in latency tracer graph options tracing: Use one prologue for the preempt irqs off tracer function tracers ...
2010-10-21Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (52 commits) sched: fix RCU lockdep splat from task_group() rcu: using ACCESS_ONCE() to observe the jiffies_stall/rnp->qsmask value sched: suppress RCU lockdep splat in task_fork_fair net: suppress RCU lockdep false positive in sock_update_classid rcu: move check from rcu_dereference_bh to rcu_read_lock_bh_held rcu: Add advice to PROVE_RCU_REPEATEDLY kernel config parameter rcu: Add tracing data to support queueing models rcu: fix sparse errors in rcutorture.c rcu: only one evaluation of arg in rcu_dereference_check() unless sparse kernel: Remove undead ifdef CONFIG_DEBUG_LOCK_ALLOC rcu: fix _oddness handling of verbose stall warnings rcu: performance fixes to TINY_PREEMPT_RCU callback checking rcu: upgrade stallwarn.txt documentation for CPU-bound RT processes vhost: add __rcu annotations rcu: add comment stating that list_empty() applies to RCU-protected lists rcu: apply TINY_PREEMPT_RCU read-side speedup to TREE_PREEMPT_RCU rcu: combine duplicate code, courtesy of CONFIG_PREEMPT_RCU rcu: Upgrade srcu_read_lock() docbook about SRCU grace periods rcu: document ways of stalling updates in low-memory situations rcu: repair code-duplication FIXMEs ...
2010-10-21Merge branch 'core-locking-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Check the depth of subclass lockdep: Add improved subclass caching affs: Use sema_init instead of init_MUTEX hfs: Convert tree_lock to mutex arm: Bcmring: semaphore cleanup printk: Make console_sem a semaphore not a pseudo mutex drivers/macintosh/adb: Do not claim that the semaphore is a mutex parport: Semaphore cleanup irda: Semaphore cleanup net: Wan/cosa.c: Convert "mutex" to semaphore net: Ppp_async: semaphore cleanup hamradio: Mkiss: semaphore cleanup hamradio: 6pack: semaphore cleanup net: 3c527: semaphore cleanup input: Serio/hp_sdc: semaphore cleanup input: Serio/hil_mlc: semaphore cleanup input: Misc/hp_sdc_rtc: semaphore cleanup lockup_detector: Make callback function static lockup detector: Fix grammar by adding a missing "to" in the comments lockdep: Remove __debug_show_held_locks
2010-10-21security: remove unused parameter from security_task_setscheduler()KOSAKI Motohiro
All security modules shouldn't change sched_param parameter of security_task_setscheduler(). This is not only meaningless, but also make a harmful result if caller pass a static variable. This patch remove policy and sched_param parameter from security_task_setscheduler() becuase none of security module is using it. Cc: James Morris <jmorris@namei.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: James Morris <jmorris@namei.org>
2010-10-20Merge branch 'linus' into irq/coreIngo Molnar
Merge reason: update to almost-final-.36 Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-19Merge branch 'devel-stable' into develRussell King
2010-10-19Merge branch 'tip/perf/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core
2010-10-19tracing: Fix compile issue for trace_sched_wakeup.cSteven Rostedt
The function start_func_tracer() was incorrectly added in the #ifdef CONFIG_FUNCTION_TRACER condition, but is still used even when function tracing is not enabled. The calls to register_ftrace_function() and register_ftrace_graph() become nops (and their arguments are even ignored), thus there is no reason to hide start_func_tracer() when function tracing is not enabled. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-10-19futex: Fix errors in nested key ref-countingDarren Hart
futex_wait() is leaking key references due to futex_wait_setup() acquiring an additional reference via the queue_lock() routine. The nested key ref-counting has been masking bugs and complicating code analysis. queue_lock() is only called with a previously ref-counted key, so remove the additional ref-counting from the queue_(un)lock() functions. Also futex_wait_requeue_pi() drops one key reference too many in unqueue_me_pi(). Remove the key reference handling from unqueue_me_pi(). This was paired with a queue_lock() in futex_lock_pi(), so the count remains unchanged. Document remaining nested key ref-counting sites. Signed-off-by: Darren Hart <dvhart@linux.intel.com> Reported-and-tested-by: Matthieu Fertré<matthieu.fertre@kerlabs.com> Reported-by: Louis Rilling<louis.rilling@kerlabs.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Kacur <jkacur@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <4CBB17A8.70401@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org
2010-10-19blktrace: remove the big kernel lockArnd Bergmann
According to Jens, this code does not need the BKL at all, it is sufficiently serialized by bd_mutex. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2010-10-19rtmutex-tester: make it build without BKLArnd Bergmann
The big kernel lock is going away, so make sure that if it is disabled by Kconfig, we do not try to validate it, which would result in compile errors. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2010-10-19workqueue: remove in_workqueue_context()Tejun Heo
Commit a25909a4 (lockdep: Add an in_workqueue_context() lockdep-based test function) added in_workqueue_context() but there hasn't been any in-kernel user and the lockdep annotation in workqueue is scheduled to change. Remove the unused function. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-10-19workqueue: Clarify that schedule_on_each_cpu is synchronousTejun Heo
The documentation for schedule_on_each_cpu() states that it calls a function on each online CPU from keventd. This can easily be interpreted as an asyncronous call because the description does not mention that flush_work is called. Clarify that it is synchronous. tj: rephrased a bit Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2010-10-18Merge branches 'at91', 'dcache', 'ftrace', 'hwbpt', 'misc', 'mmci', 's3c', ↵Russell King
'st-ux' and 'unwind' into devel
2010-10-18sched: Export account_system_vtime()Ingo Molnar
KVM uses it for example: ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined! Cc: Venkatesh Pallipadi <venki@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Call tick_check_idle before __irq_enterVenkatesh Pallipadi
When CPU is idle and on first interrupt, irq_enter calls tick_check_idle() to notify interruption from idle. But, there is a problem if this call is done after __irq_enter, as all routines in __irq_enter may find stale time due to yet to be done tick_check_idle. Specifically, trace calls in __irq_enter when they use global clock and also account_system_vtime change in this patch as it wants to use sched_clock_cpu() to do proper irq timing. But, tick_check_idle was moved after __irq_enter intentionally to prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a: irq: call __irq_enter() before calling the tick_idle_check Impact: avoid spurious ksoftirqd wakeups Moving tick_check_idle() before __irq_enter and wrapping it with local_bh_enable/disable would solve both the problems. Fixed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Remove irq time from available CPU powerVenkatesh Pallipadi
The idea was suggested by Peter Zijlstra here: http://marc.info/?l=linux-kernel&m=127476934517534&w=2 irq time is technically not available to the tasks running on the CPU. This patch removes irq time from CPU power piggybacking on sched_rt_avg_update(). Tested this by keeping CPU X busy with a network intensive task having 75% oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven cycle soakers on the system. Without this change, there will be two tasks on each CPU. With this change, there is a single task on irq busy CPU X and remaining 7 tasks are spread around among other 3 CPUs. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Do not account irq time to current taskVenkatesh Pallipadi
Scheduler accounts both softirq and interrupt processing times to the currently running task. This means, if the interrupt processing was for some other task in the system, then the current task ends up being penalized as it gets shorter runtime than otherwise. Change sched task accounting to acoount only actual task time from currently running task. Now update_curr(), modifies the delta_exec to depend on rq->clock_task. Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats for later. This change will impact scheduling behavior in interrupt heavy conditions. Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2 spending 75%+ of its time in irq processing. CPU 3 spending around 35% time running nc task. Now, if I run another CPU intensive task on CPU 2, without this change /proc/<pid>/schedstat shows 100% of time accounted to this task. With this change, it rightly shows less than 25% accounted to this task as remaining time is actually spent on irq processing. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq timeVenkatesh Pallipadi
s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does the fine granularity accounting of user, system, hardirq, softirq times. Adding that option on archs like x86 will be challenging however, given the state of TSC reliability on various platforms and also the overhead it will add in syscall entry exit. Instead, add a lighter variant that only does finer accounting of hardirq and softirq times, providing precise irq times (instead of timer tick based samples). This accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not interested in paying the perf penalty. This accounting is based on sched_clock, with the code being generic. So, other archs may find it useful as well. This patch just adds the core logic and does not enable this logic yet. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Add a PF flag for ksoftirqd identificationVenkatesh Pallipadi
To account softirq time cleanly in scheduler, we need to identify whether softirq is invoked in ksoftirqd context or softirq at hardirq tail context. Add PF_KSOFTIRQD for that purpose. As all PF flag bits are currently taken, create space by moving one of the infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along with some other state fields. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Fix softirq time accountingVenkatesh Pallipadi
Peter Zijlstra found a bug in the way softirq time is accounted in VIRT_CPU_ACCOUNTING on this thread: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html The problem is, softirq processing uses local_bh_disable internally. There is no way, later in the flow, to differentiate between whether softirq is being processed or is it just that bh has been disabled. So, a hardirq when bh is disabled results in time being wrongly accounted as softirq. Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING as well. As account_system_time() in normal tick based accouting also uses softirq_count, which will be set even when not in softirq with bh disabled. Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq processing. The patch below does that and adds API in_serving_softirq() which returns whether we are currently processing softirq or not. Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c to in_serving_softirq. Looks like many usages of in_softirq really want in_serving_softirq. Those changes can be made individually on a case by case basis. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Drop group_capacity to 1 only if local group has extra capacityNikhil Rao
When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 only if the local group has extra capacity. The extra check prevents the case where you always pull from the heaviest group when it is already under-utilized (possible with a large weight task outweighs the tasks on the system). For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task, and each task is running on one core. In this case, we observe the following events when balancing at the NUMA domain: - find_busiest_group() will always pick the sched group containing the niced task to be the busiest group. - find_busiest_queue() will then always pick one of the cpus running the nice0 task (never picks the cpu with the nice -15 task since weighted_cpuload > imbalance). - The load balancer fails to migrate the task since it is the running task and increments sd->nr_balance_failed. - It repeats the above steps a few more times until sd->nr_balance_failed > 5, at which point it kicks off the active load balancer, wakes up the migration thread and kicks the nice 0 task off the cpu. The load balancer doesn't stop until we kick out all nice 0 tasks from the sched group, leaving you with 3 idle cpus and one cpu running the nice -15 task. When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are not relevant because the niced task has a very large weight. In this patch, we add an extra condition to the "if(prefer_sibling)" check in update_sd_lb_stats(). We drop the capacity of a group only if the local group has extra capacity, ie. nr_running < group_capacity. This patch preserves the original intent of the prefer_siblings check (to spread tasks across the system in low utilization scenarios) and fixes the case above. It helps in the following ways: - In low utilization cases (where nr_tasks << nr_cpus), we still drop group_capacity down to 1 if we prefer siblings. - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most likely be > sgs.group_capacity. - When balancing large weight tasks, if the local group does not have extra capacity, we do not pick the group with the niced task as the busiest group. This prevents failed balances, active migration and the under-utilization described above. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18sched: Force balancing on newidle balance if local group has capacityNikhil Rao
This patch forces a load balance on a newly idle cpu when the local group has extra capacity and the busiest group does not have any. It improves system utilization when balancing tasks with a large weight differential. Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. With this patch, if the local group has extra capacity, we shortcut the checks in f_b_g() and try to pull a task over. A sched group has extra capacity if the group capacity is greater than the number of running tasks in that group. Thanks to Mike Galbraith for discussions leading to this patch and for the insight to reuse SD_NEWIDLE_BALANCE. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>