summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2014-12-11drm: sti: remove event lock while disabling vblankBenjamin Gaignard
Stop use event_lock in vblank disable function. This was creating a dead lock. Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
2014-12-11drm: sti: simplify gdp codeBenjamin Gaignard
Store the physical address at node creation time to avoid use of virt_to_dma and dma_to_virt everywhere Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
2014-12-11drm: sti: clear all mixer controlBenjamin Gaignard
Make sure that mixer control register is correctly reset before use it. Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
2014-12-11drm: sti: remove gpio for HDMI hot plug detectionBenjamin Gaignard
gpio used for HDMI hot plug detection is useless, HDMI_STI register contains an hot plug detection status bit. Fix binding documentation. Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
2014-12-11drm: sti: allow to change hdmi ddc i2c adapterBenjamin Gaignard
Depending of the board configuration i2c for ddc could change, this patch allow to use a phandle to specify which i2c controller to use. Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
2014-12-11arm64: mm: dump: don't skip final regionMark Rutland
If the final page table entry we walk is a valid mapping, the page table dumping code will not log the region this entry is part of, as the final note_page call in ptdump_show will trigger an early return. Luckily this isn't seen on contemporary systems as they typically don't have enough RAM to extend the linear mapping right to the end of the address space. In note_page, we log a region when we reach its end (i.e. we hit an entry immediately afterwards which has different prot bits or is invalid). The final entry has no subsequent entry, so we will not log this immediately. We try to cater for this with a subsequent call to note_page in ptdump_show, but this returns early as 0 < LOWEST_ADDR, and hence we will skip a valid mapping if it spans to the final entry we note. Unlike 32-bit ARM, the pgd with the kernel mapping is never shared with user mappings, so we do not need the check to ensure we don't log user page tables. Due to the way addr is constructed in the walk_* functions, it can never be less than LOWEST_ADDR when walking the page tables, so it is not necessary to avoid dereferencing invalid table addresses. The existing checks for st->current_prot and st->marker[1].start_address are sufficient to ensure we will not print and/or dereference garbage when trying to log information. This patch removes the unnecessary check against LOWEST_ADDR, ensuring we log all regions in the kernel page table, including those which span right to the end of the address space. Cc: Kees Cook <keescook@chromium.org> Acked-by: Laura Abbott <lauraa@codeaurora.org> Acked-by: Steve Capper <steve.capper@linaro.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2014-12-11arm64: mm: dump: fix shift warningMark Rutland
When building with 48-bit VAs, it's possible to get the following warning when building the arm64 page table dumping code: arch/arm64/mm/dump.c: In function ‘walk_pgd’: arch/arm64/mm/dump.c:266:2: warning: right shift count >= width of type pgd_t *pgd = pgd_offset(mm, 0); ^ As pgd_offset is a macro and the second argument is not cast to any particular type, the zero will be given integer type by the compiler. As pgd_offset passes the pargument to pgd_index, we then try to shift the 32-bit integer by at least 39 bits (for 4k pages). Elsewhere the pgd_offset is passed a second argument of unsigned long type, so let's do the same here by passing '0UL' rather than '0'. Cc: Kees Cook <keescook@chromium.org> Acked-by: Laura Abbott <lauraa@codeaurora.org> Acked-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2014-12-11arm64: psci: Fix build breakage without PM_SLEEPKrzysztof Kozlowski
Fix build failure of defconfig when PM_SLEEP is disabled (e.g. by disabling SUSPEND) and CPU_IDLE enabled: arch/arm64/kernel/psci.c:543:2: error: unknown field ‘cpu_suspend’ specified in initializer .cpu_suspend = cpu_psci_cpu_suspend, ^ arch/arm64/kernel/psci.c:543:2: warning: initialization from incompatible pointer type [enabled by default] arch/arm64/kernel/psci.c:543:2: warning: (near initialization for ‘cpu_psci_ops.cpu_prepare’) [enabled by default] make[1]: *** [arch/arm64/kernel/psci.o] Error 1 The cpu_operations.cpu_suspend field exists only if ARM64_CPU_SUSPEND is defined, not CPU_IDLE. Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2014-12-11xen: switch to post-init routines in xen mmu.c earlierJuergen Gross
With the virtual mapped linear p2m list the post-init mmu operations must be used for setting up the p2m mappings, as in case of CONFIG_FLATMEM the init routines may trigger BUGs. paging_init() sets up all infrastructure needed to switch to the post-init mmu ops done by xen_post_allocator_init(). With the virtual mapped linear p2m list we need some mmu ops during setup of this list, so we have to switch to the correct mmu ops as soon as possible. The p2m list is usable from the beginning, just expansion requires to have established the new linear mapping. So the call of xen_remap_memory() had to be introduced, but this is not due to the mmu ops requiring this. Summing it up: calling xen_post_allocator_init() not directly after paging_init() was conceptually wrong in the beginning, it just didn't matter up to now as no functions used between the two calls needed some critical mmu ops (e.g. alloc_pte). This has changed now, so I corrected it. Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-11Revert "swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single"David Vrabel
This reverts commit 2c3fc8d26dd09b9d7069687eead849ee81c78e46. This commit broke on x86 PV because entries in the generic SWIOTLB are indexed using (pseudo-)physical address not DMA address and these are not the same in a x86 PV guest. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-12-11KVM: x86: em_ret_far overrides cplNadav Amit
commit d50eaa18039b ("KVM: x86: Perform limit checks when assigning EIP") mistakenly used zero as cpl on em_ret_far. Use the actual one. Fixes: d50eaa18039b8b848c2285478d0775335ad5e930 Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-11KVM: nVMX: Disable unrestricted mode if ept=0Bandan Das
If L0 has disabled EPT, don't advertise unrestricted mode at all since it depends on EPT to run real mode code. Fixes: 92fbc7b195b824e201d9f06f2b93105f72384d65 Cc: stable@vger.kernel.org Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-11x86/asm: Unify segment selector definesBorislav Petkov
Those are identical on 32- and 64-bit, unify them. No functional change. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418127959-29902-1-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-11x86/asm: Guard against building the 32/64-bit versions of the asm-offsets*.c ↵Borislav Petkov
file directly Sometimes it is helpful to build a kernel compilation unit directly, i.e.: make .../<filename>.i in order to look at compiler output. Since asm-offsets_{32,64}.c are included by asm-offsets.c and building them directly doesn't evaluate the macros used (thus making the preprocessor output not very useful), error out when an attempt is made to build them. Issue a hint for the user to build asm-offsets.c instead. Suggested-by: Michael Matz <matz@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Michal Marek <mmarek@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418139917-12722-1-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-11x86_64, switch_to(): Load TLS descriptors before switching DS and ESAndy Lutomirski
Otherwise, if buggy user code points DS or ES into the TLS array, they would be corrupted after a context switch. This also significantly improves the comments and documents some gotchas in the code. Before this patch, the both tests below failed. With this patch, the es test passes, although the gsbase test still fails. ----- begin es test ----- /* * Copyright (c) 2014 Andy Lutomirski * GPL v2 */ static unsigned short GDT3(int idx) { return (idx << 3) | 3; } static int create_tls(int idx, unsigned int base) { struct user_desc desc = { .entry_number = idx, .base_addr = base, .limit = 0xfffff, .seg_32bit = 1, .contents = 0, /* Data, grow-up */ .read_exec_only = 0, .limit_in_pages = 1, .seg_not_present = 0, .useable = 0, }; if (syscall(SYS_set_thread_area, &desc) != 0) err(1, "set_thread_area"); return desc.entry_number; } int main() { int idx = create_tls(-1, 0); printf("Allocated GDT index %d\n", idx); unsigned short orig_es; asm volatile ("mov %%es,%0" : "=rm" (orig_es)); int errors = 0; int total = 1000; for (int i = 0; i < total; i++) { asm volatile ("mov %0,%%es" : : "rm" (GDT3(idx))); usleep(100); unsigned short es; asm volatile ("mov %%es,%0" : "=rm" (es)); asm volatile ("mov %0,%%es" : : "rm" (orig_es)); if (es != GDT3(idx)) { if (errors == 0) printf("[FAIL]\tES changed from 0x%hx to 0x%hx\n", GDT3(idx), es); errors++; } } if (errors) { printf("[FAIL]\tES was corrupted %d/%d times\n", errors, total); return 1; } else { printf("[OK]\tES was preserved\n"); return 0; } } ----- end es test ----- ----- begin gsbase test ----- /* * gsbase.c, a gsbase test * Copyright (c) 2014 Andy Lutomirski * GPL v2 */ static unsigned char *testptr, *testptr2; static unsigned char read_gs_testvals(void) { unsigned char ret; asm volatile ("movb %%gs:%1, %0" : "=r" (ret) : "m" (*testptr)); return ret; } int main() { int errors = 0; testptr = mmap((void *)0x200000000UL, 1, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS, -1, 0); if (testptr == MAP_FAILED) err(1, "mmap"); testptr2 = mmap((void *)0x300000000UL, 1, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS, -1, 0); if (testptr2 == MAP_FAILED) err(1, "mmap"); *testptr = 0; *testptr2 = 1; if (syscall(SYS_arch_prctl, ARCH_SET_GS, (unsigned long)testptr2 - (unsigned long)testptr) != 0) err(1, "ARCH_SET_GS"); usleep(100); if (read_gs_testvals() == 1) { printf("[OK]\tARCH_SET_GS worked\n"); } else { printf("[FAIL]\tARCH_SET_GS failed\n"); errors++; } asm volatile ("mov %0,%%gs" : : "r" (0)); if (read_gs_testvals() == 0) { printf("[OK]\tWriting 0 to gs worked\n"); } else { printf("[FAIL]\tWriting 0 to gs failed\n"); errors++; } usleep(100); if (read_gs_testvals() == 0) { printf("[OK]\tgsbase is still zero\n"); } else { printf("[FAIL]\tgsbase was corrupted\n"); errors++; } return errors == 0 ? 0 : 1; } ----- end gsbase test ----- Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: <stable@vger.kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/509d27c9fec78217691c3dad91cec87e1006b34a.1418075657.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-11x86/mm: Use min() instead of min_t() in the e820 printout codeXishi Qiu
The type of "MAX_DMA_PFN" and "xXx_pfn" are both unsigned long now, so use min() instead of min_t(). Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Linux MM <linux-mm@kvack.org> Cc: <dave@sr71.net> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/5487AB3F.7050807@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-11x86/mm: Fix zone ranges boot printoutXishi Qiu
This is the usual physical memory layout boot printout: ... [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal [mem 0x100000000-0xc3fffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x00099fff] [ 0.000000] node 0: [mem 0x00100000-0xbf78ffff] [ 0.000000] node 0: [mem 0x100000000-0x63fffffff] [ 0.000000] node 1: [mem 0x640000000-0xc3fffffff] ... This is the log when we set "mem=2G" on the boot cmdline: ... [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] // should be 0x7fffffff, right? [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x00099fff] [ 0.000000] node 0: [mem 0x00100000-0x7fffffff] ... This patch fixes the printout, the following log shows the right ranges: ... [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0x7fffffff] [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x00099fff] [ 0.000000] node 0: [mem 0x00100000-0x7fffffff] ... Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Linux MM <linux-mm@kvack.org> Cc: <dave@sr71.net> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/5487AB3D.6070306@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-11x86/doc: Update documentation after file shufflingLuis R. Rodriguez
While at it, also refer to the 32 bit entry file. Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: linux-doc@vger.kernel.org Cc: bpoirier@suse.de Link: http://lkml.kernel.org/r/1418165684-6226-1-git-send-email-mcgrof@do-not-panic.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-11perf: Fix events installation during moving groupJiri Olsa
We allow PMU driver to change the cpu on which the event should be installed to. This happened in patch: e2d37cd213dc ("perf: Allow the PMU driver to choose the CPU on which to install events") This patch also forces all the group members to follow the currently opened events cpu if the group happened to be moved. This and the change of event->cpu in perf_install_in_context() function introduced in: 0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()") forces group members to change their event->cpu, if the currently-opened-event's PMU changed the cpu and there is a group move. Above behaviour causes problem for breakpoint events, which uses event->cpu to touch cpu specific data for breakpoints accounting. By changing event->cpu, some breakpoints slots were wrongly accounted for given cpu. Vinces's perf fuzzer hit this issue and caused following WARN on my setup: WARNING: CPU: 0 PID: 20214 at arch/x86/kernel/hw_breakpoint.c:119 arch_install_hw_breakpoint+0x142/0x150() Can't find any breakpoint slot [...] This patch changes the group moving code to keep the event's original cpu. Reported-by: Vince Weaver <vince@deater.net> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vince@deater.net> Cc: Yan, Zheng <zheng.z.yan@intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1418243031-20367-3-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-11perf/x86/intel/uncore: Make sure only uncore events are collectedJiri Olsa
The uncore_collect_events functions assumes that event group might contain only uncore events which is wrong, because it might contain any type of events. This bug leads to uncore framework touching 'not' uncore events, which could end up all sorts of bugs. One was triggered by Vince's perf fuzzer, when the uncore code touched breakpoint event private event space as if it was uncore event and caused BUG: BUG: unable to handle kernel paging request at ffffffff82822068 IP: [<ffffffff81020338>] uncore_assign_events+0x188/0x250 ... The code in uncore_assign_events() function was looking for event->hw.idx data while the event was initialized as a breakpoint with different members in event->hw union. This patch forces uncore_collect_events() to collect only uncore events. Reported-by: Vince Weaver <vince@deater.net> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Yan, Zheng <zheng.z.yan@intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1418243031-20367-2-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-10Merge tag 'pm+acpi-3.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management updates from Rafael Wysocki: "This time we have some more new material than we used to have during the last couple of development cycles. The most important part of it to me is the introduction of a unified interface for accessing device properties provided by platform firmware. It works with Device Trees and ACPI in a uniform way and drivers using it need not worry about where the properties come from as long as the platform firmware (either DT or ACPI) makes them available. It covers both devices and "bare" device node objects without struct device representation as that turns out to be necessary in some cases. This has been in the works for quite a few months (and development cycles) and has been approved by all of the relevant maintainers. On top of that, some drivers are switched over to the new interface (at25, leds-gpio, gpio_keys_polled) and some additional changes are made to the core GPIO subsystem to allow device drivers to manipulate GPIOs in the "canonical" way on platforms that provide GPIO information in their ACPI tables, but don't assign names to GPIO lines (in which case the driver needs to do that on the basis of what it knows about the device in question). That also has been approved by the GPIO core maintainers and the rfkill driver is now going to use it. Second is support for hardware P-states in the intel_pstate driver. It uses CPUID to detect whether or not the feature is supported by the processor in which case it will be enabled by default. However, it can be disabled entirely from the kernel command line if necessary. Next is support for a platform firmware interface based on ACPI operation regions used by the PMIC (Power Management Integrated Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms. That interface is used for manipulating power resources and for thermal management: sensor temperature reporting, trip point setting and so on. Also the ACPI core is now going to support the _DEP configuration information in a limited way. Basically, _DEP it supposed to reflect off-the-hierarchy dependencies between devices which may be very indirect, like when AML for one device accesses locations in an operation region handled by another device's driver (usually, the device depended on this way is a serial bus or GPIO controller). The support added this time is sufficient to make the ACPI battery driver work on Asus T100A, but it is general enough to be able to cover some other use cases in the future. Finally, we have a new cpufreq driver for the Loongson1B processor. In addition to the above, there are fixes and cleanups all over the place as usual and a traditional ACPICA update to a recent upstream release. As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver for Intel platforms should be able to handle power management of the DMA engine correctly, the cpufreq-dt driver should interact with the thermal subsystem in a better way and the ACPI backlight driver should handle some more corner cases, among other things. On top of the ACPICA update there are fixes for race conditions in the ACPICA's interrupt handling code which might lead to some random and strange looking failures on some systems. In the cleanups department the most visible part is the series of commits targeted at getting rid of the CONFIG_PM_RUNTIME configuration option. That was triggered by a discussion regarding the generic power domains code during which we realized that trying to support certain combinations of PM config options was painful and not really worth it, because nobody would use them in production anyway. For this reason, we decided to make CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and that lead to the conclusion that the latter became redundant and CONFIG_PM could be used instead of it. The material here makes that replacement in a major part of the tree, but there will be at least one more batch of that in the second part of the merge window. Specifics: - Support for retrieving device properties information from ACPI _DSD device configuration objects and a unified device properties interface for device drivers (and subsystems) on top of that. As stated above, this works with Device Trees and ACPI and allows device drivers to be written in a platform firmware (DT or ACPI) agnostic way. The at25, leds-gpio and gpio_keys_polled drivers are now going to use this new interface and the GPIO subsystem is additionally modified to allow device drivers to assign names to GPIO resources returned by ACPI _CRS objects (in case _DSD is not present or does not provide the expected data). The changes in this set are mostly from Mika Westerberg, Rafael J Wysocki, Aaron Lu, and Darren Hart with some fixes from others (Fabio Estevam, Geert Uytterhoeven). - Support for Hardware Managed Performance States (HWP) as described in Volume 3, section 14.4, of the Intel SDM in the intel_pstate driver. CPUID is used to detect whether or not the feature is supported by the processor. If supported, it will be enabled automatically unless the intel_pstate=no_hwp switch is present in the kernel command line. From Dirk Brandewie. - New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie). - Support for firmware interface based on ACPI operation regions used by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR platforms for power resource control and thermal management (Aaron Lu). - Limited support for retrieving off-the-hierarchy dependencies between devices from ACPI _DEP device configuration objects and deferred probing support for the ACPI battery driver based on the _DEP information to make that driver work on Asus T100A (Lan Tianyu). - New cpufreq driver for the Loongson1B processor (Kelvin Cheung). - ACPICA update to upstream revision 20141107 which only affects tools (Bob Moore). - Fixes for race conditions in the ACPICA's interrupt handling code and in the ACPI code related to system suspend and resume (Lv Zheng and Rafael J Wysocki). - ACPI core fix for an RCU-related issue in the ioremap() regions management code that slowed down significantly after CPUs had been allowed to enter idle states even if they'd had RCU callbakcs queued and triggered some problems in certain proprietary graphics driver (and elsewhere). The fix replaces synchronize_rcu() in that code with synchronize_rcu_expedited() which makes the issue go away. From Konstantin Khlebnikov. - ACPI LPSS (Low-Power Subsystem) driver fix to handle power management of the DMA engine included into the LPSS correctly. The problem is that the DMA engine doesn't have ACPI PM support of its own and it simply is turned off when the last LPSS device having ACPI PM support goes into D3cold. To work around that, the PM domain used by the ACPI LPSS driver is redesigned so at least one device with ACPI PM support will be on as long as the DMA engine is in use. From Andy Shevchenko. - ACPI backlight driver fix to avoid using it on "Win8-compatible" systems where it doesn't work and where it was used by default by mistake (Aaron Lu). - Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki, Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and Ashwin Chaugule (mostly related to the upcoming ARM64 support). - Intel RAPL (Running Average Power Limit) power capping driver fixes and improvements including new processor IDs (Jacob Pan). - Generic power domains modification to power up domains after attaching devices to them to meet the expectations of device drivers and bus types assuming devices to be accessible at probe time (Ulf Hansson). - Preliminary support for controlling device clocks from the generic power domains core code and modifications of the ARM/shmobile platform to use that feature (Ulf Hansson). - Assorted minor fixes and cleanups of the generic power domains core code (Ulf Hansson, Geert Uytterhoeven). - Assorted minor fixes and cleanups of the device clocks control code in the PM core (Geert Uytterhoeven, Grygorii Strashko). - Consolidation of device power management Kconfig options by making CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter which is now redundant (Rafael J Wysocki and Kevin Hilman). That is the first batch of the changes needed for this purpose. - Core device runtime power management support code cleanup related to the execution of callbacks (Andrzej Hajda). - cpuidle ARM support improvements (Lorenzo Pieralisi). - cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and a new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and Bartlomiej Zolnierkiewicz). - New cpufreq driver callback (->ready) to be executed when the cpufreq core is ready to use a given policy object and cpufreq-dt driver modification to use that callback for cooling device registration (Viresh Kumar). - cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu, James Geboski, Tomeu Vizoso). - Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate, cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao, Stefan Wahren, Petr Cvek). - OPP (Operating Performance Points) framework modification to allow OPPs to be removed too and update of a few cpufreq drivers (cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added during initialization) on driver removal (Viresh Kumar). - Hibernation core fixes and cleanups (Tina Ruchandani and Markus Elfring). - PM Kconfig fix related to CPU power management (Pankaj Dubey). - cpupower tool fix (Prarit Bhargava)" * tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (120 commits) i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM tools: cpupower: fix return checks for sysfs_get_idlestate_count() drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM leds: leds-gpio: Fix multiple instances registration without 'label' property iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hwrandom / exynos / PM: Use CONFIG_PM in #ifdef block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM USB / PM: Drop CONFIG_PM_RUNTIME from the USB core PM: Merge the SET*_RUNTIME_PM_OPS() macros ...
2014-12-10Documentation: Add entry for dell-laptop sysfs interfaceGabriele Mazzotta
Add the documentation for the new sysfs interface of dell-laptop that allows to configure the keyboard illumination on Dell systems. Signed-off-by: Gabriele Mazzotta <gabriele.mzt@gmail.com> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Darren Hart <dvhart@linux.intel.com>
2014-12-10Merge tag 'pci-v3.19-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI changes from Bjorn Helgaas: "Here are the PCI changes intended for v3.19. I don't think there's anything very exciting here, but there was a lot of MSI-related stuff coming via Thomas. Details: NUMA - Allow numa_node override via sysfs (Prarit Bhargava) Resource management - Restore detection of read-only BARs (Myron Stowe) - Shrink decoding-disabled window while sizing BARs (Myron Stowe) - Add informational printk for invalid BARs (Myron Stowe) - Remove fixed parameter in pci_iov_resource_bar() (Myron Stowe) MSI - Add pci_msi_ignore_mask to prevent writes to MSI/MSI-X Mask Bits (Yijing Wang) - Revert "PCI: Add x86_msi.msi_mask_irq() and msix_mask_irq()" (Yijing Wang) - s390/MSI: Use __msi_mask_irq() instead of default_msi_mask_irq() (Yijing Wang) Virtualization - xen: Process failure for pcifront_(re)scan_root() (Chen Gang) - Make FLR and AF FLR reset warning messages different (Gavin Shan) Generic host bridge driver - Allocate config space windows after limiting bus number range (Lorenzo Pieralisi) - Convert to DT resource parsing API (Lorenzo Pieralisi) Freescale Layerscape - Add Freescale Layerscape PCIe driver (Minghuan Lian) NVIDIA Tegra - Do not build on 64-bit ARM (Thierry Reding) - Add Kconfig help text (Thierry Reding) Renesas R-Car - Make rcar_pci static (Jingoo Han) Samsung Exynos - Add exynos prefix to add_pcie_port(), pcie_init() (Jingoo Han) ST Microelectronics SPEAr13xx - Add spear prefix to add_pcie_port(), pcie_init() (Jingoo Han) - Make spear13xx_add_pcie_port() __init (Jingoo Han) - Remove unnecessary OOM message (Jingoo Han) TI DRA7xx - Add dra7xx prefix to add_pcie_port() (Jingoo Han) - Make dra7xx_add_pcie_port() __init (Jingoo Han) TI Keystone - Make ks_dw_pcie_msi_domain_ops static (Jingoo Han) - Remove unnecessary OOM message (Jingoo Han) Miscellaneous - Delete unnecessary NULL pointer checks (Markus Elfring) - Remove unused to_hotplug_slot() (Gavin Shan) - Whitespace cleanup (Jingoo Han) - Simplify if-return sequences (Quentin Lambert)" * tag 'pci-v3.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (28 commits) PCI: Remove fixed parameter in pci_iov_resource_bar() PCI: Add informational printk for invalid BARs PCI: tegra: Add Kconfig help text PCI: tegra: Do not build on 64-bit ARM PCI: spear: Remove unnecessary OOM message PCI: mvebu: Add a blank line after declarations PCI: designware: Add a blank line after declarations PCI: exynos: Remove unnecessary return statement PCI: imx6: Use tabs for indentation PCI: keystone: Remove unnecessary OOM message PCI: Remove unused and broken to_hotplug_slot() PCI: Make FLR and AF FLR reset warning messages different PCI: dra7xx: Add __init annotation to dra7xx_add_pcie_port() PCI: spear: Add __init annotation to spear13xx_add_pcie_port() PCI: spear: Rename add_pcie_port(), pcie_init() to spear13xx_add_pcie_port(), etc. PCI: dra7xx: Rename add_pcie_port() to dra7xx_add_pcie_port() PCI: layerscape: Add Freescale Layerscape PCIe driver PCI: Simplify if-return sequences PCI: Delete unnecessary NULL pointer checks PCI: Shrink decoding-disabled window while sizing BARs ...
2014-12-10Merge branch 'thinkpad-acpi' into for-nextDarren Hart
thinkpad-acpi: acpi: Remove _OSI(Linux) for ThinkPads thinkpad-acpi: Try to use full software mute control Signed-off-by: Darren Hart <dvhart@linux.intel.com>
2014-12-10Merge tag 'ktest-v3.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest Pull ktest changes from Steven Rostedt: "The following ktest updates were done: - Fix handling the make kernelrelease change - Fix make_min_config that was broken by new bisect_config changes - Allow tests to undefine default options (not just being able to override them) - Print name of test (if defined) to start of test output" * tag 'ktest-v3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest: ktest: Add back "tail -1" to kernelrelease make ktest: Add name to running title ktest: Allow tests to undefine default options ktest: Fix make_min_config to handle new assign_configs call ktest: Use make -s kernelrelease
2014-12-10Merge branch 'fec-next'David S. Miller
Fugang Duan says: ==================== net: fec: driver code clean and bug fix The patch serial include code clean and bug fix: Patch#1: avoid dummy operation during suspend/resume test. Patch#2: bug fix for i.MX6SX SOC that clean all interrupt events during MAC initial process. Patch#3: before phy device link status is up, only enable MDIO bus interrupt. V2: - Modify the comment form from David's suggestion. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10net: fec: only enable mdio interrupt before phy device link upNimrod Andy
Before phy device link up, we only enable FEC mdio interrupt, which is more reasonable. Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10net: fec: clear all interrupt events to support i.MX6SXNimrod Andy
For i.MX6SX FEC controller, there have interrupt mask and event field extension. To support all SOCs FEC, we clear all interrupt events during MAVC initial process. Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10net: fec: reset fep link status in suspend functionNimrod Andy
On some i.MX6 serial boards, phy power and refrence clock are supplied or controlled by SOC. When do suspend/resume test, the power and clock are disabled, so phy device link down. For current driver, fep->link is still up status, which cause extra operation like below code. To avoid the dumy operation, we set fep->link to down when phy device is real down. ... if (fep->link) { napi_disable(&fep->napi); netif_tx_lock_bh(ndev); fec_stop(ndev); netif_tx_unlock_bh(ndev); napi_enable(&fep->napi); fep->link = phy_dev->link; status_change = 1; } ... Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10Merge tag 'trace-seq-buf-3.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull nmi-safe seq_buf printk update from Steven Rostedt: "This code is a fork from the trace-3.19 pull as it needed the trace_seq clean ups from that branch. This code solves the issue of performing stack dumps from NMI context. The issue is that printk() is not safe from NMI context as if the NMI were to trigger when a printk() was being performed, the NMI could deadlock from the printk() internal locks. This has been seen in practice. With lots of review from Petr Mladek, this code went through several iterations, and we feel that it is now at a point of quality to be accepted into mainline. Here's what is contained in this patch set: - Creates a "seq_buf" generic buffer utility that allows a descriptor to be passed around where functions can write their own "printk()" formatted strings into it. The generic version was pulled out of the trace_seq() code that was made specifically for tracing. - The seq_buf code was change to model the seq_file code. I have a patch (not included for 3.19) that converts the seq_file.c code over to use seq_buf.c like the trace_seq.c code does. This was done to make sure that seq_buf.c is compatible with seq_file.c. I may try to get that patch in for 3.20. - The seq_buf.c file was moved to lib/ to remove it from being dependent on CONFIG_TRACING. - The printk() was updated to allow for a per_cpu "override" of the internal calls. That is, instead of writing to the console, a call to printk() may do something else. This made it easier to allow the NMI to change what printk() does in order to call dump_stack() without needing to update that code as well. - Finally, the dump_stack from all CPUs via NMI code was converted to use the seq_buf code. The caller to trigger the NMI code would wait till all the NMIs finished, and then it would print the seq_buf data to the console safely from a non NMI context One added bonus is that this code also makes the NMI dump stack work on PREEMPT_RT kernels. As printk() includes sleeping locks on PREEMPT_RT, printk() only writes to console if the console does not use any rt_mutex converted spin locks. Which a lot do" * tag 'trace-seq-buf-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: x86/nmi: Fix use of unallocated cpumask_var_t printk/percpu: Define printk_func when printk is not defined x86/nmi: Perform a safe NMI stack trace on all CPUs printk: Add per_cpu printk func to allow printk to be diverted seq_buf: Move the seq_buf code to lib/ seq-buf: Make seq_buf_bprintf() conditional on CONFIG_BINARY_PRINTF tracing: Add seq_buf_get_buf() and seq_buf_commit() helper functions tracing: Have seq_buf use full buffer seq_buf: Add seq_buf_can_fit() helper function tracing: Add paranoid size check in trace_printk_seq() tracing: Use trace_seq_used() and seq_buf_used() instead of len tracing: Clean up tracing_fill_pipe_page() seq_buf: Create seq_buf_used() to find out how much was written tracing: Add a seq_buf_clear() helper and clear len and readpos in init tracing: Convert seq_buf fields to be like seq_file fields tracing: Convert seq_buf_path() to be like seq_path() tracing: Create seq_buf layer in trace_seq
2014-12-10net: sock: fix access via invalid file descriptorAlexei Starovoitov
0day robot reported the following crash: [ 21.233581] BUG: unable to handle kernel NULL pointer dereference at 0000000000000007 [ 21.234709] IP: [<ffffffff8156ebda>] sk_attach_bpf+0x39/0xc2 It's due to bpf_prog_get() returning ERR_PTR. Check it properly. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10Merge tag 'ftracetest-3.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull ftrace self-test updates from Steven Rostedt: "Updates for the ftrace self tests: - Added kprobes on ftrace testcase - Sort test cases - Add file to hold helper functions - Use logfile name supported by busybox's mktemp - Clear trace buffer after running kprobe test - Fix show descriptions when run on dash shell - Add --verbose option for showing echo output" * tag 'ftracetest-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftracetest: Add --verbose option for showing echo output ftracetest: Fix to show descriptions on dash ftracetest: Add basic event tracing test cases ftracetest: Clear trace buffer after running kprobe testcases ftracetest: Use logfile name supported by busybox's mktemp ftracetest: Add a couple of ftrace test cases ftracetest: Add functions file that holds helper functions ftracetest: Sort testcases ftracetest: Add kprobes on ftrace testcase
2014-12-10Merge tag 'trace-3.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "There was a lot of clean ups and minor fixes. One of those clean ups was to the trace_seq code. It also removed the return values to the trace_seq_*() functions and use trace_seq_has_overflowed() to see if the buffer filled up or not. This is similar to work being done to the seq_file code as well in another tree. Some of the other goodies include: - Added some "!" (NOT) logic to the tracing filter. - Fixed the frame pointer logic to the x86_64 mcount trampolines - Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems. That is, the ftrace trampoline can be dynamically allocated and be called directly by functions that only have a single hook to them" * tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (55 commits) tracing: Truncated output is better than nothing tracing: Add additional marks to signal very large time deltas Documentation: describe trace_buf_size parameter more accurately tracing: Allow NOT to filter AND and OR clauses tracing: Add NOT to filtering logic ftrace/fgraph/x86: Have prepare_ftrace_return() take ip as first parameter ftrace/x86: Get rid of ftrace_caller_setup ftrace/x86: Have save_mcount_regs macro also save stack frames if needed ftrace/x86: Add macro MCOUNT_REG_SIZE for amount of stack used to save mcount regs ftrace/x86: Simplify save_mcount_regs on getting RIP ftrace/x86: Have save_mcount_regs store RIP in %rdi for first parameter ftrace/x86: Rename MCOUNT_SAVE_FRAME and add more detailed comments ftrace/x86: Move MCOUNT_SAVE_FRAME out of header file ftrace/x86: Have static tracing also use ftrace_caller_setup ftrace/x86: Have static function tracing always test for function graph kprobes: Add IPMODIFY flag to kprobe_ftrace_ops ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict kprobes/ftrace: Recover original IP if pre_handler doesn't change it tracing/trivial: Fix typos and make an int into a bool tracing: Deletion of an unnecessary check before iput() ...
2014-12-10net: introduce helper macro for_each_cmsghdrGu Zheng
Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating cmsghdr from msghdr, just cleanup. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10arm: omap3: twl: remove usb phy init dataHeikki Krogerus
commit dbc98635e0d4 ("phy: remove the old lookup method") removes struct phy_consumer but twl-common.c still uses the "phy_consumer" structure resulting in the following compilation warning. arch/arm/mach-omap2/twl-common.c:94:21: error: array type has incomplete element type struct phy_consumer consumers[] = { Removed using phy_consumer since twl4030 uses the new lookup method. Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
2014-12-10Merge branch 'akpm' (patchbomb from Andrew)Linus Torvalds
Merge first patchbomb from Andrew Morton: - a few minor cifs fixes - dma-debug upadtes - ocfs2 - slab - about half of MM - procfs - kernel/exit.c - panic.c tweaks - printk upates - lib/ updates - checkpatch updates - fs/binfmt updates - the drivers/rtc tree - nilfs - kmod fixes - more kernel/exit.c - various other misc tweaks and fixes * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) exit: pidns: fix/update the comments in zap_pid_ns_processes() exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting exit: exit_notify: re-use "dead" list to autoreap current exit: reparent: call forget_original_parent() under tasklist_lock exit: reparent: avoid find_new_reaper() if no children exit: reparent: introduce find_alive_thread() exit: reparent: introduce find_child_reaper() exit: reparent: document the ->has_child_subreaper checks exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper() exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting exit: proc: don't try to flush /proc/tgid/task/tgid exit: release_task: fix the comment about group leader accounting exit: wait: drop tasklist_lock before psig->c* accounting exit: wait: don't use zombie->real_parent exit: wait: cleanup the ptrace_reparented() checks usermodehelper: kill the kmod_thread_locker logic usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper() fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races ...
2014-12-10make default ->i_fop have ->open() fail with ENXIOAl Viro
As it is, default ->i_fop has NULL ->open() (along with all other methods). The only case where it matters is reopening (via procfs symlink) a file that didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned to something sane (default would fail on read/write/ioctl/etc.). Unfortunately, such case exists - alloc_file() users, especially anon_get_file() ones. There we have tons of opened files of very different kinds sharing the same inode. As the result, attempt to reopen those via procfs succeeds and you get a descriptor you can't do anything with. Moreover, in case of sockets we set ->i_fop that will only be used on such reopen attempts - and put a failing ->open() into it to make sure those do not succeed. It would be simpler to put such ->open() into default ->i_fop and leave it unchanged both for anon inode (as we do anyway) and for socket ones. Result: * everything going through do_dentry_open() works as it used to * sock_no_open() kludge is gone * attempts to reopen anon-inode files fail as they really ought to * ditto for aio_private_file() * ditto for perfmon - this one actually tried to imitate sock_no_open() trick, but failed to set ->i_fop, so in the current tree reopens succeed and yield completely useless descriptor. Intent clearly had been to fail with -ENXIO on such reopens; now it actually does. * everything else that used alloc_file() keeps working - it has ->i_fop set for its inodes anyway Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-10make nameidata completely opaque outside of fs/namei.cAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-10Merge branch 'nsfs' into for-nextAl Viro
2014-12-10kill proc_ns completelyAl Viro
procfs inodes need only the ns_ops part; nsfs inodes don't need it at all Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-10take the targets of /proc/*/ns/* symlinks to separate fsAl Viro
New pseudo-filesystem: nsfs. Targets of /proc/*/ns/* live there now. It's not mountable (not even registered, so it's not in /proc/filesystems, etc.). Files on it *are* bindable - we explicitly permit that in do_loopback(). This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well. get_proc_ns() is a macro now (it's simply returning ->i_private; would have been an inline, if not for header ordering headache). proc_ns_inode() is an ex-parrot. The interface used in procfs is ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops). Dentries and inodes are never hashed; a non-counting reference to dentry is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path() if present. See ns_get_path()/ns_prune_dentry/nsfs_evict() for details of that mechanism. As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt; it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets from ns_get_path(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-10exit: pidns: fix/update the comments in zap_pid_ns_processes()Oleg Nesterov
The comments in zap_pid_ns_processes() are not clear, we need to explain how this code actually works. 1. "Ignore SIGCHLD" looks like optimization but it is not, we also need this for correctness. 2. The comment above sys_wait4() could tell more. EXIT_ZOMBIE child is only possible if it has exited before we ignored SIGCHLD. Or if it is traced from the parent namespace, but in this case it will be reaped by debugger after detach, sys_wait4() acts as a synchronization point. 3. The comment about TASK_DEAD (EXIT_DEAD in fact) children is outdated. Contrary to what it says we do not need to make sure they all go away after 0a01f2cc390e "pidns: Make the pidns proc mount/umount logic obvious". At the same time, we do need to wait for nr_hashed==init_pids, but the reasons are quite different and not obvious: setns(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exitingOleg Nesterov
alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it fails because disable_pid_allocation() was called by the exiting child_reaper. We could simply move get_pid_ns() down to successful return, but this fix tries to be as trivial as possible. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Sterling Alexander <stalexan@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: exit_notify: re-use "dead" list to autoreap currentOleg Nesterov
After the previous change we can add just the exiting EXIT_DEAD task to the "dead" list and remove another release_task(tsk). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: call forget_original_parent() under tasklist_lockOleg Nesterov
Shift "release dead children" loop from forget_original_parent() to its caller, exit_notify(). It is safe to reap them even if our parent reaps us right after we drop tasklist_lock, those children no longer have any connection to the exiting task. And this allows us to avoid write_lock_irq(tasklist_lock) right after it was released by forget_original_parent(), we can simply call it with tasklist_lock held. While at it, move the comment about forget_original_parent() up to this function. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: avoid find_new_reaper() if no childrenOleg Nesterov
Now that pid_ns logic was isolated we can change forget_original_parent() to return right after find_child_reaper() when father->children is empty, there is nothing to reparent in this case. In particular this avoids find_alive_thread() and this can help if the whole process exits and it has a lot of PF_EXITING threads at the start of the thread list, this can easily lead to O(nr_threads ** 2) iterations. Trivial test case (tested under KVM, 2 CPUs): static void *tfunc(void *arg) { pause(); return NULL; } static int child(unsigned int nt) { pthread_t pt; while (nt--) assert(pthread_create(&pt, NULL, tfunc, NULL) == 0); pthread_kill(pt, SIGTRAP); pause(); return 0; } int main(int argc, const char *argv[]) { int stat; unsigned int nf = atoi(argv[1]); unsigned int nt = atoi(argv[2]); while (nf--) { if (!fork()) return child(nt); wait(&stat); assert(stat == SIGTRAP); } return 0; } $ time ./test 16 16536 shows: real user sys - 5m37.628s 0m4.437s 8m5.560s + 0m50.032s 0m7.130s 1m4.927s Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: introduce find_alive_thread()Oleg Nesterov
Add the new simple helper to factor out the for_each_thread() code in find_child_reaper() and find_new_reaper(). It can also simplify the potential PF_EXITING -> exit_state change, plus perhaps we can change this code to take SIGNAL_GROUP_EXIT into account. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: introduce find_child_reaper()Oleg Nesterov
find_new_reaper() does 2 completely different things. Not only it finds a reaper, it also updates pid_ns->child_reaper or kills the whole namespace if the caller is ->child_reaper. Now that has_child_subreaper logic doesn't depend on child_reaper check we can move that pid_ns code into a separate helper. IMHO this makes the code more clean, and this allows the next changes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: document the ->has_child_subreaper checksOleg Nesterov
Swap the "init_task" and same_thread_group() checks. This way it is more simple to document these checks and we can remove the link to the previous discussion on lkml. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()Oleg Nesterov
Change find_new_reaper() to use for_each_thread() instead of deprecated while_each_thread(). We do not bother to check "thread != father" in the 1st loop, we can rely on PF_EXITING check. Note: this means the minor behavioural change: for_each_thread() starts from the group leader. But this should be fine, nobody should make any assumption about do_wait(__WNOTHREAD) when it comes to reparented tasks. And this can avoid the pointless reparenting to a short-living thread While zombie leaders are not that common. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>