summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2012-12-11mm: numa: Create basic numa page hinting infrastructureMel Gorman
Note: This patch started as "mm/mpol: Create special PROT_NONE infrastructure" and preserves the basic idea but steals *very* heavily from "autonuma: numa hinting page faults entry points" for the actual fault handlers without the migration parts. The end result is barely recognisable as either patch so all Signed-off and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with this version, I will re-add the signed-offs-by to reflect the history. In order to facilitate a lazy -- fault driven -- migration of pages, create a special transient PAGE_NUMA variant, we can then use the 'spurious' protection faults to drive our migrations from. The meaning of PAGE_NUMA depends on the architecture but on x86 it is effectively PROT_NONE. Actual PROT_NONE mappings will not generate these NUMA faults for the reason that the page fault code checks the permission on the VMA (and will throw a segmentation fault on actual PROT_NONE mappings), before it ever calls handle_mm_fault. [dhillf@gmail.com: Fix typo] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pteAndrea Arcangeli
When we split a transparent hugepage, transfer the NUMA type from the pmd to the pte if needed. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: numa: Support NUMA hinting page faults from gup/gup_fastAndrea Arcangeli
Introduce FOLL_NUMA to tell follow_page to check pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do so because it always invokes handle_mm_fault and retries the follow_page later. KVM secondary MMU page faults will trigger the NUMA hinting page faults through gup_fast -> get_user_pages -> follow_page -> handle_mm_fault. Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages. [ This patch was picked up from the AutoNUMA tree. ] Originally-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> [ ported to this tree. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: compaction: Add scanned and isolated counters for compactionMel Gorman
Compaction already has tracepoints to count scanned and isolated pages but it requires that ftrace be enabled and if that information has to be written to disk then it can be disruptive. This patch adds vmstat counters for compaction called compact_migrate_scanned, compact_free_scanned and compact_isolated. With these counters, it is possible to define a basic cost model for compaction. This approximates of how much work compaction is doing and can be compared that with an oprofile showing TLB misses and see if the cost of compaction is being offset by THP for example. Minimally a compaction patch can be evaluated in terms of whether it increases or decreases cost. The basic cost model looks like this Fundamental unit u: a word sizeof(void *) Ca = cost of struct page access = sizeof(struct page) / u Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2 Cmf = Cost migrate failure = Ca * 2 Ci = Cost page isolation = (Ca + Wi) where Wi is a constant that should reflect the approximate cost of the locking operation. Csm = Cost migrate scanning = Ca Csf = Cost free scanning = Ca Overall cost = (Csm * compact_migrate_scanned) + (Csf * compact_free_scanned) + (Ci * compact_isolated) + (Cmc * pgmigrate_success) + (Cmf * pgmigrate_failed) Where the values are read from /proc/vmstat. This is very basic and ignores certain costs such as the allocation cost to do a migrate page copy but any improvement to the model would still use the same vmstat counters. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: migrate: Add a tracepoint for migrate_pagesMel Gorman
The pgmigrate_success and pgmigrate_fail vmstat counters tells the user about migration activity but not the type or the reason. This patch adds a tracepoint to identify the type of page migration and why the page is being migrated. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: compaction: Move migration fail/success stats to migrate.cMel Gorman
The compact_pages_moved and compact_pagemigrate_failed events are convenient for determining if compaction is active and to what degree migration is succeeding but it's at the wrong level. Other users of migration may also want to know if migration is working properly and this will be particularly true for any automated NUMA migration. This patch moves the counters down to migration with the new events called pgmigrate_success and pgmigrate_fail. The compact_blocks_moved counter is removed because while it was useful for debugging initially, it's worthless now as no meaningful conclusions can be drawn from its value. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: Optimize the TLB flush of sys_mprotect() and change_protection() usersIngo Molnar
Reuse the NUMA code's 'modified page protections' count that change_protection() computes and skip the TLB flush if there's no changes to a range that sys_mprotect() modifies. Given that mprotect() already optimizes the same-flags case I expected this optimization to dominantly trigger on CONFIG_NUMA_BALANCING=y kernels - but even with that feature disabled it triggers rather often. There's two reasons for that: 1) While sys_mprotect() already optimizes the same-flag case: if (newflags == oldflags) { *pprev = vma; return 0; } and this test works in many cases, but it is too sharp in some others, where it differentiates between protection values that the underlying PTE format makes no distinction about, such as PROT_EXEC == PROT_READ on x86. 2) Even where the pte format over vma flag changes necessiates a modification of the pagetables, there might be no pagetables yet to modify: they might not be instantiated yet. During a regular desktop bootup this optimization hits a couple of hundred times. During a Java test I measured thousands of hits. So this optimization improves sys_mprotect() in general, not just CONFIG_NUMA_BALANCING=y kernels. [ We could further increase the efficiency of this optimization if change_pte_range() and change_huge_pmd() was a bit smarter about recognizing exact-same-value protection masks - when the hardware can do that safely. This would probably further speed up mprotect(). ] Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-11mm: Count the number of pages affected in change_protection()Peter Zijlstra
This will be used for three kinds of purposes: - to optimize mprotect() - to speed up working set scanning for working set areas that have not been touched - to more accurately scan per real working set No change in functionality from this patch. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-11mm: Check if PTE is already allocated during page faultMel Gorman
With transparent hugepage support, handle_mm_fault() has to be careful that a normal PMD has been established before handling a PTE fault. To achieve this, it used __pte_alloc() directly instead of pte_alloc_map as pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map() is called once it is known the PMD is safe. pte_alloc_map() is smart enough to check if a PTE is already present before calling __pte_alloc but this check was lost. As a consequence, PTEs may be allocated unnecessarily and the page table lock taken. Thi useless PTE does get cleaned up but it's a performance hit which is visible in page_test from aim9. This patch simply re-adds the check normally done by pte_alloc_map to check if the PTE needs to be allocated before taking the page table lock. The effect is noticable in page_test from aim9. AIM9 2.6.38-vanilla 2.6.38-checkptenone creat-clo 446.10 ( 0.00%) 424.47 (-5.10%) page_test 38.10 ( 0.00%) 42.04 ( 9.37%) brk_test 52.45 ( 0.00%) 51.57 (-1.71%) exec_test 382.00 ( 0.00%) 456.90 (16.39%) fork_test 60.11 ( 0.00%) 67.79 (11.34%) MMTests Statistics: duration Total Elapsed Time (seconds) 611.90 612.22 (While this affects 2.6.38, it is a performance rather than a functional bug and normally outside the rules -stable. While the big performance differences are to a microbench, the difference in fork and exec performance may be significant enough that -stable wants to consider the patch) Reported-by: Raz Ben Yehuda <raziebe@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rik van Riel <riel@redhat.com> [ Picked this up from the AutoNUMA tree to help it upstream and to allow apples-to-apples performance comparisons. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-11mm: Only flush the TLB when clearing an accessible pteRik van Riel
If ptep_clear_flush() is called to clear a page table entry that is accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry, there is no need to flush the TLB on remote CPUs. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-vm3rkzevahelwhejx5uwm8ex@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-11mm,generic: only flush the local TLB in ptep_set_access_flagsRik van Riel
The function ptep_set_access_flags is only ever used to upgrade access permissions to a page. That means the only negative side effect of not flushing remote TLBs is that other CPUs may incur spurious page faults, if they happen to access the same address, and still have a PTE with the old permissions cached in their TLB. Having another CPU maybe incur a spurious page fault is faster than always incurring the cost of a remote TLB flush, so replace the remote TLB flush with a purely local one. This should be safe on every architecture that correctly implements flush_tlb_fix_spurious_fault() to actually invalidate the local TLB entry that caused a page fault, as well as on architectures where the hardware invalidates TLB entries that cause page faults. In the unlikely event that you are hitting what appears to be an infinite loop of page faults, and 'git bisect' took you to this changeset, your architecture needs to implement flush_tlb_fix_spurious_fault to actually flush the TLB entry. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Michel Lespinasse <walken@google.com> Cc: Ingo Molnar <mingo@kernel.org>
2012-12-11mm/sl[aou]b: Common alignment codeChristoph Lameter
Extract the code to do object alignment from the allocators. Do the alignment calculations in slab_common so that the __kmem_cache_create functions of the allocators do not have to deal with alignment. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-12-11slab: Use the new create_boot_cache function to simplify bootstrapChristoph Lameter
Simplify setup and reduce code in kmem_cache_init(). This allows us to get rid of initarray_cache as well as the manual setup code for the kmem_cache and kmem_cache_node arrays during bootstrap. We introduce a new bootstrap state "PARTIAL" for slab that signals the creation of a kmem_cache boot cache. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-12-11slub: Use statically allocated kmem_cache boot structure for bootstrapChristoph Lameter
Simplify bootstrap by statically allocated two kmem_cache structures. These are freed after bootup is complete. Allows us to no longer worry about calculations of sizes of kmem_cache structures during bootstrap. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-12-11mm, sl[au]b: create common functions for boot slab creationChristoph Lameter
Use a special function to create kmalloc caches and use that function in SLAB and SLUB. Acked-by: Joonsoo Kim <js1304@gmail.com> Reviewed-by: Glauber Costa <glommer@parallels.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-12-11slab: Simplify bootstrapChristoph Lameter
The nodelists field in kmem_cache is pointing to the first unused object in the array field when bootstrap is complete. A problem with the current approach is that the statically sized kmem_cache structure use on boot can only contain NR_CPUS entries. If the number of nodes plus the number of cpus is greater then we would overwrite memory following the kmem_cache_boot definition. Increase the size of the array field to ensure that also the node pointers fit into the array field. Once we do that we no longer need the kmem_cache_nodelists array and we can then also use that structure elsewhere. Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-12-11slub: Use correct cpu_slab on dead cpuChristoph Lameter
Pass a kmem_cache_cpu pointer into unfreeze partials so that a different kmem_cache_cpu structure than the local one can be specified. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-12-11mm: dmapool: use provided gfp flags for all dma_alloc_coherent() callsMarek Szyprowski
dmapool always calls dma_alloc_coherent() with GFP_ATOMIC flag, regardless the flags provided by the caller. This causes excessive pruning of emergency memory pools without any good reason. Additionaly, on ARM architecture any driver which is using dmapools will sooner or later trigger the following error: "ERROR: 256 KiB atomic DMA coherent pool is too small! Please increase it with coherent_pool= kernel parameter!". Increasing the coherent pool size usually doesn't help much and only delays such error, because all GFP_ATOMIC DMA allocations are always served from the special, very limited memory pool. This patch changes the dmapool code to correctly use gfp flags provided by the dmapool caller. Reported-by: Soeren Moch <smoch@web.de> Reported-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Soeren Moch <smoch@web.de> Cc: stable@vger.kernel.org
2012-12-10Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damageLinus Torvalds
This reverts commits a50915394f1fc02c2861d3b7ce7014788aa5066e and d7c3b937bdf45f0b844400b7bf6fd3ed50bac604. This is a revert of a revert of a revert. In addition, it reverts the even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the original commits in linux-next. It turns out that the original patch really was bogus, and that the original revert was the correct thing to do after all. We thought we had fixed the problem, and then reverted the revert, but the problem really is fundamental: waking up kswapd simply isn't the right thing to do, and direct reclaim sometimes simply _is_ the right thing to do. When certain allocations fail, we simply should try some direct reclaim, and if that fails, fail the allocation. That's the right thing to do for THP allocations, which can easily fail, and the GPU allocations want to do that too. So starting kswapd is sometimes simply wrong, and removing the flag that said "don't start kswapd" was a mistake. Let's hope we never revisit this mistake again - and certainly not this many times ;) Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-10Revert "mm: avoid waking kswapd for THP allocations when compaction is ↵Linus Torvalds
deferred or contended" This reverts commit 782fd30406ecb9d9b082816abe0c6008fc72a7b0. We are going to reinstate the __GFP_NO_KSWAPD flag that has been removed, the removal reverted, and then removed again. Making this commit a pointless fixup for a problem that was caused by the removal of __GFP_NO_KSWAPD flag. The thing is, we really don't want to wake up kswapd for THP allocations (because they fail quite commonly under any kind of memory pressure, including when there is tons of memory free), and these patches were just trying to fix up the underlying bug: the original removal of __GFP_NO_KSWAPD in commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD") was simply bogus. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-08mm: vmscan: fix inappropriate zone congestion clearingJohannes Weiner
commit c702418f8a2f ("mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones") removed zone watermark checks from the compaction code in kswapd but left in the zone congestion clearing, which now happens unconditionally on higher order reclaim. This messes up the reclaim throttling logic for zones with dirty/writeback pages, where zones should only lose their congestion status when their watermarks have been restored. Remove the clearing from the zone compaction section entirely. The preliminary zone check and the reclaim loop in kswapd will clear it if the zone is considered balanced. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-06tmpfs: fix shared mempolicy leakMel Gorman
This fixes a regression in 3.7-rc, which has since gone into stable. Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went on expecting alloc_page_vma() to drop the refcount it had acquired. This deserves a rework: but for now fix the leak in shmem_alloc_page(). Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use the same refcounting there as in shmem_alloc_page(), delete its onstack mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() - those were invented to let swapin_readahead() make an unknown number of calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e, alloc_pages_vma() has kept refcount in balance, so now no problem. Reported-and-tested-by: Tommi Rantala <tt.rantala@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-06mm: vmscan: do not keep kswapd looping forever due to individual ↵Johannes Weiner
uncompactable zones When a zone meets its high watermark and is compactable in case of higher order allocations, it contributes to the percentage of the node's memory that is considered balanced. This requirement, that a node be only partially balanced, came about when kswapd was desparately trying to balance tiny zones when all bigger zones in the node had plenty of free memory. Arguably, the same should apply to compaction: if a significant part of the node is balanced enough to run compaction, do not get hung up on that tiny zone that might never get in shape. When the compaction logic in kswapd is reached, we know that at least 25% of the node's memory is balanced properly for compaction (see zone_balanced and pgdat_balanced). Remove the individual zone checks that restart the kswapd cycle. Otherwise, we may observe more endless looping in kswapd where the compaction code loops back to reclaim because of a single zone and reclaim does nothing because the node is considered balanced overall. See for example https://bugzilla.redhat.com/show_bug.cgi?id=866988 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-and-tested-by: Thorsten Leemhuis <fedora@leemhuis.info> Reported-by: Jiri Slaby <jslaby@suse.cz> Tested-by: John Ellson <john.ellson@comcast.net> Tested-by: Zdenek Kabelac <zkabelac@redhat.com> Tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-06mm: compaction: validate pfn range passed to isolate_freepages_blockMel Gorman
Commit 0bf380bc70ec ("mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration") added a check for pfn_valid() when isolating pages for migration as the scanner does not necessarily start pageblock-aligned. Since commit c89511ab2f8f ("mm: compaction: Restart compaction from near where it left off"), the free scanner has the same problem. This patch makes sure that the pfn range passed to isolate_freepages_block() is within the same block so that pfn_valid() checks are unnecessary. In answer to Henrik's wondering why others have not reported this: reproducing this requires a large enough hole with the right aligment to have compaction walk into a PFN range with no memmap. Size and alignment depends in the memory model - 4M for FLATMEM and 128M for SPARSEMEM on x86. It needs a "lucky" machine. Reported-by: Henrik Rydberg <rydberg@euromail.se> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-06propagate name change to comments in kernel sourceNadia Yvette Chambers
I've legally changed my name with New York State, the US Social Security Administration, et al. This patch propagates the name change and change in initials and login to comments in the kernel source as well. Signed-off-by: Nadia Yvette Chambers <nyc@holomorphy.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-12-05bdi: add a user-tunable cpu_list for the bdi flusher threadsJeff Moyer
In realtime environments, it may be desirable to keep the per-bdi flusher threads from running on certain cpus. This patch adds a cpu_list file to /sys/class/bdi/* to enable this. The default is to tie the flusher threads to the same numa node as the backing device (though I could be convinced to make it a mask of all cpus to avoid a change in behaviour). Thanks to Jeremy Eder for the original idea. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-02mm, percpu: Make sure percpu_alloc early parameter has an argumentCyrill Gorcunov
Otherwise we are getting a nil dereference if percpu_alloc kernel boot argument is specified without value. | [ 0.000000] BUG: unable to handle kernel NULL pointer dereference at (null) | [ 0.000000] IP: [<ffffffff81391360>] strcmp+0x10/0x30 Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-11-30mm: soft offline: split thp at the beginning of soft_offline_page()Naoya Horiguchi
When we try to soft-offline a thp tail page, put_page() is called on the tail page unthinkingly and VM_BUG_ON is triggered in put_compound_page(). This patch splits thp before going into the main body of soft-offlining. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30mm: avoid waking kswapd for THP allocations when compaction is deferred or ↵Mel Gorman
contended With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" reverted, Zdenek Kabelac reported the following Hmm, so it's just took longer to hit the problem and observe kswapd0 spinning on my CPU again - it's not as endless like before - but still it easily eats minutes - it helps to turn off Firefox or TB (memory hungry apps) so kswapd0 stops soon - and restart those apps again. (And I still have like >1GB of cached memory) kswapd0 R running task 0 30 2 0x00000000 Call Trace: preempt_schedule+0x42/0x60 _raw_spin_unlock+0x55/0x60 put_super+0x31/0x40 drop_super+0x22/0x30 prune_super+0x149/0x1b0 shrink_slab+0xba/0x510 The sysrq+m indicates the system has no swap so it'll never reclaim anonymous pages as part of reclaim/compaction. That is one part of the problem but not the root cause as file-backed pages could also be reclaimed. The likely underlying problem is that kswapd is woken up or kept awake for each THP allocation request in the page allocator slow path. If compaction fails for the requesting process then compaction will be deferred for a time and direct reclaim is avoided. However, if there are a storm of THP requests that are simply rejected, it will still be the the case that kswapd is awake for a prolonged period of time as pgdat->kswapd_max_order is updated each time. This is noticed by the main kswapd() loop and it will not call kswapd_try_to_sleep(). Instead it will loopp, shrinking a small number of pages and calling shrink_slab() on each iteration. This patch defers when kswapd gets woken up for THP allocations. For !THP allocations, kswapd is always woken up. For THP allocations, kswapd is woken up iff the process is willing to enter into direct reclaim/compaction. [akpm@linux-foundation.org: fix typo in comment] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Zdenek Kabelac <zkabelac@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Glauber Costa <glommer@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30revert "Revert "mm: remove __GFP_NO_KSWAPD""Andrew Morton
It apepars that this patch was innocent, and we hope that "mm: avoid waking kswapd for THP allocations when compaction is deferred or contended" will fix the final kswapd-spinning cause. Cc: Zdenek Kabelac <zkabelac@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30mm: vmscan: fix endless loop in kswapd balancingJohannes Weiner
Kswapd does not in all places have the same criteria for a balanced zone. Zones are only being reclaimed when their high watermark is breached, but compaction checks loop over the zonelist again when the zone does not meet the low watermark plus two times the size of the allocation. This gets kswapd stuck in an endless loop over a small zone, like the DMA zone, where the high watermark is smaller than the compaction requirement. Add a function, zone_balanced(), that checks the watermark, and, for higher order allocations, if compaction has enough free memory. Then use it uniformly to check for balanced zones. This makes sure that when the compaction watermark is not met, at least reclaim happens and progress is made - or the zone is declared unreclaimable at some point and skipped entirely. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: George Spelvin <linux@horizon.com> Reported-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Reported-by: Tomas Racek <tracek@redhat.com> Tested-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30mm/vmemmap: fix wrong use of virt_to_pageJianguo Wu
I enable CONFIG_DEBUG_VIRTUAL and CONFIG_SPARSEMEM_VMEMMAP, when doing memory hotremove, there is a kernel BUG at arch/x86/mm/physaddr.c:20. It is caused by free_section_usemap()->virt_to_page(), virt_to_page() is only used for kernel direct mapping address, but sparse-vmemmap uses vmemmap address, so it is going wrong here. ------------[ cut here ]------------ kernel BUG at arch/x86/mm/physaddr.c:20! invalid opcode: 0000 [#1] SMP Modules linked in: acpihp_drv acpihp_slot edd cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_intel ipv6 ixgbe igb iTCO_wdt i7core_edac edac_core pcspkr iTCO_vendor_support ioatdma microcode joydev sr_mod i2c_i801 dca lpc_ich mfd_core mdio tpm_tis i2c_core hid_generic tpm cdrom sg tpm_bios rtc_cmos button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix libata megaraid_sas scsi_mod CPU 39 Pid: 6454, comm: sh Not tainted 3.7.0-rc1-acpihp-final+ #45 QCI QSSC-S4R/QSSC-S4R RIP: 0010:[<ffffffff8103c908>] [<ffffffff8103c908>] __phys_addr+0x88/0x90 RSP: 0018:ffff8804440d7c08 EFLAGS: 00010006 RAX: 0000000000000006 RBX: ffffea0012000000 RCX: 000000000000002c ... Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Reviewd-by: Wen Congyang <wency@cn.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-30mm: compaction: fix return value of capture_free_page()Mel Gorman
Commit ef6c5be658f6 ("fix incorrect NR_FREE_PAGES accounting (appears like memory leak)") fixes a NR_FREE_PAGE accounting leak but missed the return value which was also missed by this reviewer until today. That return value is used by compaction when adding pages to a list of isolated free pages and without this follow-up fix, there is a risk of free list corruption. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-26mm: vmscan: check for fatal signals iff the process was throttledMel Gorman
Commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage") introduced a check for fatal signals after a process gets throttled for network storage. The intention was that if a process was throttled and got killed that it should not trigger the OOM killer. As pointed out by Minchan Kim and David Rientjes, this check is in the wrong place and too broad. If a system is in am OOM situation and a process is exiting, it can loop in __alloc_pages_slowpath() and calling direct reclaim in a loop. As the fatal signal is pending it returns 1 as if it is making forward progress and can effectively deadlock. This patch moves the fatal_signal_pending() check after throttling to throttle_direct_reclaim() where it belongs. If the process is killed while throttled, it will return immediately without direct reclaim except now it will have TIF_MEMDIE set and will use the PFMEMALLOC reserves. Minchan pointed out that it may be better to direct reclaim before returning to avoid using the reserves because there may be pages that can easily reclaim that would avoid using the reserves. However, we do no such targetted reclaim and there is no guarantee that suitable pages are available. As it is expected that this throttling happens when swap-over-NFS is used there is a possibility that the process will instead swap which may allocate network buffers from the PFMEMALLOC reserves. Hence, in the swap-over-nfs case where a process can be throtted and be killed it can use the reserves to exit or it can potentially use reserves to swap a few pages and then exit. This patch takes the option of using the reserves if necessary to allow the process exit quickly. If this patch passes review it should be considered a -stable candidate for 3.6. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-26Revert "mm: remove __GFP_NO_KSWAPD"Mel Gorman
With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" reverted, Zdenek Kabelac reported the following Hmm, so it's just took longer to hit the problem and observe kswapd0 spinning on my CPU again - it's not as endless like before - but still it easily eats minutes - it helps to turn off Firefox or TB (memory hungry apps) so kswapd0 stops soon - and restart those apps again. (And I still have like >1GB of cached memory) kswapd0 R running task 0 30 2 0x00000000 Call Trace: preempt_schedule+0x42/0x60 _raw_spin_unlock+0x55/0x60 put_super+0x31/0x40 drop_super+0x22/0x30 prune_super+0x149/0x1b0 shrink_slab+0xba/0x510 The sysrq+m indicates the system has no swap so it'll never reclaim anonymous pages as part of reclaim/compaction. That is one part of the problem but not the root cause as file-backed pages could also be reclaimed. The likely underlying problem is that kswapd is woken up or kept awake for each THP allocation request in the page allocator slow path. If compaction fails for the requesting process then compaction will be deferred for a time and direct reclaim is avoided. However, if there are a storm of THP requests that are simply rejected, it will still be the the case that kswapd is awake for a prolonged period of time as pgdat->kswapd_max_order is updated each time. This is noticed by the main kswapd() loop and it will not call kswapd_try_to_sleep(). Instead it will loopp, shrinking a small number of pages and calling shrink_slab() on each iteration. The temptation is to supply a patch that checks if kswapd was woken for THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not backed up by proper testing. As 3.7 is very close to release and this is not a bug we should release with, a safer path is to revert "mm: remove __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the balance_pgdat() logic in general. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Zdenek Kabelac <zkabelac@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-21fix incorrect NR_FREE_PAGES accounting (appears like memory leak)Dave Hansen
There have been some 3.7-rc reports of vm issues, including some kswapd bugs and, more importantly, some memory "leaks": http://www.spinics.net/lists/linux-mm/msg46187.html https://bugzilla.kernel.org/show_bug.cgi?id=50181 Commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available") took split_free_page() and reused it for the compaction code. It does something curious with capture_free_page() (previously known as split_free_page()): int capture_free_page(struct page *page, int alloc_order, ... __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order)); - /* Split into individual pages */ - set_page_refcounted(page); - split_page(page, order); + if (alloc_order != order) + expand(zone, page, alloc_order, order, + &zone->free_area[order], migratetype); Note that expand() puts the pages _back_ in the allocator, but it does not bump NR_FREE_PAGES. We "return" 'alloc_order' worth of pages, but we accounted for removing 'order' in the __mod_zone_page_state() call. For the old split_page()-style use (order==alloc_order) the bug will not trigger. But, when called from the compaction code where we occasionally get a larger page out of the buddy allocator than we need, we will run in to this. This patch simply changes the NR_FREE_PAGES manipulation to the correct 'alloc_order' instead of 'order'. I've been able to repeatedly trigger this in my testing environment. The amount "leaked" very closely tracks the imbalance I see in buddy pages vs. NR_FREE_PAGES. I have confirmed that this patch fixes the imbalance Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-19cgroup: rename ->create/post_create/pre_destroy/destroy() to ↵Tejun Heo
->css_alloc/online/offline/free() Rename cgroup_subsys css lifetime related callbacks to better describe what their roles are. Also, update documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19various: Fix spelling of "asynchronous" in comments.Adam Buchbinder
"Asynchronous" is misspelled in some comments. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-11-16Merge 3.7-rc6 into char-misc-nextGreg Kroah-Hartman
2012-11-16revert "mm: fix-up zone present pages"Andrew Morton
Revert commit 7f1290f2f2a4 ("mm: fix-up zone present pages") That patch tried to fix a issue when calculating zone->present_pages, but it caused a regression on 32bit systems with HIGHMEM. With that change, reset_zone_present_pages() resets all zone->present_pages to zero, and fixup_zone_present_pages() is called to recalculate zone->present_pages when the boot allocator frees core memory pages into buddy allocator. Because highmem pages are not freed by bootmem allocator, all highmem zones' present_pages becomes zero. Various options for improving the situation are being discussed but for now, let's return to the 3.6 code. Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Chris Clayton <chris2553@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16tmpfs: change final i_blocks BUG to WARNINGHugh Dickins
Under a particular load on one machine, I have hit shmem_evict_inode()'s BUG_ON(inode->i_blocks), enough times to narrow it down to a particular race between swapout and eviction. It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(), and the lack of coherent locking between mapping's nrpages and shmem's swapped count. There's a window in shmem_writepage(), between lowering nrpages in shmem_delete_from_page_cache() and then raising swapped count, when the freed count appears to be +1 when it should be 0, and then the asymmetry stops it from being corrected with -1 before hitting the BUG. One answer is coherent locking: using tree_lock throughout, without info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on used_blocks makes that messier than expected. Another answer may be a further effort to eliminate the weird shmem_recalc_inode() altogether, but previous attempts at that failed. So far undecided, but for now change the BUG_ON to WARN_ON: in usual circumstances it remains a useful consistency check. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16tmpfs: fix shmem_getpage_gfp() VM_BUG_ONHugh Dickins
Fuzzing with trinity hit the "impossible" VM_BUG_ON(error) (which Fedora has converted to WARNING) in shmem_getpage_gfp(): WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70() Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49 Call Trace: warn_slowpath_common+0x7f/0xc0 warn_slowpath_null+0x1a/0x20 shmem_getpage_gfp+0xa5c/0xa70 shmem_fault+0x4f/0xa0 __do_fault+0x71/0x5c0 handle_pte_fault+0x97/0xae0 handle_mm_fault+0x289/0x350 __do_page_fault+0x18e/0x530 do_page_fault+0x2b/0x50 page_fault+0x28/0x30 tracesys+0xe1/0xe6 Thanks to Johannes for pointing to truncation: free_swap_and_cache() only does a trylock on the page, so the page lock we've held since before confirming swap is not enough to protect against truncation. What cleanup is needed in this case? Just delete_from_swap_cache(), which takes care of the memcg uncharge. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Dave Jones <davej@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16mm: highmem: don't treat PKMAP_ADDR(LAST_PKMAP) as a highmem addressWill Deacon
kmap_to_page returns the corresponding struct page for a virtual address of an arbitrary mapping. This works by checking whether the address falls in the pkmap region and using the pkmap page tables instead of the linear mapping if appropriate. Unfortunately, the bounds checking means that PKMAP_ADDR(LAST_PKMAP) is incorrectly treated as a highmem address and we can end up walking off the end of pkmap_page_table and subsequently passing junk to pte_page. This patch fixes the bound check to stay within the pkmap tables. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16mm: revert "mm: vmscan: scale number of pages reclaimed by ↵Mel Gorman
reclaim/compaction based on failures" Jiri Slaby reported the following: (It's an effective revert of "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures".) Given kswapd had hours of runtime in ps/top output yesterday in the morning and after the revert it's now 2 minutes in sum for the last 24h, I would say, it's gone. The intention of the patch in question was to compensate for the loss of lumpy reclaim. Part of the reason lumpy reclaim worked is because it aggressively reclaimed pages and this patch was meant to be a sane compromise. When compaction fails, it gets deferred and both compaction and reclaim/compaction is deferred avoid excessive reclaim. However, since commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up each time and continues reclaiming which was not taken into account when the patch was developed. Attempts to address the problem ended up just changing the shape of the problem instead of fixing it. The release window gets closer and while a THP allocation failing is not a major problem, kswapd chewing up a lot of CPU is. This patch reverts commit 83fde0f22872 ("mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures") and will be revisited in the future. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Zdenek Kabelac <zkabelac@redhat.com> Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16swapfile: fix name leak in swapoffXiaotian Feng
There's a name leak introduced by commit 91a27b2a7567 ("vfs: define struct filename and have getname() return it"). Add the missing putname. [akpm@linux-foundation.org: cleanup] Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16memcg: fix hotplugged memory zone oopsHugh Dickins
When MEMCG is configured on (even when it's disabled by boot option), when adding or removing a page to/from its lru list, the zone pointer used for stats updates is nowadays taken from the struct lruvec. (On many configurations, calculating zone from page is slower.) But we have no code to update all the lruvecs (per zone, per memcg) when a memory node is hotadded. Here's an extract from the oops which results when running numactl to bind a program to a newly onlined node: BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60 IP: __mod_zone_page_state+0x9/0x60 Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0) Call Trace: __pagevec_lru_add_fn+0xdf/0x140 pagevec_lru_move_fn+0xb1/0x100 __pagevec_lru_add+0x1c/0x30 lru_add_drain_cpu+0xa3/0x130 lru_add_drain+0x2f/0x40 ... The natural solution might be to use a memcg callback whenever memory is hotadded; but that solution has not been scoped out, and it happens that we do have an easy location at which to update lruvec->zone. The lruvec pointer is discovered either by mem_cgroup_zone_lruvec() or by mem_cgroup_page_lruvec(), and both of those do know the right zone. So check and set lruvec->zone in those; and remove the inadequate attempt to set lruvec->zone from lruvec_init(), which is called before NODE_DATA(node) has been allocated in such cases. Ah, there was one exceptionr. For no particularly good reason, mem_cgroup_force_empty_list() has its own code for deciding lruvec. Change it to use the standard mem_cgroup_zone_lruvec() and mem_cgroup_get_lru_size() too. In fact it was already safe against such an oops (the lru lists in danger could only be empty), but we're better proofed against future changes this way. I've marked this for stable (3.6) since we introduced the problem in 3.5 (now closed to stable); but I have no idea if this is the only fix needed to get memory hotadd working with memcg in 3.6, and received no answer when I enquired twice before. Reported-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16memcg: oom: fix totalpages calculation for memory.swappiness==0Michal Hocko
oom_badness() takes a totalpages argument which says how many pages are available and it uses it as a base for the score calculation. The value is calculated by mem_cgroup_get_limit which considers both limit and total_swap_pages (resp. memsw portion of it). This is usually correct but since fe35004fbf9e ("mm: avoid swapping out with swappiness==0") we do not swap when swappiness is 0 which means that we cannot really use up all the totalpages pages. This in turn confuses oom score calculation if the memcg limit is much smaller than the available swap because the used memory (capped by the limit) is negligible comparing to totalpages so the resulting score is too small if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj). A wrong process might be selected as result. The problem can be worked around by checking mem_cgroup_swappiness==0 and not considering swap at all in such a case. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16mm: fix build warning for uninitialized valueDavid Rientjes
do_wp_page() sets mmun_called if mmun_start and mmun_end were initialized and, if so, may call mmu_notifier_invalidate_range_end() with these values. This doesn't prevent gcc from emitting a build warning though: mm/memory.c: In function `do_wp_page': mm/memory.c:2530: warning: `mmun_start' may be used uninitialized in this function mm/memory.c:2531: warning: `mmun_end' may be used uninitialized in this function It's much easier to initialize the variables to impossible values and do a simple comparison to determine if they were initialized to remove the bool entirely. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16mm: add anon_vma_lock to validate_mm()Michel Lespinasse
Iterating over the vma->anon_vma_chain without anon_vma_lock may cause NULL ptr deref in anon_vma_interval_tree_verify(), because the node in the chain might have been removed. BUG: unable to handle kernel paging request at fffffffffffffff0 IP: [<ffffffff8122c29c>] anon_vma_interval_tree_verify+0xc/0xa0 PGD 4e28067 PUD 4e29067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Pid: 9050, comm: trinity-child64 Tainted: G W 3.7.0-rc2-next-20121025-sasha-00001-g673f98e-dirty #77 RIP: 0010: anon_vma_interval_tree_verify+0xc/0xa0 Process trinity-child64 (pid: 9050, threadinfo ffff880045f80000, task ffff880048eb0000) Call Trace: validate_mm+0x58/0x1e0 vma_adjust+0x635/0x6b0 __split_vma.isra.22+0x161/0x220 split_vma+0x24/0x30 sys_madvise+0x5da/0x7b0 tracesys+0xe1/0xe6 RIP anon_vma_interval_tree_verify+0xc/0xa0 CR2: fffffffffffffff0 Figured out by Bob Liu. Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-15mm: export a function to get vm committed memoryK. Y. Srinivasan
It will be useful to be able to access global memory commitment from device drivers. On the Hyper-V platform, the host has a policy engine to balance the available physical memory amongst all competing virtual machines hosted on a given node. This policy engine is driven by a number of metrics including the memory commitment reported by the guests. The balloon driver for Linux on Hyper-V will use this function to retrieve guest memory commitment. This function is also used in Xen self ballooning code. [akpm@linux-foundation.org: coding-style tweak] Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>