Age | Commit message (Collapse) | Author |
|
Some archs such as ARM want to avoid coalescing accross things such
as the lowmem/highmem boundary or similar. This provides the option
to control it via an arch callback for which a weak default is provided
which always allows coalescing.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
When one of the array gets full, we resize it. After much thinking and
a few iterations of that code, I went back to on-demand resizing using
the (new) internal memblock_find_base() function, which is pretty much what
Yinghai initially proposed, though there some differences in the details.
To work this relies on the default alloc limit being set sensibly by
the architecture.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Some shuffling is needed for doing array resize so we may as well
put some sense into the ordering of the functions in the whole memblock.c
file. No code change. Added some comments.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This will be used by the array resize code and might prove useful
to some arch code as well at which point it can be made non-static.
Also add comment as to why aligning size is important
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2. Fix loss of size alignment
v3. Fix result code
|
|
It's a real PITA to have to search for it in the middle
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This function will be used to locate a free area to put the new memblock
arrays when attempting to resize them. memblock_alloc_region() is gone,
the two callsites now call memblock_add_region().
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2. Fix membase_alloc_nid_region() conversion
|
|
Since we allocate one more than needed, why not do a bit of sanity checking
here to ensure we don't walk past the end of the array ?
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
their size a variable
This is in preparation for having resizable arrays.
Note that we still allocate one more than needed, this is unchanged from
the previous implementation.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Right now, both the "memory" and "reserved" memblock_type structures have
a "size" member. It represents the calculated memory size in the former
case and is unused in the latter.
This moves it out to the main memblock structure instead
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Let's not waste space and cycles on archs that don't support >32-bit
physical address space.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The RMA (RMO is a misnomer) is a concept specific to ppc64 (in fact
server ppc64 though I hijack it on embedded ppc64 for similar purposes)
and represents the area of memory that can be accessed in real mode
(aka with MMU off), or on embedded, from the exception vectors (which
is bolted in the TLB) which pretty much boils down to the same thing.
We take that out of the generic MEMBLOCK data structure and move it into
arch/powerpc where it belongs, renaming it to "RMA" while at it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This introduce memblock.current_limit which is used to limit allocations
from memblock_alloc() or memblock_alloc_base(..., MEMBLOCK_ALLOC_ACCESSIBLE).
The old MEMBLOCK_ALLOC_ANYWHERE changes value from 0 to ~(u64)0 and can still
be used with memblock_alloc_base() to allocate really anywhere.
It is -no-longer- cropped to MEMBLOCK_REAL_LIMIT which disappears.
Note to archs: I'm leaving the default limit to MEMBLOCK_ALLOC_ANYWHERE. I
strongly recommend that you ensure that you set an appropriate limit
during boot in order to guarantee that an memblock_alloc() at any time
results in something that is accessible with a simple __va().
The reason is that a subsequent patch will introduce the ability for
the array to resize itself by reallocating itself. The MEMBLOCK core will
honor the current limit when performing those allocations.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Nobody uses it anymore. It's semantics were ... weird
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: allow limited allocation before slab is online
percpu: make @dyn_size always mean min dyn_size in first chunk init functions
|
|
into for-linus
|
|
* 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (198 commits)
KVM: VMX: Fix host GDT.LIMIT corruption
KVM: MMU: using __xchg_spte more smarter
KVM: MMU: cleanup spte set and accssed/dirty tracking
KVM: MMU: don't atomicly set spte if it's not present
KVM: MMU: fix page dirty tracking lost while sync page
KVM: MMU: fix broken page accessed tracking with ept enabled
KVM: MMU: add missing reserved bits check in speculative path
KVM: MMU: fix mmu notifier invalidate handler for huge spte
KVM: x86 emulator: fix xchg instruction emulation
KVM: x86: Call mask notifiers from pic
KVM: x86: never re-execute instruction with enabled tdp
KVM: Document KVM_GET_SUPPORTED_CPUID2 ioctl
KVM: x86: emulator: inc/dec can have lock prefix
KVM: MMU: Eliminate redundant temporaries in FNAME(fetch)
KVM: MMU: Validate all gptes during fetch, not just those used for new pages
KVM: MMU: Simplify spte fetch() function
KVM: MMU: Add gpte_valid() helper
KVM: MMU: Add validate_direct_spte() helper
KVM: MMU: Add drop_large_spte() helper
KVM: MMU: Use __set_spte to link shadow pages
...
|
|
To make it fast, we steal ARM's binary search for memblock_is_memory()
and we use that to also the replace existing implementation of
memblock_is_reserved().
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
memblock_region
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
All callers expect a boolean result which is true if the region
overlaps a reserved region. However, the implementation actually
returns -1 if there is no overlap, and a region index (0 based)
if there is.
Make it behave as callers (and common sense) expect.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Serialize kmem_cache_create and kmem_cache_destroy using the slub_lock. Only
possible after the use of the slub_lock during dynamic dma creation has been
removed.
Then make sure that the setup of the slab sysfs entries does not race
with kmem_cache_create and kmem_cache destroy.
If a slab cache is removed before we have setup sysfs then simply skip over
the sysfs handling.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Roland Dreier <rdreier@cisco.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
This reverts commit f5b801ac38a9612b380ee9a75ab1861f0594e79f.
|
|
Conflicts:
tools/perf/Makefile
tools/perf/util/hist.c
Merge reason: Resolve the conflicts and update to latest upstream.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
is_hwpoison_address accesses the page table, so the caller must hold
current->mm->mmap_sem in read mode. So fix its usage in hva_to_pfn of
kvm accordingly.
Comment is_hwpoison_address to remind other users.
Reported-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
the MCE to guest OS.
But it is reported that if the poisoned page is accessed in guest
after unmapping and before MCE is relayed to guest OS, userspace will
be killed.
The reason is as follows. Because poisoned page has been un-mapped,
guest access will cause guest exit and kvm_mmu_page_fault will be
called. kvm_mmu_page_fault can not get the poisoned page for fault
address, so kernel and user space MMIO processing is tried in turn. In
user MMIO processing, poisoned page is accessed again, then userspace
is killed by force_sig_info.
To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
and do not try kernel and user space MMIO processing for poisoned
page.
[xiao: fix warning introduced by avi]
Reported-by: Max Asbock <masbock@linux.vnet.ibm.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Debian's ia64 autobuilders have been seeing kernel freeze or reboot
when running the gdb testsuite (Debian bug 588574): dannf bisected to
2.6.32 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1 "mm: ZERO_PAGE without
PTE_SPECIAL"; and reproduced it with gdb's gcore on a simple target.
I'd missed updating the gate_vma handling in __get_user_pages(): that
happens to use vm_normal_page() (nowadays failing on the zero page),
yet reported success even when it failed to get a page - boom when
access_process_vm() tried to copy that to its intermediate buffer.
Fix this, resisting cleanups: in particular, leave it for now reporting
success when not asked to get any pages - very probably safe to change,
but let's not risk it without testing exposure.
Why did ia64 crash with 16kB pages, but succeed with 64kB pages?
Because setup_gate() pads each 64kB of its gate area with zero pages.
Reported-by: Andreas Barth <aba@not.so.argh.org>
Bisected-by: dann frazier <dannf@debian.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: dann frazier <dannf@dannf.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The network developers have seen sporadic allocations resulting in objects
coming from unexpected NUMA nodes despite asking for objects from a
specific node.
This is due to get_partial() calling get_any_partial() if partial
slabs are exhausted for a node even if a node was specified and therefore
one would expect allocations only from the specified node.
get_any_partial() sporadically may return a slab from a foreign
node to gradually reduce the size of partial lists on remote nodes
and thereby reduce total memory use for a slab cache.
The behavior is controlled by the remote_defrag_ratio of each cache.
Strictly speaking this is permitted behavior since __GFP_THISNODE was
not specified for the allocation but it is certain surprising.
This patch makes sure that the remote defrag behavior only occurs
if no node was specified.
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
Add a flag to force lazy_max_pages() to zero to prevent any outstanding
mapped pages. We'll need this for Xen.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Nick Piggin <npiggin@suse.de>
|
|
Merge reason: Pick up the latest perf fixes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
numa is used
Borislav Petkov reported his 32bit numa system has problem:
[ 0.000000] Reserving total of 4c00 pages for numa KVA remap
[ 0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
[ 0.000000] max_pfn = 238000
[ 0.000000] 8202MB HIGHMEM available.
[ 0.000000] 885MB LOWMEM available.
[ 0.000000] mapped low ram: 0 - 375fe000
[ 0.000000] low ram: 0 - 375fe000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
[ 0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
[ 0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
[ 0.000000] BUG: unable to handle kernel paging request at 40000000
[ 0.000000] IP: [<c2c8cff1>] __alloc_memory_core_early+0x147/0x1d6
[ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
...
[ 0.000000] Call Trace:
[ 0.000000] [<c2c8b4f8>] ? __alloc_bootmem_node+0x216/0x22f
[ 0.000000] [<c2c90c9b>] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
[ 0.000000] [<c2c9149e>] ? sparse_init+0x1dc/0x499
[ 0.000000] [<c2c79118>] ? paging_init+0x168/0x1df
[ 0.000000] [<c2c780ff>] ? native_pagetable_setup_start+0xef/0x1bb
looks like it allocates too much high address for bootmem.
Try to cut limit with get_max_mapped()
Reported-by: Borislav Petkov <borislav.petkov@amd.com>
Tested-by: Conny Seidel <conny.seidel@amd.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: <stable@kernel.org> [2.6.34.x]
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We need lock_page_nosync() here because we have no reference to the
mapping when taking the page lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
slab has a "once every 2 second" timer for its housekeeping.
As the number of logical processors is growing, its more and more
common that this 2 second timer becomes the primary wakeup source.
This patch turns this housekeeping timer into a deferable timer,
which means that the timer does not interrupt idle, but just runs
at the next event that wakes the cpu up.
The impact is that the timer likely runs a bit later, but during the
delay no code is running so there's not all that much reason for
a difference in housekeeping to occur because of this delay.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
The description and parameters of the kmemleak API weren't obvious. This
patch adds comments clarifying the API usage.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
Introduce a new DEBUG_KMEMLEAK_DEFAULT_OFF config parameter that allows
kmemleak to be disabled by default, but enabled on the command line
via: kmemleak=on. Although a reboot is required to turn it on, its still
useful to not require a re-compile.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
There may be situations when an object is freed using a pointer inside
the memory block. Kmemleak should show more information to help with
debugging.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
friends use the early_res functions for memory management when
NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
corresponding code paths for bootmem allocations.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
|
|
The pointer to the page_cgroup table allocated in
init_section_page_cgroup() is stored in section->page_cgroup as (base -
pfn). Since this value does not point to the beginning or inside the
allocated memory block, kmemleak reports a false positive.
This was reported in bugzilla.kernel.org as #16297.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Adrien Dessemond <adrien.dessemond@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
|
|
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm
* 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm:
kmemleak: Add support for NO_BOOTMEM configurations
kmemleak: Annotate false positive in init_section_page_cgroup()
|
|
The cacheline with the flags is reachable from the hot paths after the
percpu allocator changes went in. So there is no need anymore to put a
flag into each slab page. Get rid of the SlubDebug flag and use
the flags in kmem_cache instead.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
If a slab cache is removed before we have setup sysfs then simply skip over
the sysfs handling.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Roland Dreier <rdreier@cisco.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
Small allocations may fail during slab bringup which is fatal. Add a BUG_ON()
so that we fail immediately rather than failing later during sysfs
processing.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
UL suffix is missing in some constants. Conform to how slab.h uses constants.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
kmalloc_node() and friends can be passed a constant -1 to indicate
that no choice was made for the node from which the object needs to
come.
Use NUMA_NO_NODE instead of -1.
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
SLOB has alloced smaller objects from their own list in reduce overall external
fragmentation and increase repeatability, free to their own list also.
This is /proc/meminfo result in my test machine:
without this patch:
===
MemTotal: 1030720 kB
MemFree: 750012 kB
Buffers: 15496 kB
Cached: 160396 kB
SwapCached: 0 kB
Active: 105024 kB
Inactive: 145604 kB
Active(anon): 74816 kB
Inactive(anon): 2180 kB
Active(file): 30208 kB
Inactive(file): 143424 kB
Unevictable: 16 kB
....
with this patch:
===
MemTotal: 1030720 kB
MemFree: 751908 kB
Buffers: 15492 kB
Cached: 160280 kB
SwapCached: 0 kB
Active: 102720 kB
Inactive: 146140 kB
Active(anon): 73168 kB
Inactive(anon): 2180 kB
Active(file): 29552 kB
Inactive(file): 143960 kB
Unevictable: 16 kB
...
The result shows an improvement of 1 MB!
And when I tested it on a embeded system with 64 MB, I found this path is never
called during kernel bootup.
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
|
|
via following scripts
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/lmb/memblock/g' \
-e 's/LMB/MEMBLOCK/g' \
$FILES
for N in $(find . -name lmb.[ch]); do
M=$(echo $N | sed 's/lmb/memblock/g')
mv $N $M
done
and remove some wrong change like lmbench and dlmb etc.
also move memblock.c from lib/ to mm/
Suggested-by: Ingo Molnar <mingo@elte.hu>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Current x86 ioremap() doesn't handle physical address higher than
32-bit properly in X86_32 PAE mode. When physical address higher than
32-bit is passed to ioremap(), higher 32-bits in physical address is
cleared wrongly. Due to this bug, ioremap() can map wrong address to
linear address space.
In my case, 64-bit MMIO region was assigned to a PCI device (ioat
device) on my system. Because of the ioremap()'s bug, wrong physical
address (instead of MMIO region) was mapped to linear address space.
Because of this, loading ioatdma driver caused unexpected behavior
(kernel panic, kernel hangup, ...).
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
LKML-Reference: <4C1AE680.7090408@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|