summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2011-10-28Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits) leases: fix write-open/read-lease race nfs: drop unnecessary locking in llseek ext4: replace cut'n'pasted llseek code with generic_file_llseek_size vfs: add generic_file_llseek_size vfs: do (nearly) lockless generic_file_llseek direct-io: merge direct_io_walker into __blockdev_direct_IO direct-io: inline the complete submission path direct-io: separate map_bh from dio direct-io: use a slab cache for struct dio direct-io: rearrange fields in dio/dio_submit to avoid holes direct-io: fix a wrong comment direct-io: separate fields only used in the submission path from struct dio vfs: fix spinning prevention in prune_icache_sb vfs: add a comment to inode_permission() vfs: pass all mask flags check_acl and posix_acl_permission vfs: add hex format for MAY_* flag values vfs: indicate that the permission functions take all the MAY_* flags compat: sync compat_stats with statfs. vfs: add "device" tag to /proc/self/mountstats cleanup: vfs: small comment fix for block_invalidatepage ... Fix up trivial conflict in fs/gfs2/file.c (llseek changes)
2011-10-28vfs: iov_iter: have iov_iter_advance decrement nr_segs appropriatelyJeff Layton
Currently, when you call iov_iter_advance, then the pointer to the iovec array can be incremented, but it does not decrement the nr_segs value in the iov_iter struct. The result is a iov_iter struct with a nr_segs value that goes beyond the end of the array. While I'm not aware of anything that's specifically broken by this, it seems odd and a bit dangerous not to decrement that value. If someone were to trust the nr_segs value to be correct, then they could end up walking off the end of the array. Changing this might also provide some micro-optimization when dealing with the last iovec in an array. Many of the other routines that deal with iov_iter have optimized codepaths when nr_segs == 1. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-26Merge branches 'slab/next' and 'slub/partial' into slab/for-linusPekka Enberg
2011-10-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits) MAINTAINERS: linux-m32r is moderated for non-subscribers linux@lists.openrisc.net is moderated for non-subscribers Drop default from "DM365 codec select" choice parisc: Kconfig: cleanup Kernel page size default Kconfig: remove redundant CONFIG_ prefix on two symbols cris: remove arch/cris/arch-v32/lib/nand_init.S microblaze: add missing CONFIG_ prefixes h8300: drop puzzling Kconfig dependencies MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers tty: drop superfluous dependency in Kconfig ARM: mxc: fix Kconfig typo 'i.MX51' Fix file references in Kconfig files aic7xxx: fix Kconfig references to READMEs Fix file references in drivers/ide/ thinkpad_acpi: Fix printk typo 'bluestooth' bcmring: drop commented out line in Kconfig btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888' doc: raw1394: Trivial typo fix CIFS: Don't free volume_info->UNC until we are entirely done with it. treewide: Correct spelling of successfully in comments ...
2011-10-25Merge branch 'next' of git://selinuxproject.org/~jmorris/linux-securityLinus Torvalds
* 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits) TOMOYO: Fix incomplete read after seek. Smack: allow to access /smack/access as normal user TOMOYO: Fix unused kernel config option. Smack: fix: invalid length set for the result of /smack/access Smack: compilation fix Smack: fix for /smack/access output, use string instead of byte Smack: domain transition protections (v3) Smack: Provide information for UDS getsockopt(SO_PEERCRED) Smack: Clean up comments Smack: Repair processing of fcntl Smack: Rule list lookup performance Smack: check permissions from user space (v2) TOMOYO: Fix quota and garbage collector. TOMOYO: Remove redundant tasklist_lock. TOMOYO: Fix domain transition failure warning. TOMOYO: Remove tomoyo_policy_memory_lock spinlock. TOMOYO: Simplify garbage collector. TOMOYO: Fix make namespacecheck warnings. target: check hex2bin result encrypted-keys: check hex2bin result ...
2011-10-19mm: fix race between mremap and removing migration entryHugh Dickins
I don't usually pay much attention to the stale "? " addresses in stack backtraces, but this lucky report from Pawel Sikora hints that mremap's move_ptes() has inadequate locking against page migration. 3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page(): kernel BUG at include/linux/swapops.h:105! RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>] migration_entry_wait+0x156/0x160 [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0 [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120 [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310 [<ffffffff81102a31>] handle_mm_fault+0x181/0x310 [<ffffffff81106097>] ? vma_adjust+0x537/0x570 [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0 [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570 [<ffffffff81421d5f>] page_fault+0x1f/0x30 mremap's down_write of mmap_sem, together with i_mmap_mutex or lock, and pagetable locks, were good enough before page migration (with its requirement that every migration entry be found) came in, and enough while migration always held mmap_sem; but not enough nowadays, when there's memory hotremove and compaction. The danger is that move_ptes() lets a migration entry dodge around behind remove_migration_pte()'s back, so it's in the old location when looking at the new, then in the new location when looking at the old. Either mremap's move_ptes() must additionally take anon_vma lock(), or migration's remove_migration_pte() must stop peeking for is_swap_entry() before it takes pagetable lock. Consensus chooses the latter: we prefer to add overhead to migration than to mremapping, which gets used by JVMs and by exec stack setup. Reported-and-tested-by: Paweł Sikora <pluto@agmk.net> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-27slub: Discard slab page when node partial > minimum partial numberAlex Shi
Discarding slab should be done when node partial > min_partial. Otherwise, node partial slab may eat up all memory. Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27slub: correct comments error for per cpu partialAlex Shi
Correct comment errors, that mistake cpu partial objects number as pages number, may make reader misunderstand. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27mm: restrict access to slab files under procfs and sysfsVasiliy Kulikov
Historically /proc/slabinfo and files under /sys/kernel/slab/* have world read permissions and are accessible to the world. slabinfo contains rather private information related both to the kernel and userspace tasks. Depending on the situation, it might reveal either private information per se or information useful to make another targeted attack. Some examples of what can be learned by reading/watching for /proc/slabinfo entries: 1) dentry (and different *inode*) number might reveal other processes fs activity. The number of dentry "active objects" doesn't strictly show file count opened/touched by a process, however, there is a good correlation between them. The patch "proc: force dcache drop on unauthorized access" relies on the privacy of dentry count. 2) different inode entries might reveal the same information as (1), but these are more fine granted counters. If a filesystem is mounted in a private mount point (or even a private namespace) and fs type differs from other mounted fs types, fs activity in this mount point/namespace is revealed. If there is a single ecryptfs mount point, the whole fs activity of a single user is revealed. Number of files in ecryptfs mount point is a private information per se. 3) fuse_* reveals number of files / fs activity of a user in a user private mount point. It is approx. the same severity as ecryptfs infoleak in (2). 4) sysfs_dir_cache similar to (2) reveals devices' addition/removal, which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo the precise number of sysfs files is known to the world. 5) buffer_head might reveal some kernel activity. With other information leaks an attacker might identify what specific kernel routines generate buffer_head activity. 6) *kmalloc* infoleaks are very situational. Attacker should watch for the specific kmalloc size entry and filter the noise related to the unrelated kernel activity. If an attacker has relatively silent victim system, he might get rather precise counters. Additional information sources might significantly increase the slabinfo infoleak benefits. E.g. if an attacker knows that the processes activity on the system is very low (only core daemons like syslog and cron), he may run setxid binaries / trigger local daemon activity / trigger network services activity / await sporadic cron jobs activity / etc. and get rather precise counters for fs and network activity of these privileged tasks, which is unknown otherwise. Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate exploitation of kernel heap overflows (and possibly, other bugs). The related discussion: http://thread.gmane.org/gmane.linux.kernel/1108378 To keep compatibility with old permission model where non-root monitoring daemon could watch for kernel memleaks though slabinfo one should do: groupadd slabinfo usermod -a -G slabinfo $MONITOR_USER And add the following commands to init scripts (to mountall.conf in Ubuntu's upstart case): chmod g+r /proc/slabinfo /sys/kernel/slab/*/* chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/* Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Reviewed-by: Kees Cook <kees@ubuntu.com> Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@gentwo.org> Acked-by: David Rientjes <rientjes@google.com> CC: Valdis.Kletnieks@vt.edu CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Alan Cox <alan@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-21Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-block: floppy: use del_timer_sync() in init cleanup blk-cgroup: be able to remove the record of unplugged device block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request mm: Add comment explaining task state setting in bdi_forker_thread() mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread() block: simplify force plug flush code a little bit block: change force plug flush call order block: Fix queue_flag update when rq_affinity goes from 2 to 1 block: separate priority boosting from REQ_META block: remove READ_META and WRITE_META xen-blkback: fixed indentation and comments xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
2011-09-19Merge branch 'slab/urgent' of git://github.com/penberg/linuxLinus Torvalds
* 'slab/urgent' of git://github.com/penberg/linux: slub: add slab with one free object to partial list tail
2011-09-19Merge branch 'slab/urgent' into slab/nextPekka Enberg
2011-09-15Merge branch 'master' into for-nextJiri Kosina
Fast-forward merge with Linus to be able to merge patches based on more recent version of the tree.
2011-09-15mm: Convert vmalloc/memset to vzallocJoe Perches
Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-09-14mm: account skipped entries to avoid looping in find_get_pagesShaohua Li
The found entries by find_get_pages() could be all swap entries. In this case we skip the entries, but make sure the skipped entries are accounted, so we don't keep looping. Using nr_found > nr_skip to simplify code as suggested by Eric. Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14mm: sync vmalloc address space page tables in alloc_vm_area()David Vrabel
Xen backend drivers (e.g., blkback and netback) would sometimes fail to map grant pages into the vmalloc address space allocated with alloc_vm_area(). The GNTTABOP_map_grant_ref would fail because Xen could not find the page (in the L2 table) containing the PTEs it needed to update. (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000 netback and blkback were making the hypercall from a kernel thread where task->active_mm != &init_mm and alloc_vm_area() was only updating the page tables for init_mm. The usual method of deferring the update to the page tables of other processes (i.e., after taking a fault) doesn't work as a fault cannot occur during the hypercall. This would work on some systems depending on what else was using vmalloc. Fix this by reverting ef691947d8a3 ("vmalloc: remove vmalloc_sync_all() from alloc_vm_area()") and add a comment to explain why it's needed. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Keir Fraser <keir.xen@gmail.com> Cc: <stable@kernel.org> [3.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14memcg: Revert "memcg: add memory.vmscan_stat"Johannes Weiner
Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add memory.vmscan_stat"). The implementation of per-memcg reclaim statistics violates how memcg hierarchies usually behave: hierarchically. The reclaim statistics are accounted to child memcgs and the parent hitting the limit, but not to hierarchy levels in between. Usually, hierarchical statistics are perfectly recursive, with each level representing the sum of itself and all its children. Since this exports statistics to userspace, this may lead to confusion and problems with changing things after the release, so revert it now, we can try again later. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14mm: vmscan: fix force-scanning small targets without swapJohannes Weiner
Without swap, anonymous pages are not scanned. As such, they should not count when considering force-scanning a small target if there is no swap. Otherwise, targets are not force-scanned even when their effective scan number is zero and the other conditions--kswapd/memcg--apply. This fixes 246e87a93934 ("memcg: fix get_scan_count() for small targets"). [akpm@linux-foundation.org: fix comment] Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14numa: fix NUMA compile error when sysfs and procfs are disabledDavid Rientjes
The vmstat_text array is only defined for CONFIG_SYSFS or CONFIG_PROC_FS, yet it is referenced for per-node vmstat with CONFIG_NUMA: drivers/built-in.o: In function `node_read_vmstat': node.c:(.text+0x1106df): undefined reference to `vmstat_text' Introduced in commit fa25c503dfa2 ("mm: per-node vmstat: show proper vmstats"). Define the array for CONFIG_NUMA as well. [akpm@linux-foundation.org: remove unneeded ifdefs] Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Cong Wang <amwang@redhat.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14mm/mempolicy.c: make copy_from_user() provably correctKAMEZAWA Hiroyuki
When compiling mm/mempolicy.c with struct user copy checks the following warning is shown: In file included from arch/x86/include/asm/uaccess.h:572, from include/linux/uaccess.h:5, from include/linux/highmem.h:7, from include/linux/pagemap.h:10, from include/linux/mempolicy.h:70, from mm/mempolicy.c:68: In function `copy_from_user', inlined from `compat_sys_get_mempolicy' at mm/mempolicy.c:1415: arch/x86/include/asm/uaccess_64.h:64: warning: call to `copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct LD mm/built-in.o Fix this by passing correct buffer size value. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14mm/mempolicy.c: fix pgoff in mbind vma mergeCaspar Zhang
commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") didn't really fix the mbind vma merge problem due to wrong pgoff value passing to vma_merge(), which made vma_merge() always return NULL. Before the patch applied, we are getting a result like: addr = 0x7fa58f00c000 [snip] 7fa58f00c000-7fa58f00d000 rw-p 00000000 00:00 0 7fa58f00d000-7fa58f00e000 rw-p 00000000 00:00 0 7fa58f00e000-7fa58f00f000 rw-p 00000000 00:00 0 here 7fa58f00c000->7fa58f00f000 we get 3 VMAs which are expected to be merged described as described in commit 9d8cebd. Re-testing the patched kernel with the reproducer provided in commit 9d8cebd, we get the correct result: addr = 0x7ffa5aaa2000 [snip] 7ffa5aaa2000-7ffa5aaa6000 rw-p 00000000 00:00 0 7fffd556f000-7fffd5584000 rw-p 00000000 00:00 0 [stack] Signed-off-by: Caspar Zhang <caspar@casparzhang.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-13slub: Code optimization in get_partial_node()Alex,Shi
I find a way to reduce a variable in get_partial_node(). That is also helpful for code understanding. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-02mm: Add comment explaining task state setting in bdi_forker_thread()Jan Kara
CC: Wu Fengguang <fengguang.wu@intel.com> CC: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-02mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()Jan Kara
bdi_forker_thread() clears BDI_pending bit at the end of the main loop. However clearing of this bit must not be done in some cases which is handled by calling 'continue' from switch statement. That's kind of unusual construct and without a good reason so change the function into more intuitive code flow. CC: Wu Fengguang <fengguang.wu@intel.com> CC: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-27slub: explicitly document position of inserting slab to partial listShaohua Li
Adding slab to partial list head/tail is sensitive to performance. So explicitly uses DEACTIVATE_TO_TAIL/DEACTIVATE_TO_HEAD to document it to avoid we get it wrong. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-27slub: add slab with one free object to partial list tailShaohua Li
The slab has just one free object, adding it to partial list head doesn't make sense. And it can cause lock contentation. For example, 1. CPU takes the slab from partial list 2. fetch an object 3. switch to another slab 4. free an object, then the slab is added to partial list again In this way n->list_lock will be heavily contended. In fact, Alex had a hackbench regression. 3.1-rc1 performance drops about 70% against 3.0. This patch fixes it. Acked-by: Christoph Lameter <cl@linux.com> Reported-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-25memcg: fix hierarchical oom lockingJohannes Weiner
Commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than counter") tried to oom lock the hierarchy and roll back upon encountering an already locked memcg. The code is confused when it comes to detecting a locked memcg, though, so it would fail and rollback after locking one memcg and encountering an unlocked second one. The result is that oom-locking hierarchies fails unconditionally and that every oom killer invocation simply goes to sleep on the oom waitqueue forever. The tasks practically hang forever without anyone intervening, possibly holding locks that trip up unrelated tasks, too. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25vmscan: clear ZONE_CONGESTED for zone with good watermarkShaohua Li
ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any task. It's possible ZONE_CONGESTED isn't cleared in some cases: 1. the zone is already balanced just entering balance_pgdat() for order-0 because concurrent tasks free memory. In this case, later check will skip the zone as it's balanced so the flag isn't cleared. 2. high order balance fallbacks to order-0. quote from Mel: At the end of balance_pgdat(), kswapd uses the following logic; If reclaiming at high order { for each zone { if all_unreclaimable skip if watermark is not met order = 0 loop again /* watermark is met */ clear congested } } i.e. it clears ZONE_CONGESTED if it the zone is balanced. if not, it restarts balancing at order-0. However, if the higher zones are balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as that only happens after a zone is shrunk. This can mean that wait_iff_congested() stalls unnecessarily. This patch makes kswapd clear ZONE_CONGESTED during its initial highmem->dma scan for zones that are already balanced. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25mm: fix a vmscan warningShaohua Li
I get the below warning: BUG: using smp_processor_id() in preemptible [00000000] code: bash/746 caller is native_sched_clock+0x37/0x6e Pid: 746, comm: bash Tainted: G W 3.0.0+ #254 Call Trace: [<ffffffff813435c6>] debug_smp_processor_id+0xc2/0xdc [<ffffffff8104158d>] native_sched_clock+0x37/0x6e [<ffffffff81116219>] try_to_free_mem_cgroup_pages+0x7d/0x270 [<ffffffff8114f1f8>] mem_cgroup_force_empty+0x24b/0x27a [<ffffffff8114ff21>] ? sys_close+0x38/0x138 [<ffffffff8114ff21>] ? sys_close+0x38/0x138 [<ffffffff8114f257>] mem_cgroup_force_empty_write+0x17/0x19 [<ffffffff810c72fb>] cgroup_file_write+0xa8/0xba [<ffffffff811522d2>] vfs_write+0xb3/0x138 [<ffffffff8115241a>] sys_write+0x4a/0x71 [<ffffffff8114ffd9>] ? sys_close+0xf0/0x138 [<ffffffff8176deab>] system_call_fastpath+0x16/0x1b sched_clock() can't be used with preempt enabled. And we don't need fast approach to get clock here, so let's use ktime API. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25memcg: pin execution to current cpu while draining stockJohannes Weiner
Commit d1a05b6973c7 ("memcg do not try to drain per-cpu caches without pages") added a drain_local_stock() call to a preemptible section. The draining task looks up the cpu-local stock twice to set the draining-flag, then to drain the stock and clear the flag again. If the task is migrated to a different CPU in between, noone will clear the flag on the first stock and it will be forever undrainable. Its charge can not be recovered and the cgroup can not be deleted anymore. Properly pin the task to the executing CPU while draining stocks. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25Merge branch 'urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback * 'urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: squeeze max-pause area and drop pass-good area
2011-08-24mm/vmscan.c: fix a typo in a comment "relaimed" to "reclaimed"Justin P. Mattock
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-08-19slub: per cpu cache for partial pagesChristoph Lameter
Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to partial pages. The partial page list is used in slab_free() to avoid per node lock taking. In __slab_alloc() we can then take multiple partial pages off the per node partial list in one go reducing node lock pressure. We can also use the per cpu partial list in slab_alloc() to avoid scanning partial lists for pages with free objects. The main effect of a per cpu partial list is that the per node list_lock is taken for batches of partial pages instead of individual ones. Potential future enhancements: 1. The pickup from the partial list could be perhaps be done without disabling interrupts with some work. The free path already puts the page into the per cpu partial list without disabling interrupts. 2. __slab_free() may have some code paths that could use optimization. Performance: Before After ./hackbench 100 process 200000 Time: 1953.047 1564.614 ./hackbench 100 process 20000 Time: 207.176 156.940 ./hackbench 100 process 20000 Time: 204.468 156.940 ./hackbench 100 process 20000 Time: 204.879 158.772 ./hackbench 10 process 20000 Time: 20.153 15.853 ./hackbench 10 process 20000 Time: 20.153 15.986 ./hackbench 10 process 20000 Time: 19.363 16.111 ./hackbench 1 process 20000 Time: 2.518 2.307 ./hackbench 1 process 20000 Time: 2.258 2.339 ./hackbench 1 process 20000 Time: 2.864 2.163 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19slub: return object pointer from get_partial() / new_slab().Christoph Lameter
There is no need anymore to return the pointer to a slab page from get_partial() since the page reference can be stored in the kmem_cache_cpu structures "page" field. Return an object pointer instead. That in turn allows a simplification of the spaghetti code in __slab_alloc(). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19slub: pass kmem_cache_cpu pointer to get_partial()Christoph Lameter
Pass the kmem_cache_cpu pointer to get_partial(). That way we can avoid the this_cpu_write() statements. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19slub: Prepare inuse field in new_slab()Christoph Lameter
inuse will always be set to page->objects. There is no point in initializing the field to zero in new_slab() and then overwriting the value in __slab_alloc(). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19slub: Remove useless statements in __slab_allocChristoph Lameter
Two statements in __slab_alloc() do not have any effect. 1. c->page is already set to NULL by deactivate_slab() called right before. 2. gfpflags are masked in new_slab() before being passed to the page allocator. There is no need to mask gfpflags in __slab_alloc in particular since most frequent processing in __slab_alloc does not require the use of a gfpmask. Cc: torvalds@linux-foundation.org Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19slub: free slabs without holding locksChristoph Lameter
There are two situations in which slub holds a lock while releasing pages: A. During kmem_cache_shrink() B. During kmem_cache_close() For A build a list while holding the lock and then release the pages later. In case of B we are the last remaining user of the slab so there is no need to take the listlock. After this patch all calls to the page allocator to free pages are done without holding any spinlocks. kmem_cache_destroy() will still hold the slub_lock semaphore. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19squeeze max-pause area and drop pass-good areaWu Fengguang
Revert the pass-good area introduced in ffd1f609ab10 ("writeback: introduce max-pause and pass-good dirty limits") and make the max-pause area smaller and safe. This fixes ~30% performance regression in the ext3 data=writeback fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes. Using deadline scheduler also has a regression, but not that big as CFQ, so this suggests we have some write starvation. The test logs show that - the disks are sometimes under utilized - global dirty pages sometimes rush high to the pass-good area for several hundred seconds, while in the mean time some bdi dirty pages drop to very low value (bdi_dirty << bdi_thresh). Then suddenly the global dirty pages dropped under global dirty threshold and bdi_dirty rush very high (for example, 2 times higher than bdi_thresh). During which time balance_dirty_pages() is not called at all. So the problems are 1) The random writes progress so slow that they break the assumption of the max-pause logic that "8 pages per 200ms is typically more than enough to curb heavy dirtiers". 2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others. 3) The higher max-pause/pass-good thresholds somehow leads to the bad swing of dirty pages. The fix is to allow the task to slightly dirty over task_bdi_thresh, but no way to exceed bdi_dirty and/or global dirty_thresh. Tests show that it fixed the JBOD regression completely (both behavior and performance), while still being able to cut down large pause times in balance_dirty_pages() for single-disk cases. Reported-by: Li Shaohua <shaohua.li@intel.com> Tested-by: Li Shaohua <shaohua.li@intel.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-08-17mm: make HASHED_PAGE_VIRTUAL page_address' struct page argument const.Ian Campbell
Followup to 33dd4e0ec911 "mm: make some struct page's const" which missed the HASHED_PAGE_VIRTUAL case. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-14mm: fix wrong vmap address calculations with odd NR_CPUS valuesClemens Ladisch
Commit db64fe02258f ("mm: rewrite vmap layer") introduced code that does address calculations under the assumption that VMAP_BLOCK_SIZE is a power of two. However, this might not be true if CONFIG_NR_CPUS is not set to a power of two. Wrong vmap_block index/offset values could lead to memory corruption. However, this has never been observed in practice (or never been diagnosed correctly); what caught this was the BUG_ON in vb_alloc() that checks for inconsistent vmap_block indices. To fix this, ensure that VMAP_BLOCK_SIZE always is a power of two. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=31572 Reported-by: Pavel Kysilka <goldenfish@linuxsoft.cz> Reported-by: Matias A. Fonzo <selk@dragora.org> Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: 2.6.28+ <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-09Revert "memcg: get rid of percpu_charge_mutex lock"Michal Hocko
This reverts commit 8521fc50d433507a7cdc96bec280f9e5888a54cc. The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE bit operations is sufficient but that is not true. Johannes Weiner has reported a crash during parallel memory cgroup removal: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70 Oops: 0000 [#1] PREEMPT SMP Pid: 19677, comm: rmdir Tainted: G W 3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3 RIP: 0010:[<ffffffff81083b70>] css_is_ancestor+0x20/0x70 RSP: 0018:ffff880077b09c88 EFLAGS: 00010202 Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310) Call Trace: [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40 [<ffffffff810feccf>] drain_all_stock+0x11f/0x170 [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0 [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20 [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500 [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0 [<ffffffff81114e7b>] do_rmdir+0xfb/0x110 [<ffffffff81114ea6>] sys_rmdir+0x16/0x20 [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b We are crashing because we try to dereference cached memcg when we are checking whether we should wait for draining on the cache. The cache is already cleaned up, though. There is also a theoretical chance that the cached memcg gets freed between we test for the FLUSHING_CACHED_CHARGE and dereference it in mem_cgroup_same_or_subtree: CPU0 CPU1 CPU2 mem=stock->cached stock->cached=NULL clear_bit test_and_set_bit test_bit() ... <preempted> mem_cgroup_destroy use after free The percpu_charge_mutex protected from this race because sync draining is exclusive. It is safer to revert now and come up with a more parallel implementation later. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-09slub: Fix partial count comparison confusionChristoph Lameter
deactivate_slab() has the comparison if more than the minimum number of partial pages are in the partial list wrong. An effect of this may be that empty pages are not freed from deactivate_slab(). The result could be an OOM due to growth of the partial slabs per node. Frees mostly occur from __slab_free which is okay so this would only affect use cases where a lot of switching around of per cpu slabs occur. Switching per cpu slabs occurs with high frequency if debugging options are enabled. Reported-and-tested-by: Xiaotian Feng <xtfeng@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-09slub: fix check_bytes() for slub debuggingAkinobu Mita
The check_bytes() function is used by slub debugging. It returns a pointer to the first unmatching byte for a character in the given memory area. If the character for matching byte is greater than 0x80, check_bytes() doesn't work. Becuase 64-bit pattern is generated as below. value64 = value | value << 8 | value << 16 | value << 24; value64 = value64 | value64 << 32; The integer promotions are performed and sign-extended as the type of value is u8. The upper 32 bits of value64 is 0xffffffff in the first line, and the second line has no effect. This fixes the 64-bit pattern generation. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Matt Mackall <mpm@selenic.com> Reviewed-by: Marcin Slusarz <marcin.slusarz@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-09slub: Fix full list corruption if debugging is onChristoph Lameter
When a slab is freed by __slab_free() and the slab can only contain a single object ever then it was full (and therefore not on the partial lists but on the full list in the debug case) before we reached slab_empty. This caused the following full list corruption when SLUB debugging was enabled: [ 5913.233035] ------------[ cut here ]------------ [ 5913.233097] WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98() [ 5913.233101] Hardware name: Adamo 13 [ 5913.233105] list_del corruption. prev->next should be ffffea000434fd20, but was ffffea0004199520 [ 5913.233108] Modules linked in: nfs fscache fuse ebtable_nat ebtables ppdev parport_pc lp parport ipt_MASQUERADE iptable_nat nf_nat nfsd lockd nfs_acl auth_rpcgss xt_CHECKSUM sunrpc iptable_mangle bridge stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables rfcomm bnep arc4 iwlagn snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel btusb mac80211 snd_hda_codec bluetooth snd_hwdep snd_seq snd_seq_device snd_pcm usb_debug dell_wmi sparse_keymap cdc_ether usbnet cdc_acm uvcvideo cdc_wdm mii cfg80211 snd_timer dell_laptop videodev dcdbas snd microcode v4l2_compat_ioctl32 soundcore joydev tg3 pcspkr snd_page_alloc iTCO_wdt i2c_i801 rfkill iTCO_vendor_support wmi virtio_net kvm_intel kvm ipv6 xts gf128mul dm_crypt i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan] [ 5913.233213] Pid: 0, comm: swapper Not tainted 3.0.0+ #127 [ 5913.233213] Call Trace: [ 5913.233213] <IRQ> [<ffffffff8105df18>] warn_slowpath_common+0x83/0x9b [ 5913.233213] [<ffffffff8105dfd3>] warn_slowpath_fmt+0x46/0x48 [ 5913.233213] [<ffffffff8127e7c1>] __list_del_entry+0x8d/0x98 [ 5913.233213] [<ffffffff8127e7da>] list_del+0xe/0x2d [ 5913.233213] [<ffffffff814e0430>] __slab_free+0x1db/0x235 [ 5913.233213] [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff81133085>] kmem_cache_free+0x88/0x102 [ 5913.233213] [<ffffffff811706ab>] bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff811706e1>] bio_free+0x34/0x64 [ 5913.233213] [<ffffffff813dc390>] dm_bio_destructor+0x12/0x14 [ 5913.233213] [<ffffffff8116fef6>] bio_put+0x2b/0x2d [ 5913.233213] [<ffffffff813dccab>] clone_endio+0x9e/0xb4 [ 5913.233213] [<ffffffff8116f7dd>] bio_endio+0x2d/0x2f [ 5913.233213] [<ffffffffa00148da>] crypt_dec_pending+0x5c/0x8b [dm_crypt] [ 5913.233213] [<ffffffffa00150a9>] crypt_endio+0x78/0x81 [dm_crypt] [ Full discussion here: https://lkml.org/lkml/2011/8/4/375 ] Make sure that we remove such a slab also from the full lists. Reported-and-tested-by: Dave Jones <davej@redhat.com> Reported-and-tested-by: Xiaotian Feng <xtfeng@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-09Merge branch 'next-evm' of ↵James Morris
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/ima-2.6 into next Conflicts: fs/attr.c Resolve conflict manually. Signed-off-by: James Morris <jmorris@namei.org>
2011-08-04Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: slab, lockdep: Annotate the locks before using them lockdep: Clear whole lockdep_map on initialization slab, lockdep: Annotate slab -> rcu -> debug_object -> slab lockdep: Fix up warning lockdep: Fix trace_hardirqs_on_caller() futex: Fix regression with read only mappings
2011-08-04slab, lockdep: Annotate the locks before using themPeter Zijlstra
Fernando found we hit the regular OFF_SLAB 'recursion' before we annotate the locks, cure this. The relevant portion of the stack-trace: > [ 0.000000] [<c085e24f>] rt_spin_lock+0x50/0x56 > [ 0.000000] [<c04fb406>] __cache_free+0x43/0xc3 > [ 0.000000] [<c04fb23f>] kmem_cache_free+0x6c/0xdc > [ 0.000000] [<c04fb2fe>] slab_destroy+0x4f/0x53 > [ 0.000000] [<c04fb396>] free_block+0x94/0xc1 > [ 0.000000] [<c04fc551>] do_tune_cpucache+0x10b/0x2bb > [ 0.000000] [<c04fc8dc>] enable_cpucache+0x7b/0xa7 > [ 0.000000] [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61 > [ 0.000000] [<c0bba687>] start_kernel+0x24c/0x363 > [ 0.000000] [<c0bba0ba>] i386_start_kernel+0xa9/0xaf Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-04slab, lockdep: Annotate slab -> rcu -> debug_object -> slabPeter Zijlstra
Lockdep thinks there's lock recursion through: kmem_cache_free() cache_flusharray() spin_lock(&l3->list_lock) <----------------. free_block() | slab_destroy() | call_rcu() | debug_object_activate() | debug_object_init() | __debug_object_init() | kmem_cache_alloc() | cache_alloc_refill() | spin_lock(&l3->list_lock) --' Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no actual possibility of recursing. Luckily debug objects marks it slab with SLAB_DEBUG_OBJECTS so we can identify the thing. Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special lockdep key so that lockdep sees its a different cachep. Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble. Reported-and-tested-by: Sebastian Siewior <sebastian@breakpoint.cc> [ fixes to the initial patch ] Reported-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-03Merge branch 'apei-release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'apei-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: ACPI, APEI, EINJ Param support is disabled by default APEI GHES: 32-bit buildfix ACPI: APEI build fix ACPI, APEI, GHES: Add hardware memory error recovery support HWPoison: add memory_failure_queue() ACPI, APEI, GHES, Error records content based throttle ACPI, APEI, GHES, printk support for recoverable error via NMI lib, Make gen_pool memory allocator lockless lib, Add lock-less NULL terminated single list Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG ACPI, APEI, Add WHEA _OSC support ACPI, APEI, Add APEI bit support in generic _OSC call ACPI, APEI, GHES, Support disable GHES at boot time ACPI, APEI, GHES, Prevent GHES to be built as module ACPI, APEI, Use apei_exec_run_optional in APEI EINJ and ERST ACPI, APEI, Add apei_exec_run_optional ACPI, APEI, GHES, Do not ratelimit fatal error printk before panic ACPI, APEI, ERST, Fix erst-dbg long record reading issue ACPI, APEI, ERST, Prevent erst_dbg from loading if ERST is disabled