summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2009-04-02memcg: fix OOM killer under memcgKAMEZAWA Hiroyuki
This patch tries to fix OOM Killer problems caused by hierarchy. Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to kill a task in memcg. But, when hierarchy is used, it's broken and correct task cannot be killed. For example, in following cgroup /groupA/ hierarchy=1, limit=1G, 01 nolimit 02 nolimit All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to groupA's 1Gbytes but OOM Killer just kills tasks in groupA. This patch provides makes the bad process be selected from all tasks under hierarchy. BTW, currently, oom_jiffies is updated against groupA in above case. oom_jiffies of tree should be updated. To see how oom_jiffies is used, please check mem_cgroup_oom_called() callers. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: const fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02memcg: fix shrinking memory to return -EBUSY by fixing retry algorithmKAMEZAWA Hiroyuki
As pointed out, shrinking memcg's limit should return -EBUSY after reasonable retries. This patch tries to fix the current behavior of shrink_usage. Before looking into "shrink should return -EBUSY" problem, we should fix hierarchical reclaim code. It compares current usage and current limit, but it only makes sense when the kernel reclaims memory because hit limits. This is also a problem. What this patch does are. 1. add new argument "shrink" to hierarchical reclaim. If "shrink==true", hierarchical reclaim returns immediately and the caller checks the kernel should shrink more or not. (At shrinking memory, usage is always smaller than limit. So check for usage < limit is useless.) 2. For adjusting to above change, 2 changes in "shrink"'s retry path. 2-a. retry_count depends on # of children because the kernel visits the children under hierarchy one by one. 2-b. rather than checking return value of hierarchical_reclaim's progress, compares usage-before-shrink and usage-after-shrink. If usage-before-shrink <= usage-after-shrink, retry_count is decremented. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02memcg: hierarchical statKAMEZAWA Hiroyuki
Clean up memory.stat file routine and show "total" hierarchical stat. This patch does - renamed get_all_zonestat to be get_local_zonestat. - remove old mem_cgroup_stat_desc, which is only for per-cpu stat. - add mcs_stat to cover both of per-cpu/per-lru stat. - add "total" stat of hierarchy (*) - add a callback system to scan all memcg under a root. == "total" is added. [kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat cache 0 rss 0 pgpgin 0 pgpgout 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 50331648 hierarchical_memsw_limit 9223372036854775807 total_cache 65536 total_rss 192512 total_pgpgin 218 total_pgpgout 155 total_inactive_anon 0 total_active_anon 135168 total_inactive_file 61440 total_active_file 4096 total_unevictable 0 == (*) maybe the user can do calc hierarchical stat by his own program in userland but if it can be written in clean way, it's worth to be shown, I think. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02memcg: use CSS IDKAMEZAWA Hiroyuki
Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy. Assume folloing tree. group_A (ID=3) /01 (ID=4) /0A (ID=7) /02 (ID=10) group_B (ID=5) and task in group_A/01/0A hits limit at group_A. reclaim will be done in following order (round-robin). group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10) -> group_A -> ..... Round robin by ID. The last visited cgroup is recorded and restart from it when it start reclaim again. (More smart algorithm can be implemented..) No cgroup_mutex or hierarchy_mutex is required. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02devcgroup: avoid using cgroup_lockLi Zefan
There is nothing special that has to be protected by cgroup_lock, so introduce devcgroup_mtuex for it's own use. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02debug cgroup: remove unneeded cgroup_lockLi Zefan
Since we are in cgroup write handler, so the cgrp is valid, so we don't have to hold cgroup_mutex when calling cgroup_task_count(). One similar example is in cgroup_tasks_open(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroups: don't change release_agent when remount failedLi Zefan
Remount can fail in either case: - wrong mount options is specified, or option 'noprefix' is changed. - a to-be-added subsys is already mounted/active. When using remount to change 'release_agent', for the above former failure case, remount will return errno with release_agent unchanged, but for the latter case, remount will return EBUSY with relase_agent changed, which is unexpected I think: # mount -t cgroup -o cpu xxx /cgrp1 # mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2 # cat /cgrp2/release_agent agent1 # mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2 mount: /cgrp2 not mounted already, or bad option # cat /cgrp2/release_agent agent1 <-- ok # mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2 mount: /cgrp2 is busy # cat /cgrp2/release_agent agent2 <-- unexpected! Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroups: show correct file modeLi Zefan
We have some read-only files and write-only files, but currently they are all set to 0644, which is counter-intuitive and cause trouble for some cgroup tools like libcgroup. This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's own files' file mode, and for the most cases cft->mode can be default to 0 and cgroup will figure out proper mode. Acked-by: Paul Menage <menage@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroups: more documentation for remount and release_agentLi Zefan
This won't remove cpuacct from the mounted hierachy: # mount -t cgroup -o cpu,cpuacct xxx /mnt # mount -o remount,cpu /mnt Because for this usage mount(8) will append the new options to the original options. And this will get you right: # mount [-t cgroup] -o remount,cpu xxx /mnt Also document how to specify or change release_agent. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02kernel/cgroup.c: kfree(NULL) is legalJesper Juhl
Reduces object file size a bit: Before: $ size kernel/cgroup.o text data bss dec hex filename 21593 7804 4924 34321 8611 kernel/cgroup.o After: $ size kernel/cgroup.o text data bss dec hex filename 21537 7744 4924 34205 859d kernel/cgroup.o Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroup: fix frequent -EBUSY at rmdirKAMEZAWA Hiroyuki
In following situation, with memory subsystem, /groupA use_hierarchy==1 /01 some tasks /02 some tasks /03 some tasks /04 empty When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim is triggered and the kernel walks tree under groupA. In this case, rmdir /groupA/04 fails with -EBUSY frequently because of temporal refcnt from the kernel. In general. cgroup can be rmdir'd if there are no children groups and no tasks. Frequent fails of rmdir() is not useful to users. (And the reason for -EBUSY is unknown to users.....in most cases) This patch tries to modify above behavior, by - retries if css_refcnt is got by someone. - add "return value" to pre_destroy() and allows subsystem to say "we're really busy!" Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroup: CSS ID supportKAMEZAWA Hiroyuki
Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code. This patch attaches unique ID to each css and provides following. - css_lookup(subsys, id) returns pointer to struct cgroup_subysys_state of id. - css_get_next(subsys, id, rootid, depth, foundid) returns the next css under "root" by scanning When cgroup_subsys->use_id is set, an id for css is maintained. The cgroup framework only parepares - css_id of root css for subsys - id is automatically attached at creation of css. - id is *not* freed automatically. Because the cgroup framework don't know lifetime of cgroup_subsys_state. free_css_id() function is provided. This must be called by subsys. There are several reasons to develop this. - Saving space .... For example, memcg's swap_cgroup is array of pointers to cgroup. But it is not necessary to be very fast. By replacing pointers(8bytes per ent) to ID (2byes per ent), we can reduce much amount of memory usage. - Scanning without lock. CSS_ID provides "scan id under this ROOT" function. By this, scanning css under root can be written without locks. ex) do { rcu_read_lock(); next = cgroup_get_next(subsys, id, root, &found); /* check sanity of next here */ css_tryget(); rcu_read_unlock(); id = found + 1 } while(...) Characteristics: - Each css has unique ID under subsys. - Lifetime of ID is controlled by subsys. - css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy - Allowed ID is 1-65535, ID 0 is UNUSED ID. Design Choices: - scan-by-ID v.s. scan-by-tree-walk. As /proc's pid scan does, scan-by-ID is robust when scanning is done by following kind of routine. scan -> rest a while(release a lock) -> conitunue from interrupted memcg's hierarchical reclaim does this. - When subsys->use_id is set, # of css in the system is limited to 65535. [bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroups: relax ns_can_attach checks to allow attaching to grandchild cgroupsGrzegorz Nosek
The ns_proxy cgroup allows moving processes to child cgroups only one level deep at a time. This commit relaxes this restriction and makes it possible to attach tasks directly to grandchild cgroups, e.g.: ($pid is in the root cgroup) echo $pid > /cgroup/CG1/CG2/tasks Previously this operation would fail with -EPERM and would have to be performed as two steps: echo $pid > /cgroup/CG1/tasks echo $pid > /cgroup/CG1/CG2/tasks Also, the target cgroup no longer needs to be empty to move a task there. Signed-off-by: Grzegorz Nosek <root@localdomain.pl> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02cgroups: fix cgroup.h commentsPaul Menage
Fix the style of some multi-line comments in cgroup.h to match Documentation/CodingStyle Signed-off-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02documentation: fix unix_dgram_qlen descriptionLi Xiaodong
Previous description about system parameter in /proc/sys/net/unix/ is wrong (or missed). Simply add a new description about unix_dgram_qlen according to latest kernel. Signed-off-by: Li Xiaodong <lixd@cn.fujitsu.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02documentation: update Documentation/filesystem/proc.txt and ↵Shen Feng
Documentation/sysctls Now /proc/sys is described in many places and much information is redundant. This patch updates the proc.txt and move the /proc/sys desciption out to the files in Documentation/sysctls. Details are: merge - 2.1 /proc/sys/fs - File system data - 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem - 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface with Documentation/sysctls/fs.txt. remove - 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats since it's not better then the Documentation/binfmt_misc.txt. merge - 2.3 /proc/sys/kernel - general kernel parameters with Documentation/sysctls/kernel.txt remove - 2.5 /proc/sys/dev - Device specific parameters since it's obsolete the sysfs is used now. remove - 2.6 /proc/sys/sunrpc - Remote procedure calls since it's not better then the Documentation/sysctls/sunrpc.txt move - 2.7 /proc/sys/net - Networking stuff - 2.9 Appletalk - 2.10 IPX to newly created Documentation/sysctls/net.txt. remove - 2.8 /proc/sys/net/ipv4 - IPV4 settings since it's not better then the Documentation/networking/ip-sysctl.txt. add - Chapter 3 Per-Process Parameters to descibe /proc/<pid>/xxx parameters. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02documentation: ignore byproducts from latexHenrik Austad
When using 'make pdfdocs', auto-generated files should be ignored Signed-off-by: Henrik Austad <henrik@austad.us> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02hppfs: hppfs_read_file() may return -ERRORRoel Kluin
hppfs_read_file() may return (ssize_t) -ENOMEM, or -EFAULT. When stored in size_t 'count', these errors will not be noticed, a large value will be added to *ppos. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ext3: avoid false EIO errorsJan Kara
Sometimes block_write_begin() can map buffers in a page but later we fail to copy data into those buffers (because the source page has been paged out in the mean time). We then end up with !uptodate mapped buffers. To add a bit more to the confusion, block_write_end() does not commit any data (and thus does not any mark buffers as uptodate) if we didn't succeed with copying all the data. Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops) missed these cases and thus we were inserting non-uptodate buffers to transaction's list which confuses JBD code and it reports IO errors, aborts a transaction and generally makes users afraid about their data ;-P. This patch fixes the problem by reorganizing ext3_..._write_end() code to first call block_write_end() to mark buffers with valid data uptodate and after that we file only uptodate buffers to transaction's lists. We also fix a problem where we could leave blocks allocated beyond i_size (i_disksize in fact) because of failed write. We now add inode to orphan list when write fails (to be safe in case we crash) and then truncate blocks beyond i_size in a separate transaction. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ext3: return -EIO not -ESTALE on directory traversal through deleted inodeBryan Donlan
ext3_iget() returns -ESTALE if invoked on a deleted inode, in order to report errors to NFS properly. However, in ext[234]_lookup(), this -ESTALE can be propagated to userspace if the filesystem is corrupted such that a directory entry references a deleted inode. This leads to a misleading error message - "Stale NFS file handle" - and confusion on the part of the admin. The bug can be easily reproduced by creating a new filesystem, making a link to an unused inode using debugfs, then mounting and attempting to ls -l said link. This patch thus changes ext3_lookup to return -EIO if it receives -ESTALE from ext3_iget(), as ext3 does for other filesystem metadata corruption; and also invokes the appropriate ext*_error functions when this case is detected. Signed-off-by: Bryan Donlan <bdonlan@gmail.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ext3: use unsigned instead of int for type of blocksize in fs/ext3/namei.cWei Yongjun
Use unsigned instead of int for the parameter which carries a blocksize. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02jbd: fix oops in jbd_journal_init_inode() on corrupted fsJan Kara
On 32-bit system with CONFIG_LBD getblk can fail because provided block number is too big. Make JBD gracefully handle that. Signed-off-by: Jan Kara <jack@suse.cz> Cc: <dmaciejak@fortinet.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02ext3: remove the BKL in ext3/ioctl.cCyrus Massoumi
Reformat ext3/ioctl.c to make it look more like ext4/ioctl.c and remove the BKL around ext3_ioctl(). Signed-off-by: Cyrus Massoumi <cyrusm@gmx.net> Cc: <linux-ext4@vger.kernel.org> Acked-by: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02pnpbios: propagate kthread_run() errorErik Ekman
- Error code from kthread_run() is now returned in pnpbios_thread_init() - Remove variable which always was 0. Signed-off-by: Erik Ekman <erik@kryo.se> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02pnpbios: fix warning if CONFIG_HOTPLUG=nErik Ekman
drivers/pnp/pnpbios/core.c: In function 'pnpbios_thread_init': drivers/pnp/pnpbios/core.c:578: warning: unused variable 'task' Signed-off-by: Erik Ekman <erik@kryo.se> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02spi-gpio: allow operation without CS signalMichael Buesch
Change spi-gpio so that it is possible to drive SPI communications over GPIO without the need for a chipselect signal. This is useful in very small setups where there's only one slave device on the bus. This patch does not affect existing setups. I use this for a tiny communication channel between an embedded device and a microcontroller. There are not enough GPIOs available for chipselect and it's not needed anyway in this case. Signed-off-by: Michael Buesch <mb@bu3sch.de> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02gpio: gpio_{request,free}() now required (feature removal)David Brownell
We want to phase out the GPIO "autorequest" mechanism in gpiolib and require all callers to use gpio_request(). - Update feature-removal-schedule - Update the documentation now - Convert the relevant pr_warning() in gpiolib to a WARN() so folk using this mechanism get a noisy stack dump Some drivers and board init code will probably need to change. Implementations not using gpiolib will still be fine; they are already required to implement gpio_{request,free}() stubs. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02gpiolib: allow GPIOs to be namedDaniel Silverstone
Allow GPIOs in GPIOLIB chips to be named. This name is then used when the GPIO is exported to sysfs, although it could be used elsewhere if deemed useful. Signed-off-by: Daniel Silverstone <dsilvers@simtec.co.uk> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02rtc: add m41t62 support to rtc-m41t80 driverDaniel Glockner
Compared to the other supported chips, the m41t62 uses a different register to set the square wave frequency. Signed-off-by: Daniel Glockner <dg@emlix.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Brownell <david-b@pacbell.net> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02rtc-v3020: add ability to access v3020 chip with GPIOsMike Rapoport
The v3020 RTC can be connected to GPIOs as well as to memory-like interface. Add ability to use GPIO bit-bang for v3020 read-write access. [akpm@linux-foundation.org: fix off-by-one in error path] Signed-off-by: Mike Rapoport <mike@compulab.co.il> Acked-by: Alessandro Zummo <a.zummo@towertech.it> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02initramfs: prevent initramfs printk message being split by messages from ↵Simon Kitching
other code. initramfs uses printk without a linefeed, then does some work, then uses printk to finish the message off. However if some other code does a printk in between, then the messages get mixed together. Better for each message to be an independent line... Example of problem that this fixes: checking if image is initramfs...<7>Switched to high resolution mode on CPU 1 Switched to high resolution mode on CPU 0 it is Signed-off-by: Simon Kitching <skitching@apache.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02Simplify copy_thread()Alexey Dobriyan
First argument unused since 2.3.11. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02memory_accessor: implement the new memory_accessor interfaces for SPI EEPROMsDavid Brownell
- Define new setup() hook to export the accessor - Implement accessor methods Moves some error checking out of the sysfs interface code into the layer below it, which is now shared by both sysfs and memory access code. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02memory_accessor: implement the new memory_accessor interface for I2C EEPROMKevin Hilman
In the case of at24, the platform code registers a 'setup' callback with the at24_platform_data. When the at24 driver detects an EEPROM, it fills out the read and write functions of the memory_accessor and calls the setup callback passing the memory_accessor struct. The platform code can then use the read/write functions in the memory_accessor struct for reading and writing the EEPROM. Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com> Cc: David Brownell <dbrownell@users.sourceforge.net> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02memory_accessor: new interface for reading/writing persistent memoryKevin Hilman
Add an interface by which other kernel code can read/write persistent memory such as I2C or SPI EEPROMs, or devices which provide NVRAM. Use cases include storage of board-specific configuration data like Ethernet addresses and sensor calibrations. Original idea, review and improvement suggestions by David Brownell. Acked-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02workqueue: add to_delayed_work() helper functionJean Delvare
It is a fairly common operation to have a pointer to a work and to need a pointer to the delayed work it is contained in. In particular, all delayed works which want to rearm themselves will have to do that. So it would seem fair to offer a helper function for this operation. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jean Delvare <khali@linux-fr.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02uml: fix warnings in kernel_execveMiklos Szeredi
Fix the following warnings: arch/um/kernel/syscall.c: In function 'kernel_execve': arch/um/kernel/syscall.c:130: warning: passing argument 1 of 'um_execve' discards qualifiers from pointer target type arch/um/kernel/syscall.c:130: warning: passing argument 2 of 'um_execve' discards qualifiers from pointer target type arch/um/kernel/syscall.c:130: warning: passing argument 3 of 'um_execve' discards qualifiers from pointer target type Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Jeff Dike <jdike@addtoit.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02uml: fix link error from prefixing of i386 syscalls with ptregs_Miklos Szeredi
Fix the following link error: arch/um/sys-i386/built-in.o: In function `sys_call_table': (.rodata+0x11c): undefined reference to `ptregs_fork' arch/um/sys-i386/built-in.o: In function `sys_call_table': (.rodata+0x140): undefined reference to `ptregs_execve' arch/um/sys-i386/built-in.o: In function `sys_call_table': (.rodata+0x2cc): undefined reference to `ptregs_iopl' arch/um/sys-i386/built-in.o: In function `sys_call_table': (.rodata+0x2d8): undefined reference to `ptregs_vm86old' arch/um/sys-i386/built-in.o: In function `sys_call_table': (.rodata+0x2f0): undefined reference to `ptregs_sigreturn' arch/um/sys-i386/built-in.o: In function `sys_call_table': (.rodata+0x2f4): undefined reference to `ptregs_clone' arch/um/sys-i386/built-in.o: In function `sys_call_table': (.rodata+0x3ac): undefined reference to `ptregs_vm86' arch/um/sys-i386/built-in.o: In function `sys_call_table': (.rodata+0x3c8): undefined reference to `ptregs_rt_sigreturn' arch/um/sys-i386/built-in.o: In function `sys_call_table': (.rodata+0x3fc): undefined reference to `ptregs_sigaltstack' arch/um/sys-i386/built-in.o: In function `sys_call_table': (.rodata+0x40c): undefined reference to `ptregs_vfork' This was introduced by commit 253f29a4, "x86: pass in pt_regs pointer for syscalls that need it" Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Brian Gerst <brgerst@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jeff Dike <jdike@addtoit.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02uml: fix compile error from net_device_ops conversionMiklos Szeredi
Fix the following compile error: arch/um/drivers/net_kern.c: In function 'uml_inetaddr_event': arch/um/drivers/net_kern.c:760: error: 'struct net_device' has no member named 'open' This was introduced by commit 8bb95b39, "uml: convert network device to netdevice ops". Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: David S. Miller <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02floppy: provide a PNP device table in the module.Scott James Remnant
The missing device table means that the floppy module is not auto-loaded, even when the appropriate PNP device (0700) is found. We don't actually use the table in the module, since the device doesn't have a struct pnp_driver, but it's sufficient to cause an alias in the module that udev/modprobe will use. Signed-off-by: Scott James Remnant <scott@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Philippe De Muyter <phdm@macqel.be> Acked-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02vfs: check bh->b_blocknr only if BH_Mapped is setNikanth Karthikesan
Check bh->b_blocknr only if BH_Mapped is set. akpm: I doubt if b_blocknr is ever uninitialised here, but it could conceivably cause a problem if we're doing a lookup for block zero. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02mm: define a UNIQUE value for AS_UNEVICTABLE flagLee Schermerhorn
A new "address_space flag"--AS_MM_ALL_LOCKS--was defined to use the next available AS flag while the Unevictable LRU was under development. The Unevictable LRU was using the same flag and "no one" noticed. Current mainline, since 2.6.28, has same value for two symbolic flag names. So, define a unique flag value for AS_UNEVICTABLE--up close to the other flags, [at the cost of an additional #ifdef] so we'll notice next time. Note that #ifdef is not actually required, if we don't mind having the unused flag value defined. Replace #defines with an enum. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: <stable@kernel.org> [2.6.28.x, 2.6.29.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02add fiemap.h to header-yEric Sandeen
Include fiemap.h in header-y; it defines the interface for the FS_IOC_FIEMAP file mapping ioctl. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02MAINTAINERS: add hvc_consoleMichael Ellerman
Add a MAINTAINERS entry for the hypervisor virtual console driver. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Acked-by: Josh Boyer <jwboyer@linux.vnet.ibm.com> Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02mm: do_xip_mapping_read: fix length calculationMartin Schwidefsky
The calculation of the value nr in do_xip_mapping_read is incorrect. If the copy required more than one iteration in the do while loop the copies variable will be non-zero. The maximum length that may be passed to the call to copy_to_user(buf+copied, xip_mem+offset, nr) is len-copied but the check only compares against (nr > len). This bug is the cause for the heap corruption Carsten has been chasing for so long: *** glibc detected *** /bin/bash: free(): invalid next size (normal): 0x00000000800e39f0 *** ======= Backtrace: ========= /lib64/libc.so.6[0x200000b9b44] /lib64/libc.so.6(cfree+0x8e)[0x200000bdade] /bin/bash(free_buffered_stream+0x32)[0x80050e4e] /bin/bash(close_buffered_stream+0x1c)[0x80050ea4] /bin/bash(unset_bash_input+0x2a)[0x8001c366] /bin/bash(make_child+0x1d4)[0x8004115c] /bin/bash[0x8002fc3c] /bin/bash(execute_command_internal+0x656)[0x8003048e] /bin/bash(execute_command+0x5e)[0x80031e1e] /bin/bash(execute_command_internal+0x79a)[0x800305d2] /bin/bash(execute_command+0x5e)[0x80031e1e] /bin/bash(reader_loop+0x270)[0x8001efe0] /bin/bash(main+0x1328)[0x8001e960] /lib64/libc.so.6(__libc_start_main+0x100)[0x200000592a8] /bin/bash(clearerr+0x5e)[0x8001c092] With this bug fix the commit 0e4a9b59282914fe057ab17027f55123964bc2e2 "ext2/xip: refuse to change xip flag during remount with busy inodes" can be removed again. Cc: Carsten Otte <cotte@de.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02random: align rekey_work's timerAnton Blanchard
Align rekey_work. Even though it's infrequent, we may as well line it up. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Matt Mackall <mpm@selenic.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02mm: align vmstat_work's timerAnton Blanchard
Even though vmstat_work is marked deferrable, there are still benefits to aligning it. For certain applications we want to keep OS jitter as low as possible and aligning timers and work so they occur together can reduce their overall impact. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02writeback: guard against jiffies wraparound on inode->dirtied_when checks ↵Jeff Layton
(try #3) The dirtied_when value on an inode is supposed to represent the first time that an inode has one of its pages dirtied. This value is in units of jiffies. It's used in several places in the writeback code to determine when to write out an inode. The problem is that these checks assume that dirtied_when is updated periodically. If an inode is continuously being used for I/O it can be persistently marked as dirty and will continue to age. Once the time compared to is greater than or equal to half the maximum of the jiffies type, the logic of the time_*() macros inverts and the opposite of what is needed is returned. On 32-bit architectures that's just under 25 days (assuming HZ == 1000). As the least-recently dirtied inode, it'll end up being the first one that pdflush will try to write out. sync_sb_inodes does this check: /* Was this inode dirtied after sync_sb_inodes was called? */ if (time_after(inode->dirtied_when, start)) break; ...but now dirtied_when appears to be in the future. sync_sb_inodes bails out without attempting to write any dirty inodes. When this occurs, pdflush will stop writing out inodes for this superblock. Nothing can unwedge it until jiffies moves out of the problematic window. This patch fixes this problem by changing the checks against dirtied_when to also check whether it appears to be in the future. If it does, then we consider the value to be far in the past. This should shrink the problematic window of time to such a small period (30s) as not to matter. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Ian Kent <raven@themaw.net> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02__tty_open(): use the correct type for saved_flagsAndrew Morton
filp->f_flags is unsigned, so use that type for the local copy. Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02vfs: skip I_CLEAR state inodesWu Fengguang
clear_inode() will switch inode state from I_FREEING to I_CLEAR, and do so _outside_ of inode_lock. So any I_FREEING testing is incomplete without a coupled testing of I_CLEAR. So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and add_dquot_ref(). Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara reminds fixing the other two cases. Masayoshi MIZUMA has a nice panic flow: ===================================================================== [process A] | [process B] | | | prune_icache() | drop_pagecache() | spin_lock(&inode_lock) | drop_pagecache_sb() | inode->i_state |= I_FREEING; | | | spin_unlock(&inode_lock) | V | | | spin_lock(&inode_lock) | V | | | dispose_list() | | | list_del() | | | clear_inode() | | | inode->i_state = I_CLEAR | | | | | V | | | if (inode->i_state & (I_FREEING|I_WILL_FREE)) | | | continue; <==== NOT MATCH | | | | | | (DANGER from here on! Accessing disposing inode!) | | | | | | __iget() | | | list_move() <===== PANIC on poisoned list !! V V | (time) ===================================================================== Reported-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>