summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2009-06-10nilfs2: remove useless b_low and b_high fields from nilfs_bmap structRyusuke Konishi
This will cut off 16 bytes from the nilfs_bmap struct which is embedded in the on-memory inode of nilfs. The b_high field was never used, and the b_low field stores a constant value which can be determined by whether the inode uses btree for block mapping or not. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10nilfs2: remove pointless NULL check of bpop_commit_alloc_ptr functionRyusuke Konishi
This indirect function is set to NULL only for gc cache inodes, but the gc cache inodes never call this function. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10nilfs2: move get block functions in bmap.c into btree codesRyusuke Konishi
Two get block function for btree nodes, nilfs_bmap_get_block() and nilfs_bmap_get_new_block(), are called only from the btree codes. This relocation will increase opportunities of compiler optimization. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10nilfs2: remove nilfs_bmap_delete_blockRyusuke Konishi
nilfs_bmap_delete_block() is a wrapper function calling nilfs_btnode_delete(). This removes it for simplicity. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10nilfs2: remove nilfs_bmap_put_blockRyusuke Konishi
nilfs_bmap_put_block() is a wrapper function calling brelse(). This eliminates the wrapper for simplicity. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10nilfs2: remove header file for segment list operationsRyusuke Konishi
This will eliminate obsolete list operations of nilfs_segment_entry structure which has been used to handle mutiple segment numbers. The patch ("nilfs2: remove list of freeing segments") removed use of the structure from the segment constructor code, and this patch simplifies the remaining code by integrating it into recovery.c. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10nilfs2: eliminate removal list of segmentsRyusuke Konishi
This will clean up the removal list of segments and the related functions from segment.c and ioctl.c, which have hurt code readability. This elimination is applied by using nilfs_sufile_updatev() previously introduced in the patch ("nilfs2: add sufile function that can modify multiple segment usages"). Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10nilfs2: add sufile function that can modify multiple segment usagesRyusuke Konishi
This is a preparation for the later cleanup patch ("nilfs2: remove list of freeing segments"). This adds nilfs_sufile_updatev() to sufile, which can modify multiple segment usages at a time. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10nilfs2: unify bmap operations starting use of indirect block addressRyusuke Konishi
This simplifies some low level functions of bmap. Three bmap pointer operations, nilfs_bmap_start_v(), nilfs_bmap_commit_v(), and nilfs_bmap_abort_v(), are unified into one nilfs_bmap_start_v() function. And the related indirect function calls are replaced with it. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10nilfs2: remove nilfs_dat_prepare_free functionRyusuke Konishi
This function is unused. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10[SCSI] libosd: Define an osd_dev wrapper to retrieve the request_queueBoaz Harrosh
libosd users that need to work with bios, must sometime use the request_queue associated with the osd_dev. Make a wrapper for that, and convert all in-tree users. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-10[SCSI] libosd: osd_req_{read,write} takes a length parameterBoaz Harrosh
For supporting of chained-bios we can not inspect the first bio only, as before. Caller shall pass the total length of the request, ie. sum_bytes(bio-chain). Also since the bio might be a chain we don't set it's direction on behalf of it's callers. The bio direction should be properly set prior to this call. So fix a couple of write users that now need to set the bio direction properly [In this patch I change both library code and user sites at exofs, to make it easy on integration. It should be submitted via James's scsi-misc tree.] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> CC: Jeff Garzik <jeff@garzik.org> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-10[SCSI] libosd: osd_req_{read,write}_kern new APIBoaz Harrosh
By popular demand, define usefull wrappers for osd_req_read/write that recieve kernel pointers. All users had their own. Also remove these from exofs Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-10GFS2: Merge gfs2_get_sb into gfs2_get_sb_metaSteven Whitehouse
These don't need to be separate functions. Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-10GFS2: Fix cache coherency between truncate and O_DIRECT readSteven Whitehouse
If a page was partially zeroed as the result of a truncate, then it was not being correctly marked dirty. This resulted in the deleted data reappearing if the file was read back via direct I/O. Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-09jbd: fix race in buffer processing in commit codeJan Kara
In commit code, we scan buffers attached to a transaction. During this scan, we sometimes have to drop j_list_lock and then we recheck whether the journal buffer head didn't get freed by journal_try_to_free_buffers(). But checking for buffer_jbd(bh) isn't enough because a new journal head could get attached to our buffer head. So add a check whether the journal head remained the same and whether it's still at the same transaction and list. This is a nasty bug and can cause problems like memory corruption (use after free) or trigger various assertions in JBD code (observed). Signed-off-by: Jan Kara <jack@suse.cz> Cc: <stable@kernel.org> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-09autofs4: remove hashed check in validate_wait()Ian Kent
The recent ->lookup() deadlock correction required the directory inode mutex to be dropped while waiting for expire completion. We were concerned about side effects from this change and one has been identified. I saw several error messages. They cause autofs to become quite confused and don't really point to the actual problem. Things like: handle_packet_missing_direct:1376: can't find map entry for (43,1827932) which is usually totally fatal (although in this case it wouldn't be except that I treat is as such because it normally is). do_mount_direct: direct trigger not valid or already mounted /test/nested/g3c/s1/ss1 which is recoverable, however if this problem is at play it can cause autofs to become quite confused as to the dependencies in the mount tree because mount triggers end up mounted multiple times. It's hard to accurately check for this over mounting case and automount shouldn't need to if the kernel module is doing its job. There was one other message, similar in consequence of this last one but I can't locate a log example just now. When checking if a mount has already completed prior to adding a new mount request to the wait queue we check if the dentry is hashed and, if so, if it is a mount point. But, if a mount successfully completed while we slept on the wait queue mutex the dentry must exist for the mount to have completed so the test is not really needed. Mounts can also be done on top of a global root dentry, so for the above case, where a mount request completes and the wait queue entry has already been removed, the hashed test returning false can cause an incorrect callback to the daemon. Also, d_mountpoint() is not sufficient to check if a mount has completed for the multi-mount case when we don't have a real mount at the base of the tree. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-09ocfs2: fdatasync should skip unimportant metadata writeoutHisashi Hifumi
In ocfs2, fdatasync and fsync are identical. I think fdatasync should skip committing transaction when inode->i_state is set just I_DIRTY_SYNC and this indicates only atime or/and mtime updates. Following patch improves fdatasync throughput. #sysbench --num-threads=16 --max-requests=300000 --test=fileio --file-block-size=4K --file-total-size=16G --file-test-mode=rndwr --file-fsync-mode=fdatasync run Results: -2.6.30-rc8 Test execution summary: total time: 107.1445s total number of events: 119559 total time taken by event execution: 116.1050 per-request statistics: min: 0.0000s avg: 0.0010s max: 0.1220s approx. 95 percentile: 0.0016s Threads fairness: events (avg/stddev): 7472.4375/303.60 execution time (avg/stddev): 7.2566/0.64 -2.6.30-rc8-patched Test execution summary: total time: 86.8529s total number of events: 300016 total time taken by event execution: 24.3077 per-request statistics: min: 0.0000s avg: 0.0001s max: 0.0336s approx. 95 percentile: 0.0001s Threads fairness: events (avg/stddev): 18751.0000/718.75 execution time (avg/stddev): 1.5192/0.05 Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-09tracing/events: convert block trace points to TRACE_EVENT()Li Zefan
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds these new capabilities to this tracepoint: - zero-copy and per-cpu splice() tracing - binary tracing without printf overhead - structured logging records exposed under /debug/tracing/events - trace events embedded in function tracer output and other plugins - user-defined, per tracepoint filter expressions ... Cons: - no dev_t info for the output of plug, unplug_timer and unplug_io events. no dev_t info for getrq and sleeprq events if bio == NULL. no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL. This is mainly because we can't get the deivce from a request queue. But this may change in the future. - A packet command is converted to a string in TP_assign, not TP_print. While blktrace do the convertion just before output. Since pc requests should be rather rare, this is not a big issue. - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT has a unique format, which means we have some unused data in a trace entry. The overhead is minimized by using __dynamic_array() instead of __array(). I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing: dd dd + ioctl blktrace dd + TRACE_EVENT (splice) 1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s 2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s 3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s So the overhead of tracing is very small, and no regression when using those trace events vs blktrace. And the binary output of TRACE_EVENT is much smaller than blktrace: # ls -l -h -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out Following are some comparisons between TRACE_EVENT and blktrace: plug: kjournald-480 [000] 303.084981: block_plug: [kjournald] kjournald-480 [000] 303.084981: 8,0 P N [kjournald] unplug_io: kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1 kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1 remap: kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384 kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384 bio_backmerge: kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald] kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald] getrq: kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald] kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald] bash-2066 [001] 1072.953770: 8,0 G N [bash] bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash] rq_complete: konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0] konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0] ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0] ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0] rq_insert: kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald] kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald] Changelog from v2 -> v3: - use the newly introduced __dynamic_array(). Changelog from v1 -> v2: - use __string() instead of __array() to minimize the memory required to store hex dump of rq->cmd(). - support large pc requests. - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT. - some cleanups. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09ext4: Don't treat a truncation of a zero-length file as replace-via-truncateTheodore Ts'o
If a non-existent file is opened via O_WRONLY|O_CREAT|O_TRUNC, there's no need to treat this as a true file truncation, so we shouldn't activate the replace-via-truncate hueristic. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09CUSE: implement CUSE - Character device in UserspaceTejun Heo
CUSE enables implementing character devices in userspace. With recent additions of ioctl and poll support, FUSE already has most of what's necessary to implement character devices. All CUSE has to do is bonding all those components - FUSE, chardev and the driver model - nicely. When client opens /dev/cuse, kernel starts conversation with CUSE_INIT. The client tells CUSE which device it wants to create. As the previous patch made fuse_file usable without associated fuse_inode, CUSE doesn't create super block or inodes. It attaches fuse_file to cdev file->private_data during open and set ff->fi to NULL. The rest of the operation is almost identical to FUSE direct IO case. Each CUSE device has a corresponding directory /sys/class/cuse/DEVNAME (which is symlink to /sys/devices/virtual/class/DEVNAME if SYSFS_DEPRECATED is turned off) which hosts "waiting" and "abort" among other things. Those two files have the same meaning as the FUSE control files. The only notable lacking feature compared to in-kernel implementation is mmap support. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-06-09Merge branch 'master' into nextJames Morris
2009-06-08ext4: fix dx_map_entry to support 256k directory blocksToshiyuki Okajima
The dx_map_entry structure doesn't support over 64KB block size by current usage of its member("offs"). Because "offs" treats an offset of copies of the ext4_dir_entry_2 structure as is. This member size is 16 bits. But real offset for over 64KB(256KB) block size needs 18 bits. However, real offset keeps 4 byte boundary, so lower 2 bits is not used. Therefore, we do the following to fix this limitation: For "store": we divide the real offset by 4 and then store this result to "offs" member. For "use": we multiply "offs" member by 4 and then use this result as real offset. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-08xfs: remove SYNC_BDFLUSHChristoph Hellwig
SYNC_BDFLUSH is a leftover from IRIX and rather misnamed for todays code. Make xfs_sync_fsdata and xfs_dq_sync use the SYNC_TRYLOCK flag for not blocking on logs just as the inode sync code already does. For xfs_sync_fsdata it's a trivial 1:1 replacement, but for xfs_qm_sync I use the opportunity to decouple the non-blocking lock case from the different flushing modes, similar to the inode sync code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08xfs: remove SYNC_IOWAITChristoph Hellwig
We want to wait for all I/O to finish when we do data integrity syncs. So there is no reason to keep SYNC_WAIT separate from SYNC_IOWAIT. This causes a little change in behaviour for the ENOSPC flushing code which now does a second submission and wait of buffered I/O, but that should finish ASAP as we already did an asynchronous writeout earlier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08xfs: split xfs_sync_inodesChristoph Hellwig
xfs_sync_inodes is used to write back either file data or inode metadata. In general we always do these separately, except for one fishy case in xfs_fs_put_super that does both. So separate xfs_sync_inodes into separate xfs_sync_data and xfs_sync_attr functions. In xfs_fs_put_super we first call the data sync and then the attr sync as that was the previous order. The moved log force in that path doesn't make a difference because we will force the log again as part of the real unmount process. The filesystem readonly checks are not performed by the new function but instead moved into the callers, given that most callers alredy have it further up in the stack. Also add debug checks that we do not pass in incorrect flags in the new xfs_sync_data and xfs_sync_attr function and fix the one place that did pass in a wrong flag. Also remove a comment mentioning xfs_sync_inodes that has been incorrect for a while because we always take either the iolock or ilock in the sync path these days. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08xfs: use generic inode iterator in xfs_qm_dqrele_all_inodesChristoph Hellwig
Use xfs_inode_ag_iterator instead of opencoding the inode walk in the quota code. Mark xfs_inode_ag_iterator and xfs_sync_inode_valid non-static to allow using them from the quota code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08xfs: introduce a per-ag inode iteratorDave Chinner
Given that we walk across the per-ag inode lists so often, it makes sense to introduce an iterator for this. Convert the sync and reclaim code to use this new iterator, quota code will follow in the next patch. Also change xfs_reclaim_inode to return -EGAIN instead of 1 for an inode already under reclaim. This simplifies the AG iterator and doesn't matter for the only other caller. [hch: merged the lookup and execute callbacks back into one to get the pag_ici_lock locking correct and simplify the code flow] Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08xfs: remove unused parameter from xfs_reclaim_inodesDave Chinner
The noblock parameter of xfs_reclaim_inodes is only ever set to zero. Remove it and all the conditional code that is never executed. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08xfs: factor out inode validation for syncDave Chinner
Separate the validation of inodes found by the radix tree walk from the radix tree lookup. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08xfs: split inode flushing from xfs_sync_inodes_agChristoph Hellwig
In many cases we only want to sync inode metadata. Split out the inode flushing into a separate helper to prepare factoring the inode sync code. Based on a patch from Dave Chinner, but redone to keep the current behaviour exactly and leave changes to the flushing logic to another patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08xfs: split inode data writeback from xfs_sync_inodes_agDave Chinner
In many cases we only want to sync inode data. Start spliting the inode sync into data sync and inode sync by factoring out the inode data flush. [hch: minor cleanups] Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08xfs: kill xfs_qmopsChristoph Hellwig
Kill the quota ops function vector and replace it with direct calls or stubs in the CONFIG_XFS_QUOTA=n case. Make sure we check XFS_IS_QUOTA_RUNNING in the right spots. We can remove the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set otherwise. This brings us back closer to the way this code worked in IRIX and earlier Linux versions, but we keep a lot of the more useful factoring of common code. Eventually we should also kill xfs_qm_bhv.c, but that's left for a later patch. Reduces the size of the source code by about 250 lines and the size of XFS module by about 1.5 kilobytes with quotas enabled: text data bss dec hex filename 615957 2960 3848 622765 980ad fs/xfs/xfs.o 617231 3152 3848 624231 98667 fs/xfs/xfs.o.old Fallout: - xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects the inode locked and xfs_qm_dqattach which does the locking around it, thus removing XFS_QMOPT_ILOCKED. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08xfs: validate quota log items during log recoveryChristoph Hellwig
Arkadiusz has seen really strange crashes in xfs_qm_dqcheck that I can only explain by a log item being too smal to actually fit the xfs_dqblk_t we're dereferencing all over xfs_qm_dqcheck. So add graceful checks for NULL or too small quota items to the log recovery code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08xfs: update max log sizeChristoph Hellwig
Commit a6634fba3dec4a92f0a2c4e30c80b634c0576ad5 in xfsprogs increased the maximum log size supported by mkfs. Merged back the changes to xfs_fs.h so the growfs enforced the same limit and the headers are in sync. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08fat: split fat_generic_ioctlChristoph Hellwig
Split up fat_generic_ioctl and add separate functions for the two implemented ioctls. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2009-06-08Merge branch 'next-mtd' of git://aeryn.fluff.org.uk/bjdooks/linuxDavid Woodhouse
2009-06-08UBIFS: start using hrtimersArtem Bityutskiy
UBIFS uses timers for write-buffer write-back. It is not crucial for us to write-back exactly on time. We are fine to write-back a little earlier or later. And this means we may optimize UBIFS timer so that it could be groped with a close timer event, so that the CPU would not be waken up just to do the write back. This is optimization to lessen power consumption, which is important in embedded devices UBIFS is used for. hrtimers have a nice feature: they are effectively range timers, and we may defind the soft and hard limits for it. Standard timers do not have these feature. They may only be made deferrable, but this means there is effectively no hard limit. So, we will better use hrtimers. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-06-08UBIFS: do not forget to register BDI deviceArtem Bityutskiy
Reviewed-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-06-07partitions: add ->set_capacity block device methodBartlomiej Zolnierkiewicz
* Add ->set_capacity block device method and use it in rescan_partitions() to attempt enabling native capacity of the device upon detecting the partition which exceeds device capacity. * Add GENHD_FL_NATIVE_CAPACITY flag to try limit attempts of enabling native capacity during partition scan. Together with the consecutive patch implementing ->set_capacity method in ide-gd device driver this allows automatic disabling of Host Protected Area (HPA) if any partitions overlapping HPA are detected. Cc: Robert Hancock <hancockrwd@gmail.com> Cc: Frans Pop <elendil@planet.nl> Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Emphatically-Acked-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-06-07partitions: warn about the partition exceeding device capacityBartlomiej Zolnierkiewicz
The current warning message says only about the kernel's action taken without mentioning the underlying reason behind it. Noticed-by: Robert Hancock <hancockrwd@gmail.com> Cc: Frans Pop <elendil@planet.nl> Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl> Cc: Al Viro <viro@zeniv.linux.org.uk> Emphatically-Acked-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-06-06integrity: fix IMA inode leakHugh Dickins
CONFIG_IMA=y inode activity leaks iint_cache and radix_tree_node objects until the system runs out of memory. Nowhere is calling ima_inode_free() a.k.a. ima_iint_delete(). Fix that by calling it from destroy_inode(). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-06[CIFS] Add mention of new mount parm (forceuid) to cifs readmeSteve French
Also update fs/cifs/CHANGES Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-06cifs: make overriding of ownership conditional on new mount optionsJeff Layton
We have a bit of a problem with the uid= option. The basic issue is that it means too many things and has too many side-effects. It's possible to allow an unprivileged user to mount a filesystem if the user owns the mountpoint, /bin/mount is setuid root, and the mount is set up in /etc/fstab with the "user" option. When doing this though, /bin/mount automatically adds the "uid=" and "gid=" options to the share. This is fortunate since the correct uid= option is needed in order to tell the upcall what user's credcache to use when generating the SPNEGO blob. On a mount without unix extensions this is fine -- you generally will want the files to be owned by the "owner" of the mount. The problem comes in on a mount with unix extensions. With those enabled, the uid/gid options cause the ownership of files to be overriden even though the server is sending along the ownership info. This means that it's not possible to have a mount by an unprivileged user that shows the server's file ownership info. The result is also inode permissions that have no reflection at all on the server. You simply cannot separate ownership from the mode in this fashion. This behavior also makes MultiuserMount option less usable. Once you pass in the uid= option for a mount, then you can't use unix ownership info and allow someone to share the mount. While I'm not thrilled with it, the only solution I can see is to stop making uid=/gid= force the overriding of ownership on mounts, and to add new mount options that turn this behavior on. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-06Merge branch 'linus' into perfcounters/coreIngo Molnar
Merge reason: Pick up the latest fixes before the -v8 perfcounters release. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-06ext3/4 with synchronous writes gets wedged by PostfixAl Viro
OK, that's probably the easiest way to do that, as much as I don't like it... Since iget() et.al. will not accept I_FREEING (will wait to go away and restart), and since we'd better have serialization between new/free on fs data structures anyway, we can afford simply skipping I_FREEING et.al. in insert_inode_locked(). We do that from new_inode, so it won't race with free_inode in any interesting ways and it won't race with iget (of any origin; nfsd or in case of fs corruption a lookup) since both still will wait for I_LOCK. Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Tested-by: David Watson <dbwatson@ukfsn.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-06Fix nobh_truncate_page() to not pass stack garbage to get_block()Theodore Ts'o
The nobh_truncate_page() function is used by ext2, exofs, and jfs. Of these three, only ext2 and jfs's get_block() function pays attention to bh->b_size --- which is normally always the filesystem blocksize except when the get_block() function is called by either mpage_readpage(), mpage_readpages(), or the direct I/O routines in fs/direct_io.c. Unfortunately, nobh_truncate_page() does not initialize map_bh before calling the filesystem-supplied get_block() function. So ext2 and jfs will try to calculate the number of blocks to map by taking stack garbage and shifting it left by inode->i_blkbits. This should be *mostly* harmless (except the filesystem will do some unnneeded work) unless the stack garbage is less than filesystem's blocksize, in which case maxblocks will be zero, and the attempt to find out whether or not the filesystem has a hole at a given logical block will fail, and the page cache entry might not get zero'ed out. Also if the stack garbage in in map_bh->state happens to have the BH_Mapped bit set, there could be an attempt to call readpage() on a non-existent page, which could cause nobh_truncate_page() to return an error when it should not. Fix this by initializing map_bh->state and map_bh->size. Fortunately, it's probably fairly unlikely that ext2 and jfs users mount with nobh these days. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: Fix oops and use after free during space balancing Btrfs: set device->total_disk_bytes when adding new device
2009-06-05GFS2: Fix locking issue mounting gfs2meta fsSteven Whitehouse
This patch uses sget() to get a reference to the existing gfs2 sb when mouting the gfs2meta filesystem (in fact thats just another mount of the gfs2 filesystem with a different root and this interface is for backward compatibility). Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reported-by: Benjamin Marzinski <bmarzins@redhat.com> Tested-by: Benjamin Marzinski <bmarzins@redhat.com> Cc: Christoph Hellwig <hch@infradead.org>
2009-06-05ext4: truncate the file properly if we fail to copy data from userspaceAneesh Kumar K.V
In generic_perform_write if we fail to copy the user data we don't update the inode->i_size. We should truncate the file in the above case so that we don't have blocks allocated outside inode->i_size. Add the inode to orphan list in the same transaction as block allocation This ensures that if we crash in between the recovery would do the truncate. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> CC: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>