summaryrefslogtreecommitdiffstats
path: root/block/blk-lib.c
AgeCommit message (Collapse)Author
2013-02-15block: account iowait time when waiting for completion of IO requestVladimir Davydov
Using wait_for_completion() for waiting for a IO request to be executed results in wrong iowait time accounting. For example, a system having the only task doing write() and fdatasync() on a block device can be reported being idle instead of iowaiting as it should because blkdev_issue_flush() calls wait_for_completion() which in turn calls schedule() that does not increment the iowait proc counter and thus does not turn on iowait time accounting. The patch makes block layer use wait_for_completion_io() instead of wait_for_completion() where appropriate to account iowait time correctly. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-14block: add plug for blkdev_issue_discardShaohua Li
Last post of this patch appears lost, so I resend this. Now discard merge works, add plug for blkdev_issue_discard. This will help discard request merge especially for raid0 case. In raid0, a big discard request is split to small requests, and if correct plug is added, such small requests can be merged in low layer. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-14block: discard granularity might not be power of 2Shaohua Li
In MD raid case, discard granularity might not be power of 2, for example, a 4-disk raid5 has 3*chunk_size discard granularity. Correct the calculation for such cases. Reported-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20block: Make blkdev_issue_zeroout use WRITE SAMEMartin K. Petersen
If the device supports WRITE SAME, use that to optimize zeroing of blocks. If the device does not support WRITE SAME or if the operation fails, fall back to writing zeroes the old-fashioned way. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-20block: Implement support for WRITE SAMEMartin K. Petersen
The WRITE SAME command supported on some SCSI devices allows the same block to be efficiently replicated throughout a block range. Only a single logical block is transferred from the host and the storage device writes the same data to all blocks described by the I/O. This patch implements support for WRITE SAME in the block layer. The blkdev_issue_write_same() function can be used by filesystems and block drivers to replicate a buffer across a block range. This can be used to efficiently initialize software RAID devices, etc. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02block: split discard into aligned requestsPaolo Bonzini
When a disk has large discard_granularity and small max_discard_sectors, discards are not split with optimal alignment. In the limit case of discard_granularity == max_discard_sectors, no request could be aligned correctly, so in fact you might end up with no discarded logical blocks at all. Another example that helps showing the condition in the patch is with discard_granularity == 64, max_discard_sectors == 128. A request that is submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257. However, only 2 aligned blocks out of 3 are included in the request; 128..191 may be left intact and not discarded. With this patch, the first request will be truncated to ensure good alignment of what's left, and the split will be 2..127, 128..255, 256..257. The patch will also take into account the discard_alignment. At most one extra request will be introduced, because the first request will be reduced by at most granularity-1 sectors, and granularity must be less than max_discard_sectors. Subsequent requests will run on round_down(max_discard_sectors, granularity) sectors, as in the current code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02block: reorganize rounding of max_discard_sectorsPaolo Bonzini
Mostly a preparation for the next patch. In principle this fixes an infinite loop if max_discard_sectors < granularity, but that really shouldn't happen. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-07-23block: fix patch import error in max_discard_sectors checkJens Axboe
A '!' snuck in before the unlikely, rendering it useless. Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-06block: eliminate potential for infinite loop in blkdev_issue_discardMike Snitzer
Due to the recently identified overflow in read_capacity_16() it was possible for max_discard_sectors to be zero but still have discards enabled on the associated device's queue. Eliminate the possibility for blkdev_issue_discard to infinitely loop. Interestingly this issue wasn't identified until a device, whose discard_granularity was 0 due to read_capacity_16 overflow, was consumed by blk_stack_limits() to construct limits for a higher-level DM multipath device. The multipath device's resulting limits never had the discard limits stacked because blk_stack_limits() will only do so if the bottom device's discard_granularity != 0. This resulted in the multipath device's limits.max_discard_sectors being 0. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06blkdev: Do not return -EOPNOTSUPP if discard is supportedLukas Czerner
Currently we return -EOPNOTSUPP in blkdev_issue_discard() if any of the bio fails due to underlying device not supporting discard request. However, if the device is for example dm device composed of devices which some of them support discard and some of them does not, it is ok for some bios to fail with EOPNOTSUPP, but it does not mean that discard is not supported at all. This commit removes the check for bios failed with EOPNOTSUPP and change blkdev_issue_discard() to return operation not supported if and only if the device does not actually supports it, not just part of the device as some bios might indicate. This change also fixes problem with BLKDISCARD ioctl() which now works correctly on such dm devices. Signed-off-by: Lukas Czerner <lczerner@redhat.com> CC: Jens Axboe <jaxboe@fusionio.com> CC: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06blkdev: Simple cleanup in blkdev_issue_zeroout()Lukas Czerner
In blkdev_issue_zeroout() we are submitting regular WRITE bios, so we do not need to check for -EOPNOTSUPP specifically in case of error. Also there is no need to have label submit: because there is no way to jump out from the while cycle without an error and we really want to exit, rather than try again. And also remove the check for (sz == 0) since at that point sz can never be zero. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> CC: Dmitry Monakhov <dmonakhov@openvz.org> CC: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06blkdev: Submit discard bio in batches in blkdev_issue_discard()Lukas Czerner
Currently we are waiting for every submitted REQ_DISCARD bio separately, but it can have unwanted consequences of repeatedly flushing the queue, so we rather submit bios in batches and wait for the entire batch, hence narrowing the window of other ios going in. Use bio_batch_end_io() and struct bio_batch for that purpose, the same is used by blkdev_issue_zeroout(). Also change bio_batch_end_io() so we always set !BIO_UPTODATE in the case of error and remove the check for bb, since we are the only user of this function and we always set this. Remove bio_get()/bio_put() from the blkdev_issue_discard() since bio_alloc() and bio_batch_end_io() is doing the same thing, hence it is not needed anymore. I have done simple dd testing with surprising results. The script I have used is: for i in $(seq 10); do echo $i dd if=/dev/sdb1 of=/dev/sdc1 bs=4k & sleep 5 done /usr/bin/time -f %e ./blkdiscard /dev/sdc1 Running time of BLKDISCARD on the whole device: with patch without patch 0.95 15.58 So we can see that in this artificial test the kernel with the patch applied is approx 16x faster in discarding the device. Signed-off-by: Lukas Czerner <lczerner@redhat.com> CC: Dmitry Monakhov <dmonakhov@openvz.org> CC: Jens Axboe <jaxboe@fusionio.com> CC: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-24Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}
2011-03-11block: remove obsolete comments for blkdev_issue_zeroout.Tao Ma
barrier is already removed, so remove the obsolete comments in blkdev_issue_zeroout. Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-11block: fix mis-synchronisation in blkdev_issue_zeroout()Lukas Czerner
BZ29402 https://bugzilla.kernel.org/show_bug.cgi?id=29402 We can hit serious mis-synchronization in bio completion path of blkdev_issue_zeroout() leading to a panic. The problem is that when we are going to wait_for_completion() in blkdev_issue_zeroout() we check if the bb.done equals issued (number of submitted bios). If it does, we can skip the wait_for_completition() and just out of the function since there is nothing to wait for. However, there is a ordering problem because bio_batch_end_io() is calling atomic_inc(&bb->done) before complete(), hence it might seem to blkdev_issue_zeroout() that all bios has been completed and exit. At this point when bio_batch_end_io() is going to call complete(bb->wait), bb and wait does not longer exist since it was allocated on stack in blkdev_issue_zeroout() ==> panic! (thread 1) (thread 2) bio_batch_end_io() blkdev_issue_zeroout() if(bb) { ... if (bb->end_io) ... bb->end_io(bio, err); ... atomic_inc(&bb->done); ... ... while (issued != atomic_read(&bb.done)) ... (let issued == bb.done) ... (do the rest of the function) ... return ret; complete(bb->wait); ^^^^^^^^ panic We can fix this easily by simplifying bio_batch and completion counting. Also remove bio_end_io_t *end_io since it is not used. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reported-by: Eric Whitney <eric.whitney@hp.com> Tested-by: Eric Whitney <eric.whitney@hp.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> CC: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-01block: fix kernel-doc format for blkdev_issue_zerooutBen Hutchings
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-16block: remove BLKDEV_IFL_WAITChristoph Hellwig
All the blkdev_issue_* helpers can only sanely be used for synchronous caller. To issue cache flushes or barriers asynchronously the caller needs to set up a bio by itself with a completion callback to move the asynchronous state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always specified when calling blkdev_issue_* and also remove the now unused flags argument to blkdev_issue_flush and blkdev_issue_zeroout. For blkdev_issue_discard we need to keep it for the secure discard flag, which gains a more descriptive name and loses the bitops vs flag confusion. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10block: remove the BLKDEV_IFL_BARRIER flagChristoph Hellwig
Remove support for barriers on discards, which is unused now. Also remove the DISCARD_NOBARRIER I/O type in favour of just setting the rw flags up locally in blkdev_issue_discard. tj: Also remove DISCARD_SECURE and use REQ_SECURE directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-12block: add secure discardAdrian Hunter
Secure discard is the same as discard except that all copies of the discarded sectors (perhaps created by garbage collection) must also be erased. Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Madhusudhan Chikkature <madhu.cr@ti.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ben Gardiner <bengardiner@nanometrics.ca> Cc: <linux-mmc@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-08blkdev: fix blkdev_issue_zeroout return valueDmitry Monakhov
- If function called without barrier option retvalue is incorrect Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07block: fix problem with sending down discard that isn't of correct granularityJens Axboe
If the queue doesn't have a limit set, or it just set UINT_MAX like we default to, we coud be sending down a discard request that isn't of the correct granularity if the block size is > 512b. Fix this by adjusting max_discard_sectors down to the proper alignment. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07block: don't allocate a payload for discard requestChristoph Hellwig
Allocating a fixed payload for discard requests always was a horrible hack, and it's not coming to byte us when adding support for discard in DM/MD. So change the code to leave the allocation of a payload to the lowlevel driver. Unfortunately that means we'll need another hack, which allows us to update the various block layer length fields indicating that we have a payload. Instead of hiding this in sd.c, which we already partially do for UNMAP support add a documented helper in the core block layer for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-04-29block: fix bad use of min() on different typesJens Axboe
Just cast the page size to sector_t, that will always fit. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-28blkdev: add blkdev_issue_zeroout helper functionDmitry Monakhov
- Add bio_batch helper primitive. This is rather generic primitive for submitting/waiting a complex request which consists of several bios. - blkdev_issue_zeroout() generate number of zero filed write bios. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-28blkdev: move blkdev_issue helper functions to separate fileDmitry Monakhov
Move blkdev_issue_discard from blk-barrier.c because it is not barrier related. Later the file will be populated by other helpers. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>