summaryrefslogtreecommitdiffstats
path: root/drivers/block
AgeCommit message (Collapse)Author
2014-07-10drbd: stop the meta data sync timer before open coded meta data syncLars Ellenberg
If we re-write all meta data due to resize, we have open-coded write-out of our meta data super block. Stop the md_sync_timer, it would just trigger scary but in this case spurious "timer expired" messages. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10drbd: fix resync finished detectionLars Ellenberg
This fixes one recent regresion, and one long existing bug. The bug: drbd_try_clear_on_disk_bm() assumed that all "count" bits have to be accounted in the resync extent corresponding to the start sector. Since we allow application requests to cross our "extent" boundaries, this assumption is no longer true, resulting in possible misaccounting, scary messages ("BAD! sector=12345s enr=6 rs_left=-7 rs_failed=0 count=58 cstate=..."), and potentially, if the last bit to be cleared during resync would reside in previously misaccounted resync extent, the resync would never be recognized as finished, but would be "stalled" forever, even though all blocks are in sync again and all bits have been cleared... The regression was introduced by drbd: get rid of atomic update on disk bitmap works For an "empty" resync (rs_total == 0), we must not "finish" the resync on the SyncSource before the SyncTarget knows all relevant information (sync uuid). We need to wait for the full round-trip, the SyncTarget will then explicitly notify us. Also for normal, non-empty resyncs (rs_total > 0), the resync-finished condition needs to be tested before the schedule() in wait_for_work, or it is likely to be missed. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10drbd: fix a race stopping the worker threadLars Ellenberg
We may implicitly call drbd_send() from inside wait_for_work(), via maybe_send_barrier(). If the "stop" signal was send just before that, drbd_send() would call flush_signals(), and we would run an unbounded schedule() afterwards. Fix: check for thread_state == RUNNING before we schedule() Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10drbd: get rid of atomic update on disk bitmap worksLars Ellenberg
Just trigger the occasional lazy bitmap write-out during resync from the central wait_for_work() helper. Previously, during resync, bitmap pages would be written out separately, synchronously, one at a time, at least 8 times each (every 512 bytes worth of bitmap cleared). Now we trigger "merge friendly" bulk write out of all cleared pages every two seconds during resync, and once the resync is finished. Most pages will be written out only once. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10drbd: allow write-ordering policy to be bumped up againLars Ellenberg
Previously, once you disabled flushes as a means of enforcing write-ordering, you'd need to detach/re-attach to enable them again. Allow drbdsetup disk-options to re-enable previously disabled write-ordering policy options at runtime. While at it fix RCU in drbd_bump_write_ordering() max_allowed_wo() uses rcu_dereference, therefore it must be called within rcu_read_lock()/rcu_read_unlock() Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10drbd: refactor use of first_peer_device()Lars Ellenberg
Reduce the number of calls to first_peer_device(). Instead, call first_peer_device() just once to assign a local variable peer_device. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10drbd: reduce number of spinlock drop/re-aquire cyclesLars Ellenberg
Instead of dropping and re-aquiring the spinlock around the submit, just remember that we want to submit, and do that only once we have dropped the spinlock for good. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10drbd: rename drbd_free_bc() to drbd_free_ldev()Philipp Reisner
Since the member of drbd_device is called ldev Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10drbd: device->ldev is not guaranteed on an D_ATTACHING diskPhilipp Reisner
Some parts of the code assumed that get_ldev_if_state(device, D_ATTACHING) is sufficient to access the ldev member of the device object. That was wrong. ldev may not be there or might be freed at any time if the device has a disk state of D_ATTACHING. bm_rw() Documented that drbd_bm_read() is only called from drbd_adm_attach. drbd_bm_write() is only called when a reference is held, and it is documented that a caller has to hold a reference before calling drbd_bm_write() drbd_bm_write_page() Use get_ldev() instead of get_ldev_if_state(device, D_ATTACHING) drbd_bmio_set_n_write() No longer use get_ldev_if_state(device, D_ATTACHING). All callers hold a reference to ldev now. drbd_bmio_clear_n_write() All callers where holding a reference of ldev anyways. Remove the misleading get_ldev_if_state(device, D_ATTACHING) drbd_reconsider_max_bio_size() Removed the get_ldev_if_state(device, D_ATTACHING). All callers now pass a struct drbd_backing_dev* when they have a proper reference, or a NULL pointer. Before this fix, the receiver could trigger a NULL pointer deref when in drbd_reconsider_max_bio_size() drbd_bump_write_ordering() Used get_ldev_if_state(device, D_ATTACHING) with the wrong assumption. Remove it, and allow the caller to pass in a struct drbd_backing_dev* when the caller knows that accessing this bdev is safe. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10drbd: Move write_ordering from connection to resourcePhilipp Reisner
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10drbd: fix regression 'out of mem, failed to invoke fence-peer helper'Lars Ellenberg
Since linux kernel 3.13, kthread_run() internally uses wait_for_completion_killable(). We sometimes may use kthread_run() while we still have a signal pending, which we used to kick our threads out of potentially blocking network functions, causing kthread_run() to mistake that as a new fatal signal and fail. Fix: flush_signals() before kthread_run(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-08rbd: do not leak image_id in rbd_dev_v2_parent_info()Ilya Dryomov
image_id is leaked if the parent happens to have been recorded already. Fix it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08rbd: use rbd_obj_watch_request_helper() helperIlya Dryomov
Switch rbd_dev_header_{un,}watch_sync() to use the new helper and fix rbd_dev_header_unwatch_sync() to destroy watch_request structures before queuing watch-remove message while at it. This mistake slipped into commit b30a01f2a307 ("rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()") and could lead to "image still in use" errors on image removal. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08rbd: add rbd_obj_watch_request_helper() helperIlya Dryomov
In the past, rbd_dev_header_watch_sync() used to handle both watch and unwatch requests and was entangled and leaky. Commit b30a01f2a307 ("rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()") split it into two separate functions. This commit cleanly abstracts the common bits, relying on the fixed rbd_obj_request_wait(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08rbd: rbd_obj_request_wait() should cancel the request if interruptedIlya Dryomov
rbd_obj_request_wait() should cancel the underlying OSD request if interrupted. Otherwise libceph will hold onto it indefinitely, causing assert failures or leaking the original object request. This also adds an rbd wrapper around ceph_osdc_cancel_request() to match rbd_obj_request_submit() and rbd_obj_request_wait(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-03zram: revalidate disk after capacity changeMinchan Kim
Alexander reported mkswap on /dev/zram0 is failed if other process is opening the block device file. Step is as follows, 0. Reset the unused zram device. 1. Use a program that opens /dev/zram0 with O_RDWR and sleeps until killed. 2. While that program sleeps, echo the correct value to /sys/block/zram0/disksize. 3. Verify (e.g. in /proc/partitions) that the disk size is applied correctly. It is. 4. While that program still sleeps, attempt to mkswap /dev/zram0. This fails: mkswap: error: swap area needs to be at least 40 KiB When I investigated, the size get by ioctl(fd, BLKGETSIZE64, xxx) on mkswap to get a size of blockdev was zero although zram0 has right size by 2. The reason is zram didn't revalidate disk after changing capacity so that size of blockdev's inode is not uptodate until all of file is close. This patch should fix the BUG. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Alexander E. Patrakov <patrakov@gmail.com> Tested-by: Alexander E. Patrakov <patrakov@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-01block: virtio-blk: support multi virt queues per virtio-blk deviceMing Lei
Firstly this patch supports more than one virtual queues for virtio-blk device. Secondly this patch maps the virtual queue to blk-mq's hardware queue. With this approach, both scalability and performance can be improved. Signed-off-by: Ming Lei <ming.lei@canonical.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-26Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A small collection of fixes/changes for the current series. This contains: - Removal of dead code from Gu Zheng. - Revert of two bad fixes that went in earlier in this round, marking things as __init that were not purely used from init. - A fix for blk_mq_start_hw_queue() using the __blk_mq_run_hw_queue(), which could place us wrongly. Make it use the non __ variant, which handles cases where we are called from the wrong CPU set. From me. - A fix for drbd, which allocates discard requests without room for the SCSI payload. From Lars Ellenberg. - A fix for user-after-free in the blkcg code from Tejun. - Addition of limiting gaps in SG lists, if the hardware needs it. This is the last pre-req patch for blk-mq to enable the full NVMe conversion. Could wait until 3.17, but it's simple enough so would be nice to have everything we need for the NVMe port in the 3.17 release. From me" * 'for-linus' of git://git.kernel.dk/linux-block: drbd: fix NULL pointer deref in blk_add_request_payload blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue() block: add support for limiting gaps in SG lists bio: remove unused macro bip_vec_idx() Revert "block: add __init to elv_register" Revert "block: add __init to blkcg_policy_register" blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t floppy: format block0 read error message properly
2014-06-25drbd: fix NULL pointer deref in blk_add_request_payloadLars Ellenberg
Discards don't have any payload. But the scsi layer still expects a bio_vec it can use internally, see sd_setup_discard_cmnd() and blk_add_request_payload(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-23rbd: handle parent_overlap on writes correctlyIlya Dryomov
The following check in rbd_img_obj_request_submit() rbd_dev->parent_overlap <= obj_request->img_offset allows the fall through to the non-layered write case even if both parent_overlap and obj_request->img_offset belong to the same RADOS object. This leads to data corruption, because the area to the left of parent_overlap ends up unconditionally zero-filled instead of being populated with parent data. Suppose we want to write 1M to offset 6M of image bar, which is a clone of foo@snap; object_size is 4M, parent_overlap is 5M: rbd_data.<id>.0000000000000001 ---------------------|----------------------|------------ | should be copyup'ed | should be zeroed out | write ... ---------------------|----------------------|------------ 4M 5M 6M parent_overlap obj_request->img_offset 4..5M should be copyup'ed from foo, yet it is zero-filled, just like 5..6M is. Given that the only striping mode kernel client currently supports is chunking (i.e. stripe_unit == object_size, stripe_count == 1), round parent_overlap up to the next object boundary for the purposes of the overlap check. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-06-19Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A smaller collection of fixes for the block core that would be nice to have in -rc2. This pull request contains: - Fixes for races in the wait/wakeup logic used in blk-mq from Alexander. No issues have been observed, but it is definitely a bit flakey currently. Alternatively, we may drop the cyclic wakeups going forward, but that needs more testing. - Some cleanups from Christoph. - Fix for an oops in null_blk if queue_mode=1 and softirq completions are used. From me. - A fix for a regression caused by the chunk size setting. It inadvertently used max_hw_sectors instead of max_sectors, which is incorrect, and causes hangs on btrfs multi-disk setups (where hw sectors apparently isn't set). From me. - Removal of WQ_POWER_EFFICIENT in the kblockd creation. This was a recent addition as well, but it actually breaks blk-mq which relies on strict scheduling. If the workqueue power_efficient mode is turned on, this breaks blk-mq. From Matias. - null_blk module parameter description fix from Mike" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: bitmap tag: fix races in bt_get() function blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt blk-mq: bitmap tag: fix races on shared ::wake_index fields block: blk_max_size_offset() should check ->max_sectors null_blk: fix softirq completions for queue_mode == 1 blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue blk-mq: properly drain stopped queues block: remove WQ_POWER_EFFICIENT from kblockd null_blk: fix name and description of 'queue_mode' module parameter block: remove elv_abort_queue and blk_abort_flushes
2014-06-18Merge branch 'for-jens' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/linux-block into for-linus
2014-06-18floppy: format block0 read error message properlyJiri Kosina
In case reading of block 0 fails, line without trailing newline is printed causing dmesg to look horrible. Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-06-16null_blk: fix softirq completions for queue_mode == 1Jens Axboe
Only blk-mq completions have payload attached to the request, for request_fn mode we have stored it in req->special. This fixes an oops with queue_mode=1 and softirq completions. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-15Merge git://git.infradead.org/users/willy/linux-nvmeLinus Torvalds
Pull NVMe update from Matthew Wilcox: "Mostly bugfixes again for the NVMe driver. I'd like to call out the exported tracepoint in the block layer; I believe Keith has cleared this with Jens. We've had a few reports from people who're really pounding on NVMe devices at scale, hence the timeout changes (and new module parameters), hotplug cpu deadlock, tracepoints, and minor performance tweaks" [ Jens hadn't seen that tracepoint thing, but is ok with it - it will end up going away when mq conversion happens ] * git://git.infradead.org/users/willy/linux-nvme: (22 commits) NVMe: Fix START_STOP_UNIT Scsi->NVMe translation. NVMe: Use Log Page constants in SCSI emulation NVMe: Define Log Page constants NVMe: Fix hot cpu notification dead lock NVMe: Rename io_timeout to nvme_io_timeout NVMe: Use last bytes of f/w rev SCSI Inquiry NVMe: Adhere to request queue block accounting enable/disable NVMe: Fix nvme get/put queue semantics NVMe: Delete NVME_GET_FEAT_TEMP_THRESH NVMe: Make admin timeout a module parameter NVMe: Make iod bio timeout a parameter NVMe: Prevent possible NULL pointer dereference NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD) NVMe: Update data structures for NVMe 1.2 NVMe: Enable BUILD_BUG_ON checks NVMe: Update namespace and controller identify structures to the 1.1a spec NVMe: Flush with data support NVMe: Configure support for block flush NVMe: Add tracepoints NVMe: Protect against badly formatted CQEs ...
2014-06-13NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.Dan McLeran
This patch contains several fixes for Scsi START_STOP_UNIT. The previous code did not account for signed vs. unsigned arithmetic which resulted in an invalid lowest power state caculation when the device only supports 1 power state. The code for Power Condition == 2 (Idle) was not following the spec. The spec calls for setting the device to specific power states, depending upon Power Condition Modifier, without accounting for the number of power states supported by the device. The code for Power Condition == 3 (Standby) was using a hard-coded '0' which is replaced with the macro POWER_STATE_0. Signed-off-by: Dan McLeran <daniel.mcleran@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-13NVMe: Use Log Page constants in SCSI emulationMatthew Wilcox
The nvme-scsi file defined its own Log Page constant. Use the newly-defined one from the header file instead. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-13NVMe: Fix hot cpu notification dead lockKeith Busch
There is a potential dead lock if a cpu event occurs during nvme probe since it registered with hot cpu notification. This fixes the race by having the module register with notification outside of probe rather than have each device register. The actual work is done in a scheduled work queue instead of in the notifier since assigning IO queues has the potential to block if the driver creates additional queues. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This has a mix of bug fixes and cleanups. Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT check when a second rbd image is mapped and a couple memory leaks. Zheng fixes several issues with fragmented directories and multiple MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's patches fix setting and unsetting RBD images read-only. Naturally there are several other cleanups mixed in for good measure" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) rbd: only set disk to read-only once rbd: move calls that may sleep out of spin lock range rbd: add ioctl for rbd ceph: use truncate_pagecache() instead of truncate_inode_pages() ceph: include time stamp in every MDS request rbd: fix ida/idr memory leak rbd: use reference counts for image requests rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() rbd: make sure we have latest osdmap on 'rbd map' libceph: add ceph_monc_wait_osdmap() libceph: mon_get_version request infrastructure libceph: recognize poolop requests in debugfs ceph: refactor readpage_nounlock() to make the logic clearer mds: check cap ID when handling cap export message ceph: remember subtree root dirfrag's auth MDS ceph: introduce ceph_fill_fragtree() ceph: handle cap import atomically ceph: pre-allocate ceph_cap struct for ceph_add_cap() ceph: update inode fields according to issued caps rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ...
2014-06-11null_blk: fix name and description of 'queue_mode' module parameterMike Snitzer
'use_mq' is not the name of the module parameter, 'queue_mode' is. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-11Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer fixes from Jens Axboe: "Final small batch of fixes to be included before -rc1. Some general cleanups in here as well, but some of the blk-mq fixes we need for the NVMe conversion and/or scsi-mq. The pull request contains: - Support for not merging across a specified "chunk size", if set by the driver. Some NVMe devices perform poorly for IO that crosses such a chunk, so we need to support it generically as part of request merging avoid having to do complicated split logic. From me. - Bump max tag depth to 10Ki tags. Some scsi devices have a huge shared tag space. Before we failed with EINVAL if a too large tag depth was specified, now we truncate it and pass back the actual value. From me. - Various blk-mq rq init fixes from me and others. - A fix for enter on a dying queue for blk-mq from Keith. This is needed to prevent oopsing on hot device removal. - Fixup for blk-mq timer addition from Ming Lei. - Small round of performance fixes for mtip32xx from Sam Bradshaw. - Minor stack leak fix from Rickard Strandqvist. - Two __init annotations from Fabian Frederick" * 'for-linus' of git://git.kernel.dk/linux-block: block: add __init to blkcg_policy_register block: add __init to elv_register block: ensure that bio_add_page() always accepts a page for an empty bio blk-mq: add timer in blk_mq_start_request blk-mq: always initialize request->start_time block: blk-exec.c: Cleaning up local variable address returnd mtip32xx: minor performance enhancements blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init() blk-mq: don't allow queue entering for a dying queue blk-mq: bump max tag depth to 10K tags block: add blk_rq_set_block_pc() block: add notion of a chunk size for request merging
2014-06-10rbd: only set disk to read-only onceJosh Durgin
rbd_open(), called every time the device is opened, calls set_device_ro(). There's no reason to set the device read-only or read-write every time it is opened. Just do this once during device setup, using set_disk_ro() instead because the struct block_device isn't available to us there. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-06-10rbd: move calls that may sleep out of spin lock rangeJosh Durgin
get_user() and set_disk_ro() may allocate memory, leading to a potential deadlock if theye are called while a spin lock is held. Move the acquisition and release of rbd_dev->lock from rbd_ioctl() into rbd_ioctl_set_ro(), so it can occur between get_user() and set_disk_ro(). Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-06-10rbd: add ioctl for rbdGuangliang Zhao
When running the following commands: [root@ceph0 mnt]# blockdev --setro /dev/rbd1 [root@ceph0 mnt]# blockdev --getro /dev/rbd1 0 The block setro didn't take effect, it is because the rbd doesn't support ioctl of block driver. This resolves: http://tracker.ceph.com/issues/6265 Signed-off-by: Guangliang Zhao <guangliang@unitedstack.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-06-06nbd: zero from and len fields in NBD_CMD_DISCONNECT.Hani Benhabiles
Len field is already set to zero, but not the from field which is sent as 0xfffffffffffffe00. This makes no sense, and may cause confuse server implementations doing sanity checks (qemu-nbd is an example.) Signed-off-by: Hani Benhabiles <hani@linux.com> Cc: Paul Clements <paul.clements@us.sios.com> Cc: Paul Clements <Paul.Clements@steeleye.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mtip32xx: minor performance enhancementsSam Bradshaw
This patch adds the following: 1) Compiler hinting in the fast path. 2) A prefetch of port->flags to eliminate moderate cpu stalling later in mtip_hw_submit_io(). 3) Eliminate a redundant rq_data_dir(). 4) Reorder members of driver_data to eliminate false cacheline sharing between irq_workers_active and unal_qdepth. With some workload and topology configurations, I'm seeing ~1.5% throughput improvement in small block random read benchmarks as well as improved latency std. dev. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Add include of <linux/prefetch.h> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-06block: add blk_rq_set_block_pc()Jens Axboe
With the optimizations around not clearing the full request at alloc time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC up to the user allocating the request. Add a blk_rq_set_block_pc() that sets the command type to REQ_TYPE_BLOCK_PC, and properly initializes the members associated with this type of request. Update callers to use this function instead of manipulating rq->cmd_type directly. Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed attempt. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-06rbd: fix ida/idr memory leakIlya Dryomov
ida_destroy() needs to be called on module exit to release ida caches. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-06-06rbd: use reference counts for image requestsAlex Elder
Each image request contains a reference count, but to date it has not actually been used. (I think this was just an oversight.) A recent report involving rbd failing an assertion shed light on why and where we need to use these reference counts. Every OSD request associated with an object request uses rbd_osd_req_callback() as its callback function. That function will call a helper function (dependent on the type of OSD request) that will set the object request's "done" flag if the object request if appropriate. If that "done" flag is set, the object request is passed to rbd_obj_request_complete(). In rbd_obj_request_complete(), requests are processed in sequential order. So if an object request completes before one of its predecessors in the image request, the completion is deferred. Otherwise, if it's a completing object's "turn" to be completed, it is passed to rbd_img_obj_end_request(), which records the result of the operation, accumulates transferred bytes, and so on. Next, the successor to this request is checked and if it is marked "done", (deferred) completion processing is performed on that request, and so on. If the last object request in an image request is completed, rbd_img_request_complete() is called, which (typically) destroys the image request. There is a race here, however. The instant an object request is marked "done" it can be provided (by a thread handling completion of one of its predecessor operations) to rbd_img_obj_end_request(), which (for the last request) can then lead to the image request getting torn down. And this can happen *before* that object has itself entered rbd_img_obj_end_request(). As a result, once it *does* enter that function, the image request (and even the object request itself) may have been freed and become invalid. All that's necessary to avoid this is to properly count references to the image requests. We tear down an image request's object requests all at once--only when the entire image request has completed. So there's no need for an image request to count references for its object requests. However, we don't want an image request to go away until the last of its object requests has passed through rbd_img_obj_callback(). In other words, we don't want rbd_img_request_complete() to necessarily result in the image request being destroyed, because it may get called before we've finished processing on all of its object requests. So the fix is to add a reference to an image request for each of its object requests. The reference can be viewed as representing an object request that has not yet finished its call to rbd_img_obj_callback(). That is emphasized by getting the reference right after assigning that as the image object's callback function. The corresponding release of that reference is done at the end of rbd_img_obj_callback(), which every image object request passes through exactly once. Cc: stable@vger.kernel.org Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-06-06rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()Ilya Dryomov
osd_request, along with r_request and r_reply messages attached to it are leaked in __rbd_dev_header_watch_sync() if the requested image doesn't exist. This is because lingering requests are special and get an extra ref in the reply path. Fix it by unregistering linger request on the error path and split __rbd_dev_header_watch_sync() into two functions to make it maintainable. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-06-06rbd: make sure we have latest osdmap on 'rbd map'Ilya Dryomov
Given an existing idle mapping (img1), mapping an image (img2) in a newly created pool (pool2) fails: $ ceph osd pool create pool1 8 8 $ rbd create --size 1000 pool1/img1 $ sudo rbd map pool1/img1 $ ceph osd pool create pool2 8 8 $ rbd create --size 1000 pool2/img2 $ sudo rbd map pool2/img2 rbd: sysfs write failed rbd: map failed: (2) No such file or directory This is because client instances are shared by default and we don't request an osdmap update when bumping a ref on an existing client. The fix is to use the mon_get_version request to see if the osdmap we have is the latest, and block until the requested update is received if it's not. Fixes: http://tracker.ceph.com/issues/8184 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-06-06rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERODuan Jiong
This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-04Merge branch 'akpm' (patchbomb from Andrew) into nextLinus Torvalds
Merge misc updates from Andrew Morton: - a few fixes for 3.16. Cc'ed to stable so they'll get there somehow. - various misc fixes and cleanups - most of the ocfs2 queue. Review is slow... - most of MM. The MM queue is pretty huge this time, but not much in the way of feature work. - some tweaks under kernel/ - printk maintenance work - updates to lib/ - checkpatch updates - tweaks to init/ * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (276 commits) fs/autofs4/dev-ioctl.c: add __init to autofs_dev_ioctl_init fs/ncpfs/getopt.c: replace simple_strtoul by kstrtoul init/main.c: remove an ifdef kthreads: kill CLONE_KERNEL, change kernel_thread(kernel_init) to avoid CLONE_SIGHAND init/main.c: add initcall_blacklist kernel parameter init/main.c: don't use pr_debug() fs/binfmt_flat.c: make old_reloc() static fs/binfmt_elf.c: fix bool assignements fs/efs: convert printk(KERN_DEBUG to pr_debug fs/efs: add pr_fmt / use __func__ fs/efs: convert printk to pr_foo() scripts/checkpatch.pl: device_initcall is not the only __initcall substitute checkpatch: check stable email address checkpatch: warn on unnecessary void function return statements checkpatch: prefer kstrto<foo> to sscanf(buf, "%<lhuidx>", &bar); checkpatch: add warning for kmalloc/kzalloc with multiply checkpatch: warn on #defines ending in semicolon checkpatch: make --strict a default for files in drivers/net and net/ checkpatch: always warn on missing blank line after variable declaration block checkpatch: fix wildcard DT compatible string checking ...
2014-06-04zram: correct offset usage in zram_bio_discardWeijie Yang
We want to skip the physical block(PAGE_SIZE) which is partially covered by the discard bio, so we check the remaining size and subtract it if there is a need to goto the next physical block. The current offset usage in zram_bio_discard is incorrect, it will cause its upper filesystem breakdown. Consider the following scenario: On some architecture or config, PAGE_SIZE is 64K for example, filesystem is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a offset = 4K and size=72K, normally, it should not really discard any physical block as it partially cover two physical blocks. However, with the current offset usage, it will discard the second physical block and free its memory, which will cause filesystem breakdown. This patch corrects the offset usage in zram_bio_discard. Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04brd: return -ENOSPC rather than -ENOMEM on page allocation failureMatthew Wilcox
brd is effectively a thinly provisioned device. Thinly provisioned devices return -ENOSPC when they can't write a new block. -ENOMEM is an implementation detail that callers shouldn't know. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04brd: add support for rw_page()Matthew Wilcox
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04blk-mq: let blk_mq_tag_to_rq() take blk_mq_tags as the main parameterJens Axboe
We currently pass in the hardware queue, and get the tags from there. But from scsi-mq, with a shared tag space, it's a lot more convenient to pass in the blk_mq_tags instead as the hardware queue isn't always directly available. So instead of having to re-map to a given hardware queue from rq->mq_ctx, just pass in the tags structure. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-03NVMe: Rename io_timeout to nvme_io_timeoutMatthew Wilcox
It's positively immoral to have a global variable called 'io_timeout'. Keep the module parameter called io_timeout, though. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-03NVMe: Use last bytes of f/w rev SCSI InquiryKeith Busch
After skipping right-padded spaces, use the last four bytes of the firmware revision when reporting the Inquiry Product Revision. These are generally more indicative to what is running. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-03NVMe: Adhere to request queue block accounting enable/disableSam Bradshaw
Recently, a new sysfs control "iostats" was added to selectively enable or disable io statistics collection for request queues. This patch hooks that control. IO statistics collection is rather expensive on large, multi-node machines with drives pushing millions of iops. Having the ability to disable collection if not needed can improve throughput significantly. As a data point, on a quad E5-4640, I see more than 50% throughput improvement when io statistics accounting is disabled during heavily multi-threaded small block random read benchmarks where device performance is in the million iops+ range. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>