summaryrefslogtreecommitdiffstats
path: root/drivers/md
AgeCommit message (Collapse)Author
2015-02-06md: remove need for mddev_lock() in md_seq_show()NeilBrown
The only access in md_seq_show that could suffer from races not protected by ->lock is walking the rdev list. This can receive sufficient protection from 'rcu'. So use rdev_for_each_rcu() and get rid of mddev_lock(). Now reading /proc/mdstat will never block in md_seq_show. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06md/bitmap: protect clearing of ->bitmap by mddev->lockNeilBrown
This makes it safe to inspect the struct while holding only the spinlock. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-03Merge tag 'md/3.19-fixes' of git://neil.brown.name/mdLinus Torvalds
Pull two fixes for md from Neil Brown: - Another live lock, needs backporting - work-around false positive with new warnings. * tag 'md/3.19-fixes' of git://neil.brown.name/md: md/bitmap: fix a might_sleep() warning. md/raid5: fix another livelock caused by non-aligned writes.
2015-02-04md: protect ->pers changes with mddev->lockNeilBrown
->pers is already protected by ->reconfig_mutex, and cannot possibly change when there are threads running or outstanding IO. However there are some places where we access ->pers not in a thread or IO context, and where ->reconfig_mutex is unnecessarily heavy-weight: level_show and md_seq_show(). So protect all changes, and those accesses, with ->lock. This is a step toward taking those accesses out from under reconfig_mutex. [Fixed missing "mddev->pers" -> "pers" conversion, thanks to Dan Carpenter <dan.carpenter@oracle.com>] Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: level_store: group all important changes into one place.NeilBrown
Gather all the changes that can happen atomically and might be relevant to other code into one place. This will make it easier to refine the locking. Note that this puts quite a few things between mddev_detach() and ->free(). Enabling this was the point of some recent patches. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: rename ->stop to ->freeNeilBrown
Now that the ->stop function only frees the private data, rename is accordingly. Also pass in the private pointer as an arg rather than using mddev->private. This flexibility will be useful in level_store(). Finally, don't clear ->private. It doesn't make sense to clear it seeing that isn't what we free, and it is no longer necessary to clear ->private (it was some time ago before ->to_remove was introduced). Setting ->to_remove in ->free() is a bit of a wart, but not a big problem at the moment. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: split detach operation out from ->stop.NeilBrown
Each md personality has a 'stop' operation which does two things: 1/ it finalizes some aspects of the array to ensure nothing is accessing the ->private data 2/ it frees the ->private data. All the steps in '1' can apply to all arrays and so can be performed in common code. This is useful as in the case where we change the personality which manages an array (in level_store()), it would be helpful to do step 1 early, and step 2 later. So split the 'step 1' functionality out into a new mddev_detach(). Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md/linear: remove rcu protections in favour of suspend/resumeNeilBrown
The use of 'rcu' to protect accesses to ->private_data so that the ->private_data could be updated predates the introduction of mddev_suspend/mddev_resume. These are a cleaner mechanism for providing stability while swapping in a new ->private data - it is used by level_store() to support changing of raid levels. So get rid of the RCU stuff and just use mddev_suspend, mddev_resume. As these function call ->quiesce(), we add an empty function for linear just like for raid0. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: make merge_bvec_fn more robust in face of personality changes.NeilBrown
There is no locking around calls to merge_bvec_fn(), so it is possible that calls which coincide with a level (or personality) change could go wrong. So create a central dispatch point for these functions and use rcu_read_lock(). If the array is suspended, reject any merge that can be rejected. If not, we know it is safe to call the function. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: make ->congested robust against personality changes.NeilBrown
There is currently no locking around calls to the 'congested' bdi function. If called at an awkward time while an array is being converted from one level (or personality) to another, there is a tiny chance of running code in an unreferenced module etc. So add a 'congested' function to the md_personality operations structure, and call it with appropriate locking from a central 'mddev_congested'. When the array personality is changing the array will be 'suspended' so no IO is processed. If mddev_congested detects this, it simply reports that the array is congested, which is a safe guess. As mddev_suspend calls synchronize_rcu(), mddev_congested can avoid races by included the whole call inside an rcu_read_lock() region. This require that the congested functions for all subordinate devices can be run under rcu_lock. Fortunately this is the case. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: rename mddev->write_lock to mddev->lockNeilBrown
This lock is used for (slightly) more than helping with writing superblocks, and it will soon be extended further. So the name is inappropriate. Also, the _irq variant hasn't been needed since 2.6.37 as it is never taking from interrupt or bh context. So: -rename write_lock to lock -document what it protects -remove _irq ... except in md_flush_request() as there is no wait_event_lock() (with no _irq). This can be cleaned up after appropriate changes to wait.h. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md/raid5: need_this_block: tidy/fix last condition.NeilBrown
That last condition is unclear and over cautious. There are two related issues here. If a partial write is destined for a missing device, then either RMW or RCW can work. We must read all the available block. Only then can the missing blocks be calculated, and then the parity update performed. If RMW is not an option, then there is a complication even without partial writes. If we would need to read a missing device to perform the reconstruction, then we must first read every block so the missing device data can be computed. This is the case for RAID6 (Which currently does not support RMW) and for times when we don't trust the parity (after a crash) and so are in the process of resyncing it. So make these two cases more clear and separate, and perform the relevant tests more thoroughly. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md/raid5: need_this_block: start simplifying the last two conditions.NeilBrown
Both the last two cases are only relevant if something has failed and something needs to be written (but not over-written), and if it is OK to pre-read blocks at this point. So factor out those tests and explain them. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md/raid5: separate out the easy conditions in need_this_block.NeilBrown
Some of the conditions in need_this_block have very straight forward motivation. Separate those out and document them. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md/raid5: separate large if clause out of fetch_block().NeilBrown
fetch_block() has a very large and hard to read 'if' condition. Separate it into its own function so that it can be made more readable. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: do_release_stripe(): No need to call md_wakeup_thread() twiceJes Sorensen
67f455486d2ea20b2d94d6adf5b9b783d079e321 introduced a call to md_wakeup_thread() when adding to the delayed_list. However the md thread is woken up unconditionally just below. Remove the unnecessary wakeup call. Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-02md/bitmap: fix a might_sleep() warning.NeilBrown
commit 8eb23b9f35aae413140d3fda766a98092c21e9b0 sched: Debug nested sleeps causes false-positive warnings in RAID5 code. This annotation removes them and adds a comment explaining why there is no real problem. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-02md/raid5: fix another livelock caused by non-aligned writes.NeilBrown
If a non-page-aligned write is destined for a device which is missing/faulty, we can deadlock. As the target device is missing, a read-modify-write cycle is not possible. As the write is not for a full-page, a recontruct-write cycle is not possible. This should be handled by logic in fetch_block() which notices there is a non-R5_OVERWRITE write to a missing device, and so loads all blocks. However since commit 67f455486d2ea2, that code requires STRIPE_PREREAD_ACTIVE before it will active, and those circumstances never set STRIPE_PREREAD_ACTIVE. So: in handle_stripe_dirtying, if neither rmw or rcw was possible, set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set after a suitable delay. Fixes: 67f455486d2ea20b2d94d6adf5b9b783d079e321 Cc: stable@vger.kernel.org (v3.16+) Reported-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-01-28block: require blk_rq_prep_clone() be given an initialized clone requestKeith Busch
Prepare to allow blk_rq_prep_clone() to accept clone requests that were allocated from blk-mq request queues. As such the blk_rq_prep_clone() caller must first initialize the clone request. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-28dm thin: don't allow messages to be sent to a pool target in READ_ONLY or ↵Joe Thornber
FAIL mode You can't modify the metadata in these modes. It's better to fail these messages immediately than let the block-manager deny write locks on metadata blocks. Otherwise these failed metadata changes will trigger 'needs_check' to get set in the metadata superblock -- requiring repair using the thin_check utility. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-01-28dm cache: fix missing ERR_PTR returns and handlingJoe Thornber
Commit 9b1cc9f251 ("dm cache: share cache-metadata object across inactive and active DM tables") mistakenly ignored the use of ERR_PTR returns. Restore missing IS_ERR checks and ERR_PTR returns where appropriate. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-01-24dm: fix handling of multiple internal suspendsMikulas Patocka
Commit ffcc393641 ("dm: enhance internal suspend and resume interface") attempted to handle multiple internal suspends on the same device, but it did that incorrectly. When these functions are called in this order on the same device the device is no longer suspended, but it should be: dm_internal_suspend_noflush dm_internal_suspend_noflush dm_internal_resume Fix this bug by maintaining an 'internal_suspend_count' and resuming the device when this count drops to zero. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-01-23dm cache: fix problematic dual use of a single migration count variableJoe Thornber
Introduce a new variable to count the number of allocated migration structures. The existing variable cache->nr_migrations became overloaded. It was used to: i) track of the number of migrations in flight for the purposes of quiescing during suspend. ii) to estimate the amount of background IO occuring. Recent discard changes meant that REQ_DISCARD bios are processed with a migration. Discards are not background IO so nr_migrations was not incremented. However this could cause quiescing to complete early. (i) is now handled with a new variable cache->nr_allocated_migrations. cache->nr_migrations has been renamed cache->nr_io_migrations. cleanup_migration() is now called free_io_migration(), since it decrements that variable. Also, remove the unused cache->next_migration variable that got replaced with with prealloc_structs a while ago. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-01-23dm cache: share cache-metadata object across inactive and active DM tablesJoe Thornber
If a DM table is reloaded with an inactive table when the device is not suspended (normal procedure for LVM2), then there will be two dm-bufio objects that can diverge. This can lead to a situation where the inactive table uses bufio to read metadata at the same time the active table writes metadata -- resulting in the inactive table having stale metadata buffers once it is promoted to the active table slot. Fix this by using reference counting and a global list of cache metadata objects to ensure there is only one metadata object per metadata device. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-01-21Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Documentation updates. - Miscellaneous fixes. - Preemptible-RCU fixes, including fixing an old bug in the interaction of RCU priority boosting and CPU hotplug. - SRCU updates. - RCU CPU stall-warning updates. - RCU torture-test updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-07kconfig: use bool instead of boolean for type definition attributesChristoph Jaeger
Support for keyword 'boolean' will be dropped later on. No functional change. Reference: http://lkml.kernel.org/r/cover.1418003065.git.cj@linux.com Signed-off-by: Christoph Jaeger <cj@linux.com> Signed-off-by: Michal Marek <mmarek@suse.cz>
2015-01-06rcu: Make SRCU optional by using CONFIG_SRCUPranith Kumar
SRCU is not necessary to be compiled by default in all cases. For tinification efforts not compiling SRCU unless necessary is desirable. The current patch tries to make compiling SRCU optional by introducing a new Kconfig option CONFIG_SRCU which is selected when any of the components making use of SRCU are selected. If we do not select CONFIG_SRCU, srcu.o will not be compiled at all. text data bss dec hex filename 2007 0 0 2007 7d7 kernel/rcu/srcu.o Size of arch/powerpc/boot/zImage changes from text data bss dec hex filename 831552 64180 23944 919676 e087c arch/powerpc/boot/zImage : before 829504 64180 23952 917636 e0084 arch/powerpc/boot/zImage : after so the savings are about ~2000 bytes. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> CC: Josh Triplett <josh@joshtriplett.org> CC: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
2014-12-17dm: fix missed error code if .end_io isn't implemented by target_typezhendong chen
In bio-based DM's clone_endio(), when target_type doesn't implement .end_io (e.g. linear) r will be always be initialized 0. So if a WRITE SAME bio fails WRITE SAME will not be disabled as intended. Fix this by initializing r to error, rather than 0, in clone_endio(). Signed-off-by: Alex Chen <alex.chen@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: 7eee4ae2db ("dm: disable WRITE SAME if it fails") Cc: stable@vger.kernel.org
2014-12-17dm thin: fix crash by initializing thin device's refcount and completion earlierMarc Dionne
Commit 80e96c5484be ("dm thin: do not allow thin device activation while pool is suspended") delayed the initialization of a new thin device's refcount and completion until after this new thin was added to the pool's active_thins list and the pool lock is released. This opens a race with a worker thread that walks the list and calls thin_get/put, noticing that the refcount goes to 0 and calling complete, freezing up the system and giving the oops below: kernel: BUG: unable to handle kernel NULL pointer dereference at (null) kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90 kernel: Call Trace: kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20 kernel: [<ffffffff810d3dc7>] complete+0x37/0x50 kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool] kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool] kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0 kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400 kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490 kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260 kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170 kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170 Set the thin device's initial refcount and initialize the completion before adding it to the pool's active_thins list in thin_ctr(). Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-17dm thin: fix missing out-of-data-space to write mode transition if blocks ↵Joe Thornber
are released Discard bios and thin device deletion have the potential to release data blocks. If the thin-pool is in out-of-data-space mode, and blocks were released, transition the thin-pool back to full write mode. The correct time to do this is just after the thin-pool metadata commit. It cannot be done before the commit because the space maps will not allow immediate reuse of the data blocks in case there's a rollback following power failure. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-17dm thin: fix inability to discard blocks when in out-of-data-space modeJoe Thornber
When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard function pointer was incorrectly being set to process_prepared_discard_passdown rather than process_prepared_discard. This incorrect function pointer meant the discard was being passed down, but not effecting the mapping. As such any discard that was issued, in an attempt to reclaim blocks, would not successfully free data space. Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-14Merge tag 'md/3.19' of git://neil.brown.name/mdLinus Torvalds
Pull md updates from Neil Brown: "Three fixes for md. I did have a largish set of locking changes queued, but late testing showed they weren't quite as stable as I thought and while I fixed what I found, I decided it safer to delay them a release ... particularly as I'll be AFK for a few weeks. So expect a larger batch next time :-)" * tag 'md/3.19' of git://neil.brown.name/md: md: Check MD_RECOVERY_RUNNING as well as ->sync_thread. md: fix semicolon.cocci warnings md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
2014-12-13Merge branch 'for-3.19/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer driver updates from Jens Axboe: - NVMe updates: - The blk-mq conversion from Matias (and others) - A stack of NVMe bug fixes from the nvme tree, mostly from Keith. - Various bug fixes from me, fixing issues in both the blk-mq conversion and generic bugs. - Abort and CPU online fix from Sam. - Hot add/remove fix from Indraneel. - A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp) - With the generic IO stat accounting from 3.19/core, converting md, bcache, and rsxx to use those. From Gu Zheng. - Boundary check for queue/irq mode for null_blk from Matias. Fixes cases where invalid values could be given, causing the device to hang. - The xen blkfront pull request, with two bug fixes from Vitaly. * 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits) NVMe: fix race condition in nvme_submit_sync_cmd() NVMe: fix retry/error logic in nvme_queue_rq() NVMe: Fix FS mount issue (hot-remove followed by hot-add) NVMe: fix error return checking from blk_mq_alloc_request() NVMe: fix freeing of wrong request in abort path xen/blkfront: remove redundant flush_op xen/blkfront: improve protection against issuing unsupported REQ_FUA NVMe: Fix command setup on IO retry null_blk: boundary check queue_mode and irqmode block/rsxx: use generic io stats accounting functions to simplify io stat accounting md: use generic io stats accounting functions to simplify io stat accounting drbd: use generic io stats accounting functions to simplify io stat accounting md/bcache: use generic io stats accounting functions to simplify io stat accounting NVMe: Update module version major number NVMe: fail pci initialization if the device doesn't have any BARs NVMe: add ->exit_hctx() hook NVMe: make setup work for devices that don't do INTx NVMe: enable IO stats by default NVMe: nvme_submit_async_admin_req() must use atomic rq allocation NVMe: replace blk_put_request() with blk_mq_free_request() ...
2014-12-11md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.NeilBrown
A recent change to md started the ->sync_thread from a asynchronously from a work_queue rather than synchronously. This means that there can be a small window between the time when MD_RECOVERY_RUNNING is set and when ->sync_thread is set. So code that checks ->sync_thread might now conclude that the thread has not been started and (because a lock is held) will not be started. That is no longer the case. Most of those places are best fixed by testing MD_RECOVERY_RUNNING as well. To make this completely reliable, we wake_up(&resync_wait) after clearing that flag as well as after clearing ->sync_thread. Other places are better served by flushing the relevant workqueue to ensure that that if the sync thread was starting, it has now started. This is particularly best if we are about to stop the sync thread. Fixes: ac05f256691fe427a3e84c19261adb0b67dd73c0 Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-08Merge tag 'dm-3.19-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Significant DM thin-provisioning performance improvements to meet performance requirements that were requested by the Gluster distributed filesystem. Specifically, dm-thinp now takes care to aggregate IO that will be issued to the same thinp block before issuing IO to the underlying devices. This really helps improve performance on HW RAID6 devices that have a writeback cache because it avoids RMW in the HW RAID controller. - Some stable fixes: fix leak in DM bufio if integrity profiles were enabled, use memzero_explicit in DM crypt to avoid any potential for information leak, and a DM cache fix to properly mark a cache block dirty if it was promoted to the cache via the overwrite optimization. - A few simple DM persistent data library fixes - DM cache multiqueue policy block promotion improvements. - DM cache discard improvements that take advantage of range (multiblock) discard support in the DM bio-prison. This allows for much more efficient bulk discard processing (e.g. when mkfs.xfs discards the entire device). - Some small optimizations in DM core and RCU deference cleanups - DM core changes to suspend/resume code to introduce the new internal suspend/resume interface that the DM thin-pool target now uses to suspend/resume active thin devices when the thin-pool must suspend/resume. This avoids forcing userspace to track all active thin volumes in a thin-pool when the thin-pool is suspended for the purposes of metadata or data space resize. * tag 'dm-3.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (49 commits) dm crypt: use memzero_explicit for on-stack buffer dm space map metadata: fix sm_bootstrap_get_count() dm space map metadata: fix sm_bootstrap_get_nr_blocks() dm bufio: fix memleak when using a dm_buffer's inline bio dm cache: fix spurious cell_defer when dealing with partial block at end of device dm cache: dirty flag was mistakenly being cleared when promoting via overwrite dm cache: only use overwrite optimisation for promotion when in writeback mode dm cache: discard block size must be a multiple of cache block size dm cache: fix a harmless race when working out if a block is discarded dm cache: when reloading a discard bitset allow for a different discard block size dm cache: fix some issues with the new discard range support dm array: if resizing the array is a noop set the new root to the old one dm: use rcu_dereference_protected instead of rcu_dereference dm thin: fix pool_io_hints to avoid looking at max_hw_sectors dm thin: suspend/resume active thin devices when reloading thin-pool dm: enhance internal suspend and resume interface dm thin: do not allow thin device activation while pool is suspended dm: add presuspend_undo hook to target_type dm: return earlier from dm_blk_ioctl if target doesn't implement .ioctl dm thin: remove stale 'trim' message in block comment above pool_message ...
2014-12-03md: fix semicolon.cocci warningskbuild test robot
drivers/md/md.c:7175:43-44: Unneeded semicolon Removes unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-03md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.NeilBrown
It is critical that fetch_block() and handle_stripe_dirtying() are consistent in their analysis of what needs to be loaded. Otherwise raid5 can wait forever for a block that won't be loaded. Currently when writing to a RAID5 that is resyncing, to a location beyond the resync offset, handle_stripe_dirtying chooses a reconstruct-write cycle, but fetch_block() assumes a read-modify-write, and a lockup can happen. So treat that case just like RAID6, just as we do in handle_stripe_dirtying. RAID6 always does reconstruct-write. This bug was introduced when the behaviour of handle_stripe_dirtying was changed in 3.7, so the patch is suitable for any kernel since, though it will need careful merging for some versions. Cc: stable@vger.kernel.org (v3.7+) Fixes: a7854487cd7128a30a7f4f5259de9f67d5efb95f Reported-by: Henry Cai <henryplusplus@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-02dm crypt: use memzero_explicit for on-stack bufferMilan Broz
Use memzero_explicit to cleanup sensitive data allocated on stack to prevent the compiler from optimizing and removing memset() calls. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-02dm space map metadata: fix sm_bootstrap_get_count()Joe Thornber
Must set 'result' accordingly rather than return it. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01dm space map metadata: fix sm_bootstrap_get_nr_blocks()Dan Carpenter
This function isn't right and it causes a static checker warning: drivers/md/dm-thin.c:3016 maybe_resize_data_dev() error: potentially using uninitialized 'sb_data_size'. It should set "*count" and return zero on success the same as the sm_metadata_get_nr_blocks() function does earlier. Fixes: 3241b1d3e0aa ('dm: add persistent data library') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01dm bufio: fix memleak when using a dm_buffer's inline bioDarrick J. Wong
When dm-bufio sets out to use the bio built into a struct dm_buffer to issue an IO, it needs to call bio_reset after it's done with the bio so that we can free things attached to the bio such as the integrity payload. Therefore, inject our own endio callback to take care of the bio_reset after calling submit_io's end_io callback. Test case: 1. modprobe scsi_debug delay=0 dif=1 dix=199 ato=1 dev_size_mb=300 2. Set up a dm-bufio client, e.g. dm-verity, on the scsi_debug device 3. Repeatedly read metadata and watch kmalloc-192 leak! Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-01dm cache: fix spurious cell_defer when dealing with partial block at end of ↵Joe Thornber
device We never bother caching a partial block that is at the back end of the origin device. No cell ever gets locked, but the calling code was assuming it was and trying to release it. Now the code only releases if the cell has been set to a non NULL value. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-01dm cache: dirty flag was mistakenly being cleared when promoting via overwriteJoe Thornber
If the incoming bio is a WRITE and completely covers a block then we don't bother to do any copying for a promotion operation. Once this is done the cache block and origin block will be different, so we need to set it to 'dirty'. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-01dm cache: only use overwrite optimisation for promotion when in writeback modeJoe Thornber
Overwrite causes the cache block and origin blocks to diverge, which is only allowed in writeback mode. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-01dm cache: discard block size must be a multiple of cache block sizeJoe Thornber
Otherwise the cache blocks may span two discard blocks, which we don't handle when doing the discard lookup. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01dm cache: fix a harmless race when working out if a block is discardedJoe Thornber
It is more correct to hold the cell before checking the discard state. These flags are only used as hints to the policy so this change will have negligable effect. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01dm cache: when reloading a discard bitset allow for a different discard ↵Joe Thornber
block size The discard block size can change if the origin changes size or if an old DM cache is upgraded from using a discard block size that was equal to cache block size. To fix this an extent of discarded blocks is established for the purpose of translating the old discard block size to the new in-core discard block size and set bits. The old (potentially huge) discard bitset is left ondisk until it is re-written using the new in-core information on the next successful DM cache shutdown. Fixes: 7ae34e777896 ("dm cache: improve discard support") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01dm cache: fix some issues with the new discard range supportJoe Thornber
Commit 7ae34e777 ("dm cache: improve discard support") needed to also: - discontinue having DM core split the discard bios on cache block boundaries - calculate the cache's discard_nr_blocks relative to the determined discard_block_size rather than using oblock_to_dblock() Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01dm array: if resizing the array is a noop set the new root to the old oneJoe Thornber
This could've been quite bad (to return success but not update the new root to point at the old) but in practice the only known consumer of the dm array code is the DM cache target. And the DM cache target passes in the same old root to array_resize() anyway. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-24md: use generic io stats accounting functions to simplify io stat accountingGu Zheng
Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>