summaryrefslogtreecommitdiffstats
path: root/drivers/md
AgeCommit message (Collapse)Author
2012-10-11md/raid5: protect debug message against NULL derefernce.NeilBrown
The pr_debug in add_stripe_bio could race with something changing *bip, so it is best to hold the lock until after the pr_debug. Reported-by: "Jianpeng Ma" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md/raid5: add some missing locking in handle_failed_stripe.NeilBrown
We really should hold the stripe_lock while accessing 'toread' else we could race with add_stripe_bio and corrupt a list. Reported-by: "Jianpeng Ma" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11MD: raid5 avoid unnecessary zero page for trimShaohua Li
We want to avoid zero discarded dev page, because it's useless for discard. But if we don't zero it, another read/write hit such page in the cache and will get inconsistent data. To avoid zero the page, we don't set R5_UPTODATE flag after construction is done. In this way, discard write request is still issued and finished, but read will not hit the page. If the stripe gets accessed soon, we need reread the stripe, but since the chance is low, the reread isn't a big deal. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11MD: raid5 trim supportShaohua Li
Discard for raid4/5/6 has limitation. If discard request size is small, we do discard for one disk, but we need calculate parity and write parity disk. To correctly calculate parity, zero_after_discard must be guaranteed. Even it's true, we need do discard for one disk but write another disks, which makes the parity disks wear out fast. This doesn't make sense. So an efficient discard for raid4/5/6 should discard all data disks and parity disks, which requires the write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size is smaller than chunk_size, such pattern is almost impossible in practice. So in this patch, I only handle the case that A's size equals to chunk_size. That is discard request should be aligned to stripe size and its size is multiple of stripe size. Since we can only handle request with specific alignment and size (or part of the request fitting stripes), we can't guarantee zero_after_discard even zero_after_discard is true in low level drives. The block layer doesn't send down correctly aligned requests even correct discard alignment is set, so I must filter out. For raid4/5/6 parity calculation, if data is 0, parity is 0. So if zero_after_discard is true for all disks, data is consistent after discard. Otherwise, data might be lost. Let's consider a scenario: discard a stripe, write data to one disk and write parity disk. The stripe could be still inconsistent till then depending on using data from other data disks or parity disks to calculate new parity. If the disk is broken, we can't restore it. So in this patch, we only enable discard support if all disks have zero_after_discard. If discard fails in one disk, we face the similar inconsistent issue above. The patch will make discard follow the same path as normal write request. If discard fails, a resync will be scheduled to make the data consistent. This isn't good to have extra writes, but data consistency is important. If a subsequent read/write request hits raid5 cache of a discarded stripe, the discarded dev page should have zero filled, so the data is consistent. This patch will always zero dev page for discarded request stripe. This isn't optimal because discard request doesn't need such payload. Next patch will avoid it. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md/bitmap:Don't use IS_ERR to judge alloc_page().Jianpeng Ma
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md/raid1: Don't release reference to device while handling read error.NeilBrown
When we get a read error, we arrange for raid1d to handle it. Currently we release the reference on the device. This can result in conf->mirrors[read_disk].rdev being NULL in fix_read_error, if the device happens to get removed before the read error is handled. So instead keep the reference until the read error has been fully handled. Reported-by: hank <pyu@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11raid: replace list_for_each_continue_rcu with new interfaceMichael Wang
This patch replaces list_for_each_continue_rcu() with list_for_each_entry_continue_rcu() to save a few lines of code and allow removing list_for_each_continue_rcu(). Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11DM RAID: Fix for "sync" directive ineffectivenessJonathan Brassow
There are two table arguments that can be given to a DM RAID target that control whether the array is forced to (re)synchronize or skip initialization: "sync" and "nosync". When "sync" is given, we set mddev->recovery_cp to 0 in order to cause the device to resynchronize. This is insufficient if there is a bitmap in use, because the array will simply look at the bitmap and see that there is no recovery necessary. The fix is to skip over the loading of the superblocks when "sync" is given, causing new superblocks to be written that will force the array to go through initialization (i.e. synchronization). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11DM RAID: Fix comparison of index and quantity for "rebuild" parameterJonathan Brassow
DM RAID: Fix comparison of index and quantity for "rebuild" parameter The "rebuild" parameter takes an index argument that starts counting from zero. The conditional used to validate the index was using '>' rather than '>=', leaving the door open for an index value that would be 1 too large. Reported-by: Neil Brown <neilb@suse.de> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11DM RAID: Add rebuild capability for RAID10Jonathan Brassow
DM RAID: Add code to validate replacement slots for RAID10 arrays RAID10 can handle 'copies - 1' failures for each mirror group. This code ensures the user has provided a valid array - one whose devices specified for rebuild do not exceed the amount of redundancy available. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11DM RAID: Move 'rebuild' checking code to its own functionJonathan Brassow
DM RAID: Move chunk of code to it's own function The code that checks whether device replacements/rebuilds are possible given a specific RAID type is moved to it's own function. It will further expand when the code to check RAID10 is added. A separate function makes it easier to read. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11MD RAID10: Prep for DM RAID10 device replacement capabilityJonathan Brassow
MD RAID10: Fix a couple potential kernel panics if RAID10 is used by dm-raid When device-mapper uses the RAID10 personality through dm-raid.c, there is no 'gendisk' structure in mddev and some sysfs information is also not populated. This patch avoids touching those non-existent structures. Signed-off-by: Jonathan Brassow <jbrassow@rehdat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md: avoid taking the mutex on some ioctls.NeilBrown
Some ioctls don't need to take the mutex and doing so can cause a delay as it is held during super-block update. So move those ioctls out of the mutex and rely on rcu locking to ensure we don't access stale data. Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11MD: change the parameter of md threadShaohua Li
Change the thread parameter, so the thread can carry extra info. Next patch will use it. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md/raid10: submit IO from originating thread instead of md thread.NeilBrown
queuing writes to the md thread means that all requests go through the one processor which may not be able to keep up with very high request rates. So use the plugging infrastructure to submit all requests on unplug. If a 'schedule' is needed, we fall back on the old approach of handing the requests to the thread for it to handle. This is nearly identical to a recent patch which provided similar functionality to RAID1. Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md: raid 10 supports TRIMShaohua Li
This makes md raid 10 support TRIM. If one disk supports discard and another not, or one has discard_zero_data and another not, there could be inconsistent between data from such disks. But this should not matter, discarded data is useless. This will add extra copy in rebuild though. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md: raid 1 supports TRIMShaohua Li
This makes md raid 1 support TRIM. If one disk supports discard and another not, or one has discard_zero_data and another not, there could be inconsistent between data from such disks. But this should not matter, discarded data is useless. This will add extra copy in rebuild though. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md: raid 0 supports TRIMShaohua Li
This makes md raid 0 support TRIM. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md: linear supports TRIMShaohua Li
This makes md linear support TRIM. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md/linear: rcu_dereference outside read-lock sectionDenis Efremov
According to the comment in linear_stop function rcu_dereference in linear_start and linear_stop functions occurs under reconfig_mutex. The patch represents this agreement in code and prevents lockdep complaint. Found by Linux Driver Verification project (linuxtesting.org) Signed-off-by: Denis Efremov <yefremov.denis@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO update from Jens Axboe: "Core block IO bits for 3.7. Not a huge round this time, it contains: - First series from Kent cleaning up and generalizing bio allocation and freeing. - WRITE_SAME support from Martin. - Mikulas patches to prevent O_DIRECT crashes when someone changes the block size of a device. - Make bio_split() work on data-less bio's (like trim/discards). - A few other minor fixups." Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew Morton. It is due to the VM no longer using a prio-tree (see commit 6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree"). So make set_blocksize() use mapping_mapped() instead of open-coding the internal VM knowledge that has changed. * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits) block: makes bio_split support bio without data scatterlist: refactor the sg_nents scatterlist: add sg_nents fs: fix include/percpu-rwsem.h export error percpu-rw-semaphore: fix documentation typos fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared blockdev: turn a rw semaphore into a percpu rw semaphore Fix a crash when block device is read and block size is changed at the same time block: fix request_queue->flags initialization block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue() block: ioctl to zero block ranges block: Make blkdev_issue_zeroout use WRITE SAME block: Implement support for WRITE SAME block: Consolidate command flag and queue limit checks for merges block: Clean up special command handling logic block/blk-tag.c: Remove useless kfree block: remove the duplicated setting for congestion_threshold block: reject invalid queue attribute values block: Add bio_clone_bioset(), bio_clone_kmalloc() block: Consolidate bio_alloc_bioset(), bio_kmalloc() ...
2012-10-02Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue changes from Tejun Heo: "This is workqueue updates for v3.7-rc1. A lot of activities this round including considerable API and behavior cleanups. * delayed_work combines a timer and a work item. The handling of the timer part has always been a bit clunky leading to confusing cancelation API with weird corner-case behaviors. delayed_work is updated to use new IRQ safe timer and cancelation now works as expected. * Another deficiency of delayed_work was lack of the counterpart of mod_timer() which led to cancel+queue combinations or open-coded timer+work usages. mod_delayed_work[_on]() are added. These two delayed_work changes make delayed_work provide interface and behave like timer which is executed with process context. * A work item could be executed concurrently on multiple CPUs, which is rather unintuitive and made flush_work() behavior confusing and half-broken under certain circumstances. This problem doesn't exist for non-reentrant workqueues. While non-reentrancy check isn't free, the overhead is incurred only when a work item bounces across different CPUs and even in simulated pathological scenario the overhead isn't too high. All workqueues are made non-reentrant. This removes the distinction between flush_[delayed_]work() and flush_[delayed_]_work_sync(). The former is now as strong as the latter and the specified work item is guaranteed to have finished execution of any previous queueing on return. * In addition to the various bug fixes, Lai redid and simplified CPU hotplug handling significantly. * Joonsoo introduced system_highpri_wq and used it during CPU hotplug. There are two merge commits - one to pull in IRQ safe timer from tip/timers/core and the other to pull in CPU hotplug fixes from wq/for-3.6-fixes as Lai's hotplug restructuring depended on them." Fixed a number of trivial conflicts, but the more interesting conflicts were silent ones where the deprecated interfaces had been used by new code in the merge window, and thus didn't cause any real data conflicts. Tejun pointed out a few of them, I fixed a couple more. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits) workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() workqueue: remove @delayed from cwq_dec_nr_in_flight() workqueue: fix possible stall on try_to_grab_pending() of a delayed work item workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue: use __cpuinit instead of __devinit for cpu callbacks workqueue: rename manager_mutex to assoc_mutex workqueue: WORKER_REBIND is no longer necessary for idle rebinding workqueue: WORKER_REBIND is no longer necessary for busy rebinding workqueue: reimplement idle worker rebinding workqueue: deprecate __cancel_delayed_work() workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() workqueue: use mod_delayed_work() instead of __cancel + queue workqueue: use irqsafe timer for delayed_work workqueue: clean up delayed_work initializers and add missing one workqueue: make deferrable delayed_work initializer names consistent workqueue: cosmetic whitespace updates for macro definitions workqueue: deprecate system_nrt[_freezable]_wq workqueue: deprecate flush[_delayed]_work_sync() ...
2012-09-28Merge tag 'dm-3.6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm Pull dm fixes from Alasdair G Kergon: "A few fixes for problems discovered during the 3.6 cycle. Of particular note, are fixes to the thin target's discard support, which I hope is finally working correctly; and fixes for multipath ioctls and device limits when there are no paths." * tag 'dm-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: dm verity: fix overflow check dm thin: fix discard support for data devices dm thin: tidy discard support dm: retain table limits when swapping to new table with no devices dm table: clear add_random unless all devices have it set dm: handle requests beyond end of device instead of using BUG_ON dm mpath: only retry ioctl when no paths if queue_if_no_path set dm thin: do not set discard_zeroes_data
2012-09-27md/raid10: fix "enough" function for detecting if array is failed.NeilBrown
The 'enough' function is written to work with 'near' arrays only in that is implicitly assumes that the offset from one 'group' of devices to the next is the same as the number of copies. In reality it is the number of 'near' copies. So change it to make this number explicit. This bug makes it possible to run arrays without enough drives present, which is dangerous. It is appropriate for an -stable kernel, but will almost certainly need to be modified for some of them. Cc: stable@vger.kernel.org Reported-by: Jakub Husák <jakub@gooseman.cz> Signed-off-by: NeilBrown <neilb@suse.de>
2012-09-26dm verity: fix overflow checkMikulas Patocka
This patch fixes sector_t overflow checking in dm-verity. Without this patch, the code checks for overflow only if sector_t is smaller than long long, not if sector_t and long long have the same size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26dm thin: fix discard support for data devicesMike Snitzer
The discard limits that get established for a thin-pool or thin device may be incompatible with the pool's data device. Avoid this by checking the discard limits of the pool's data device. If an incompatibility is found then the pool's 'discard passdown' feature is disabled. Change thin_io_hints to ensure that a thin device always uses the same queue limits as its pool device. Introduce requested_pf to track whether or not the table line originally contained the no_discard_passdown flag and use this directly for table output. We prepare the correct setting for discard_passdown directly in bind_control_target (called from pool_io_hints) and store it in adjusted_pf rather than waiting until we have access to pool->pf in pool_preresume. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26dm thin: tidy discard supportMike Snitzer
A little thin discard code refactoring to make the next patch (dm thin: fix discard support for data devices) more readable. Pull out a couple of functions (and uses bools instead of unsigned for features). No functional changes. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26dm: retain table limits when swapping to new table with no devicesMike Snitzer
Add a safety net that will re-use the DM device's existing limits in the event that DM device has a temporary table that doesn't have any component devices. This is to reduce the chance that requests not respecting the hardware limits will reach the device. DM recalculates queue limits based only on devices which currently exist in the table. This creates a problem in the event all devices are temporarily removed such as all paths being lost in multipath. DM will reset the limits to the maximum permissible, which can then assemble requests which exceed the limits of the paths when the paths are restored. The request will fail the blk_rq_check_limits() test when sent to a path with lower limits, and will be retried without end by multipath. This became a much bigger issue after v3.6 commit fe86cdcef ("block: do not artificially constrain max_sectors for stacking drivers"). Reported-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26dm table: clear add_random unless all devices have it setMilan Broz
Always clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not have it set. Otherwise devices with predictable characteristics may contribute entropy. QUEUE_FLAG_ADD_RANDOM specifies whether or not queue IO timings contribute to the random pool. For bio-based targets this flag is always 0 because such devices have no real queue. For request-based devices this flag was always set to 1 by default. Now set it according to the flags on underlying devices. If there is at least one device which should not contribute, set the flag to zero: If a device, such as fast SSD storage, is not suitable for supplying entropy, a request-based queue stacked over it will not be either. Because the checking logic is exactly same as for the rotational flag, share the iteration function with device_is_nonrot(). Signed-off-by: Milan Broz <mbroz@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26dm: handle requests beyond end of device instead of using BUG_ONMike Snitzer
The access beyond the end of device BUG_ON that was introduced to dm_request_fn via commit 29e4013de7ad950280e4b2208 ("dm: implement REQ_FLUSH/FUA support for request-based dm") was an overly drastic (but simple) response to this situation. I have received a report that this BUG_ON was hit and now think it would be better to use dm_kill_unmapped_request() to fail the clone and original request with -EIO. map_request() will assign the valid target returned by dm_table_find_target to tio->ti. But when the target isn't valid tio->ti is never assigned (because map_request isn't called); so add a check for tio->ti != NULL to dm_done(). Reported-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: stable@vger.kernel.org # v2.6.37+ Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26dm mpath: only retry ioctl when no paths if queue_if_no_path setMike Snitzer
When there are no paths and multipath receives an ioctl, it waits until a path becomes available. This behaviour is incorrect if the "queue_if_no_path" setting was not specified, as then the ioctl should be rejected immediately, which this patch now does. commit 35991652b ("dm mpath: allow ioctls to trigger pg init") should have checked if queue_if_no_path was configured before queueing IO. Checking for the queue_if_no_path feature, like is done in map_io(), allows the following table load to work without blocking in the multipath_ioctl retry loop: echo "0 1024 multipath 0 0 0 0" | dmsetup create mpath_nodevs Without this fix the multipath_ioctl will block with the following stack trace: blkid D 0000000000000002 0 23936 1 0x00000000 ffff8802b89e5cd8 0000000000000082 ffff8802b89e5fd8 0000000000012440 ffff8802b89e4010 0000000000012440 0000000000012440 0000000000012440 ffff8802b89e5fd8 0000000000012440 ffff88030c2aab30 ffff880325794040 Call Trace: [<ffffffff814ce099>] schedule+0x29/0x70 [<ffffffff814cc312>] schedule_timeout+0x182/0x2e0 [<ffffffff8104dee0>] ? lock_timer_base+0x70/0x70 [<ffffffff814cc48e>] schedule_timeout_uninterruptible+0x1e/0x20 [<ffffffff8104f840>] msleep+0x20/0x30 [<ffffffffa0000839>] multipath_ioctl+0x109/0x170 [dm_multipath] [<ffffffffa06bfb9c>] dm_blk_ioctl+0xbc/0xd0 [dm_mod] [<ffffffff8122a408>] __blkdev_driver_ioctl+0x28/0x30 [<ffffffff8122a79e>] blkdev_ioctl+0xce/0x730 [<ffffffff811970ac>] block_ioctl+0x3c/0x40 [<ffffffff8117321c>] do_vfs_ioctl+0x8c/0x340 [<ffffffff81166293>] ? sys_newfstat+0x33/0x40 [<ffffffff81173571>] sys_ioctl+0xa1/0xb0 [<ffffffff814d70a9>] system_call_fastpath+0x16/0x1b Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.5+ Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-26dm thin: do not set discard_zeroes_dataMike Snitzer
The dm thin pool target claims to support the zeroing of discarded data areas. This turns out to be incorrect when processing discards that do not exactly cover a complete number of blocks, so the target must always set discard_zeroes_data_unsupported. The thin pool target will zero blocks when they are allocated if the skip_block_zeroing feature is not specified. The block layer may send a discard that only partly covers a block. If a thin pool block is partially discarded then there is no guarantee that the discarded data will get zeroed before it is accessed again. Due to this, thin devices cannot claim discards will always zero data. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org # 3.4+ Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-09-24md/raid5: add missing spin_lock_init.NeilBrown
commit b17459c05000fdbe8d10946570a26510f86ec0f raid5: add a per-stripe lock added a spin_lock to the 'stripe_head' struct. Unfortunately there are two places where this struct is allocated but the spin lock was only initialised in one of them. So add the missing spin_lock_init. Signed-off-by: NeilBrown <neilb@suse.de>
2012-09-20block: Implement support for WRITE SAMEMartin K. Petersen
The WRITE SAME command supported on some SCSI devices allows the same block to be efficiently replicated throughout a block range. Only a single logical block is transferred from the host and the storage device writes the same data to all blocks described by the I/O. This patch implements support for WRITE SAME in the block layer. The blkdev_issue_write_same() function can be used by filesystems and block drivers to replicate a buffer across a block range. This can be used to efficiently initialize software RAID devices, etc. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-19md: make sure metadata is updated when spares are activated or removed.NeilBrown
It isn't always necessary to update the metadata when spares are removed as the presence-or-not of a spare isn't really important to the integrity of an array. Also activating a spare doesn't always require updating the metadata as the update on 'recovery-completed' is usually sufficient. However the introduction of 'replacement' devices have made these transitions sometimes more important. For example the 'Replacement' flag isn't cleared until the original device is removed, so we need to ensure a metadata update after that 'spare' is removed. So set MD_CHANGE_DEVS whenever a spare is activated or removed, to complement the current situation where it is set when a spare is added or a device is failed (or a number of other less common situations). This is suitable for -stable as out-of-data metadata could lead to data corruption. This is only relevant for 3.3 and later 9when 'replacement' as introduced. Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2012-09-19md/raid5: fix calculate of 'degraded' when a replacement becomes active.NeilBrown
When a replacement device becomes active, we mark the device that it replaces as 'faulty' so that it can subsequently get removed. However 'calc_degraded' only pays attention to the primary device, not the replacement, so the array appears to become degraded, which is wrong. So teach 'calc_degraded' to consider any replacement if a primary device is faulty. This is suitable for -stable as an incorrect 'degraded' value can confuse md and could lead to data corruption. This is only relevant for 3.3 and later. Cc: stable@vger.kernel.org Reported-by: Robin Hill <robin@robinhill.me.uk> Reported-by: John Drescher <drescherjm@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-09-19Revert "md/raid5: For odirect-write performance, do not set ↵NeilBrown
STRIPE_PREREAD_ACTIVE." This reverts commit 895e3c5c58a80bb9e4e05d9ac38b4f30e0f97d80. While this patch seemed like a good idea and did help some workloads, it hurts other workloads. Large sequential O_DIRECT writes were faster, Small random O_DIRECT writes were slower. Other changes (batching RAID5 writes) have improved the sequential writes using a different mechanism, so the net result of this patch is definitely negative. So revert it. Reported-by: Shaohua Li <shli@kernel.org> Tested-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-09-09block: Add bio_clone_bioset(), bio_clone_kmalloc()Kent Overstreet
Previously, there was bio_clone() but it only allocated from the fs bio set; as a result various users were open coding it and using __bio_clone(). This changes bio_clone() to become bio_clone_bioset(), and then we add bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of the functionality the last patch adedd. This will also help in a later patch changing how bio cloning works. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: Boaz Harrosh <bharrosh@panasas.com> CC: Jeff Garzik <jeff@garzik.org> Acked-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09dm: Use bioset's front_pad for dm_rq_clone_bio_infoKent Overstreet
Previously, dm_rq_clone_bio_info needed to be freed by the bio's destructor to avoid a memory leak in the blk_rq_prep_clone() error path. This gets rid of a memory allocation and means we can kill dm_rq_bio_destructor. The _rq_bio_info_cache kmem cache is unused now and needs to be deleted, but due to the way io_pool is used and overloaded this looks not quite trivial so I'm leaving it for a later patch. v6: Fix comment on struct dm_rq_clone_bio_info, per Tejun Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Alasdair Kergon <agk@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09block: Ues bi_pool for bio_integrity_alloc()Kent Overstreet
Now that bios keep track of where they were allocated from, bio_integrity_alloc_bioset() becomes redundant. Remove bio_integrity_alloc_bioset() and drop bio_set argument from the related functions and make them use bio->bi_pool. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09block: Generalized bio pool freeingKent Overstreet
With the old code, when you allocate a bio from a bio pool you have to implement your own destructor that knows how to find the bio pool the bio was originally allocated from. This adds a new field to struct bio (bi_pool) and changes bio_alloc_bioset() to use it. This makes various bio destructors unnecessary, so they're then deleted. v6: Explain the temporary if statement in bio_put Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: Nicholas Bellinger <nab@linux-iscsi.org> CC: Lars Ellenberg <lars.ellenberg@linbit.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-20workqueue: deprecate flush[_delayed]_work_sync()Tejun Heo
flush[_delayed]_work_sync() are now spurious. Mark them deprecated and convert all users to flush[_delayed]_work(). If you're cc'd and wondering what's going on: Now all workqueues are non-reentrant and the regular flushes guarantee that the work item is not pending or running on any CPU on return, so there's no reason to use the sync flushes at all and they're going away. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mattia Dongili <malattia@linux.it> Cc: Kent Yoder <key@linux.vnet.ibm.com> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Bryan Wu <bryan.wu@canonical.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-wireless@vger.kernel.org Cc: Anton Vorontsov <cbou@mail.ru> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Avi Kivity <avi@redhat.com>
2012-08-18md/raid10: fix problem with on-stack allocation of r10bio structure.NeilBrown
A 'struct r10bio' has an array of per-copy information at the end. This array is declared with size [0] and r10bio_pool_alloc allocates enough extra space to store the per-copy information depending on the number of copies needed. So declaring a 'struct r10bio on the stack isn't going to work. It won't allocate enough space, and memory corruption will ensue. So in the two places where this is done, declare a sufficiently large structure and use that instead. The two call-sites of this bug were introduced in 3.4 and 3.5 so this is suitable for both those kernels. The patch will have to be modified for 3.4 as it only has one bug. Cc: stable@vger.kernel.org Reported-by: Ivan Vasilyev <ivan.vasilyev@gmail.com> Tested-by: Ivan Vasilyev <ivan.vasilyev@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-08-16md: Don't truncate size at 4TB for RAID0 and LinearNeilBrown
commit 27a7b260f71439c40546b43588448faac01adb93 md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. changed 0.90 metadata handling to truncated size to 4TB as that is all that 0.90 can record. However for RAID0 and Linear, 0.90 doesn't need to record the size, so this truncation is not needed and causes working arrays to become too small. So avoid the truncation for RAID0 and Linear This bug was introduced in 3.1 and is suitable for any stable kernels from then onwards. As the offending commit was tagged for 'stable', any stable kernel that it was applied to should also get this patch. That includes at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for providing that list). Cc: stable@vger.kernel.org Signed-off-by: Neil Brown <neilb@suse.de>
2012-08-02Merge tag 'md-3.6' of git://neil.brown.name/mdLinus Torvalds
Pull additional md update from NeilBrown: "This contains a few patches that depend on plugging changes in the block layer so needed to wait for those. It also contains a Kconfig fix for the new RAID10 support in dm-raid." * tag 'md-3.6' of git://neil.brown.name/md: md/dm-raid: DM_RAID should select MD_RAID10 md/raid1: submit IO from originating thread instead of md thread. raid5: raid5d handle stripe in batch way raid5: make_request use batch stripe release
2012-08-02md/dm-raid: DM_RAID should select MD_RAID10NeilBrown
Now that DM_RAID supports raid10, it needs to select that code to ensure it is included. Cc: Jonathan Brassow <jbrassow@redhat.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-08-02md/raid1: submit IO from originating thread instead of md thread.NeilBrown
queuing writes to the md thread means that all requests go through the one processor which may not be able to keep up with very high request rates. So use the plugging infrastructure to submit all requests on unplug. If a 'schedule' is needed, we fall back on the old approach of handing the requests to the thread for it to handle. Signed-off-by: NeilBrown <neilb@suse.de>
2012-08-02raid5: raid5d handle stripe in batch wayShaohua Li
Let raid5d handle stripe in batch way to reduce conf->device_lock locking. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-08-02raid5: make_request use batch stripe releaseShaohua Li
make_request() does stripe release for every stripe and the stripe usually has count 1, which makes previous release_stripe() optimization not work. In my test, this release_stripe() becomes the heaviest pleace to take conf->device_lock after previous patches applied. Below patch makes stripe release batch. All the stripes will be released in unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe lru. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-08-01Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver changes from Jens Axboe: - Making the plugging support for drivers a bit more sane from Neil. This supersedes the plugging change from Shaohua as well. - The usual round of drbd updates. - Using a tail add instead of a head add in the request completion for ndb, making us find the most completed request more quickly. - A few floppy changes, getting rid of a duplicated flag and also running the floppy init async (since it takes forever in boot terms) from Andi. * 'for-3.6/drivers' of git://git.kernel.dk/linux-block: floppy: remove duplicated flag FD_RAW_NEED_DISK blk: pass from_schedule to non-request unplug functions. block: stack unplug blk: centralize non-request unplug handling. md: remove plug_cnt feature of plugging. block/nbd: micro-optimization in nbd request completion drbd: announce FLUSH/FUA capability to upper layers drbd: fix max_bio_size to be unsigned drbd: flush drbd work queue before invalidate/invalidate remote drbd: fix potential access after free drbd: call local-io-error handler early drbd: do not reset rs_pending_cnt too early drbd: reset congestion information before reporting it in /proc/drbd drbd: report congestion if we are waiting for some userland callback drbd: differentiate between normal and forced detach drbd: cleanup, remove two unused global flags floppy: Run floppy initialization asynchronous