summaryrefslogtreecommitdiffstats
path: root/drivers/md/raid1.c
AgeCommit message (Collapse)Author
2012-01-11md/raid1: perform bad-block tests for WriteMostly devices too.NeilBrown
We normally try to avoid reading from write-mostly devices, but when we do we really have to check for bad blocks and be sure not to try reading them. With the current code, best_good_sectors might not get set and that causes zero-length read requests to be send down which is very confusing. This bug was introduced in commit d2eb35acfdccbe2 and so the patch is suitable for 3.1.x and 3.2.x Reported-and-tested-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Reported-and-tested-by: Art -kwaak- van Breemen <ard@telegraafnet.nl> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org
2011-12-23md/raid1: Mark device want_replacement when we see a write error.NeilBrown
Now that WantReplacement drives are replaced cleanly, mark a drive as want_replacement when we see a write error. It might get failed soon so the WantReplacement flag is irrelevant, but if the write error is recorded in the bad block log, we still want to activate any spare that might be available. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: If there is a spare and a want_replacement device, start replacement.NeilBrown
When attempting to add a spare to a RAID1 array, also consider adding it as a replacement for a want_replacement device. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: recognise replacements when assembling arrays.NeilBrown
If a Replacement is seen, file it as such. If we see two replacements (or two normal devices) for the one slot, abort. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: handle activation of replacement device when recovery completes.NeilBrown
When recovery completes ->spare_active is called. This checks if the replacement is ready and if so it fails the original. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: Allow a failed replacement device to be removed.NeilBrown
Replacement devices are stored at a different offset, so look there too. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: Allocate spare to store replacement devices and their bios.NeilBrown
In RAID1, a replacement is much like a normal device, so we just double the size of the relevant arrays and look at all possible devices for reads and writes. This means that the array looks like it is now double the size in some way - we need to be careful about that. In particular, we checking if the array is still degraded while creating a recovery request we need to only consider the first 'half' - i.e. the real (non-replacement) devices. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: Replace use of mddev->raid_disks with conf->raid_disks.NeilBrown
In general mddev->raid_disks can change unexpectedly while conf->raid_disks will only change in a very controlled way. So change some uses of one to the other. The use of mddev->raid_disks will not cause actually problems but this way is more consistent and safer in the long term. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md: change hot_remove_disk to take an rdev rather than a number.NeilBrown
Soon an array will be able to have multiple devices with the same raid_disk number (an original and a replacement). So removing a device based on the number won't work. So pass the actual device handle instead. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-11-06Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-11-04Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c
2011-10-31md: Add module.h to all files using it implicitlyPaul Gortmaker
A pending cleanup will mean that module.h won't be implicitly everywhere anymore. Make sure the modular drivers in md dir are actually calling out for <module.h> explicitly in advance. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-26md: Fix some bugs in recovery_disabled handling.NeilBrown
In 3.0 we changed the way recovery_disabled was handle so that instead of testing against zero, we test an mddev-> value against a conf-> value. Two problems: 1/ one place in raid1 was missed and still sets to '1'. 2/ We didn't explicitly set the conf-> value at array creation time. It defaulted to '0' just like the mddev value does so they could appear equal and thus disable recovery. This did not affect normal 'md' as it calls bind_rdev_to_array which changes the mddev value. However the dmraid interface doesn't call this and so doesn't change ->recovery_disabled; so at array start all recovery is incorrectly disabled. So initialise the 'conf' value to one less that the mddev value, so the will only be the same when explicitly set that way. Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-24block: Remove the control of complete cpu from bio.Tao Ma
bio originally has the functionality to set the complete cpu, but it is broken. Chirstoph said that "This code is unused, and from the all the discussions lately pretty obviously broken. The only thing keeping it serves is creating more confusion and possibly more bugs." And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine with leaving cpu control to the request based drivers, they are the only ones that can toggle the setting anyway". So this patch tries to remove all the work of controling complete cpu from a bio. Cc: Shaohua Li <shaohua.li@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19Merge branch 'v3.1-rc10' into for-3.2/coreJens Axboe
Conflicts: block/blk-core.c include/linux/blkdev.h Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-11md: add proper write-congestion reporting to RAID1 and RAID10.NeilBrown
RAID1 and RAID10 handle write requests by queuing them for handling by a separate thread. This is because when a write-intent-bitmap is active we might need to update the bitmap first, so it is good to queue a lot of writes, then do one big bitmap update for them all. However writeback request devices to appear to be congested after a while so it can make some guesstimate of throughput. The infinite queue defeats that (note that RAID5 has already has a finite queue so it doesn't suffer from this problem). So impose a limit on the number of pending write requests. By default it is 1024 which seems to be generally suitable. Make it configurable via module option just in case someone finds a regression. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: rename "mdk_personality" to "md_personality"NeilBrown
"mdk" doesn't mean anything any more. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md/raid1: typedef removal: conf_t -> struct r1confNeilBrown
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: remove typedefs: mirror_info_t -> struct mirror_infoNeilBrown
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bioNeilBrown
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: remove typedefs: mddev_t -> struct mddevNeilBrown
Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: removing typedefs: mdk_rdev_t -> struct md_rdevNeilBrown
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07md: remove PRINTK and dprintk debugging and use pr_debugNeilBrown
Being able to dynamically enable these make them much more useful. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07md/raid1/ avoid bio search in end_sync_read()NeilBrown
We know which device we just read from so we don't need to search the bios to find out. Just use ->read_disk. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07md/raid1: factor out common bio handling codeNamhyung Kim
When normal-write and sync-read/write bio completes, we should find out the disk number the bio belongs to. Factor those common code out to a separate function. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-21md: Avoid waking up a thread after it has been freed.NeilBrown
Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> cc: stable@kernel.org
2011-09-12block: remove support for bio remapping from ->make_requestChristoph Hellwig
There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-10md/raid1,10: Remove use-after-free bug in make_request.NeilBrown
A single request to RAID1 or RAID10 might result in multiple requests if there are known bad blocks that need to be avoided. To detect if we need to submit another write request we test: if (sectors_handled < (bio->bi_size >> 9)) { However this is after we call **_write_done() so the 'bio' no longer belongs to us - the writes could have completed and the bio freed. So move the **_write_done call until after the test against bio->bi_size. This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862 Reported-by: Bruno Wolff III <bruno@wolff.to> Tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md/raid1: factor several functions out or raid1d()NeilBrown
raid1d is too big with several deep branches. So separate them out into their own functions. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28md/raid1: improve handling of read failure during recovery.NeilBrown
If we cannot read a block from anywhere during recovery, there is now a better approach than just giving up. We can record a bad block on each device and keep going - being careful not to clear the bad block when a write succeeds as it might - it will be a write of incorrect data. We have now reached the state where - for raid1 - we only call md_error if md_set_badblocks has failed. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28md/raid1: record badblocks found during resync etc.NeilBrown
If we find a bad block while writing as part of resync/recovery we need to report that back to raid1d which must record the bad block, or fail the device. Similarly when fixing a read error, a further error should just record a bad block if possible rather than failing the device. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28md/raid1: Handle write errors by updating badblock log.NeilBrown
When we get a write error (in the data area, not in metadata), update the badblock log rather than failing the whole device. As the write may well be many blocks, we trying writing each block individually and only log the ones which fail. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28md/raid1: store behind-write pages in bi_vecs.NeilBrown
When performing write-behind we allocate pages to store the data during write. Previously we just keep a list of pages. Now we keep a list of bi_vec which includes offset and size. This means that the r1bio has complete information to create a new bio which will be needed for retrying after write errors. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28md/raid1: clear bad-block record when write succeeds.NeilBrown
If we succeed in writing to a block that was recorded as being bad, we clear the bad-block record. This requires some delayed handling as the bad-block-list update has to happen in process-context. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28md/raid1: avoid writing to known-bad blocks on known-bad drives.NeilBrown
If we have seen any write error on a drive, then don't write to any known-bad blocks on that drive. If necessary, we divide the write request up into pieces just like we do for reads, so each piece is either all written or all not written to any given drive. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28md: make it easier to wait for bad blocks to be acknowledged.NeilBrown
It is only safe to choose not to write to a bad block if that bad block is safely recorded in metadata - i.e. if it has been 'acknowledged'. If it hasn't we need to wait for the acknowledgement. We support that using rdev->blocked wait and md_wait_for_blocked_rdev by introducing a new device flag 'BlockedBadBlock'. This flag is only advisory. It is cleared whenever we acknowledge a bad block, so that a waiter can re-check the particular bad blocks that it is interested it. It should be set by a caller when they find they need to wait. This (set after test) is inherently racy, but as md_wait_for_blocked_rdev already has a timeout, losing the race will have minimal impact. When we clear "Blocked" was also clear "BlockedBadBlocks" incase it was set incorrectly (see above race). We also modify the way we manage 'Blocked' to fit better with the new handling of 'BlockedBadBlocks' and to make it consistent between externally managed and internally managed metadata. This requires that each raidXd loop checks if the metadata needs to be written and triggers a write (md_check_recovery) if needed. Otherwise a queued write request might cause raidXd to wait for the metadata to write, and only that thread can write it. Before writing metadata, we set FaultRecorded for all devices that are Faulty, then after writing the metadata we clear Blocked for any device for which the Fault was certainly Recorded. The 'faulty' device flag now appears in sysfs if the device is faulty *or* it has unacknowledged bad blocks. So user-space which does not understand bad blocks can continue to function correctly. User space which does, should not assume a device is faulty until it sees the 'faulty' flag, and then sees the list of unacknowledged bad blocks is empty. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md/raid1: avoid reading known bad blocks during resyncNeilBrown
When performing resync/etc, keep the size of the request small enough that it doesn't overlap any known bad blocks. Devices with badblocks at the start of the request are completely excluded. If there is nowhere to read from due to bad blocks, record a bad block on each target device. Now that we never read from known-bad-blocks we can allow devices with known-bad-blocks into a RAID1. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md/raid1: avoid reading from known bad blocks.NeilBrown
Now that we have a bad block list, we should not read from those blocks. There are several main parts to this: 1/ read_balance needs to check for bad blocks, and return not only the chosen device, but also how many good blocks are available there. 2/ fix_read_error needs to avoid trying to read from bad blocks. 3/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' 4/ retrying a read needs to also be ready to submit a smaller read and queue another request for the rest. This does not yet handle bad blocks when reading to perform resync, recovery, or check. 'md_trim_bio' will also be used for RAID10, so put it in md.c and export it. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md: don't allow arrays to contain devices with bad blocks.NeilBrown
As no personality understand bad block lists yet, we must reject any device that is known to contain bad blocks. As the personalities get taught, these tests can be removed. This only applies to raid1/raid5/raid10. For linear/raid0/multipath/faulty the whole concept of bad blocks doesn't mean anything so there is no point adding the checks. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safeJonathan Brassow
If device-mapper creates a RAID1 array that includes devices to be rebuilt, it will deref a NULL pointer when finished because sysfs is not used by device-mapper instantiated RAID devices. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27md/raid1: move rdev->corrected_errors countingNamhyung Kim
Read errors are considered to corrected if write-back and re-read cycle is finished without further problems. Thus moving the rdev-> corrected_errors counting after the re-reading looks more reasonable IMHO. Also included a couple of whitespace fixes on sync_page_io(). Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27md: change managed of recovery_disabled.NeilBrown
If we hit a read error while recovering a mirror, we want to abort the recovery without necessarily failing the disk - as having a disk this a read error is better than not having an array at all. Currently this is managed with a per-array flag "recovery_disabled" and is only implemented for RAID1. For RAID10 we will need finer grained control as we might want to disable recovery for individual devices separately. So push more of the decision making into the personality. 'recovery_disabled' is now a 'cookie' which is copied when the personality want to disable recovery and is changed when a device is added to the array as this is used as a trigger to 'try recovery again'. This will allow RAID10 to get the control that it needs. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27md: introduce link/unlink_rdev() helpersNamhyung Kim
There are places where sysfs links to rdev are handled in a same way. Add the helper functions to consolidate them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27md/raid: use printk_ratelimited instead of printk_ratelimitChristian Dietrich
As per printk_ratelimit comment, it should not be used. Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de> Signed-off-by: NeilBrown <neilb@suse.de>
2011-06-08MD: raid1 changes to allow use by device mapperJonathan Brassow
MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c) Added the necessary congestion function and conditionalize calls requiring an array 'queue' or 'gendisk'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-11md: allow resync_start to be set while an array is active.NeilBrown
The sysfs attribute 'resync_start' (known internally as recovery_cp), records where a resync is up to. A value of 0 means the array is not known to be in-sync at all. A value of MaxSector means the array is believed to be fully in-sync. When the size of member devices of an array (RAID1,RAID4/5/6) is increased, the array can be increased to match. This process sets resync_start to the old end-of-device offset so that the new part of the array gets resynced. However with RAID1 (and RAID6) a resync is not technically necessary and may be undesirable. So it would be good if the implied resync after the array is resized could be avoided. So: change 'resync_start' so the value can be changed while the array is active, and as a precaution only allow it to be changed while resync/recovery is 'frozen'. Changing it once resync has started is not going to be useful anyway. This allows the array to be resized without a resync by: write 'frozen' to 'sync_action' write new size to 'component_size' (this will set resync_start) write 'none' to 'resync_start' write 'idle' to 'sync_action'. Also slightly improve some tests on recovery_cp when resizing raid1/raid5. Now that an arbitrary value could be set we should be more careful in our tests. Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-11md/raid1: improve handling of pages allocated for write-behind.NeilBrown
The current handling and freeing of these pages is a bit fragile. We only keep the list of allocated pages in each bio, so we need to still have a valid bio when freeing the pages, which is a bit clumsy. So simply store the allocated page list in the r1_bio so it can easily be found and freed when we are finished with the r1_bio. Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-11md/raid1: try fix_sync_read_error before process_checks.NeilBrown
If we get a read error during resync/recovery we current repeat with single-page reads to find out just where the error is, and possibly read each page from a different device. With check/repair we don't currently do that, we just fail. However it is possible that while all devices fail on the large 64K read, we might be able to satisfy each 4K from one device or another. So call fix_sync_read_error before process_checks to maximise the chance of finding good data and writing it out to the devices with read errors. For this to work, we need to set the 'uptodate' flags properly after fix_sync_read_error has succeeded. Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-11md/raid1: tidy up new functions: process_checks and fix_sync_read_error.NeilBrown
These changes are mostly cosmetic: 1/ change mddev->raid_disks to conf->raid_disks because the later is technically safer, though in current practice it doesn't matter in this particular context. 2/ Rearrange two for / if loops to have an early 'continue' so the body of the 'if' doesn't need to be indented so much. Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-11md/raid1: split out two sub-functions from sync_request_writeNeilBrown
sync_request_write is too big and too deep. So split out two self-contains bits of functionality into separate function. Signed-off-by: NeilBrown <neilb@suse.de>