Age | Commit message (Collapse) | Author |
|
If a RAID10 has an odd number of chunks - as might happen when there
are an odd number of devices - the last chunk has no pair and so is
not mirrored. We don't store data there, but when recovering the last
device in an array we retry to recover that last chunk from a
non-existent location. This results in an error, and the recovery
aborts.
When we get to that last chunk we should just stop - there is nothing
more to do anyway.
This bug has been present since the introduction of RAID10, so the
patch is appropriate for any -stable kernel.
Cc: stable@vger.kernel.org
Reported-by: Christian Balzer <chibi@gol.com>
Tested-by: Christian Balzer <chibi@gol.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Pull two md fixes from NeilBrown:
"One sparse-warning fix, one bugfix for 3.4-stable"
* tag 'md-3.5-fixes' of git://neil.brown.name/md:
md: raid1/raid10: fix problem with merge_bvec_fn
lib/raid6: fix sparse warnings in recovery functions
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull device-mapper updates from Alasdair G Kergon:
"Improve multipath's retrying mechanism in some defined circumstances
and provide a simple reserve/release mechanism for userspace tools to
access thin provisioning metadata while the pool is in use."
* tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm thin: provide userspace access to pool metadata
dm thin: use slab mempools
dm mpath: allow ioctls to trigger pg init
dm mpath: delay retry of bypassed pg
dm mpath: reduce size of struct multipath
|
|
This patch implements two new messages that can be sent to the thin
pool target allowing it to take a snapshot of the _metadata_. This,
read-only snapshot can be accessed by userland, concurrently with the
live target.
Only one metadata snapshot can be held at a time. The pool's status
line will give the block location for the current msnap.
Since version 0.1.5 of the userland thin provisioning tools, the
thin_dump program displays the msnap as follows:
thin_dump -m <msnap root> <metadata dev>
Available here: https://github.com/jthornber/thin-provisioning-tools
Now that userland can access the metadata we can do various things
that have traditionally been kernel side tasks:
i) Incremental backups.
By using metadata snapshots we can work out what blocks have
changed over time. Combined with data snapshots we can ensure
the data doesn't change while we back it up.
A short proof of concept script can be found here:
https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb
ii) Migration of thin devices from one pool to another.
iii) Merging snapshots back into an external origin.
iv) Asyncronous replication.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Use dedicated caches prefixed with a "dm_" name rather than relying on
kmalloc mempools backed by generic slab caches so the memory usage of
thin provisioning (and any leaks) can be accounted for independently.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
After the failure of a group of paths, any alternative paths that
need initialising do not become available until further I/O is sent to
the device. Until this has happened, ioctls return -EAGAIN.
With this patch, new paths are made available in response to an ioctl
too. The processing of the ioctl gets delayed until this has happened.
Instead of returning an error, we submit a work item to kmultipathd
(that will potentially activate the new path) and retry in ten
milliseconds.
Note that the patch doesn't retry an ioctl if the ioctl itself fails due
to a path failure. Such retries should be handled intelligently by the
code that generated the ioctl in the first place, noting that some SCSI
commands should not be retried because they are not idempotent (XOR write
commands). For commands that could be retried, there is a danger that
if the device rejected the SCSI command, the path could be errorneously
marked as failed, and the request would be retried on another path which
might fail too. It can be determined if the failure happens on the
device or on the SCSI controller, but there is no guarantee that all
SCSI drivers set these flags correctly.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
If I/O needs retrying and only bypassed priority groups are available,
set the pg_init_delay_retry flag to wait before retrying.
If, for example, the reason for the bypass is that the controller is
getting reset or there is a firmware upgrade happening, retrying right
away would cause a flood of log messages and retries for what could be a
few seconds or even several minutes.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Move multipath structure's 'lock' and 'queue_size' members to eliminate
two 4-byte holes. Also use a bit within a single unsigned int for each
existing flag (saves 8-bytes). This allows future flags to be added
without each consuming an unsigned int.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
The new merge_bvec_fn which calls the corresponding function
in subsidiary devices requires that mddev->merge_check_needed
be set if any child has a merge_bvec_fn.
However were were only setting that when a device was hot-added,
not when a device was present from the start.
This bug was introduced in 3.4 so patch is suitable for 3.4.y
kernels. However that are conflicts in raid10.c so a separate
patch will be needed for 3.4.y.
Cc: stable@vger.kernel.org
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Pull md updates from NeilBrown:
"It's been a busy cycle for md - lots of fun stuff here.. if you like
this kind of thing :-)
Main features:
- RAID10 arrays can be reshaped - adding and removing devices and
changing chunks (not 'far' array though)
- allow RAID5 arrays to be reshaped with a backup file (not tested
yet, but the priciple works fine for RAID10).
- arrays can be reshaped while a bitmap is present - you no longer
need to remove it first
- SSSE3 support for RAID6 syndrome calculations
and of course a number of minor fixes etc."
* tag 'md-3.5' of git://neil.brown.name/md: (56 commits)
md/bitmap: record the space available for the bitmap in the superblock.
md/raid10: Remove extras after reshape to smaller number of devices.
md/raid5: improve removal of extra devices after reshape.
md: check the return of mddev_find()
MD RAID1: Further conditionalize 'fullsync'
DM RAID: Use md_error() in place of simply setting Faulty bit
DM RAID: Record and handle missing devices
DM RAID: Set recovery flags on resume
md/raid5: Allow reshape while a bitmap is present.
md/raid10: resize bitmap when required during reshape.
md: allow array to be resized while bitmap is present.
md/bitmap: make sure reshape request are reflected in superblock.
md/bitmap: add bitmap_resize function to allow bitmap resizing.
md/bitmap: use DIV_ROUND_UP instead of open-code
md/bitmap: create a 'struct bitmap_counts' substructure of 'struct bitmap'
md/bitmap: make bitmap bitops atomic.
md/bitmap: make _page_attr bitops atomic.
md/bitmap: merge bitmap_file_unmap and bitmap_file_put.
md/bitmap: remove async freeing of bitmap file.
md/bitmap: convert some spin_lock_irqsave to spin_lock_irq
...
|
|
Now that bitmaps can grow and shrink it is best if we record
how much space is available. This means that when
we reduce the size of the bitmap we won't "lose" the space
for late when we might want to increase the size of the bitmap
again.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
When a reshape which reduced the number of devices finishes
we must remove the extra devices.
So ensure that raid10_remove_disk won't try to keep them, and
have raid10_finish_reshape clear the 'in_sync' flag. Then
remove_and_add_spares will be able to remove them.
Reported-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
After a reshape which reduced the number of devices we need
to disconnect the extra devices.
The code for this doesn't currently handle 'replacement' devices.
It is very unlikely that such devices will be present, but it is
safest to handle them anyway.
So simplify the handling. Just clear In_sync and leave it
to remove_and_add_spaces (which will be called soon) to do
the real works.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Check the return of mddev_find(), since it may fail due to out of
memeory or out of usable minor number.
The reason I chose -ENODEV instead of -ENOMEM or something else is
md_alloc() function chose that ;)
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
A RAID1 device does not necessarily need a fullsync if the bitmap can be used instead.
Similar to commit d6b212f4b19da5301e6b6eca562e5c7a2a6e8c8d in raid5.c, if a raid1
device can be brought back (i.e. from a transient failure) it shouldn't need a
complete resync. Provided the bitmap is not to old, it will have recorded the areas
of the disk that need recovery.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
When encountering an error while reading the superblock, call md_error.
We are currently setting the 'Faulty' bit on one of the array devices when an
error is encountered while reading the superblock of a dm-raid array. We should
be calling md_error(), as it handles the error more completely.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Missing dm-raid devices should be recorded in the superblock
When specifying the devices that compose a DM RAID array, it is possible to denote
failed or missing devices with '-'s. When this occurs, we must record this in the
superblock. We do this by checking if the array position's data device is missing
and then forcing MD to record the superblock by setting 'MD_CHANGE_DEVS' in
'raid_resume'. If we do not cause the superblock to be rewritten by the resume
function, it is possible for a stale superblock to be written by an out-going
in-active table (during 'raid_dtr').
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Properly initialize MD recovery flags when resuming device-mapper devices.
When a device-mapper device is suspended, all I/O must stop. This is done by
calling 'md_stop_writes' and 'mddev_suspend'. These calls in-turn manipulate
the recovery flags - including setting 'MD_RECOVERY_FROZEN'. The DM device
may have been suspended while recovery was not yet complete, so the process
needs to pick-up where it left off. Since 'mddev_resume' does not unset
'MD_RECOVERY_FROZEN' and set 'MD_RECOVERY_NEEDED', we must do it ourselves.
'MD_RECOVERY_NEEDED' can safely be set in 'mddev_resume', but 'MD_RECOVERY_FROZEN'
must be set outside of 'mddev_resume' due to how MD handles RAID reshaping.
(e.g. It is possible for a user to delay reshaping a RAID5->RAID6 by purposefully
setting 'MD_RECOVERY_FROZEN'. Clearing it in 'mddev_resume' would override the
desired behavior.)
Because 'mddev_resume' already unconditionally calls 'md_wakeup_thread(mddev->thread)'
there is no need to make this call from 'raid_resume' since it calls 'mddev_resume'.
Also clean up where level_store calls mddev_resume() - it current
duplicates some of the funcitons of that call. - NB
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
We always should have allowed this. A raid5 reshape doesn't change
the size of the bitmap, so not need to restrict it.
Also add a test to make sure we don't try to start a reshape on a
failed array.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
If a reshape changes the size of the array, then we can now
update the bitmap to suit - so do so.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Now that bitmaps can be resized, we can allow an array to be resized
while the bitmap is present.
This only covers resizing that involves changing the effective size
of member devices, not resizing that changes the number of devices.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
As a reshape may change the sync_size and/or chunk_size, we need
to update these whenever we write out the bitmap superblock.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
This function will allocate the new data structures and copy
bits across from old to new, allowing for the possibility that the
chunksize has changed.
Use the same function for performing the initial allocation
of the structures. This improves test coverage.
When bitmap_resize is used to resize an existing bitmap, it
only copies '1' bits in, not '0' bits.
So when allocating the bitmap, ensure everything is initialised
to ZERO.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Also take the opportunity to simplify CHUNK_BLOCK_RATIO.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
The new "struct bitmap_counts" contains all the fields that are
related to counting the number of active writes in each bitmap chunk.
Having this separate will make it easier to change the chunksize
or overall size of a bitmap atomically.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
This allows us to remove spinlock protection which is
more heavy-weight than simple atomics.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Using e.g. set_bit instead of __set_bit and using test_and_clear_bit
allow us to remove some locking and contract other locked ranges.
It is rare that we set or clear a lot of these bits, so gain should
outweigh any cost.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
There functions really do one thing together: release the
'bitmap_storage'. So make them just one function.
Since we removed the locking (previous patch), we don't need to zero
any fields before freeing them, so it all becomes a bit simpler.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
There is no real value in freeing things the moment there is an error.
It is just as good to free the bitmap file and pages when the bitmap
is explicitly removed (and replaced?) or at shutdown.
With this gone, the bitmap will only disappear when the array is
quiescent, so we can remove some locking.
As the 'filemap' doesn't disappear now, include extra checks before
trying to write any of it out.
Also remove the check for "has it disappeared" in
bitmap_daemon_write().
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
All of these sites can only be called from process context with
irqs enabled, so using irqsave/irqrestore just adds noise.
Remove it.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
We currently use '&' and '|' which isn't the norm in the kernel
and doesn't allow easy atomicity.
So change to bit numbers and {set,clear,test}_bit.
This allows us to remove a spinlock/unlock (which was dubious anyway)
and some other simplifications.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Just do single-bit manipulations on bitmap->flags and copy whole
value between that and sb->state.
This will allow next patch which changes how bit manipulations are
performed on bitmap->flags.
This does result in BITMAP_STALE not being set in sb by
bitmap_read_sb, however as the setting is determined by other
information in the 'sb' we do not lose information this way.
Normally, bitmap_load will be called shortly which will clear
BITMAP_STALE anyway.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
This function isn't really needed. It sets or clears a flag in both
bitmap->flags and sb->state.
However both times it is called, bitmap_update_sb is called soon
afterwards which copies bitmap->flags to sb->state.
So just make changes to bitmap->flags, and open-code those rather than
hiding in a function.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
We should allocate memory for the storage-bitmap at create-time, not
load time.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
This will allow allocation before swapping in a new bitmap.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
This number is more generally useful, and bytes-in-last-page is
easily extracted from it.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
This new 'struct bitmap_storage' reflects the external storage of the
bitmap.
Having this clearly defined will make it easier to change the storage
used while the array is active.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Most often we have the page number, not the page. And that is what
the *_page_attr() functions really want. So change the arguments to
take that number.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Instead of allocating pages in read_sb_page, read_page and
bitmap_read_sb, allocate them all in bitmap_init_from disk.
Also replace the hack of calling "attach_page_buffers(page, NULL)" to
ensure that free_buffer() won't complain, by putting a test for
PagePrivate in free_buffer().
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
An md bitmap comprises two parts
- internal counting of active writes per 'chunk'.
- external storage of whether there are any active writes on
each chunk
The second requires the first, but the first doesn't require the
second.
Not having backing storage means that the bitmap cannot expedite
resync after a crash, but it still allows us to expedite the recovery
of a recently-removed device.
So: allow a bitmap to exist even if there is no backing device.
In that case we default to 128M chunks.
A particular value of this is that we can remove and re-add a bitmap
(possibly of a different granularity) on a degraded array, and not
lose the information needed to fast-recover the missing device.
We don't actually activate these bitmaps yet - that will come
in a later patch.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
If we are to allow bitmaps to be resized when the array is resized,
we need to know how much space there is.
So create an attribute to store this information and set appropriate
defaults.
It can be set more precisely via sysfs, or future metadata extensions
may allow it to be recorded.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
There are two different 'pending' concepts in the handling of the
write intent bitmap.
Firstly, a 'page' from the bitmap (which container PAGE_SIZE*8 bits)
may have changes (bits cleared) that should be written in due course.
There is no hurry for these and the page will transition from
PENDING to NEEDWRITE and will then be written, though if it ever
becomes DIRTY it will be written much sooner and PENDING will be
cleared.
Secondly, a page of counters - which contains PAGE_SIZE/2 counters, one
for each bit, can usefully have a 'pending' flag which indicates if
any of the counters are low (2 or 1) and ready to be processed by
bitmap_daemon_work(). If this flag is clear we can skip the whole
page.
These two concepts are currently combined in the bitmap-file flag.
This causes a tighter connection between the counters and the bitmap
file than I would like - as I want to add some flexibility to the
bitmap file.
So introduce a new flag with the page-of-counters, and rewrite
bitmap_daemon_work() so that it handles the two different 'pending'
concepts separately.
This also allows us to clear BITMAP_PAGE_PENDING when we write out
a dirty page, which may occasionally reduce the number of times we
write a page.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
REQ_SYNC is ignored in current raid5 code. Block layer does use it to do
policy,
for example ioscheduler. This patch adds it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
The two variables are useless.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
If the allocation of rep1_bio fails, we currently don't free the 'bio'
of the same dev.
Reported by kmemleak.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
When attempting to fix a read error, it is acceptable to read from a
device that is recovering, provided the recovery has got past the
place we are reading from. This makes the test for "can we read from
here" the same as the test in read_balance.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
This ensures that it is always freed - there were case where
we failed to free the page.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
dm-raid currently open-codes the freeing of some members of
and rdev. It is more maintainable to have it call common code
from md.c which does this for all call-sites.
So remove free_disk_sb to md_rdev_clear, export it, and use it in
dm-raid.c
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
A 'near' or 'offset' lay RAID10 array can be reshaped to a different
'near' or 'offset' layout, a different chunk size, and a different
number of devices.
However the number of copies cannot change.
Unlike RAID5/6, we do not support having user-space backup data that
is being relocated during a 'critical section'. Rather, the
data_offset of each device must change so that when writing any block
to a new location, it will not over-write any data that is still
'live'.
This means that RAID10 reshape is not supportable on v0.90 metadata.
The different between the old data_offset and the new_offset must be
at least the larger of the chunksize multiplied by offset copies of
each of the old and new layout. (for 'near' mode, offset_copies == 1).
A larger difference of around 64M seems useful for in-place reshapes
as more data can be moved between metadata updates.
Very large differences (e.g. 512M) seem to slow the process down due
to lots of long seeks (on oldish consumer graded devices at least).
Metadata needs to be updated whenever the place we are about to write
to is considered - by the current metadata - to still contain data in
the old layout.
[unbalanced locking fix from Dan Carpenter <dan.carpenter@oracle.com>]
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
We will soon be interpreting the layout (and chunksize etc) from
multiple places to support reshape. So split it out into separate
function.
Signed-off-by: NeilBrown <neilb@suse.de>
|