Age | Commit message (Collapse) | Author |
|
Use __GFP_HIGHMEM in __vmalloc.
Pages allocated with __vmalloc can be allocated in high memory that is not
directly mapped to kernel space, so use __GFP_HIGHMEM just like vmalloc
does. This patch reduces memory pressure slightly because pages can be
allocated in the high zone.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Fix a boundary condition that caused failure for certain device sizes.
The problem is reported at
http://code.google.com/p/cryptsetup/issues/detail?id=160
For certain device sizes the number of hashes at a specific level was
calculated incorrectly.
It happens for example for a device with data and metadata block size 4096
that has 16385 blocks and algorithm sha256.
The user can test if he is affected by this bug by running the
"veritysetup verify" command and also by activating the dm-verity kernel
driver and reading the whole block device. If it passes without an error,
then the user is not affected.
The condition for the bug is:
Split the total number of data blocks (data_block_bits) into bit strings,
each string has hash_per_block_bits bits. hash_per_block_bits is
rounddown(log2(metadata_block_size/hash_digest_size)). Equivalently, you
can say that you convert data_blocks_bits to 2^hash_per_block_bits base.
If there some zero bit string below the most significant bit string and at
least one bit below this zero bit string is set, then the bug happens.
The same bug exists in the userspace veritysetup tool, so you must use
fixed veritysetup too if you want to use devices that are affected by
this boundary condition.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # 3.4+
Cc: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Set noio flag while calling __vmalloc() because it doesn't fully respect
gfp flags to avoid a possible deadlock (see commit
502624bdad3dba45dfaacaf36b7d83e39e74b2d2).
This should be backported to stable kernels 3.8 and newer. The kernel 3.8
doesn't have memalloc_noio_save(), so we should set and restore process
flag PF_MEMALLOC instead.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
When multipath needs to retry an ioctl the reference to the
current live table needs to be dropped. Otherwise a deadlock
occurs when all paths are down:
- dm_blk_ioctl takes a reference to the current table
and spins in multipath_ioctl().
- A new table is being loaded, but upon resume the process
hangs in dm_table_destroy() waiting for references to
drop to zero.
With this patch the reference to the old table is dropped
prior to retry, thereby avoiding the deadlock.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
"The usual stuff from trivial tree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
treewide: relase -> release
Documentation/cgroups/memory.txt: fix stat file documentation
sysctl/net.txt: delete reference to obsolete 2.4.x kernel
spinlock_api_smp.h: fix preprocessor comments
treewide: Fix typo in printk
doc: device tree: clarify stuff in usage-model.txt.
open firmware: "/aliasas" -> "/aliases"
md: bcache: Fixed a typo with the word 'arithmetic'
irq/generic-chip: fix a few kernel-doc entries
frv: Convert use of typedef ctl_table to struct ctl_table
sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
doc: clk: Fix incorrect wording
Documentation/arm/IXP4xx fix a typo
Documentation/networking/ieee802154 fix a typo
Documentation/DocBook/media/v4l fix a typo
Documentation/video4linux/si476x.txt fix a typo
Documentation/virtual/kvm/api.txt fix a typo
Documentation/early-userspace/README fix a typo
Documentation/video4linux/soc-camera.txt fix a typo
lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
...
|
|
The recent comment:
commit 7e83ccbecd608b971f340e951c9e84cd0343002f
md/raid10: Allow skipping recovery when clean arrays are assembled
Causes raid10 to skip a recovery in certain cases where it is safe to
do so. Unfortunately it also causes a reshape to be skipped which is
never safe. The result is that an attempt to reshape a RAID10 will
appear to complete instantly, but no data will have been moves so the
array will now contain garbage.
(If nothing is written, you can recovery by simple performing the
reverse reshape which will also complete instantly).
Bug was introduced in 3.10, so this is suitable for 3.10-stable.
Cc: stable@vger.kernel.org (3.10)
Cc: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
There is a bug in 'check_reshape' for raid5.c To checks
that the new minimum number of devices is large enough (which is
good), but it does so also after the reshape has started (bad).
This is bad because
- the calculation is now wrong as mddev->raid_disks has changed
already, and
- it is pointless because it is now too late to stop.
So only perform that test when reshape has not been committed to.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
1/ If a RAID10 is being reshaped to a fewer number of devices
and is stopped while this is ongoing, then when the array is
reassembled the 'mirrors' array will be allocated too small.
This will lead to an access error or memory corruption.
2/ A sanity test for a reshaping RAID10 array is restarted
is slightly incorrect.
Due to the first bug, this is suitable for any -stable
kernel since 3.5 where this code was introduced.
Cc: stable@vger.kernel.org (v3.5+)
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Some of bcache's utility code has made it into the rest of the kernel,
so drop the bcache versions.
Bcache used to have a workaround for allocating from a bio set under
generic_make_request() (if you allocated more than once, the bios you
already allocated would get stuck on current->bio_list when you
submitted, and you'd risk deadlock) - bcache would mask out __GFP_WAIT
when allocating bios under generic_make_request() so that allocation
could fail and it could retry from workqueue. But bio_alloc_bioset() has
a workaround now, so we can drop this hack and the associated error
handling.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
This code has rotted and it hasn't been used in ages anyways.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
|
|
Journal writes need to be marked FUA, not just REQ_FLUSH. And btree node
writes have... weird ordering requirements.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
|
|
Now that we're tracking dirty data per stripe, we can add two
optimizations for raid5/6:
* If a stripe is already dirty, force writes to that stripe to
writeback mode - to help build up full stripes of dirty data
* When flushing dirty data, preferentially write out full stripes first
if there are any.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
To make background writeback aware of raid5/6 stripes, we first need to
track the amount of dirty data within each stripe - we do this by
breaking up the existing sectors_dirty into per stripe atomic_ts
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
Previously, dirty_data wouldn't get initialized until the first garbage
collection... which was a bit of a problem for background writeback (as
the PD controller keys off of it) and also confusing for users.
This is also prep work for making background writeback aware of raid5/6
stripes.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
The old lazy sorting code was kind of hacky - rewrite in a way that
mathematically makes more sense; the idea is that the size of the sets
of keys in a btree node should increase by a more or less fixed ratio
from smallest to biggest.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
Old gcc doesnt like the struct hack, and it is kind of ugly. So finish
off the work to convert pr_debug() statements to tracepoints, and delete
pkey()/pbtree().
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
The tracepoints were reworked to be more sensible, and fixed a null
pointer deref in one of the tracepoints.
Converted some of the pr_debug()s to tracepoints - this is partly a
performance optimization; it used to be that with DEBUG or
CONFIG_DYNAMIC_DEBUG pr_debug() was an empty macro; but at some point it
was changed to an empty inline function.
Some of the pr_debug() statements had rather expensive function calls as
part of the arguments, so this code was getting run unnecessarily even
on non debug kernels - in some fast paths, too.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
The most significant change is that btree reads are now done
synchronously, instead of asynchronously and doing the post read stuff
from a workqueue.
This was originally done because we can't block on IO under
generic_make_request(). But - we already have a mechanism to punt cache
lookups to workqueue if needed, so if we just use that we don't have to
deal with the complexity of doing things asynchronously.
The main benefit is this makes the locking situation saner; we can hold
our write lock on the btree node until we're finished reading it, and we
don't need that btree_node_read_done() flag anymore.
Also, for writes, btree_write() was broken out into btree_node_write()
and btree_leaf_dirty() - the old code with the boolean argument was dumb
and confusing.
The prio_blocked mechanism was improved a bit too, now the only counter
is in struct btree_write, we don't mess with transfering a count from
struct btree anymore.
This required changing garbage collection to block prios at the start
and unblock when it finishes, which is cleaner than what it was doing
anyways (the old code had mostly the same effect, but was doing it in a
convoluted way)
And the btree iter btree_node_read_done() uses was converted to a real
mempool.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
Using a workqueue when we just want a single thread is a bit silly.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
Signed-off-by: Gabriel de Perthuis <g2p.code+bcache@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
An old version of gcc was complaining about using a const int as the
size of a stack allocated array. Which should be fine - but using
ARRAY_SIZE() is better, anyways.
Also, refactor the code to use scnprintf().
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
bio_alloc_bioset returns NULL on failure. This fix adds a missing check
for potential NULL pointer dereferencing.
Signed-off-by: Kumar Amit Mehta <gmate.amit@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
MD: Remember the last sync operation that was performed
This patch adds a field to the mddev structure to track the last
sync operation that was performed. This is especially useful when
it comes to what is recorded in mismatch_cnt in sysfs. If the
last operation was "data-check", then it reports the number of
descrepancies found by the user-initiated check. If it was a
"repair" operation, then it is reporting the number of
descrepancies repaired. etc.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
RAID5 uses a 'per-array' value for the 'size' of each device.
RAID0 uses a 'per-device' value - it can be different for each device.
When converting a RAID5 to a RAID0 we must ensure that the per-device
size of each device matches the per-array size for the RAID5, else
the array will change size.
If the metadata cannot record a changed per-device size (as is the
case with v0.90 metadata) the array could get bigger on restart. This
does not cause data corruption, so it not a big issue and is mainly
yet another a reason to not use 0.90.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
The word 'arithmetic' was typed as 'arithmatic'
Signed-off-by: Phil Viana <phillip.l.viana@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
It isn't really enough to check that the rdev is present, we need to
also be sure that the device is still In_sync.
Doing this requires using rcu_dereference to access the rdev, and
holding the rcu_read_lock() to ensure the rdev doesn't disappear while
we look at it.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
As 'enough' accesses conf->prev and conf->geo, which can change
spontanously, it should guard against changes.
This can be done with device_lock as start_reshape holds device_lock
while updating 'geo' and end_reshape holds it while updating 'prev'.
So 'error' needs to hold 'device_lock'.
On the other hand, raid10_end_read_request knows which of the two it
really wants to access, and as it is an active request on that one,
the value cannot change underneath it.
So change _enough to take flag rather than a pointer, pass the
appropriate flag from raid10_end_read_request(), and remove the locking.
All other calls to 'enough' are made with reconfig_mutex held, so
neither 'prev' nor 'geo' can change.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
The usage of strict_strtoul() is not preferred, because
strict_strtoul() is obsolete. Thus, kstrtoul() should be
used.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
When a device has failed, it needs to be removed from the personality
module before it can be removed from the array as a whole.
The first step is performed by md_check_recovery() which is called
from the raid management thread.
So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery
to have run. This increases the chance that the ioctl will succeed.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Neil Brown <nfbrown@suse.de>
|
|
This doesn't really need to be initialised, but it doesn't hurt,
silences the compiler, and as it is a counter it makes sense for it to
start at zero.
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
DM RAID: Fix raid_resume not reviving failed devices in all cases
When a device fails in a RAID array, it is marked as Faulty. Later,
md_check_recovery is called which (through the call chain) calls
'hot_remove_disk' in order to have the personalities remove the device
from use in the array.
Sometimes, it is possible for the array to be suspended before the
personalities get their chance to perform 'hot_remove_disk'. This is
normally not an issue. If the array is deactivated, then the failed
device will be noticed when the array is reinstantiated. If the
array is resumed and the disk is still missing, md_check_recovery will
be called upon resume and 'hot_remove_disk' will be called at that
time. However, (for dm-raid) if the device has been restored,
a resume on the array would cause it to attempt to revive the device
by calling 'hot_add_disk'. If 'hot_remove_disk' had not been called,
a situation is then created where the device is thought to concurrently
be the replacement and the device to be replaced. Thus, the device
is first sync'ed with the rest of the array (because it is the replacement
device) and then marked Faulty and removed from the array (because
it is also the device being replaced).
The solution is to check and see if the device had properly been removed
before the array was suspended. This is done by seeing whether the
device's 'raid_disk' field is -1 - a condition that implies that
'md_check_recovery -> remove_and_add_spares (where raid_disk is set to -1)
-> hot_remove_disk' has been called. If 'raid_disk' is not -1, then
'hot_remove_disk' must be called to complete the removal of the previously
faulty device before it can be revived via 'hot_add_disk'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
DM RAID: Break-up untidy function
Clean-up excessive indentation by moving some code in raid_resume()
into its own function.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
DM RAID: Add ability to restore transiently failed devices on resume
This patch adds code to the resume function to check over the devices
in the RAID array. If any are found to be marked as failed and their
superblocks can be read, an attempt is made to reintegrate them into
the array. This allows the user to refresh the array with a simple
suspend and resume of the array - rather than having to load a
completely new table, allocate and initialize all the structures and
throw away the old instantiation.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Pull md bugfixes from Neil Brown:
"A few bugfixes for md
Some tagged for -stable"
* tag 'md-3.10-fixes' of git://neil.brown.name/md:
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
md: md_stop_writes() should always freeze recovery.
|
|
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.
Using raise_barrier will sometimes deadlock. Using freeze_array
should not.
As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.
The deadlock was made particularly noticeable by commits
050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.
This patch probably won't apply directly to some early kernels and
will need to be applied by hand.
Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
non-rebuilding drive completed it.
Without that fix, the following scenario could happen:
- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io() calls call_bio_endio()
- As a result, in call_bio_endio():
if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.
So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.
[neilb - applied identical change to raid10 as well]
This bug can result in lost data, so it is suitable for any
-stable kernel.
Cc: stable@vger.kernel.org
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.
However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward. This can particularly cause problems or dm-raid.
So change __md_stop_writes() to always freeze recovery. This is safe
and more predicatable.
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Tested-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Pull block layer fixes from Jens Axboe:
"Outside of bcache (which really isn't super big), these are all
few-liners. There are a few important fixes in here:
- Fix blk pm sleeping when holding the queue lock
- A small collection of bcache fixes that have been done and tested
since bcache was included in this merge window.
- A fix for a raid5 regression introduced with the bio changes.
- Two important fixes for mtip32xx, fixing an oops and potential data
corruption (or hang) due to wrong bio iteration on stacked devices."
* 'for-linus' of git://git.kernel.dk/linux-block:
scatterlist: sg_set_buf() argument must be in linear mapping
raid5: Initialize bi_vcnt
pktcdvd: silence static checker warning
block: remove refs to XD disks from documentation
blkpm: avoid sleep when holding queue lock
mtip32xx: Correctly handle bio->bi_idx != 0 conditions
mtip32xx: Fix NULL pointer dereference during module unload
bcache: Fix error handling in init code
bcache: clarify free/available/unused space
bcache: drop "select CLOSURES"
bcache: Fix incompatible pointer type warning
|
|
The patch that converted raid5 to use bio_reset() forgot to initialize
bi_vcnt.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Tested-by: Ilia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fix detection of the need to resize the dm thin metadata device.
The code incorrectly tried to extend the metadata device when it
didn't need to due to a merging error with patch 24347e9 ("dm thin:
detect metadata device resizing").
device-mapper: transaction manager: couldn't open metadata space map
device-mapper: thin metadata: tm_open_with_sm failed
device-mapper: thin: aborting transaction failed
device-mapper: thin: switching pool to failure mode
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
git://evilpiepirate.org/~kent/linux-bcache into for-linus
Kent writes:
Jens - couple more bcache patches. Bug fixes and a doc update.
|
|
This code appears to have rotted... fix various bugs and do some
refactoring.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
The Kconfig entry for BCACHE selects CLOSURES. But there's no Kconfig
symbol CLOSURES. That symbol was used in development versions of bcache,
but was removed when the closures code was no longer provided as a
kernel library. It can safely be dropped.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
|
|
The function pointer release in struct block_device_operations
should point to functions declared as void.
Sparse warnings:
drivers/md/bcache/super.c:656:27: warning:
incorrect type in initializer (different base types)
drivers/md/bcache/super.c:656:27:
expected void ( *release )( ... )
drivers/md/bcache/super.c:656:27:
got int ( static [toplevel] *<noident> )( ... )
drivers/md/bcache/super.c:656:2: warning:
initialization from incompatible pointer type [enabled by default]
drivers/md/bcache/super.c:656:2: warning:
(near initialization for ‘bcache_ops.release’) [enabled by default]
Signed-off-by: Emil Goode <emilgoode@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
|
|
Share configuration option processing code between the dm cache
ctr and message functions.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Move process_config_option() in dm-cache-target.c to make the
next patch more readable.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|