[go: nahoru, domu]

History log of /drivers/md/dm-table.c
Revision Date Author Comments
86f1152b117a404229fd6f08ec3faca779f37b92 13-Aug-2014 Benjamin Marzinski <bmarzins@redhat.com> dm: allow active and inactive tables to share dm_devs

Until this change, when loading a new DM table, DM core would re-open
all of the devices in the DM table. Now, DM core will avoid redundant
device opens (and closes when destroying the old table) if the old
table already has a device open using the same mode. This is achieved
by managing reference counts on the table_devices that DM core now
stores in the mapped_device structure (rather than in the dm_table
structure). So a mapped_device's active and inactive dm_tables' dm_dev
lists now just point to the dm_devs stored in the mapped_device's
table_devices list.

This improvement in DM core's device reference counting has the
side-effect of fixing a long-standing limitation of the multipath
target: a DM multipath table couldn't include any paths that were unusable
(failed). For example: if all paths have failed and you add a new,
working, path to the table; you can't use it since the table load would
fail due to it still containing failed paths. Now a re-load of a
multipath table can include failed devices and when those devices become
active again they can be used instantly.

The device list code in dm.c isn't a straight copy/paste from the code in
dm-table.c, but it's very close (aside from some variable renames). One
subtle difference is that find_table_device for the tables_devices list
will only match devices with the same name and mode. This is because we
don't want to upgrade a device's mode in the active table when an
inactive table is loaded.

Access to the mapped_device structure's tables_devices list requires a
mutex (tables_devices_lock), so that tables cannot be created and
destroyed concurrently.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
200612ec33e555a356eebc717630b866ae2b694f 08-Aug-2014 Jeff Moyer <jmoyer@redhat.com> dm table: propagate QUEUE_FLAG_NO_SG_MERGE

Commit 05f1dd5 ("block: add queue flag for disabling SG merging")
introduced a new queue flag: QUEUE_FLAG_NO_SG_MERGE. This gets set by
default in blk_mq_init_queue for mq-enabled devices. The effect of
the flag is to bypass the SG segment merging. Instead, the
bio->bi_vcnt is used as the number of hardware segments.

With a device mapper target on top of a device with
QUEUE_FLAG_NO_SG_MERGE set, we can end up sending down more segments
than a driver is prepared to handle. I ran into this when backporting
the virtio_blk mq support. It triggerred this BUG_ON, in
virtio_queue_rq:

BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);

The queue's max is set here:
blk_queue_max_segments(q, vblk->sg_elems-2);

Basically, what happens is that a bio is built up for the dm device
(which does not have the QUEUE_FLAG_NO_SG_MERGE flag set) using
bio_add_page. That path will call into __blk_recalc_rq_segments, so
what you end up with is bi_phys_segments being much smaller than bi_vcnt
(and bi_vcnt grows beyond the maximum sg elements). Then, when the bio
is submitted, it gets cloned. When the cloned bio is submitted, it will
end up in blk_recount_segments, here:

if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags))
bio->bi_phys_segments = bio->bi_vcnt;

and now we've set bio->bi_phys_segments to a number that is beyond what
was registered as queue_max_segments by the driver.

The right way to fix this is to propagate the queue flag up the stack.

The rules for propagating the flag are simple:
- if the flag is set for any underlying device, it must be set for the
upper device
- consequently, if the flag is not set for any underlying device, it
should not be set for the upper device.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.16+
a7ffb6a53391c2690263675f13c79a273301d2b3 10-Jul-2014 Mikulas Patocka <mpatocka@redhat.com> dm table: make dm_table_supports_discards static

The function dm_table_supports_discards is only called from
dm-table.c:dm_table_set_restrictions(). So move it above
dm_table_set_restrictions and make it static.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
11f0431be2f99c574a65c6dfc0ca205511500f29 03-Jun-2014 Mike Snitzer <snitzer@redhat.com> dm: remove symbol export for dm_set_device_limits

There is no need for code other than DM core to use dm_set_device_limits
so remove its EXPORT_SYMBOL_GPL. Also, cleanup a couple whitespace nits.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
9974fa2c6a7d470ca3c201fe7dbac64bf4dd8d2a 28-Feb-2014 Mike Snitzer <snitzer@redhat.com> dm table: add dm_table_run_md_queue_async

Introduce dm_table_run_md_queue_async() to run the request_queue of the
mapped_device associated with a request-based DM table.

Also add dm_md_get_queue() wrapper to extract the request_queue from a
mapped_device.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
473c36dfeecf4e49db928f3284b2fbe981f8c284 13-Feb-2014 Mikulas Patocka <mpatocka@redhat.com> dm: make dm_table_alloc_md_mempools static

Make the function dm_table_alloc_md_mempools static because it is not
called from another file.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
57a2f238564e0700c8648238d31f366246a5b963 23-Nov-2013 Mikulas Patocka <mpatocka@redhat.com> dm table: remove unused buggy code that extends the targets array

A device mapper table is allocated in the following way:
* The function dm_table_create is called, it gets the number of targets
as an argument -- it allocates a targets array accordingly.
* For each target, we call dm_table_add_target.

If we add more targets than were specified in dm_table_create, the
function dm_table_add_target reallocates the targets array. However,
this reallocation code is wrong - it moves the targets array to a new
location, while some target constructors hold pointers to the array in
the old location.

The following DM target drivers save the pointer to the target
structure, so they corrupt memory if the target array is moved:
multipath, raid, mirror, snapshot, stripe, switch, thin, verity.

Under normal circumstances, the reallocation function is not called
(because dm_table_create is called with the correct number of targets),
so the buggy reallocation code is not used.

Prior to the fix "dm table: fail dm_table_create on dm_round_up
overflow", the reallocation code could only be used in case the user
specifies too large a value in param->target_count, such as 0xffffffff.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
5b2d06576c5410c10d95adfd5c4d8b24de861d87 23-Nov-2013 Mikulas Patocka <mpatocka@redhat.com> dm table: fail dm_table_create on dm_round_up overflow

The dm_round_up function may overflow to zero. In this case,
dm_table_create() must fail rather than go on to allocate an empty array
with alloc_targets().

This fixes a possible memory corruption that could be caused by passing
too large a number in "param->target_count".

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
7833b08e18241a1c35c09ef38be840cbf6c58acf 24-Oct-2013 Mike Snitzer <snitzer@redhat.com> dm table: print error on preresume failure

If preresume fails it is worth logging an error given that a device is
left suspended due to the failure.

This change was motivated by local preresume error logging that was
added to the cache target ("preresume failed"). Elevating this
target-agnostic context for the where the target-specific error occurred
relative to the DM core's callouts makes sense.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
f36afb3957353d2529cb2b00f78fdccd14fc5e9c 31-Oct-2013 Mikulas Patocka <mpatocka@redhat.com> dm: allocate buffer for messages with small number of arguments using GFP_NOIO

dm-mpath and dm-thin must process messages even if some device is
suspended, so we allocate argv buffer with GFP_NOIO. These messages have
a small fixed number of arguments.

On the other hand, dm-switch needs to process bulk data using messages
so excessive use of GFP_NOIO could cause trouble.

The patch also lowers the default number of arguments from 64 to 8, so
that there is smaller load on GFP_NOIO allocations.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
00c4fc3b1f590288cb3c42f36da50f49a513cfcf 28-Aug-2013 Mike Snitzer <snitzer@redhat.com> dm ioctl: increase granularity of type_lock when loading table

Hold the mapped device's type_lock before calling populate_table() since
it is where the table's type is determined based on the specified
targets. There is no need to allow concurrent table loads to race to
establish the table's targets or type.

This eliminates the need to grab the lock in dm_table_set_type().

Also verify that the type_lock is held in both dm_set_md_type() and
dm_get_md_type().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
169e2cc279c443085f7e423561eb1fe6158ade44 23-Aug-2013 Mike Snitzer <snitzer@redhat.com> dm: allow error target to replace bio-based and request-based targets

It may be useful to switch a request-based table to the "error" target.
Enhance the DM core to allow a hybrid target_type which is capable of
handling either bios (via .map) or requests (via .map_rq).

Add a request-based map function (.map_rq) to the "error" target_type;
making it DM's first hybrid target. Train dm_table_set_type() to prefer
the mapped device's established type (request-based or bio-based). If
the mapped device doesn't have an established type default to making the
table with the hybrid target(s) bio-based.

Tested 'dmsetup wipe_table' to work on both bio-based and request-based
devices.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
83d5e5b0af907d46d241a86d9e44003b3f0accbd 11-Jul-2013 Mikulas Patocka <mpatocka@redhat.com> dm: optimize use SRCU and RCU

This patch removes "io_lock" and "map_lock" in struct mapped_device and
"holders" in struct dm_table and replaces these mechanisms with
sleepable-rcu.

Previously, the code would call "dm_get_live_table" and "dm_table_put" to
get and release table. Now, the code is changed to call "dm_get_live_table"
and "dm_put_live_table". dm_get_live_table locks sleepable-rcu and
dm_put_live_table unlocks it.

dm_get_live_table_fast/dm_put_live_table_fast can be used instead of
dm_get_live_table/dm_put_live_table. These *_fast functions use
non-sleepable RCU, so the caller must not block between them.

If the code changes active or inactive dm table, it must call
dm_sync_table before destroying the old table.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dc019b21fb92d620a3b52ccecc135ac968a7c7ec 10-May-2013 Mike Snitzer <snitzer@redhat.com> dm table: fix write same support

If device_not_write_same_capable() returns true then the iterate_devices
loop in dm_table_supports_write_same() should return false.

Reported-by: Bharata B Rao <bharata.rao@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
55a62eef8d1b50ceff3b7bf46851103bdcc7e5b0 01-Mar-2013 Alasdair G Kergon <agk@redhat.com> dm: rename request variables to bios

Use 'bio' in the name of variables and functions that deal with
bios rather than 'request' to avoid confusion with the normal
block layer use of 'request'.

No functional changes.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
d2ce70a119f844225c10f133f8b957d540027b0f 01-Mar-2013 Wang Sheng-Hui <shhuiw@gmail.com> dm table: remove superfluous variable reset

If allocation fails, the local var *t is not used any more after kfree.
Don't need to reset it to NULL. Remove the unnecesary NULL set here.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
c0820cf5ad09522bdd9ff68e84841a09c9f339d8 21-Dec-2012 Mikulas Patocka <mpatocka@redhat.com> dm: introduce per_bio_data

Introduce a field per_bio_data_size in struct dm_target.

Targets can set this field in the constructor. If a target sets this
field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
are allocated for each bio submitted to the target. These data can be
used for any purpose by the target and help us improve performance by
removing some per-target mempools.

Per-bio data is accessed with dm_per_bio_data. The
argument data_size must be the same as the value per_bio_data_size in
dm_target.

If the target has a pointer to per_bio_data, it can get a pointer to
the bio with dm_bio_from_per_bio_data() function (data_size must be the
same as the value passed to dm_per_bio_data).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
d54eaa5a0fde0a202e4e91f200f818edcef15bee 21-Dec-2012 Mike Snitzer <snitzer@redhat.com> dm: prepare to support WRITE SAME

Allow targets to opt in to WRITE SAME support by setting
'num_write_same_requests' in the dm_target structure.

A dm device will only advertise WRITE SAME support if all its
targets and all its underlying devices support it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
c1a94672a830e01d58c7c7e8de530c3f136d6ff2 21-Dec-2012 Mike Snitzer <snitzer@redhat.com> dm: disable WRITE SAME

WRITE SAME bios are not yet handled correctly by device-mapper so
disable their use on device-mapper devices by setting
max_write_same_sectors to zero.

As an example, a ciphertext device is incompatible because the data
gets changed according to the location at which it written and so the
dm crypt target cannot support it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
3ae706561637331aa578e52bb89ecbba5edcb7a9 27-Sep-2012 Mike Snitzer <snitzer@redhat.com> dm: retain table limits when swapping to new table with no devices

Add a safety net that will re-use the DM device's existing limits in the
event that DM device has a temporary table that doesn't have any
component devices. This is to reduce the chance that requests not
respecting the hardware limits will reach the device.

DM recalculates queue limits based only on devices which currently exist
in the table. This creates a problem in the event all devices are
temporarily removed such as all paths being lost in multipath. DM will
reset the limits to the maximum permissible, which can then assemble
requests which exceed the limits of the paths when the paths are
restored. The request will fail the blk_rq_check_limits() test when
sent to a path with lower limits, and will be retried without end by
multipath. This became a much bigger issue after v3.6 commit fe86cdcef
("block: do not artificially constrain max_sectors for stacking
drivers").

Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
c3c4555edd10dbc0b388a0125b9c50de5e79af05 27-Sep-2012 Milan Broz <mbroz@redhat.com> dm table: clear add_random unless all devices have it set

Always clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not
have it set. Otherwise devices with predictable characteristics may
contribute entropy.

QUEUE_FLAG_ADD_RANDOM specifies whether or not queue IO timings
contribute to the random pool.

For bio-based targets this flag is always 0 because such devices have no
real queue.

For request-based devices this flag was always set to 1 by default.

Now set it according to the flags on underlying devices. If there is at
least one device which should not contribute, set the flag to zero: If a
device, such as fast SSD storage, is not suitable for supplying entropy,
a request-based queue stacked over it will not be either.

Because the checking logic is exactly same as for the rotational flag,
share the iteration function with device_is_nonrot().

Signed-off-by: Milan Broz <mbroz@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
0e9c24ed7443d090e369a2eddfa13f7f0b5afbaf 27-Jul-2012 Joe Thornber <ejt@redhat.com> dm: allow targets to request flushes regardless of underlying device support

Allow targets to override the 'supports flush' calculation.

Set 'flush_supported' if a target needs to receive flushes regardless of
whether or not its underlying devices have support.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
31998ef19385c944600d9a981b96252f98204bee 28-Mar-2012 Mikulas Patocka <mpatocka@redhat.com> dm: reject trailing characters in sccanf input

Device mapper uses sscanf to convert arguments to numbers. The problem is that
the way we use it ignores additional unmatched characters in the scanned string.

For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number,
but also it will match number with some garbage appended, like "123abc".

As a result, device mapper accepts garbage after some numbers. For example
the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"'
will pass without an error.

This patch fixes all sscanf uses in device mapper. It appends "%c" with
a pointer to a dummy character variable to every sscanf statement.

The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds
only if string is a null-terminated number (optionally preceded by some
whitespace characters). If there is some character appended after the number,
sscanf matches "%c", writes the character to the dummy variable and returns 2.
We check the return value for 1 and consequently reject numbers with some
garbage appended.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
574ce07eb0014069f1da763c219bb30ea4c266ec 28-Mar-2012 Hannes Reinecke <hare@suse.de> dm table: simplify call to free_devices

free_devices in dm_table.c already uses list_for_each(), so we don't
need to check if the list is empty.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
b1bd055d397e09f99dcef9b138ed104ff1812fcb 11-Jan-2012 Martin K. Petersen <martin.petersen@oracle.com> block: Introduce blk_set_stacking_limits function

Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.

This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.

Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
36a0456fbf2d9680bf9af81b39daf4a8e22cb1b8 31-Oct-2011 Alasdair G Kergon <agk@redhat.com> dm table: add immutable feature

Introduce DM_TARGET_IMMUTABLE to indicate that the target type cannot be mixed
with any other target type, and once loaded into a device, it cannot be
replaced with a table containing a different type.

The thin provisioning pool device will use this.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
cc6cbe141a20f6d876b161b60af38d93935bfa85 31-Oct-2011 Alasdair G Kergon <agk@redhat.com> dm table: add always writeable feature

Add a target feature flag DM_TARGET_ALWAYS_WRITEABLE to indicate that a target
does not support read-only mode.

The initial implementation of the thin provisioning target uses this.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
3791e2fc0e4b40d4188e79b0a99bfa6bce714a10 31-Oct-2011 Alasdair G Kergon <agk@redhat.com> dm table: add singleton feature

Introduce the concept of a singleton table which contains exactly one target.

If a target type sets the DM_TARGET_SINGLETON feature bit device-mapper
will ensure that any table that includes that target contains no others.

The thin provisioning pool target uses this.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
4693c9668fdcec229825b3763876b4744f9e6d5e 31-Oct-2011 Mandeep Singh Baines <msb@chromium.org> dm table: propagate non rotational flag

Allow QUEUE_FLAG_NONROT to propagate up the device stack if all
underlying devices are non-rotational. Tools like ureadahead will
schedule IOs differently based on the rotational flag.

With this patch, I see boot time go from 7.75 s to 7.46 s on my device.

Suggested-by: J. Richard Barnette <jrbarnette@chromium.org>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: dm-devel@redhat.com
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
983c7db347db8ce2d8453fd1d89b7a4bb6920d56 26-Sep-2011 Milan Broz <mbroz@redhat.com> dm crypt: always disable discard_zeroes_data

If optional discard support in dm-crypt is enabled, discards requests
bypass the crypt queue and blocks of the underlying device are discarded.
For the read path, discarded blocks are handled the same as normal
ciphertext blocks, thus decrypted.

So if the underlying device announces discarded regions return zeroes,
dm-crypt must disable this flag because after decryption there is just
random noise instead of zeroes.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
876fbba1db4a377f050a2bb49b474c7527b2995d 26-Sep-2011 Mike Snitzer <snitzer@redhat.com> dm table: avoid crash if integrity profile changes

Commit a63a5cf (dm: improve block integrity support) introduced a
two-phase initialization of a DM device's integrity profile. This
patch avoids dereferencing a NULL 'template_disk' pointer in
blk_integrity_register() if there is an integrity profile mismatch in
dm_table_set_integrity().

This can occur if the integrity profiles for stacked devices in a DM
table are changed between the call to dm_table_prealloc_integrity() and
dm_table_set_integrity().

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org # 2.6.39
ed8b752bccf2560e305e25125721d2f0ac759e88 02-Aug-2011 Mike Snitzer <snitzer@redhat.com> dm table: set flush capability based on underlying devices

DM has always advertised both REQ_FLUSH and REQ_FUA flush capabilities
regardless of whether or not a given DM device's underlying devices
also advertised a need for them.

Block's flush-merge changes from 2.6.39 have proven to be more costly
for DM devices. Performance regressions have been reported even when
DM's underlying devices do not advertise that they have a write cache.

Fix the performance regressions by configuring a DM device's flushing
capabilities based on those of the underlying devices' capabilities.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
498f0103ea13123e007660def9072a0b7dd1c599 02-Aug-2011 Mike Snitzer <snitzer@redhat.com> dm table: share target argument parsing functions

Move multipath target argument parsing code into dm-table so other
targets can share it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
d5b9dd04bd74b774b8e8d93ced7a0d15ad403fa9 02-Aug-2011 Mikulas Patocka <mpatocka@redhat.com> dm: ignore merge_bvec for snapshots when safe

Add a new flag DMF_MERGE_IS_OPTIONAL to struct mapped_device to indicate
whether the device can accept bios larger than the size its merge
function returns. When set, use this to send large bios to snapshots
which can split them if necessary. Snapshot I/O may be significantly
fragmented and this approach seems to improve peformance.

Before the patch, dm_set_device_limits restricted bio size to page size
if the underlying device had a merge function and the target didn't
provide a merge function. After the patch, dm_set_device_limits
restricts bio size to page size if the underlying device has a merge
function, doesn't have DMF_MERGE_IS_OPTIONAL flag and the target doesn't
provide a merge function.

The snapshot target can't provide a merge function because when the merge
function is called, it is impossible to determine where the bio will be
remapped. Previously this led us to impose a 4k limit, which we can
now remove if the snapshot store is located on a device without a merge
function. Together with another patch for optimizing full chunk writes,
it improves performance from 29MB/s to 40MB/s when writing to the
filesystem on snapshot store.

If the snapshot store is placed on a non-dm device with a merge function
(such as md-raid), device mapper still limits all bios to page size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
08649012545cfb116798260352547cf4d47064ec 02-Aug-2011 Mike Snitzer <snitzer@redhat.com> dm table: clean dm_get_device and move exports

There is no need for __table_get_device to be factored out.
Also move the exports to the end of their respective functions.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
e29e65aacbd9e628378084905cbcf62a9fa4a8cc 02-Aug-2011 Joe Perches <joe@perches.com> dm: use vzalloc

Use vzalloc() instead of vmalloc()+memset().

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
936688d7eb0f39be96c5791be1a04994cc8d6aa0 02-Aug-2011 Mike Snitzer <snitzer@redhat.com> dm table: fix discard support

Remove 'discards_supported' from the dm_table structure. The same
information can be easily discovered from the table's target(s) in
dm_table_supports_discards().

Before this fix dm_table_supports_discards() would skip checking the
individual targets' 'discards_supported' flag if any one target in the
table didn't set num_discard_requests > 0. Now the per-target
'discards_supported' flag is effective at insuring the final DM device
advertises discard support. But, to be clear, targets that don't
support discards (!num_discard_requests) will not receive discard
requests.

Also DMWARN if a target sets 'discards_supported' override but forgets
to set 'num_discard_requests'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
60063497a95e716c9a689af3be2687d261f115b4 27-Jul-2011 Arun Sharma <asharma@fb.com> atomic: use <linux/atomic.h>

This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
f4808ca99a203f20b4475601748e44b25a65bdec 29-May-2011 Milan Broz <mbroz@redhat.com> dm table: reject devices without request fns

This patch adds a check that a block device has a request function
defined before it is used. Otherwise, misconfiguration can cause an oops.

Because we are allowing devices with zero size e.g. an offline multipath
device as in commit 2cd54d9bedb79a97f014e86c0da393416b264eb3
("dm: allow offline devices") there needs to be an additional check
to ensure devices are initialised. Some block devices, like a loop
device without a backing file, exist but have no request function.

Reproducer is trivial: dm-mirror on unbound loop device
(no backing file on loop devices)

dmsetup create x --table "0 8 mirror core 2 8 sync 2 /dev/loop0 0 /dev/loop1 0"

and mirror resync will immediatelly cause OOps.

BUG: unable to handle kernel NULL pointer dereference at (null)
? generic_make_request+0x2bd/0x590
? kmem_cache_alloc+0xad/0x190
submit_bio+0x53/0xe0
? bio_add_page+0x3b/0x50
dispatch_io+0x1ca/0x210 [dm_mod]
? read_callback+0x0/0xd0 [dm_mirror]
dm_io+0xbb/0x290 [dm_mod]
do_mirror+0x1e0/0x748 [dm_mirror]

Signed-off-by: Milan Broz <mbroz@redhat.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
4c2593270133708698d4b8cea2dab469479ad13b 29-May-2011 Mike Snitzer <snitzer@redhat.com> dm table: allow targets to support discards internally

Permit a target to support discards regardless of whether or not all its
underlying devices do.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
a63a5cf84dac7a23a57c800eea5734701e7d3c04 01-Apr-2011 Mike Snitzer <snitzer@redhat.com> dm: improve block integrity support

The current block integrity (DIF/DIX) support in DM is verifying that
all devices' integrity profiles match during DM device resume (which
is past the point of no return). To some degree that is unavoidable
(stacked DM devices force this late checking). But for most DM
devices (which aren't stacking on other DM devices) the ideal time to
verify all integrity profiles match is during table load.

Introduce the notion of an "initialized" integrity profile: a profile
that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
template. Add blk_integrity_is_initialized() to allow checking if a
profile was initialized.

Update DM integrity support to:
- check all devices with _initialized_ integrity profiles match
during table load; uninitialized profiles (e.g. for underlying DM
device(s) of a stacked DM device) are ignored.
- disallow a table load that would result in an integrity profile that
conflicts with a DM device's existing (in-use) integrity profile
- avoid clearing an existing integrity profile
- validate all integrity profiles match during resume; but if they
don't all we can do is report the mismatch (during resume we're past
the point of no return)

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
a91a2785b200864aef2270ed6a3babac7a253a20 17-Mar-2011 Martin K. Petersen <martin.petersen@oracle.com> block: Require subsystems to explicitly allocate bio_set integrity mempool

MD and DM create a new bio_set for every metadevice. Each bio_set has an
integrity mempool attached regardless of whether the metadevice is
capable of passing integrity metadata. This is a waste of memory.

Instead we defer the allocation decision to MD and DM since we know at
metadevice creation time whether integrity passthrough is needed or not.

Automatic integrity mempool allocation can then be removed from
bioset_create() and we make an explicit integrity allocation for the
fs_bio_set.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Acked-by: Mike Snitzer <snizer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
7eaceaccab5f40bbfda044629a6298616aeaed50 10-Mar-2011 Jens Axboe <jaxboe@fusionio.com> block: remove per-queue plugging

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
49731baa41df404c2c3f44555869ab387363af43 14-Jan-2011 Tejun Heo <tj@kernel.org> block: restore multiple bd_link_disk_holder() support

Commit e09b457b (block: simplify holder symlink handling) incorrectly
assumed that there is only one link at maximum. dm may use multiple
links and expects block layer to track reference count for each link,
which is different from and unrelated to the exclusive device holder
identified by @holder when the device is opened.

Remove the single holder assumption and automatic removal of the link
and revive the per-link reference count tracking. The code
essentially behaves the same as before commit e09b457b sans the
unnecessary kobject reference count dancing.

While at it, note that this facility should not be used by anyone else
than the current ones. Sysfs symlinks shouldn't be abused like this
and the whole thing doesn't belong in the block layer at all.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Milan Broz <mbroz@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
99d03c141b40914b67d63c9d23b8da4386422ed7 13-Jan-2011 NeilBrown <neilb@suse.de> dm: per target unplug callback support

Add per-target unplug callback support.

Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
9d357b0787bb3c91835d5e658c3bda178f9ca419 13-Jan-2011 NeilBrown <neilb@suse.de> dm: introduce target callbacks and congestion callback

DM currently implements congestion checking by checking on congestion
in each component device. For raid456 we need to also check if the
stripe cache is congested.

Add per-target congestion checker callback support.

Extending the target_callbacks structure with additional callback
functions allows for establishing multiple callbacks per-target (a
callback is also needed for unplug).

Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
72d4cd9f38b5ed96b75df4c622be25e1c2648dd3 17-Dec-2010 Mike Snitzer <snitzer@redhat.com> block: max hardware sectors limit wrapper

Implement blk_limits_max_hw_sectors() and make
blk_queue_max_hw_sectors() a wrapper around it.

DM needs this to avoid setting queue_limits' max_hw_sectors and
max_sectors directly. dm_set_device_limits() now leverages
blk_limits_max_hw_sectors() logic to establish the appropriate
max_hw_sectors minimum (PAGE_SIZE). Fixes issue where DM was
incorrectly setting max_sectors rather than max_hw_sectors (which
caused dm_merge_bvec()'s max_hw_sectors check to be ineffective).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
e692cb668fdd5a712c6ed2a2d6f2a36ee83997b4 01-Dec-2010 Martin K. Petersen <martin.petersen@oracle.com> block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead

When stacking devices, a request_queue is not always available. This
forced us to have a no_cluster flag in the queue_limits that could be
used as a carrier until the request_queue had been set up for a
metadevice.

There were several problems with that approach. First of all it was up
to the stacking device to remember to set queue flag after stacking had
completed. Also, the queue flag and the queue limits had to be kept in
sync at all times. We got that wrong, which could lead to us issuing
commands that went beyond the max scatterlist limit set by the driver.

The proper fix is to avoid having two flags for tracking the same thing.
We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
block layer merging functions. The queue_limit 'no_cluster' is turned
into 'cluster' to avoid double negatives and to ease stacking.
Clustering defaults to being enabled as before. The queue flag logic is
removed from the stacking function, and explicitly setting the cluster
flag is no longer necessary in DM and MD.

Reported-by: Ed Lin <ed.lin@promise.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
d4d77629953eabd3c14f6fa5746f6b28babfc55f 13-Nov-2010 Tejun Heo <tj@kernel.org> block: clean up blkdev_get() wrappers and their users

After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.

* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
e525fd89d380c4a94c0d63913a1dd1a593ed25e7 13-Nov-2010 Tejun Heo <tj@kernel.org> block: make blkdev_get/put() handle exclusive access

Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.

* blkdev_get/put() are the primary open and close functions.

* bd_claim/release() deal with exclusive open.

* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.

* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.

* open_by_devnum() wraps bdget() + blkdev_get().

The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,

* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.

* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.

* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.

* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.

* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().

* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().

Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.

open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.

Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
e09b457bdb7e8d23fc54dcef0930ac697d8de895 13-Nov-2010 Tejun Heo <tj@kernel.org> block: simplify holder symlink handling

Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
complex with multiple holder considerations, redundant extra
references to all involved kobjects, unused generic kobject holder
support and unnecessary mixup with bd_claim/release functionalities.

Strip it down to what's necessary (single gendisk holder) and make it
use a separate interface. This is a step for cleaning up
bd_claim/release. This patch makes dm-table slightly more complex but
it will be simplified again with further changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
c8bf1336824ebd698d37b71763e1c43190f2229a 10-Sep-2010 Martin K. Petersen <martin.petersen@oracle.com> Consolidate min_not_zero

We have several users of min_not_zero, each of them using their own
definition. Move the define to kernel.h.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
5ae89a8720c28caf35c4e53711d77df2856c404e 12-Aug-2010 Mike Snitzer <snitzer@redhat.com> dm: linear support discard

Allow discards to be passed through to linear mappings if at least one
underlying device supports it. Discards will be forwarded only to
devices that support them.

A target that supports discards should set num_discard_requests to
indicate how many times each discard request must be submitted to it.

Verify table's underlying devices support discards prior to setting the
associated DM device as capable of discards (via QUEUE_FLAG_DISCARD).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
26803b9f06d365122fae82e7554a66ef8278e0bb 12-Aug-2010 Will Drewry <wad@chromium.org> dm ioctl: refactor dm_table_complete

This change unifies the various checks and finalization that occurs on a
table prior to use. By doing so, it allows table construction without
traversing the dm-ioctl interface.

Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
8215d6ec5fee1e76545decea2cd73717efb5cb42 06-Mar-2010 Nikanth Karthikesan <knikanth@novell.com> dm table: remove unused dm_get_device range parameters

Remove unused parameters(start and len) of dm_get_device()
and fix the callers.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
ecdb2e257abc33ae6798d3ccba87bdafa40ef6b6 06-Mar-2010 Kiyoshi Ueda <k-ueda@ct.jp.nec.com> dm table: remove dm_get from dm_table_get_md

Remove the dm_get() in dm_table_get_md() because dm_table_get_md() could
be called from presuspend/postsuspend, which are called while
mapped_device is in DMF_FREEING state, where dm_get() is not allowed.

Justification for that is the lifetime of both objects: As far as the
current dm design/implementation, mapped_device is never freed while
targets are doing something, because dm core waits for targets to become
quiet in dm_put() using presuspend/postsuspend. So targets should be
able to touch mapped_device without holding reference count of the
mapped_device, and we should allow targets to touch mapped_device even
if it is in DMF_FREEING state.

Backgrounds:
I'm trying to remove the multipath internal queue, since dm core now has
a generic queue for request-based dm. In the patch-set, the multipath
target wants to request dm core to start/stop queue. One of such
start/stop requests can happen during postsuspend() while the target
waits for pg-init to complete, because the target stops queue when
starting pg-init and tries to restart it when completing pg-init. Since
queue belongs to mapped_device, it involves calling dm_table_get_md()
and dm_put(). On the other hand, postsuspend() is called in dm_put()
for mapped_device which is in DMF_FREEING state, and that triggers
BUG_ON(DMF_FREEING) in the 2nd dm_put().

I had tried to solve this problem by changing only multipath not to
touch mapped_device which is in DMF_FREEING state, but I couldn't and I
came up with a question why we need dm_get() in dm_table_get_md().

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
b27d7f16d3c6c27345d4280a739809c1c2c4c0b5 11-Jan-2010 Martin K. Petersen <martin.petersen@oracle.com> DM: Fix device mapper topology stacking

Make DM use bdev_stack_limits() function so that partition offsets get
taken into account when calculating alignment. Clarify stacking
warnings.

Also remove obsolete clearing of final alignment_offset and misalignment
flag.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair G. Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
e7d2860b690d4f3bed6824757c540579638e3d1e 15-Dec-2009 André Goddard Rosa <andre.goddard@gmail.com> tree-wide: convert open calls to remove spaces to skip_spaces() lib function

Makes use of skip_spaces() defined in lib/string.c for removing leading
spaces from strings all over the tree.

It decreases lib.a code size by 47 bytes and reuses the function tree-wide:
text data bss dec hex filename
64688 584 592 65864 10148 (TOTALS-BEFORE)
64641 584 592 65817 10119 (TOTALS-AFTER)

Also, while at it, if we see (*str && isspace(*str)), we can be sure to
remove the first condition (*str) as the second one (isspace(*str)) also
evaluates to 0 whenever *str == 0, making it redundant. In other words,
"a char equals zero is never a space".

Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below,
and found occurrences of this pattern on 3 more files:
drivers/leds/led-class.c
drivers/leds/ledtrig-timer.c
drivers/video/output.c

@@
expression str;
@@

( // ignore skip_spaces cases
while (*str && isspace(*str)) { \(str++;\|++str;\) }
|
- *str &&
isspace(*str)
)

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Julia Lawall <julia@diku.dk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a794015597a2d9b437470c7692aac77e5fc08cd2 11-Dec-2009 Alasdair G Kergon <agk@redhat.com> dm: bind new table before destroying old

When replacing a mapped device's table during a 'resume', delay the
destruction of the old table until the new one is successfully in place.

This will make it easier for a later patch to transfer internal state
information from the old table to the new one (something we do not currently
support) while giving us more options for reversion if a later part
of the operation fails.

Devices are always in the suspended state during dm_swap_table().
This patch reinforces the requirement that all I/O must have been
flushed from the table targets while in this state (including any in
workqueues). In the case of 'noflush' suspending, unprocessed
I/O should have been 'pushed back' to the dm core prior to this point,
for resubmission after the new table is in place.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
40bea431274c247425e7f5970d796ff7b37a2b22 04-Sep-2009 Mike Snitzer <snitzer@redhat.com> dm stripe: expose correct io hints

Set sensible I/O hints for striped DM devices in the topology
infrastructure added for 2.6.31 for userspace tools to
obtain via sysfs.

Add .io_hints to 'struct target_type' to allow the I/O hints portion
(io_min and io_opt) of the 'struct queue_limits' to be set by each
target and implement this for dm-stripe.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
a963a956225eb0f8c4d3537f428153c30adf54b8 04-Sep-2009 Mike Snitzer <snitzer@redhat.com> dm table: add more context to terse warning messages

A couple of recent warning messages make it difficult for the reader to
determine exactly what is wrong. This patch adds more information to
those messages.

The messages were added by these commits:
5dea271b6d87bd1d79a59c1d5baac2596a841c37 ("dm table: pass correct dev area size
to device_area_is_valid")
ea9df47cc92573b159ef3b4fda516c32cba9c4fd ("dm table: fix blk_stack_limits arg
to use bytes not sectors")

The patch also corrects references to logical_block_size in printk format
strings from %hu to %u.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
f6a1ed10864b7540fa758bbccf3433fe17070329 04-Sep-2009 Mikulas Patocka <mpatocka@redhat.com> dm table: fix queue_limit checking device iterator

The logic to check for valid device areas is inverted relative to proper
use with iterate_devices.

The iterate_devices method calls its callback for every underlying
device in the target. If any callback returns non-zero, iterate_devices
exits immediately. But the callback device_area_is_valid() returns 0 on
error and 1 on success. The overall effect without is that an error is
issued only if every device is invalid.

This patch renames device_area_is_valid to device_area_is_invalid and
inverts the logic so that one invalid device is sufficient to raise
an error.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
5dea271b6d87bd1d79a59c1d5baac2596a841c37 23-Jul-2009 Mike Snitzer <snitzer@redhat.com> dm table: pass correct dev area size to device_area_is_valid

Incorrect device area lengths are being passed to device_area_is_valid().

The regression appeared in 2.6.31-rc1 through commit
754c5fc7ebb417b23601a6222a6005cc2e7f2913.

With the dm-stripe target, the size of the target (ti->len) was used
instead of the stripe_width (ti->len/#stripes). An example of a
consequent incorrect error message is:

device-mapper: table: 254:0: sdb too small for target

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
a732c207d19e899845ae47139708af898daaf9fd 23-Jul-2009 Mike Snitzer <snitzer@redhat.com> dm: remove queue next_ordered workaround for barriers

This patch removes DM's bio-based vs request-based conditional setting
of next_ordered. For bio-based DM the next_ordered check is no longer a
concern (as that check is now in the __make_request path). For
request-based DM the default of QUEUE_ORDERED_NONE is now appropriate.

bio-based DM was changed to work-around the previously misplaced
next_ordered check with this commit:
99360b4c18f7675b50d283301d46d755affe75fd

request-based DM does not yet support barriers but reacted to the above
bio-based DM change with this commit:
5d67aa2366ccb8257d103d0b43df855605c3c086

The above changes are no longer needed given Neil Brown's recent fix to
put the next_ordered check in the __make_request path:
db64f680ba4b5c56c4be59f0698000df89ff0281

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: NeilBrown <neilb@suse.de>
Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
ea9df47cc92573b159ef3b4fda516c32cba9c4fd 30-Jun-2009 Mike Snitzer <snitzer@redhat.com> dm table: fix blk_stack_limits arg to use bytes not sectors

The offset passed to blk_stack_limits() must be in bytes not sectors.
Fixes false warnings like the following:
device-mapper: table: 254:1: target device sda6 is misaligned

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reported-by: Frans Pop <elendil@planet.nl>
Tested-by: Frans Pop <elendil@planet.nl>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
5d67aa2366ccb8257d103d0b43df855605c3c086 22-Jun-2009 Kiyoshi Ueda <k-ueda@ct.jp.nec.com> dm: do not set QUEUE_ORDERED_DRAIN if request based

Request-based dm doesn't have barrier support yet.
So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm.
Since the device type is decided at the first table loading time,
the flag set is deferred until then.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
e6ee8c0b767540f59e20da3ced282601db8aa502 22-Jun-2009 Kiyoshi Ueda <k-ueda@ct.jp.nec.com> dm: enable request based option

This patch enables request-based dm.

o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.

type of underlying device
bio-based request-based
----------------------------------------------
bio-based OK OK
request-based -- OK

The device type is recognized by the queue flag in the kernel,
so dm follows that.

o The type of a dm device is decided at the first table binding time.
Once the type of a dm device is decided, the type can't be changed.

o Mempool allocations are deferred to at the table loading time, since
mempools for request-based dm are different from those for bio-based
dm and needed mempool type is fixed by the type of table.

o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
cec47e3d4a861e1d942b3a580d0bbef2700d2bb2 22-Jun-2009 Kiyoshi Ueda <k-ueda@ct.jp.nec.com> dm: prepare for request based option

This patch adds core functions for request-based dm.

When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
make_request_fn: dm_make_request()
pref_fn: dm_prep_fn()
request_fn: dm_request_fn()
softirq_done_fn: dm_softirq_done()
lld_busy_fn: dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).

Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md

bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.

Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.

dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.

Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)

After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().

Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
bio->bi_end_io(): end_clone_bio()
rq->end_io(): end_clone_request()

Summary of the request completion flow is below:
blk_end_request() for a clone request
=> blk_update_request()
=> bio->bi_end_io() == end_clone_bio() for each clone bio
=> Free the clone bio
=> Success: Complete the original bio (blk_update_request())
Error: Don't complete the original bio
=> blk_finish_request()
=> rq->end_io() == end_clone_request()
=> blk_complete_request()
=> dm_softirq_done()
=> Free the clone request
=> Success: Complete the original request (blk_end_request())
Error: Requeue the original request

end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().

end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
- The submitted request may be mapped to the same device
- Request submission requires queue lock, but the queue lock
has been held by itself and it doesn't know that

The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.

suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.

For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue. Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.

resume
======
Starts md->queue.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
754c5fc7ebb417b23601a6222a6005cc2e7f2913 22-Jun-2009 Mike Snitzer <snitzer@redhat.com> dm: calculate queue limits during resume not load

Currently, device-mapper maintains a separate instance of 'struct
queue_limits' for each table of each device. When the configuration of
a device is to be changed, first its table is loaded and this structure
is populated, then the device is 'resumed' and the calculated
queue_limits are applied.

This places restrictions on how userspace may process related devices,
where it is often advantageous to 'load' tables for several devices
at once before 'resuming' them together. As the new queue_limits
only take effect after the 'resume', if they are changing and one
device uses another, the latter must be 'resumed' before the former
may be 'loaded'.

This patch moves the calculation of these queue_limits out of
the 'load' operation into 'resume'. Since we are no longer
pre-calculating this struct, we no longer need to maintain copies
within our dm structs.

dm_set_device_limits() now passes the 'start' of the device's
data area (aka pe_start) as the 'offset' to blk_stack_limits().

init_valid_queue_limits() is replaced by blk_set_default_limits().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: martin.petersen@oracle.com
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
1197764e403d97231eb6da2b1e16f511a7fd3101 22-Jun-2009 Mike Snitzer <snitzer@redhat.com> dm table: establish queue limits by copying table limits

Copy the table's queue_limits to the DM device's request_queue. This
properly initializes the queue's topology limits and also avoids having
to track the evolution of 'struct queue_limits' in
dm_table_set_restrictions()

Also fixes a bug that was introduced in dm_table_set_restrictions() via
commit ae03bf639a5027d27270123f5f6e3ee6a412781d. In addition to
establishing 'bounce_pfn' in the queue's limits blk_queue_bounce_limit()
also performs an allocation to setup the ISA DMA pool. This allocation
resulted in "sleeping function called from invalid context" when called
from dm_table_set_restrictions().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
5ab97588fb266187b88d1ad893251c94388f18ba 22-Jun-2009 Mike Snitzer <snitzer@redhat.com> dm table: replace struct io_restrictions with struct queue_limits

Use blk_stack_limits() to stack block limits (including topology) rather
than duplicate the equivalent within Device Mapper.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
be6d4305db093ad1cc623f7dd3d2470b7bd73fa4 22-Jun-2009 Mike Snitzer <snitzer@redhat.com> dm table: validate device logical_block_size

Impose necessary and sufficient conditions on a devices's table such
that any incoming bio which respects its logical_block_size can be
processed successfully.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
02acc3a4fa0a6c2a5ccc4fb722b55fb710265882 22-Jun-2009 Mike Snitzer <snitzer@redhat.com> dm table: ensure targets are aligned to logical_block_size

Ensure I/O is aligned to the logical block size of target devices.

Rename check_device_area() to device_area_is_valid() for clarity and
establish the device limits including the logical block size prior to
calling it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
1b6da754594e6e26c24e6fbc1a34f9c03e4617a3 22-Jun-2009 Jonthan Brassow <jbrassow@redhat.com> dm table: improve warning message when devices not freed before destruction

Report any devices forgotten to be freed before a table is destroyed.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
5657e8fa45cf230df278040c420fb80e06309d8f 22-Jun-2009 Mikulas Patocka <mpatocka@redhat.com> dm: use i_size_read

Use i_size_read() instead of reading i_size.

If someone changes the size of the device simultaneously, i_size_read
is guaranteed to return a valid value (either the old one or the new one).

i_size can return some intermediate invalid value (on 32-bit computers
with 64-bit i_size, the reads to both halves of i_size can be interleaved
with updates to i_size, resulting in garbage being returned).

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
9df1bb9b516daeece159ab7fb262d01a0359247c 09-Jun-2009 Jens Axboe <jens.axboe@oracle.com> Revert "block: Fix bounce limit setting in DM"

This reverts commit a05c0205ba031c01bba33a21bf0a35920eb64833.

DM doesn't need to access the bounce_pfn directly.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
a05c0205ba031c01bba33a21bf0a35920eb64833 03-Jun-2009 Martin K. Petersen <martin.petersen@oracle.com> block: Fix bounce limit setting in DM

blk_queue_bounce_limit() is more than a wrapper about the request queue
limits.bounce_pfn variable. Introduce blk_queue_bounce_pfn() which can
be called by stacking drivers that wish to set the bounce limit
explicitly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
ae03bf639a5027d27270123f5f6e3ee6a412781d 22-May-2009 Martin K. Petersen <martin.petersen@oracle.com> block: Use accessor functions for queue limits

Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
e1defc4ff0cf57aca6c5e3ff99fa503f5943c1f1 22-May-2009 Martin K. Petersen <martin.petersen@oracle.com> block: Do away with the notion of hardsect_size

Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
692d0eb9e02cf81fb387ff891f53840db2f3110a 09-Apr-2009 Mikulas Patocka <mpatocka@redhat.com> dm: remove limited barrier support

Prepare for full barrier implementation: first remove the restricted support.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
9c47008d13add50ec4597a8b9eee200c515282c8 09-Apr-2009 Martin K. Petersen <martin.petersen@oracle.com> dm: add integrity support

This patch provides support for data integrity passthrough in the device
mapper.

- If one or more component devices support integrity an integrity
profile is preallocated for the DM device.

- If all component devices have compatible profiles the DM device is
flagged as capable.

- Handle integrity metadata when splitting and cloning bios.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
570b9d968bf9b16974252ef7cbce73fa6dac34f3 02-Apr-2009 Alasdair G Kergon <agk@redhat.com> dm table: fix upgrade mode race

upgrade_mode() sets bdev to NULL temporarily, and does not have any
locking to exclude anything from seeing that NULL.

In dm_table_any_congested() bdev_get_queue() can dereference that NULL and
cause a reported oops.

Fix this by not changing that field during the mode upgrade.

Cc: stable@kernel.org
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
d58168763f74d1edbc296d7038c60efe6493fdd4 06-Jan-2009 Mikulas Patocka <mpatocka@redhat.com> dm table: rework reference counting

Rework table reference counting.

The existing code uses a reference counter. When the last reference is
dropped and the counter reaches zero, the table destructor is called.
Table reference counters are acquired/released from upcalls from other
kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
If the reference counter reaches zero in one of the upcalls, the table
destructor is called from almost random kernel code.

This leads to various problems:
* dm_any_congested being called under a spinlock, which calls the
destructor, which calls some sleeping function.
* the destructor attempting to take a lock that is already taken by the
same process.
* stale reference from some other kernel code keeps the table
constructed, which keeps some devices open, even after successful
return from "dmsetup remove". This can confuse lvm and prevent closing
of underlying devices or reusing device minor numbers.

The patch changes reference counting so that the table destructor can be
called only at predetermined places.

The table has always exactly one reference from either mapped_device->map
or hash_cell->new_map. After this patch, this reference is not counted
in table->holders. A pair of dm_create_table/dm_destroy_table functions
is used for table creation/destruction.

Temporary references from the other code increase table->holders. A pair
of dm_table_get/dm_table_put functions is used to manipulate it.

When the table is about to be destroyed, we wait for table->holders to
reach 0. Then, we call the table destructor. We use active waiting with
msleep(1), because the situation happens rarely (to one user in 5 years)
and removing the device isn't performance-critical task: the user doesn't
care if it takes one tick more or not.

This way, the destructor is called only at specific points
(dm_table_destroy function) and the above problems associated with lazy
destruction can't happen.

Finally remove the temporary protection added to dm_any_congested().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
ab4c1424882be9cd70b89abf2b484add355712fa 06-Jan-2009 Andi Kleen <ak@suse.de> dm: support barriers on simple devices

Implement barrier support for single device DM devices

This patch implements barrier support in DM for the common case of dm linear
just remapping a single underlying device. In this case we can safely
pass the barrier through because there can be no reordering between
devices.

NB. Any DM device might cease to support barriers if it gets
reconfigured so code must continue to allow for a possible
-EOPNOTSUPP on every barrier bio submitted. - agk

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
0e435ac26e3f951d83338ed3d4ab7dc0fe0055bc 03-Dec-2008 Milan Broz <mbroz@redhat.com> block: fix setting of max_segment_size and seg_boundary mask

Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
devices.

When stacking devices (LVM over MD over SCSI) some of the request queue
parameters are not set up correctly in some cases by default, namely
max_segment_size and and seg_boundary mask.

If you create MD device over SCSI, these attributes are zeroed.

Problem become when there is over this mapping next device-mapper mapping
- queue attributes are set in DM this way:

request_queue max_segment_size seg_boundary_mask
SCSI 65536 0xffffffff
MD RAID1 0 0
LVM 65536 -1 (64bit)

Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of
physical segments according to these parameters.

During the generic_make_request() is segment cout recalculated and can
increase bio->bi_phys_segments count over the allowed limit. (After
bio_clone() in stack operation.)

Thi is specially problem in CCISS driver, where it produce OOPS here

BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);

(MAXSEGENTRIES is 31 by default.)

Sometimes even this command is enough to cause oops:

dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10

This command generates bios with 250 sectors, allocated in 32 4k-pages
(last page uses only 1024 bytes).

For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
unfortunatelly on lower layer it is recalculated to 32 segments and this
violates CCISS restriction and triggers BUG_ON().

The patch tries to fix it by:

* initializing attributes above in queue request constructor
blk_queue_make_request()

* make sure that blk_queue_stack_limits() inherits setting

(DM uses its own function to set the limits because it
blk_queue_stack_limits() was introduced later. It should probably switch
to use generic stack limit function too.)

* sets the default seg_boundary value in one place (blkdev.h)

* use this mask as default in DM (instead of -1, which differs in 64bit)

Bugs related to this:
https://bugzilla.redhat.com/show_bug.cgi?id=471639
http://bugzilla.kernel.org/show_bug.cgi?id=8672

Signed-off-by: Milan Broz <mbroz@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Miller <mike.miller@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
72e8264eda02b6ba85a222b9620802ebf23c6a10 11-Aug-2008 Christoph Hellwig <hch@lst.de> [PATCH] dm: kill lookup_device wrapper

Now that lookup_bdev is exported and used by dm just use it directly
instead of through a trivial wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9a1c3542768b5a58e45a9216921cd10a3bae1205 23-Feb-2008 Al Viro <viro@zeniv.linux.org.uk> [PATCH] pass fmode_t to blkdev_put()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
aeb5d727062a0238a2f96c9c380fbd2be4640c6f 02-Sep-2008 Al Viro <viro@zeniv.linux.org.uk> [PATCH] introduce fmode_t, do annotations

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
0c2322e4ce144e130c03d813fe92de3798662c5e 10-Oct-2008 Alasdair G Kergon <agk@redhat.com> dm: detect lost queue

Detect and report buggy drivers that destroy their request_queue.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Stefan Raspl <raspl@linux.vnet.ibm.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
82b1519b345d61dcfae526e3fcb08128f39f9bcc 10-Oct-2008 Mikulas Patocka <mpatocka@redhat.com> dm: export struct dm_dev

Split struct dm_dev in two and publish the part that other targets need in
include/linux/device-mapper.h.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
d5686b444ff3f72808d2b3fbd58672a86cdf38e7 01-Aug-2008 Al Viro <viro@zeniv.linux.org.uk> [PATCH] switch mtd and dm-table to lookup_bdev()

No need to open-code it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9980c638a666ecd88acaf0a7ab91043d4a3f44d1 21-Jul-2008 Milan Broz <mbroz@redhat.com> dm table: remove merge_bvec sector restriction

Remove max_sector restriction - merge function replaced it.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
c9a3f6d6f541915bd7451fc7e9cb23a8b33a3ab8 29-Apr-2008 Jens Axboe <jens.axboe@oracle.com> dm: use unlocked variants of queue flag check/set

dm.c already provides mutual exclusion through ->map_lock.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
75ad23bc0fcb4f992a5d06982bf0857ab1738e9e 29-Apr-2008 Nick Piggin <npiggin@suse.de> block: make queue flags non-atomic

We can save some atomic ops in the IO path, if we clearly define
the rules of how to modify the queue flags.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
4fdfe401e9d7e30029972d568c667234c0c1d828 24-Apr-2008 Adrian Bunk <bunk@kernel.org> dm table: remove unused dm_create_error_table

dm_create_error_table() was added in kernel 2.6.18 and never used...

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
e8488d08586e6df7fab3db7881631bb13619311b 24-Apr-2008 Adrian Bunk <bunk@kernel.org> dm table: drop void suspend_targets return

void returning functions returned the return value of another void
returning function...

Spotted by sparse.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
1d957f9bf87da74f420424d16ece005202bbebd3 15-Feb-2008 Jan Blunck <jblunck@suse.de> Introduce path_put()

* Add path_put() functions for releasing a reference to the dentry and
vfsmount of a struct path in the right order

* Switch from path_release(nd) to path_put(&nd->path)

* Rename dput_path() to path_put_conditional()

[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4ac9137858e08a19f29feac4e1f4df7c268b0ba5 15-Feb-2008 Jan Blunck <jblunck@suse.de> Embed a struct path into struct nameidata instead of nd->{dentry,mnt}

This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.

Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
<dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
struct path in every place where the stack can be traversed
- it reduces the overall code size:

without patch series:
text data bss dec hex filename
5321639 858418 715768 6895825 6938d1 vmlinux

with patch series:
text data bss dec hex filename
5320026 858418 715768 6894212 693284 vmlinux

This patch:

Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
69a2ce72a4efe0653479a5d69fc86b5726e83219 08-Feb-2008 Andrew Morton <akpm@linux-foundation.org> dm: table use uninitialized_var

drivers/md/dm-table.c: In function 'dm_get_device':
drivers/md/dm-table.c:478: warning: 'dev' may be used uninitialized in this function

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
82d601dc076deb5f348cc3a70f57248bc976ae0c 08-Feb-2008 Jun'ichi Nomura <j-nomura@ce.jp.nec.com> dm: table remove unused total

"total = 0" does nothing.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
afb24528f9012e5c6361ca9a9128c7c089c1cc7c 08-Feb-2008 Paul Jimenez <pj@place.org> dm: table use list_for_each

This patch is some minor janitorish cleanup, using some macros
from linux/list.h (already #included via dm.h) to improve
readability.

Signed-off-by: Paul Jimenez <pj@place.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
91212507f93778c09d4c1335207b6f4b995f5ad1 13-Dec-2007 Neil Brown <neilb@suse.de> dm: merge max_hw_sector

Make sure dm honours max_hw_sectors of underlying devices

We still have no firm testing evidence in support of this patch but
believe it may help to resolve some bug reports. - agk

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
512875bd9661368da6f993205a61213b79ba1df0 13-Dec-2007 Jun'ichi Nomura <j-nomura@ce.jp.nec.com> dm: table detect io beyond device

This patch fixes a panic on shrinking a DM device if there is
outstanding I/O to the part of the device that is being removed.
(Normally this doesn't happen - a filesystem would be resized first,
for example.)

The bug is that __clone_and_map() assumes dm_table_find_target()
always returns a valid pointer. It may fail if a bio arrives from the
block layer but its target sector is no longer included in the DM
btree.

This patch appends an empty entry to table->targets[] which will
be returned by a lookup beyond the end of the device.

After calling dm_table_find_target(), __clone_and_map() and target_message()
check for this condition using
dm_target_is_valid().

Sample test script to trigger oops:
2ad8b1ef11c98c5603580878aebf9f1bc74129e4 07-Nov-2007 Alan D. Brunelle <Alan.Brunelle@hp.com> Add UNPLUG traces to all appropriate places

Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG.

Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
5ec140e600b7d6624c657f008833f0e71bd5ef48 31-Oct-2007 Vasily Averin <vvs@sw.ru> dm: bounce_pfn limit added

Device mapper uses its own bounce_pfn that may differ from one on underlying
device. In that way dm can build incorrect requests that contain sg elements
greater than underlying device is able to handle.

This is the cause of slab corruption in i2o layer, occurred on i386 arch when
very long direct IO requests are addressed to dm-over-i2o device.

Signed-off-by: Vasily Averin <vvs@sw.ru>
Cc: <stable@kernel.org>
Cc: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
094262db9e4c615e0db7a7b924d244b7a6c186b0 19-Oct-2007 Dmitry Monakhov <dmonakhov@openvz.org> dm: use kzalloc

Convert kmalloc() + memset() to kzalloc().

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
fd5d806266935179deda1502101624832eacd01f 16-Oct-2007 Jens Axboe <jens.axboe@oracle.com> block: convert blkdev_issue_flush() to use empty barriers

Then we can get rid of ->issue_flush_fn() and all the driver private
implementations of that.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
165125e1e480f9510a5ffcfbfee4e3ee38c05f23 24-Jul-2007 Jens Axboe <jens.axboe@oracle.com> [BLOCK] Get rid of request_queue_t typedef

Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2cd54d9bedb79a97f014e86c0da393416b264eb3 09-May-2007 Mike Anderson <andmike@us.ibm.com> dm: allow offline devices

Allow check_device_area to succeed if a device has an i_size of zero. This
addresses an issue seen on DASD devices setting up a multipath table for paths
in online and offline state.

Signed-off-by: Mike Anderson <andmike@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
999d816851c3e080412a19558f111d01852d2f04 03-Oct-2006 Bryn Reeves <breeves@redhat.com> [PATCH] dm table: add target flush

This patch adds support for a per-target dm_flush_fn method. This is needed
to allow dm-loop to invalidate page cache mappings in response to BLKFLSBUF
ioctl commands.

Signed-off-by: Bryn Reeves <breeves@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
3cb4021453a69585e458ec2177677c0c1300dccf 03-Oct-2006 Bryn Reeves <breeves@redhat.com> [PATCH] dm: extract device limit setting

Separate the setting of device I/O limits from dm_get_device(). dm-loop will
use this.

Signed-off-by: Bryn Reeves <breeves@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
8757b7764f13e336f3c0eb1f634440d4ee4c3a67 03-Oct-2006 Milan Broz <mbroz@redhat.com> [PATCH] dm table: add target preresume

This patch adds a target preresume hook.

It is called before the targets are resumed and if it returns an error the
resume gets cancelled.

The crypt target will use this to indicate that it is unable to process I/O
because no encryption key has been supplied.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
72d9486169a2a8353e022813185ba2f32d7dde69 26-Jun-2006 Alasdair G Kergon <agk@redhat.com> [PATCH] dm: improve error message consistency

Tidy device-mapper error messages to include context information
automatically.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
c2ade42dd35466d90aa6fc7cc717f396e165492f 26-Jun-2006 David Teigland <teigland@redhat.com> [PATCH] dm: create error table

Add a library function dm_create_error_table() to create a table that rejects
any I/O sent to a device with EIO.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
814d68629b40e863997fa0eea459be4cc99a06cc 26-Jun-2006 David Teigland <teigland@redhat.com> [PATCH] dm table split_args: handle no input

Return sense if dm_split_args is called with a NULL input parameter.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
143535396c7ebd9395a931a000b3963f457712b8 26-Jun-2006 Milan Broz <mbroz@redhat.com> [PATCH] dm table: get_target: fix last index

The table is indexed from 0, so an index equal to t->num_targets should be
rejected.

(There is no code in the current tree that would exercise this bug.)

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
48c9c27b8bcd2a328a06151e2d5c1170db0b701b 27-Mar-2006 Arjan van de Ven <arjan@infradead.org> [PATCH] sem2mutex: drivers/md

Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
f165921df46a977e3561f1bd9f13a348441486d1 27-Mar-2006 Jun'ichi Nomura <j-nomura@ce.jp.nec.com> [PATCH] dm/md dependency tree in sysfs: dm to use bd_claim_by_disk

Use bd_claim_by_disk.

Following symlinks are created if dm-0 maps to sda:
/sys/block/dm-0/slaves/sda --> /sys/block/sda
/sys/block/sda/holders/dm-0 --> /sys/block/dm-0

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1134e5ae79bab61c05657ca35a6297cf87202e35 27-Mar-2006 Mike Anderson <andmike@us.ibm.com> [PATCH] dm table: store md

Store an up-pointer to the owning struct mapped_device in every table when it
is created.

Access it with:
struct mapped_device *dm_table_get_md(struct dm_table *t)

Tables linked to md must be destroyed before the md itself.

Signed-off-by: Mike Anderson <andmike@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
969429b504ae866d3f8b1cafd68a2c099e305093 27-Mar-2006 NeilBrown <neilb@suse.de> [PATCH] dm: make sure QUEUE_FLAG_CLUSTER is set properly

This flag should be set for a virtual device iff it is set for all
underlying devices.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
547bc92649345af6014578a64b27cc5787617935 26-Mar-2006 Eric Sesterhenn <snakebyte@gmx.de> BUG_ON() Conversion in md/dm-table.c

this changes if() BUG(); constructs to BUG_ON() which is
cleaner and can better optimized away

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
3ee247ebce93a526f482d6bc714ce796fa85a81a 01-Feb-2006 Alasdair G Kergon <agk@redhat.com> [PATCH] dm: dm-table warning fix

drivers/md/dm-table.c:500: warning: comparison of distinct pointer types lacks a cast

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
defd94b75409b983f94548ea2f52ff5787ddb848 05-Dec-2005 Mike Christie <michaelc@cs.wisc.edu> [SCSI] seperate max_sectors from max_hw_sectors

- export __blk_put_request and blk_execute_rq_nowait
needed for async REQ_BLOCK_PC requests
- seperate max_hw_sectors and max_sectors for block/scsi_ioctl.c and
SG_IO bio.c helpers per Jens's last comments. Since block/scsi_ioctl.c SG_IO was
already testing against max_sectors and SCSI-ml was setting max_sectors and
max_hw_sectors to the same value this does not change any scsi SG_IO behavior. It only
prepares ll_rw_blk.c, scsi_ioctl.c and bio.c for when SCSI-ml begins to set
a valid max_hw_sectors for all LLDs. Today if a LLD does not set it
SCSI-ml sets it to a safe default and some LLDs set it to a artificial low
value to overcome memory and feedback issues.

Note: Since we now cap max_sectors to BLK_DEF_MAX_SECTORS, which is 1024,
drivers that used to call blk_queue_max_sectors with a large value of
max_sectors will now see the fs requests capped to BLK_DEF_MAX_SECTORS.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
cf222b3769c3759488579441ab724ed33a2da5f4 29-Jul-2005 Alasdair G Kergon <agk@redhat.com> [PATCH] device-mapper: fix deadlocks in core (prep)

Some code tidy-ups in preparation for the next patches. Change
dm_table_pre/postsuspend_targets to accept NULL. Use dm_suspended()
throughout.

Signed-Off-By: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
d5e404c10a98fc2979643476851e9cbdb1944812 13-Jul-2005 Alasdair G Kergon <agk@redhat.com> [PATCH] device-mapper snapshots: Handle origin extension

Handle writes to a snapshot-origin device that has been extended since the
snapshot was taken.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
5e198d94dd0c3ec7f6138229e2e412c2c6268c38 06-May-2005 Alasdair G Kergon <agk@redhat.com> [PATCH] device-mapper: Some missing statics

This patch makes some needlessly global code static.

Signed-Off-By: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 17-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org> Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!