From 5151412dd4338b273afdb107c3772528e9e67d92 Mon Sep 17 00:00:00 2001
From: Mike Snitzer <snitzer@redhat.com>
Date: Wed, 23 Nov 2011 10:59:13 +0100
Subject: block: initialize request_queue's numa node during

struct request_queue is allocated with __GFP_ZERO so its "node" field is
zero before initialization.  This causes an oops if node 0 is offline in
the page allocator because its zonelists are not initialized.  From Dave
Young's dmesg:

	SRAT: Node 1 PXM 2 0-d0000000
	SRAT: Node 1 PXM 2 100000000-330000000
	SRAT: Node 0 PXM 1 330000000-630000000
	Initmem setup node 1 0000000000000000-000000000affb000
	...
	Built 1 zonelists in Node order, mobility grouping on.
	...
	BUG: unable to handle kernel paging request at 0000000000001c08
	IP: [<ffffffff8111c355>] __alloc_pages_nodemask+0xb5/0x870

and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on
zonelist->_zonerefs.

The fix is to initialize q->node at the time of allocation so the correct
node is passed to the slab allocator later.

Since blk_init_allocated_queue_node() is no longer needed, merge it with
blk_init_allocated_queue().

[rientjes@google.com: changelog, initializing q->node]
Cc: stable@vger.kernel.org [2.6.37+]
Reported-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Tested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

(limited to 'block')
diff --git a/block/blk-core.c b/block/blk-core.c
index ea70e6c80cd..20d69f6beb6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -467,6 +467,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	q->backing_dev_info.state = 0;
 	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
 	q->backing_dev_info.name = "block";
+	q->node = node_id;
 
 	err = bdi_init(&q->backing_dev_info);
 	if (err) {
@@ -551,7 +552,7 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 	if (!uninit_q)
 		return NULL;
 
-	q = blk_init_allocated_queue_node(uninit_q, rfn, lock, node_id);
+	q = blk_init_allocated_queue(uninit_q, rfn, lock);
 	if (!q)
 		blk_cleanup_queue(uninit_q);
 
@@ -562,19 +563,10 @@ EXPORT_SYMBOL(blk_init_queue_node);
 struct request_queue *
 blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 			 spinlock_t *lock)
-{
-	return blk_init_allocated_queue_node(q, rfn, lock, -1);
-}
-EXPORT_SYMBOL(blk_init_allocated_queue);
-
-struct request_queue *
-blk_init_allocated_queue_node(struct request_queue *q, request_fn_proc *rfn,
-			      spinlock_t *lock, int node_id)
 {
 	if (!q)
 		return NULL;
 
-	q->node = node_id;
 	if (blk_init_free_list(q))
 		return NULL;
 
@@ -604,7 +596,7 @@ blk_init_allocated_queue_node(struct request_queue *q, request_fn_proc *rfn,
 
 	return NULL;
 }
-EXPORT_SYMBOL(blk_init_allocated_queue_node);
+EXPORT_SYMBOL(blk_init_allocated_queue);
 
 int blk_get_queue(struct request_queue *q)
 {
-- 
cgit v1.2.3-70-g09d2


From 2984ff38ccf6cbc02a7a996a36c7d6f69f3c6146 Mon Sep 17 00:00:00 2001
From: majianpeng <majianpeng@gmail.com>
Date: Wed, 30 Nov 2011 15:47:48 +0100
Subject: cfq-iosched: free cic_index if blkio_alloc_blkg_stats fails

If we fail allocating the blkpg stats, we free cfqd and cfgq.
But we need to free the IDA cfqd->cic_index as well.

Signed-off-by: majianpeng <majianpeng@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c | 5 +++++
 1 file changed, 5 insertions(+)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 16ace89613b..3beed83437a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -4036,6 +4036,11 @@ static void *cfq_init_queue(struct request_queue *q)
 
 	if (blkio_alloc_blkg_stats(&cfqg->blkg)) {
 		kfree(cfqg);
+
+		spin_lock(&cic_index_lock);
+		ida_remove(&cic_index_ida, cfqd->cic_index);
+		spin_unlock(&cic_index_lock);
+
 		kfree(cfqd);
 		return NULL;
 	}
-- 
cgit v1.2.3-70-g09d2


From 5eb46851de3904cd1be9192fdacb8d34deadc1fc Mon Sep 17 00:00:00 2001
From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Date: Fri, 2 Dec 2011 10:07:07 +0100
Subject: cfq-iosched: fix cfq_cic_link() race confition

cfq_cic_link() has race condition. When some processes which shared ioc
issue I/O to same block device simultaneously, cfq_cic_link() returns -EEXIST
sometimes. The race condition might stop I/O by following steps:

step  1: Process A: Issue an I/O to /dev/sda
step  2: Process A: Get an ioc (iocA here) in get_io_context() which does not
		    linked with a cic for the device
step  3: Process A: Get a new cic for the device (cicA here) in
		    cfq_alloc_io_context()

step  4: Process B: Issue an I/O to /dev/sda
step  5: Process B: Get iocA in get_io_context() since process A and B share the
		    same ioc
step  6: Process B: Get a new cic for the device (cicB here) in
		    cfq_alloc_io_context() since iocA has not been linked with a
		    cic for the device yet

step  7: Process A: Link cicA to iocA in cfq_cic_link()
step  8: Process A: Dispatch I/O to driver and finish it

step  9: Process B: Try to link cicB to iocA in cfq_cic_link()
		    But it fails with showing "cfq: cic link failed!" kernel
		    message, since iocA has already linked with cicA at step 7.
step 10: Process B: Wait for finishig I/O in get_request_wait()
		    The function does not wake up, when there is no I/O to the
		    device.

When cfq_cic_link() returns -EEXIST, it means ioc has already linked with cic.
So when cfq_cic_link() return -EEXIST, retry cfq_cic_lookup().

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3beed83437a..4c12869fcf7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3184,7 +3184,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 		}
 	}
 
-	if (ret)
+	if (ret && ret != -EEXIST)
 		printk(KERN_ERR "cfq: cic link failed!\n");
 
 	return ret;
@@ -3200,6 +3200,7 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct io_context *ioc = NULL;
 	struct cfq_io_context *cic;
+	int ret;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -3207,6 +3208,7 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 	if (!ioc)
 		return NULL;
 
+retry:
 	cic = cfq_cic_lookup(cfqd, ioc);
 	if (cic)
 		goto out;
@@ -3215,7 +3217,12 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 	if (cic == NULL)
 		goto err;
 
-	if (cfq_cic_link(cfqd, ioc, cic, gfp_mask))
+	ret = cfq_cic_link(cfqd, ioc, cic, gfp_mask);
+	if (ret == -EEXIST) {
+		/* someone has linked cic to ioc already */
+		cfq_cic_free(cic);
+		goto retry;
+	} else if (ret)
 		goto err_free;
 
 out:
-- 
cgit v1.2.3-70-g09d2


From bb9d97b6dffa10cec5e1ce9adbce60f3c2b5eabc Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 12 Dec 2011 18:12:21 -0800
Subject: cgroup: don't use subsys->can_attach_task() or ->attach_task()

Now that subsys->can_attach() and attach() take @tset instead of
@task, they can handle per-task operations.  Convert
->can_attach_task() and ->attach_task() users to use ->can_attach()
and attach() instead.  Most converions are straight-forward.
Noteworthy changes are,

* In cgroup_freezer, remove unnecessary NULL assignments to unused
  methods.  It's useless and very prone to get out of sync, which
  already happened.

* In cpuset, PF_THREAD_BOUND test is checked for each task.  This
  doesn't make any practical difference but is conceptually cleaner.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 block/blk-cgroup.c      | 45 +++++++++++++++++++------------
 kernel/cgroup_freezer.c | 14 +++-------
 kernel/cpuset.c         | 70 ++++++++++++++++++++++---------------------------
 kernel/events/core.c    | 13 +++++----
 kernel/sched.c          | 31 +++++++++++++---------
 5 files changed, 91 insertions(+), 82 deletions(-)

(limited to 'block')

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 8f630cec906..b8c143d68ee 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
-static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
+static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
+			      struct cgroup_taskset *);
+static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
+			   struct cgroup_taskset *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -44,8 +46,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 struct cgroup_subsys blkio_subsys = {
 	.name = "blkio",
 	.create = blkiocg_create,
-	.can_attach_task = blkiocg_can_attach_task,
-	.attach_task = blkiocg_attach_task,
+	.can_attach = blkiocg_can_attach,
+	.attach = blkiocg_attach,
 	.destroy = blkiocg_destroy,
 	.populate = blkiocg_populate,
 #ifdef CONFIG_BLK_CGROUP
@@ -1626,30 +1628,39 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int blkiocg_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup_taskset *tset)
 {
+	struct task_struct *task;
 	struct io_context *ioc;
 	int ret = 0;
 
 	/* task_lock() is needed to avoid races with exit_io_context() */
-	task_lock(tsk);
-	ioc = tsk->io_context;
-	if (ioc && atomic_read(&ioc->nr_tasks) > 1)
-		ret = -EINVAL;
-	task_unlock(tsk);
-
+	cgroup_taskset_for_each(task, cgrp, tset) {
+		task_lock(task);
+		ioc = task->io_context;
+		if (ioc && atomic_read(&ioc->nr_tasks) > 1)
+			ret = -EINVAL;
+		task_unlock(task);
+		if (ret)
+			break;
+	}
 	return ret;
 }
 
-static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static void blkiocg_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			   struct cgroup_taskset *tset)
 {
+	struct task_struct *task;
 	struct io_context *ioc;
 
-	task_lock(tsk);
-	ioc = tsk->io_context;
-	if (ioc)
-		ioc->cgroup_changed = 1;
-	task_unlock(tsk);
+	cgroup_taskset_for_each(task, cgrp, tset) {
+		task_lock(task);
+		ioc = task->io_context;
+		if (ioc)
+			ioc->cgroup_changed = 1;
+		task_unlock(task);
+	}
 }
 
 void blkio_policy_register(struct blkio_policy_type *blkiop)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e95c6fb65cc..0e748059ba8 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -162,10 +162,14 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup_taskset *tset)
 {
 	struct freezer *freezer;
+	struct task_struct *task;
 
 	/*
 	 * Anything frozen can't move or be moved to/from.
 	 */
+	cgroup_taskset_for_each(task, new_cgroup, tset)
+		if (cgroup_freezing(task))
+			return -EBUSY;
 
 	freezer = cgroup_freezer(new_cgroup);
 	if (freezer->state != CGROUP_THAWED)
@@ -174,11 +178,6 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	return 0;
 }
 
-static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
-{
-	return cgroup_freezing(tsk) ? -EBUSY : 0;
-}
-
 static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 {
 	struct freezer *freezer;
@@ -374,10 +373,5 @@ struct cgroup_subsys freezer_subsys = {
 	.populate	= freezer_populate,
 	.subsys_id	= freezer_subsys_id,
 	.can_attach	= freezer_can_attach,
-	.can_attach_task = freezer_can_attach_task,
-	.pre_attach	= NULL,
-	.attach_task	= NULL,
-	.attach		= NULL,
 	.fork		= freezer_fork,
-	.exit		= NULL,
 };
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 512bd59e862..9a8a6130152 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1375,33 +1375,34 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			     struct cgroup_taskset *tset)
 {
 	struct cpuset *cs = cgroup_cs(cgrp);
+	struct task_struct *task;
+	int ret;
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
 		return -ENOSPC;
 
-	/*
-	 * Kthreads bound to specific cpus cannot be moved to a new cpuset; we
-	 * cannot change their cpu affinity and isolating such threads by their
-	 * set of allowed nodes is unnecessary.  Thus, cpusets are not
-	 * applicable for such threads.  This prevents checking for success of
-	 * set_cpus_allowed_ptr() on all attached tasks before cpus_allowed may
-	 * be changed.
-	 */
-	if (cgroup_taskset_first(tset)->flags & PF_THREAD_BOUND)
-		return -EINVAL;
-
+	cgroup_taskset_for_each(task, cgrp, tset) {
+		/*
+		 * Kthreads bound to specific cpus cannot be moved to a new
+		 * cpuset; we cannot change their cpu affinity and
+		 * isolating such threads by their set of allowed nodes is
+		 * unnecessary.  Thus, cpusets are not applicable for such
+		 * threads.  This prevents checking for success of
+		 * set_cpus_allowed_ptr() on all attached tasks before
+		 * cpus_allowed may be changed.
+		 */
+		if (task->flags & PF_THREAD_BOUND)
+			return -EINVAL;
+		if ((ret = security_task_setscheduler(task)))
+			return ret;
+	}
 	return 0;
 }
 
-static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
-{
-	return security_task_setscheduler(task);
-}
-
 /*
  * Protected by cgroup_lock. The nodemasks must be stored globally because
  * dynamically allocating them is not allowed in pre_attach, and they must
- * persist among pre_attach, attach_task, and attach.
+ * persist among pre_attach, and attach.
  */
 static cpumask_var_t cpus_attach;
 static nodemask_t cpuset_attach_nodemask_from;
@@ -1420,39 +1421,34 @@ static void cpuset_pre_attach(struct cgroup *cont)
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
 }
 
-/* Per-thread attachment work. */
-static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
-{
-	int err;
-	struct cpuset *cs = cgroup_cs(cont);
-
-	/*
-	 * can_attach beforehand should guarantee that this doesn't fail.
-	 * TODO: have a better way to handle failure here
-	 */
-	err = set_cpus_allowed_ptr(tsk, cpus_attach);
-	WARN_ON_ONCE(err);
-
-	cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
-	cpuset_update_task_spread_flag(cs, tsk);
-}
-
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			  struct cgroup_taskset *tset)
 {
 	struct mm_struct *mm;
-	struct task_struct *tsk = cgroup_taskset_first(tset);
+	struct task_struct *task;
+	struct task_struct *leader = cgroup_taskset_first(tset);
 	struct cgroup *oldcgrp = cgroup_taskset_cur_cgroup(tset);
 	struct cpuset *cs = cgroup_cs(cgrp);
 	struct cpuset *oldcs = cgroup_cs(oldcgrp);
 
+	cgroup_taskset_for_each(task, cgrp, tset) {
+		/*
+		 * can_attach beforehand should guarantee that this doesn't
+		 * fail.  TODO: have a better way to handle failure here
+		 */
+		WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+
+		cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
+		cpuset_update_task_spread_flag(cs, task);
+	}
+
 	/*
 	 * Change mm, possibly for multiple threads in a threadgroup. This is
 	 * expensive and may sleep.
 	 */
 	cpuset_attach_nodemask_from = oldcs->mems_allowed;
 	cpuset_attach_nodemask_to = cs->mems_allowed;
-	mm = get_task_mm(tsk);
+	mm = get_task_mm(leader);
 	if (mm) {
 		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
@@ -1908,9 +1904,7 @@ struct cgroup_subsys cpuset_subsys = {
 	.create = cpuset_create,
 	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
-	.can_attach_task = cpuset_can_attach_task,
 	.pre_attach = cpuset_pre_attach,
-	.attach_task = cpuset_attach_task,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
 	.post_clone = cpuset_post_clone,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0e8457da6f9..3b8e0edbe69 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7044,10 +7044,13 @@ static int __perf_cgroup_move(void *info)
 	return 0;
 }
 
-static void
-perf_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *task)
+static void perf_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			       struct cgroup_taskset *tset)
 {
-	task_function_call(task, __perf_cgroup_move, task);
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, cgrp, tset)
+		task_function_call(task, __perf_cgroup_move, task);
 }
 
 static void perf_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
@@ -7061,7 +7064,7 @@ static void perf_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	if (!(task->flags & PF_EXITING))
 		return;
 
-	perf_cgroup_attach_task(cgrp, task);
+	task_function_call(task, __perf_cgroup_move, task);
 }
 
 struct cgroup_subsys perf_subsys = {
@@ -7070,6 +7073,6 @@ struct cgroup_subsys perf_subsys = {
 	.create		= perf_cgroup_create,
 	.destroy	= perf_cgroup_destroy,
 	.exit		= perf_cgroup_exit,
-	.attach_task	= perf_cgroup_attach_task,
+	.attach		= perf_cgroup_attach,
 };
 #endif /* CONFIG_CGROUP_PERF */
diff --git a/kernel/sched.c b/kernel/sched.c
index 0e9344a71be..161184da7b8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9127,24 +9127,31 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	sched_destroy_group(tg);
 }
 
-static int
-cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+				 struct cgroup_taskset *tset)
 {
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, cgrp, tset) {
 #ifdef CONFIG_RT_GROUP_SCHED
-	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
-		return -EINVAL;
+		if (!sched_rt_can_attach(cgroup_tg(cgrp), task))
+			return -EINVAL;
 #else
-	/* We don't support RT-tasks being in separate groups */
-	if (tsk->sched_class != &fair_sched_class)
-		return -EINVAL;
+		/* We don't support RT-tasks being in separate groups */
+		if (task->sched_class != &fair_sched_class)
+			return -EINVAL;
 #endif
+	}
 	return 0;
 }
 
-static void
-cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static void cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup_taskset *tset)
 {
-	sched_move_task(tsk);
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, cgrp, tset)
+		sched_move_task(task);
 }
 
 static void
@@ -9480,8 +9487,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.name		= "cpu",
 	.create		= cpu_cgroup_create,
 	.destroy	= cpu_cgroup_destroy,
-	.can_attach_task = cpu_cgroup_can_attach_task,
-	.attach_task	= cpu_cgroup_attach_task,
+	.can_attach	= cpu_cgroup_can_attach,
+	.attach		= cpu_cgroup_attach,
 	.exit		= cpu_cgroup_exit,
 	.populate	= cpu_cgroup_populate,
 	.subsys_id	= cpu_cgroup_subsys_id,
-- 
cgit v1.2.3-70-g09d2


From 1ba64edef6051d2ec79bb2fbd3a0c8f0df00ab55 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:37 +0100
Subject: block, sx8: kill blk_insert_request()

The only user left for blk_insert_request() is sx8 and it can be
trivially switched to use blk_execute_rq_nowait() - special requests
aren't included in io stat and sx8 doesn't use block layer tagging.
Switch sx8 and kill blk_insert_requeset().

This patch doesn't introduce any functional difference.

Only compile tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c       | 48 ------------------------------------------------
 drivers/block/sx8.c    | 12 ++++++++----
 include/linux/blkdev.h |  1 -
 3 files changed, 8 insertions(+), 53 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index ea70e6c80cd..435af237861 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1010,54 +1010,6 @@ static void add_acct_request(struct request_queue *q, struct request *rq,
 	__elv_add_request(q, rq, where);
 }
 
-/**
- * blk_insert_request - insert a special request into a request queue
- * @q:		request queue where request should be inserted
- * @rq:		request to be inserted
- * @at_head:	insert request at head or tail of queue
- * @data:	private data
- *
- * Description:
- *    Many block devices need to execute commands asynchronously, so they don't
- *    block the whole kernel from preemption during request execution.  This is
- *    accomplished normally by inserting aritficial requests tagged as
- *    REQ_TYPE_SPECIAL in to the corresponding request queue, and letting them
- *    be scheduled for actual execution by the request queue.
- *
- *    We have the option of inserting the head or the tail of the queue.
- *    Typically we use the tail for new ioctls and so forth.  We use the head
- *    of the queue for things like a QUEUE_FULL message from a device, or a
- *    host that is unable to accept a particular command.
- */
-void blk_insert_request(struct request_queue *q, struct request *rq,
-			int at_head, void *data)
-{
-	int where = at_head ? ELEVATOR_INSERT_FRONT : ELEVATOR_INSERT_BACK;
-	unsigned long flags;
-
-	/*
-	 * tell I/O scheduler that this isn't a regular read/write (ie it
-	 * must not attempt merges on this) and that it acts as a soft
-	 * barrier
-	 */
-	rq->cmd_type = REQ_TYPE_SPECIAL;
-
-	rq->special = data;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-
-	/*
-	 * If command is tagged, release the tag
-	 */
-	if (blk_rq_tagged(rq))
-		blk_queue_end_tag(q, rq);
-
-	add_acct_request(q, rq, where);
-	__blk_run_queue(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-}
-EXPORT_SYMBOL(blk_insert_request);
-
 static void part_round_stats_single(int cpu, struct hd_struct *part,
 				    unsigned long now)
 {
diff --git a/drivers/block/sx8.c b/drivers/block/sx8.c
index b70f0fca9a4..e7472f567c9 100644
--- a/drivers/block/sx8.c
+++ b/drivers/block/sx8.c
@@ -619,8 +619,10 @@ static int carm_array_info (struct carm_host *host, unsigned int array_idx)
 	       host->state == HST_DEV_SCAN);
 	spin_unlock_irq(&host->lock);
 
-	DPRINTK("blk_insert_request, tag == %u\n", idx);
-	blk_insert_request(host->oob_q, crq->rq, 1, crq);
+	DPRINTK("blk_execute_rq_nowait, tag == %u\n", idx);
+	crq->rq->cmd_type = REQ_TYPE_SPECIAL;
+	crq->rq->special = crq;
+	blk_execute_rq_nowait(host->oob_q, NULL, crq->rq, true, NULL);
 
 	return 0;
 
@@ -658,8 +660,10 @@ static int carm_send_special (struct carm_host *host, carm_sspc_t func)
 	BUG_ON(rc < 0);
 	crq->msg_bucket = (u32) rc;
 
-	DPRINTK("blk_insert_request, tag == %u\n", idx);
-	blk_insert_request(host->oob_q, crq->rq, 1, crq);
+	DPRINTK("blk_execute_rq_nowait, tag == %u\n", idx);
+	crq->rq->cmd_type = REQ_TYPE_SPECIAL;
+	crq->rq->special = crq;
+	blk_execute_rq_nowait(host->oob_q, NULL, crq->rq, true, NULL);
 
 	return 0;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c7a6d3b5bc7..8a6b51b13a1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -660,7 +660,6 @@ extern void __blk_put_request(struct request_queue *, struct request *);
 extern struct request *blk_get_request(struct request_queue *, int, gfp_t);
 extern struct request *blk_make_request(struct request_queue *, struct bio *,
 					gfp_t);
-extern void blk_insert_request(struct request_queue *, struct request *, int, void *);
 extern void blk_requeue_request(struct request_queue *, struct request *);
 extern void blk_add_request_payload(struct request *rq, struct page *page,
 		unsigned int len);
-- 
cgit v1.2.3-70-g09d2


From 34f6055c80285e4efb3f602a9119db75239744dc Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:37 +0100
Subject: block: add blk_queue_dead()

There are a number of QUEUE_FLAG_DEAD tests.  Add blk_queue_dead()
macro and use it.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c       | 6 +++---
 block/blk-exec.c       | 2 +-
 block/blk-sysfs.c      | 4 ++--
 block/blk-throttle.c   | 4 ++--
 block/blk.h            | 2 +-
 include/linux/blkdev.h | 1 +
 6 files changed, 10 insertions(+), 9 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index 435af237861..b5ed4f4a8d9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -608,7 +608,7 @@ EXPORT_SYMBOL(blk_init_allocated_queue_node);
 
 int blk_get_queue(struct request_queue *q)
 {
-	if (likely(!test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
+	if (likely(!blk_queue_dead(q))) {
 		kobject_get(&q->kobj);
 		return 0;
 	}
@@ -755,7 +755,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue;
 
-	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
+	if (unlikely(blk_queue_dead(q)))
 		return NULL;
 
 	may_queue = elv_may_queue(q, rw_flags);
@@ -875,7 +875,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		struct io_context *ioc;
 		struct request_list *rl = &q->rq;
 
-		if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
+		if (unlikely(blk_queue_dead(q)))
 			return NULL;
 
 		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
diff --git a/block/blk-exec.c b/block/blk-exec.c
index a1ebceb332f..60532852b3a 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -50,7 +50,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
 {
 	int where = at_head ? ELEVATOR_INSERT_FRONT : ELEVATOR_INSERT_BACK;
 
-	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
+	if (unlikely(blk_queue_dead(q))) {
 		rq->errors = -ENXIO;
 		if (rq->end_io)
 			rq->end_io(rq, rq->errors);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e7f9f657f10..f0b2ca8f66d 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -425,7 +425,7 @@ queue_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
 	if (!entry->show)
 		return -EIO;
 	mutex_lock(&q->sysfs_lock);
-	if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)) {
+	if (blk_queue_dead(q)) {
 		mutex_unlock(&q->sysfs_lock);
 		return -ENOENT;
 	}
@@ -447,7 +447,7 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
 
 	q = container_of(kobj, struct request_queue, kobj);
 	mutex_lock(&q->sysfs_lock);
-	if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)) {
+	if (blk_queue_dead(q)) {
 		mutex_unlock(&q->sysfs_lock);
 		return -ENOENT;
 	}
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 4553245d931..5eed6a76721 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -310,7 +310,7 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	struct request_queue *q = td->queue;
 
 	/* no throttling for dead queue */
-	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
+	if (unlikely(blk_queue_dead(q)))
 		return NULL;
 
 	rcu_read_lock();
@@ -335,7 +335,7 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	spin_lock_irq(q->queue_lock);
 
 	/* Make sure @q is still alive */
-	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
+	if (unlikely(blk_queue_dead(q))) {
 		kfree(tg);
 		return NULL;
 	}
diff --git a/block/blk.h b/block/blk.h
index 3f6551b3c92..e38691dbb32 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -85,7 +85,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
 			q->flush_queue_delayed = 1;
 			return NULL;
 		}
-		if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags) ||
+		if (unlikely(blk_queue_dead(q)) ||
 		    !q->elevator->ops->elevator_dispatch_fn(q, 0))
 			return NULL;
 	}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8a6b51b13a1..783f97c14d0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -481,6 +481,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 
 #define blk_queue_tagged(q)	test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
 #define blk_queue_stopped(q)	test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
+#define blk_queue_dead(q)	test_bit(QUEUE_FLAG_DEAD, &(q)->queue_flags)
 #define blk_queue_nomerges(q)	test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
 #define blk_queue_noxmerges(q)	\
 	test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags)
-- 
cgit v1.2.3-70-g09d2


From 481a7d64790cd7ca61a8bbcbd9d017ce58e6fe39 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:37 +0100
Subject: block: fix drain_all condition in blk_drain_queue()

When trying to drain all requests, blk_drain_queue() checked only
q->rq.count[]; however, this only tracks REQ_ALLOCED requests.  This
patch updates blk_drain_queue() such that it looks at all the counters
and queues so that request_queue is actually empty on completion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index b5ed4f4a8d9..c37e9e7c9d0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -358,7 +358,8 @@ EXPORT_SYMBOL(blk_put_queue);
 void blk_drain_queue(struct request_queue *q, bool drain_all)
 {
 	while (true) {
-		int nr_rqs;
+		bool drain = false;
+		int i;
 
 		spin_lock_irq(q->queue_lock);
 
@@ -368,14 +369,25 @@ void blk_drain_queue(struct request_queue *q, bool drain_all)
 
 		__blk_run_queue(q);
 
-		if (drain_all)
-			nr_rqs = q->rq.count[0] + q->rq.count[1];
-		else
-			nr_rqs = q->rq.elvpriv;
+		drain |= q->rq.elvpriv;
+
+		/*
+		 * Unfortunately, requests are queued at and tracked from
+		 * multiple places and there's no single counter which can
+		 * be drained.  Check all the queues and counters.
+		 */
+		if (drain_all) {
+			drain |= !list_empty(&q->queue_head);
+			for (i = 0; i < 2; i++) {
+				drain |= q->rq.count[i];
+				drain |= q->in_flight[i];
+				drain |= !list_empty(&q->flush_queue[i]);
+			}
+		}
 
 		spin_unlock_irq(q->queue_lock);
 
-		if (!nr_rqs)
+		if (!drain)
 			break;
 		msleep(10);
 	}
-- 
cgit v1.2.3-70-g09d2


From 8ba61435d73f2274e12d4d823fde06735e8f6a54 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:37 +0100
Subject: block: add missing blk_queue_dead() checks

blk_insert_cloned_request(), blk_execute_rq_nowait() and
blk_flush_plug_list() either didn't check whether the queue was dead
or did it without holding queue_lock.  Update them so that dead state
is checked while holding queue_lock.

AFAICS, this plugs all holes (requeue doesn't matter as the request is
transitioning atomically from in_flight to queued).

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c | 21 +++++++++++++++++++++
 block/blk-exec.c |  6 ++++--
 2 files changed, 25 insertions(+), 2 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index c37e9e7c9d0..30add45a87e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1731,6 +1731,10 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 		return -EIO;
 
 	spin_lock_irqsave(q->queue_lock, flags);
+	if (unlikely(blk_queue_dead(q))) {
+		spin_unlock_irqrestore(q->queue_lock, flags);
+		return -ENODEV;
+	}
 
 	/*
 	 * Submitting request must be dequeued before calling this function
@@ -2704,6 +2708,14 @@ static void queue_unplugged(struct request_queue *q, unsigned int depth,
 {
 	trace_block_unplug(q, depth, !from_schedule);
 
+	/*
+	 * Don't mess with dead queue.
+	 */
+	if (unlikely(blk_queue_dead(q))) {
+		spin_unlock(q->queue_lock);
+		return;
+	}
+
 	/*
 	 * If we are punting this to kblockd, then we can safely drop
 	 * the queue_lock before waking kblockd (which needs to take
@@ -2780,6 +2792,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 			depth = 0;
 			spin_lock(q->queue_lock);
 		}
+
+		/*
+		 * Short-circuit if @q is dead
+		 */
+		if (unlikely(blk_queue_dead(q))) {
+			__blk_end_request_all(rq, -ENODEV);
+			continue;
+		}
+
 		/*
 		 * rq is already accounted, so use raw insert
 		 */
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 60532852b3a..fb2cbd55162 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -50,7 +50,11 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
 {
 	int where = at_head ? ELEVATOR_INSERT_FRONT : ELEVATOR_INSERT_BACK;
 
+	WARN_ON(irqs_disabled());
+	spin_lock_irq(q->queue_lock);
+
 	if (unlikely(blk_queue_dead(q))) {
+		spin_unlock_irq(q->queue_lock);
 		rq->errors = -ENXIO;
 		if (rq->end_io)
 			rq->end_io(rq, rq->errors);
@@ -59,8 +63,6 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
 
 	rq->rq_disk = bd_disk;
 	rq->end_io = done;
-	WARN_ON(irqs_disabled());
-	spin_lock_irq(q->queue_lock);
 	__elv_add_request(q, rq, where);
 	__blk_run_queue(q);
 	/* the queue is stopped so it won't be run */
-- 
cgit v1.2.3-70-g09d2


From a73f730d013ff2788389fd0c46ad3e5510f124e6 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:37 +0100
Subject: block, cfq: move cfqd->cic_index to q->id

cfq allocates per-queue id using ida and uses it to index cic radix
tree from io_context.  Move it to q->id and allocate on queue init and
free on queue release.  This simplifies cfq a bit and will allow for
further improvements of io context life-cycle management.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c       | 24 +++++++++++++++--------
 block/blk-sysfs.c      |  2 ++
 block/blk.h            |  3 +++
 block/cfq-iosched.c    | 52 +++++---------------------------------------------
 include/linux/blkdev.h |  6 ++++++
 5 files changed, 32 insertions(+), 55 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index 30add45a87e..af730158117 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
 
+DEFINE_IDA(blk_queue_ida);
+
 /*
  * For the allocated request tables
  */
@@ -474,6 +476,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (!q)
 		return NULL;
 
+	q->id = ida_simple_get(&blk_queue_ida, 0, 0, GFP_KERNEL);
+	if (q->id < 0)
+		goto fail_q;
+
 	q->backing_dev_info.ra_pages =
 			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	q->backing_dev_info.state = 0;
@@ -481,15 +487,11 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	q->backing_dev_info.name = "block";
 
 	err = bdi_init(&q->backing_dev_info);
-	if (err) {
-		kmem_cache_free(blk_requestq_cachep, q);
-		return NULL;
-	}
+	if (err)
+		goto fail_id;
 
-	if (blk_throtl_init(q)) {
-		kmem_cache_free(blk_requestq_cachep, q);
-		return NULL;
-	}
+	if (blk_throtl_init(q))
+		goto fail_id;
 
 	setup_timer(&q->backing_dev_info.laptop_mode_wb_timer,
 		    laptop_mode_timer_fn, (unsigned long) q);
@@ -512,6 +514,12 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	q->queue_lock = &q->__queue_lock;
 
 	return q;
+
+fail_id:
+	ida_simple_remove(&blk_queue_ida, q->id);
+fail_q:
+	kmem_cache_free(blk_requestq_cachep, q);
+	return NULL;
 }
 EXPORT_SYMBOL(blk_alloc_queue_node);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f0b2ca8f66d..5b4b4ab5e78 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -494,6 +494,8 @@ static void blk_release_queue(struct kobject *kobj)
 	blk_trace_shutdown(q);
 
 	bdi_destroy(&q->backing_dev_info);
+
+	ida_simple_remove(&blk_queue_ida, q->id);
 	kmem_cache_free(blk_requestq_cachep, q);
 }
 
diff --git a/block/blk.h b/block/blk.h
index e38691dbb32..aae4d88fc52 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -1,6 +1,8 @@
 #ifndef BLK_INTERNAL_H
 #define BLK_INTERNAL_H
 
+#include <linux/idr.h>
+
 /* Amount of time in which a process may batch requests */
 #define BLK_BATCH_TIME	(HZ/50UL)
 
@@ -9,6 +11,7 @@
 
 extern struct kmem_cache *blk_requestq_cachep;
 extern struct kobj_type blk_queue_ktype;
+extern struct ida blk_queue_ida;
 
 void init_request_from_bio(struct request *req, struct bio *bio);
 void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 16ace89613b..ec3f5e8ba56 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -65,9 +65,6 @@ static DEFINE_PER_CPU(unsigned long, cfq_ioc_count);
 static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
-static DEFINE_SPINLOCK(cic_index_lock);
-static DEFINE_IDA(cic_index_ida);
-
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
 #define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
 #define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
@@ -290,7 +287,6 @@ struct cfq_data {
 	unsigned int cfq_group_idle;
 	unsigned int cfq_latency;
 
-	unsigned int cic_index;
 	struct list_head cic_list;
 
 	/*
@@ -484,7 +480,7 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
 
 static inline void *cfqd_dead_key(struct cfq_data *cfqd)
 {
-	return (void *)(cfqd->cic_index << CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY);
+	return (void *)(cfqd->queue->id << CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY);
 }
 
 static inline struct cfq_data *cic_to_cfqd(struct cfq_io_context *cic)
@@ -3105,7 +3101,7 @@ cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
 	BUG_ON(rcu_dereference_check(ioc->ioc_data,
 		lockdep_is_held(&ioc->lock)) == cic);
 
-	radix_tree_delete(&ioc->radix_root, cfqd->cic_index);
+	radix_tree_delete(&ioc->radix_root, cfqd->queue->id);
 	hlist_del_rcu(&cic->cic_list);
 	spin_unlock_irqrestore(&ioc->lock, flags);
 
@@ -3133,7 +3129,7 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 	}
 
 	do {
-		cic = radix_tree_lookup(&ioc->radix_root, cfqd->cic_index);
+		cic = radix_tree_lookup(&ioc->radix_root, cfqd->queue->id);
 		rcu_read_unlock();
 		if (!cic)
 			break;
@@ -3169,8 +3165,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 		cic->key = cfqd;
 
 		spin_lock_irqsave(&ioc->lock, flags);
-		ret = radix_tree_insert(&ioc->radix_root,
-						cfqd->cic_index, cic);
+		ret = radix_tree_insert(&ioc->radix_root, cfqd->queue->id, cic);
 		if (!ret)
 			hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list);
 		spin_unlock_irqrestore(&ioc->lock, flags);
@@ -3944,10 +3939,6 @@ static void cfq_exit_queue(struct elevator_queue *e)
 
 	cfq_shutdown_timer_wq(cfqd);
 
-	spin_lock(&cic_index_lock);
-	ida_remove(&cic_index_ida, cfqd->cic_index);
-	spin_unlock(&cic_index_lock);
-
 	/*
 	 * Wait for cfqg->blkg->key accessors to exit their grace periods.
 	 * Do this wait only if there are other unlinked groups out
@@ -3969,24 +3960,6 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	kfree(cfqd);
 }
 
-static int cfq_alloc_cic_index(void)
-{
-	int index, error;
-
-	do {
-		if (!ida_pre_get(&cic_index_ida, GFP_KERNEL))
-			return -ENOMEM;
-
-		spin_lock(&cic_index_lock);
-		error = ida_get_new(&cic_index_ida, &index);
-		spin_unlock(&cic_index_lock);
-		if (error && error != -EAGAIN)
-			return error;
-	} while (error);
-
-	return index;
-}
-
 static void *cfq_init_queue(struct request_queue *q)
 {
 	struct cfq_data *cfqd;
@@ -3994,23 +3967,9 @@ static void *cfq_init_queue(struct request_queue *q)
 	struct cfq_group *cfqg;
 	struct cfq_rb_root *st;
 
-	i = cfq_alloc_cic_index();
-	if (i < 0)
-		return NULL;
-
 	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
-	if (!cfqd) {
-		spin_lock(&cic_index_lock);
-		ida_remove(&cic_index_ida, i);
-		spin_unlock(&cic_index_lock);
+	if (!cfqd)
 		return NULL;
-	}
-
-	/*
-	 * Don't need take queue_lock in the routine, since we are
-	 * initializing the ioscheduler, and nobody is using cfqd
-	 */
-	cfqd->cic_index = i;
 
 	/* Init root service tree */
 	cfqd->grp_service_tree = CFQ_RB_ROOT;
@@ -4294,7 +4253,6 @@ static void __exit cfq_exit(void)
 	 */
 	if (elv_ioc_count_read(cfq_ioc_count))
 		wait_for_completion(&all_gone);
-	ida_destroy(&cic_index_ida);
 	cfq_slab_kill();
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 783f97c14d0..8c8dbc4738e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -310,6 +310,12 @@ struct request_queue {
 	 */
 	unsigned long		queue_flags;
 
+	/*
+	 * ida allocated id for this queue.  Used to index queues from
+	 * ioctx.
+	 */
+	int			id;
+
 	/*
 	 * queue needs bounce pages for pages above this limit
 	 */
-- 
cgit v1.2.3-70-g09d2


From 42ec57a8f68311bbbf4ff96a5d33c8a2e90b9d05 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:37 +0100
Subject: block: misc ioc cleanups

* int return from put_io_context() wasn't used by anybody.  Make it
  return void like other put functions and docbook-fy the function
  comment.

* Reorder dummy declarations for !CONFIG_BLOCK case a bit.

* Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of
  if block and drop 0'ing.

* Docbook-fy current_io_context() comment.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-ioc.c           | 72 +++++++++++++++++++++++------------------------
 include/linux/iocontext.h | 12 ++------
 2 files changed, 39 insertions(+), 45 deletions(-)

(limited to 'block')

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 6f9bbd97865..8bebf06bac7 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -27,26 +27,28 @@ static void cfq_dtor(struct io_context *ioc)
 	}
 }
 
-/*
- * IO Context helper functions. put_io_context() returns 1 if there are no
- * more users of this io context, 0 otherwise.
+/**
+ * put_io_context - put a reference of io_context
+ * @ioc: io_context to put
+ *
+ * Decrement reference count of @ioc and release it if the count reaches
+ * zero.
  */
-int put_io_context(struct io_context *ioc)
+void put_io_context(struct io_context *ioc)
 {
 	if (ioc == NULL)
-		return 1;
+		return;
 
-	BUG_ON(atomic_long_read(&ioc->refcount) == 0);
+	BUG_ON(atomic_long_read(&ioc->refcount) <= 0);
 
-	if (atomic_long_dec_and_test(&ioc->refcount)) {
-		rcu_read_lock();
-		cfq_dtor(ioc);
-		rcu_read_unlock();
+	if (!atomic_long_dec_and_test(&ioc->refcount))
+		return;
 
-		kmem_cache_free(iocontext_cachep, ioc);
-		return 1;
-	}
-	return 0;
+	rcu_read_lock();
+	cfq_dtor(ioc);
+	rcu_read_unlock();
+
+	kmem_cache_free(iocontext_cachep, ioc);
 }
 EXPORT_SYMBOL(put_io_context);
 
@@ -84,33 +86,31 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ioc;
 
-	ioc = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ioc) {
-		atomic_long_set(&ioc->refcount, 1);
-		atomic_set(&ioc->nr_tasks, 1);
-		spin_lock_init(&ioc->lock);
-		ioc->ioprio_changed = 0;
-		ioc->ioprio = 0;
-		ioc->last_waited = 0; /* doesn't matter... */
-		ioc->nr_batch_requests = 0; /* because this is 0 */
-		INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ioc->cic_list);
-		ioc->ioc_data = NULL;
-#if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
-		ioc->cgroup_changed = 0;
-#endif
-	}
+	ioc = kmem_cache_alloc_node(iocontext_cachep, gfp_flags | __GFP_ZERO,
+				    node);
+	if (unlikely(!ioc))
+		return NULL;
+
+	/* initialize */
+	atomic_long_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
 
 	return ioc;
 }
 
-/*
- * If the current task has no IO context then create one and initialise it.
- * Otherwise, return its existing IO context.
+/**
+ * current_io_context - get io_context of %current
+ * @gfp_flags: allocation flags, used if allocation is necessary
+ * @node: allocation node, used if allocation is necessary
  *
- * This returned IO context doesn't have a specifically elevated refcount,
- * but since the current task itself holds a reference, the context can be
- * used in general code, so long as it stays within `current` context.
+ * Return io_context of %current.  If it doesn't exist, it is created with
+ * @gfp_flags and @node.  The returned io_context does NOT have its
+ * reference count incremented.  Because io_context is exited only on task
+ * exit, %current can be sure that the returned io_context is valid and
+ * alive as long as it is executing.
  */
 struct io_context *current_io_context(gfp_t gfp_flags, int node)
 {
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 5037a0ad231..8a6ecb66346 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -76,20 +76,14 @@ static inline struct io_context *ioc_task_link(struct io_context *ioc)
 
 struct task_struct;
 #ifdef CONFIG_BLOCK
-int put_io_context(struct io_context *ioc);
+void put_io_context(struct io_context *ioc);
 void exit_io_context(struct task_struct *task);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
 #else
-static inline void exit_io_context(struct task_struct *task)
-{
-}
-
 struct io_context;
-static inline int put_io_context(struct io_context *ioc)
-{
-	return 1;
-}
+static inline void put_io_context(struct io_context *ioc) { }
+static inline void exit_io_context(struct task_struct *task) { }
 #endif
 
 #endif
-- 
cgit v1.2.3-70-g09d2


From 6e736be7f282fff705db7c34a15313281b372a76 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:38 +0100
Subject: block: make ioc get/put interface more conventional and fix race on
 alloction

Ignoring copy_io() during fork, io_context can be allocated from two
places - current_io_context() and set_task_ioprio().  The former is
always called from local task while the latter can be called from
different task.  The synchornization between them are peculiar and
dubious.

* current_io_context() doesn't grab task_lock() and assumes that if it
  saw %NULL ->io_context, it would stay that way until allocation and
  assignment is complete.  It has smp_wmb() between alloc/init and
  assignment.

* set_task_ioprio() grabs task_lock() for assignment and does
  smp_read_barrier_depends() between "ioc = task->io_context" and "if
  (ioc)".  Unfortunately, this doesn't achieve anything - the latter
  is not a dependent load of the former.  ie, if ioc itself were being
  dereferenced "ioc->xxx", it would mean something (not sure what tho)
  but as the code currently stands, the dependent read barrier is
  noop.

As only one of the the two test-assignment sequences is task_lock()
protected, the task_lock() can't do much about race between the two.
Nothing prevents current_io_context() and set_task_ioprio() allocating
its own ioc for the same task and overwriting the other's.

Also, set_task_ioprio() can race with exiting task and create a new
ioc after exit_io_context() is finished.

ioc get/put doesn't have any reason to be complex.  The only hot path
is accessing the existing ioc of %current, which is simple to achieve
given that ->io_context is never destroyed as long as the task is
alive.  All other paths can happily go through task_lock() like all
other task sub structures without impacting anything.

This patch updates ioc get/put so that it becomes more conventional.

* alloc_io_context() is replaced with get_task_io_context().  This is
  the only interface which can acquire access to ioc of another task.
  On return, the caller has an explicit reference to the object which
  should be put using put_io_context() afterwards.

* The functionality of current_io_context() remains the same but when
  creating a new ioc, it shares the code path with
  get_task_io_context() and always goes through task_lock().

* get_io_context() now means incrementing ref on an ioc which the
  caller already has access to (be that an explicit refcnt or implicit
  %current one).

* PF_EXITING inhibits creation of new io_context and once
  exit_io_context() is finished, it's guaranteed that both ioc
  acquisition functions return %NULL.

* All users are updated.  Most are trivial but
  smp_read_barrier_depends() removal from cfq_get_io_context() needs a
  bit of explanation.  I suppose the original intention was to ensure
  ioc->ioprio is visible when set_task_ioprio() allocates new
  io_context and installs it; however, this wouldn't have worked
  because set_task_ioprio() doesn't have wmb between init and install.
  There are other problems with this which will be fixed in another
  patch.

* While at it, use NUMA_NO_NODE instead of -1 for wildcard node
  specification.

-v2: Vivek spotted contamination from debug patch.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-cgroup.c        |  9 +++--
 block/blk-ioc.c           | 99 +++++++++++++++++++++++++++++++----------------
 block/blk.h               |  1 +
 block/cfq-iosched.c       | 18 ++++-----
 fs/ioprio.c               | 21 ++--------
 include/linux/iocontext.h |  4 +-
 kernel/fork.c             |  8 ++--
 7 files changed, 91 insertions(+), 69 deletions(-)

(limited to 'block')

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 8f630cec906..4b001dcd85b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1645,11 +1645,12 @@ static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
-	task_lock(tsk);
-	ioc = tsk->io_context;
-	if (ioc)
+	/* we don't lose anything even if ioc allocation fails */
+	ioc = get_task_io_context(tsk, GFP_ATOMIC, NUMA_NO_NODE);
+	if (ioc) {
 		ioc->cgroup_changed = 1;
-	task_unlock(tsk);
+		put_io_context(ioc);
+	}
 }
 
 void blkio_policy_register(struct blkio_policy_type *blkiop)
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 8bebf06bac7..b13ed96776c 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -16,6 +16,19 @@
  */
 static struct kmem_cache *iocontext_cachep;
 
+/**
+ * get_io_context - increment reference count to io_context
+ * @ioc: io_context to get
+ *
+ * Increment reference count to @ioc.
+ */
+void get_io_context(struct io_context *ioc)
+{
+	BUG_ON(atomic_long_read(&ioc->refcount) <= 0);
+	atomic_long_inc(&ioc->refcount);
+}
+EXPORT_SYMBOL(get_io_context);
+
 static void cfq_dtor(struct io_context *ioc)
 {
 	if (!hlist_empty(&ioc->cic_list)) {
@@ -71,6 +84,9 @@ void exit_io_context(struct task_struct *task)
 {
 	struct io_context *ioc;
 
+	/* PF_EXITING prevents new io_context from being attached to @task */
+	WARN_ON_ONCE(!(current->flags & PF_EXITING));
+
 	task_lock(task);
 	ioc = task->io_context;
 	task->io_context = NULL;
@@ -82,7 +98,9 @@ void exit_io_context(struct task_struct *task)
 	put_io_context(ioc);
 }
 
-struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
+static struct io_context *create_task_io_context(struct task_struct *task,
+						 gfp_t gfp_flags, int node,
+						 bool take_ref)
 {
 	struct io_context *ioc;
 
@@ -98,6 +116,20 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
 	INIT_HLIST_HEAD(&ioc->cic_list);
 
+	/* try to install, somebody might already have beaten us to it */
+	task_lock(task);
+
+	if (!task->io_context && !(task->flags & PF_EXITING)) {
+		task->io_context = ioc;
+	} else {
+		kmem_cache_free(iocontext_cachep, ioc);
+		ioc = task->io_context;
+	}
+
+	if (ioc && take_ref)
+		get_io_context(ioc);
+
+	task_unlock(task);
 	return ioc;
 }
 
@@ -114,46 +146,47 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
  */
 struct io_context *current_io_context(gfp_t gfp_flags, int node)
 {
-	struct task_struct *tsk = current;
-	struct io_context *ret;
-
-	ret = tsk->io_context;
-	if (likely(ret))
-		return ret;
-
-	ret = alloc_io_context(gfp_flags, node);
-	if (ret) {
-		/* make sure set_task_ioprio() sees the settings above */
-		smp_wmb();
-		tsk->io_context = ret;
-	}
+	might_sleep_if(gfp_flags & __GFP_WAIT);
 
-	return ret;
+	if (current->io_context)
+		return current->io_context;
+
+	return create_task_io_context(current, gfp_flags, node, false);
 }
+EXPORT_SYMBOL(current_io_context);
 
-/*
- * If the current task has no IO context then create one and initialise it.
- * If it does have a context, take a ref on it.
+/**
+ * get_task_io_context - get io_context of a task
+ * @task: task of interest
+ * @gfp_flags: allocation flags, used if allocation is necessary
+ * @node: allocation node, used if allocation is necessary
+ *
+ * Return io_context of @task.  If it doesn't exist, it is created with
+ * @gfp_flags and @node.  The returned io_context has its reference count
+ * incremented.
  *
- * This is always called in the context of the task which submitted the I/O.
+ * This function always goes through task_lock() and it's better to use
+ * current_io_context() + get_io_context() for %current.
  */
-struct io_context *get_io_context(gfp_t gfp_flags, int node)
+struct io_context *get_task_io_context(struct task_struct *task,
+				       gfp_t gfp_flags, int node)
 {
-	struct io_context *ioc = NULL;
-
-	/*
-	 * Check for unlikely race with exiting task. ioc ref count is
-	 * zero when ioc is being detached.
-	 */
-	do {
-		ioc = current_io_context(gfp_flags, node);
-		if (unlikely(!ioc))
-			break;
-	} while (!atomic_long_inc_not_zero(&ioc->refcount));
+	struct io_context *ioc;
 
-	return ioc;
+	might_sleep_if(gfp_flags & __GFP_WAIT);
+
+	task_lock(task);
+	ioc = task->io_context;
+	if (likely(ioc)) {
+		get_io_context(ioc);
+		task_unlock(task);
+		return ioc;
+	}
+	task_unlock(task);
+
+	return create_task_io_context(task, gfp_flags, node, true);
 }
-EXPORT_SYMBOL(get_io_context);
+EXPORT_SYMBOL(get_task_io_context);
 
 static int __init blk_ioc_init(void)
 {
diff --git a/block/blk.h b/block/blk.h
index aae4d88fc52..fc3c41b2fd2 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -122,6 +122,7 @@ static inline int blk_should_fake_timeout(struct request_queue *q)
 }
 #endif
 
+void get_io_context(struct io_context *ioc);
 struct io_context *current_io_context(gfp_t gfp_flags, int node);
 
 int ll_back_merge_fn(struct request_queue *q, struct request *req,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ec3f5e8ba56..d42d89ccce1 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -14,6 +14,7 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
+#include "blk.h"
 #include "cfq.h"
 
 /*
@@ -3194,13 +3195,13 @@ static struct cfq_io_context *
 cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct io_context *ioc = NULL;
-	struct cfq_io_context *cic;
+	struct cfq_io_context *cic = NULL;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
-	ioc = get_io_context(gfp_mask, cfqd->queue->node);
+	ioc = current_io_context(gfp_mask, cfqd->queue->node);
 	if (!ioc)
-		return NULL;
+		goto err;
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	if (cic)
@@ -3211,10 +3212,10 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 		goto err;
 
 	if (cfq_cic_link(cfqd, ioc, cic, gfp_mask))
-		goto err_free;
-
+		goto err;
 out:
-	smp_read_barrier_depends();
+	get_io_context(ioc);
+
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
 
@@ -3223,10 +3224,9 @@ out:
 		cfq_ioc_set_cgroup(ioc);
 #endif
 	return cic;
-err_free:
-	cfq_cic_free(cic);
 err:
-	put_io_context(ioc);
+	if (cic)
+		cfq_cic_free(cic);
 	return NULL;
 }
 
diff --git a/fs/ioprio.c b/fs/ioprio.c
index f79dab83e17..998ec239d1e 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -48,28 +48,13 @@ int set_task_ioprio(struct task_struct *task, int ioprio)
 	if (err)
 		return err;
 
-	task_lock(task);
-	do {
-		ioc = task->io_context;
-		/* see wmb() in current_io_context() */
-		smp_read_barrier_depends();
-		if (ioc)
-			break;
-
-		ioc = alloc_io_context(GFP_ATOMIC, -1);
-		if (!ioc) {
-			err = -ENOMEM;
-			break;
-		}
-		task->io_context = ioc;
-	} while (1);
-
-	if (!err) {
+	ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
+	if (ioc) {
 		ioc->ioprio = ioprio;
 		ioc->ioprio_changed = 1;
+		put_io_context(ioc);
 	}
 
-	task_unlock(task);
 	return err;
 }
 EXPORT_SYMBOL_GPL(set_task_ioprio);
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 8a6ecb66346..28bb621ef5a 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -78,8 +78,8 @@ struct task_struct;
 #ifdef CONFIG_BLOCK
 void put_io_context(struct io_context *ioc);
 void exit_io_context(struct task_struct *task);
-struct io_context *get_io_context(gfp_t gfp_flags, int node);
-struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+struct io_context *get_task_io_context(struct task_struct *task,
+				       gfp_t gfp_flags, int node);
 #else
 struct io_context;
 static inline void put_io_context(struct io_context *ioc) { }
diff --git a/kernel/fork.c b/kernel/fork.c
index da4a6a10d08..5bcfc739bb7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -870,6 +870,7 @@ static int copy_io(unsigned long clone_flags, struct task_struct *tsk)
 {
 #ifdef CONFIG_BLOCK
 	struct io_context *ioc = current->io_context;
+	struct io_context *new_ioc;
 
 	if (!ioc)
 		return 0;
@@ -881,11 +882,12 @@ static int copy_io(unsigned long clone_flags, struct task_struct *tsk)
 		if (unlikely(!tsk->io_context))
 			return -ENOMEM;
 	} else if (ioprio_valid(ioc->ioprio)) {
-		tsk->io_context = alloc_io_context(GFP_KERNEL, -1);
-		if (unlikely(!tsk->io_context))
+		new_ioc = get_task_io_context(tsk, GFP_KERNEL, NUMA_NO_NODE);
+		if (unlikely(!new_ioc))
 			return -ENOMEM;
 
-		tsk->io_context->ioprio = ioc->ioprio;
+		new_ioc->ioprio = ioc->ioprio;
+		put_io_context(new_ioc);
 	}
 #endif
 	return 0;
-- 
cgit v1.2.3-70-g09d2


From 09ac46c429464c919d04bb737b27edd84d944f02 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:38 +0100
Subject: block: misc updates to blk_get_queue()

* blk_get_queue() is peculiar in that it returns 0 on success and 1 on
  failure instead of 0 / -errno or boolean.  Update it such that it
  returns %true on success and %false on failure.

* Make sure the caller checks for the return value.

* Separate out __blk_get_queue() which doesn't check whether @q is
  dead and put it in blk.h.  This will be used later.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c         | 8 ++++----
 block/blk.h              | 5 +++++
 block/bsg.c              | 4 +---
 block/genhd.c            | 2 +-
 drivers/scsi/scsi_scan.c | 2 +-
 include/linux/blkdev.h   | 2 +-
 6 files changed, 13 insertions(+), 10 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index af730158117..fd4749391e1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -626,14 +626,14 @@ blk_init_allocated_queue_node(struct request_queue *q, request_fn_proc *rfn,
 }
 EXPORT_SYMBOL(blk_init_allocated_queue_node);
 
-int blk_get_queue(struct request_queue *q)
+bool blk_get_queue(struct request_queue *q)
 {
 	if (likely(!blk_queue_dead(q))) {
-		kobject_get(&q->kobj);
-		return 0;
+		__blk_get_queue(q);
+		return true;
 	}
 
-	return 1;
+	return false;
 }
 EXPORT_SYMBOL(blk_get_queue);
 
diff --git a/block/blk.h b/block/blk.h
index fc3c41b2fd2..8d421156fef 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -13,6 +13,11 @@ extern struct kmem_cache *blk_requestq_cachep;
 extern struct kobj_type blk_queue_ktype;
 extern struct ida blk_queue_ida;
 
+static inline void __blk_get_queue(struct request_queue *q)
+{
+	kobject_get(&q->kobj);
+}
+
 void init_request_from_bio(struct request *req, struct bio *bio);
 void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 			struct bio *bio);
diff --git a/block/bsg.c b/block/bsg.c
index 702f1316bb8..167d586cece 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -769,12 +769,10 @@ static struct bsg_device *bsg_add_device(struct inode *inode,
 					 struct file *file)
 {
 	struct bsg_device *bd;
-	int ret;
 #ifdef BSG_DEBUG
 	unsigned char buf[32];
 #endif
-	ret = blk_get_queue(rq);
-	if (ret)
+	if (!blk_get_queue(rq))
 		return ERR_PTR(-ENXIO);
 
 	bd = bsg_alloc_device();
diff --git a/block/genhd.c b/block/genhd.c
index 02e9fca8082..c958169d24f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -615,7 +615,7 @@ void add_disk(struct gendisk *disk)
 	 * Take an extra ref on queue which will be put on disk_release()
 	 * so that it sticks around as long as @disk is there.
 	 */
-	WARN_ON_ONCE(blk_get_queue(disk->queue));
+	WARN_ON_ONCE(!blk_get_queue(disk->queue));
 
 	retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
 				   "bdi");
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index b3c6d957fbd..89da43f73c0 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -297,7 +297,7 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 		kfree(sdev);
 		goto out;
 	}
-	blk_get_queue(sdev->request_queue);
+	WARN_ON_ONCE(!blk_get_queue(sdev->request_queue));
 	sdev->request_queue->queuedata = sdev;
 	scsi_adjust_queue_depth(sdev, 0, sdev->host->cmd_per_lun);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8c8dbc4738e..d1b6f4ed1f9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -865,7 +865,7 @@ extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatte
 extern void blk_dump_rq_flags(struct request *, char *);
 extern long nr_blockdev_pages(void);
 
-int blk_get_queue(struct request_queue *);
+bool __must_check blk_get_queue(struct request_queue *);
 struct request_queue *blk_alloc_queue(gfp_t);
 struct request_queue *blk_alloc_queue_node(gfp_t, int);
 extern void blk_put_queue(struct request_queue *);
-- 
cgit v1.2.3-70-g09d2


From 283287a52e3c3f7f8f9da747f4b8c5202740d776 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:38 +0100
Subject: block, cfq: misc updates to cfq_io_context

Make the following changes to prepare for ioc/cic management cleanup.

* Add cic->q so that ioc can determine the associated queue without
  querying cfq.  This will eventually replace ->key.

* Factor out cfq_release_cic() from cic_free_func().  This function
  assumes that the caller handled locking.

* Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it
  take only @cic.

* Restructure cfq_cic_link() for future updates.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c       | 58 ++++++++++++++++++++++++++---------------------
 include/linux/iocontext.h |  1 +
 2 files changed, 33 insertions(+), 26 deletions(-)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d42d89ccce1..a612ca65f37 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2709,21 +2709,26 @@ static void cfq_cic_free(struct cfq_io_context *cic)
 	call_rcu(&cic->rcu_head, cfq_cic_free_rcu);
 }
 
-static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
+static void cfq_release_cic(struct cfq_io_context *cic)
 {
-	unsigned long flags;
+	struct io_context *ioc = cic->ioc;
 	unsigned long dead_key = (unsigned long) cic->key;
 
 	BUG_ON(!(dead_key & CIC_DEAD_KEY));
-
-	spin_lock_irqsave(&ioc->lock, flags);
 	radix_tree_delete(&ioc->radix_root, dead_key >> CIC_DEAD_INDEX_SHIFT);
 	hlist_del_rcu(&cic->cic_list);
-	spin_unlock_irqrestore(&ioc->lock, flags);
-
 	cfq_cic_free(cic);
 }
 
+static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ioc->lock, flags);
+	cfq_release_cic(cic);
+	spin_unlock_irqrestore(&ioc->lock, flags);
+}
+
 /*
  * Must be called with rcu_read_lock() held or preemption otherwise disabled.
  * Only two callers of this - ->dtor() which is called with the rcu_read_lock(),
@@ -2773,9 +2778,9 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	cfq_put_queue(cfqq);
 }
 
-static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
-					 struct cfq_io_context *cic)
+static void cfq_exit_cic(struct cfq_io_context *cic)
 {
+	struct cfq_data *cfqd = cic_to_cfqd(cic);
 	struct io_context *ioc = cic->ioc;
 
 	list_del_init(&cic->queue_list);
@@ -2823,7 +2828,7 @@ static void cfq_exit_single_io_context(struct io_context *ioc,
 		 */
 		smp_read_barrier_depends();
 		if (cic->key == cfqd)
-			__cfq_exit_single_io_context(cfqd, cic);
+			cfq_exit_cic(cic);
 
 		spin_unlock_irqrestore(q->queue_lock, flags);
 	}
@@ -3161,28 +3166,29 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 	int ret;
 
 	ret = radix_tree_preload(gfp_mask);
-	if (!ret) {
-		cic->ioc = ioc;
-		cic->key = cfqd;
+	if (ret)
+		goto out;
 
-		spin_lock_irqsave(&ioc->lock, flags);
-		ret = radix_tree_insert(&ioc->radix_root, cfqd->queue->id, cic);
-		if (!ret)
-			hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list);
-		spin_unlock_irqrestore(&ioc->lock, flags);
+	cic->ioc = ioc;
+	cic->key = cfqd;
+	cic->q = cfqd->queue;
+
+	spin_lock_irqsave(&ioc->lock, flags);
+	ret = radix_tree_insert(&ioc->radix_root, cfqd->queue->id, cic);
+	if (!ret)
+		hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list);
+	spin_unlock_irqrestore(&ioc->lock, flags);
 
-		radix_tree_preload_end();
+	radix_tree_preload_end();
 
-		if (!ret) {
-			spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-			list_add(&cic->queue_list, &cfqd->cic_list);
-			spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-		}
+	if (!ret) {
+		spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+		list_add(&cic->queue_list, &cfqd->cic_list);
+		spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
 	}
-
+out:
 	if (ret)
 		printk(KERN_ERR "cfq: cic link failed!\n");
-
 	return ret;
 }
 
@@ -3922,7 +3928,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 							struct cfq_io_context,
 							queue_list);
 
-		__cfq_exit_single_io_context(cfqd, cic);
+		cfq_exit_cic(cic);
 	}
 
 	cfq_put_async_queues(cfqd);
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 28bb621ef5a..079aea8fd8a 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -15,6 +15,7 @@ struct cfq_ttime {
 
 struct cfq_io_context {
 	void *key;
+	struct request_queue *q;
 
 	struct cfq_queue *cfqq[2];
 
-- 
cgit v1.2.3-70-g09d2


From dc86900e0a8f665122de6faadd27fb4c6d2b3e4d Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:38 +0100
Subject: block, cfq: move ioc ioprio/cgroup changed handling to cic

ioprio/cgroup change was handled by marking the changed state in ioc
and, on the following access to the ioc, performing RCU-protected
iteration through all cic's grabbing the matching queue_lock.

This patch moves the changed state to each cic.  When ioprio or cgroup
changes, the respective bit is set on all cic's of the ioc and when
each of those cic (not ioc) is accessed, change is applied for that
specific ioc-queue pair.

This also fixes the following two race conditions between setting and
clearing of changed states.

* Missing barrier between assign/load of ioprio and ioprio_changed
  allowed applying old ioprio.

* Change requests could happen between application of change and
  clearing of changed variables.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-cgroup.c        |  2 +-
 block/blk-ioc.c           | 45 +++++++++++++++++++++++++++++++++++++++++++++
 block/cfq-iosched.c       | 28 +++++++++-------------------
 fs/ioprio.c               |  3 +--
 include/linux/iocontext.h | 14 +++++++++-----
 5 files changed, 65 insertions(+), 27 deletions(-)

(limited to 'block')

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4b001dcd85b..dc00835aab6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1648,7 +1648,7 @@ static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	/* we don't lose anything even if ioc allocation fails */
 	ioc = get_task_io_context(tsk, GFP_ATOMIC, NUMA_NO_NODE);
 	if (ioc) {
-		ioc->cgroup_changed = 1;
+		ioc_cgroup_changed(ioc);
 		put_io_context(ioc);
 	}
 }
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b13ed96776c..6f59fbad93d 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -188,6 +188,51 @@ struct io_context *get_task_io_context(struct task_struct *task,
 }
 EXPORT_SYMBOL(get_task_io_context);
 
+void ioc_set_changed(struct io_context *ioc, int which)
+{
+	struct cfq_io_context *cic;
+	struct hlist_node *n;
+
+	hlist_for_each_entry(cic, n, &ioc->cic_list, cic_list)
+		set_bit(which, &cic->changed);
+}
+
+/**
+ * ioc_ioprio_changed - notify ioprio change
+ * @ioc: io_context of interest
+ * @ioprio: new ioprio
+ *
+ * @ioc's ioprio has changed to @ioprio.  Set %CIC_IOPRIO_CHANGED for all
+ * cic's.  iosched is responsible for checking the bit and applying it on
+ * request issue path.
+ */
+void ioc_ioprio_changed(struct io_context *ioc, int ioprio)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ioc->lock, flags);
+	ioc->ioprio = ioprio;
+	ioc_set_changed(ioc, CIC_IOPRIO_CHANGED);
+	spin_unlock_irqrestore(&ioc->lock, flags);
+}
+
+/**
+ * ioc_cgroup_changed - notify cgroup change
+ * @ioc: io_context of interest
+ *
+ * @ioc's cgroup has changed.  Set %CIC_CGROUP_CHANGED for all cic's.
+ * iosched is responsible for checking the bit and applying it on request
+ * issue path.
+ */
+void ioc_cgroup_changed(struct io_context *ioc)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ioc->lock, flags);
+	ioc_set_changed(ioc, CIC_CGROUP_CHANGED);
+	spin_unlock_irqrestore(&ioc->lock, flags);
+}
+
 static int __init blk_ioc_init(void)
 {
 	iocontext_cachep = kmem_cache_create("blkdev_ioc",
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a612ca65f37..51aece2eea7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2904,7 +2904,7 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
-static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
+static void changed_ioprio(struct cfq_io_context *cic)
 {
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
 	struct cfq_queue *cfqq;
@@ -2933,12 +2933,6 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
 }
 
-static void cfq_ioc_set_ioprio(struct io_context *ioc)
-{
-	call_for_each_cic(ioc, changed_ioprio);
-	ioc->ioprio_changed = 0;
-}
-
 static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 			  pid_t pid, bool is_sync)
 {
@@ -2960,7 +2954,7 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 }
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
-static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+static void changed_cgroup(struct cfq_io_context *cic)
 {
 	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
@@ -2986,12 +2980,6 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
-
-static void cfq_ioc_set_cgroup(struct io_context *ioc)
-{
-	call_for_each_cic(ioc, changed_cgroup);
-	ioc->cgroup_changed = 0;
-}
 #endif  /* CONFIG_CFQ_GROUP_IOSCHED */
 
 static struct cfq_queue *
@@ -3222,13 +3210,15 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 out:
 	get_io_context(ioc);
 
-	if (unlikely(ioc->ioprio_changed))
-		cfq_ioc_set_ioprio(ioc);
-
+	if (unlikely(cic->changed)) {
+		if (test_and_clear_bit(CIC_IOPRIO_CHANGED, &cic->changed))
+			changed_ioprio(cic);
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
-	if (unlikely(ioc->cgroup_changed))
-		cfq_ioc_set_cgroup(ioc);
+		if (test_and_clear_bit(CIC_CGROUP_CHANGED, &cic->changed))
+			changed_cgroup(cic);
 #endif
+	}
+
 	return cic;
 err:
 	if (cic)
diff --git a/fs/ioprio.c b/fs/ioprio.c
index 998ec239d1e..0f1b9515213 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -50,8 +50,7 @@ int set_task_ioprio(struct task_struct *task, int ioprio)
 
 	ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
 	if (ioc) {
-		ioc->ioprio = ioprio;
-		ioc->ioprio_changed = 1;
+		ioc_ioprio_changed(ioc, ioprio);
 		put_io_context(ioc);
 	}
 
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 079aea8fd8a..2c2b6da96b3 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -13,6 +13,11 @@ struct cfq_ttime {
 	unsigned long ttime_mean;
 };
 
+enum {
+	CIC_IOPRIO_CHANGED,
+	CIC_CGROUP_CHANGED,
+};
+
 struct cfq_io_context {
 	void *key;
 	struct request_queue *q;
@@ -26,6 +31,8 @@ struct cfq_io_context {
 	struct list_head queue_list;
 	struct hlist_node cic_list;
 
+	unsigned long changed;
+
 	void (*dtor)(struct io_context *); /* destructor */
 	void (*exit)(struct io_context *); /* called on task exit */
 
@@ -44,11 +51,6 @@ struct io_context {
 	spinlock_t lock;
 
 	unsigned short ioprio;
-	unsigned short ioprio_changed;
-
-#if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
-	unsigned short cgroup_changed;
-#endif
 
 	/*
 	 * For request batching
@@ -81,6 +83,8 @@ void put_io_context(struct io_context *ioc);
 void exit_io_context(struct task_struct *task);
 struct io_context *get_task_io_context(struct task_struct *task,
 				       gfp_t gfp_flags, int node);
+void ioc_ioprio_changed(struct io_context *ioc, int ioprio);
+void ioc_cgroup_changed(struct io_context *ioc);
 #else
 struct io_context;
 static inline void put_io_context(struct io_context *ioc) { }
-- 
cgit v1.2.3-70-g09d2


From 216284c352a0061f5b20acff2c4e50fb43fea183 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:38 +0100
Subject: block, cfq: fix race condition in cic creation path and tighten
 locking

cfq_get_io_context() would fail if multiple tasks race to insert cic's
for the same association.  This patch restructures
cfq_get_io_context() such that slow path insertion race is handled
properly.

Note that the restructuring also makes cfq_get_io_context() called
under queue_lock and performs both ioc and cfqd insertions while
holding both ioc and queue locks.  This is part of on-going locking
tightening and will be used to simplify synchronization rules.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c | 135 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 76 insertions(+), 59 deletions(-)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 51aece2eea7..181a63d3669 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2908,13 +2908,10 @@ static void changed_ioprio(struct cfq_io_context *cic)
 {
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
 	struct cfq_queue *cfqq;
-	unsigned long flags;
 
 	if (unlikely(!cfqd))
 		return;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 	if (cfqq) {
 		struct cfq_queue *new_cfqq;
@@ -2929,8 +2926,6 @@ static void changed_ioprio(struct cfq_io_context *cic)
 	cfqq = cic->cfqq[BLK_RW_SYNC];
 	if (cfqq)
 		cfq_mark_cfqq_prio_changed(cfqq);
-
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
 }
 
 static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -2958,7 +2953,6 @@ static void changed_cgroup(struct cfq_io_context *cic)
 {
 	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
-	unsigned long flags;
 	struct request_queue *q;
 
 	if (unlikely(!cfqd))
@@ -2966,8 +2960,6 @@ static void changed_cgroup(struct cfq_io_context *cic)
 
 	q = cfqd->queue;
 
-	spin_lock_irqsave(q->queue_lock, flags);
-
 	if (sync_cfqq) {
 		/*
 		 * Drop reference to sync queue. A new sync queue will be
@@ -2977,8 +2969,6 @@ static void changed_cgroup(struct cfq_io_context *cic)
 		cic_set_cfqq(cic, NULL, 1);
 		cfq_put_queue(sync_cfqq);
 	}
-
-	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 #endif  /* CONFIG_CFQ_GROUP_IOSCHED */
 
@@ -3142,16 +3132,31 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 	return cic;
 }
 
-/*
- * Add cic into ioc, using cfqd as the search key. This enables us to lookup
- * the process specific cfq io context when entered from the block layer.
- * Also adds the cic to a per-cfqd list, used when this queue is removed.
+/**
+ * cfq_create_cic - create and link a cfq_io_context
+ * @cfqd: cfqd of interest
+ * @gfp_mask: allocation mask
+ *
+ * Make sure cfq_io_context linking %current->io_context and @cfqd exists.
+ * If ioc and/or cic doesn't exist, they will be created using @gfp_mask.
  */
-static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
-			struct cfq_io_context *cic, gfp_t gfp_mask)
+static int cfq_create_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
-	unsigned long flags;
-	int ret;
+	struct request_queue *q = cfqd->queue;
+	struct cfq_io_context *cic = NULL;
+	struct io_context *ioc;
+	int ret = -ENOMEM;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	/* allocate stuff */
+	ioc = current_io_context(gfp_mask, q->node);
+	if (!ioc)
+		goto out;
+
+	cic = cfq_alloc_io_context(cfqd, gfp_mask);
+	if (!cic)
+		goto out;
 
 	ret = radix_tree_preload(gfp_mask);
 	if (ret)
@@ -3161,53 +3166,72 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 	cic->key = cfqd;
 	cic->q = cfqd->queue;
 
-	spin_lock_irqsave(&ioc->lock, flags);
-	ret = radix_tree_insert(&ioc->radix_root, cfqd->queue->id, cic);
-	if (!ret)
-		hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list);
-	spin_unlock_irqrestore(&ioc->lock, flags);
-
-	radix_tree_preload_end();
+	/* lock both q and ioc and try to link @cic */
+	spin_lock_irq(q->queue_lock);
+	spin_lock(&ioc->lock);
 
-	if (!ret) {
-		spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	ret = radix_tree_insert(&ioc->radix_root, q->id, cic);
+	if (likely(!ret)) {
+		hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list);
 		list_add(&cic->queue_list, &cfqd->cic_list);
-		spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+		cic = NULL;
+	} else if (ret == -EEXIST) {
+		/* someone else already did it */
+		ret = 0;
 	}
+
+	spin_unlock(&ioc->lock);
+	spin_unlock_irq(q->queue_lock);
+
+	radix_tree_preload_end();
 out:
 	if (ret)
 		printk(KERN_ERR "cfq: cic link failed!\n");
+	if (cic)
+		cfq_cic_free(cic);
 	return ret;
 }
 
-/*
- * Setup general io context and cfq io context. There can be several cfq
- * io contexts per general io context, if this process is doing io to more
- * than one device managed by cfq.
+/**
+ * cfq_get_io_context - acquire cfq_io_context and bump refcnt on io_context
+ * @cfqd: cfqd to setup cic for
+ * @gfp_mask: allocation mask
+ *
+ * Return cfq_io_context associating @cfqd and %current->io_context and
+ * bump refcnt on io_context.  If ioc or cic doesn't exist, they're created
+ * using @gfp_mask.
+ *
+ * Must be called under queue_lock which may be released and re-acquired.
+ * This function also may sleep depending on @gfp_mask.
  */
 static struct cfq_io_context *
 cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
-	struct io_context *ioc = NULL;
+	struct request_queue *q = cfqd->queue;
 	struct cfq_io_context *cic = NULL;
+	struct io_context *ioc;
+	int err;
+
+	lockdep_assert_held(q->queue_lock);
+
+	while (true) {
+		/* fast path */
+		ioc = current->io_context;
+		if (likely(ioc)) {
+			cic = cfq_cic_lookup(cfqd, ioc);
+			if (likely(cic))
+				break;
+		}
 
-	might_sleep_if(gfp_mask & __GFP_WAIT);
-
-	ioc = current_io_context(gfp_mask, cfqd->queue->node);
-	if (!ioc)
-		goto err;
-
-	cic = cfq_cic_lookup(cfqd, ioc);
-	if (cic)
-		goto out;
-
-	cic = cfq_alloc_io_context(cfqd, gfp_mask);
-	if (cic == NULL)
-		goto err;
+		/* slow path - unlock, create missing ones and retry */
+		spin_unlock_irq(q->queue_lock);
+		err = cfq_create_cic(cfqd, gfp_mask);
+		spin_lock_irq(q->queue_lock);
+		if (err)
+			return NULL;
+	}
 
-	if (cfq_cic_link(cfqd, ioc, cic, gfp_mask))
-		goto err;
-out:
+	/* bump @ioc's refcnt and handle changed notifications */
 	get_io_context(ioc);
 
 	if (unlikely(cic->changed)) {
@@ -3220,10 +3244,6 @@ out:
 	}
 
 	return cic;
-err:
-	if (cic)
-		cfq_cic_free(cic);
-	return NULL;
 }
 
 static void
@@ -3759,14 +3779,11 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	const int rw = rq_data_dir(rq);
 	const bool is_sync = rq_is_sync(rq);
 	struct cfq_queue *cfqq;
-	unsigned long flags;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
+	spin_lock_irq(q->queue_lock);
 	cic = cfq_get_io_context(cfqd, gfp_mask);
-
-	spin_lock_irqsave(q->queue_lock, flags);
-
 	if (!cic)
 		goto queue_fail;
 
@@ -3802,12 +3819,12 @@ new_queue:
 	rq->elevator_private[0] = cic;
 	rq->elevator_private[1] = cfqq;
 	rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	spin_unlock_irq(q->queue_lock);
 	return 0;
 
 queue_fail:
 	cfq_schedule_dispatch(cfqd);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	spin_unlock_irq(q->queue_lock);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
-- 
cgit v1.2.3-70-g09d2


From f1a4f4d35ff30a328d5ea28f6cc826b2083111d2 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:39 +0100
Subject: block, cfq: fix cic lookup locking

* cfq_cic_lookup() may be called without queue_lock and multiple tasks
  can execute it simultaneously for the same shared ioc.  Nothing
  prevents them racing each other and trying to drop the same dead cic
  entry multiple times.

* smp_wmb() in cfq_exit_cic() doesn't really do anything and nothing
  prevents cfq_cic_lookup() seeing stale cic->key.  This usually
  doesn't blow up because by the time cic is exited, all requests have
  been drained and new requests are terminated before going through
  elevator.  However, it can still be triggered by plug merge path
  which doesn't grab queue_lock and thus can't check DEAD state
  reliably.

This patch updates lookup locking such that,

* Lookup is always performed under queue_lock.  This doesn't add any
  more locking.  The only issue is cfq_allow_merge() which can be
  called from plug merge path without holding any lock.  For now, this
  is worked around by using cic of the request to merge into, which is
  guaranteed to have the same ioc.  For longer term, I think it would
  be best to separate out plug merge method from regular one.

* Spurious ioc->lock locking around cic lookup hint assignment
  dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c | 67 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 35 insertions(+), 32 deletions(-)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 181a63d3669..e617b088c59 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1682,12 +1682,19 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 		return false;
 
 	/*
-	 * Lookup the cfqq that this bio will be queued with. Allow
-	 * merge only if rq is queued there.
+	 * Lookup the cfqq that this bio will be queued with and allow
+	 * merge only if rq is queued there.  This function can be called
+	 * from plug merge without queue_lock.  In such cases, ioc of @rq
+	 * and %current are guaranteed to be equal.  Avoid lookup which
+	 * requires queue_lock by using @rq's cic.
 	 */
-	cic = cfq_cic_lookup(cfqd, current->io_context);
-	if (!cic)
-		return false;
+	if (current->io_context == RQ_CIC(rq)->ioc) {
+		cic = RQ_CIC(rq);
+	} else {
+		cic = cfq_cic_lookup(cfqd, current->io_context);
+		if (!cic)
+			return false;
+	}
 
 	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
 	return cfqq == RQ_CFQQ(rq);
@@ -2784,21 +2791,15 @@ static void cfq_exit_cic(struct cfq_io_context *cic)
 	struct io_context *ioc = cic->ioc;
 
 	list_del_init(&cic->queue_list);
+	cic->key = cfqd_dead_key(cfqd);
 
 	/*
-	 * Make sure dead mark is seen for dead queues
+	 * Both setting lookup hint to and clearing it from @cic are done
+	 * under queue_lock.  If it's not pointing to @cic now, it never
+	 * will.  Hint assignment itself can race safely.
 	 */
-	smp_wmb();
-	cic->key = cfqd_dead_key(cfqd);
-
-	rcu_read_lock();
-	if (rcu_dereference(ioc->ioc_data) == cic) {
-		rcu_read_unlock();
-		spin_lock(&ioc->lock);
+	if (rcu_dereference_raw(ioc->ioc_data) == cic)
 		rcu_assign_pointer(ioc->ioc_data, NULL);
-		spin_unlock(&ioc->lock);
-	} else
-		rcu_read_unlock();
 
 	if (cic->cfqq[BLK_RW_ASYNC]) {
 		cfq_exit_cfqq(cfqd, cic->cfqq[BLK_RW_ASYNC]);
@@ -3092,12 +3093,20 @@ cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
 	cfq_cic_free(cic);
 }
 
+/**
+ * cfq_cic_lookup - lookup cfq_io_context
+ * @cfqd: the associated cfq_data
+ * @ioc: the associated io_context
+ *
+ * Look up cfq_io_context associated with @cfqd - @ioc pair.  Must be
+ * called with queue_lock held.
+ */
 static struct cfq_io_context *
 cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 {
 	struct cfq_io_context *cic;
-	unsigned long flags;
 
+	lockdep_assert_held(cfqd->queue->queue_lock);
 	if (unlikely(!ioc))
 		return NULL;
 
@@ -3107,28 +3116,22 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 	 * we maintain a last-hit cache, to avoid browsing over the tree
 	 */
 	cic = rcu_dereference(ioc->ioc_data);
-	if (cic && cic->key == cfqd) {
-		rcu_read_unlock();
-		return cic;
-	}
+	if (cic && cic->key == cfqd)
+		goto out;
 
 	do {
 		cic = radix_tree_lookup(&ioc->radix_root, cfqd->queue->id);
-		rcu_read_unlock();
 		if (!cic)
 			break;
-		if (unlikely(cic->key != cfqd)) {
-			cfq_drop_dead_cic(cfqd, ioc, cic);
-			rcu_read_lock();
-			continue;
+		if (likely(cic->key == cfqd)) {
+			/* hint assignment itself can race safely */
+			rcu_assign_pointer(ioc->ioc_data, cic);
+			break;
 		}
-
-		spin_lock_irqsave(&ioc->lock, flags);
-		rcu_assign_pointer(ioc->ioc_data, cic);
-		spin_unlock_irqrestore(&ioc->lock, flags);
-		break;
+		cfq_drop_dead_cic(cfqd, ioc, cic);
 	} while (1);
-
+out:
+	rcu_read_unlock();
 	return cic;
 }
 
-- 
cgit v1.2.3-70-g09d2


From b2efa05265d62bc29f3a64400fad4b44340eedb8 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:39 +0100
Subject: block, cfq: unlink cfq_io_context's immediately

cic is association between io_context and request_queue.  A cic is
linked from both ioc and q and should be destroyed when either one
goes away.  As ioc and q both have their own locks, locking becomes a
bit complex - both orders work for removal from one but not from the
other.

Currently, cfq tries to circumvent this locking order issue with RCU.
ioc->lock nests inside queue_lock but the radix tree and cic's are
also protected by RCU allowing either side to walk their lists without
grabbing lock.

This rather unconventional use of RCU quickly devolves into extremely
fragile convolution.  e.g. The following is from cfqd going away too
soon after ioc and q exits raced.

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:
 [   88.503444]
 Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
 RIP: 0010:[<ffffffff81397628>]  [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
 ...
 Call Trace:
  [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
  [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
  [<ffffffff81389130>] exit_io_context+0x100/0x140
  [<ffffffff81098a29>] do_exit+0x579/0x850
  [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
  [<ffffffff81098de7>] sys_exit_group+0x17/0x20
  [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b

The only real hot path here is cic lookup during request
initialization and avoiding extra locking requires very confined use
of RCU.  This patch makes cic removal from both ioc and request_queue
perform double-locking and unlink immediately.

* From q side, the change is almost trivial as ioc->lock nests inside
  queue_lock.  It just needs to grab each ioc->lock as it walks
  cic_list and unlink it.

* From ioc side, it's a bit more difficult because of inversed lock
  order.  ioc needs its lock to walk its cic_list but can't grab the
  matching queue_lock and needs to perform unlock-relock dancing.

  Unlinking is now wholly done from put_io_context() and fast path is
  optimized by using the queue_lock the caller already holds, which is
  by far the most common case.  If the ioc accessed multiple devices,
  it tries with trylock.  In unlikely cases of fast path failure, it
  falls back to full double-locking dance from workqueue.

Double-locking isn't the prettiest thing in the world but it's *far*
simpler and more understandable than RCU trick without adding any
meaningful overhead.

This still leaves a lot of now unnecessary RCU logics.  Future patches
will trim them.

-v2: Vivek pointed out that cic->q was being dereferenced after
     cic->release() was called.  Updated to use local variable @this_q
     instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-cgroup.c        |   2 +-
 block/blk-ioc.c           | 166 ++++++++++++++++++++++++++++++++++++++--------
 block/cfq-iosched.c       |  44 +++---------
 fs/ioprio.c               |   2 +-
 include/linux/blkdev.h    |   3 +
 include/linux/iocontext.h |  12 ++--
 kernel/fork.c             |   2 +-
 7 files changed, 159 insertions(+), 72 deletions(-)

(limited to 'block')

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index dc00835aab6..27886935804 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1649,7 +1649,7 @@ static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	ioc = get_task_io_context(tsk, GFP_ATOMIC, NUMA_NO_NODE);
 	if (ioc) {
 		ioc_cgroup_changed(ioc);
-		put_io_context(ioc);
+		put_io_context(ioc, NULL);
 	}
 }
 
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 6f59fbad93d..fb23965595d 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -29,55 +29,164 @@ void get_io_context(struct io_context *ioc)
 }
 EXPORT_SYMBOL(get_io_context);
 
-static void cfq_dtor(struct io_context *ioc)
+/*
+ * Releasing ioc may nest into another put_io_context() leading to nested
+ * fast path release.  As the ioc's can't be the same, this is okay but
+ * makes lockdep whine.  Keep track of nesting and use it as subclass.
+ */
+#ifdef CONFIG_LOCKDEP
+#define ioc_release_depth(q)		((q) ? (q)->ioc_release_depth : 0)
+#define ioc_release_depth_inc(q)	(q)->ioc_release_depth++
+#define ioc_release_depth_dec(q)	(q)->ioc_release_depth--
+#else
+#define ioc_release_depth(q)		0
+#define ioc_release_depth_inc(q)	do { } while (0)
+#define ioc_release_depth_dec(q)	do { } while (0)
+#endif
+
+/*
+ * Slow path for ioc release in put_io_context().  Performs double-lock
+ * dancing to unlink all cic's and then frees ioc.
+ */
+static void ioc_release_fn(struct work_struct *work)
 {
-	if (!hlist_empty(&ioc->cic_list)) {
-		struct cfq_io_context *cic;
+	struct io_context *ioc = container_of(work, struct io_context,
+					      release_work);
+	struct request_queue *last_q = NULL;
+
+	spin_lock_irq(&ioc->lock);
+
+	while (!hlist_empty(&ioc->cic_list)) {
+		struct cfq_io_context *cic = hlist_entry(ioc->cic_list.first,
+							 struct cfq_io_context,
+							 cic_list);
+		struct request_queue *this_q = cic->q;
+
+		if (this_q != last_q) {
+			/*
+			 * Need to switch to @this_q.  Once we release
+			 * @ioc->lock, it can go away along with @cic.
+			 * Hold on to it.
+			 */
+			__blk_get_queue(this_q);
+
+			/*
+			 * blk_put_queue() might sleep thanks to kobject
+			 * idiocy.  Always release both locks, put and
+			 * restart.
+			 */
+			if (last_q) {
+				spin_unlock(last_q->queue_lock);
+				spin_unlock_irq(&ioc->lock);
+				blk_put_queue(last_q);
+			} else {
+				spin_unlock_irq(&ioc->lock);
+			}
+
+			last_q = this_q;
+			spin_lock_irq(this_q->queue_lock);
+			spin_lock(&ioc->lock);
+			continue;
+		}
+		ioc_release_depth_inc(this_q);
+		cic->exit(cic);
+		cic->release(cic);
+		ioc_release_depth_dec(this_q);
+	}
 
-		cic = hlist_entry(ioc->cic_list.first, struct cfq_io_context,
-								cic_list);
-		cic->dtor(ioc);
+	if (last_q) {
+		spin_unlock(last_q->queue_lock);
+		spin_unlock_irq(&ioc->lock);
+		blk_put_queue(last_q);
+	} else {
+		spin_unlock_irq(&ioc->lock);
 	}
+
+	kmem_cache_free(iocontext_cachep, ioc);
 }
 
 /**
  * put_io_context - put a reference of io_context
  * @ioc: io_context to put
+ * @locked_q: request_queue the caller is holding queue_lock of (hint)
  *
  * Decrement reference count of @ioc and release it if the count reaches
- * zero.
+ * zero.  If the caller is holding queue_lock of a queue, it can indicate
+ * that with @locked_q.  This is an optimization hint and the caller is
+ * allowed to pass in %NULL even when it's holding a queue_lock.
  */
-void put_io_context(struct io_context *ioc)
+void put_io_context(struct io_context *ioc, struct request_queue *locked_q)
 {
+	struct request_queue *last_q = locked_q;
+	unsigned long flags;
+
 	if (ioc == NULL)
 		return;
 
 	BUG_ON(atomic_long_read(&ioc->refcount) <= 0);
+	if (locked_q)
+		lockdep_assert_held(locked_q->queue_lock);
 
 	if (!atomic_long_dec_and_test(&ioc->refcount))
 		return;
 
-	rcu_read_lock();
-	cfq_dtor(ioc);
-	rcu_read_unlock();
-
-	kmem_cache_free(iocontext_cachep, ioc);
-}
-EXPORT_SYMBOL(put_io_context);
+	/*
+	 * Destroy @ioc.  This is a bit messy because cic's are chained
+	 * from both ioc and queue, and ioc->lock nests inside queue_lock.
+	 * The inner ioc->lock should be held to walk our cic_list and then
+	 * for each cic the outer matching queue_lock should be grabbed.
+	 * ie. We need to do reverse-order double lock dancing.
+	 *
+	 * Another twist is that we are often called with one of the
+	 * matching queue_locks held as indicated by @locked_q, which
+	 * prevents performing double-lock dance for other queues.
+	 *
+	 * So, we do it in two stages.  The fast path uses the queue_lock
+	 * the caller is holding and, if other queues need to be accessed,
+	 * uses trylock to avoid introducing locking dependency.  This can
+	 * handle most cases, especially if @ioc was performing IO on only
+	 * single device.
+	 *
+	 * If trylock doesn't cut it, we defer to @ioc->release_work which
+	 * can do all the double-locking dancing.
+	 */
+	spin_lock_irqsave_nested(&ioc->lock, flags,
+				 ioc_release_depth(locked_q));
+
+	while (!hlist_empty(&ioc->cic_list)) {
+		struct cfq_io_context *cic = hlist_entry(ioc->cic_list.first,
+							 struct cfq_io_context,
+							 cic_list);
+		struct request_queue *this_q = cic->q;
+
+		if (this_q != last_q) {
+			if (last_q && last_q != locked_q)
+				spin_unlock(last_q->queue_lock);
+			last_q = NULL;
+
+			if (!spin_trylock(this_q->queue_lock))
+				break;
+			last_q = this_q;
+			continue;
+		}
+		ioc_release_depth_inc(this_q);
+		cic->exit(cic);
+		cic->release(cic);
+		ioc_release_depth_dec(this_q);
+	}
 
-static void cfq_exit(struct io_context *ioc)
-{
-	rcu_read_lock();
+	if (last_q && last_q != locked_q)
+		spin_unlock(last_q->queue_lock);
 
-	if (!hlist_empty(&ioc->cic_list)) {
-		struct cfq_io_context *cic;
+	spin_unlock_irqrestore(&ioc->lock, flags);
 
-		cic = hlist_entry(ioc->cic_list.first, struct cfq_io_context,
-								cic_list);
-		cic->exit(ioc);
-	}
-	rcu_read_unlock();
+	/* if no cic's left, we're done; otherwise, kick release_work */
+	if (hlist_empty(&ioc->cic_list))
+		kmem_cache_free(iocontext_cachep, ioc);
+	else
+		schedule_work(&ioc->release_work);
 }
+EXPORT_SYMBOL(put_io_context);
 
 /* Called by the exiting task */
 void exit_io_context(struct task_struct *task)
@@ -92,10 +201,8 @@ void exit_io_context(struct task_struct *task)
 	task->io_context = NULL;
 	task_unlock(task);
 
-	if (atomic_dec_and_test(&ioc->nr_tasks))
-		cfq_exit(ioc);
-
-	put_io_context(ioc);
+	atomic_dec(&ioc->nr_tasks);
+	put_io_context(ioc, NULL);
 }
 
 static struct io_context *create_task_io_context(struct task_struct *task,
@@ -115,6 +222,7 @@ static struct io_context *create_task_io_context(struct task_struct *task,
 	spin_lock_init(&ioc->lock);
 	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
 	INIT_HLIST_HEAD(&ioc->cic_list);
+	INIT_WORK(&ioc->release_work, ioc_release_fn);
 
 	/* try to install, somebody might already have beaten us to it */
 	task_lock(task);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e617b088c59..6cc60656040 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1778,7 +1778,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		cfqd->active_queue = NULL;
 
 	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
+		put_io_context(cfqd->active_cic->ioc, cfqd->queue);
 		cfqd->active_cic = NULL;
 	}
 }
@@ -2812,38 +2812,6 @@ static void cfq_exit_cic(struct cfq_io_context *cic)
 	}
 }
 
-static void cfq_exit_single_io_context(struct io_context *ioc,
-				       struct cfq_io_context *cic)
-{
-	struct cfq_data *cfqd = cic_to_cfqd(cic);
-
-	if (cfqd) {
-		struct request_queue *q = cfqd->queue;
-		unsigned long flags;
-
-		spin_lock_irqsave(q->queue_lock, flags);
-
-		/*
-		 * Ensure we get a fresh copy of the ->key to prevent
-		 * race between exiting task and queue
-		 */
-		smp_read_barrier_depends();
-		if (cic->key == cfqd)
-			cfq_exit_cic(cic);
-
-		spin_unlock_irqrestore(q->queue_lock, flags);
-	}
-}
-
-/*
- * The process that ioc belongs to has exited, we need to clean up
- * and put the internal structures we have that belongs to that process.
- */
-static void cfq_exit_io_context(struct io_context *ioc)
-{
-	call_for_each_cic(ioc, cfq_exit_single_io_context);
-}
-
 static struct cfq_io_context *
 cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
@@ -2855,8 +2823,8 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 		cic->ttime.last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->queue_list);
 		INIT_HLIST_NODE(&cic->cic_list);
-		cic->dtor = cfq_free_io_context;
-		cic->exit = cfq_exit_io_context;
+		cic->exit = cfq_exit_cic;
+		cic->release = cfq_release_cic;
 		elv_ioc_count_inc(cfq_ioc_count);
 	}
 
@@ -3726,7 +3694,7 @@ static void cfq_put_request(struct request *rq)
 		BUG_ON(!cfqq->allocated[rw]);
 		cfqq->allocated[rw]--;
 
-		put_io_context(RQ_CIC(rq)->ioc);
+		put_io_context(RQ_CIC(rq)->ioc, cfqq->cfqd->queue);
 
 		rq->elevator_private[0] = NULL;
 		rq->elevator_private[1] = NULL;
@@ -3937,8 +3905,12 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
 							queue_list);
+		struct io_context *ioc = cic->ioc;
 
+		spin_lock(&ioc->lock);
 		cfq_exit_cic(cic);
+		cfq_release_cic(cic);
+		spin_unlock(&ioc->lock);
 	}
 
 	cfq_put_async_queues(cfqd);
diff --git a/fs/ioprio.c b/fs/ioprio.c
index 0f1b9515213..f84b380d65e 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -51,7 +51,7 @@ int set_task_ioprio(struct task_struct *task, int ioprio)
 	ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
 	if (ioc) {
 		ioc_ioprio_changed(ioc, ioprio);
-		put_io_context(ioc);
+		put_io_context(ioc, NULL);
 	}
 
 	return err;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d1b6f4ed1f9..65c2f8c7008 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -393,6 +393,9 @@ struct request_queue {
 	/* Throttle data */
 	struct throtl_data *td;
 #endif
+#ifdef CONFIG_LOCKDEP
+	int			ioc_release_depth;
+#endif
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 2c2b6da96b3..01e86312878 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -3,6 +3,7 @@
 
 #include <linux/radix-tree.h>
 #include <linux/rcupdate.h>
+#include <linux/workqueue.h>
 
 struct cfq_queue;
 struct cfq_ttime {
@@ -33,8 +34,8 @@ struct cfq_io_context {
 
 	unsigned long changed;
 
-	void (*dtor)(struct io_context *); /* destructor */
-	void (*exit)(struct io_context *); /* called on task exit */
+	void (*exit)(struct cfq_io_context *);
+	void (*release)(struct cfq_io_context *);
 
 	struct rcu_head rcu_head;
 };
@@ -61,6 +62,8 @@ struct io_context {
 	struct radix_tree_root radix_root;
 	struct hlist_head cic_list;
 	void __rcu *ioc_data;
+
+	struct work_struct release_work;
 };
 
 static inline struct io_context *ioc_task_link(struct io_context *ioc)
@@ -79,7 +82,7 @@ static inline struct io_context *ioc_task_link(struct io_context *ioc)
 
 struct task_struct;
 #ifdef CONFIG_BLOCK
-void put_io_context(struct io_context *ioc);
+void put_io_context(struct io_context *ioc, struct request_queue *locked_q);
 void exit_io_context(struct task_struct *task);
 struct io_context *get_task_io_context(struct task_struct *task,
 				       gfp_t gfp_flags, int node);
@@ -87,7 +90,8 @@ void ioc_ioprio_changed(struct io_context *ioc, int ioprio);
 void ioc_cgroup_changed(struct io_context *ioc);
 #else
 struct io_context;
-static inline void put_io_context(struct io_context *ioc) { }
+static inline void put_io_context(struct io_context *ioc,
+				  struct request_queue *locked_q) { }
 static inline void exit_io_context(struct task_struct *task) { }
 #endif
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 5bcfc739bb7..2753449f203 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -887,7 +887,7 @@ static int copy_io(unsigned long clone_flags, struct task_struct *tsk)
 			return -ENOMEM;
 
 		new_ioc->ioprio = ioc->ioprio;
-		put_io_context(new_ioc);
+		put_io_context(new_ioc, NULL);
 	}
 #endif
 	return 0;
-- 
cgit v1.2.3-70-g09d2


From b9a1920837bc53430d339380e393a6e4c372939f Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:39 +0100
Subject: block, cfq: remove delayed unlink

Now that all cic's are immediately unlinked from both ioc and queue,
lazy dropping from lookup path and trimming on elevator unregister are
unnecessary.  Kill them and remove now unused elevator_ops->trim().

This also leaves call_for_each_cic() without any user.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c      | 92 ++++++------------------------------------------
 block/elevator.c         | 16 ---------
 include/linux/elevator.h |  1 -
 3 files changed, 10 insertions(+), 99 deletions(-)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6cc60656040..ff44435fad5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2669,24 +2669,6 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
 	cfq_put_cfqg(cfqg);
 }
 
-/*
- * Call func for each cic attached to this ioc.
- */
-static void
-call_for_each_cic(struct io_context *ioc,
-		  void (*func)(struct io_context *, struct cfq_io_context *))
-{
-	struct cfq_io_context *cic;
-	struct hlist_node *n;
-
-	rcu_read_lock();
-
-	hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list)
-		func(ioc, cic);
-
-	rcu_read_unlock();
-}
-
 static void cfq_cic_free_rcu(struct rcu_head *head)
 {
 	struct cfq_io_context *cic;
@@ -2727,31 +2709,6 @@ static void cfq_release_cic(struct cfq_io_context *cic)
 	cfq_cic_free(cic);
 }
 
-static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&ioc->lock, flags);
-	cfq_release_cic(cic);
-	spin_unlock_irqrestore(&ioc->lock, flags);
-}
-
-/*
- * Must be called with rcu_read_lock() held or preemption otherwise disabled.
- * Only two callers of this - ->dtor() which is called with the rcu_read_lock(),
- * and ->trim() which is called with the task lock held
- */
-static void cfq_free_io_context(struct io_context *ioc)
-{
-	/*
-	 * ioc->refcount is zero here, or we are called from elv_unregister(),
-	 * so no more cic's are allowed to be linked into this ioc.  So it
-	 * should be ok to iterate over the known list, we will see all cic's
-	 * since no new ones are added.
-	 */
-	call_for_each_cic(ioc, cic_free_func);
-}
-
 static void cfq_put_cooperator(struct cfq_queue *cfqq)
 {
 	struct cfq_queue *__cfqq, *next;
@@ -3037,30 +2994,6 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
 	return cfqq;
 }
 
-/*
- * We drop cfq io contexts lazily, so we may find a dead one.
- */
-static void
-cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
-		  struct cfq_io_context *cic)
-{
-	unsigned long flags;
-
-	WARN_ON(!list_empty(&cic->queue_list));
-	BUG_ON(cic->key != cfqd_dead_key(cfqd));
-
-	spin_lock_irqsave(&ioc->lock, flags);
-
-	BUG_ON(rcu_dereference_check(ioc->ioc_data,
-		lockdep_is_held(&ioc->lock)) == cic);
-
-	radix_tree_delete(&ioc->radix_root, cfqd->queue->id);
-	hlist_del_rcu(&cic->cic_list);
-	spin_unlock_irqrestore(&ioc->lock, flags);
-
-	cfq_cic_free(cic);
-}
-
 /**
  * cfq_cic_lookup - lookup cfq_io_context
  * @cfqd: the associated cfq_data
@@ -3078,26 +3011,22 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 	if (unlikely(!ioc))
 		return NULL;
 
-	rcu_read_lock();
-
 	/*
-	 * we maintain a last-hit cache, to avoid browsing over the tree
+	 * cic's are indexed from @ioc using radix tree and hint pointer,
+	 * both of which are protected with RCU.  All removals are done
+	 * holding both q and ioc locks, and we're holding q lock - if we
+	 * find a cic which points to us, it's guaranteed to be valid.
 	 */
+	rcu_read_lock();
 	cic = rcu_dereference(ioc->ioc_data);
 	if (cic && cic->key == cfqd)
 		goto out;
 
-	do {
-		cic = radix_tree_lookup(&ioc->radix_root, cfqd->queue->id);
-		if (!cic)
-			break;
-		if (likely(cic->key == cfqd)) {
-			/* hint assignment itself can race safely */
-			rcu_assign_pointer(ioc->ioc_data, cic);
-			break;
-		}
-		cfq_drop_dead_cic(cfqd, ioc, cic);
-	} while (1);
+	cic = radix_tree_lookup(&ioc->radix_root, cfqd->queue->id);
+	if (cic && cic->key == cfqd)
+		rcu_assign_pointer(ioc->ioc_data, cic);	/* allowed to race */
+	else
+		cic = NULL;
 out:
 	rcu_read_unlock();
 	return cic;
@@ -4182,7 +4111,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_may_queue_fn =	cfq_may_queue,
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
-		.trim =				cfq_free_io_context,
 	},
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
diff --git a/block/elevator.c b/block/elevator.c
index 66343d6917d..6a343e8f831 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -913,22 +913,6 @@ EXPORT_SYMBOL_GPL(elv_register);
 
 void elv_unregister(struct elevator_type *e)
 {
-	struct task_struct *g, *p;
-
-	/*
-	 * Iterate every thread in the process to remove the io contexts.
-	 */
-	if (e->ops.trim) {
-		read_lock(&tasklist_lock);
-		do_each_thread(g, p) {
-			task_lock(p);
-			if (p->io_context)
-				e->ops.trim(p->io_context);
-			task_unlock(p);
-		} while_each_thread(g, p);
-		read_unlock(&tasklist_lock);
-	}
-
 	spin_lock(&elv_list_lock);
 	list_del_init(&e->list);
 	spin_unlock(&elv_list_lock);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1d0f7a2ff73..581dd1bd3d3 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -63,7 +63,6 @@ struct elevator_ops
 
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
-	void (*trim)(struct io_context *);
 };
 
 #define ELV_NAME_MAX	(16)
-- 
cgit v1.2.3-70-g09d2


From b50b636bce6293fa858cc7ff6c3ffe4920d90006 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:39 +0100
Subject: block, cfq: kill ioc_gone

Now that cic's are immediately unlinked under both locks, there's no
need to count and drain cic's before module unload.  RCU callback
completion is waited with rcu_barrier().

While at it, remove residual RCU operations on cic_list.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c      | 43 +++++--------------------------------------
 include/linux/elevator.h | 17 -----------------
 2 files changed, 5 insertions(+), 55 deletions(-)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ff44435fad5..ae7791a8ded 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -62,10 +62,6 @@ static const int cfq_hist_divisor = 4;
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
 
-static DEFINE_PER_CPU(unsigned long, cfq_ioc_count);
-static struct completion *ioc_gone;
-static DEFINE_SPINLOCK(ioc_gone_lock);
-
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
 #define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
 #define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
@@ -2671,26 +2667,8 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
 
 static void cfq_cic_free_rcu(struct rcu_head *head)
 {
-	struct cfq_io_context *cic;
-
-	cic = container_of(head, struct cfq_io_context, rcu_head);
-
-	kmem_cache_free(cfq_ioc_pool, cic);
-	elv_ioc_count_dec(cfq_ioc_count);
-
-	if (ioc_gone) {
-		/*
-		 * CFQ scheduler is exiting, grab exit lock and check
-		 * the pending io context count. If it hits zero,
-		 * complete ioc_gone and set it back to NULL
-		 */
-		spin_lock(&ioc_gone_lock);
-		if (ioc_gone && !elv_ioc_count_read(cfq_ioc_count)) {
-			complete(ioc_gone);
-			ioc_gone = NULL;
-		}
-		spin_unlock(&ioc_gone_lock);
-	}
+	kmem_cache_free(cfq_ioc_pool,
+			container_of(head, struct cfq_io_context, rcu_head));
 }
 
 static void cfq_cic_free(struct cfq_io_context *cic)
@@ -2705,7 +2683,7 @@ static void cfq_release_cic(struct cfq_io_context *cic)
 
 	BUG_ON(!(dead_key & CIC_DEAD_KEY));
 	radix_tree_delete(&ioc->radix_root, dead_key >> CIC_DEAD_INDEX_SHIFT);
-	hlist_del_rcu(&cic->cic_list);
+	hlist_del(&cic->cic_list);
 	cfq_cic_free(cic);
 }
 
@@ -2782,7 +2760,6 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 		INIT_HLIST_NODE(&cic->cic_list);
 		cic->exit = cfq_exit_cic;
 		cic->release = cfq_release_cic;
-		elv_ioc_count_inc(cfq_ioc_count);
 	}
 
 	return cic;
@@ -3072,7 +3049,7 @@ static int cfq_create_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 
 	ret = radix_tree_insert(&ioc->radix_root, q->id, cic);
 	if (likely(!ret)) {
-		hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list);
+		hlist_add_head(&cic->cic_list, &ioc->cic_list);
 		list_add(&cic->queue_list, &cfqd->cic_list);
 		cic = NULL;
 	} else if (ret == -EEXIST) {
@@ -4156,19 +4133,9 @@ static int __init cfq_init(void)
 
 static void __exit cfq_exit(void)
 {
-	DECLARE_COMPLETION_ONSTACK(all_gone);
 	blkio_policy_unregister(&blkio_policy_cfq);
 	elv_unregister(&iosched_cfq);
-	ioc_gone = &all_gone;
-	/* ioc_gone's update must be visible before reading ioc_count */
-	smp_wmb();
-
-	/*
-	 * this also protects us from entering cfq_slab_kill() with
-	 * pending RCU callbacks
-	 */
-	if (elv_ioc_count_read(cfq_ioc_count))
-		wait_for_completion(&all_gone);
+	rcu_barrier();	/* make sure all cic RCU frees are complete */
 	cfq_slab_kill();
 }
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 581dd1bd3d3..02604c89ddd 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -196,22 +196,5 @@ enum {
 	INIT_LIST_HEAD(&(rq)->csd.list);	\
 	} while (0)
 
-/*
- * io context count accounting
- */
-#define elv_ioc_count_mod(name, __val) this_cpu_add(name, __val)
-#define elv_ioc_count_inc(name)	this_cpu_inc(name)
-#define elv_ioc_count_dec(name)	this_cpu_dec(name)
-
-#define elv_ioc_count_read(name)				\
-({								\
-	unsigned long __val = 0;				\
-	int __cpu;						\
-	smp_wmb();						\
-	for_each_possible_cpu(__cpu)				\
-		__val += per_cpu(name, __cpu);			\
-	__val;							\
-})
-
 #endif /* CONFIG_BLOCK */
 #endif
-- 
cgit v1.2.3-70-g09d2


From 1238033c79e92e5c315af12e45396f1a78c73dec Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:40 +0100
Subject: block, cfq: kill cic->key

Now that lazy paths are removed, cfqd_dead_key() is meaningless and
cic->q can be used whereever cic->key is used.  Kill cic->key.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c       | 26 +++++---------------------
 include/linux/iocontext.h |  1 -
 2 files changed, 5 insertions(+), 22 deletions(-)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ae7791a8ded..3b07ce16878 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -472,22 +472,9 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
 	cic->cfqq[is_sync] = cfqq;
 }
 
-#define CIC_DEAD_KEY	1ul
-#define CIC_DEAD_INDEX_SHIFT	1
-
-static inline void *cfqd_dead_key(struct cfq_data *cfqd)
-{
-	return (void *)(cfqd->queue->id << CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY);
-}
-
 static inline struct cfq_data *cic_to_cfqd(struct cfq_io_context *cic)
 {
-	struct cfq_data *cfqd = cic->key;
-
-	if (unlikely((unsigned long) cfqd & CIC_DEAD_KEY))
-		return NULL;
-
-	return cfqd;
+	return cic->q->elevator->elevator_data;
 }
 
 /*
@@ -2679,10 +2666,8 @@ static void cfq_cic_free(struct cfq_io_context *cic)
 static void cfq_release_cic(struct cfq_io_context *cic)
 {
 	struct io_context *ioc = cic->ioc;
-	unsigned long dead_key = (unsigned long) cic->key;
 
-	BUG_ON(!(dead_key & CIC_DEAD_KEY));
-	radix_tree_delete(&ioc->radix_root, dead_key >> CIC_DEAD_INDEX_SHIFT);
+	radix_tree_delete(&ioc->radix_root, cic->q->id);
 	hlist_del(&cic->cic_list);
 	cfq_cic_free(cic);
 }
@@ -2726,7 +2711,6 @@ static void cfq_exit_cic(struct cfq_io_context *cic)
 	struct io_context *ioc = cic->ioc;
 
 	list_del_init(&cic->queue_list);
-	cic->key = cfqd_dead_key(cfqd);
 
 	/*
 	 * Both setting lookup hint to and clearing it from @cic are done
@@ -2982,6 +2966,7 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
 static struct cfq_io_context *
 cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 {
+	struct request_queue *q = cfqd->queue;
 	struct cfq_io_context *cic;
 
 	lockdep_assert_held(cfqd->queue->queue_lock);
@@ -2996,11 +2981,11 @@ cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 	 */
 	rcu_read_lock();
 	cic = rcu_dereference(ioc->ioc_data);
-	if (cic && cic->key == cfqd)
+	if (cic && cic->q == q)
 		goto out;
 
 	cic = radix_tree_lookup(&ioc->radix_root, cfqd->queue->id);
-	if (cic && cic->key == cfqd)
+	if (cic && cic->q == q)
 		rcu_assign_pointer(ioc->ioc_data, cic);	/* allowed to race */
 	else
 		cic = NULL;
@@ -3040,7 +3025,6 @@ static int cfq_create_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 		goto out;
 
 	cic->ioc = ioc;
-	cic->key = cfqd;
 	cic->q = cfqd->queue;
 
 	/* lock both q and ioc and try to link @cic */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 01e86312878..b2b75a54f25 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -20,7 +20,6 @@ enum {
 };
 
 struct cfq_io_context {
-	void *key;
 	struct request_queue *q;
 
 	struct cfq_queue *cfqq[2];
-- 
cgit v1.2.3-70-g09d2


From f2dbd76a0a994bc1d5a3d0e7c844cc373832e86c Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:40 +0100
Subject: block, cfq: replace current_io_context() with create_io_context()

When called under queue_lock, current_io_context() triggers lockdep
warning if it hits allocation path.  This is because io_context
installation is protected by task_lock which is not IRQ safe, so it
triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
deadlock warning.

Given the restriction, accessor + creator rolled into one doesn't work
too well.  Drop current_io_context() and let the users access
task->io_context directly inside queue_lock combined with explicit
creation using create_io_context().

Future ioc updates will further consolidate ioc access and the create
interface will be unexported.

While at it, relocate ioc internal interface declarations in blk.h and
add section comments before and after.

This patch does not introduce functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c    | 25 ++++++++++++++++-----
 block/blk-ioc.c     | 62 +++++++++++++++--------------------------------------
 block/blk.h         | 36 ++++++++++++++++++++++++++++---
 block/cfq-iosched.c |  2 +-
 4 files changed, 71 insertions(+), 54 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index fd4749391e1..6804fdf27ef 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -771,9 +771,12 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
-	struct io_context *ioc = NULL;
+	struct io_context *ioc;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
+	bool retried = false;
 	int may_queue;
+retry:
+	ioc = current->io_context;
 
 	if (unlikely(blk_queue_dead(q)))
 		return NULL;
@@ -784,7 +787,20 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 
 	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
 		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
+			/*
+			 * We want ioc to record batching state.  If it's
+			 * not already there, creating a new one requires
+			 * dropping queue_lock, which in turn requires
+			 * retesting conditions to avoid queue hang.
+			 */
+			if (!ioc && !retried) {
+				spin_unlock_irq(q->queue_lock);
+				create_io_context(current, gfp_mask, q->node);
+				spin_lock_irq(q->queue_lock);
+				retried = true;
+				goto retry;
+			}
+
 			/*
 			 * The queue will fill after this allocation, so set
 			 * it as full, and mark this process as "batching".
@@ -892,7 +908,6 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 	rq = get_request(q, rw_flags, bio, GFP_NOIO);
 	while (!rq) {
 		DEFINE_WAIT(wait);
-		struct io_context *ioc;
 		struct request_list *rl = &q->rq;
 
 		if (unlikely(blk_queue_dead(q)))
@@ -912,8 +927,8 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		 * up to a big batch of them for a small period time.
 		 * See ioc_batching, ioc_set_batching
 		 */
-		ioc = current_io_context(GFP_NOIO, q->node);
-		ioc_set_batching(q, ioc);
+		create_io_context(current, GFP_NOIO, q->node);
+		ioc_set_batching(q, current->io_context);
 
 		spin_lock_irq(q->queue_lock);
 		finish_wait(&rl->wait[is_sync], &wait);
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index fb23965595d..e23c797b468 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -205,16 +205,15 @@ void exit_io_context(struct task_struct *task)
 	put_io_context(ioc, NULL);
 }
 
-static struct io_context *create_task_io_context(struct task_struct *task,
-						 gfp_t gfp_flags, int node,
-						 bool take_ref)
+void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_flags,
+				int node)
 {
 	struct io_context *ioc;
 
 	ioc = kmem_cache_alloc_node(iocontext_cachep, gfp_flags | __GFP_ZERO,
 				    node);
 	if (unlikely(!ioc))
-		return NULL;
+		return;
 
 	/* initialize */
 	atomic_long_set(&ioc->refcount, 1);
@@ -226,42 +225,13 @@ static struct io_context *create_task_io_context(struct task_struct *task,
 
 	/* try to install, somebody might already have beaten us to it */
 	task_lock(task);
-
-	if (!task->io_context && !(task->flags & PF_EXITING)) {
+	if (!task->io_context && !(task->flags & PF_EXITING))
 		task->io_context = ioc;
-	} else {
+	else
 		kmem_cache_free(iocontext_cachep, ioc);
-		ioc = task->io_context;
-	}
-
-	if (ioc && take_ref)
-		get_io_context(ioc);
-
 	task_unlock(task);
-	return ioc;
 }
-
-/**
- * current_io_context - get io_context of %current
- * @gfp_flags: allocation flags, used if allocation is necessary
- * @node: allocation node, used if allocation is necessary
- *
- * Return io_context of %current.  If it doesn't exist, it is created with
- * @gfp_flags and @node.  The returned io_context does NOT have its
- * reference count incremented.  Because io_context is exited only on task
- * exit, %current can be sure that the returned io_context is valid and
- * alive as long as it is executing.
- */
-struct io_context *current_io_context(gfp_t gfp_flags, int node)
-{
-	might_sleep_if(gfp_flags & __GFP_WAIT);
-
-	if (current->io_context)
-		return current->io_context;
-
-	return create_task_io_context(current, gfp_flags, node, false);
-}
-EXPORT_SYMBOL(current_io_context);
+EXPORT_SYMBOL(create_io_context_slowpath);
 
 /**
  * get_task_io_context - get io_context of a task
@@ -274,7 +244,7 @@ EXPORT_SYMBOL(current_io_context);
  * incremented.
  *
  * This function always goes through task_lock() and it's better to use
- * current_io_context() + get_io_context() for %current.
+ * %current->io_context + get_io_context() for %current.
  */
 struct io_context *get_task_io_context(struct task_struct *task,
 				       gfp_t gfp_flags, int node)
@@ -283,16 +253,18 @@ struct io_context *get_task_io_context(struct task_struct *task,
 
 	might_sleep_if(gfp_flags & __GFP_WAIT);
 
-	task_lock(task);
-	ioc = task->io_context;
-	if (likely(ioc)) {
-		get_io_context(ioc);
+	do {
+		task_lock(task);
+		ioc = task->io_context;
+		if (likely(ioc)) {
+			get_io_context(ioc);
+			task_unlock(task);
+			return ioc;
+		}
 		task_unlock(task);
-		return ioc;
-	}
-	task_unlock(task);
+	} while (create_io_context(task, gfp_flags, node));
 
-	return create_task_io_context(task, gfp_flags, node, true);
+	return NULL;
 }
 EXPORT_SYMBOL(get_task_io_context);
 
diff --git a/block/blk.h b/block/blk.h
index 8d421156fef..5bca2668e1b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -127,9 +127,6 @@ static inline int blk_should_fake_timeout(struct request_queue *q)
 }
 #endif
 
-void get_io_context(struct io_context *ioc);
-struct io_context *current_io_context(gfp_t gfp_flags, int node);
-
 int ll_back_merge_fn(struct request_queue *q, struct request *req,
 		     struct bio *bio);
 int ll_front_merge_fn(struct request_queue *q, struct request *req, 
@@ -198,6 +195,39 @@ static inline int blk_do_io_stat(struct request *rq)
 	        (rq->cmd_flags & REQ_DISCARD));
 }
 
+/*
+ * Internal io_context interface
+ */
+void get_io_context(struct io_context *ioc);
+
+void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_mask,
+				int node);
+
+/**
+ * create_io_context - try to create task->io_context
+ * @task: target task
+ * @gfp_mask: allocation mask
+ * @node: allocation node
+ *
+ * If @task->io_context is %NULL, allocate a new io_context and install it.
+ * Returns the current @task->io_context which may be %NULL if allocation
+ * failed.
+ *
+ * Note that this function can't be called with IRQ disabled because
+ * task_lock which protects @task->io_context is IRQ-unsafe.
+ */
+static inline struct io_context *create_io_context(struct task_struct *task,
+						   gfp_t gfp_mask, int node)
+{
+	WARN_ON_ONCE(irqs_disabled());
+	if (unlikely(!task->io_context))
+		create_io_context_slowpath(task, gfp_mask, node);
+	return task->io_context;
+}
+
+/*
+ * Internal throttling interface
+ */
 #ifdef CONFIG_BLK_DEV_THROTTLING
 extern bool blk_throtl_bio(struct request_queue *q, struct bio *bio);
 extern void blk_throtl_drain(struct request_queue *q);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3b07ce16878..5f7e4d16140 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3012,7 +3012,7 @@ static int cfq_create_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
 	/* allocate stuff */
-	ioc = current_io_context(gfp_mask, q->node);
+	ioc = create_io_context(current, gfp_mask, q->node);
 	if (!ioc)
 		goto out;
 
-- 
cgit v1.2.3-70-g09d2


From f8fc877d3c1f10457d0d73d8540a0c51a1fa718a Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:40 +0100
Subject: block: reorder elevator switch sequence

Elevator switch sequence first attached the new elevator, then tried
registering it (sysfs) and if that failed attached back the old
elevator.  However, sysfs registration doesn't require the elevator to
be attached, so there is no reason to do the "detach, attach new,
register, maybe re-attach old" sequence.  It can just do "register,
detach, attach".

* elevator_init_queue() is updated to set ->elevator_data directly and
  return 0 / -errno.  This allows elevator_exit() on an unattached
  elevator.

* __elv_unregister_queue() which was necessary to unregister
  unattached q is removed in favor of __elv_register_queue() which can
  register unattached q.

* elevator_attach() becomes a single assignment and obscures more then
  it helps.  Dropped.

This will help cleaning up io_context handling across elevator switch.

This patch doesn't introduce visible behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/elevator.c | 91 +++++++++++++++++++++++---------------------------------
 1 file changed, 37 insertions(+), 54 deletions(-)

(limited to 'block')

diff --git a/block/elevator.c b/block/elevator.c
index 6a343e8f831..a16c2d1713e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -168,17 +168,13 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static int elevator_init_queue(struct request_queue *q,
+			       struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
-}
-
-static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
-{
-	q->elevator = eq;
-	eq->elevator_data = data;
+	eq->elevator_data = eq->ops->elevator_init_fn(q);
+	if (eq->elevator_data)
+		return 0;
+	return -ENOMEM;
 }
 
 static char chosen_elevator[ELV_NAME_MAX];
@@ -241,7 +237,7 @@ int elevator_init(struct request_queue *q, char *name)
 {
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
-	void *data;
+	int err;
 
 	if (unlikely(q->elevator))
 		return 0;
@@ -278,13 +274,13 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	err = elevator_init_queue(q, eq);
+	if (err) {
 		kobject_put(&eq->kobj);
-		return -ENOMEM;
+		return err;
 	}
 
-	elevator_attach(q, eq, data);
+	q->elevator = eq;
 	return 0;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -856,9 +852,8 @@ static struct kobj_type elv_ktype = {
 	.release	= elevator_release,
 };
 
-int elv_register_queue(struct request_queue *q)
+int __elv_register_queue(struct request_queue *q, struct elevator_queue *e)
 {
-	struct elevator_queue *e = q->elevator;
 	int error;
 
 	error = kobject_add(&e->kobj, &q->kobj, "%s", "iosched");
@@ -876,19 +871,22 @@ int elv_register_queue(struct request_queue *q)
 	}
 	return error;
 }
-EXPORT_SYMBOL(elv_register_queue);
 
-static void __elv_unregister_queue(struct elevator_queue *e)
+int elv_register_queue(struct request_queue *q)
 {
-	kobject_uevent(&e->kobj, KOBJ_REMOVE);
-	kobject_del(&e->kobj);
-	e->registered = 0;
+	return __elv_register_queue(q, q->elevator);
 }
+EXPORT_SYMBOL(elv_register_queue);
 
 void elv_unregister_queue(struct request_queue *q)
 {
-	if (q)
-		__elv_unregister_queue(q->elevator);
+	if (q) {
+		struct elevator_queue *e = q->elevator;
+
+		kobject_uevent(&e->kobj, KOBJ_REMOVE);
+		kobject_del(&e->kobj);
+		e->registered = 0;
+	}
 }
 EXPORT_SYMBOL(elv_unregister_queue);
 
@@ -928,50 +926,36 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
 	int err;
 
-	/*
-	 * Allocate new elevator
-	 */
+	/* allocate new elevator */
 	e = elevator_alloc(q, new_e);
 	if (!e)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	err = elevator_init_queue(q, e);
+	if (err) {
 		kobject_put(&e->kobj);
-		return -ENOMEM;
+		return err;
 	}
 
-	/*
-	 * Turn on BYPASS and drain all requests w/ elevator private data
-	 */
+	/* turn on BYPASS and drain all requests w/ elevator private data */
 	elv_quiesce_start(q);
 
-	/*
-	 * Remember old elevator.
-	 */
-	old_elevator = q->elevator;
-
-	/*
-	 * attach and start new elevator
-	 */
-	spin_lock_irq(q->queue_lock);
-	elevator_attach(q, e, data);
-	spin_unlock_irq(q->queue_lock);
-
-	if (old_elevator->registered) {
-		__elv_unregister_queue(old_elevator);
-
-		err = elv_register_queue(q);
+	/* unregister old queue, register new one and kill old elevator */
+	if (q->elevator->registered) {
+		elv_unregister_queue(q);
+		err = __elv_register_queue(q, e);
 		if (err)
 			goto fail_register;
 	}
 
-	/*
-	 * finally exit old elevator and turn off BYPASS.
-	 */
+	/* done, replace the old one with new one and turn off BYPASS */
+	spin_lock_irq(q->queue_lock);
+	old_elevator = q->elevator;
+	q->elevator = e;
+	spin_unlock_irq(q->queue_lock);
+
 	elevator_exit(old_elevator);
 	elv_quiesce_end(q);
 
@@ -985,7 +969,6 @@ fail_register:
 	 * one again (along with re-adding the sysfs dir)
 	 */
 	elevator_exit(e);
-	q->elevator = old_elevator;
 	elv_register_queue(q);
 	elv_quiesce_end(q);
 
-- 
cgit v1.2.3-70-g09d2


From 22f746e235a5cbee2a6ca9887b1be2aa7d31fe71 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:41 +0100
Subject: block: remove elevator_queue->ops

elevator_queue->ops points to the same ops struct ->elevator_type.ops
is pointing to.  The only effect of caching it in elevator_queue is
shorter notation - it doesn't save any indirect derefence.

Relocate elevator_type->list which used only during module init/exit
to the end of the structure, rename elevator_queue->elevator_type to
->type, and replace elevator_queue->ops with elevator_queue->type.ops.

This doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk.h              | 10 +++----
 block/elevator.c         | 74 +++++++++++++++++++++++-------------------------
 include/linux/elevator.h |  5 ++--
 3 files changed, 43 insertions(+), 46 deletions(-)

(limited to 'block')

diff --git a/block/blk.h b/block/blk.h
index 5bca2668e1b..4943770e079 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -94,7 +94,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
 			return NULL;
 		}
 		if (unlikely(blk_queue_dead(q)) ||
-		    !q->elevator->ops->elevator_dispatch_fn(q, 0))
+		    !q->elevator->type->ops.elevator_dispatch_fn(q, 0))
 			return NULL;
 	}
 }
@@ -103,16 +103,16 @@ static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_activate_req_fn)
-		e->ops->elevator_activate_req_fn(q, rq);
+	if (e->type->ops.elevator_activate_req_fn)
+		e->type->ops.elevator_activate_req_fn(q, rq);
 }
 
 static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_deactivate_req_fn)
-		e->ops->elevator_deactivate_req_fn(q, rq);
+	if (e->type->ops.elevator_deactivate_req_fn)
+		e->type->ops.elevator_deactivate_req_fn(q, rq);
 }
 
 #ifdef CONFIG_FAIL_IO_TIMEOUT
diff --git a/block/elevator.c b/block/elevator.c
index a16c2d1713e..31ffe76aed3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -61,8 +61,8 @@ static int elv_iosched_allow_merge(struct request *rq, struct bio *bio)
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_allow_merge_fn)
-		return e->ops->elevator_allow_merge_fn(q, rq, bio);
+	if (e->type->ops.elevator_allow_merge_fn)
+		return e->type->ops.elevator_allow_merge_fn(q, rq, bio);
 
 	return 1;
 }
@@ -171,7 +171,7 @@ static struct elevator_type *elevator_get(const char *name)
 static int elevator_init_queue(struct request_queue *q,
 			       struct elevator_queue *eq)
 {
-	eq->elevator_data = eq->ops->elevator_init_fn(q);
+	eq->elevator_data = eq->type->ops.elevator_init_fn(q);
 	if (eq->elevator_data)
 		return 0;
 	return -ENOMEM;
@@ -203,8 +203,7 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	if (unlikely(!eq))
 		goto err;
 
-	eq->ops = &e->ops;
-	eq->elevator_type = e;
+	eq->type = e;
 	kobject_init(&eq->kobj, &elv_ktype);
 	mutex_init(&eq->sysfs_lock);
 
@@ -228,7 +227,7 @@ static void elevator_release(struct kobject *kobj)
 	struct elevator_queue *e;
 
 	e = container_of(kobj, struct elevator_queue, kobj);
-	elevator_put(e->elevator_type);
+	elevator_put(e->type);
 	kfree(e->hash);
 	kfree(e);
 }
@@ -288,9 +287,8 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
-	if (e->ops->elevator_exit_fn)
-		e->ops->elevator_exit_fn(e);
-	e->ops = NULL;
+	if (e->type->ops.elevator_exit_fn)
+		e->type->ops.elevator_exit_fn(e);
 	mutex_unlock(&e->sysfs_lock);
 
 	kobject_put(&e->kobj);
@@ -500,8 +498,8 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 		return ELEVATOR_BACK_MERGE;
 	}
 
-	if (e->ops->elevator_merge_fn)
-		return e->ops->elevator_merge_fn(q, req, bio);
+	if (e->type->ops.elevator_merge_fn)
+		return e->type->ops.elevator_merge_fn(q, req, bio);
 
 	return ELEVATOR_NO_MERGE;
 }
@@ -544,8 +542,8 @@ void elv_merged_request(struct request_queue *q, struct request *rq, int type)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_merged_fn)
-		e->ops->elevator_merged_fn(q, rq, type);
+	if (e->type->ops.elevator_merged_fn)
+		e->type->ops.elevator_merged_fn(q, rq, type);
 
 	if (type == ELEVATOR_BACK_MERGE)
 		elv_rqhash_reposition(q, rq);
@@ -559,8 +557,8 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	struct elevator_queue *e = q->elevator;
 	const int next_sorted = next->cmd_flags & REQ_SORTED;
 
-	if (next_sorted && e->ops->elevator_merge_req_fn)
-		e->ops->elevator_merge_req_fn(q, rq, next);
+	if (next_sorted && e->type->ops.elevator_merge_req_fn)
+		e->type->ops.elevator_merge_req_fn(q, rq, next);
 
 	elv_rqhash_reposition(q, rq);
 
@@ -577,8 +575,8 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_bio_merged_fn)
-		e->ops->elevator_bio_merged_fn(q, rq, bio);
+	if (e->type->ops.elevator_bio_merged_fn)
+		e->type->ops.elevator_bio_merged_fn(q, rq, bio);
 }
 
 void elv_requeue_request(struct request_queue *q, struct request *rq)
@@ -604,12 +602,12 @@ void elv_drain_elevator(struct request_queue *q)
 
 	lockdep_assert_held(q->queue_lock);
 
-	while (q->elevator->ops->elevator_dispatch_fn(q, 1))
+	while (q->elevator->type->ops.elevator_dispatch_fn(q, 1))
 		;
 	if (q->nr_sorted && printed++ < 10) {
 		printk(KERN_ERR "%s: forced dispatching is broken "
 		       "(nr_sorted=%u), please report this\n",
-		       q->elevator->elevator_type->elevator_name, q->nr_sorted);
+		       q->elevator->type->elevator_name, q->nr_sorted);
 	}
 }
 
@@ -698,7 +696,7 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where)
 		 * rq cannot be accessed after calling
 		 * elevator_add_req_fn.
 		 */
-		q->elevator->ops->elevator_add_req_fn(q, rq);
+		q->elevator->type->ops.elevator_add_req_fn(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_FLUSH:
@@ -727,8 +725,8 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_latter_req_fn)
-		return e->ops->elevator_latter_req_fn(q, rq);
+	if (e->type->ops.elevator_latter_req_fn)
+		return e->type->ops.elevator_latter_req_fn(q, rq);
 	return NULL;
 }
 
@@ -736,8 +734,8 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_former_req_fn)
-		return e->ops->elevator_former_req_fn(q, rq);
+	if (e->type->ops.elevator_former_req_fn)
+		return e->type->ops.elevator_former_req_fn(q, rq);
 	return NULL;
 }
 
@@ -745,8 +743,8 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_set_req_fn)
-		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+	if (e->type->ops.elevator_set_req_fn)
+		return e->type->ops.elevator_set_req_fn(q, rq, gfp_mask);
 
 	rq->elevator_private[0] = NULL;
 	return 0;
@@ -756,16 +754,16 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_put_req_fn)
-		e->ops->elevator_put_req_fn(rq);
+	if (e->type->ops.elevator_put_req_fn)
+		e->type->ops.elevator_put_req_fn(rq);
 }
 
 int elv_may_queue(struct request_queue *q, int rw)
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_may_queue_fn)
-		return e->ops->elevator_may_queue_fn(q, rw);
+	if (e->type->ops.elevator_may_queue_fn)
+		return e->type->ops.elevator_may_queue_fn(q, rw);
 
 	return ELV_MQUEUE_MAY;
 }
@@ -800,8 +798,8 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	if (blk_account_rq(rq)) {
 		q->in_flight[rq_is_sync(rq)]--;
 		if ((rq->cmd_flags & REQ_SORTED) &&
-		    e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		    e->type->ops.elevator_completed_req_fn)
+			e->type->ops.elevator_completed_req_fn(q, rq);
 	}
 }
 
@@ -819,7 +817,7 @@ elv_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
 
 	e = container_of(kobj, struct elevator_queue, kobj);
 	mutex_lock(&e->sysfs_lock);
-	error = e->ops ? entry->show(e, page) : -ENOENT;
+	error = e->type ? entry->show(e, page) : -ENOENT;
 	mutex_unlock(&e->sysfs_lock);
 	return error;
 }
@@ -837,7 +835,7 @@ elv_attr_store(struct kobject *kobj, struct attribute *attr,
 
 	e = container_of(kobj, struct elevator_queue, kobj);
 	mutex_lock(&e->sysfs_lock);
-	error = e->ops ? entry->store(e, page, length) : -ENOENT;
+	error = e->type ? entry->store(e, page, length) : -ENOENT;
 	mutex_unlock(&e->sysfs_lock);
 	return error;
 }
@@ -858,7 +856,7 @@ int __elv_register_queue(struct request_queue *q, struct elevator_queue *e)
 
 	error = kobject_add(&e->kobj, &q->kobj, "%s", "iosched");
 	if (!error) {
-		struct elv_fs_entry *attr = e->elevator_type->elevator_attrs;
+		struct elv_fs_entry *attr = e->type->elevator_attrs;
 		if (attr) {
 			while (attr->attr.name) {
 				if (sysfs_create_file(&e->kobj, &attr->attr))
@@ -959,7 +957,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	elevator_exit(old_elevator);
 	elv_quiesce_end(q);
 
-	blk_add_trace_msg(q, "elv switch: %s", e->elevator_type->elevator_name);
+	blk_add_trace_msg(q, "elv switch: %s", e->type->elevator_name);
 
 	return 0;
 
@@ -993,7 +991,7 @@ int elevator_change(struct request_queue *q, const char *name)
 		return -EINVAL;
 	}
 
-	if (!strcmp(elevator_name, q->elevator->elevator_type->elevator_name)) {
+	if (!strcmp(elevator_name, q->elevator->type->elevator_name)) {
 		elevator_put(e);
 		return 0;
 	}
@@ -1028,7 +1026,7 @@ ssize_t elv_iosched_show(struct request_queue *q, char *name)
 	if (!q->elevator || !blk_queue_stackable(q))
 		return sprintf(name, "none\n");
 
-	elv = e->elevator_type;
+	elv = e->type;
 
 	spin_lock(&elv_list_lock);
 	list_for_each_entry(__e, &elv_list, list) {
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 02604c89ddd..04958ef53e6 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -78,11 +78,11 @@ struct elv_fs_entry {
  */
 struct elevator_type
 {
-	struct list_head list;
 	struct elevator_ops ops;
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+	struct list_head list;
 };
 
 /*
@@ -90,10 +90,9 @@ struct elevator_type
  */
 struct elevator_queue
 {
-	struct elevator_ops *ops;
+	struct elevator_type *type;
 	void *elevator_data;
 	struct kobject kobj;
-	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
 	unsigned int registered:1;
-- 
cgit v1.2.3-70-g09d2


From c58698073218f2c8f2fc5982fa3938c2d3803b9f Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:41 +0100
Subject: block, cfq: reorganize cfq_io_context into generic and cfq specific
 parts

Currently io_context and cfq logics are mixed without clear boundary.
Most of io_context is independent from cfq but cfq_io_context handling
logic is dispersed between generic ioc code and cfq.

cfq_io_context represents association between an io_context and a
request_queue, which is a concept useful outside of cfq, but it also
contains fields which are useful only to cfq.

This patch takes out generic part and put it into io_cq (io
context-queue) and the rest into cfq_io_cq (cic moniker remains the
same) which contains io_cq.  The following changes are made together.

* cfq_ttime and cfq_io_cq now live in cfq-iosched.c.

* All related fields, functions and constants are renamed accordingly.

* ioc->ioc_data is now "struct io_cq *" instead of "void *" and
  renamed to icq_hint.

This prepares for io_context API cleanup.  Documentation is currently
sparse.  It will be added later.

Changes in this patch are mechanical and don't cause functional
change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-ioc.c           |  58 ++++++-----
 block/cfq-iosched.c       | 248 +++++++++++++++++++++++++---------------------
 include/linux/iocontext.h |  43 +++-----
 3 files changed, 175 insertions(+), 174 deletions(-)

(limited to 'block')

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index e23c797b468..dc5e69d335a 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -46,7 +46,7 @@ EXPORT_SYMBOL(get_io_context);
 
 /*
  * Slow path for ioc release in put_io_context().  Performs double-lock
- * dancing to unlink all cic's and then frees ioc.
+ * dancing to unlink all icq's and then frees ioc.
  */
 static void ioc_release_fn(struct work_struct *work)
 {
@@ -56,11 +56,10 @@ static void ioc_release_fn(struct work_struct *work)
 
 	spin_lock_irq(&ioc->lock);
 
-	while (!hlist_empty(&ioc->cic_list)) {
-		struct cfq_io_context *cic = hlist_entry(ioc->cic_list.first,
-							 struct cfq_io_context,
-							 cic_list);
-		struct request_queue *this_q = cic->q;
+	while (!hlist_empty(&ioc->icq_list)) {
+		struct io_cq *icq = hlist_entry(ioc->icq_list.first,
+						struct io_cq, ioc_node);
+		struct request_queue *this_q = icq->q;
 
 		if (this_q != last_q) {
 			/*
@@ -89,8 +88,8 @@ static void ioc_release_fn(struct work_struct *work)
 			continue;
 		}
 		ioc_release_depth_inc(this_q);
-		cic->exit(cic);
-		cic->release(cic);
+		icq->exit(icq);
+		icq->release(icq);
 		ioc_release_depth_dec(this_q);
 	}
 
@@ -131,10 +130,10 @@ void put_io_context(struct io_context *ioc, struct request_queue *locked_q)
 		return;
 
 	/*
-	 * Destroy @ioc.  This is a bit messy because cic's are chained
+	 * Destroy @ioc.  This is a bit messy because icq's are chained
 	 * from both ioc and queue, and ioc->lock nests inside queue_lock.
-	 * The inner ioc->lock should be held to walk our cic_list and then
-	 * for each cic the outer matching queue_lock should be grabbed.
+	 * The inner ioc->lock should be held to walk our icq_list and then
+	 * for each icq the outer matching queue_lock should be grabbed.
 	 * ie. We need to do reverse-order double lock dancing.
 	 *
 	 * Another twist is that we are often called with one of the
@@ -153,11 +152,10 @@ void put_io_context(struct io_context *ioc, struct request_queue *locked_q)
 	spin_lock_irqsave_nested(&ioc->lock, flags,
 				 ioc_release_depth(locked_q));
 
-	while (!hlist_empty(&ioc->cic_list)) {
-		struct cfq_io_context *cic = hlist_entry(ioc->cic_list.first,
-							 struct cfq_io_context,
-							 cic_list);
-		struct request_queue *this_q = cic->q;
+	while (!hlist_empty(&ioc->icq_list)) {
+		struct io_cq *icq = hlist_entry(ioc->icq_list.first,
+						struct io_cq, ioc_node);
+		struct request_queue *this_q = icq->q;
 
 		if (this_q != last_q) {
 			if (last_q && last_q != locked_q)
@@ -170,8 +168,8 @@ void put_io_context(struct io_context *ioc, struct request_queue *locked_q)
 			continue;
 		}
 		ioc_release_depth_inc(this_q);
-		cic->exit(cic);
-		cic->release(cic);
+		icq->exit(icq);
+		icq->release(icq);
 		ioc_release_depth_dec(this_q);
 	}
 
@@ -180,8 +178,8 @@ void put_io_context(struct io_context *ioc, struct request_queue *locked_q)
 
 	spin_unlock_irqrestore(&ioc->lock, flags);
 
-	/* if no cic's left, we're done; otherwise, kick release_work */
-	if (hlist_empty(&ioc->cic_list))
+	/* if no icq is left, we're done; otherwise, kick release_work */
+	if (hlist_empty(&ioc->icq_list))
 		kmem_cache_free(iocontext_cachep, ioc);
 	else
 		schedule_work(&ioc->release_work);
@@ -219,8 +217,8 @@ void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_flags,
 	atomic_long_set(&ioc->refcount, 1);
 	atomic_set(&ioc->nr_tasks, 1);
 	spin_lock_init(&ioc->lock);
-	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
-	INIT_HLIST_HEAD(&ioc->cic_list);
+	INIT_RADIX_TREE(&ioc->icq_tree, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->icq_list);
 	INIT_WORK(&ioc->release_work, ioc_release_fn);
 
 	/* try to install, somebody might already have beaten us to it */
@@ -270,11 +268,11 @@ EXPORT_SYMBOL(get_task_io_context);
 
 void ioc_set_changed(struct io_context *ioc, int which)
 {
-	struct cfq_io_context *cic;
+	struct io_cq *icq;
 	struct hlist_node *n;
 
-	hlist_for_each_entry(cic, n, &ioc->cic_list, cic_list)
-		set_bit(which, &cic->changed);
+	hlist_for_each_entry(icq, n, &ioc->icq_list, ioc_node)
+		set_bit(which, &icq->changed);
 }
 
 /**
@@ -282,8 +280,8 @@ void ioc_set_changed(struct io_context *ioc, int which)
  * @ioc: io_context of interest
  * @ioprio: new ioprio
  *
- * @ioc's ioprio has changed to @ioprio.  Set %CIC_IOPRIO_CHANGED for all
- * cic's.  iosched is responsible for checking the bit and applying it on
+ * @ioc's ioprio has changed to @ioprio.  Set %ICQ_IOPRIO_CHANGED for all
+ * icq's.  iosched is responsible for checking the bit and applying it on
  * request issue path.
  */
 void ioc_ioprio_changed(struct io_context *ioc, int ioprio)
@@ -292,7 +290,7 @@ void ioc_ioprio_changed(struct io_context *ioc, int ioprio)
 
 	spin_lock_irqsave(&ioc->lock, flags);
 	ioc->ioprio = ioprio;
-	ioc_set_changed(ioc, CIC_IOPRIO_CHANGED);
+	ioc_set_changed(ioc, ICQ_IOPRIO_CHANGED);
 	spin_unlock_irqrestore(&ioc->lock, flags);
 }
 
@@ -300,7 +298,7 @@ void ioc_ioprio_changed(struct io_context *ioc, int ioprio)
  * ioc_cgroup_changed - notify cgroup change
  * @ioc: io_context of interest
  *
- * @ioc's cgroup has changed.  Set %CIC_CGROUP_CHANGED for all cic's.
+ * @ioc's cgroup has changed.  Set %ICQ_CGROUP_CHANGED for all icq's.
  * iosched is responsible for checking the bit and applying it on request
  * issue path.
  */
@@ -309,7 +307,7 @@ void ioc_cgroup_changed(struct io_context *ioc)
 	unsigned long flags;
 
 	spin_lock_irqsave(&ioc->lock, flags);
-	ioc_set_changed(ioc, CIC_CGROUP_CHANGED);
+	ioc_set_changed(ioc, ICQ_CGROUP_CHANGED);
 	spin_unlock_irqrestore(&ioc->lock, flags);
 }
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5f7e4d16140..d2f16fcdec7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -54,13 +54,12 @@ static const int cfq_hist_divisor = 4;
 #define CFQQ_SECT_THR_NONROT	(sector_t)(2 * 32)
 #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
 
-#define RQ_CIC(rq)		\
-	((struct cfq_io_context *) (rq)->elevator_private[0])
+#define RQ_CIC(rq)		icq_to_cic((rq)->elevator_private[0])
 #define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private[1])
 #define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator_private[2])
 
 static struct kmem_cache *cfq_pool;
-static struct kmem_cache *cfq_ioc_pool;
+static struct kmem_cache *cfq_icq_pool;
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
 #define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
@@ -69,6 +68,14 @@ static struct kmem_cache *cfq_ioc_pool;
 #define sample_valid(samples)	((samples) > 80)
 #define rb_entry_cfqg(node)	rb_entry((node), struct cfq_group, rb_node)
 
+struct cfq_ttime {
+	unsigned long last_end_request;
+
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+};
+
 /*
  * Most of our rbtree usage is for sorting with min extraction, so
  * if we cache the leftmost node we don't have to walk down the tree
@@ -210,6 +217,12 @@ struct cfq_group {
 	struct cfq_ttime ttime;
 };
 
+struct cfq_io_cq {
+	struct io_cq		icq;		/* must be the first member */
+	struct cfq_queue	*cfqq[2];
+	struct cfq_ttime	ttime;
+};
+
 /*
  * Per block device queue structure
  */
@@ -261,7 +274,7 @@ struct cfq_data {
 	struct work_struct unplug_work;
 
 	struct cfq_queue *active_queue;
-	struct cfq_io_context *active_cic;
+	struct cfq_io_cq *active_cic;
 
 	/*
 	 * async queue for each priority case
@@ -284,7 +297,7 @@ struct cfq_data {
 	unsigned int cfq_group_idle;
 	unsigned int cfq_latency;
 
-	struct list_head cic_list;
+	struct list_head icq_list;
 
 	/*
 	 * Fallback dummy cfqq for extreme OOM conditions
@@ -457,24 +470,28 @@ static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
 static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
 				       struct io_context *, gfp_t);
-static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
-						struct io_context *);
+static struct cfq_io_cq *cfq_cic_lookup(struct cfq_data *, struct io_context *);
 
-static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
-					    bool is_sync)
+static inline struct cfq_io_cq *icq_to_cic(struct io_cq *icq)
+{
+	/* cic->icq is the first member, %NULL will convert to %NULL */
+	return container_of(icq, struct cfq_io_cq, icq);
+}
+
+static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_cq *cic, bool is_sync)
 {
 	return cic->cfqq[is_sync];
 }
 
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, bool is_sync)
+static inline void cic_set_cfqq(struct cfq_io_cq *cic, struct cfq_queue *cfqq,
+				bool is_sync)
 {
 	cic->cfqq[is_sync] = cfqq;
 }
 
-static inline struct cfq_data *cic_to_cfqd(struct cfq_io_context *cic)
+static inline struct cfq_data *cic_to_cfqd(struct cfq_io_cq *cic)
 {
-	return cic->q->elevator->elevator_data;
+	return cic->icq.q->elevator->elevator_data;
 }
 
 /*
@@ -1541,7 +1558,7 @@ static struct request *
 cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 {
 	struct task_struct *tsk = current;
-	struct cfq_io_context *cic;
+	struct cfq_io_cq *cic;
 	struct cfq_queue *cfqq;
 
 	cic = cfq_cic_lookup(cfqd, tsk->io_context);
@@ -1655,7 +1672,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 			   struct bio *bio)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
-	struct cfq_io_context *cic;
+	struct cfq_io_cq *cic;
 	struct cfq_queue *cfqq;
 
 	/*
@@ -1671,7 +1688,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	 * and %current are guaranteed to be equal.  Avoid lookup which
 	 * requires queue_lock by using @rq's cic.
 	 */
-	if (current->io_context == RQ_CIC(rq)->ioc) {
+	if (current->io_context == RQ_CIC(rq)->icq.ioc) {
 		cic = RQ_CIC(rq);
 	} else {
 		cic = cfq_cic_lookup(cfqd, current->io_context);
@@ -1761,7 +1778,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		cfqd->active_queue = NULL;
 
 	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc, cfqd->queue);
+		put_io_context(cfqd->active_cic->icq.ioc, cfqd->queue);
 		cfqd->active_cic = NULL;
 	}
 }
@@ -1981,7 +1998,7 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 {
 	struct cfq_queue *cfqq = cfqd->active_queue;
-	struct cfq_io_context *cic;
+	struct cfq_io_cq *cic;
 	unsigned long sl, group_idle = 0;
 
 	/*
@@ -2016,7 +2033,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	 * task has exited, don't wait
 	 */
 	cic = cfqd->active_cic;
-	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
+	if (!cic || !atomic_read(&cic->icq.ioc->nr_tasks))
 		return;
 
 	/*
@@ -2567,9 +2584,9 @@ static bool cfq_dispatch_request(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	cfq_dispatch_insert(cfqd->queue, rq);
 
 	if (!cfqd->active_cic) {
-		struct cfq_io_context *cic = RQ_CIC(rq);
+		struct cfq_io_cq *cic = RQ_CIC(rq);
 
-		atomic_long_inc(&cic->ioc->refcount);
+		atomic_long_inc(&cic->icq.ioc->refcount);
 		cfqd->active_cic = cic;
 	}
 
@@ -2652,24 +2669,24 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
 	cfq_put_cfqg(cfqg);
 }
 
-static void cfq_cic_free_rcu(struct rcu_head *head)
+static void cfq_icq_free_rcu(struct rcu_head *head)
 {
-	kmem_cache_free(cfq_ioc_pool,
-			container_of(head, struct cfq_io_context, rcu_head));
+	kmem_cache_free(cfq_icq_pool,
+			icq_to_cic(container_of(head, struct io_cq, rcu_head)));
 }
 
-static void cfq_cic_free(struct cfq_io_context *cic)
+static void cfq_icq_free(struct io_cq *icq)
 {
-	call_rcu(&cic->rcu_head, cfq_cic_free_rcu);
+	call_rcu(&icq->rcu_head, cfq_icq_free_rcu);
 }
 
-static void cfq_release_cic(struct cfq_io_context *cic)
+static void cfq_release_icq(struct io_cq *icq)
 {
-	struct io_context *ioc = cic->ioc;
+	struct io_context *ioc = icq->ioc;
 
-	radix_tree_delete(&ioc->radix_root, cic->q->id);
-	hlist_del(&cic->cic_list);
-	cfq_cic_free(cic);
+	radix_tree_delete(&ioc->icq_tree, icq->q->id);
+	hlist_del(&icq->ioc_node);
+	cfq_icq_free(icq);
 }
 
 static void cfq_put_cooperator(struct cfq_queue *cfqq)
@@ -2705,20 +2722,21 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	cfq_put_queue(cfqq);
 }
 
-static void cfq_exit_cic(struct cfq_io_context *cic)
+static void cfq_exit_icq(struct io_cq *icq)
 {
+	struct cfq_io_cq *cic = icq_to_cic(icq);
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
-	struct io_context *ioc = cic->ioc;
+	struct io_context *ioc = icq->ioc;
 
-	list_del_init(&cic->queue_list);
+	list_del_init(&icq->q_node);
 
 	/*
-	 * Both setting lookup hint to and clearing it from @cic are done
-	 * under queue_lock.  If it's not pointing to @cic now, it never
+	 * Both setting lookup hint to and clearing it from @icq are done
+	 * under queue_lock.  If it's not pointing to @icq now, it never
 	 * will.  Hint assignment itself can race safely.
 	 */
-	if (rcu_dereference_raw(ioc->ioc_data) == cic)
-		rcu_assign_pointer(ioc->ioc_data, NULL);
+	if (rcu_dereference_raw(ioc->icq_hint) == icq)
+		rcu_assign_pointer(ioc->icq_hint, NULL);
 
 	if (cic->cfqq[BLK_RW_ASYNC]) {
 		cfq_exit_cfqq(cfqd, cic->cfqq[BLK_RW_ASYNC]);
@@ -2731,19 +2749,18 @@ static void cfq_exit_cic(struct cfq_io_context *cic)
 	}
 }
 
-static struct cfq_io_context *
-cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
+static struct cfq_io_cq *cfq_alloc_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
-	struct cfq_io_context *cic;
+	struct cfq_io_cq *cic;
 
-	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
+	cic = kmem_cache_alloc_node(cfq_icq_pool, gfp_mask | __GFP_ZERO,
 							cfqd->queue->node);
 	if (cic) {
 		cic->ttime.last_end_request = jiffies;
-		INIT_LIST_HEAD(&cic->queue_list);
-		INIT_HLIST_NODE(&cic->cic_list);
-		cic->exit = cfq_exit_cic;
-		cic->release = cfq_release_cic;
+		INIT_LIST_HEAD(&cic->icq.q_node);
+		INIT_HLIST_NODE(&cic->icq.ioc_node);
+		cic->icq.exit = cfq_exit_icq;
+		cic->icq.release = cfq_release_icq;
 	}
 
 	return cic;
@@ -2791,7 +2808,7 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
-static void changed_ioprio(struct cfq_io_context *cic)
+static void changed_ioprio(struct cfq_io_cq *cic)
 {
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
 	struct cfq_queue *cfqq;
@@ -2802,7 +2819,7 @@ static void changed_ioprio(struct cfq_io_context *cic)
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 	if (cfqq) {
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->icq.ioc,
 						GFP_ATOMIC);
 		if (new_cfqq) {
 			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
@@ -2836,7 +2853,7 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 }
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
-static void changed_cgroup(struct cfq_io_context *cic)
+static void changed_cgroup(struct cfq_io_cq *cic)
 {
 	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
@@ -2864,7 +2881,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
 		     struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
-	struct cfq_io_context *cic;
+	struct cfq_io_cq *cic;
 	struct cfq_group *cfqg;
 
 retry:
@@ -2956,56 +2973,57 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
 }
 
 /**
- * cfq_cic_lookup - lookup cfq_io_context
+ * cfq_cic_lookup - lookup cfq_io_cq
  * @cfqd: the associated cfq_data
  * @ioc: the associated io_context
  *
- * Look up cfq_io_context associated with @cfqd - @ioc pair.  Must be
- * called with queue_lock held.
+ * Look up cfq_io_cq associated with @cfqd - @ioc pair.  Must be called
+ * with queue_lock held.
  */
-static struct cfq_io_context *
+static struct cfq_io_cq *
 cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 {
 	struct request_queue *q = cfqd->queue;
-	struct cfq_io_context *cic;
+	struct io_cq *icq;
 
 	lockdep_assert_held(cfqd->queue->queue_lock);
 	if (unlikely(!ioc))
 		return NULL;
 
 	/*
-	 * cic's are indexed from @ioc using radix tree and hint pointer,
+	 * icq's are indexed from @ioc using radix tree and hint pointer,
 	 * both of which are protected with RCU.  All removals are done
 	 * holding both q and ioc locks, and we're holding q lock - if we
-	 * find a cic which points to us, it's guaranteed to be valid.
+	 * find a icq which points to us, it's guaranteed to be valid.
 	 */
 	rcu_read_lock();
-	cic = rcu_dereference(ioc->ioc_data);
-	if (cic && cic->q == q)
+	icq = rcu_dereference(ioc->icq_hint);
+	if (icq && icq->q == q)
 		goto out;
 
-	cic = radix_tree_lookup(&ioc->radix_root, cfqd->queue->id);
-	if (cic && cic->q == q)
-		rcu_assign_pointer(ioc->ioc_data, cic);	/* allowed to race */
+	icq = radix_tree_lookup(&ioc->icq_tree, cfqd->queue->id);
+	if (icq && icq->q == q)
+		rcu_assign_pointer(ioc->icq_hint, icq);	/* allowed to race */
 	else
-		cic = NULL;
+		icq = NULL;
 out:
 	rcu_read_unlock();
-	return cic;
+	return icq_to_cic(icq);
 }
 
 /**
- * cfq_create_cic - create and link a cfq_io_context
+ * cfq_create_cic - create and link a cfq_io_cq
  * @cfqd: cfqd of interest
  * @gfp_mask: allocation mask
  *
- * Make sure cfq_io_context linking %current->io_context and @cfqd exists.
- * If ioc and/or cic doesn't exist, they will be created using @gfp_mask.
+ * Make sure cfq_io_cq linking %current->io_context and @cfqd exists.  If
+ * ioc and/or cic doesn't exist, they will be created using @gfp_mask.
  */
 static int cfq_create_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct request_queue *q = cfqd->queue;
-	struct cfq_io_context *cic = NULL;
+	struct io_cq *icq = NULL;
+	struct cfq_io_cq *cic;
 	struct io_context *ioc;
 	int ret = -ENOMEM;
 
@@ -3016,26 +3034,27 @@ static int cfq_create_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 	if (!ioc)
 		goto out;
 
-	cic = cfq_alloc_io_context(cfqd, gfp_mask);
+	cic = cfq_alloc_cic(cfqd, gfp_mask);
 	if (!cic)
 		goto out;
+	icq = &cic->icq;
 
 	ret = radix_tree_preload(gfp_mask);
 	if (ret)
 		goto out;
 
-	cic->ioc = ioc;
-	cic->q = cfqd->queue;
+	icq->ioc = ioc;
+	icq->q = cfqd->queue;
 
-	/* lock both q and ioc and try to link @cic */
+	/* lock both q and ioc and try to link @icq */
 	spin_lock_irq(q->queue_lock);
 	spin_lock(&ioc->lock);
 
-	ret = radix_tree_insert(&ioc->radix_root, q->id, cic);
+	ret = radix_tree_insert(&ioc->icq_tree, q->id, icq);
 	if (likely(!ret)) {
-		hlist_add_head(&cic->cic_list, &ioc->cic_list);
-		list_add(&cic->queue_list, &cfqd->cic_list);
-		cic = NULL;
+		hlist_add_head(&icq->ioc_node, &ioc->icq_list);
+		list_add(&icq->q_node, &cfqd->icq_list);
+		icq = NULL;
 	} else if (ret == -EEXIST) {
 		/* someone else already did it */
 		ret = 0;
@@ -3047,29 +3066,28 @@ static int cfq_create_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 	radix_tree_preload_end();
 out:
 	if (ret)
-		printk(KERN_ERR "cfq: cic link failed!\n");
-	if (cic)
-		cfq_cic_free(cic);
+		printk(KERN_ERR "cfq: icq link failed!\n");
+	if (icq)
+		cfq_icq_free(icq);
 	return ret;
 }
 
 /**
- * cfq_get_io_context - acquire cfq_io_context and bump refcnt on io_context
+ * cfq_get_cic - acquire cfq_io_cq and bump refcnt on io_context
  * @cfqd: cfqd to setup cic for
  * @gfp_mask: allocation mask
  *
- * Return cfq_io_context associating @cfqd and %current->io_context and
+ * Return cfq_io_cq associating @cfqd and %current->io_context and
  * bump refcnt on io_context.  If ioc or cic doesn't exist, they're created
  * using @gfp_mask.
  *
  * Must be called under queue_lock which may be released and re-acquired.
  * This function also may sleep depending on @gfp_mask.
  */
-static struct cfq_io_context *
-cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
+static struct cfq_io_cq *cfq_get_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct request_queue *q = cfqd->queue;
-	struct cfq_io_context *cic = NULL;
+	struct cfq_io_cq *cic = NULL;
 	struct io_context *ioc;
 	int err;
 
@@ -3095,11 +3113,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 	/* bump @ioc's refcnt and handle changed notifications */
 	get_io_context(ioc);
 
-	if (unlikely(cic->changed)) {
-		if (test_and_clear_bit(CIC_IOPRIO_CHANGED, &cic->changed))
+	if (unlikely(cic->icq.changed)) {
+		if (test_and_clear_bit(ICQ_IOPRIO_CHANGED, &cic->icq.changed))
 			changed_ioprio(cic);
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
-		if (test_and_clear_bit(CIC_CGROUP_CHANGED, &cic->changed))
+		if (test_and_clear_bit(ICQ_CGROUP_CHANGED, &cic->icq.changed))
 			changed_cgroup(cic);
 #endif
 	}
@@ -3120,7 +3138,7 @@ __cfq_update_io_thinktime(struct cfq_ttime *ttime, unsigned long slice_idle)
 
 static void
 cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-	struct cfq_io_context *cic)
+			struct cfq_io_cq *cic)
 {
 	if (cfq_cfqq_sync(cfqq)) {
 		__cfq_update_io_thinktime(&cic->ttime, cfqd->cfq_slice_idle);
@@ -3158,7 +3176,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
  */
 static void
 cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		       struct cfq_io_context *cic)
+		       struct cfq_io_cq *cic)
 {
 	int old_idle, enable_idle;
 
@@ -3175,8 +3193,9 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE))
 		enable_idle = 0;
-	else if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq)))
+	else if (!atomic_read(&cic->icq.ioc->nr_tasks) ||
+		 !cfqd->cfq_slice_idle ||
+		 (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq)))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime.ttime_samples)) {
 		if (cic->ttime.ttime_mean > cfqd->cfq_slice_idle)
@@ -3308,7 +3327,7 @@ static void
 cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		struct request *rq)
 {
-	struct cfq_io_context *cic = RQ_CIC(rq);
+	struct cfq_io_cq *cic = RQ_CIC(rq);
 
 	cfqd->rq_queued++;
 	if (rq->cmd_flags & REQ_PRIO)
@@ -3361,7 +3380,7 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
 
 	cfq_log_cfqq(cfqd, cfqq, "insert_request");
-	cfq_init_prio_data(cfqq, RQ_CIC(rq)->ioc);
+	cfq_init_prio_data(cfqq, RQ_CIC(rq)->icq.ioc);
 
 	rq_set_fifo_time(rq, jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)]);
 	list_add_tail(&rq->queuelist, &cfqq->fifo);
@@ -3411,7 +3430,7 @@ static void cfq_update_hw_tag(struct cfq_data *cfqd)
 
 static bool cfq_should_wait_busy(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	struct cfq_io_context *cic = cfqd->active_cic;
+	struct cfq_io_cq *cic = cfqd->active_cic;
 
 	/* If the queue already has requests, don't wait */
 	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
@@ -3548,7 +3567,7 @@ static int cfq_may_queue(struct request_queue *q, int rw)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct task_struct *tsk = current;
-	struct cfq_io_context *cic;
+	struct cfq_io_cq *cic;
 	struct cfq_queue *cfqq;
 
 	/*
@@ -3563,7 +3582,7 @@ static int cfq_may_queue(struct request_queue *q, int rw)
 
 	cfqq = cic_to_cfqq(cic, rw_is_sync(rw));
 	if (cfqq) {
-		cfq_init_prio_data(cfqq, cic->ioc);
+		cfq_init_prio_data(cfqq, cic->icq.ioc);
 
 		return __cfq_may_queue(cfqq);
 	}
@@ -3584,7 +3603,7 @@ static void cfq_put_request(struct request *rq)
 		BUG_ON(!cfqq->allocated[rw]);
 		cfqq->allocated[rw]--;
 
-		put_io_context(RQ_CIC(rq)->ioc, cfqq->cfqd->queue);
+		put_io_context(RQ_CIC(rq)->icq.ioc, cfqq->cfqd->queue);
 
 		rq->elevator_private[0] = NULL;
 		rq->elevator_private[1] = NULL;
@@ -3598,7 +3617,7 @@ static void cfq_put_request(struct request *rq)
 }
 
 static struct cfq_queue *
-cfq_merge_cfqqs(struct cfq_data *cfqd, struct cfq_io_context *cic,
+cfq_merge_cfqqs(struct cfq_data *cfqd, struct cfq_io_cq *cic,
 		struct cfq_queue *cfqq)
 {
 	cfq_log_cfqq(cfqd, cfqq, "merging with queue %p", cfqq->new_cfqq);
@@ -3613,7 +3632,7 @@ cfq_merge_cfqqs(struct cfq_data *cfqd, struct cfq_io_context *cic,
  * was the last process referring to said cfqq.
  */
 static struct cfq_queue *
-split_cfqq(struct cfq_io_context *cic, struct cfq_queue *cfqq)
+split_cfqq(struct cfq_io_cq *cic, struct cfq_queue *cfqq)
 {
 	if (cfqq_process_refs(cfqq) == 1) {
 		cfqq->pid = current->pid;
@@ -3636,7 +3655,7 @@ static int
 cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
-	struct cfq_io_context *cic;
+	struct cfq_io_cq *cic;
 	const int rw = rq_data_dir(rq);
 	const bool is_sync = rq_is_sync(rq);
 	struct cfq_queue *cfqq;
@@ -3644,14 +3663,14 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
 	spin_lock_irq(q->queue_lock);
-	cic = cfq_get_io_context(cfqd, gfp_mask);
+	cic = cfq_get_cic(cfqd, gfp_mask);
 	if (!cic)
 		goto queue_fail;
 
 new_queue:
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, is_sync, cic->icq.ioc, gfp_mask);
 		cic_set_cfqq(cic, cfqq, is_sync);
 	} else {
 		/*
@@ -3677,7 +3696,7 @@ new_queue:
 	cfqq->allocated[rw]++;
 
 	cfqq->ref++;
-	rq->elevator_private[0] = cic;
+	rq->elevator_private[0] = &cic->icq;
 	rq->elevator_private[1] = cfqq;
 	rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
 	spin_unlock_irq(q->queue_lock);
@@ -3791,15 +3810,14 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	if (cfqd->active_queue)
 		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
 
-	while (!list_empty(&cfqd->cic_list)) {
-		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
-							struct cfq_io_context,
-							queue_list);
-		struct io_context *ioc = cic->ioc;
+	while (!list_empty(&cfqd->icq_list)) {
+		struct io_cq *icq = list_entry(cfqd->icq_list.next,
+					       struct io_cq, q_node);
+		struct io_context *ioc = icq->ioc;
 
 		spin_lock(&ioc->lock);
-		cfq_exit_cic(cic);
-		cfq_release_cic(cic);
+		cfq_exit_icq(icq);
+		cfq_release_icq(icq);
 		spin_unlock(&ioc->lock);
 	}
 
@@ -3904,7 +3922,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->oom_cfqq.ref++;
 	cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);
 
-	INIT_LIST_HEAD(&cfqd->cic_list);
+	INIT_LIST_HEAD(&cfqd->icq_list);
 
 	cfqd->queue = q;
 
@@ -3942,8 +3960,8 @@ static void cfq_slab_kill(void)
 	 */
 	if (cfq_pool)
 		kmem_cache_destroy(cfq_pool);
-	if (cfq_ioc_pool)
-		kmem_cache_destroy(cfq_ioc_pool);
+	if (cfq_icq_pool)
+		kmem_cache_destroy(cfq_icq_pool);
 }
 
 static int __init cfq_slab_setup(void)
@@ -3952,8 +3970,8 @@ static int __init cfq_slab_setup(void)
 	if (!cfq_pool)
 		goto fail;
 
-	cfq_ioc_pool = KMEM_CACHE(cfq_io_context, 0);
-	if (!cfq_ioc_pool)
+	cfq_icq_pool = KMEM_CACHE(cfq_io_cq, 0);
+	if (!cfq_icq_pool)
 		goto fail;
 
 	return 0;
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index b2b75a54f25..d15ca6591f9 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -5,38 +5,23 @@
 #include <linux/rcupdate.h>
 #include <linux/workqueue.h>
 
-struct cfq_queue;
-struct cfq_ttime {
-	unsigned long last_end_request;
-
-	unsigned long ttime_total;
-	unsigned long ttime_samples;
-	unsigned long ttime_mean;
-};
-
 enum {
-	CIC_IOPRIO_CHANGED,
-	CIC_CGROUP_CHANGED,
+	ICQ_IOPRIO_CHANGED,
+	ICQ_CGROUP_CHANGED,
 };
 
-struct cfq_io_context {
-	struct request_queue *q;
-
-	struct cfq_queue *cfqq[2];
-
-	struct io_context *ioc;
-
-	struct cfq_ttime ttime;
-
-	struct list_head queue_list;
-	struct hlist_node cic_list;
+struct io_cq {
+	struct request_queue	*q;
+	struct io_context	*ioc;
 
-	unsigned long changed;
+	struct list_head	q_node;
+	struct hlist_node	ioc_node;
 
-	void (*exit)(struct cfq_io_context *);
-	void (*release)(struct cfq_io_context *);
+	unsigned long		changed;
+	struct rcu_head		rcu_head;
 
-	struct rcu_head rcu_head;
+	void (*exit)(struct io_cq *);
+	void (*release)(struct io_cq *);
 };
 
 /*
@@ -58,9 +43,9 @@ struct io_context {
 	int nr_batch_requests;     /* Number of requests left in the batch */
 	unsigned long last_waited; /* Time last woken after wait for request */
 
-	struct radix_tree_root radix_root;
-	struct hlist_head cic_list;
-	void __rcu *ioc_data;
+	struct radix_tree_root	icq_tree;
+	struct io_cq __rcu	*icq_hint;
+	struct hlist_head	icq_list;
 
 	struct work_struct release_work;
 };
-- 
cgit v1.2.3-70-g09d2


From a612fddf0d8090f2877305c9168b6c1a34fb5d90 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:41 +0100
Subject: block, cfq: move cfqd->icq_list to request_queue and add
 request->elv.icq

Most of icq management is about to be moved out of cfq into blk-ioc.
This patch prepares for it.

* Move cfqd->icq_list to request_queue->icq_list

* Make request explicitly point to icq instead of through elevator
  private data.  ->elevator_private[3] is replaced with sub struct elv
  which contains icq pointer and priv[2].  cfq is updated accordingly.

* Meaningless clearing of ->elevator_private[0] removed from
  elv_set_request().  At that point in code, the field was guaranteed
  to be %NULL anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c       |  1 +
 block/cfq-iosched.c    | 28 +++++++++++-----------------
 block/elevator.c       |  2 --
 include/linux/blkdev.h | 10 ++++++++--
 4 files changed, 20 insertions(+), 21 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index 6804fdf27ef..3c26c7f4870 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -497,6 +497,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 		    laptop_mode_timer_fn, (unsigned long) q);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
+	INIT_LIST_HEAD(&q->icq_list);
 	INIT_LIST_HEAD(&q->flush_queue[0]);
 	INIT_LIST_HEAD(&q->flush_queue[1]);
 	INIT_LIST_HEAD(&q->flush_data_in_flight);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d2f16fcdec7..9bc5ecc1b33 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -54,9 +54,9 @@ static const int cfq_hist_divisor = 4;
 #define CFQQ_SECT_THR_NONROT	(sector_t)(2 * 32)
 #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
 
-#define RQ_CIC(rq)		icq_to_cic((rq)->elevator_private[0])
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private[1])
-#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator_private[2])
+#define RQ_CIC(rq)		icq_to_cic((rq)->elv.icq)
+#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elv.priv[0])
+#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elv.priv[1])
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_icq_pool;
@@ -297,8 +297,6 @@ struct cfq_data {
 	unsigned int cfq_group_idle;
 	unsigned int cfq_latency;
 
-	struct list_head icq_list;
-
 	/*
 	 * Fallback dummy cfqq for extreme OOM conditions
 	 */
@@ -3053,7 +3051,7 @@ static int cfq_create_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 	ret = radix_tree_insert(&ioc->icq_tree, q->id, icq);
 	if (likely(!ret)) {
 		hlist_add_head(&icq->ioc_node, &ioc->icq_list);
-		list_add(&icq->q_node, &cfqd->icq_list);
+		list_add(&icq->q_node, &q->icq_list);
 		icq = NULL;
 	} else if (ret == -EEXIST) {
 		/* someone else already did it */
@@ -3605,12 +3603,10 @@ static void cfq_put_request(struct request *rq)
 
 		put_io_context(RQ_CIC(rq)->icq.ioc, cfqq->cfqd->queue);
 
-		rq->elevator_private[0] = NULL;
-		rq->elevator_private[1] = NULL;
-
 		/* Put down rq reference on cfqg */
 		cfq_put_cfqg(RQ_CFQG(rq));
-		rq->elevator_private[2] = NULL;
+		rq->elv.priv[0] = NULL;
+		rq->elv.priv[1] = NULL;
 
 		cfq_put_queue(cfqq);
 	}
@@ -3696,9 +3692,9 @@ new_queue:
 	cfqq->allocated[rw]++;
 
 	cfqq->ref++;
-	rq->elevator_private[0] = &cic->icq;
-	rq->elevator_private[1] = cfqq;
-	rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
+	rq->elv.icq = &cic->icq;
+	rq->elv.priv[0] = cfqq;
+	rq->elv.priv[1] = cfq_ref_get_cfqg(cfqq->cfqg);
 	spin_unlock_irq(q->queue_lock);
 	return 0;
 
@@ -3810,8 +3806,8 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	if (cfqd->active_queue)
 		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
 
-	while (!list_empty(&cfqd->icq_list)) {
-		struct io_cq *icq = list_entry(cfqd->icq_list.next,
+	while (!list_empty(&q->icq_list)) {
+		struct io_cq *icq = list_entry(q->icq_list.next,
 					       struct io_cq, q_node);
 		struct io_context *ioc = icq->ioc;
 
@@ -3922,8 +3918,6 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->oom_cfqq.ref++;
 	cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);
 
-	INIT_LIST_HEAD(&cfqd->icq_list);
-
 	cfqd->queue = q;
 
 	init_timer(&cfqd->idle_slice_timer);
diff --git a/block/elevator.c b/block/elevator.c
index 31ffe76aed3..c5c6214829c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -745,8 +745,6 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	if (e->type->ops.elevator_set_req_fn)
 		return e->type->ops.elevator_set_req_fn(q, rq, gfp_mask);
-
-	rq->elevator_private[0] = NULL;
 	return 0;
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 65c2f8c7008..8bca04873f5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -111,10 +111,14 @@ struct request {
 	 * Three pointers are available for the IO schedulers, if they need
 	 * more they have to dynamically allocate it.  Flush requests are
 	 * never put on the IO scheduler. So let the flush fields share
-	 * space with the three elevator_private pointers.
+	 * space with the elevator data.
 	 */
 	union {
-		void *elevator_private[3];
+		struct {
+			struct io_cq		*icq;
+			void			*priv[2];
+		} elv;
+
 		struct {
 			unsigned int		seq;
 			struct list_head	list;
@@ -357,6 +361,8 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct list_head	timeout_list;
 
+	struct list_head	icq_list;
+
 	struct queue_limits	limits;
 
 	/*
-- 
cgit v1.2.3-70-g09d2


From 47fdd4ca96bf4b28ac4d05d7a6e382df31d3d758 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:42 +0100
Subject: block, cfq: move io_cq lookup to blk-ioc.c

Now that all io_cq related data structures are in block core layer,
io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c.

Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with
parameter return type changes (cfqd -> request_queue, cfq_io_cq ->
io_cq) and cfq_cic_lookup() becomes thin wrapper around
cfq_cic_lookup().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-ioc.c     | 36 ++++++++++++++++++++++++++++++++++++
 block/blk.h         |  1 +
 block/cfq-iosched.c | 48 ++++++++----------------------------------------
 3 files changed, 45 insertions(+), 40 deletions(-)

(limited to 'block')

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index dc5e69d335a..87ecc98b8ad 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -266,6 +266,42 @@ struct io_context *get_task_io_context(struct task_struct *task,
 }
 EXPORT_SYMBOL(get_task_io_context);
 
+/**
+ * ioc_lookup_icq - lookup io_cq from ioc
+ * @ioc: the associated io_context
+ * @q: the associated request_queue
+ *
+ * Look up io_cq associated with @ioc - @q pair from @ioc.  Must be called
+ * with @q->queue_lock held.
+ */
+struct io_cq *ioc_lookup_icq(struct io_context *ioc, struct request_queue *q)
+{
+	struct io_cq *icq;
+
+	lockdep_assert_held(q->queue_lock);
+
+	/*
+	 * icq's are indexed from @ioc using radix tree and hint pointer,
+	 * both of which are protected with RCU.  All removals are done
+	 * holding both q and ioc locks, and we're holding q lock - if we
+	 * find a icq which points to us, it's guaranteed to be valid.
+	 */
+	rcu_read_lock();
+	icq = rcu_dereference(ioc->icq_hint);
+	if (icq && icq->q == q)
+		goto out;
+
+	icq = radix_tree_lookup(&ioc->icq_tree, q->id);
+	if (icq && icq->q == q)
+		rcu_assign_pointer(ioc->icq_hint, icq);	/* allowed to race */
+	else
+		icq = NULL;
+out:
+	rcu_read_unlock();
+	return icq;
+}
+EXPORT_SYMBOL(ioc_lookup_icq);
+
 void ioc_set_changed(struct io_context *ioc, int which)
 {
 	struct io_cq *icq;
diff --git a/block/blk.h b/block/blk.h
index 4943770e079..3c510a4b505 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -199,6 +199,7 @@ static inline int blk_do_io_stat(struct request *rq)
  * Internal io_context interface
  */
 void get_io_context(struct io_context *ioc);
+struct io_cq *ioc_lookup_icq(struct io_context *ioc, struct request_queue *q);
 
 void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_mask,
 				int node);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9bc5ecc1b33..048fa699adf 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -468,7 +468,6 @@ static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
 static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
 				       struct io_context *, gfp_t);
-static struct cfq_io_cq *cfq_cic_lookup(struct cfq_data *, struct io_context *);
 
 static inline struct cfq_io_cq *icq_to_cic(struct io_cq *icq)
 {
@@ -476,6 +475,14 @@ static inline struct cfq_io_cq *icq_to_cic(struct io_cq *icq)
 	return container_of(icq, struct cfq_io_cq, icq);
 }
 
+static inline struct cfq_io_cq *cfq_cic_lookup(struct cfq_data *cfqd,
+					       struct io_context *ioc)
+{
+	if (ioc)
+		return icq_to_cic(ioc_lookup_icq(ioc, cfqd->queue));
+	return NULL;
+}
+
 static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_cq *cic, bool is_sync)
 {
 	return cic->cfqq[is_sync];
@@ -2970,45 +2977,6 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
 	return cfqq;
 }
 
-/**
- * cfq_cic_lookup - lookup cfq_io_cq
- * @cfqd: the associated cfq_data
- * @ioc: the associated io_context
- *
- * Look up cfq_io_cq associated with @cfqd - @ioc pair.  Must be called
- * with queue_lock held.
- */
-static struct cfq_io_cq *
-cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
-{
-	struct request_queue *q = cfqd->queue;
-	struct io_cq *icq;
-
-	lockdep_assert_held(cfqd->queue->queue_lock);
-	if (unlikely(!ioc))
-		return NULL;
-
-	/*
-	 * icq's are indexed from @ioc using radix tree and hint pointer,
-	 * both of which are protected with RCU.  All removals are done
-	 * holding both q and ioc locks, and we're holding q lock - if we
-	 * find a icq which points to us, it's guaranteed to be valid.
-	 */
-	rcu_read_lock();
-	icq = rcu_dereference(ioc->icq_hint);
-	if (icq && icq->q == q)
-		goto out;
-
-	icq = radix_tree_lookup(&ioc->icq_tree, cfqd->queue->id);
-	if (icq && icq->q == q)
-		rcu_assign_pointer(ioc->icq_hint, icq);	/* allowed to race */
-	else
-		icq = NULL;
-out:
-	rcu_read_unlock();
-	return icq_to_cic(icq);
-}
-
 /**
  * cfq_create_cic - create and link a cfq_io_cq
  * @cfqd: cfqd of interest
-- 
cgit v1.2.3-70-g09d2


From 3d3c2379feb177a5fd55bb0ed76776dc9d4f3243 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:42 +0100
Subject: block, cfq: move icq cache management to block core

Let elevators set ->icq_size and ->icq_align in elevator_type and
elv_register() and elv_unregister() respectively create and destroy
kmem_cache for icq.

* elv_register() now can return failure.  All callers updated.

* icq caches are automatically named "ELVNAME_io_cq".

* cfq_slab_setup/kill() are collapsed into cfq_init/exit().

* While at it, minor indentation change for iosched_cfq.elevator_name
  for consistency.

This will help moving icq management to block core.  This doesn't
introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c      | 48 +++++++++++++++---------------------------------
 block/deadline-iosched.c |  4 +---
 block/elevator.c         | 37 +++++++++++++++++++++++++++++++++++--
 block/noop-iosched.c     |  4 +---
 include/linux/elevator.h | 11 ++++++++++-
 5 files changed, 62 insertions(+), 42 deletions(-)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 048fa699adf..06e59abcb57 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3914,34 +3914,6 @@ static void *cfq_init_queue(struct request_queue *q)
 	return cfqd;
 }
 
-static void cfq_slab_kill(void)
-{
-	/*
-	 * Caller already ensured that pending RCU callbacks are completed,
-	 * so we should have no busy allocations at this point.
-	 */
-	if (cfq_pool)
-		kmem_cache_destroy(cfq_pool);
-	if (cfq_icq_pool)
-		kmem_cache_destroy(cfq_icq_pool);
-}
-
-static int __init cfq_slab_setup(void)
-{
-	cfq_pool = KMEM_CACHE(cfq_queue, 0);
-	if (!cfq_pool)
-		goto fail;
-
-	cfq_icq_pool = KMEM_CACHE(cfq_io_cq, 0);
-	if (!cfq_icq_pool)
-		goto fail;
-
-	return 0;
-fail:
-	cfq_slab_kill();
-	return -ENOMEM;
-}
-
 /*
  * sysfs parts below -->
  */
@@ -4053,8 +4025,10 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 	},
+	.icq_size	=	sizeof(struct cfq_io_cq),
+	.icq_align	=	__alignof__(struct cfq_io_cq),
 	.elevator_attrs =	cfq_attrs,
-	.elevator_name =	"cfq",
+	.elevator_name	=	"cfq",
 	.elevator_owner =	THIS_MODULE,
 };
 
@@ -4072,6 +4046,8 @@ static struct blkio_policy_type blkio_policy_cfq;
 
 static int __init cfq_init(void)
 {
+	int ret;
+
 	/*
 	 * could be 0 on HZ < 1000 setups
 	 */
@@ -4086,10 +4062,17 @@ static int __init cfq_init(void)
 #else
 		cfq_group_idle = 0;
 #endif
-	if (cfq_slab_setup())
+	cfq_pool = KMEM_CACHE(cfq_queue, 0);
+	if (!cfq_pool)
 		return -ENOMEM;
 
-	elv_register(&iosched_cfq);
+	ret = elv_register(&iosched_cfq);
+	if (ret) {
+		kmem_cache_destroy(cfq_pool);
+		return ret;
+	}
+	cfq_icq_pool = iosched_cfq.icq_cache;
+
 	blkio_policy_register(&blkio_policy_cfq);
 
 	return 0;
@@ -4099,8 +4082,7 @@ static void __exit cfq_exit(void)
 {
 	blkio_policy_unregister(&blkio_policy_cfq);
 	elv_unregister(&iosched_cfq);
-	rcu_barrier();	/* make sure all cic RCU frees are complete */
-	cfq_slab_kill();
+	kmem_cache_destroy(cfq_pool);
 }
 
 module_init(cfq_init);
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index c644137d9cd..7bf12d793fc 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -448,9 +448,7 @@ static struct elevator_type iosched_deadline = {
 
 static int __init deadline_init(void)
 {
-	elv_register(&iosched_deadline);
-
-	return 0;
+	return elv_register(&iosched_deadline);
 }
 
 static void __exit deadline_exit(void)
diff --git a/block/elevator.c b/block/elevator.c
index c5c6214829c..cca049fb45c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -886,15 +886,36 @@ void elv_unregister_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(elv_unregister_queue);
 
-void elv_register(struct elevator_type *e)
+int elv_register(struct elevator_type *e)
 {
 	char *def = "";
 
+	/* create icq_cache if requested */
+	if (e->icq_size) {
+		if (WARN_ON(e->icq_size < sizeof(struct io_cq)) ||
+		    WARN_ON(e->icq_align < __alignof__(struct io_cq)))
+			return -EINVAL;
+
+		snprintf(e->icq_cache_name, sizeof(e->icq_cache_name),
+			 "%s_io_cq", e->elevator_name);
+		e->icq_cache = kmem_cache_create(e->icq_cache_name, e->icq_size,
+						 e->icq_align, 0, NULL);
+		if (!e->icq_cache)
+			return -ENOMEM;
+	}
+
+	/* register, don't allow duplicate names */
 	spin_lock(&elv_list_lock);
-	BUG_ON(elevator_find(e->elevator_name));
+	if (elevator_find(e->elevator_name)) {
+		spin_unlock(&elv_list_lock);
+		if (e->icq_cache)
+			kmem_cache_destroy(e->icq_cache);
+		return -EBUSY;
+	}
 	list_add_tail(&e->list, &elv_list);
 	spin_unlock(&elv_list_lock);
 
+	/* print pretty message */
 	if (!strcmp(e->elevator_name, chosen_elevator) ||
 			(!*chosen_elevator &&
 			 !strcmp(e->elevator_name, CONFIG_DEFAULT_IOSCHED)))
@@ -902,14 +923,26 @@ void elv_register(struct elevator_type *e)
 
 	printk(KERN_INFO "io scheduler %s registered%s\n", e->elevator_name,
 								def);
+	return 0;
 }
 EXPORT_SYMBOL_GPL(elv_register);
 
 void elv_unregister(struct elevator_type *e)
 {
+	/* unregister */
 	spin_lock(&elv_list_lock);
 	list_del_init(&e->list);
 	spin_unlock(&elv_list_lock);
+
+	/*
+	 * Destroy icq_cache if it exists.  icq's are RCU managed.  Make
+	 * sure all RCU operations are complete before proceeding.
+	 */
+	if (e->icq_cache) {
+		rcu_barrier();
+		kmem_cache_destroy(e->icq_cache);
+		e->icq_cache = NULL;
+	}
 }
 EXPORT_SYMBOL_GPL(elv_unregister);
 
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 06389e9ef96..413a0b1d788 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -94,9 +94,7 @@ static struct elevator_type elevator_noop = {
 
 static int __init noop_init(void)
 {
-	elv_register(&elevator_noop);
-
-	return 0;
+	return elv_register(&elevator_noop);
 }
 
 static void __exit noop_exit(void)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 04958ef53e6..d3d3e28cbfd 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -78,10 +78,19 @@ struct elv_fs_entry {
  */
 struct elevator_type
 {
+	/* managed by elevator core */
+	struct kmem_cache *icq_cache;
+
+	/* fields provided by elevator implementation */
 	struct elevator_ops ops;
+	size_t icq_size;
+	size_t icq_align;
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+
+	/* managed by elevator core */
+	char icq_cache_name[ELV_NAME_MAX + 5];	/* elvname + "_io_cq" */
 	struct list_head list;
 };
 
@@ -127,7 +136,7 @@ extern void elv_drain_elevator(struct request_queue *);
 /*
  * io scheduler registration
  */
-extern void elv_register(struct elevator_type *);
+extern int elv_register(struct elevator_type *);
 extern void elv_unregister(struct elevator_type *);
 
 /*
-- 
cgit v1.2.3-70-g09d2


From 7e5a8794492e43e9eebb68a98a23be055888ccd0 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:42 +0100
Subject: block, cfq: move io_cq exit/release to blk-ioc.c

With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too.  The odd ->io_cq->exit/release() callbacks are replaced
with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc.  The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.

Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.

Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released.  New field
io_cq->__rcu_icq_cache is added for this purpose.  As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-ioc.c           | 76 ++++++++++++++++++++++++++++++++++++++++++-----
 block/blk-sysfs.c         |  6 +++-
 block/blk.h               |  1 +
 block/cfq-iosched.c       | 47 ++---------------------------
 block/elevator.c          |  3 +-
 include/linux/elevator.h  |  5 ++++
 include/linux/iocontext.h | 20 +++++++++----
 7 files changed, 97 insertions(+), 61 deletions(-)

(limited to 'block')

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 87ecc98b8ad..0910a5584d3 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -44,6 +44,51 @@ EXPORT_SYMBOL(get_io_context);
 #define ioc_release_depth_dec(q)	do { } while (0)
 #endif
 
+static void icq_free_icq_rcu(struct rcu_head *head)
+{
+	struct io_cq *icq = container_of(head, struct io_cq, __rcu_head);
+
+	kmem_cache_free(icq->__rcu_icq_cache, icq);
+}
+
+/*
+ * Exit and free an icq.  Called with both ioc and q locked.
+ */
+static void ioc_exit_icq(struct io_cq *icq)
+{
+	struct io_context *ioc = icq->ioc;
+	struct request_queue *q = icq->q;
+	struct elevator_type *et = q->elevator->type;
+
+	lockdep_assert_held(&ioc->lock);
+	lockdep_assert_held(q->queue_lock);
+
+	radix_tree_delete(&ioc->icq_tree, icq->q->id);
+	hlist_del_init(&icq->ioc_node);
+	list_del_init(&icq->q_node);
+
+	/*
+	 * Both setting lookup hint to and clearing it from @icq are done
+	 * under queue_lock.  If it's not pointing to @icq now, it never
+	 * will.  Hint assignment itself can race safely.
+	 */
+	if (rcu_dereference_raw(ioc->icq_hint) == icq)
+		rcu_assign_pointer(ioc->icq_hint, NULL);
+
+	if (et->ops.elevator_exit_icq_fn) {
+		ioc_release_depth_inc(q);
+		et->ops.elevator_exit_icq_fn(icq);
+		ioc_release_depth_dec(q);
+	}
+
+	/*
+	 * @icq->q might have gone away by the time RCU callback runs
+	 * making it impossible to determine icq_cache.  Record it in @icq.
+	 */
+	icq->__rcu_icq_cache = et->icq_cache;
+	call_rcu(&icq->__rcu_head, icq_free_icq_rcu);
+}
+
 /*
  * Slow path for ioc release in put_io_context().  Performs double-lock
  * dancing to unlink all icq's and then frees ioc.
@@ -87,10 +132,7 @@ static void ioc_release_fn(struct work_struct *work)
 			spin_lock(&ioc->lock);
 			continue;
 		}
-		ioc_release_depth_inc(this_q);
-		icq->exit(icq);
-		icq->release(icq);
-		ioc_release_depth_dec(this_q);
+		ioc_exit_icq(icq);
 	}
 
 	if (last_q) {
@@ -167,10 +209,7 @@ void put_io_context(struct io_context *ioc, struct request_queue *locked_q)
 			last_q = this_q;
 			continue;
 		}
-		ioc_release_depth_inc(this_q);
-		icq->exit(icq);
-		icq->release(icq);
-		ioc_release_depth_dec(this_q);
+		ioc_exit_icq(icq);
 	}
 
 	if (last_q && last_q != locked_q)
@@ -203,6 +242,27 @@ void exit_io_context(struct task_struct *task)
 	put_io_context(ioc, NULL);
 }
 
+/**
+ * ioc_clear_queue - break any ioc association with the specified queue
+ * @q: request_queue being cleared
+ *
+ * Walk @q->icq_list and exit all io_cq's.  Must be called with @q locked.
+ */
+void ioc_clear_queue(struct request_queue *q)
+{
+	lockdep_assert_held(q->queue_lock);
+
+	while (!list_empty(&q->icq_list)) {
+		struct io_cq *icq = list_entry(q->icq_list.next,
+					       struct io_cq, q_node);
+		struct io_context *ioc = icq->ioc;
+
+		spin_lock(&ioc->lock);
+		ioc_exit_icq(icq);
+		spin_unlock(&ioc->lock);
+	}
+}
+
 void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_flags,
 				int node)
 {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 5b4b4ab5e78..cf150011d80 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -479,8 +479,12 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_sync_queue(q);
 
-	if (q->elevator)
+	if (q->elevator) {
+		spin_lock_irq(q->queue_lock);
+		ioc_clear_queue(q);
+		spin_unlock_irq(q->queue_lock);
 		elevator_exit(q->elevator);
+	}
 
 	blk_throtl_exit(q);
 
diff --git a/block/blk.h b/block/blk.h
index 3c510a4b505..ed4d9bf2ab1 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -200,6 +200,7 @@ static inline int blk_do_io_stat(struct request *rq)
  */
 void get_io_context(struct io_context *ioc);
 struct io_cq *ioc_lookup_icq(struct io_context *ioc, struct request_queue *q);
+void ioc_clear_queue(struct request_queue *q);
 
 void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_mask,
 				int node);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 06e59abcb57..f6d31555149 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2674,26 +2674,6 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
 	cfq_put_cfqg(cfqg);
 }
 
-static void cfq_icq_free_rcu(struct rcu_head *head)
-{
-	kmem_cache_free(cfq_icq_pool,
-			icq_to_cic(container_of(head, struct io_cq, rcu_head)));
-}
-
-static void cfq_icq_free(struct io_cq *icq)
-{
-	call_rcu(&icq->rcu_head, cfq_icq_free_rcu);
-}
-
-static void cfq_release_icq(struct io_cq *icq)
-{
-	struct io_context *ioc = icq->ioc;
-
-	radix_tree_delete(&ioc->icq_tree, icq->q->id);
-	hlist_del(&icq->ioc_node);
-	cfq_icq_free(icq);
-}
-
 static void cfq_put_cooperator(struct cfq_queue *cfqq)
 {
 	struct cfq_queue *__cfqq, *next;
@@ -2731,17 +2711,6 @@ static void cfq_exit_icq(struct io_cq *icq)
 {
 	struct cfq_io_cq *cic = icq_to_cic(icq);
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
-	struct io_context *ioc = icq->ioc;
-
-	list_del_init(&icq->q_node);
-
-	/*
-	 * Both setting lookup hint to and clearing it from @icq are done
-	 * under queue_lock.  If it's not pointing to @icq now, it never
-	 * will.  Hint assignment itself can race safely.
-	 */
-	if (rcu_dereference_raw(ioc->icq_hint) == icq)
-		rcu_assign_pointer(ioc->icq_hint, NULL);
 
 	if (cic->cfqq[BLK_RW_ASYNC]) {
 		cfq_exit_cfqq(cfqd, cic->cfqq[BLK_RW_ASYNC]);
@@ -2764,8 +2733,6 @@ static struct cfq_io_cq *cfq_alloc_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 		cic->ttime.last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->icq.q_node);
 		INIT_HLIST_NODE(&cic->icq.ioc_node);
-		cic->icq.exit = cfq_exit_icq;
-		cic->icq.release = cfq_release_icq;
 	}
 
 	return cic;
@@ -3034,7 +3001,7 @@ out:
 	if (ret)
 		printk(KERN_ERR "cfq: icq link failed!\n");
 	if (icq)
-		cfq_icq_free(icq);
+		kmem_cache_free(cfq_icq_pool, icq);
 	return ret;
 }
 
@@ -3774,17 +3741,6 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	if (cfqd->active_queue)
 		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
 
-	while (!list_empty(&q->icq_list)) {
-		struct io_cq *icq = list_entry(q->icq_list.next,
-					       struct io_cq, q_node);
-		struct io_context *ioc = icq->ioc;
-
-		spin_lock(&ioc->lock);
-		cfq_exit_icq(icq);
-		cfq_release_icq(icq);
-		spin_unlock(&ioc->lock);
-	}
-
 	cfq_put_async_queues(cfqd);
 	cfq_release_cfq_groups(cfqd);
 
@@ -4019,6 +3975,7 @@ static struct elevator_type iosched_cfq = {
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
+		.elevator_exit_icq_fn =		cfq_exit_icq,
 		.elevator_set_req_fn =		cfq_set_request,
 		.elevator_put_req_fn =		cfq_put_request,
 		.elevator_may_queue_fn =	cfq_may_queue,
diff --git a/block/elevator.c b/block/elevator.c
index cca049fb45c..91e18f8af9b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -979,8 +979,9 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 			goto fail_register;
 	}
 
-	/* done, replace the old one with new one and turn off BYPASS */
+	/* done, clear io_cq's, switch elevators and turn off BYPASS */
 	spin_lock_irq(q->queue_lock);
+	ioc_clear_queue(q);
 	old_elevator = q->elevator;
 	q->elevator = e;
 	spin_unlock_irq(q->queue_lock);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index d3d3e28cbfd..06e4dd56871 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -5,6 +5,8 @@
 
 #ifdef CONFIG_BLOCK
 
+struct io_cq;
+
 typedef int (elevator_merge_fn) (struct request_queue *, struct request **,
 				 struct bio *);
 
@@ -24,6 +26,7 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
+typedef void (elevator_exit_icq_fn) (struct io_cq *);
 typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
@@ -56,6 +59,8 @@ struct elevator_ops
 	elevator_request_list_fn *elevator_former_req_fn;
 	elevator_request_list_fn *elevator_latter_req_fn;
 
+	elevator_exit_icq_fn *elevator_exit_icq_fn;
+
 	elevator_set_req_fn *elevator_set_req_fn;
 	elevator_put_req_fn *elevator_put_req_fn;
 
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index d15ca6591f9..ac390a34c0e 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -14,14 +14,22 @@ struct io_cq {
 	struct request_queue	*q;
 	struct io_context	*ioc;
 
-	struct list_head	q_node;
-	struct hlist_node	ioc_node;
+	/*
+	 * q_node and ioc_node link io_cq through icq_list of q and ioc
+	 * respectively.  Both fields are unused once ioc_exit_icq() is
+	 * called and shared with __rcu_icq_cache and __rcu_head which are
+	 * used for RCU free of io_cq.
+	 */
+	union {
+		struct list_head	q_node;
+		struct kmem_cache	*__rcu_icq_cache;
+	};
+	union {
+		struct hlist_node	ioc_node;
+		struct rcu_head		__rcu_head;
+	};
 
 	unsigned long		changed;
-	struct rcu_head		rcu_head;
-
-	void (*exit)(struct io_cq *);
-	void (*release)(struct io_cq *);
 };
 
 /*
-- 
cgit v1.2.3-70-g09d2


From 9b84cacd013996f244d85b3d873287c2a8f88658 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:42 +0100
Subject: block, cfq: restructure io_cq creation path for io_context interface
 cleanup

Add elevator_ops->elevator_init_icq_fn() and restructure
cfq_create_cic() and rename it to ioc_create_icq().

The new function expects its caller to pass in io_context, uses
elevator_type->icq_cache, handles generic init, calls the new elevator
operation for elevator specific initialization, and returns pointer to
created or looked up icq.  This leaves cfq_icq_pool variable without
any user.  Removed.

This prepares for io_context interface cleanup and doesn't introduce
any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c      | 94 +++++++++++++++++++++---------------------------
 include/linux/elevator.h |  2 ++
 2 files changed, 43 insertions(+), 53 deletions(-)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f6d31555149..11f49d03684 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -59,7 +59,6 @@ static const int cfq_hist_divisor = 4;
 #define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elv.priv[1])
 
 static struct kmem_cache *cfq_pool;
-static struct kmem_cache *cfq_icq_pool;
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
 #define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
@@ -2707,6 +2706,13 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	cfq_put_queue(cfqq);
 }
 
+static void cfq_init_icq(struct io_cq *icq)
+{
+	struct cfq_io_cq *cic = icq_to_cic(icq);
+
+	cic->ttime.last_end_request = jiffies;
+}
+
 static void cfq_exit_icq(struct io_cq *icq)
 {
 	struct cfq_io_cq *cic = icq_to_cic(icq);
@@ -2723,21 +2729,6 @@ static void cfq_exit_icq(struct io_cq *icq)
 	}
 }
 
-static struct cfq_io_cq *cfq_alloc_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
-{
-	struct cfq_io_cq *cic;
-
-	cic = kmem_cache_alloc_node(cfq_icq_pool, gfp_mask | __GFP_ZERO,
-							cfqd->queue->node);
-	if (cic) {
-		cic->ttime.last_end_request = jiffies;
-		INIT_LIST_HEAD(&cic->icq.q_node);
-		INIT_HLIST_NODE(&cic->icq.ioc_node);
-	}
-
-	return cic;
-}
-
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
@@ -2945,64 +2936,62 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
 }
 
 /**
- * cfq_create_cic - create and link a cfq_io_cq
- * @cfqd: cfqd of interest
+ * ioc_create_icq - create and link io_cq
+ * @q: request_queue of interest
  * @gfp_mask: allocation mask
  *
- * Make sure cfq_io_cq linking %current->io_context and @cfqd exists.  If
- * ioc and/or cic doesn't exist, they will be created using @gfp_mask.
+ * Make sure io_cq linking %current->io_context and @q exists.  If either
+ * io_context and/or icq don't exist, they will be created using @gfp_mask.
+ *
+ * The caller is responsible for ensuring @ioc won't go away and @q is
+ * alive and will stay alive until this function returns.
  */
-static int cfq_create_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
+static struct io_cq *ioc_create_icq(struct request_queue *q, gfp_t gfp_mask)
 {
-	struct request_queue *q = cfqd->queue;
-	struct io_cq *icq = NULL;
-	struct cfq_io_cq *cic;
+	struct elevator_type *et = q->elevator->type;
 	struct io_context *ioc;
-	int ret = -ENOMEM;
-
-	might_sleep_if(gfp_mask & __GFP_WAIT);
+	struct io_cq *icq;
 
 	/* allocate stuff */
 	ioc = create_io_context(current, gfp_mask, q->node);
 	if (!ioc)
-		goto out;
+		return NULL;
 
-	cic = cfq_alloc_cic(cfqd, gfp_mask);
-	if (!cic)
-		goto out;
-	icq = &cic->icq;
+	icq = kmem_cache_alloc_node(et->icq_cache, gfp_mask | __GFP_ZERO,
+				    q->node);
+	if (!icq)
+		return NULL;
 
-	ret = radix_tree_preload(gfp_mask);
-	if (ret)
-		goto out;
+	if (radix_tree_preload(gfp_mask) < 0) {
+		kmem_cache_free(et->icq_cache, icq);
+		return NULL;
+	}
 
 	icq->ioc = ioc;
-	icq->q = cfqd->queue;
+	icq->q = q;
+	INIT_LIST_HEAD(&icq->q_node);
+	INIT_HLIST_NODE(&icq->ioc_node);
 
 	/* lock both q and ioc and try to link @icq */
 	spin_lock_irq(q->queue_lock);
 	spin_lock(&ioc->lock);
 
-	ret = radix_tree_insert(&ioc->icq_tree, q->id, icq);
-	if (likely(!ret)) {
+	if (likely(!radix_tree_insert(&ioc->icq_tree, q->id, icq))) {
 		hlist_add_head(&icq->ioc_node, &ioc->icq_list);
 		list_add(&icq->q_node, &q->icq_list);
-		icq = NULL;
-	} else if (ret == -EEXIST) {
-		/* someone else already did it */
-		ret = 0;
+		if (et->ops.elevator_init_icq_fn)
+			et->ops.elevator_init_icq_fn(icq);
+	} else {
+		kmem_cache_free(et->icq_cache, icq);
+		icq = ioc_lookup_icq(ioc, q);
+		if (!icq)
+			printk(KERN_ERR "cfq: icq link failed!\n");
 	}
 
 	spin_unlock(&ioc->lock);
 	spin_unlock_irq(q->queue_lock);
-
 	radix_tree_preload_end();
-out:
-	if (ret)
-		printk(KERN_ERR "cfq: icq link failed!\n");
-	if (icq)
-		kmem_cache_free(cfq_icq_pool, icq);
-	return ret;
+	return icq;
 }
 
 /**
@@ -3022,7 +3011,6 @@ static struct cfq_io_cq *cfq_get_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 	struct request_queue *q = cfqd->queue;
 	struct cfq_io_cq *cic = NULL;
 	struct io_context *ioc;
-	int err;
 
 	lockdep_assert_held(q->queue_lock);
 
@@ -3037,9 +3025,9 @@ static struct cfq_io_cq *cfq_get_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
 
 		/* slow path - unlock, create missing ones and retry */
 		spin_unlock_irq(q->queue_lock);
-		err = cfq_create_cic(cfqd, gfp_mask);
+		cic = icq_to_cic(ioc_create_icq(q, gfp_mask));
 		spin_lock_irq(q->queue_lock);
-		if (err)
+		if (!cic)
 			return NULL;
 	}
 
@@ -3975,6 +3963,7 @@ static struct elevator_type iosched_cfq = {
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
+		.elevator_init_icq_fn =		cfq_init_icq,
 		.elevator_exit_icq_fn =		cfq_exit_icq,
 		.elevator_set_req_fn =		cfq_set_request,
 		.elevator_put_req_fn =		cfq_put_request,
@@ -4028,7 +4017,6 @@ static int __init cfq_init(void)
 		kmem_cache_destroy(cfq_pool);
 		return ret;
 	}
-	cfq_icq_pool = iosched_cfq.icq_cache;
 
 	blkio_policy_register(&blkio_policy_cfq);
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 06e4dd56871..c8f1e67a8eb 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -26,6 +26,7 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
+typedef void (elevator_init_icq_fn) (struct io_cq *);
 typedef void (elevator_exit_icq_fn) (struct io_cq *);
 typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
@@ -59,6 +60,7 @@ struct elevator_ops
 	elevator_request_list_fn *elevator_former_req_fn;
 	elevator_request_list_fn *elevator_latter_req_fn;
 
+	elevator_init_icq_fn *elevator_init_icq_fn;
 	elevator_exit_icq_fn *elevator_exit_icq_fn;
 
 	elevator_set_req_fn *elevator_set_req_fn;
-- 
cgit v1.2.3-70-g09d2


From f1f8cc94651738b418ba54c039df536303b91704 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Dec 2011 00:33:42 +0100
Subject: block, cfq: move icq creation and rq->elv.icq association to block
 core

Now block layer knows everything necessary to create and associate
icq's with requests.  Move ioc_create_icq() to blk-ioc.c and update
get_request() such that, if elevator_type->icq_size is set, requests
are automatically associated with their matching icq's before
elv_set_request().  io_context reference is also managed by block core
on request alloc/free.

* Only ioprio/cgroup changed handling remains from cfq_get_cic().
  Collapsed into cfq_set_request().

* This removes queue kicking on icq allocation failure (for now).  As
  icq allocation failure is rare and the only effect of queue kicking
  achieved was possibily accelerating queue processing, this change
  shouldn't be noticeable.

  There is a larger underlying problem.  Unlike request allocation,
  icq allocation is not guaranteed to succeed eventually after
  retries.  The number of icq is unbound and thus mempool can't be the
  solution either.  This effectively adds allocation dependency on
  memory free path and thus possibility of deadlock.

  This usually wouldn't happen because icq allocation is not a hot
  path and, even when the condition triggers, it's highly unlikely
  that none of the writeback workers already has icq.

  However, this is still possible especially if elevator is being
  switched under high memory pressure, so we better get it fixed.
  Probably the only solution is just bypassing elevator and appending
  to dispatch queue on any elevator allocation failure.

* Comment added to explain how icq's are managed and synchronized.

This completes cleanup of io_context interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c          |  46 +++++++++++++---
 block/blk-ioc.c           |  60 ++++++++++++++++++++-
 block/blk.h               |   1 +
 block/cfq-iosched.c       | 135 ++++------------------------------------------
 include/linux/elevator.h  |   8 +--
 include/linux/iocontext.h |  59 ++++++++++++++++++++
 6 files changed, 173 insertions(+), 136 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index 3c26c7f4870..8fbdac7010b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -640,13 +640,18 @@ EXPORT_SYMBOL(blk_get_queue);
 
 static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
-	if (rq->cmd_flags & REQ_ELVPRIV)
+	if (rq->cmd_flags & REQ_ELVPRIV) {
 		elv_put_request(q, rq);
+		if (rq->elv.icq)
+			put_io_context(rq->elv.icq->ioc, q);
+	}
+
 	mempool_free(rq, q->rq.rq_pool);
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, unsigned int flags, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct io_cq *icq,
+		  unsigned int flags, gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -657,10 +662,15 @@ blk_alloc_request(struct request_queue *q, unsigned int flags, gfp_t gfp_mask)
 
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
-	if ((flags & REQ_ELVPRIV) &&
-	    unlikely(elv_set_request(q, rq, gfp_mask))) {
-		mempool_free(rq, q->rq.rq_pool);
-		return NULL;
+	if (flags & REQ_ELVPRIV) {
+		rq->elv.icq = icq;
+		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+			mempool_free(rq, q->rq.rq_pool);
+			return NULL;
+		}
+		/* @rq->elv.icq holds on to io_context until @rq is freed */
+		if (icq)
+			get_io_context(icq->ioc);
 	}
 
 	return rq;
@@ -772,11 +782,14 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
+	struct elevator_type *et;
 	struct io_context *ioc;
+	struct io_cq *icq = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	bool retried = false;
 	int may_queue;
 retry:
+	et = q->elevator->type;
 	ioc = current->io_context;
 
 	if (unlikely(blk_queue_dead(q)))
@@ -837,17 +850,36 @@ retry:
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	/*
+	 * Decide whether the new request will be managed by elevator.  If
+	 * so, mark @rw_flags and increment elvpriv.  Non-zero elvpriv will
+	 * prevent the current elevator from being destroyed until the new
+	 * request is freed.  This guarantees icq's won't be destroyed and
+	 * makes creating new ones safe.
+	 *
+	 * Also, lookup icq while holding queue_lock.  If it doesn't exist,
+	 * it will be created after releasing queue_lock.
+	 */
 	if (blk_rq_should_init_elevator(bio) &&
 	    !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)) {
 		rw_flags |= REQ_ELVPRIV;
 		rl->elvpriv++;
+		if (et->icq_cache && ioc)
+			icq = ioc_lookup_icq(ioc, q);
 	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, gfp_mask);
+	/* create icq if missing */
+	if (unlikely(et->icq_cache && !icq))
+		icq = ioc_create_icq(q, gfp_mask);
+
+	/* rqs are guaranteed to have icq on elv_set_request() if requested */
+	if (likely(!et->icq_cache || icq))
+		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
+
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 0910a5584d3..c04d16b0222 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -289,7 +289,6 @@ void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_flags,
 		kmem_cache_free(iocontext_cachep, ioc);
 	task_unlock(task);
 }
-EXPORT_SYMBOL(create_io_context_slowpath);
 
 /**
  * get_task_io_context - get io_context of a task
@@ -362,6 +361,65 @@ out:
 }
 EXPORT_SYMBOL(ioc_lookup_icq);
 
+/**
+ * ioc_create_icq - create and link io_cq
+ * @q: request_queue of interest
+ * @gfp_mask: allocation mask
+ *
+ * Make sure io_cq linking %current->io_context and @q exists.  If either
+ * io_context and/or icq don't exist, they will be created using @gfp_mask.
+ *
+ * The caller is responsible for ensuring @ioc won't go away and @q is
+ * alive and will stay alive until this function returns.
+ */
+struct io_cq *ioc_create_icq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct elevator_type *et = q->elevator->type;
+	struct io_context *ioc;
+	struct io_cq *icq;
+
+	/* allocate stuff */
+	ioc = create_io_context(current, gfp_mask, q->node);
+	if (!ioc)
+		return NULL;
+
+	icq = kmem_cache_alloc_node(et->icq_cache, gfp_mask | __GFP_ZERO,
+				    q->node);
+	if (!icq)
+		return NULL;
+
+	if (radix_tree_preload(gfp_mask) < 0) {
+		kmem_cache_free(et->icq_cache, icq);
+		return NULL;
+	}
+
+	icq->ioc = ioc;
+	icq->q = q;
+	INIT_LIST_HEAD(&icq->q_node);
+	INIT_HLIST_NODE(&icq->ioc_node);
+
+	/* lock both q and ioc and try to link @icq */
+	spin_lock_irq(q->queue_lock);
+	spin_lock(&ioc->lock);
+
+	if (likely(!radix_tree_insert(&ioc->icq_tree, q->id, icq))) {
+		hlist_add_head(&icq->ioc_node, &ioc->icq_list);
+		list_add(&icq->q_node, &q->icq_list);
+		if (et->ops.elevator_init_icq_fn)
+			et->ops.elevator_init_icq_fn(icq);
+	} else {
+		kmem_cache_free(et->icq_cache, icq);
+		icq = ioc_lookup_icq(ioc, q);
+		if (!icq)
+			printk(KERN_ERR "cfq: icq link failed!\n");
+	}
+
+	spin_unlock(&ioc->lock);
+	spin_unlock_irq(q->queue_lock);
+	radix_tree_preload_end();
+	return icq;
+}
+
 void ioc_set_changed(struct io_context *ioc, int which)
 {
 	struct io_cq *icq;
diff --git a/block/blk.h b/block/blk.h
index ed4d9bf2ab1..7efd772336d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -200,6 +200,7 @@ static inline int blk_do_io_stat(struct request *rq)
  */
 void get_io_context(struct io_context *ioc);
 struct io_cq *ioc_lookup_icq(struct io_context *ioc, struct request_queue *q);
+struct io_cq *ioc_create_icq(struct request_queue *q, gfp_t gfp_mask);
 void ioc_clear_queue(struct request_queue *q);
 
 void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_mask,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 11f49d03684..f3b44c394e6 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2935,117 +2935,6 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
 	return cfqq;
 }
 
-/**
- * ioc_create_icq - create and link io_cq
- * @q: request_queue of interest
- * @gfp_mask: allocation mask
- *
- * Make sure io_cq linking %current->io_context and @q exists.  If either
- * io_context and/or icq don't exist, they will be created using @gfp_mask.
- *
- * The caller is responsible for ensuring @ioc won't go away and @q is
- * alive and will stay alive until this function returns.
- */
-static struct io_cq *ioc_create_icq(struct request_queue *q, gfp_t gfp_mask)
-{
-	struct elevator_type *et = q->elevator->type;
-	struct io_context *ioc;
-	struct io_cq *icq;
-
-	/* allocate stuff */
-	ioc = create_io_context(current, gfp_mask, q->node);
-	if (!ioc)
-		return NULL;
-
-	icq = kmem_cache_alloc_node(et->icq_cache, gfp_mask | __GFP_ZERO,
-				    q->node);
-	if (!icq)
-		return NULL;
-
-	if (radix_tree_preload(gfp_mask) < 0) {
-		kmem_cache_free(et->icq_cache, icq);
-		return NULL;
-	}
-
-	icq->ioc = ioc;
-	icq->q = q;
-	INIT_LIST_HEAD(&icq->q_node);
-	INIT_HLIST_NODE(&icq->ioc_node);
-
-	/* lock both q and ioc and try to link @icq */
-	spin_lock_irq(q->queue_lock);
-	spin_lock(&ioc->lock);
-
-	if (likely(!radix_tree_insert(&ioc->icq_tree, q->id, icq))) {
-		hlist_add_head(&icq->ioc_node, &ioc->icq_list);
-		list_add(&icq->q_node, &q->icq_list);
-		if (et->ops.elevator_init_icq_fn)
-			et->ops.elevator_init_icq_fn(icq);
-	} else {
-		kmem_cache_free(et->icq_cache, icq);
-		icq = ioc_lookup_icq(ioc, q);
-		if (!icq)
-			printk(KERN_ERR "cfq: icq link failed!\n");
-	}
-
-	spin_unlock(&ioc->lock);
-	spin_unlock_irq(q->queue_lock);
-	radix_tree_preload_end();
-	return icq;
-}
-
-/**
- * cfq_get_cic - acquire cfq_io_cq and bump refcnt on io_context
- * @cfqd: cfqd to setup cic for
- * @gfp_mask: allocation mask
- *
- * Return cfq_io_cq associating @cfqd and %current->io_context and
- * bump refcnt on io_context.  If ioc or cic doesn't exist, they're created
- * using @gfp_mask.
- *
- * Must be called under queue_lock which may be released and re-acquired.
- * This function also may sleep depending on @gfp_mask.
- */
-static struct cfq_io_cq *cfq_get_cic(struct cfq_data *cfqd, gfp_t gfp_mask)
-{
-	struct request_queue *q = cfqd->queue;
-	struct cfq_io_cq *cic = NULL;
-	struct io_context *ioc;
-
-	lockdep_assert_held(q->queue_lock);
-
-	while (true) {
-		/* fast path */
-		ioc = current->io_context;
-		if (likely(ioc)) {
-			cic = cfq_cic_lookup(cfqd, ioc);
-			if (likely(cic))
-				break;
-		}
-
-		/* slow path - unlock, create missing ones and retry */
-		spin_unlock_irq(q->queue_lock);
-		cic = icq_to_cic(ioc_create_icq(q, gfp_mask));
-		spin_lock_irq(q->queue_lock);
-		if (!cic)
-			return NULL;
-	}
-
-	/* bump @ioc's refcnt and handle changed notifications */
-	get_io_context(ioc);
-
-	if (unlikely(cic->icq.changed)) {
-		if (test_and_clear_bit(ICQ_IOPRIO_CHANGED, &cic->icq.changed))
-			changed_ioprio(cic);
-#ifdef CONFIG_CFQ_GROUP_IOSCHED
-		if (test_and_clear_bit(ICQ_CGROUP_CHANGED, &cic->icq.changed))
-			changed_cgroup(cic);
-#endif
-	}
-
-	return cic;
-}
-
 static void
 __cfq_update_io_thinktime(struct cfq_ttime *ttime, unsigned long slice_idle)
 {
@@ -3524,8 +3413,6 @@ static void cfq_put_request(struct request *rq)
 		BUG_ON(!cfqq->allocated[rw]);
 		cfqq->allocated[rw]--;
 
-		put_io_context(RQ_CIC(rq)->icq.ioc, cfqq->cfqd->queue);
-
 		/* Put down rq reference on cfqg */
 		cfq_put_cfqg(RQ_CFQG(rq));
 		rq->elv.priv[0] = NULL;
@@ -3574,7 +3461,7 @@ static int
 cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
-	struct cfq_io_cq *cic;
+	struct cfq_io_cq *cic = icq_to_cic(rq->elv.icq);
 	const int rw = rq_data_dir(rq);
 	const bool is_sync = rq_is_sync(rq);
 	struct cfq_queue *cfqq;
@@ -3582,9 +3469,16 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
 	spin_lock_irq(q->queue_lock);
-	cic = cfq_get_cic(cfqd, gfp_mask);
-	if (!cic)
-		goto queue_fail;
+
+	/* handle changed notifications */
+	if (unlikely(cic->icq.changed)) {
+		if (test_and_clear_bit(ICQ_IOPRIO_CHANGED, &cic->icq.changed))
+			changed_ioprio(cic);
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+		if (test_and_clear_bit(ICQ_CGROUP_CHANGED, &cic->icq.changed))
+			changed_cgroup(cic);
+#endif
+	}
 
 new_queue:
 	cfqq = cic_to_cfqq(cic, is_sync);
@@ -3615,17 +3509,10 @@ new_queue:
 	cfqq->allocated[rw]++;
 
 	cfqq->ref++;
-	rq->elv.icq = &cic->icq;
 	rq->elv.priv[0] = cfqq;
 	rq->elv.priv[1] = cfq_ref_get_cfqg(cfqq->cfqg);
 	spin_unlock_irq(q->queue_lock);
 	return 0;
-
-queue_fail:
-	cfq_schedule_dispatch(cfqd);
-	spin_unlock_irq(q->queue_lock);
-	cfq_log(cfqd, "set_request fail");
-	return 1;
 }
 
 static void cfq_kick_queue(struct work_struct *work)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c8f1e67a8eb..c24f3d7fbf1 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -60,8 +60,8 @@ struct elevator_ops
 	elevator_request_list_fn *elevator_former_req_fn;
 	elevator_request_list_fn *elevator_latter_req_fn;
 
-	elevator_init_icq_fn *elevator_init_icq_fn;
-	elevator_exit_icq_fn *elevator_exit_icq_fn;
+	elevator_init_icq_fn *elevator_init_icq_fn;	/* see iocontext.h */
+	elevator_exit_icq_fn *elevator_exit_icq_fn;	/* ditto */
 
 	elevator_set_req_fn *elevator_set_req_fn;
 	elevator_put_req_fn *elevator_put_req_fn;
@@ -90,8 +90,8 @@ struct elevator_type
 
 	/* fields provided by elevator implementation */
 	struct elevator_ops ops;
-	size_t icq_size;
-	size_t icq_align;
+	size_t icq_size;	/* see iocontext.h */
+	size_t icq_align;	/* ditto */
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index ac390a34c0e..7e1371c4bcc 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -10,6 +10,65 @@ enum {
 	ICQ_CGROUP_CHANGED,
 };
 
+/*
+ * An io_cq (icq) is association between an io_context (ioc) and a
+ * request_queue (q).  This is used by elevators which need to track
+ * information per ioc - q pair.
+ *
+ * Elevator can request use of icq by setting elevator_type->icq_size and
+ * ->icq_align.  Both size and align must be larger than that of struct
+ * io_cq and elevator can use the tail area for private information.  The
+ * recommended way to do this is defining a struct which contains io_cq as
+ * the first member followed by private members and using its size and
+ * align.  For example,
+ *
+ *	struct snail_io_cq {
+ *		struct io_cq	icq;
+ *		int		poke_snail;
+ *		int		feed_snail;
+ *	};
+ *
+ *	struct elevator_type snail_elv_type {
+ *		.ops =		{ ... },
+ *		.icq_size =	sizeof(struct snail_io_cq),
+ *		.icq_align =	__alignof__(struct snail_io_cq),
+ *		...
+ *	};
+ *
+ * If icq_size is set, block core will manage icq's.  All requests will
+ * have its ->elv.icq field set before elevator_ops->elevator_set_req_fn()
+ * is called and be holding a reference to the associated io_context.
+ *
+ * Whenever a new icq is created, elevator_ops->elevator_init_icq_fn() is
+ * called and, on destruction, ->elevator_exit_icq_fn().  Both functions
+ * are called with both the associated io_context and queue locks held.
+ *
+ * Elevator is allowed to lookup icq using ioc_lookup_icq() while holding
+ * queue lock but the returned icq is valid only until the queue lock is
+ * released.  Elevators can not and should not try to create or destroy
+ * icq's.
+ *
+ * As icq's are linked from both ioc and q, the locking rules are a bit
+ * complex.
+ *
+ * - ioc lock nests inside q lock.
+ *
+ * - ioc->icq_list and icq->ioc_node are protected by ioc lock.
+ *   q->icq_list and icq->q_node by q lock.
+ *
+ * - ioc->icq_tree and ioc->icq_hint are protected by ioc lock, while icq
+ *   itself is protected by q lock.  However, both the indexes and icq
+ *   itself are also RCU managed and lookup can be performed holding only
+ *   the q lock.
+ *
+ * - icq's are not reference counted.  They are destroyed when either the
+ *   ioc or q goes away.  Each request with icq set holds an extra
+ *   reference to ioc to ensure it stays until the request is completed.
+ *
+ * - Linking and unlinking icq's are performed while holding both ioc and q
+ *   locks.  Due to the lock ordering, q exit is simple but ioc exit
+ *   requires reverse-order double lock dance.
+ */
 struct io_cq {
 	struct request_queue	*q;
 	struct io_context	*ioc;
-- 
cgit v1.2.3-70-g09d2


From 4eabc941259f9d8c8fb71746d3f30c87e1d9e49b Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Thu, 15 Dec 2011 20:03:04 +0100
Subject: block: don't kick empty queue in blk_drain_queue()

While probing, fd sets up queue, probes hardware and tears down the
queue if probing fails.  In the process, blk_drain_queue() kicks the
queue which failed to finish initialization and fd is unhappy about
that.

  floppy0: no floppy controllers found
  ------------[ cut here ]------------
  WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0()
  Hardware name: To Be Filled By O.E.M.
  VFS: do_fd_request called on non-open device
  Modules linked in:
  Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2 #2
  Call Trace:
   [<ffffffff81039a6a>] warn_slowpath_common+0x7a/0xb0
   [<ffffffff81039b41>] warn_slowpath_fmt+0x41/0x50
   [<ffffffff813d657f>] do_fd_request+0xbf/0xd0
   [<ffffffff81322b95>] blk_drain_queue+0x65/0x80
   [<ffffffff81322c93>] blk_cleanup_queue+0xe3/0x1a0
   [<ffffffff818a809d>] floppy_init+0xdeb/0xe28
   [<ffffffff818a72b2>] ? daring+0x6b/0x6b
   [<ffffffff810002af>] do_one_initcall+0x3f/0x170
   [<ffffffff81884b34>] kernel_init+0x9d/0x11e
   [<ffffffff810317c2>] ? schedule_tail+0x22/0xa0
   [<ffffffff815dbb14>] kernel_thread_helper+0x4/0x10
   [<ffffffff81884a97>] ? start_kernel+0x2be/0x2be
   [<ffffffff815dbb10>] ? gs_change+0xb/0xb

Avoid it by making blk_drain_queue() kick queue iff dispatch queue has
something on it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ralf Hildebrandt <Ralf.Hildebrandt@charite.de>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Tested-by: Sergei Trofimovich <slyich@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index 20d69f6beb6..15de223c7f9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -366,7 +366,14 @@ void blk_drain_queue(struct request_queue *q, bool drain_all)
 		if (drain_all)
 			blk_throtl_drain(q);
 
-		__blk_run_queue(q);
+		/*
+		 * This function might be called on a queue which failed
+		 * driver init after queue creation.  Some drivers
+		 * (e.g. fd) get unhappy in such cases.  Kick queue iff
+		 * dispatch queue has something on it.
+		 */
+		if (!list_empty(&q->queue_head))
+			__blk_run_queue(q);
 
 		if (drain_all)
 			nr_rqs = q->rq.count[0] + q->rq.count[1];
-- 
cgit v1.2.3-70-g09d2


From 4a0b75c7d02c2bd46ed227d4ba5941ba8a0aba5d Mon Sep 17 00:00:00 2001
From: Shaohua Li <shaohua.li@intel.com>
Date: Fri, 16 Dec 2011 14:00:22 +0100
Subject: block, cfq: fix empty queue crash caused by request merge

All requests of a queue could be merged to other requests of other queue.
Such queue will not have request in it, but it's in service tree. This
will cause kernel oops.
I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the
issue should exist without the patch.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f3b44c394e6..163263ddd38 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1656,6 +1656,8 @@ cfq_merged_requests(struct request_queue *q, struct request *rq,
 		    struct request *next)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+
 	/*
 	 * reposition in fifo if next is older than rq
 	 */
@@ -1670,6 +1672,16 @@ cfq_merged_requests(struct request_queue *q, struct request *rq,
 	cfq_remove_request(next);
 	cfq_blkiocg_update_io_merged_stats(&(RQ_CFQG(rq))->blkg,
 					rq_data_dir(next), rq_is_sync(next));
+
+	cfqq = RQ_CFQQ(next);
+	/*
+	 * all requests of this queue are merged to other queues, delete it
+	 * from the service tree. If it's the active_queue,
+	 * cfq_dispatch_requests() will choose to expire it or do idle
+	 */
+	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list) &&
+	    cfqq != cfqd->active_queue)
+		cfq_del_cfqq_rr(cfqd, cfqq);
 }
 
 static int cfq_allow_merge(struct request_queue *q, struct request *rq,
-- 
cgit v1.2.3-70-g09d2


From 274193224cdabd687d804a26e0150bb20f2dd52c Mon Sep 17 00:00:00 2001
From: Shaohua Li <shaohua.li@intel.com>
Date: Fri, 16 Dec 2011 14:00:31 +0100
Subject: block: recursive merge requests

In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1,
a+3,.... When the requests are flushed to queue, a and a+1 are merged
to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3)
aren't merged.
With recursive merge below, the workload throughput gets improved 20%
and context switch drops 60%.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/elevator.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

(limited to 'block')

diff --git a/block/elevator.c b/block/elevator.c
index 91e18f8af9b..99838f460b4 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -515,6 +515,7 @@ static bool elv_attempt_insert_merge(struct request_queue *q,
 				     struct request *rq)
 {
 	struct request *__rq;
+	bool ret;
 
 	if (blk_queue_nomerges(q))
 		return false;
@@ -528,14 +529,21 @@ static bool elv_attempt_insert_merge(struct request_queue *q,
 	if (blk_queue_noxmerges(q))
 		return false;
 
+	ret = false;
 	/*
 	 * See if our hash lookup can find a potential backmerge.
 	 */
-	__rq = elv_rqhash_find(q, blk_rq_pos(rq));
-	if (__rq && blk_attempt_req_merge(q, __rq, rq))
-		return true;
+	while (1) {
+		__rq = elv_rqhash_find(q, blk_rq_pos(rq));
+		if (!__rq || !blk_attempt_req_merge(q, __rq, rq))
+			break;
 
-	return false;
+		/* The merged request could be merged with others, try again */
+		ret = true;
+		rq = __rq;
+	}
+
+	return ret;
 }
 
 void elv_merged_request(struct request_queue *q, struct request *rq, int type)
-- 
cgit v1.2.3-70-g09d2


From 6ae0516b8a50ece5d766be608a305707e0450060 Mon Sep 17 00:00:00 2001
From: Shaohua Li <shaohua.li@intel.com>
Date: Fri, 16 Dec 2011 14:04:23 +0100
Subject: block, cfq: fix empty queue crash caused by request merge

All requests of a queue could be merged to other requests of other queue.
Such queue will not have request in it, but it's in service tree. This
will cause kernel oops.
I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the
issue should exist without the patch.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4c12869fcf7..3548705b04e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1655,6 +1655,8 @@ cfq_merged_requests(struct request_queue *q, struct request *rq,
 		    struct request *next)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+
 	/*
 	 * reposition in fifo if next is older than rq
 	 */
@@ -1669,6 +1671,16 @@ cfq_merged_requests(struct request_queue *q, struct request *rq,
 	cfq_remove_request(next);
 	cfq_blkiocg_update_io_merged_stats(&(RQ_CFQG(rq))->blkg,
 					rq_data_dir(next), rq_is_sync(next));
+
+	cfqq = RQ_CFQQ(next);
+	/*
+	 * all requests of this queue are merged to other queues, delete it
+	 * from the service tree. If it's the active_queue,
+	 * cfq_dispatch_requests() will choose to expire it or do idle
+	 */
+	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list) &&
+	    cfqq != cfqd->active_queue)
+		cfq_del_cfqq_rr(cfqd, cfqq);
 }
 
 static int cfq_allow_merge(struct request_queue *q, struct request *rq,
-- 
cgit v1.2.3-70-g09d2


From 64c42998f14d5894ea3138625897d620b30c8e4e Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Mon, 19 Dec 2011 10:36:44 +0100
Subject: block: ioc_cgroup_changed() needs to be exported

With the ioc changed, ioc_cgroup_changed() can be used by modular
code. So ensure that it is exported.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-ioc.c | 1 +
 1 file changed, 1 insertion(+)

(limited to 'block')

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index c04d16b0222..ce9b35a9468 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -464,6 +464,7 @@ void ioc_cgroup_changed(struct io_context *ioc)
 	ioc_set_changed(ioc, ICQ_CGROUP_CHANGED);
 	spin_unlock_irqrestore(&ioc->lock, flags);
 }
+EXPORT_SYMBOL(ioc_cgroup_changed);
 
 static int __init blk_ioc_init(void)
 {
-- 
cgit v1.2.3-70-g09d2


From 609f6ea1c9cdfe0c43a927e13205a57d0c266d5a Mon Sep 17 00:00:00 2001
From: majianpeng <majianpeng@gmail.com>
Date: Wed, 21 Dec 2011 15:27:24 +0100
Subject: block: re-use existing 'reading' variable instead of checking
 direction again

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/blk-map.c b/block/blk-map.c
index 164cd005970..623e1cd4cff 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -311,7 +311,7 @@ int blk_rq_map_kern(struct request_queue *q, struct request *rq, void *kbuf,
 	if (IS_ERR(bio))
 		return PTR_ERR(bio);
 
-	if (rq_data_dir(rq) == WRITE)
+	if (!reading)
 		bio->bi_rw |= REQ_WRITE;
 
 	if (do_copy)
-- 
cgit v1.2.3-70-g09d2


From fd63836811d6e5b5f5f608abf865bc9e91762c8c Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Sun, 25 Dec 2011 14:29:14 +0100
Subject: block: an exiting task should be allowed to create io_context

While fixing io_context creation / task exit race condition,
6e736be7f2 "block: make ioc get/put interface more conventional and
fix race on alloction" also prevented an exiting (%PF_EXITING) task
from creating its own io_context.  This is incorrect as exit path may
issue IOs, e.g. from exit_files(), and if those IOs are the first ones
issued by the task, io_context needs to be created to process the IOs.

Combined with the existing problem of io_context / io_cq creation
failure having the possibility of stalling IO, this problem results in
deterministic full IO lockup with certain workloads.

Fix it by allowing io_context creation regardless of %PF_EXITING for
%current.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-ioc.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

(limited to 'block')

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index ce9b35a9468..33fae7df16a 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -281,9 +281,16 @@ void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_flags,
 	INIT_HLIST_HEAD(&ioc->icq_list);
 	INIT_WORK(&ioc->release_work, ioc_release_fn);
 
-	/* try to install, somebody might already have beaten us to it */
+	/*
+	 * Try to install.  ioc shouldn't be installed if someone else
+	 * already did or @task, which isn't %current, is exiting.  Note
+	 * that we need to allow ioc creation on exiting %current as exit
+	 * path may issue IOs from e.g. exit_files().  The exit path is
+	 * responsible for not issuing IO after exit_io_context().
+	 */
 	task_lock(task);
-	if (!task->io_context && !(task->flags & PF_EXITING))
+	if (!task->io_context &&
+	    (task == current || !(task->flags & PF_EXITING)))
 		task->io_context = ioc;
 	else
 		kmem_cache_free(iocontext_cachep, ioc);
-- 
cgit v1.2.3-70-g09d2


From c98b2cc29af8e84e7364b53e9bb4cc7cfaf62555 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Tue, 27 Dec 2011 18:52:16 +0100
Subject: block: remove WARN_ON_ONCE() in exit_io_context()

6e736be7 "block: make ioc get/put interface more conventional and fix
race on alloction" added WARN_ON_ONCE() in exit_io_context() which
triggers if !PF_EXITING.  All tasks hitting exit_io_context() from
task exit should have PF_EXITING set but task struct tearing down
after fork failure calls into the function without PF_EXITING,
triggering the condition.

  WARNING: at block/blk-ioc.c:234 exit_io_context+0x40/0x92()
  Pid: 17090, comm: trinity Not tainted 3.2.0-rc6-next-20111222-sasha-dirty #77
  Call Trace:
   [<ffffffff810b69a3>] warn_slowpath_common+0x8f/0xb2
   [<ffffffff810b6a77>] warn_slowpath_null+0x18/0x1a
   [<ffffffff8181a7a2>] exit_io_context+0x40/0x92
   [<ffffffff810b58c9>] copy_process+0x126f/0x1453
   [<ffffffff810b5c1b>] do_fork+0x120/0x3e9
   [<ffffffff8106242f>] sys_clone+0x26/0x28
   [<ffffffff82425803>] stub_clone+0x13/0x20
  ---[ end trace a2e4eb670b375238 ]---

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-ioc.c | 3 ---
 1 file changed, 3 deletions(-)

(limited to 'block')

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 33fae7df16a..27a06e00eae 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -230,9 +230,6 @@ void exit_io_context(struct task_struct *task)
 {
 	struct io_context *ioc;
 
-	/* PF_EXITING prevents new io_context from being attached to @task */
-	WARN_ON_ONCE(!(current->flags & PF_EXITING));
-
 	task_lock(task);
 	ioc = task->io_context;
 	task->io_context = NULL;
-- 
cgit v1.2.3-70-g09d2


From f2b20d436534f22ccc3f5ad172499fcb013bb315 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams@intel.com>
Date: Thu, 29 Dec 2011 09:16:28 +0100
Subject: block: fix blk_queue_end_tag()

Commit 5e081591 "block: warn if tag is greater than real_max_depth"
cleaned up blk_queue_end_tag() to warn when the tag is truly invalid
(greater than real_max_depth).  However, it changed behavior in the tag <
max_depth case to not end the request.  Leading to triggering of
BUG_ON(blk_queued_rq(rq)) in the request completion path:

  http://marc.info/?l=linux-kernel&m=132204370518629&w=2

In order to allow blk_queue_resize_tags() to shrink the tag space
blk_queue_end_tag() must always complete tags with a value less than
real_max_depth regardless of the current max_depth.  The comment about
"handling the shrink case" seems to be what prompted changes in this
space, so remove it and BUG on all invalid tags (made even simpler by
Matthew's suggestion to use an unsigned compare).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Reported-by: Meelis Roos <mroos@ut.ee>
Reported-by: Ed Nadolski <edmund.nadolski@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-tag.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

(limited to 'block')

diff --git a/block/blk-tag.c b/block/blk-tag.c
index e74d6d13838..4af6f5cc116 100644
--- a/block/blk-tag.c
+++ b/block/blk-tag.c
@@ -282,18 +282,9 @@ EXPORT_SYMBOL(blk_queue_resize_tags);
 void blk_queue_end_tag(struct request_queue *q, struct request *rq)
 {
 	struct blk_queue_tag *bqt = q->queue_tags;
-	int tag = rq->tag;
+	unsigned tag = rq->tag; /* negative tags invalid */
 
-	BUG_ON(tag == -1);
-
-	if (unlikely(tag >= bqt->max_depth)) {
-		/*
-		 * This can happen after tag depth has been reduced.
-		 * But tag shouldn't be larger than real_max_depth.
-		 */
-		WARN_ON(tag >= bqt->real_max_depth);
-		return;
-	}
+	BUG_ON(tag >= bqt->real_max_depth);
 
 	list_del_init(&rq->queuelist);
 	rq->cmd_flags &= ~REQ_QUEUED;
-- 
cgit v1.2.3-70-g09d2


From 4752bc309b7604d507c973c7b7678ac2ce10a058 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Fri, 16 Sep 2011 00:21:54 -0400
Subject: make register_disk() static

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/genhd.c b/block/genhd.c
index 02e9fca8082..bf443a71b93 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -507,7 +507,7 @@ static int exact_lock(dev_t devt, void *data)
 	return 0;
 }
 
-void register_disk(struct gendisk *disk)
+static void register_disk(struct gendisk *disk)
 {
 	struct device *ddev = disk_to_dev(disk);
 	struct block_device *bdev;
-- 
cgit v1.2.3-70-g09d2


From 9be96f3fd10187f185d84cf878cf032465bcced3 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Fri, 16 Sep 2011 00:25:05 -0400
Subject: move fs/partitions to block/

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 block/Kconfig             |    6 +
 block/Makefile            |    3 +-
 block/partitions/Kconfig  |  251 ++++++++
 block/partitions/Makefile |   20 +
 block/partitions/acorn.c  |  556 ++++++++++++++++
 block/partitions/acorn.h  |   14 +
 block/partitions/amiga.c  |  139 ++++
 block/partitions/amiga.h  |    6 +
 block/partitions/atari.c  |  149 +++++
 block/partitions/atari.h  |   34 +
 block/partitions/check.c  |  687 ++++++++++++++++++++
 block/partitions/check.h  |   49 ++
 block/partitions/efi.c    |  675 +++++++++++++++++++
 block/partitions/efi.h    |  134 ++++
 block/partitions/ibm.c    |  275 ++++++++
 block/partitions/ibm.h    |    1 +
 block/partitions/karma.c  |   57 ++
 block/partitions/karma.h  |    8 +
 block/partitions/ldm.c    | 1570 +++++++++++++++++++++++++++++++++++++++++++++
 block/partitions/ldm.h    |  215 +++++++
 block/partitions/mac.c    |  134 ++++
 block/partitions/mac.h    |   44 ++
 block/partitions/msdos.c  |  552 ++++++++++++++++
 block/partitions/msdos.h  |    8 +
 block/partitions/osf.c    |   86 +++
 block/partitions/osf.h    |    7 +
 block/partitions/sgi.c    |   82 +++
 block/partitions/sgi.h    |    8 +
 block/partitions/sun.c    |  122 ++++
 block/partitions/sun.h    |    8 +
 block/partitions/sysv68.c |   95 +++
 block/partitions/sysv68.h |    1 +
 block/partitions/ultrix.c |   48 ++
 block/partitions/ultrix.h |    5 +
 fs/Kconfig                |    8 -
 fs/Makefile               |    1 -
 fs/partitions/Kconfig     |  251 --------
 fs/partitions/Makefile    |   20 -
 fs/partitions/acorn.c     |  556 ----------------
 fs/partitions/acorn.h     |   14 -
 fs/partitions/amiga.c     |  139 ----
 fs/partitions/amiga.h     |    6 -
 fs/partitions/atari.c     |  149 -----
 fs/partitions/atari.h     |   34 -
 fs/partitions/check.c     |  687 --------------------
 fs/partitions/check.h     |   49 --
 fs/partitions/efi.c       |  675 -------------------
 fs/partitions/efi.h       |  134 ----
 fs/partitions/ibm.c       |  275 --------
 fs/partitions/ibm.h       |    1 -
 fs/partitions/karma.c     |   57 --
 fs/partitions/karma.h     |    8 -
 fs/partitions/ldm.c       | 1570 ---------------------------------------------
 fs/partitions/ldm.h       |  215 -------
 fs/partitions/mac.c       |  134 ----
 fs/partitions/mac.h       |   44 --
 fs/partitions/msdos.c     |  552 ----------------
 fs/partitions/msdos.h     |    8 -
 fs/partitions/osf.c       |   86 ---
 fs/partitions/osf.h       |    7 -
 fs/partitions/sgi.c       |   82 ---
 fs/partitions/sgi.h       |    8 -
 fs/partitions/sun.c       |  122 ----
 fs/partitions/sun.h       |    8 -
 fs/partitions/sysv68.c    |   95 ---
 fs/partitions/sysv68.h    |    1 -
 fs/partitions/ultrix.c    |   48 --
 fs/partitions/ultrix.h    |    5 -
 68 files changed, 6048 insertions(+), 6050 deletions(-)
 create mode 100644 block/partitions/Kconfig
 create mode 100644 block/partitions/Makefile
 create mode 100644 block/partitions/acorn.c
 create mode 100644 block/partitions/acorn.h
 create mode 100644 block/partitions/amiga.c
 create mode 100644 block/partitions/amiga.h
 create mode 100644 block/partitions/atari.c
 create mode 100644 block/partitions/atari.h
 create mode 100644 block/partitions/check.c
 create mode 100644 block/partitions/check.h
 create mode 100644 block/partitions/efi.c
 create mode 100644 block/partitions/efi.h
 create mode 100644 block/partitions/ibm.c
 create mode 100644 block/partitions/ibm.h
 create mode 100644 block/partitions/karma.c
 create mode 100644 block/partitions/karma.h
 create mode 100644 block/partitions/ldm.c
 create mode 100644 block/partitions/ldm.h
 create mode 100644 block/partitions/mac.c
 create mode 100644 block/partitions/mac.h
 create mode 100644 block/partitions/msdos.c
 create mode 100644 block/partitions/msdos.h
 create mode 100644 block/partitions/osf.c
 create mode 100644 block/partitions/osf.h
 create mode 100644 block/partitions/sgi.c
 create mode 100644 block/partitions/sgi.h
 create mode 100644 block/partitions/sun.c
 create mode 100644 block/partitions/sun.h
 create mode 100644 block/partitions/sysv68.c
 create mode 100644 block/partitions/sysv68.h
 create mode 100644 block/partitions/ultrix.c
 create mode 100644 block/partitions/ultrix.h
 delete mode 100644 fs/partitions/Kconfig
 delete mode 100644 fs/partitions/Makefile
 delete mode 100644 fs/partitions/acorn.c
 delete mode 100644 fs/partitions/acorn.h
 delete mode 100644 fs/partitions/amiga.c
 delete mode 100644 fs/partitions/amiga.h
 delete mode 100644 fs/partitions/atari.c
 delete mode 100644 fs/partitions/atari.h
 delete mode 100644 fs/partitions/check.c
 delete mode 100644 fs/partitions/check.h
 delete mode 100644 fs/partitions/efi.c
 delete mode 100644 fs/partitions/efi.h
 delete mode 100644 fs/partitions/ibm.c
 delete mode 100644 fs/partitions/ibm.h
 delete mode 100644 fs/partitions/karma.c
 delete mode 100644 fs/partitions/karma.h
 delete mode 100644 fs/partitions/ldm.c
 delete mode 100644 fs/partitions/ldm.h
 delete mode 100644 fs/partitions/mac.c
 delete mode 100644 fs/partitions/mac.h
 delete mode 100644 fs/partitions/msdos.c
 delete mode 100644 fs/partitions/msdos.h
 delete mode 100644 fs/partitions/osf.c
 delete mode 100644 fs/partitions/osf.h
 delete mode 100644 fs/partitions/sgi.c
 delete mode 100644 fs/partitions/sgi.h
 delete mode 100644 fs/partitions/sun.c
 delete mode 100644 fs/partitions/sun.h
 delete mode 100644 fs/partitions/sysv68.c
 delete mode 100644 fs/partitions/sysv68.h
 delete mode 100644 fs/partitions/ultrix.c
 delete mode 100644 fs/partitions/ultrix.h

(limited to 'block')

diff --git a/block/Kconfig b/block/Kconfig
index e97934eecec..09acf1b3990 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -99,6 +99,12 @@ config BLK_DEV_THROTTLING
 
 	See Documentation/cgroups/blkio-controller.txt for more information.
 
+menu "Partition Types"
+
+source "block/partitions/Kconfig"
+
+endmenu
+
 endif # BLOCK
 
 config BLOCK_COMPAT
diff --git a/block/Makefile b/block/Makefile
index 514c6e4f427..3199c0cdef5 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,8 @@
 obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			blk-iopoll.o blk-lib.o ioctl.o genhd.o scsi_ioctl.o
+			blk-iopoll.o blk-lib.o ioctl.o genhd.o scsi_ioctl.o \
+			partitions/
 
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
 obj-$(CONFIG_BLK_DEV_BSGLIB)	+= bsg-lib.o
diff --git a/block/partitions/Kconfig b/block/partitions/Kconfig
new file mode 100644
index 00000000000..cb5f0a3f1b0
--- /dev/null
+++ b/block/partitions/Kconfig
@@ -0,0 +1,251 @@
+#
+# Partition configuration
+#
+config PARTITION_ADVANCED
+	bool "Advanced partition selection"
+	help
+	  Say Y here if you would like to use hard disks under Linux which
+	  were partitioned under an operating system running on a different
+	  architecture than your Linux system.
+
+	  Note that the answer to this question won't directly affect the
+	  kernel: saying N will just cause the configurator to skip all
+	  the questions about foreign partitioning schemes.
+
+	  If unsure, say N.
+
+config ACORN_PARTITION
+	bool "Acorn partition support" if PARTITION_ADVANCED
+	default y if ARCH_ACORN
+	help
+	  Support hard disks partitioned under Acorn operating systems.
+
+config ACORN_PARTITION_CUMANA
+	bool "Cumana partition support" if PARTITION_ADVANCED
+	default y if ARCH_ACORN
+	depends on ACORN_PARTITION
+	help
+	  Say Y here if you would like to use hard disks under Linux which
+	  were partitioned using the Cumana interface on Acorn machines.
+
+config ACORN_PARTITION_EESOX
+	bool "EESOX partition support" if PARTITION_ADVANCED
+	default y if ARCH_ACORN
+	depends on ACORN_PARTITION
+
+config ACORN_PARTITION_ICS
+	bool "ICS partition support" if PARTITION_ADVANCED
+	default y if ARCH_ACORN
+	depends on ACORN_PARTITION
+	help
+	  Say Y here if you would like to use hard disks under Linux which
+	  were partitioned using the ICS interface on Acorn machines.
+
+config ACORN_PARTITION_ADFS
+	bool "Native filecore partition support" if PARTITION_ADVANCED
+	default y if ARCH_ACORN
+	depends on ACORN_PARTITION
+	help
+	  The Acorn Disc Filing System is the standard file system of the
+	  RiscOS operating system which runs on Acorn's ARM-based Risc PC
+	  systems and the Acorn Archimedes range of machines.  If you say
+	  `Y' here, Linux will support disk partitions created under ADFS.
+
+config ACORN_PARTITION_POWERTEC
+	bool "PowerTec partition support" if PARTITION_ADVANCED
+	default y if ARCH_ACORN
+	depends on ACORN_PARTITION
+	help
+	  Support reading partition tables created on Acorn machines using
+	  the PowerTec SCSI drive.
+
+config ACORN_PARTITION_RISCIX
+	bool "RISCiX partition support" if PARTITION_ADVANCED
+	default y if ARCH_ACORN
+	depends on ACORN_PARTITION
+	help
+	  Once upon a time, there was a native Unix port for the Acorn series
+	  of machines called RISCiX.  If you say 'Y' here, Linux will be able
+	  to read disks partitioned under RISCiX.
+
+config OSF_PARTITION
+	bool "Alpha OSF partition support" if PARTITION_ADVANCED
+	default y if ALPHA
+	help
+	  Say Y here if you would like to use hard disks under Linux which
+	  were partitioned on an Alpha machine.
+
+config AMIGA_PARTITION
+	bool "Amiga partition table support" if PARTITION_ADVANCED
+	default y if (AMIGA || AFFS_FS=y)
+	help
+	  Say Y here if you would like to use hard disks under Linux which
+	  were partitioned under AmigaOS.
+
+config ATARI_PARTITION
+	bool "Atari partition table support" if PARTITION_ADVANCED
+	default y if ATARI
+	help
+	  Say Y here if you would like to use hard disks under Linux which
+	  were partitioned under the Atari OS.
+
+config IBM_PARTITION
+	bool "IBM disk label and partition support"
+	depends on PARTITION_ADVANCED && S390
+	help
+	  Say Y here if you would like to be able to read the hard disk
+	  partition table format used by IBM DASD disks operating under CMS.
+	  Otherwise, say N.
+
+config MAC_PARTITION
+	bool "Macintosh partition map support" if PARTITION_ADVANCED
+	default y if (MAC || PPC_PMAC)
+	help
+	  Say Y here if you would like to use hard disks under Linux which
+	  were partitioned on a Macintosh.
+
+config MSDOS_PARTITION
+	bool "PC BIOS (MSDOS partition tables) support" if PARTITION_ADVANCED
+	default y
+	help
+	  Say Y here.
+
+config BSD_DISKLABEL
+	bool "BSD disklabel (FreeBSD partition tables) support"
+	depends on PARTITION_ADVANCED && MSDOS_PARTITION
+	help
+	  FreeBSD uses its own hard disk partition scheme on your PC. It
+	  requires only one entry in the primary partition table of your disk
+	  and manages it similarly to DOS extended partitions, putting in its
+	  first sector a new partition table in BSD disklabel format. Saying Y
+	  here allows you to read these disklabels and further mount FreeBSD
+	  partitions from within Linux if you have also said Y to "UFS
+	  file system support", above. If you don't know what all this is
+	  about, say N.
+
+config MINIX_SUBPARTITION
+	bool "Minix subpartition support"
+	depends on PARTITION_ADVANCED && MSDOS_PARTITION
+	help
+	  Minix 2.0.0/2.0.2 subpartition table support for Linux.
+	  Say Y here if you want to mount and use Minix 2.0.0/2.0.2
+	  subpartitions.
+
+config SOLARIS_X86_PARTITION
+	bool "Solaris (x86) partition table support"
+	depends on PARTITION_ADVANCED && MSDOS_PARTITION
+	help
+	  Like most systems, Solaris x86 uses its own hard disk partition
+	  table format, incompatible with all others. Saying Y here allows you
+	  to read these partition tables and further mount Solaris x86
+	  partitions from within Linux if you have also said Y to "UFS
+	  file system support", above.
+
+config UNIXWARE_DISKLABEL
+	bool "Unixware slices support"
+	depends on PARTITION_ADVANCED && MSDOS_PARTITION
+	---help---
+	  Like some systems, UnixWare uses its own slice table inside a
+	  partition (VTOC - Virtual Table of Contents). Its format is
+	  incompatible with all other OSes. Saying Y here allows you to read
+	  VTOC and further mount UnixWare partitions read-only from within
+	  Linux if you have also said Y to "UFS file system support" or
+	  "System V and Coherent file system support", above.
+
+	  This is mainly used to carry data from a UnixWare box to your
+	  Linux box via a removable medium like magneto-optical, ZIP or
+	  removable IDE drives. Note, however, that a good portable way to
+	  transport files and directories between unixes (and even other
+	  operating systems) is given by the tar program ("man tar" or
+	  preferably "info tar").
+
+	  If you don't know what all this is about, say N.
+
+config LDM_PARTITION
+	bool "Windows Logical Disk Manager (Dynamic Disk) support"
+	depends on PARTITION_ADVANCED
+	---help---
+	  Say Y here if you would like to use hard disks under Linux which
+	  were partitioned using Windows 2000's/XP's or Vista's Logical Disk
+	  Manager.  They are also known as "Dynamic Disks".
+
+	  Note this driver only supports Dynamic Disks with a protective MBR
+	  label, i.e. DOS partition table.  It does not support GPT labelled
+	  Dynamic Disks yet as can be created with Vista.
+
+	  Windows 2000 introduced the concept of Dynamic Disks to get around
+	  the limitations of the PC's partitioning scheme.  The Logical Disk
+	  Manager allows the user to repartition a disk and create spanned,
+	  mirrored, striped or RAID volumes, all without the need for
+	  rebooting.
+
+	  Normal partitions are now called Basic Disks under Windows 2000, XP,
+	  and Vista.
+
+	  For a fuller description read <file:Documentation/ldm.txt>.
+
+	  If unsure, say N.
+
+config LDM_DEBUG
+	bool "Windows LDM extra logging"
+	depends on LDM_PARTITION
+	help
+	  Say Y here if you would like LDM to log verbosely.  This could be
+	  helpful if the driver doesn't work as expected and you'd like to
+	  report a bug.
+
+	  If unsure, say N.
+
+config SGI_PARTITION
+	bool "SGI partition support" if PARTITION_ADVANCED
+	default y if DEFAULT_SGI_PARTITION
+	help
+	  Say Y here if you would like to be able to read the hard disk
+	  partition table format used by SGI machines.
+
+config ULTRIX_PARTITION
+	bool "Ultrix partition table support" if PARTITION_ADVANCED
+	default y if MACH_DECSTATION
+	help
+	  Say Y here if you would like to be able to read the hard disk
+	  partition table format used by DEC (now Compaq) Ultrix machines.
+	  Otherwise, say N.
+
+config SUN_PARTITION
+	bool "Sun partition tables support" if PARTITION_ADVANCED
+	default y if (SPARC || SUN3 || SUN3X)
+	---help---
+	  Like most systems, SunOS uses its own hard disk partition table
+	  format, incompatible with all others. Saying Y here allows you to
+	  read these partition tables and further mount SunOS partitions from
+	  within Linux if you have also said Y to "UFS file system support",
+	  above. This is mainly used to carry data from a SPARC under SunOS to
+	  your Linux box via a removable medium like magneto-optical or ZIP
+	  drives; note however that a good portable way to transport files and
+	  directories between unixes (and even other operating systems) is
+	  given by the tar program ("man tar" or preferably "info tar"). If
+	  you don't know what all this is about, say N.
+
+config KARMA_PARTITION
+	bool "Karma Partition support"
+	depends on PARTITION_ADVANCED
+	help
+	  Say Y here if you would like to mount the Rio Karma MP3 player, as it
+	  uses a proprietary partition table.
+
+config EFI_PARTITION
+	bool "EFI GUID Partition support"
+	depends on PARTITION_ADVANCED
+	select CRC32
+	help
+	  Say Y here if you would like to use hard disks under Linux which
+	  were partitioned using EFI GPT.
+
+config SYSV68_PARTITION
+	bool "SYSV68 partition table support" if PARTITION_ADVANCED
+	default y if VME
+	help
+	  Say Y here if you would like to be able to read the hard disk
+	  partition table format used by Motorola Delta machines (using
+	  sysv68).
+	  Otherwise, say N.
diff --git a/block/partitions/Makefile b/block/partitions/Makefile
new file mode 100644
index 00000000000..03af8eac51d
--- /dev/null
+++ b/block/partitions/Makefile
@@ -0,0 +1,20 @@
+#
+# Makefile for the linux kernel.
+#
+
+obj-$(CONFIG_BLOCK) := check.o
+
+obj-$(CONFIG_ACORN_PARTITION) += acorn.o
+obj-$(CONFIG_AMIGA_PARTITION) += amiga.o
+obj-$(CONFIG_ATARI_PARTITION) += atari.o
+obj-$(CONFIG_MAC_PARTITION) += mac.o
+obj-$(CONFIG_LDM_PARTITION) += ldm.o
+obj-$(CONFIG_MSDOS_PARTITION) += msdos.o
+obj-$(CONFIG_OSF_PARTITION) += osf.o
+obj-$(CONFIG_SGI_PARTITION) += sgi.o
+obj-$(CONFIG_SUN_PARTITION) += sun.o
+obj-$(CONFIG_ULTRIX_PARTITION) += ultrix.o
+obj-$(CONFIG_IBM_PARTITION) += ibm.o
+obj-$(CONFIG_EFI_PARTITION) += efi.o
+obj-$(CONFIG_KARMA_PARTITION) += karma.o
+obj-$(CONFIG_SYSV68_PARTITION) += sysv68.o
diff --git a/block/partitions/acorn.c b/block/partitions/acorn.c
new file mode 100644
index 00000000000..fbeb697374d
--- /dev/null
+++ b/block/partitions/acorn.c
@@ -0,0 +1,556 @@
+/*
+ *  linux/fs/partitions/acorn.c
+ *
+ *  Copyright (c) 1996-2000 Russell King.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ *  Scan ADFS partitions on hard disk drives.  Unfortunately, there
+ *  isn't a standard for partitioning drives on Acorn machines, so
+ *  every single manufacturer of SCSI and IDE cards created their own
+ *  method.
+ */
+#include <linux/buffer_head.h>
+#include <linux/adfs_fs.h>
+
+#include "check.h"
+#include "acorn.h"
+
+/*
+ * Partition types. (Oh for reusability)
+ */
+#define PARTITION_RISCIX_MFM	1
+#define PARTITION_RISCIX_SCSI	2
+#define PARTITION_LINUX		9
+
+#if defined(CONFIG_ACORN_PARTITION_CUMANA) || \
+	defined(CONFIG_ACORN_PARTITION_ADFS)
+static struct adfs_discrecord *
+adfs_partition(struct parsed_partitions *state, char *name, char *data,
+	       unsigned long first_sector, int slot)
+{
+	struct adfs_discrecord *dr;
+	unsigned int nr_sects;
+
+	if (adfs_checkbblk(data))
+		return NULL;
+
+	dr = (struct adfs_discrecord *)(data + 0x1c0);
+
+	if (dr->disc_size == 0 && dr->disc_size_high == 0)
+		return NULL;
+
+	nr_sects = (le32_to_cpu(dr->disc_size_high) << 23) |
+		   (le32_to_cpu(dr->disc_size) >> 9);
+
+	if (name) {
+		strlcat(state->pp_buf, " [", PAGE_SIZE);
+		strlcat(state->pp_buf, name, PAGE_SIZE);
+		strlcat(state->pp_buf, "]", PAGE_SIZE);
+	}
+	put_partition(state, slot, first_sector, nr_sects);
+	return dr;
+}
+#endif
+
+#ifdef CONFIG_ACORN_PARTITION_RISCIX
+
+struct riscix_part {
+	__le32	start;
+	__le32	length;
+	__le32	one;
+	char	name[16];
+};
+
+struct riscix_record {
+	__le32	magic;
+#define RISCIX_MAGIC	cpu_to_le32(0x4a657320)
+	__le32	date;
+	struct riscix_part part[8];
+};
+
+#if defined(CONFIG_ACORN_PARTITION_CUMANA) || \
+	defined(CONFIG_ACORN_PARTITION_ADFS)
+static int riscix_partition(struct parsed_partitions *state,
+			    unsigned long first_sect, int slot,
+			    unsigned long nr_sects)
+{
+	Sector sect;
+	struct riscix_record *rr;
+	
+	rr = read_part_sector(state, first_sect, &sect);
+	if (!rr)
+		return -1;
+
+	strlcat(state->pp_buf, " [RISCiX]", PAGE_SIZE);
+
+
+	if (rr->magic == RISCIX_MAGIC) {
+		unsigned long size = nr_sects > 2 ? 2 : nr_sects;
+		int part;
+
+		strlcat(state->pp_buf, " <", PAGE_SIZE);
+
+		put_partition(state, slot++, first_sect, size);
+		for (part = 0; part < 8; part++) {
+			if (rr->part[part].one &&
+			    memcmp(rr->part[part].name, "All\0", 4)) {
+				put_partition(state, slot++,
+					le32_to_cpu(rr->part[part].start),
+					le32_to_cpu(rr->part[part].length));
+				strlcat(state->pp_buf, "(", PAGE_SIZE);
+				strlcat(state->pp_buf, rr->part[part].name, PAGE_SIZE);
+				strlcat(state->pp_buf, ")", PAGE_SIZE);
+			}
+		}
+
+		strlcat(state->pp_buf, " >\n", PAGE_SIZE);
+	} else {
+		put_partition(state, slot++, first_sect, nr_sects);
+	}
+
+	put_dev_sector(sect);
+	return slot;
+}
+#endif
+#endif
+
+#define LINUX_NATIVE_MAGIC 0xdeafa1de
+#define LINUX_SWAP_MAGIC   0xdeafab1e
+
+struct linux_part {
+	__le32 magic;
+	__le32 start_sect;
+	__le32 nr_sects;
+};
+
+#if defined(CONFIG_ACORN_PARTITION_CUMANA) || \
+	defined(CONFIG_ACORN_PARTITION_ADFS)
+static int linux_partition(struct parsed_partitions *state,
+			   unsigned long first_sect, int slot,
+			   unsigned long nr_sects)
+{
+	Sector sect;
+	struct linux_part *linuxp;
+	unsigned long size = nr_sects > 2 ? 2 : nr_sects;
+
+	strlcat(state->pp_buf, " [Linux]", PAGE_SIZE);
+
+	put_partition(state, slot++, first_sect, size);
+
+	linuxp = read_part_sector(state, first_sect, &sect);
+	if (!linuxp)
+		return -1;
+
+	strlcat(state->pp_buf, " <", PAGE_SIZE);
+	while (linuxp->magic == cpu_to_le32(LINUX_NATIVE_MAGIC) ||
+	       linuxp->magic == cpu_to_le32(LINUX_SWAP_MAGIC)) {
+		if (slot == state->limit)
+			break;
+		put_partition(state, slot++, first_sect +
+				 le32_to_cpu(linuxp->start_sect),
+				 le32_to_cpu(linuxp->nr_sects));
+		linuxp ++;
+	}
+	strlcat(state->pp_buf, " >", PAGE_SIZE);
+
+	put_dev_sector(sect);
+	return slot;
+}
+#endif
+
+#ifdef CONFIG_ACORN_PARTITION_CUMANA
+int adfspart_check_CUMANA(struct parsed_partitions *state)
+{
+	unsigned long first_sector = 0;
+	unsigned int start_blk = 0;
+	Sector sect;
+	unsigned char *data;
+	char *name = "CUMANA/ADFS";
+	int first = 1;
+	int slot = 1;
+
+	/*
+	 * Try Cumana style partitions - sector 6 contains ADFS boot block
+	 * with pointer to next 'drive'.
+	 *
+	 * There are unknowns in this code - is the 'cylinder number' of the
+	 * next partition relative to the start of this one - I'm assuming
+	 * it is.
+	 *
+	 * Also, which ID did Cumana use?
+	 *
+	 * This is totally unfinished, and will require more work to get it
+	 * going. Hence it is totally untested.
+	 */
+	do {
+		struct adfs_discrecord *dr;
+		unsigned int nr_sects;
+
+		data = read_part_sector(state, start_blk * 2 + 6, &sect);
+		if (!data)
+			return -1;
+
+		if (slot == state->limit)
+			break;
+
+		dr = adfs_partition(state, name, data, first_sector, slot++);
+		if (!dr)
+			break;
+
+		name = NULL;
+
+		nr_sects = (data[0x1fd] + (data[0x1fe] << 8)) *
+			   (dr->heads + (dr->lowsector & 0x40 ? 1 : 0)) *
+			   dr->secspertrack;
+
+		if (!nr_sects)
+			break;
+
+		first = 0;
+		first_sector += nr_sects;
+		start_blk += nr_sects >> (BLOCK_SIZE_BITS - 9);
+		nr_sects = 0; /* hmm - should be partition size */
+
+		switch (data[0x1fc] & 15) {
+		case 0: /* No partition / ADFS? */
+			break;
+
+#ifdef CONFIG_ACORN_PARTITION_RISCIX
+		case PARTITION_RISCIX_SCSI:
+			/* RISCiX - we don't know how to find the next one. */
+			slot = riscix_partition(state, first_sector, slot,
+						nr_sects);
+			break;
+#endif
+
+		case PARTITION_LINUX:
+			slot = linux_partition(state, first_sector, slot,
+					       nr_sects);
+			break;
+		}
+		put_dev_sector(sect);
+		if (slot == -1)
+			return -1;
+	} while (1);
+	put_dev_sector(sect);
+	return first ? 0 : 1;
+}
+#endif
+
+#ifdef CONFIG_ACORN_PARTITION_ADFS
+/*
+ * Purpose: allocate ADFS partitions.
+ *
+ * Params : hd		- pointer to gendisk structure to store partition info.
+ *	    dev		- device number to access.
+ *
+ * Returns: -1 on error, 0 for no ADFS boot sector, 1 for ok.
+ *
+ * Alloc  : hda  = whole drive
+ *	    hda1 = ADFS partition on first drive.
+ *	    hda2 = non-ADFS partition.
+ */
+int adfspart_check_ADFS(struct parsed_partitions *state)
+{
+	unsigned long start_sect, nr_sects, sectscyl, heads;
+	Sector sect;
+	unsigned char *data;
+	struct adfs_discrecord *dr;
+	unsigned char id;
+	int slot = 1;
+
+	data = read_part_sector(state, 6, &sect);
+	if (!data)
+		return -1;
+
+	dr = adfs_partition(state, "ADFS", data, 0, slot++);
+	if (!dr) {
+		put_dev_sector(sect);
+    		return 0;
+	}
+
+	heads = dr->heads + ((dr->lowsector >> 6) & 1);
+	sectscyl = dr->secspertrack * heads;
+	start_sect = ((data[0x1fe] << 8) + data[0x1fd]) * sectscyl;
+	id = data[0x1fc] & 15;
+	put_dev_sector(sect);
+
+	/*
+	 * Work out start of non-adfs partition.
+	 */
+	nr_sects = (state->bdev->bd_inode->i_size >> 9) - start_sect;
+
+	if (start_sect) {
+		switch (id) {
+#ifdef CONFIG_ACORN_PARTITION_RISCIX
+		case PARTITION_RISCIX_SCSI:
+		case PARTITION_RISCIX_MFM:
+			slot = riscix_partition(state, start_sect, slot,
+						nr_sects);
+			break;
+#endif
+
+		case PARTITION_LINUX:
+			slot = linux_partition(state, start_sect, slot,
+					       nr_sects);
+			break;
+		}
+	}
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	return 1;
+}
+#endif
+
+#ifdef CONFIG_ACORN_PARTITION_ICS
+
+struct ics_part {
+	__le32 start;
+	__le32 size;
+};
+
+static int adfspart_check_ICSLinux(struct parsed_partitions *state,
+				   unsigned long block)
+{
+	Sector sect;
+	unsigned char *data = read_part_sector(state, block, &sect);
+	int result = 0;
+
+	if (data) {
+		if (memcmp(data, "LinuxPart", 9) == 0)
+			result = 1;
+		put_dev_sector(sect);
+	}
+
+	return result;
+}
+
+/*
+ * Check for a valid ICS partition using the checksum.
+ */
+static inline int valid_ics_sector(const unsigned char *data)
+{
+	unsigned long sum;
+	int i;
+
+	for (i = 0, sum = 0x50617274; i < 508; i++)
+		sum += data[i];
+
+	sum -= le32_to_cpu(*(__le32 *)(&data[508]));
+
+	return sum == 0;
+}
+
+/*
+ * Purpose: allocate ICS partitions.
+ * Params : hd		- pointer to gendisk structure to store partition info.
+ *	    dev		- device number to access.
+ * Returns: -1 on error, 0 for no ICS table, 1 for partitions ok.
+ * Alloc  : hda  = whole drive
+ *	    hda1 = ADFS partition 0 on first drive.
+ *	    hda2 = ADFS partition 1 on first drive.
+ *		..etc..
+ */
+int adfspart_check_ICS(struct parsed_partitions *state)
+{
+	const unsigned char *data;
+	const struct ics_part *p;
+	int slot;
+	Sector sect;
+
+	/*
+	 * Try ICS style partitions - sector 0 contains partition info.
+	 */
+	data = read_part_sector(state, 0, &sect);
+	if (!data)
+	    	return -1;
+
+	if (!valid_ics_sector(data)) {
+	    	put_dev_sector(sect);
+		return 0;
+	}
+
+	strlcat(state->pp_buf, " [ICS]", PAGE_SIZE);
+
+	for (slot = 1, p = (const struct ics_part *)data; p->size; p++) {
+		u32 start = le32_to_cpu(p->start);
+		s32 size = le32_to_cpu(p->size); /* yes, it's signed. */
+
+		if (slot == state->limit)
+			break;
+
+		/*
+		 * Negative sizes tell the RISC OS ICS driver to ignore
+		 * this partition - in effect it says that this does not
+		 * contain an ADFS filesystem.
+		 */
+		if (size < 0) {
+			size = -size;
+
+			/*
+			 * Our own extension - We use the first sector
+			 * of the partition to identify what type this
+			 * partition is.  We must not make this visible
+			 * to the filesystem.
+			 */
+			if (size > 1 && adfspart_check_ICSLinux(state, start)) {
+				start += 1;
+				size -= 1;
+			}
+		}
+
+		if (size)
+			put_partition(state, slot++, start, size);
+	}
+
+	put_dev_sector(sect);
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	return 1;
+}
+#endif
+
+#ifdef CONFIG_ACORN_PARTITION_POWERTEC
+struct ptec_part {
+	__le32 unused1;
+	__le32 unused2;
+	__le32 start;
+	__le32 size;
+	__le32 unused5;
+	char type[8];
+};
+
+static inline int valid_ptec_sector(const unsigned char *data)
+{
+	unsigned char checksum = 0x2a;
+	int i;
+
+	/*
+	 * If it looks like a PC/BIOS partition, then it
+	 * probably isn't PowerTec.
+	 */
+	if (data[510] == 0x55 && data[511] == 0xaa)
+		return 0;
+
+	for (i = 0; i < 511; i++)
+		checksum += data[i];
+
+	return checksum == data[511];
+}
+
+/*
+ * Purpose: allocate ICS partitions.
+ * Params : hd		- pointer to gendisk structure to store partition info.
+ *	    dev		- device number to access.
+ * Returns: -1 on error, 0 for no ICS table, 1 for partitions ok.
+ * Alloc  : hda  = whole drive
+ *	    hda1 = ADFS partition 0 on first drive.
+ *	    hda2 = ADFS partition 1 on first drive.
+ *		..etc..
+ */
+int adfspart_check_POWERTEC(struct parsed_partitions *state)
+{
+	Sector sect;
+	const unsigned char *data;
+	const struct ptec_part *p;
+	int slot = 1;
+	int i;
+
+	data = read_part_sector(state, 0, &sect);
+	if (!data)
+		return -1;
+
+	if (!valid_ptec_sector(data)) {
+		put_dev_sector(sect);
+		return 0;
+	}
+
+	strlcat(state->pp_buf, " [POWERTEC]", PAGE_SIZE);
+
+	for (i = 0, p = (const struct ptec_part *)data; i < 12; i++, p++) {
+		u32 start = le32_to_cpu(p->start);
+		u32 size = le32_to_cpu(p->size);
+
+		if (size)
+			put_partition(state, slot++, start, size);
+	}
+
+	put_dev_sector(sect);
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	return 1;
+}
+#endif
+
+#ifdef CONFIG_ACORN_PARTITION_EESOX
+struct eesox_part {
+	char	magic[6];
+	char	name[10];
+	__le32	start;
+	__le32	unused6;
+	__le32	unused7;
+	__le32	unused8;
+};
+
+/*
+ * Guess who created this format?
+ */
+static const char eesox_name[] = {
+	'N', 'e', 'i', 'l', ' ',
+	'C', 'r', 'i', 't', 'c', 'h', 'e', 'l', 'l', ' ', ' '
+};
+
+/*
+ * EESOX SCSI partition format.
+ *
+ * This is a goddamned awful partition format.  We don't seem to store
+ * the size of the partition in this table, only the start addresses.
+ *
+ * There are two possibilities where the size comes from:
+ *  1. The individual ADFS boot block entries that are placed on the disk.
+ *  2. The start address of the next entry.
+ */
+int adfspart_check_EESOX(struct parsed_partitions *state)
+{
+	Sector sect;
+	const unsigned char *data;
+	unsigned char buffer[256];
+	struct eesox_part *p;
+	sector_t start = 0;
+	int i, slot = 1;
+
+	data = read_part_sector(state, 7, &sect);
+	if (!data)
+		return -1;
+
+	/*
+	 * "Decrypt" the partition table.  God knows why...
+	 */
+	for (i = 0; i < 256; i++)
+		buffer[i] = data[i] ^ eesox_name[i & 15];
+
+	put_dev_sector(sect);
+
+	for (i = 0, p = (struct eesox_part *)buffer; i < 8; i++, p++) {
+		sector_t next;
+
+		if (memcmp(p->magic, "Eesox", 6))
+			break;
+
+		next = le32_to_cpu(p->start);
+		if (i)
+			put_partition(state, slot++, start, next - start);
+		start = next;
+	}
+
+	if (i != 0) {
+		sector_t size;
+
+		size = get_capacity(state->bdev->bd_disk);
+		put_partition(state, slot++, start, size - start);
+		strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	}
+
+	return i ? 1 : 0;
+}
+#endif
diff --git a/block/partitions/acorn.h b/block/partitions/acorn.h
new file mode 100644
index 00000000000..ede82852969
--- /dev/null
+++ b/block/partitions/acorn.h
@@ -0,0 +1,14 @@
+/*
+ * linux/fs/partitions/acorn.h
+ *
+ * Copyright (C) 1996-2001 Russell King.
+ *
+ *  I _hate_ this partitioning mess - why can't we have one defined
+ *  format, and everyone stick to it?
+ */
+
+int adfspart_check_CUMANA(struct parsed_partitions *state);
+int adfspart_check_ADFS(struct parsed_partitions *state);
+int adfspart_check_ICS(struct parsed_partitions *state);
+int adfspart_check_POWERTEC(struct parsed_partitions *state);
+int adfspart_check_EESOX(struct parsed_partitions *state);
diff --git a/block/partitions/amiga.c b/block/partitions/amiga.c
new file mode 100644
index 00000000000..70cbf44a156
--- /dev/null
+++ b/block/partitions/amiga.c
@@ -0,0 +1,139 @@
+/*
+ *  fs/partitions/amiga.c
+ *
+ *  Code extracted from drivers/block/genhd.c
+ *
+ *  Copyright (C) 1991-1998  Linus Torvalds
+ *  Re-organised Feb 1998 Russell King
+ */
+
+#include <linux/types.h>
+#include <linux/affs_hardblocks.h>
+
+#include "check.h"
+#include "amiga.h"
+
+static __inline__ u32
+checksum_block(__be32 *m, int size)
+{
+	u32 sum = 0;
+
+	while (size--)
+		sum += be32_to_cpu(*m++);
+	return sum;
+}
+
+int amiga_partition(struct parsed_partitions *state)
+{
+	Sector sect;
+	unsigned char *data;
+	struct RigidDiskBlock *rdb;
+	struct PartitionBlock *pb;
+	int start_sect, nr_sects, blk, part, res = 0;
+	int blksize = 1;	/* Multiplier for disk block size */
+	int slot = 1;
+	char b[BDEVNAME_SIZE];
+
+	for (blk = 0; ; blk++, put_dev_sector(sect)) {
+		if (blk == RDB_ALLOCATION_LIMIT)
+			goto rdb_done;
+		data = read_part_sector(state, blk, &sect);
+		if (!data) {
+			if (warn_no_part)
+				printk("Dev %s: unable to read RDB block %d\n",
+				       bdevname(state->bdev, b), blk);
+			res = -1;
+			goto rdb_done;
+		}
+		if (*(__be32 *)data != cpu_to_be32(IDNAME_RIGIDDISK))
+			continue;
+
+		rdb = (struct RigidDiskBlock *)data;
+		if (checksum_block((__be32 *)data, be32_to_cpu(rdb->rdb_SummedLongs) & 0x7F) == 0)
+			break;
+		/* Try again with 0xdc..0xdf zeroed, Windows might have
+		 * trashed it.
+		 */
+		*(__be32 *)(data+0xdc) = 0;
+		if (checksum_block((__be32 *)data,
+				be32_to_cpu(rdb->rdb_SummedLongs) & 0x7F)==0) {
+			printk("Warning: Trashed word at 0xd0 in block %d "
+				"ignored in checksum calculation\n",blk);
+			break;
+		}
+
+		printk("Dev %s: RDB in block %d has bad checksum\n",
+		       bdevname(state->bdev, b), blk);
+	}
+
+	/* blksize is blocks per 512 byte standard block */
+	blksize = be32_to_cpu( rdb->rdb_BlockBytes ) / 512;
+
+	{
+		char tmp[7 + 10 + 1 + 1];
+
+		/* Be more informative */
+		snprintf(tmp, sizeof(tmp), " RDSK (%d)", blksize * 512);
+		strlcat(state->pp_buf, tmp, PAGE_SIZE);
+	}
+	blk = be32_to_cpu(rdb->rdb_PartitionList);
+	put_dev_sector(sect);
+	for (part = 1; blk>0 && part<=16; part++, put_dev_sector(sect)) {
+		blk *= blksize;	/* Read in terms partition table understands */
+		data = read_part_sector(state, blk, &sect);
+		if (!data) {
+			if (warn_no_part)
+				printk("Dev %s: unable to read partition block %d\n",
+				       bdevname(state->bdev, b), blk);
+			res = -1;
+			goto rdb_done;
+		}
+		pb  = (struct PartitionBlock *)data;
+		blk = be32_to_cpu(pb->pb_Next);
+		if (pb->pb_ID != cpu_to_be32(IDNAME_PARTITION))
+			continue;
+		if (checksum_block((__be32 *)pb, be32_to_cpu(pb->pb_SummedLongs) & 0x7F) != 0 )
+			continue;
+
+		/* Tell Kernel about it */
+
+		nr_sects = (be32_to_cpu(pb->pb_Environment[10]) + 1 -
+			    be32_to_cpu(pb->pb_Environment[9])) *
+			   be32_to_cpu(pb->pb_Environment[3]) *
+			   be32_to_cpu(pb->pb_Environment[5]) *
+			   blksize;
+		if (!nr_sects)
+			continue;
+		start_sect = be32_to_cpu(pb->pb_Environment[9]) *
+			     be32_to_cpu(pb->pb_Environment[3]) *
+			     be32_to_cpu(pb->pb_Environment[5]) *
+			     blksize;
+		put_partition(state,slot++,start_sect,nr_sects);
+		{
+			/* Be even more informative to aid mounting */
+			char dostype[4];
+			char tmp[42];
+
+			__be32 *dt = (__be32 *)dostype;
+			*dt = pb->pb_Environment[16];
+			if (dostype[3] < ' ')
+				snprintf(tmp, sizeof(tmp), " (%c%c%c^%c)",
+					dostype[0], dostype[1],
+					dostype[2], dostype[3] + '@' );
+			else
+				snprintf(tmp, sizeof(tmp), " (%c%c%c%c)",
+					dostype[0], dostype[1],
+					dostype[2], dostype[3]);
+			strlcat(state->pp_buf, tmp, PAGE_SIZE);
+			snprintf(tmp, sizeof(tmp), "(res %d spb %d)",
+				be32_to_cpu(pb->pb_Environment[6]),
+				be32_to_cpu(pb->pb_Environment[4]));
+			strlcat(state->pp_buf, tmp, PAGE_SIZE);
+		}
+		res = 1;
+	}
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+
+rdb_done:
+	return res;
+}
diff --git a/block/partitions/amiga.h b/block/partitions/amiga.h
new file mode 100644
index 00000000000..d094585cada
--- /dev/null
+++ b/block/partitions/amiga.h
@@ -0,0 +1,6 @@
+/*
+ *  fs/partitions/amiga.h
+ */
+
+int amiga_partition(struct parsed_partitions *state);
+
diff --git a/block/partitions/atari.c b/block/partitions/atari.c
new file mode 100644
index 00000000000..9875b05e80a
--- /dev/null
+++ b/block/partitions/atari.c
@@ -0,0 +1,149 @@
+/*
+ *  fs/partitions/atari.c
+ *
+ *  Code extracted from drivers/block/genhd.c
+ *
+ *  Copyright (C) 1991-1998  Linus Torvalds
+ *  Re-organised Feb 1998 Russell King
+ */
+
+#include <linux/ctype.h>
+#include "check.h"
+#include "atari.h"
+
+/* ++guenther: this should be settable by the user ("make config")?.
+ */
+#define ICD_PARTS
+
+/* check if a partition entry looks valid -- Atari format is assumed if at
+   least one of the primary entries is ok this way */
+#define	VALID_PARTITION(pi,hdsiz)					     \
+    (((pi)->flg & 1) &&							     \
+     isalnum((pi)->id[0]) && isalnum((pi)->id[1]) && isalnum((pi)->id[2]) && \
+     be32_to_cpu((pi)->st) <= (hdsiz) &&				     \
+     be32_to_cpu((pi)->st) + be32_to_cpu((pi)->siz) <= (hdsiz))
+
+static inline int OK_id(char *s)
+{
+	return  memcmp (s, "GEM", 3) == 0 || memcmp (s, "BGM", 3) == 0 ||
+		memcmp (s, "LNX", 3) == 0 || memcmp (s, "SWP", 3) == 0 ||
+		memcmp (s, "RAW", 3) == 0 ;
+}
+
+int atari_partition(struct parsed_partitions *state)
+{
+	Sector sect;
+	struct rootsector *rs;
+	struct partition_info *pi;
+	u32 extensect;
+	u32 hd_size;
+	int slot;
+#ifdef ICD_PARTS
+	int part_fmt = 0; /* 0:unknown, 1:AHDI, 2:ICD/Supra */
+#endif
+
+	rs = read_part_sector(state, 0, &sect);
+	if (!rs)
+		return -1;
+
+	/* Verify this is an Atari rootsector: */
+	hd_size = state->bdev->bd_inode->i_size >> 9;
+	if (!VALID_PARTITION(&rs->part[0], hd_size) &&
+	    !VALID_PARTITION(&rs->part[1], hd_size) &&
+	    !VALID_PARTITION(&rs->part[2], hd_size) &&
+	    !VALID_PARTITION(&rs->part[3], hd_size)) {
+		/*
+		 * if there's no valid primary partition, assume that no Atari
+		 * format partition table (there's no reliable magic or the like
+	         * :-()
+		 */
+		put_dev_sector(sect);
+		return 0;
+	}
+
+	pi = &rs->part[0];
+	strlcat(state->pp_buf, " AHDI", PAGE_SIZE);
+	for (slot = 1; pi < &rs->part[4] && slot < state->limit; slot++, pi++) {
+		struct rootsector *xrs;
+		Sector sect2;
+		ulong partsect;
+
+		if ( !(pi->flg & 1) )
+			continue;
+		/* active partition */
+		if (memcmp (pi->id, "XGM", 3) != 0) {
+			/* we don't care about other id's */
+			put_partition (state, slot, be32_to_cpu(pi->st),
+					be32_to_cpu(pi->siz));
+			continue;
+		}
+		/* extension partition */
+#ifdef ICD_PARTS
+		part_fmt = 1;
+#endif
+		strlcat(state->pp_buf, " XGM<", PAGE_SIZE);
+		partsect = extensect = be32_to_cpu(pi->st);
+		while (1) {
+			xrs = read_part_sector(state, partsect, &sect2);
+			if (!xrs) {
+				printk (" block %ld read failed\n", partsect);
+				put_dev_sector(sect);
+				return -1;
+			}
+
+			/* ++roman: sanity check: bit 0 of flg field must be set */
+			if (!(xrs->part[0].flg & 1)) {
+				printk( "\nFirst sub-partition in extended partition is not valid!\n" );
+				put_dev_sector(sect2);
+				break;
+			}
+
+			put_partition(state, slot,
+				   partsect + be32_to_cpu(xrs->part[0].st),
+				   be32_to_cpu(xrs->part[0].siz));
+
+			if (!(xrs->part[1].flg & 1)) {
+				/* end of linked partition list */
+				put_dev_sector(sect2);
+				break;
+			}
+			if (memcmp( xrs->part[1].id, "XGM", 3 ) != 0) {
+				printk("\nID of extended partition is not XGM!\n");
+				put_dev_sector(sect2);
+				break;
+			}
+
+			partsect = be32_to_cpu(xrs->part[1].st) + extensect;
+			put_dev_sector(sect2);
+			if (++slot == state->limit) {
+				printk( "\nMaximum number of partitions reached!\n" );
+				break;
+			}
+		}
+		strlcat(state->pp_buf, " >", PAGE_SIZE);
+	}
+#ifdef ICD_PARTS
+	if ( part_fmt!=1 ) { /* no extended partitions -> test ICD-format */
+		pi = &rs->icdpart[0];
+		/* sanity check: no ICD format if first partition invalid */
+		if (OK_id(pi->id)) {
+			strlcat(state->pp_buf, " ICD<", PAGE_SIZE);
+			for (; pi < &rs->icdpart[8] && slot < state->limit; slot++, pi++) {
+				/* accept only GEM,BGM,RAW,LNX,SWP partitions */
+				if (!((pi->flg & 1) && OK_id(pi->id)))
+					continue;
+				part_fmt = 2;
+				put_partition (state, slot,
+						be32_to_cpu(pi->st),
+						be32_to_cpu(pi->siz));
+			}
+			strlcat(state->pp_buf, " >", PAGE_SIZE);
+		}
+	}
+#endif
+	put_dev_sector(sect);
+
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+
+	return 1;
+}
diff --git a/block/partitions/atari.h b/block/partitions/atari.h
new file mode 100644
index 00000000000..fe2d32a89f3
--- /dev/null
+++ b/block/partitions/atari.h
@@ -0,0 +1,34 @@
+/*
+ *  fs/partitions/atari.h
+ *  Moved by Russell King from:
+ *
+ * linux/include/linux/atari_rootsec.h
+ * definitions for Atari Rootsector layout
+ * by Andreas Schwab (schwab@ls5.informatik.uni-dortmund.de)
+ *
+ * modified for ICD/Supra partitioning scheme restricted to at most 12
+ * partitions
+ * by Guenther Kelleter (guenther@pool.informatik.rwth-aachen.de)
+ */
+
+struct partition_info
+{
+  u8 flg;			/* bit 0: active; bit 7: bootable */
+  char id[3];			/* "GEM", "BGM", "XGM", or other */
+  __be32 st;			/* start of partition */
+  __be32 siz;			/* length of partition */
+};
+
+struct rootsector
+{
+  char unused[0x156];		/* room for boot code */
+  struct partition_info icdpart[8];	/* info for ICD-partitions 5..12 */
+  char unused2[0xc];
+  u32 hd_siz;			/* size of disk in blocks */
+  struct partition_info part[4];
+  u32 bsl_st;			/* start of bad sector list */
+  u32 bsl_cnt;			/* length of bad sector list */
+  u16 checksum;			/* checksum for bootable disks */
+} __attribute__((__packed__));
+
+int atari_partition(struct parsed_partitions *state);
diff --git a/block/partitions/check.c b/block/partitions/check.c
new file mode 100644
index 00000000000..e3c63d1c5e1
--- /dev/null
+++ b/block/partitions/check.c
@@ -0,0 +1,687 @@
+/*
+ *  fs/partitions/check.c
+ *
+ *  Code extracted from drivers/block/genhd.c
+ *  Copyright (C) 1991-1998  Linus Torvalds
+ *  Re-organised Feb 1998 Russell King
+ *
+ *  We now have independent partition support from the
+ *  block drivers, which allows all the partition code to
+ *  be grouped in one location, and it to be mostly self
+ *  contained.
+ *
+ *  Added needed MAJORS for new pairs, {hdi,hdj}, {hdk,hdl}
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/kmod.h>
+#include <linux/ctype.h>
+#include <linux/genhd.h>
+#include <linux/blktrace_api.h>
+
+#include "check.h"
+
+#include "acorn.h"
+#include "amiga.h"
+#include "atari.h"
+#include "ldm.h"
+#include "mac.h"
+#include "msdos.h"
+#include "osf.h"
+#include "sgi.h"
+#include "sun.h"
+#include "ibm.h"
+#include "ultrix.h"
+#include "efi.h"
+#include "karma.h"
+#include "sysv68.h"
+
+#ifdef CONFIG_BLK_DEV_MD
+extern void md_autodetect_dev(dev_t dev);
+#endif
+
+int warn_no_part = 1; /*This is ugly: should make genhd removable media aware*/
+
+static int (*check_part[])(struct parsed_partitions *) = {
+	/*
+	 * Probe partition formats with tables at disk address 0
+	 * that also have an ADFS boot block at 0xdc0.
+	 */
+#ifdef CONFIG_ACORN_PARTITION_ICS
+	adfspart_check_ICS,
+#endif
+#ifdef CONFIG_ACORN_PARTITION_POWERTEC
+	adfspart_check_POWERTEC,
+#endif
+#ifdef CONFIG_ACORN_PARTITION_EESOX
+	adfspart_check_EESOX,
+#endif
+
+	/*
+	 * Now move on to formats that only have partition info at
+	 * disk address 0xdc0.  Since these may also have stale
+	 * PC/BIOS partition tables, they need to come before
+	 * the msdos entry.
+	 */
+#ifdef CONFIG_ACORN_PARTITION_CUMANA
+	adfspart_check_CUMANA,
+#endif
+#ifdef CONFIG_ACORN_PARTITION_ADFS
+	adfspart_check_ADFS,
+#endif
+
+#ifdef CONFIG_EFI_PARTITION
+	efi_partition,		/* this must come before msdos */
+#endif
+#ifdef CONFIG_SGI_PARTITION
+	sgi_partition,
+#endif
+#ifdef CONFIG_LDM_PARTITION
+	ldm_partition,		/* this must come before msdos */
+#endif
+#ifdef CONFIG_MSDOS_PARTITION
+	msdos_partition,
+#endif
+#ifdef CONFIG_OSF_PARTITION
+	osf_partition,
+#endif
+#ifdef CONFIG_SUN_PARTITION
+	sun_partition,
+#endif
+#ifdef CONFIG_AMIGA_PARTITION
+	amiga_partition,
+#endif
+#ifdef CONFIG_ATARI_PARTITION
+	atari_partition,
+#endif
+#ifdef CONFIG_MAC_PARTITION
+	mac_partition,
+#endif
+#ifdef CONFIG_ULTRIX_PARTITION
+	ultrix_partition,
+#endif
+#ifdef CONFIG_IBM_PARTITION
+	ibm_partition,
+#endif
+#ifdef CONFIG_KARMA_PARTITION
+	karma_partition,
+#endif
+#ifdef CONFIG_SYSV68_PARTITION
+	sysv68_partition,
+#endif
+	NULL
+};
+ 
+/*
+ * disk_name() is used by partition check code and the genhd driver.
+ * It formats the devicename of the indicated disk into
+ * the supplied buffer (of size at least 32), and returns
+ * a pointer to that same buffer (for convenience).
+ */
+
+char *disk_name(struct gendisk *hd, int partno, char *buf)
+{
+	if (!partno)
+		snprintf(buf, BDEVNAME_SIZE, "%s", hd->disk_name);
+	else if (isdigit(hd->disk_name[strlen(hd->disk_name)-1]))
+		snprintf(buf, BDEVNAME_SIZE, "%sp%d", hd->disk_name, partno);
+	else
+		snprintf(buf, BDEVNAME_SIZE, "%s%d", hd->disk_name, partno);
+
+	return buf;
+}
+
+const char *bdevname(struct block_device *bdev, char *buf)
+{
+	return disk_name(bdev->bd_disk, bdev->bd_part->partno, buf);
+}
+
+EXPORT_SYMBOL(bdevname);
+
+/*
+ * There's very little reason to use this, you should really
+ * have a struct block_device just about everywhere and use
+ * bdevname() instead.
+ */
+const char *__bdevname(dev_t dev, char *buffer)
+{
+	scnprintf(buffer, BDEVNAME_SIZE, "unknown-block(%u,%u)",
+				MAJOR(dev), MINOR(dev));
+	return buffer;
+}
+
+EXPORT_SYMBOL(__bdevname);
+
+static struct parsed_partitions *
+check_partition(struct gendisk *hd, struct block_device *bdev)
+{
+	struct parsed_partitions *state;
+	int i, res, err;
+
+	state = kzalloc(sizeof(struct parsed_partitions), GFP_KERNEL);
+	if (!state)
+		return NULL;
+	state->pp_buf = (char *)__get_free_page(GFP_KERNEL);
+	if (!state->pp_buf) {
+		kfree(state);
+		return NULL;
+	}
+	state->pp_buf[0] = '\0';
+
+	state->bdev = bdev;
+	disk_name(hd, 0, state->name);
+	snprintf(state->pp_buf, PAGE_SIZE, " %s:", state->name);
+	if (isdigit(state->name[strlen(state->name)-1]))
+		sprintf(state->name, "p");
+
+	state->limit = disk_max_parts(hd);
+	i = res = err = 0;
+	while (!res && check_part[i]) {
+		memset(&state->parts, 0, sizeof(state->parts));
+		res = check_part[i++](state);
+		if (res < 0) {
+			/* We have hit an I/O error which we don't report now.
+		 	* But record it, and let the others do their job.
+		 	*/
+			err = res;
+			res = 0;
+		}
+
+	}
+	if (res > 0) {
+		printk(KERN_INFO "%s", state->pp_buf);
+
+		free_page((unsigned long)state->pp_buf);
+		return state;
+	}
+	if (state->access_beyond_eod)
+		err = -ENOSPC;
+	if (err)
+	/* The partition is unrecognized. So report I/O errors if there were any */
+		res = err;
+	if (!res)
+		strlcat(state->pp_buf, " unknown partition table\n", PAGE_SIZE);
+	else if (warn_no_part)
+		strlcat(state->pp_buf, " unable to read partition table\n", PAGE_SIZE);
+
+	printk(KERN_INFO "%s", state->pp_buf);
+
+	free_page((unsigned long)state->pp_buf);
+	kfree(state);
+	return ERR_PTR(res);
+}
+
+static ssize_t part_partition_show(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+
+	return sprintf(buf, "%d\n", p->partno);
+}
+
+static ssize_t part_start_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+
+	return sprintf(buf, "%llu\n",(unsigned long long)p->start_sect);
+}
+
+ssize_t part_size_show(struct device *dev,
+		       struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	return sprintf(buf, "%llu\n",(unsigned long long)p->nr_sects);
+}
+
+static ssize_t part_ro_show(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	return sprintf(buf, "%d\n", p->policy ? 1 : 0);
+}
+
+static ssize_t part_alignment_offset_show(struct device *dev,
+					  struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	return sprintf(buf, "%llu\n", (unsigned long long)p->alignment_offset);
+}
+
+static ssize_t part_discard_alignment_show(struct device *dev,
+					   struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	return sprintf(buf, "%u\n", p->discard_alignment);
+}
+
+ssize_t part_stat_show(struct device *dev,
+		       struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	int cpu;
+
+	cpu = part_stat_lock();
+	part_round_stats(cpu, p);
+	part_stat_unlock();
+	return sprintf(buf,
+		"%8lu %8lu %8llu %8u "
+		"%8lu %8lu %8llu %8u "
+		"%8u %8u %8u"
+		"\n",
+		part_stat_read(p, ios[READ]),
+		part_stat_read(p, merges[READ]),
+		(unsigned long long)part_stat_read(p, sectors[READ]),
+		jiffies_to_msecs(part_stat_read(p, ticks[READ])),
+		part_stat_read(p, ios[WRITE]),
+		part_stat_read(p, merges[WRITE]),
+		(unsigned long long)part_stat_read(p, sectors[WRITE]),
+		jiffies_to_msecs(part_stat_read(p, ticks[WRITE])),
+		part_in_flight(p),
+		jiffies_to_msecs(part_stat_read(p, io_ticks)),
+		jiffies_to_msecs(part_stat_read(p, time_in_queue)));
+}
+
+ssize_t part_inflight_show(struct device *dev,
+			struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+
+	return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
+		atomic_read(&p->in_flight[1]));
+}
+
+#ifdef CONFIG_FAIL_MAKE_REQUEST
+ssize_t part_fail_show(struct device *dev,
+		       struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+
+	return sprintf(buf, "%d\n", p->make_it_fail);
+}
+
+ssize_t part_fail_store(struct device *dev,
+			struct device_attribute *attr,
+			const char *buf, size_t count)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	int i;
+
+	if (count > 0 && sscanf(buf, "%d", &i) > 0)
+		p->make_it_fail = (i == 0) ? 0 : 1;
+
+	return count;
+}
+#endif
+
+static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL);
+static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
+static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
+static DEVICE_ATTR(ro, S_IRUGO, part_ro_show, NULL);
+static DEVICE_ATTR(alignment_offset, S_IRUGO, part_alignment_offset_show, NULL);
+static DEVICE_ATTR(discard_alignment, S_IRUGO, part_discard_alignment_show,
+		   NULL);
+static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
+static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+#ifdef CONFIG_FAIL_MAKE_REQUEST
+static struct device_attribute dev_attr_fail =
+	__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
+#endif
+
+static struct attribute *part_attrs[] = {
+	&dev_attr_partition.attr,
+	&dev_attr_start.attr,
+	&dev_attr_size.attr,
+	&dev_attr_ro.attr,
+	&dev_attr_alignment_offset.attr,
+	&dev_attr_discard_alignment.attr,
+	&dev_attr_stat.attr,
+	&dev_attr_inflight.attr,
+#ifdef CONFIG_FAIL_MAKE_REQUEST
+	&dev_attr_fail.attr,
+#endif
+	NULL
+};
+
+static struct attribute_group part_attr_group = {
+	.attrs = part_attrs,
+};
+
+static const struct attribute_group *part_attr_groups[] = {
+	&part_attr_group,
+#ifdef CONFIG_BLK_DEV_IO_TRACE
+	&blk_trace_attr_group,
+#endif
+	NULL
+};
+
+static void part_release(struct device *dev)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	free_part_stats(p);
+	free_part_info(p);
+	kfree(p);
+}
+
+struct device_type part_type = {
+	.name		= "partition",
+	.groups		= part_attr_groups,
+	.release	= part_release,
+};
+
+static void delete_partition_rcu_cb(struct rcu_head *head)
+{
+	struct hd_struct *part = container_of(head, struct hd_struct, rcu_head);
+
+	part->start_sect = 0;
+	part->nr_sects = 0;
+	part_stat_set_all(part, 0);
+	put_device(part_to_dev(part));
+}
+
+void __delete_partition(struct hd_struct *part)
+{
+	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
+}
+
+void delete_partition(struct gendisk *disk, int partno)
+{
+	struct disk_part_tbl *ptbl = disk->part_tbl;
+	struct hd_struct *part;
+
+	if (partno >= ptbl->len)
+		return;
+
+	part = ptbl->part[partno];
+	if (!part)
+		return;
+
+	blk_free_devt(part_devt(part));
+	rcu_assign_pointer(ptbl->part[partno], NULL);
+	rcu_assign_pointer(ptbl->last_lookup, NULL);
+	kobject_put(part->holder_dir);
+	device_del(part_to_dev(part));
+
+	hd_struct_put(part);
+}
+
+static ssize_t whole_disk_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	return 0;
+}
+static DEVICE_ATTR(whole_disk, S_IRUSR | S_IRGRP | S_IROTH,
+		   whole_disk_show, NULL);
+
+struct hd_struct *add_partition(struct gendisk *disk, int partno,
+				sector_t start, sector_t len, int flags,
+				struct partition_meta_info *info)
+{
+	struct hd_struct *p;
+	dev_t devt = MKDEV(0, 0);
+	struct device *ddev = disk_to_dev(disk);
+	struct device *pdev;
+	struct disk_part_tbl *ptbl;
+	const char *dname;
+	int err;
+
+	err = disk_expand_part_tbl(disk, partno);
+	if (err)
+		return ERR_PTR(err);
+	ptbl = disk->part_tbl;
+
+	if (ptbl->part[partno])
+		return ERR_PTR(-EBUSY);
+
+	p = kzalloc(sizeof(*p), GFP_KERNEL);
+	if (!p)
+		return ERR_PTR(-EBUSY);
+
+	if (!init_part_stats(p)) {
+		err = -ENOMEM;
+		goto out_free;
+	}
+	pdev = part_to_dev(p);
+
+	p->start_sect = start;
+	p->alignment_offset =
+		queue_limit_alignment_offset(&disk->queue->limits, start);
+	p->discard_alignment =
+		queue_limit_discard_alignment(&disk->queue->limits, start);
+	p->nr_sects = len;
+	p->partno = partno;
+	p->policy = get_disk_ro(disk);
+
+	if (info) {
+		struct partition_meta_info *pinfo = alloc_part_info(disk);
+		if (!pinfo)
+			goto out_free_stats;
+		memcpy(pinfo, info, sizeof(*info));
+		p->info = pinfo;
+	}
+
+	dname = dev_name(ddev);
+	if (isdigit(dname[strlen(dname) - 1]))
+		dev_set_name(pdev, "%sp%d", dname, partno);
+	else
+		dev_set_name(pdev, "%s%d", dname, partno);
+
+	device_initialize(pdev);
+	pdev->class = &block_class;
+	pdev->type = &part_type;
+	pdev->parent = ddev;
+
+	err = blk_alloc_devt(p, &devt);
+	if (err)
+		goto out_free_info;
+	pdev->devt = devt;
+
+	/* delay uevent until 'holders' subdir is created */
+	dev_set_uevent_suppress(pdev, 1);
+	err = device_add(pdev);
+	if (err)
+		goto out_put;
+
+	err = -ENOMEM;
+	p->holder_dir = kobject_create_and_add("holders", &pdev->kobj);
+	if (!p->holder_dir)
+		goto out_del;
+
+	dev_set_uevent_suppress(pdev, 0);
+	if (flags & ADDPART_FLAG_WHOLEDISK) {
+		err = device_create_file(pdev, &dev_attr_whole_disk);
+		if (err)
+			goto out_del;
+	}
+
+	/* everything is up and running, commence */
+	rcu_assign_pointer(ptbl->part[partno], p);
+
+	/* suppress uevent if the disk suppresses it */
+	if (!dev_get_uevent_suppress(ddev))
+		kobject_uevent(&pdev->kobj, KOBJ_ADD);
+
+	hd_ref_init(p);
+	return p;
+
+out_free_info:
+	free_part_info(p);
+out_free_stats:
+	free_part_stats(p);
+out_free:
+	kfree(p);
+	return ERR_PTR(err);
+out_del:
+	kobject_put(p->holder_dir);
+	device_del(pdev);
+out_put:
+	put_device(pdev);
+	blk_free_devt(devt);
+	return ERR_PTR(err);
+}
+
+static bool disk_unlock_native_capacity(struct gendisk *disk)
+{
+	const struct block_device_operations *bdops = disk->fops;
+
+	if (bdops->unlock_native_capacity &&
+	    !(disk->flags & GENHD_FL_NATIVE_CAPACITY)) {
+		printk(KERN_CONT "enabling native capacity\n");
+		bdops->unlock_native_capacity(disk);
+		disk->flags |= GENHD_FL_NATIVE_CAPACITY;
+		return true;
+	} else {
+		printk(KERN_CONT "truncated\n");
+		return false;
+	}
+}
+
+int rescan_partitions(struct gendisk *disk, struct block_device *bdev)
+{
+	struct parsed_partitions *state = NULL;
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+	int p, highest, res;
+rescan:
+	if (state && !IS_ERR(state)) {
+		kfree(state);
+		state = NULL;
+	}
+
+	if (bdev->bd_part_count)
+		return -EBUSY;
+	res = invalidate_partition(disk, 0);
+	if (res)
+		return res;
+
+	disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY);
+	while ((part = disk_part_iter_next(&piter)))
+		delete_partition(disk, part->partno);
+	disk_part_iter_exit(&piter);
+
+	if (disk->fops->revalidate_disk)
+		disk->fops->revalidate_disk(disk);
+	check_disk_size_change(disk, bdev);
+	bdev->bd_invalidated = 0;
+	if (!get_capacity(disk) || !(state = check_partition(disk, bdev)))
+		return 0;
+	if (IS_ERR(state)) {
+		/*
+		 * I/O error reading the partition table.  If any
+		 * partition code tried to read beyond EOD, retry
+		 * after unlocking native capacity.
+		 */
+		if (PTR_ERR(state) == -ENOSPC) {
+			printk(KERN_WARNING "%s: partition table beyond EOD, ",
+			       disk->disk_name);
+			if (disk_unlock_native_capacity(disk))
+				goto rescan;
+		}
+		return -EIO;
+	}
+	/*
+	 * If any partition code tried to read beyond EOD, try
+	 * unlocking native capacity even if partition table is
+	 * successfully read as we could be missing some partitions.
+	 */
+	if (state->access_beyond_eod) {
+		printk(KERN_WARNING
+		       "%s: partition table partially beyond EOD, ",
+		       disk->disk_name);
+		if (disk_unlock_native_capacity(disk))
+			goto rescan;
+	}
+
+	/* tell userspace that the media / partition table may have changed */
+	kobject_uevent(&disk_to_dev(disk)->kobj, KOBJ_CHANGE);
+
+	/* Detect the highest partition number and preallocate
+	 * disk->part_tbl.  This is an optimization and not strictly
+	 * necessary.
+	 */
+	for (p = 1, highest = 0; p < state->limit; p++)
+		if (state->parts[p].size)
+			highest = p;
+
+	disk_expand_part_tbl(disk, highest);
+
+	/* add partitions */
+	for (p = 1; p < state->limit; p++) {
+		sector_t size, from;
+		struct partition_meta_info *info = NULL;
+
+		size = state->parts[p].size;
+		if (!size)
+			continue;
+
+		from = state->parts[p].from;
+		if (from >= get_capacity(disk)) {
+			printk(KERN_WARNING
+			       "%s: p%d start %llu is beyond EOD, ",
+			       disk->disk_name, p, (unsigned long long) from);
+			if (disk_unlock_native_capacity(disk))
+				goto rescan;
+			continue;
+		}
+
+		if (from + size > get_capacity(disk)) {
+			printk(KERN_WARNING
+			       "%s: p%d size %llu extends beyond EOD, ",
+			       disk->disk_name, p, (unsigned long long) size);
+
+			if (disk_unlock_native_capacity(disk)) {
+				/* free state and restart */
+				goto rescan;
+			} else {
+				/*
+				 * we can not ignore partitions of broken tables
+				 * created by for example camera firmware, but
+				 * we limit them to the end of the disk to avoid
+				 * creating invalid block devices
+				 */
+				size = get_capacity(disk) - from;
+			}
+		}
+
+		if (state->parts[p].has_info)
+			info = &state->parts[p].info;
+		part = add_partition(disk, p, from, size,
+				     state->parts[p].flags,
+				     &state->parts[p].info);
+		if (IS_ERR(part)) {
+			printk(KERN_ERR " %s: p%d could not be added: %ld\n",
+			       disk->disk_name, p, -PTR_ERR(part));
+			continue;
+		}
+#ifdef CONFIG_BLK_DEV_MD
+		if (state->parts[p].flags & ADDPART_FLAG_RAID)
+			md_autodetect_dev(part_to_dev(part)->devt);
+#endif
+	}
+	kfree(state);
+	return 0;
+}
+
+unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p)
+{
+	struct address_space *mapping = bdev->bd_inode->i_mapping;
+	struct page *page;
+
+	page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_CACHE_SHIFT-9)),
+				 NULL);
+	if (!IS_ERR(page)) {
+		if (PageError(page))
+			goto fail;
+		p->v = page;
+		return (unsigned char *)page_address(page) +  ((n & ((1 << (PAGE_CACHE_SHIFT - 9)) - 1)) << 9);
+fail:
+		page_cache_release(page);
+	}
+	p->v = NULL;
+	return NULL;
+}
+
+EXPORT_SYMBOL(read_dev_sector);
diff --git a/block/partitions/check.h b/block/partitions/check.h
new file mode 100644
index 00000000000..d68bf4dc3bc
--- /dev/null
+++ b/block/partitions/check.h
@@ -0,0 +1,49 @@
+#include <linux/pagemap.h>
+#include <linux/blkdev.h>
+#include <linux/genhd.h>
+
+/*
+ * add_gd_partition adds a partitions details to the devices partition
+ * description.
+ */
+struct parsed_partitions {
+	struct block_device *bdev;
+	char name[BDEVNAME_SIZE];
+	struct {
+		sector_t from;
+		sector_t size;
+		int flags;
+		bool has_info;
+		struct partition_meta_info info;
+	} parts[DISK_MAX_PARTS];
+	int next;
+	int limit;
+	bool access_beyond_eod;
+	char *pp_buf;
+};
+
+static inline void *read_part_sector(struct parsed_partitions *state,
+				     sector_t n, Sector *p)
+{
+	if (n >= get_capacity(state->bdev->bd_disk)) {
+		state->access_beyond_eod = true;
+		return NULL;
+	}
+	return read_dev_sector(state->bdev, n, p);
+}
+
+static inline void
+put_partition(struct parsed_partitions *p, int n, sector_t from, sector_t size)
+{
+	if (n < p->limit) {
+		char tmp[1 + BDEVNAME_SIZE + 10 + 1];
+
+		p->parts[n].from = from;
+		p->parts[n].size = size;
+		snprintf(tmp, sizeof(tmp), " %s%d", p->name, n);
+		strlcat(p->pp_buf, tmp, PAGE_SIZE);
+	}
+}
+
+extern int warn_no_part;
+
diff --git a/block/partitions/efi.c b/block/partitions/efi.c
new file mode 100644
index 00000000000..6296b403c67
--- /dev/null
+++ b/block/partitions/efi.c
@@ -0,0 +1,675 @@
+/************************************************************
+ * EFI GUID Partition Table handling
+ *
+ * http://www.uefi.org/specs/
+ * http://www.intel.com/technology/efi/
+ *
+ * efi.[ch] by Matt Domsch <Matt_Domsch@dell.com>
+ *   Copyright 2000,2001,2002,2004 Dell Inc.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *
+ * TODO:
+ *
+ * Changelog:
+ * Mon Nov 09 2004 Matt Domsch <Matt_Domsch@dell.com>
+ * - test for valid PMBR and valid PGPT before ever reading
+ *   AGPT, allow override with 'gpt' kernel command line option.
+ * - check for first/last_usable_lba outside of size of disk
+ *
+ * Tue  Mar 26 2002 Matt Domsch <Matt_Domsch@dell.com>
+ * - Ported to 2.5.7-pre1 and 2.5.7-dj2
+ * - Applied patch to avoid fault in alternate header handling
+ * - cleaned up find_valid_gpt
+ * - On-disk structure and copy in memory is *always* LE now - 
+ *   swab fields as needed
+ * - remove print_gpt_header()
+ * - only use first max_p partition entries, to keep the kernel minor number
+ *   and partition numbers tied.
+ *
+ * Mon  Feb 04 2002 Matt Domsch <Matt_Domsch@dell.com>
+ * - Removed __PRIPTR_PREFIX - not being used
+ *
+ * Mon  Jan 14 2002 Matt Domsch <Matt_Domsch@dell.com>
+ * - Ported to 2.5.2-pre11 + library crc32 patch Linus applied
+ *
+ * Thu Dec 6 2001 Matt Domsch <Matt_Domsch@dell.com>
+ * - Added compare_gpts().
+ * - moved le_efi_guid_to_cpus() back into this file.  GPT is the only
+ *   thing that keeps EFI GUIDs on disk.
+ * - Changed gpt structure names and members to be simpler and more Linux-like.
+ * 
+ * Wed Oct 17 2001 Matt Domsch <Matt_Domsch@dell.com>
+ * - Removed CONFIG_DEVFS_VOLUMES_UUID code entirely per Martin Wilck
+ *
+ * Wed Oct 10 2001 Matt Domsch <Matt_Domsch@dell.com>
+ * - Changed function comments to DocBook style per Andreas Dilger suggestion.
+ *
+ * Mon Oct 08 2001 Matt Domsch <Matt_Domsch@dell.com>
+ * - Change read_lba() to use the page cache per Al Viro's work.
+ * - print u64s properly on all architectures
+ * - fixed debug_printk(), now Dprintk()
+ *
+ * Mon Oct 01 2001 Matt Domsch <Matt_Domsch@dell.com>
+ * - Style cleanups
+ * - made most functions static
+ * - Endianness addition
+ * - remove test for second alternate header, as it's not per spec,
+ *   and is unnecessary.  There's now a method to read/write the last
+ *   sector of an odd-sized disk from user space.  No tools have ever
+ *   been released which used this code, so it's effectively dead.
+ * - Per Asit Mallick of Intel, added a test for a valid PMBR.
+ * - Added kernel command line option 'gpt' to override valid PMBR test.
+ *
+ * Wed Jun  6 2001 Martin Wilck <Martin.Wilck@Fujitsu-Siemens.com>
+ * - added devfs volume UUID support (/dev/volumes/uuids) for
+ *   mounting file systems by the partition GUID. 
+ *
+ * Tue Dec  5 2000 Matt Domsch <Matt_Domsch@dell.com>
+ * - Moved crc32() to linux/lib, added efi_crc32().
+ *
+ * Thu Nov 30 2000 Matt Domsch <Matt_Domsch@dell.com>
+ * - Replaced Intel's CRC32 function with an equivalent
+ *   non-license-restricted version.
+ *
+ * Wed Oct 25 2000 Matt Domsch <Matt_Domsch@dell.com>
+ * - Fixed the last_lba() call to return the proper last block
+ *
+ * Thu Oct 12 2000 Matt Domsch <Matt_Domsch@dell.com>
+ * - Thanks to Andries Brouwer for his debugging assistance.
+ * - Code works, detects all the partitions.
+ *
+ ************************************************************/
+#include <linux/crc32.h>
+#include <linux/ctype.h>
+#include <linux/math64.h>
+#include <linux/slab.h>
+#include "check.h"
+#include "efi.h"
+
+/* This allows a kernel command line option 'gpt' to override
+ * the test for invalid PMBR.  Not __initdata because reloading
+ * the partition tables happens after init too.
+ */
+static int force_gpt;
+static int __init
+force_gpt_fn(char *str)
+{
+	force_gpt = 1;
+	return 1;
+}
+__setup("gpt", force_gpt_fn);
+
+
+/**
+ * efi_crc32() - EFI version of crc32 function
+ * @buf: buffer to calculate crc32 of
+ * @len - length of buf
+ *
+ * Description: Returns EFI-style CRC32 value for @buf
+ * 
+ * This function uses the little endian Ethernet polynomial
+ * but seeds the function with ~0, and xor's with ~0 at the end.
+ * Note, the EFI Specification, v1.02, has a reference to
+ * Dr. Dobbs Journal, May 1994 (actually it's in May 1992).
+ */
+static inline u32
+efi_crc32(const void *buf, unsigned long len)
+{
+	return (crc32(~0L, buf, len) ^ ~0L);
+}
+
+/**
+ * last_lba(): return number of last logical block of device
+ * @bdev: block device
+ * 
+ * Description: Returns last LBA value on success, 0 on error.
+ * This is stored (by sd and ide-geometry) in
+ *  the part[0] entry for this disk, and is the number of
+ *  physical sectors available on the disk.
+ */
+static u64 last_lba(struct block_device *bdev)
+{
+	if (!bdev || !bdev->bd_inode)
+		return 0;
+	return div_u64(bdev->bd_inode->i_size,
+		       bdev_logical_block_size(bdev)) - 1ULL;
+}
+
+static inline int
+pmbr_part_valid(struct partition *part)
+{
+        if (part->sys_ind == EFI_PMBR_OSTYPE_EFI_GPT &&
+            le32_to_cpu(part->start_sect) == 1UL)
+                return 1;
+        return 0;
+}
+
+/**
+ * is_pmbr_valid(): test Protective MBR for validity
+ * @mbr: pointer to a legacy mbr structure
+ *
+ * Description: Returns 1 if PMBR is valid, 0 otherwise.
+ * Validity depends on two things:
+ *  1) MSDOS signature is in the last two bytes of the MBR
+ *  2) One partition of type 0xEE is found
+ */
+static int
+is_pmbr_valid(legacy_mbr *mbr)
+{
+	int i;
+	if (!mbr || le16_to_cpu(mbr->signature) != MSDOS_MBR_SIGNATURE)
+                return 0;
+	for (i = 0; i < 4; i++)
+		if (pmbr_part_valid(&mbr->partition_record[i]))
+                        return 1;
+	return 0;
+}
+
+/**
+ * read_lba(): Read bytes from disk, starting at given LBA
+ * @state
+ * @lba
+ * @buffer
+ * @size_t
+ *
+ * Description: Reads @count bytes from @state->bdev into @buffer.
+ * Returns number of bytes read on success, 0 on error.
+ */
+static size_t read_lba(struct parsed_partitions *state,
+		       u64 lba, u8 *buffer, size_t count)
+{
+	size_t totalreadcount = 0;
+	struct block_device *bdev = state->bdev;
+	sector_t n = lba * (bdev_logical_block_size(bdev) / 512);
+
+	if (!buffer || lba > last_lba(bdev))
+                return 0;
+
+	while (count) {
+		int copied = 512;
+		Sector sect;
+		unsigned char *data = read_part_sector(state, n++, &sect);
+		if (!data)
+			break;
+		if (copied > count)
+			copied = count;
+		memcpy(buffer, data, copied);
+		put_dev_sector(sect);
+		buffer += copied;
+		totalreadcount +=copied;
+		count -= copied;
+	}
+	return totalreadcount;
+}
+
+/**
+ * alloc_read_gpt_entries(): reads partition entries from disk
+ * @state
+ * @gpt - GPT header
+ * 
+ * Description: Returns ptes on success,  NULL on error.
+ * Allocates space for PTEs based on information found in @gpt.
+ * Notes: remember to free pte when you're done!
+ */
+static gpt_entry *alloc_read_gpt_entries(struct parsed_partitions *state,
+					 gpt_header *gpt)
+{
+	size_t count;
+	gpt_entry *pte;
+
+	if (!gpt)
+		return NULL;
+
+	count = le32_to_cpu(gpt->num_partition_entries) *
+                le32_to_cpu(gpt->sizeof_partition_entry);
+	if (!count)
+		return NULL;
+	pte = kzalloc(count, GFP_KERNEL);
+	if (!pte)
+		return NULL;
+
+	if (read_lba(state, le64_to_cpu(gpt->partition_entry_lba),
+                     (u8 *) pte,
+		     count) < count) {
+		kfree(pte);
+                pte=NULL;
+		return NULL;
+	}
+	return pte;
+}
+
+/**
+ * alloc_read_gpt_header(): Allocates GPT header, reads into it from disk
+ * @state
+ * @lba is the Logical Block Address of the partition table
+ * 
+ * Description: returns GPT header on success, NULL on error.   Allocates
+ * and fills a GPT header starting at @ from @state->bdev.
+ * Note: remember to free gpt when finished with it.
+ */
+static gpt_header *alloc_read_gpt_header(struct parsed_partitions *state,
+					 u64 lba)
+{
+	gpt_header *gpt;
+	unsigned ssz = bdev_logical_block_size(state->bdev);
+
+	gpt = kzalloc(ssz, GFP_KERNEL);
+	if (!gpt)
+		return NULL;
+
+	if (read_lba(state, lba, (u8 *) gpt, ssz) < ssz) {
+		kfree(gpt);
+                gpt=NULL;
+		return NULL;
+	}
+
+	return gpt;
+}
+
+/**
+ * is_gpt_valid() - tests one GPT header and PTEs for validity
+ * @state
+ * @lba is the logical block address of the GPT header to test
+ * @gpt is a GPT header ptr, filled on return.
+ * @ptes is a PTEs ptr, filled on return.
+ *
+ * Description: returns 1 if valid,  0 on error.
+ * If valid, returns pointers to newly allocated GPT header and PTEs.
+ */
+static int is_gpt_valid(struct parsed_partitions *state, u64 lba,
+			gpt_header **gpt, gpt_entry **ptes)
+{
+	u32 crc, origcrc;
+	u64 lastlba;
+
+	if (!ptes)
+		return 0;
+	if (!(*gpt = alloc_read_gpt_header(state, lba)))
+		return 0;
+
+	/* Check the GUID Partition Table signature */
+	if (le64_to_cpu((*gpt)->signature) != GPT_HEADER_SIGNATURE) {
+		pr_debug("GUID Partition Table Header signature is wrong:"
+			 "%lld != %lld\n",
+			 (unsigned long long)le64_to_cpu((*gpt)->signature),
+			 (unsigned long long)GPT_HEADER_SIGNATURE);
+		goto fail;
+	}
+
+	/* Check the GUID Partition Table header size */
+	if (le32_to_cpu((*gpt)->header_size) >
+			bdev_logical_block_size(state->bdev)) {
+		pr_debug("GUID Partition Table Header size is wrong: %u > %u\n",
+			le32_to_cpu((*gpt)->header_size),
+			bdev_logical_block_size(state->bdev));
+		goto fail;
+	}
+
+	/* Check the GUID Partition Table CRC */
+	origcrc = le32_to_cpu((*gpt)->header_crc32);
+	(*gpt)->header_crc32 = 0;
+	crc = efi_crc32((const unsigned char *) (*gpt), le32_to_cpu((*gpt)->header_size));
+
+	if (crc != origcrc) {
+		pr_debug("GUID Partition Table Header CRC is wrong: %x != %x\n",
+			 crc, origcrc);
+		goto fail;
+	}
+	(*gpt)->header_crc32 = cpu_to_le32(origcrc);
+
+	/* Check that the my_lba entry points to the LBA that contains
+	 * the GUID Partition Table */
+	if (le64_to_cpu((*gpt)->my_lba) != lba) {
+		pr_debug("GPT my_lba incorrect: %lld != %lld\n",
+			 (unsigned long long)le64_to_cpu((*gpt)->my_lba),
+			 (unsigned long long)lba);
+		goto fail;
+	}
+
+	/* Check the first_usable_lba and last_usable_lba are
+	 * within the disk.
+	 */
+	lastlba = last_lba(state->bdev);
+	if (le64_to_cpu((*gpt)->first_usable_lba) > lastlba) {
+		pr_debug("GPT: first_usable_lba incorrect: %lld > %lld\n",
+			 (unsigned long long)le64_to_cpu((*gpt)->first_usable_lba),
+			 (unsigned long long)lastlba);
+		goto fail;
+	}
+	if (le64_to_cpu((*gpt)->last_usable_lba) > lastlba) {
+		pr_debug("GPT: last_usable_lba incorrect: %lld > %lld\n",
+			 (unsigned long long)le64_to_cpu((*gpt)->last_usable_lba),
+			 (unsigned long long)lastlba);
+		goto fail;
+	}
+
+	/* Check that sizeof_partition_entry has the correct value */
+	if (le32_to_cpu((*gpt)->sizeof_partition_entry) != sizeof(gpt_entry)) {
+		pr_debug("GUID Partitition Entry Size check failed.\n");
+		goto fail;
+	}
+
+	if (!(*ptes = alloc_read_gpt_entries(state, *gpt)))
+		goto fail;
+
+	/* Check the GUID Partition Entry Array CRC */
+	crc = efi_crc32((const unsigned char *) (*ptes),
+			le32_to_cpu((*gpt)->num_partition_entries) *
+			le32_to_cpu((*gpt)->sizeof_partition_entry));
+
+	if (crc != le32_to_cpu((*gpt)->partition_entry_array_crc32)) {
+		pr_debug("GUID Partitition Entry Array CRC check failed.\n");
+		goto fail_ptes;
+	}
+
+	/* We're done, all's well */
+	return 1;
+
+ fail_ptes:
+	kfree(*ptes);
+	*ptes = NULL;
+ fail:
+	kfree(*gpt);
+	*gpt = NULL;
+	return 0;
+}
+
+/**
+ * is_pte_valid() - tests one PTE for validity
+ * @pte is the pte to check
+ * @lastlba is last lba of the disk
+ *
+ * Description: returns 1 if valid,  0 on error.
+ */
+static inline int
+is_pte_valid(const gpt_entry *pte, const u64 lastlba)
+{
+	if ((!efi_guidcmp(pte->partition_type_guid, NULL_GUID)) ||
+	    le64_to_cpu(pte->starting_lba) > lastlba         ||
+	    le64_to_cpu(pte->ending_lba)   > lastlba)
+		return 0;
+	return 1;
+}
+
+/**
+ * compare_gpts() - Search disk for valid GPT headers and PTEs
+ * @pgpt is the primary GPT header
+ * @agpt is the alternate GPT header
+ * @lastlba is the last LBA number
+ * Description: Returns nothing.  Sanity checks pgpt and agpt fields
+ * and prints warnings on discrepancies.
+ * 
+ */
+static void
+compare_gpts(gpt_header *pgpt, gpt_header *agpt, u64 lastlba)
+{
+	int error_found = 0;
+	if (!pgpt || !agpt)
+		return;
+	if (le64_to_cpu(pgpt->my_lba) != le64_to_cpu(agpt->alternate_lba)) {
+		printk(KERN_WARNING
+		       "GPT:Primary header LBA != Alt. header alternate_lba\n");
+		printk(KERN_WARNING "GPT:%lld != %lld\n",
+		       (unsigned long long)le64_to_cpu(pgpt->my_lba),
+                       (unsigned long long)le64_to_cpu(agpt->alternate_lba));
+		error_found++;
+	}
+	if (le64_to_cpu(pgpt->alternate_lba) != le64_to_cpu(agpt->my_lba)) {
+		printk(KERN_WARNING
+		       "GPT:Primary header alternate_lba != Alt. header my_lba\n");
+		printk(KERN_WARNING "GPT:%lld != %lld\n",
+		       (unsigned long long)le64_to_cpu(pgpt->alternate_lba),
+                       (unsigned long long)le64_to_cpu(agpt->my_lba));
+		error_found++;
+	}
+	if (le64_to_cpu(pgpt->first_usable_lba) !=
+            le64_to_cpu(agpt->first_usable_lba)) {
+		printk(KERN_WARNING "GPT:first_usable_lbas don't match.\n");
+		printk(KERN_WARNING "GPT:%lld != %lld\n",
+		       (unsigned long long)le64_to_cpu(pgpt->first_usable_lba),
+                       (unsigned long long)le64_to_cpu(agpt->first_usable_lba));
+		error_found++;
+	}
+	if (le64_to_cpu(pgpt->last_usable_lba) !=
+            le64_to_cpu(agpt->last_usable_lba)) {
+		printk(KERN_WARNING "GPT:last_usable_lbas don't match.\n");
+		printk(KERN_WARNING "GPT:%lld != %lld\n",
+		       (unsigned long long)le64_to_cpu(pgpt->last_usable_lba),
+                       (unsigned long long)le64_to_cpu(agpt->last_usable_lba));
+		error_found++;
+	}
+	if (efi_guidcmp(pgpt->disk_guid, agpt->disk_guid)) {
+		printk(KERN_WARNING "GPT:disk_guids don't match.\n");
+		error_found++;
+	}
+	if (le32_to_cpu(pgpt->num_partition_entries) !=
+            le32_to_cpu(agpt->num_partition_entries)) {
+		printk(KERN_WARNING "GPT:num_partition_entries don't match: "
+		       "0x%x != 0x%x\n",
+		       le32_to_cpu(pgpt->num_partition_entries),
+		       le32_to_cpu(agpt->num_partition_entries));
+		error_found++;
+	}
+	if (le32_to_cpu(pgpt->sizeof_partition_entry) !=
+            le32_to_cpu(agpt->sizeof_partition_entry)) {
+		printk(KERN_WARNING
+		       "GPT:sizeof_partition_entry values don't match: "
+		       "0x%x != 0x%x\n",
+                       le32_to_cpu(pgpt->sizeof_partition_entry),
+		       le32_to_cpu(agpt->sizeof_partition_entry));
+		error_found++;
+	}
+	if (le32_to_cpu(pgpt->partition_entry_array_crc32) !=
+            le32_to_cpu(agpt->partition_entry_array_crc32)) {
+		printk(KERN_WARNING
+		       "GPT:partition_entry_array_crc32 values don't match: "
+		       "0x%x != 0x%x\n",
+                       le32_to_cpu(pgpt->partition_entry_array_crc32),
+		       le32_to_cpu(agpt->partition_entry_array_crc32));
+		error_found++;
+	}
+	if (le64_to_cpu(pgpt->alternate_lba) != lastlba) {
+		printk(KERN_WARNING
+		       "GPT:Primary header thinks Alt. header is not at the end of the disk.\n");
+		printk(KERN_WARNING "GPT:%lld != %lld\n",
+			(unsigned long long)le64_to_cpu(pgpt->alternate_lba),
+			(unsigned long long)lastlba);
+		error_found++;
+	}
+
+	if (le64_to_cpu(agpt->my_lba) != lastlba) {
+		printk(KERN_WARNING
+		       "GPT:Alternate GPT header not at the end of the disk.\n");
+		printk(KERN_WARNING "GPT:%lld != %lld\n",
+			(unsigned long long)le64_to_cpu(agpt->my_lba),
+			(unsigned long long)lastlba);
+		error_found++;
+	}
+
+	if (error_found)
+		printk(KERN_WARNING
+		       "GPT: Use GNU Parted to correct GPT errors.\n");
+	return;
+}
+
+/**
+ * find_valid_gpt() - Search disk for valid GPT headers and PTEs
+ * @state
+ * @gpt is a GPT header ptr, filled on return.
+ * @ptes is a PTEs ptr, filled on return.
+ * Description: Returns 1 if valid, 0 on error.
+ * If valid, returns pointers to newly allocated GPT header and PTEs.
+ * Validity depends on PMBR being valid (or being overridden by the
+ * 'gpt' kernel command line option) and finding either the Primary
+ * GPT header and PTEs valid, or the Alternate GPT header and PTEs
+ * valid.  If the Primary GPT header is not valid, the Alternate GPT header
+ * is not checked unless the 'gpt' kernel command line option is passed.
+ * This protects against devices which misreport their size, and forces
+ * the user to decide to use the Alternate GPT.
+ */
+static int find_valid_gpt(struct parsed_partitions *state, gpt_header **gpt,
+			  gpt_entry **ptes)
+{
+	int good_pgpt = 0, good_agpt = 0, good_pmbr = 0;
+	gpt_header *pgpt = NULL, *agpt = NULL;
+	gpt_entry *pptes = NULL, *aptes = NULL;
+	legacy_mbr *legacymbr;
+	u64 lastlba;
+
+	if (!ptes)
+		return 0;
+
+	lastlba = last_lba(state->bdev);
+        if (!force_gpt) {
+                /* This will be added to the EFI Spec. per Intel after v1.02. */
+                legacymbr = kzalloc(sizeof (*legacymbr), GFP_KERNEL);
+                if (legacymbr) {
+                        read_lba(state, 0, (u8 *) legacymbr,
+				 sizeof (*legacymbr));
+                        good_pmbr = is_pmbr_valid(legacymbr);
+                        kfree(legacymbr);
+                }
+                if (!good_pmbr)
+                        goto fail;
+        }
+
+	good_pgpt = is_gpt_valid(state, GPT_PRIMARY_PARTITION_TABLE_LBA,
+				 &pgpt, &pptes);
+        if (good_pgpt)
+		good_agpt = is_gpt_valid(state,
+					 le64_to_cpu(pgpt->alternate_lba),
+					 &agpt, &aptes);
+        if (!good_agpt && force_gpt)
+                good_agpt = is_gpt_valid(state, lastlba, &agpt, &aptes);
+
+        /* The obviously unsuccessful case */
+        if (!good_pgpt && !good_agpt)
+                goto fail;
+
+        compare_gpts(pgpt, agpt, lastlba);
+
+        /* The good cases */
+        if (good_pgpt) {
+                *gpt  = pgpt;
+                *ptes = pptes;
+                kfree(agpt);
+                kfree(aptes);
+                if (!good_agpt) {
+                        printk(KERN_WARNING 
+			       "Alternate GPT is invalid, "
+                               "using primary GPT.\n");
+                }
+                return 1;
+        }
+        else if (good_agpt) {
+                *gpt  = agpt;
+                *ptes = aptes;
+                kfree(pgpt);
+                kfree(pptes);
+                printk(KERN_WARNING 
+                       "Primary GPT is invalid, using alternate GPT.\n");
+                return 1;
+        }
+
+ fail:
+        kfree(pgpt);
+        kfree(agpt);
+        kfree(pptes);
+        kfree(aptes);
+        *gpt = NULL;
+        *ptes = NULL;
+        return 0;
+}
+
+/**
+ * efi_partition(struct parsed_partitions *state)
+ * @state
+ *
+ * Description: called from check.c, if the disk contains GPT
+ * partitions, sets up partition entries in the kernel.
+ *
+ * If the first block on the disk is a legacy MBR,
+ * it will get handled by msdos_partition().
+ * If it's a Protective MBR, we'll handle it here.
+ *
+ * We do not create a Linux partition for GPT, but
+ * only for the actual data partitions.
+ * Returns:
+ * -1 if unable to read the partition table
+ *  0 if this isn't our partition table
+ *  1 if successful
+ *
+ */
+int efi_partition(struct parsed_partitions *state)
+{
+	gpt_header *gpt = NULL;
+	gpt_entry *ptes = NULL;
+	u32 i;
+	unsigned ssz = bdev_logical_block_size(state->bdev) / 512;
+	u8 unparsed_guid[37];
+
+	if (!find_valid_gpt(state, &gpt, &ptes) || !gpt || !ptes) {
+		kfree(gpt);
+		kfree(ptes);
+		return 0;
+	}
+
+	pr_debug("GUID Partition Table is valid!  Yea!\n");
+
+	for (i = 0; i < le32_to_cpu(gpt->num_partition_entries) && i < state->limit-1; i++) {
+		struct partition_meta_info *info;
+		unsigned label_count = 0;
+		unsigned label_max;
+		u64 start = le64_to_cpu(ptes[i].starting_lba);
+		u64 size = le64_to_cpu(ptes[i].ending_lba) -
+			   le64_to_cpu(ptes[i].starting_lba) + 1ULL;
+
+		if (!is_pte_valid(&ptes[i], last_lba(state->bdev)))
+			continue;
+
+		put_partition(state, i+1, start * ssz, size * ssz);
+
+		/* If this is a RAID volume, tell md */
+		if (!efi_guidcmp(ptes[i].partition_type_guid,
+				 PARTITION_LINUX_RAID_GUID))
+			state->parts[i + 1].flags = ADDPART_FLAG_RAID;
+
+		info = &state->parts[i + 1].info;
+		/* Instead of doing a manual swap to big endian, reuse the
+		 * common ASCII hex format as the interim.
+		 */
+		efi_guid_unparse(&ptes[i].unique_partition_guid, unparsed_guid);
+		part_pack_uuid(unparsed_guid, info->uuid);
+
+		/* Naively convert UTF16-LE to 7 bits. */
+		label_max = min(sizeof(info->volname) - 1,
+				sizeof(ptes[i].partition_name));
+		info->volname[label_max] = 0;
+		while (label_count < label_max) {
+			u8 c = ptes[i].partition_name[label_count] & 0xff;
+			if (c && !isprint(c))
+				c = '!';
+			info->volname[label_count] = c;
+			label_count++;
+		}
+		state->parts[i + 1].has_info = true;
+	}
+	kfree(ptes);
+	kfree(gpt);
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	return 1;
+}
diff --git a/block/partitions/efi.h b/block/partitions/efi.h
new file mode 100644
index 00000000000..b69ab729558
--- /dev/null
+++ b/block/partitions/efi.h
@@ -0,0 +1,134 @@
+/************************************************************
+ * EFI GUID Partition Table
+ * Per Intel EFI Specification v1.02
+ * http://developer.intel.com/technology/efi/efi.htm
+ *
+ * By Matt Domsch <Matt_Domsch@dell.com>  Fri Sep 22 22:15:56 CDT 2000  
+ *   Copyright 2000,2001 Dell Inc.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ * 
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ * 
+ ************************************************************/
+
+#ifndef FS_PART_EFI_H_INCLUDED
+#define FS_PART_EFI_H_INCLUDED
+
+#include <linux/types.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/string.h>
+#include <linux/efi.h>
+
+#define MSDOS_MBR_SIGNATURE 0xaa55
+#define EFI_PMBR_OSTYPE_EFI 0xEF
+#define EFI_PMBR_OSTYPE_EFI_GPT 0xEE
+
+#define GPT_HEADER_SIGNATURE 0x5452415020494645ULL
+#define GPT_HEADER_REVISION_V1 0x00010000
+#define GPT_PRIMARY_PARTITION_TABLE_LBA 1
+
+#define PARTITION_SYSTEM_GUID \
+    EFI_GUID( 0xC12A7328, 0xF81F, 0x11d2, \
+              0xBA, 0x4B, 0x00, 0xA0, 0xC9, 0x3E, 0xC9, 0x3B) 
+#define LEGACY_MBR_PARTITION_GUID \
+    EFI_GUID( 0x024DEE41, 0x33E7, 0x11d3, \
+              0x9D, 0x69, 0x00, 0x08, 0xC7, 0x81, 0xF3, 0x9F)
+#define PARTITION_MSFT_RESERVED_GUID \
+    EFI_GUID( 0xE3C9E316, 0x0B5C, 0x4DB8, \
+              0x81, 0x7D, 0xF9, 0x2D, 0xF0, 0x02, 0x15, 0xAE)
+#define PARTITION_BASIC_DATA_GUID \
+    EFI_GUID( 0xEBD0A0A2, 0xB9E5, 0x4433, \
+              0x87, 0xC0, 0x68, 0xB6, 0xB7, 0x26, 0x99, 0xC7)
+#define PARTITION_LINUX_RAID_GUID \
+    EFI_GUID( 0xa19d880f, 0x05fc, 0x4d3b, \
+              0xa0, 0x06, 0x74, 0x3f, 0x0f, 0x84, 0x91, 0x1e)
+#define PARTITION_LINUX_SWAP_GUID \
+    EFI_GUID( 0x0657fd6d, 0xa4ab, 0x43c4, \
+              0x84, 0xe5, 0x09, 0x33, 0xc8, 0x4b, 0x4f, 0x4f)
+#define PARTITION_LINUX_LVM_GUID \
+    EFI_GUID( 0xe6d6d379, 0xf507, 0x44c2, \
+              0xa2, 0x3c, 0x23, 0x8f, 0x2a, 0x3d, 0xf9, 0x28)
+
+typedef struct _gpt_header {
+	__le64 signature;
+	__le32 revision;
+	__le32 header_size;
+	__le32 header_crc32;
+	__le32 reserved1;
+	__le64 my_lba;
+	__le64 alternate_lba;
+	__le64 first_usable_lba;
+	__le64 last_usable_lba;
+	efi_guid_t disk_guid;
+	__le64 partition_entry_lba;
+	__le32 num_partition_entries;
+	__le32 sizeof_partition_entry;
+	__le32 partition_entry_array_crc32;
+
+	/* The rest of the logical block is reserved by UEFI and must be zero.
+	 * EFI standard handles this by:
+	 *
+	 * uint8_t		reserved2[ BlockSize - 92 ];
+	 */
+} __attribute__ ((packed)) gpt_header;
+
+typedef struct _gpt_entry_attributes {
+	u64 required_to_function:1;
+	u64 reserved:47;
+        u64 type_guid_specific:16;
+} __attribute__ ((packed)) gpt_entry_attributes;
+
+typedef struct _gpt_entry {
+	efi_guid_t partition_type_guid;
+	efi_guid_t unique_partition_guid;
+	__le64 starting_lba;
+	__le64 ending_lba;
+	gpt_entry_attributes attributes;
+	efi_char16_t partition_name[72 / sizeof (efi_char16_t)];
+} __attribute__ ((packed)) gpt_entry;
+
+typedef struct _legacy_mbr {
+	u8 boot_code[440];
+	__le32 unique_mbr_signature;
+	__le16 unknown;
+	struct partition partition_record[4];
+	__le16 signature;
+} __attribute__ ((packed)) legacy_mbr;
+
+/* Functions */
+extern int efi_partition(struct parsed_partitions *state);
+
+#endif
+
+/*
+ * Overrides for Emacs so that we follow Linus's tabbing style.
+ * Emacs will notice this stuff at the end of the file and automatically
+ * adjust the settings for this buffer only.  This must remain at the end
+ * of the file.
+ * --------------------------------------------------------------------------
+ * Local variables:
+ * c-indent-level: 4 
+ * c-brace-imaginary-offset: 0
+ * c-brace-offset: -4
+ * c-argdecl-indent: 4
+ * c-label-offset: -4
+ * c-continued-statement-offset: 4
+ * c-continued-brace-offset: 0
+ * indent-tabs-mode: nil
+ * tab-width: 8
+ * End:
+ */
diff --git a/block/partitions/ibm.c b/block/partitions/ibm.c
new file mode 100644
index 00000000000..d513a07f44b
--- /dev/null
+++ b/block/partitions/ibm.c
@@ -0,0 +1,275 @@
+/*
+ * File...........: linux/fs/partitions/ibm.c
+ * Author(s)......: Holger Smolinski <Holger.Smolinski@de.ibm.com>
+ *                  Volker Sameske <sameske@de.ibm.com>
+ * Bugreports.to..: <Linux390@de.ibm.com>
+ * (C) IBM Corporation, IBM Deutschland Entwicklung GmbH, 1999,2000
+ */
+
+#include <linux/buffer_head.h>
+#include <linux/hdreg.h>
+#include <linux/slab.h>
+#include <asm/dasd.h>
+#include <asm/ebcdic.h>
+#include <asm/uaccess.h>
+#include <asm/vtoc.h>
+
+#include "check.h"
+#include "ibm.h"
+
+/*
+ * compute the block number from a
+ * cyl-cyl-head-head structure
+ */
+static sector_t
+cchh2blk (struct vtoc_cchh *ptr, struct hd_geometry *geo) {
+
+	sector_t cyl;
+	__u16 head;
+
+	/*decode cylinder and heads for large volumes */
+	cyl = ptr->hh & 0xFFF0;
+	cyl <<= 12;
+	cyl |= ptr->cc;
+	head = ptr->hh & 0x000F;
+	return cyl * geo->heads * geo->sectors +
+	       head * geo->sectors;
+}
+
+/*
+ * compute the block number from a
+ * cyl-cyl-head-head-block structure
+ */
+static sector_t
+cchhb2blk (struct vtoc_cchhb *ptr, struct hd_geometry *geo) {
+
+	sector_t cyl;
+	__u16 head;
+
+	/*decode cylinder and heads for large volumes */
+	cyl = ptr->hh & 0xFFF0;
+	cyl <<= 12;
+	cyl |= ptr->cc;
+	head = ptr->hh & 0x000F;
+	return	cyl * geo->heads * geo->sectors +
+		head * geo->sectors +
+		ptr->b;
+}
+
+/*
+ */
+int ibm_partition(struct parsed_partitions *state)
+{
+	struct block_device *bdev = state->bdev;
+	int blocksize, res;
+	loff_t i_size, offset, size, fmt_size;
+	dasd_information2_t *info;
+	struct hd_geometry *geo;
+	char type[5] = {0,};
+	char name[7] = {0,};
+	union label_t {
+		struct vtoc_volume_label_cdl vol;
+		struct vtoc_volume_label_ldl lnx;
+		struct vtoc_cms_label cms;
+	} *label;
+	unsigned char *data;
+	Sector sect;
+	sector_t labelsect;
+	char tmp[64];
+
+	res = 0;
+	blocksize = bdev_logical_block_size(bdev);
+	if (blocksize <= 0)
+		goto out_exit;
+	i_size = i_size_read(bdev->bd_inode);
+	if (i_size == 0)
+		goto out_exit;
+
+	info = kmalloc(sizeof(dasd_information2_t), GFP_KERNEL);
+	if (info == NULL)
+		goto out_exit;
+	geo = kmalloc(sizeof(struct hd_geometry), GFP_KERNEL);
+	if (geo == NULL)
+		goto out_nogeo;
+	label = kmalloc(sizeof(union label_t), GFP_KERNEL);
+	if (label == NULL)
+		goto out_nolab;
+
+	if (ioctl_by_bdev(bdev, BIODASDINFO2, (unsigned long)info) != 0 ||
+	    ioctl_by_bdev(bdev, HDIO_GETGEO, (unsigned long)geo) != 0)
+		goto out_freeall;
+
+	/*
+	 * Special case for FBA disks: label sector does not depend on
+	 * blocksize.
+	 */
+	if ((info->cu_type == 0x6310 && info->dev_type == 0x9336) ||
+	    (info->cu_type == 0x3880 && info->dev_type == 0x3370))
+		labelsect = info->label_block;
+	else
+		labelsect = info->label_block * (blocksize >> 9);
+
+	/*
+	 * Get volume label, extract name and type.
+	 */
+	data = read_part_sector(state, labelsect, &sect);
+	if (data == NULL)
+		goto out_readerr;
+
+	memcpy(label, data, sizeof(union label_t));
+	put_dev_sector(sect);
+
+	if ((!info->FBA_layout) && (!strcmp(info->type, "ECKD"))) {
+		strncpy(type, label->vol.vollbl, 4);
+		strncpy(name, label->vol.volid, 6);
+	} else {
+		strncpy(type, label->lnx.vollbl, 4);
+		strncpy(name, label->lnx.volid, 6);
+	}
+	EBCASC(type, 4);
+	EBCASC(name, 6);
+
+	res = 1;
+
+	/*
+	 * Three different formats: LDL, CDL and unformated disk
+	 *
+	 * identified by info->format
+	 *
+	 * unformated disks we do not have to care about
+	 */
+	if (info->format == DASD_FORMAT_LDL) {
+		if (strncmp(type, "CMS1", 4) == 0) {
+			/*
+			 * VM style CMS1 labeled disk
+			 */
+			blocksize = label->cms.block_size;
+			if (label->cms.disk_offset != 0) {
+				snprintf(tmp, sizeof(tmp), "CMS1/%8s(MDSK):", name);
+				strlcat(state->pp_buf, tmp, PAGE_SIZE);
+				/* disk is reserved minidisk */
+				offset = label->cms.disk_offset;
+				size = (label->cms.block_count - 1)
+					* (blocksize >> 9);
+			} else {
+				snprintf(tmp, sizeof(tmp), "CMS1/%8s:", name);
+				strlcat(state->pp_buf, tmp, PAGE_SIZE);
+				offset = (info->label_block + 1);
+				size = label->cms.block_count
+					* (blocksize >> 9);
+			}
+			put_partition(state, 1, offset*(blocksize >> 9),
+				      size-offset*(blocksize >> 9));
+		} else {
+			if (strncmp(type, "LNX1", 4) == 0) {
+				snprintf(tmp, sizeof(tmp), "LNX1/%8s:", name);
+				strlcat(state->pp_buf, tmp, PAGE_SIZE);
+				if (label->lnx.ldl_version == 0xf2) {
+					fmt_size = label->lnx.formatted_blocks
+						* (blocksize >> 9);
+				} else if (!strcmp(info->type, "ECKD")) {
+					/* formated w/o large volume support */
+					fmt_size = geo->cylinders * geo->heads
+					      * geo->sectors * (blocksize >> 9);
+				} else {
+					/* old label and no usable disk geometry
+					 * (e.g. DIAG) */
+					fmt_size = i_size >> 9;
+				}
+				size = i_size >> 9;
+				if (fmt_size < size)
+					size = fmt_size;
+				offset = (info->label_block + 1);
+			} else {
+				/* unlabeled disk */
+				strlcat(state->pp_buf, "(nonl)", PAGE_SIZE);
+				size = i_size >> 9;
+				offset = (info->label_block + 1);
+			}
+			put_partition(state, 1, offset*(blocksize >> 9),
+				      size-offset*(blocksize >> 9));
+		}
+	} else if (info->format == DASD_FORMAT_CDL) {
+		/*
+		 * New style CDL formatted disk
+		 */
+		sector_t blk;
+		int counter;
+
+		/*
+		 * check if VOL1 label is available
+		 * if not, something is wrong, skipping partition detection
+		 */
+		if (strncmp(type, "VOL1",  4) == 0) {
+			snprintf(tmp, sizeof(tmp), "VOL1/%8s:", name);
+			strlcat(state->pp_buf, tmp, PAGE_SIZE);
+			/*
+			 * get block number and read then go through format1
+			 * labels
+			 */
+			blk = cchhb2blk(&label->vol.vtoc, geo) + 1;
+			counter = 0;
+			data = read_part_sector(state, blk * (blocksize/512),
+						&sect);
+			while (data != NULL) {
+				struct vtoc_format1_label f1;
+
+				memcpy(&f1, data,
+				       sizeof(struct vtoc_format1_label));
+				put_dev_sector(sect);
+
+				/* skip FMT4 / FMT5 / FMT7 labels */
+				if (f1.DS1FMTID == _ascebc['4']
+				    || f1.DS1FMTID == _ascebc['5']
+				    || f1.DS1FMTID == _ascebc['7']
+				    || f1.DS1FMTID == _ascebc['9']) {
+					blk++;
+					data = read_part_sector(state,
+						blk * (blocksize/512), &sect);
+					continue;
+				}
+
+				/* only FMT1 and 8 labels valid at this point */
+				if (f1.DS1FMTID != _ascebc['1'] &&
+				    f1.DS1FMTID != _ascebc['8'])
+					break;
+
+				/* OK, we got valid partition data */
+				offset = cchh2blk(&f1.DS1EXT1.llimit, geo);
+				size  = cchh2blk(&f1.DS1EXT1.ulimit, geo) -
+					offset + geo->sectors;
+				if (counter >= state->limit)
+					break;
+				put_partition(state, counter + 1,
+					      offset * (blocksize >> 9),
+					      size * (blocksize >> 9));
+				counter++;
+				blk++;
+				data = read_part_sector(state,
+						blk * (blocksize/512), &sect);
+			}
+
+			if (!data)
+				/* Are we not supposed to report this ? */
+				goto out_readerr;
+		} else
+			printk(KERN_WARNING "Warning, expected Label VOL1 not "
+			       "found, treating as CDL formated Disk");
+
+	}
+
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	goto out_freeall;
+
+
+out_readerr:
+	res = -1;
+out_freeall:
+	kfree(label);
+out_nolab:
+	kfree(geo);
+out_nogeo:
+	kfree(info);
+out_exit:
+	return res;
+}
diff --git a/block/partitions/ibm.h b/block/partitions/ibm.h
new file mode 100644
index 00000000000..08fb0804a81
--- /dev/null
+++ b/block/partitions/ibm.h
@@ -0,0 +1 @@
+int ibm_partition(struct parsed_partitions *);
diff --git a/block/partitions/karma.c b/block/partitions/karma.c
new file mode 100644
index 00000000000..0ea19312706
--- /dev/null
+++ b/block/partitions/karma.c
@@ -0,0 +1,57 @@
+/*
+ *  fs/partitions/karma.c
+ *  Rio Karma partition info.
+ *
+ *  Copyright (C) 2006 Bob Copeland (me@bobcopeland.com)
+ *  based on osf.c
+ */
+
+#include "check.h"
+#include "karma.h"
+
+int karma_partition(struct parsed_partitions *state)
+{
+	int i;
+	int slot = 1;
+	Sector sect;
+	unsigned char *data;
+	struct disklabel {
+		u8 d_reserved[270];
+		struct d_partition {
+			__le32 p_res;
+			u8 p_fstype;
+			u8 p_res2[3];
+			__le32 p_offset;
+			__le32 p_size;
+		} d_partitions[2];
+		u8 d_blank[208];
+		__le16 d_magic;
+	} __attribute__((packed)) *label;
+	struct d_partition *p;
+
+	data = read_part_sector(state, 0, &sect);
+	if (!data)
+		return -1;
+
+	label = (struct disklabel *)data;
+	if (le16_to_cpu(label->d_magic) != KARMA_LABEL_MAGIC) {
+		put_dev_sector(sect);
+		return 0;
+	}
+
+	p = label->d_partitions;
+	for (i = 0 ; i < 2; i++, p++) {
+		if (slot == state->limit)
+			break;
+
+		if (p->p_fstype == 0x4d && le32_to_cpu(p->p_size)) {
+			put_partition(state, slot, le32_to_cpu(p->p_offset),
+				le32_to_cpu(p->p_size));
+		}
+		slot++;
+	}
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	put_dev_sector(sect);
+	return 1;
+}
+
diff --git a/block/partitions/karma.h b/block/partitions/karma.h
new file mode 100644
index 00000000000..c764b2e9df2
--- /dev/null
+++ b/block/partitions/karma.h
@@ -0,0 +1,8 @@
+/*
+ *  fs/partitions/karma.h
+ */
+
+#define KARMA_LABEL_MAGIC		0xAB56
+
+int karma_partition(struct parsed_partitions *state);
+
diff --git a/block/partitions/ldm.c b/block/partitions/ldm.c
new file mode 100644
index 00000000000..bd8ae788f68
--- /dev/null
+++ b/block/partitions/ldm.c
@@ -0,0 +1,1570 @@
+/**
+ * ldm - Support for Windows Logical Disk Manager (Dynamic Disks)
+ *
+ * Copyright (C) 2001,2002 Richard Russon <ldm@flatcap.org>
+ * Copyright (c) 2001-2007 Anton Altaparmakov
+ * Copyright (C) 2001,2002 Jakob Kemi <jakob.kemi@telia.com>
+ *
+ * Documentation is available at http://www.linux-ntfs.org/doku.php?id=downloads 
+ *
+ * This program is free software; you can redistribute it and/or modify it under
+ * the terms of the GNU General Public License as published by the Free Software
+ * Foundation; either version 2 of the License, or (at your option) any later
+ * version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+ * FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program (in the main directory of the source in the file COPYING); if
+ * not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330,
+ * Boston, MA  02111-1307  USA
+ */
+
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/stringify.h>
+#include <linux/kernel.h>
+#include "ldm.h"
+#include "check.h"
+#include "msdos.h"
+
+/**
+ * ldm_debug/info/error/crit - Output an error message
+ * @f:    A printf format string containing the message
+ * @...:  Variables to substitute into @f
+ *
+ * ldm_debug() writes a DEBUG level message to the syslog but only if the
+ * driver was compiled with debug enabled. Otherwise, the call turns into a NOP.
+ */
+#ifndef CONFIG_LDM_DEBUG
+#define ldm_debug(...)	do {} while (0)
+#else
+#define ldm_debug(f, a...) _ldm_printk (KERN_DEBUG, __func__, f, ##a)
+#endif
+
+#define ldm_crit(f, a...)  _ldm_printk (KERN_CRIT,  __func__, f, ##a)
+#define ldm_error(f, a...) _ldm_printk (KERN_ERR,   __func__, f, ##a)
+#define ldm_info(f, a...)  _ldm_printk (KERN_INFO,  __func__, f, ##a)
+
+static __printf(3, 4)
+void _ldm_printk(const char *level, const char *function, const char *fmt, ...)
+{
+	struct va_format vaf;
+	va_list args;
+
+	va_start (args, fmt);
+
+	vaf.fmt = fmt;
+	vaf.va = &args;
+
+	printk("%s%s(): %pV\n", level, function, &vaf);
+
+	va_end(args);
+}
+
+/**
+ * ldm_parse_hexbyte - Convert a ASCII hex number to a byte
+ * @src:  Pointer to at least 2 characters to convert.
+ *
+ * Convert a two character ASCII hex string to a number.
+ *
+ * Return:  0-255  Success, the byte was parsed correctly
+ *          -1     Error, an invalid character was supplied
+ */
+static int ldm_parse_hexbyte (const u8 *src)
+{
+	unsigned int x;		/* For correct wrapping */
+	int h;
+
+	/* high part */
+	x = h = hex_to_bin(src[0]);
+	if (h < 0)
+		return -1;
+
+	/* low part */
+	h = hex_to_bin(src[1]);
+	if (h < 0)
+		return -1;
+
+	return (x << 4) + h;
+}
+
+/**
+ * ldm_parse_guid - Convert GUID from ASCII to binary
+ * @src:   36 char string of the form fa50ff2b-f2e8-45de-83fa-65417f2f49ba
+ * @dest:  Memory block to hold binary GUID (16 bytes)
+ *
+ * N.B. The GUID need not be NULL terminated.
+ *
+ * Return:  'true'   @dest contains binary GUID
+ *          'false'  @dest contents are undefined
+ */
+static bool ldm_parse_guid (const u8 *src, u8 *dest)
+{
+	static const int size[] = { 4, 2, 2, 2, 6 };
+	int i, j, v;
+
+	if (src[8]  != '-' || src[13] != '-' ||
+	    src[18] != '-' || src[23] != '-')
+		return false;
+
+	for (j = 0; j < 5; j++, src++)
+		for (i = 0; i < size[j]; i++, src+=2, *dest++ = v)
+			if ((v = ldm_parse_hexbyte (src)) < 0)
+				return false;
+
+	return true;
+}
+
+/**
+ * ldm_parse_privhead - Read the LDM Database PRIVHEAD structure
+ * @data:  Raw database PRIVHEAD structure loaded from the device
+ * @ph:    In-memory privhead structure in which to return parsed information
+ *
+ * This parses the LDM database PRIVHEAD structure supplied in @data and
+ * sets up the in-memory privhead structure @ph with the obtained information.
+ *
+ * Return:  'true'   @ph contains the PRIVHEAD data
+ *          'false'  @ph contents are undefined
+ */
+static bool ldm_parse_privhead(const u8 *data, struct privhead *ph)
+{
+	bool is_vista = false;
+
+	BUG_ON(!data || !ph);
+	if (MAGIC_PRIVHEAD != get_unaligned_be64(data)) {
+		ldm_error("Cannot find PRIVHEAD structure. LDM database is"
+			" corrupt. Aborting.");
+		return false;
+	}
+	ph->ver_major = get_unaligned_be16(data + 0x000C);
+	ph->ver_minor = get_unaligned_be16(data + 0x000E);
+	ph->logical_disk_start = get_unaligned_be64(data + 0x011B);
+	ph->logical_disk_size = get_unaligned_be64(data + 0x0123);
+	ph->config_start = get_unaligned_be64(data + 0x012B);
+	ph->config_size = get_unaligned_be64(data + 0x0133);
+	/* Version 2.11 is Win2k/XP and version 2.12 is Vista. */
+	if (ph->ver_major == 2 && ph->ver_minor == 12)
+		is_vista = true;
+	if (!is_vista && (ph->ver_major != 2 || ph->ver_minor != 11)) {
+		ldm_error("Expected PRIVHEAD version 2.11 or 2.12, got %d.%d."
+			" Aborting.", ph->ver_major, ph->ver_minor);
+		return false;
+	}
+	ldm_debug("PRIVHEAD version %d.%d (Windows %s).", ph->ver_major,
+			ph->ver_minor, is_vista ? "Vista" : "2000/XP");
+	if (ph->config_size != LDM_DB_SIZE) {	/* 1 MiB in sectors. */
+		/* Warn the user and continue, carefully. */
+		ldm_info("Database is normally %u bytes, it claims to "
+			"be %llu bytes.", LDM_DB_SIZE,
+			(unsigned long long)ph->config_size);
+	}
+	if ((ph->logical_disk_size == 0) || (ph->logical_disk_start +
+			ph->logical_disk_size > ph->config_start)) {
+		ldm_error("PRIVHEAD disk size doesn't match real disk size");
+		return false;
+	}
+	if (!ldm_parse_guid(data + 0x0030, ph->disk_id)) {
+		ldm_error("PRIVHEAD contains an invalid GUID.");
+		return false;
+	}
+	ldm_debug("Parsed PRIVHEAD successfully.");
+	return true;
+}
+
+/**
+ * ldm_parse_tocblock - Read the LDM Database TOCBLOCK structure
+ * @data:  Raw database TOCBLOCK structure loaded from the device
+ * @toc:   In-memory toc structure in which to return parsed information
+ *
+ * This parses the LDM Database TOCBLOCK (table of contents) structure supplied
+ * in @data and sets up the in-memory tocblock structure @toc with the obtained
+ * information.
+ *
+ * N.B.  The *_start and *_size values returned in @toc are not range-checked.
+ *
+ * Return:  'true'   @toc contains the TOCBLOCK data
+ *          'false'  @toc contents are undefined
+ */
+static bool ldm_parse_tocblock (const u8 *data, struct tocblock *toc)
+{
+	BUG_ON (!data || !toc);
+
+	if (MAGIC_TOCBLOCK != get_unaligned_be64(data)) {
+		ldm_crit ("Cannot find TOCBLOCK, database may be corrupt.");
+		return false;
+	}
+	strncpy (toc->bitmap1_name, data + 0x24, sizeof (toc->bitmap1_name));
+	toc->bitmap1_name[sizeof (toc->bitmap1_name) - 1] = 0;
+	toc->bitmap1_start = get_unaligned_be64(data + 0x2E);
+	toc->bitmap1_size  = get_unaligned_be64(data + 0x36);
+
+	if (strncmp (toc->bitmap1_name, TOC_BITMAP1,
+			sizeof (toc->bitmap1_name)) != 0) {
+		ldm_crit ("TOCBLOCK's first bitmap is '%s', should be '%s'.",
+				TOC_BITMAP1, toc->bitmap1_name);
+		return false;
+	}
+	strncpy (toc->bitmap2_name, data + 0x46, sizeof (toc->bitmap2_name));
+	toc->bitmap2_name[sizeof (toc->bitmap2_name) - 1] = 0;
+	toc->bitmap2_start = get_unaligned_be64(data + 0x50);
+	toc->bitmap2_size  = get_unaligned_be64(data + 0x58);
+	if (strncmp (toc->bitmap2_name, TOC_BITMAP2,
+			sizeof (toc->bitmap2_name)) != 0) {
+		ldm_crit ("TOCBLOCK's second bitmap is '%s', should be '%s'.",
+				TOC_BITMAP2, toc->bitmap2_name);
+		return false;
+	}
+	ldm_debug ("Parsed TOCBLOCK successfully.");
+	return true;
+}
+
+/**
+ * ldm_parse_vmdb - Read the LDM Database VMDB structure
+ * @data:  Raw database VMDB structure loaded from the device
+ * @vm:    In-memory vmdb structure in which to return parsed information
+ *
+ * This parses the LDM Database VMDB structure supplied in @data and sets up
+ * the in-memory vmdb structure @vm with the obtained information.
+ *
+ * N.B.  The *_start, *_size and *_seq values will be range-checked later.
+ *
+ * Return:  'true'   @vm contains VMDB info
+ *          'false'  @vm contents are undefined
+ */
+static bool ldm_parse_vmdb (const u8 *data, struct vmdb *vm)
+{
+	BUG_ON (!data || !vm);
+
+	if (MAGIC_VMDB != get_unaligned_be32(data)) {
+		ldm_crit ("Cannot find the VMDB, database may be corrupt.");
+		return false;
+	}
+
+	vm->ver_major = get_unaligned_be16(data + 0x12);
+	vm->ver_minor = get_unaligned_be16(data + 0x14);
+	if ((vm->ver_major != 4) || (vm->ver_minor != 10)) {
+		ldm_error ("Expected VMDB version %d.%d, got %d.%d. "
+			"Aborting.", 4, 10, vm->ver_major, vm->ver_minor);
+		return false;
+	}
+
+	vm->vblk_size     = get_unaligned_be32(data + 0x08);
+	if (vm->vblk_size == 0) {
+		ldm_error ("Illegal VBLK size");
+		return false;
+	}
+
+	vm->vblk_offset   = get_unaligned_be32(data + 0x0C);
+	vm->last_vblk_seq = get_unaligned_be32(data + 0x04);
+
+	ldm_debug ("Parsed VMDB successfully.");
+	return true;
+}
+
+/**
+ * ldm_compare_privheads - Compare two privhead objects
+ * @ph1:  First privhead
+ * @ph2:  Second privhead
+ *
+ * This compares the two privhead structures @ph1 and @ph2.
+ *
+ * Return:  'true'   Identical
+ *          'false'  Different
+ */
+static bool ldm_compare_privheads (const struct privhead *ph1,
+				   const struct privhead *ph2)
+{
+	BUG_ON (!ph1 || !ph2);
+
+	return ((ph1->ver_major          == ph2->ver_major)		&&
+		(ph1->ver_minor          == ph2->ver_minor)		&&
+		(ph1->logical_disk_start == ph2->logical_disk_start)	&&
+		(ph1->logical_disk_size  == ph2->logical_disk_size)	&&
+		(ph1->config_start       == ph2->config_start)		&&
+		(ph1->config_size        == ph2->config_size)		&&
+		!memcmp (ph1->disk_id, ph2->disk_id, GUID_SIZE));
+}
+
+/**
+ * ldm_compare_tocblocks - Compare two tocblock objects
+ * @toc1:  First toc
+ * @toc2:  Second toc
+ *
+ * This compares the two tocblock structures @toc1 and @toc2.
+ *
+ * Return:  'true'   Identical
+ *          'false'  Different
+ */
+static bool ldm_compare_tocblocks (const struct tocblock *toc1,
+				   const struct tocblock *toc2)
+{
+	BUG_ON (!toc1 || !toc2);
+
+	return ((toc1->bitmap1_start == toc2->bitmap1_start)	&&
+		(toc1->bitmap1_size  == toc2->bitmap1_size)	&&
+		(toc1->bitmap2_start == toc2->bitmap2_start)	&&
+		(toc1->bitmap2_size  == toc2->bitmap2_size)	&&
+		!strncmp (toc1->bitmap1_name, toc2->bitmap1_name,
+			sizeof (toc1->bitmap1_name))		&&
+		!strncmp (toc1->bitmap2_name, toc2->bitmap2_name,
+			sizeof (toc1->bitmap2_name)));
+}
+
+/**
+ * ldm_validate_privheads - Compare the primary privhead with its backups
+ * @state: Partition check state including device holding the LDM Database
+ * @ph1:   Memory struct to fill with ph contents
+ *
+ * Read and compare all three privheads from disk.
+ *
+ * The privheads on disk show the size and location of the main disk area and
+ * the configuration area (the database).  The values are range-checked against
+ * @hd, which contains the real size of the disk.
+ *
+ * Return:  'true'   Success
+ *          'false'  Error
+ */
+static bool ldm_validate_privheads(struct parsed_partitions *state,
+				   struct privhead *ph1)
+{
+	static const int off[3] = { OFF_PRIV1, OFF_PRIV2, OFF_PRIV3 };
+	struct privhead *ph[3] = { ph1 };
+	Sector sect;
+	u8 *data;
+	bool result = false;
+	long num_sects;
+	int i;
+
+	BUG_ON (!state || !ph1);
+
+	ph[1] = kmalloc (sizeof (*ph[1]), GFP_KERNEL);
+	ph[2] = kmalloc (sizeof (*ph[2]), GFP_KERNEL);
+	if (!ph[1] || !ph[2]) {
+		ldm_crit ("Out of memory.");
+		goto out;
+	}
+
+	/* off[1 & 2] are relative to ph[0]->config_start */
+	ph[0]->config_start = 0;
+
+	/* Read and parse privheads */
+	for (i = 0; i < 3; i++) {
+		data = read_part_sector(state, ph[0]->config_start + off[i],
+					&sect);
+		if (!data) {
+			ldm_crit ("Disk read failed.");
+			goto out;
+		}
+		result = ldm_parse_privhead (data, ph[i]);
+		put_dev_sector (sect);
+		if (!result) {
+			ldm_error ("Cannot find PRIVHEAD %d.", i+1); /* Log again */
+			if (i < 2)
+				goto out;	/* Already logged */
+			else
+				break;	/* FIXME ignore for now, 3rd PH can fail on odd-sized disks */
+		}
+	}
+
+	num_sects = state->bdev->bd_inode->i_size >> 9;
+
+	if ((ph[0]->config_start > num_sects) ||
+	   ((ph[0]->config_start + ph[0]->config_size) > num_sects)) {
+		ldm_crit ("Database extends beyond the end of the disk.");
+		goto out;
+	}
+
+	if ((ph[0]->logical_disk_start > ph[0]->config_start) ||
+	   ((ph[0]->logical_disk_start + ph[0]->logical_disk_size)
+		    > ph[0]->config_start)) {
+		ldm_crit ("Disk and database overlap.");
+		goto out;
+	}
+
+	if (!ldm_compare_privheads (ph[0], ph[1])) {
+		ldm_crit ("Primary and backup PRIVHEADs don't match.");
+		goto out;
+	}
+	/* FIXME ignore this for now
+	if (!ldm_compare_privheads (ph[0], ph[2])) {
+		ldm_crit ("Primary and backup PRIVHEADs don't match.");
+		goto out;
+	}*/
+	ldm_debug ("Validated PRIVHEADs successfully.");
+	result = true;
+out:
+	kfree (ph[1]);
+	kfree (ph[2]);
+	return result;
+}
+
+/**
+ * ldm_validate_tocblocks - Validate the table of contents and its backups
+ * @state: Partition check state including device holding the LDM Database
+ * @base:  Offset, into @state->bdev, of the database
+ * @ldb:   Cache of the database structures
+ *
+ * Find and compare the four tables of contents of the LDM Database stored on
+ * @state->bdev and return the parsed information into @toc1.
+ *
+ * The offsets and sizes of the configs are range-checked against a privhead.
+ *
+ * Return:  'true'   @toc1 contains validated TOCBLOCK info
+ *          'false'  @toc1 contents are undefined
+ */
+static bool ldm_validate_tocblocks(struct parsed_partitions *state,
+				   unsigned long base, struct ldmdb *ldb)
+{
+	static const int off[4] = { OFF_TOCB1, OFF_TOCB2, OFF_TOCB3, OFF_TOCB4};
+	struct tocblock *tb[4];
+	struct privhead *ph;
+	Sector sect;
+	u8 *data;
+	int i, nr_tbs;
+	bool result = false;
+
+	BUG_ON(!state || !ldb);
+	ph = &ldb->ph;
+	tb[0] = &ldb->toc;
+	tb[1] = kmalloc(sizeof(*tb[1]) * 3, GFP_KERNEL);
+	if (!tb[1]) {
+		ldm_crit("Out of memory.");
+		goto err;
+	}
+	tb[2] = (struct tocblock*)((u8*)tb[1] + sizeof(*tb[1]));
+	tb[3] = (struct tocblock*)((u8*)tb[2] + sizeof(*tb[2]));
+	/*
+	 * Try to read and parse all four TOCBLOCKs.
+	 *
+	 * Windows Vista LDM v2.12 does not always have all four TOCBLOCKs so
+	 * skip any that fail as long as we get at least one valid TOCBLOCK.
+	 */
+	for (nr_tbs = i = 0; i < 4; i++) {
+		data = read_part_sector(state, base + off[i], &sect);
+		if (!data) {
+			ldm_error("Disk read failed for TOCBLOCK %d.", i);
+			continue;
+		}
+		if (ldm_parse_tocblock(data, tb[nr_tbs]))
+			nr_tbs++;
+		put_dev_sector(sect);
+	}
+	if (!nr_tbs) {
+		ldm_crit("Failed to find a valid TOCBLOCK.");
+		goto err;
+	}
+	/* Range check the TOCBLOCK against a privhead. */
+	if (((tb[0]->bitmap1_start + tb[0]->bitmap1_size) > ph->config_size) ||
+			((tb[0]->bitmap2_start + tb[0]->bitmap2_size) >
+			ph->config_size)) {
+		ldm_crit("The bitmaps are out of range.  Giving up.");
+		goto err;
+	}
+	/* Compare all loaded TOCBLOCKs. */
+	for (i = 1; i < nr_tbs; i++) {
+		if (!ldm_compare_tocblocks(tb[0], tb[i])) {
+			ldm_crit("TOCBLOCKs 0 and %d do not match.", i);
+			goto err;
+		}
+	}
+	ldm_debug("Validated %d TOCBLOCKs successfully.", nr_tbs);
+	result = true;
+err:
+	kfree(tb[1]);
+	return result;
+}
+
+/**
+ * ldm_validate_vmdb - Read the VMDB and validate it
+ * @state: Partition check state including device holding the LDM Database
+ * @base:  Offset, into @bdev, of the database
+ * @ldb:   Cache of the database structures
+ *
+ * Find the vmdb of the LDM Database stored on @bdev and return the parsed
+ * information in @ldb.
+ *
+ * Return:  'true'   @ldb contains validated VBDB info
+ *          'false'  @ldb contents are undefined
+ */
+static bool ldm_validate_vmdb(struct parsed_partitions *state,
+			      unsigned long base, struct ldmdb *ldb)
+{
+	Sector sect;
+	u8 *data;
+	bool result = false;
+	struct vmdb *vm;
+	struct tocblock *toc;
+
+	BUG_ON (!state || !ldb);
+
+	vm  = &ldb->vm;
+	toc = &ldb->toc;
+
+	data = read_part_sector(state, base + OFF_VMDB, &sect);
+	if (!data) {
+		ldm_crit ("Disk read failed.");
+		return false;
+	}
+
+	if (!ldm_parse_vmdb (data, vm))
+		goto out;				/* Already logged */
+
+	/* Are there uncommitted transactions? */
+	if (get_unaligned_be16(data + 0x10) != 0x01) {
+		ldm_crit ("Database is not in a consistent state.  Aborting.");
+		goto out;
+	}
+
+	if (vm->vblk_offset != 512)
+		ldm_info ("VBLKs start at offset 0x%04x.", vm->vblk_offset);
+
+	/*
+	 * The last_vblkd_seq can be before the end of the vmdb, just make sure
+	 * it is not out of bounds.
+	 */
+	if ((vm->vblk_size * vm->last_vblk_seq) > (toc->bitmap1_size << 9)) {
+		ldm_crit ("VMDB exceeds allowed size specified by TOCBLOCK.  "
+				"Database is corrupt.  Aborting.");
+		goto out;
+	}
+
+	result = true;
+out:
+	put_dev_sector (sect);
+	return result;
+}
+
+
+/**
+ * ldm_validate_partition_table - Determine whether bdev might be a dynamic disk
+ * @state: Partition check state including device holding the LDM Database
+ *
+ * This function provides a weak test to decide whether the device is a dynamic
+ * disk or not.  It looks for an MS-DOS-style partition table containing at
+ * least one partition of type 0x42 (formerly SFS, now used by Windows for
+ * dynamic disks).
+ *
+ * N.B.  The only possible error can come from the read_part_sector and that is
+ *       only likely to happen if the underlying device is strange.  If that IS
+ *       the case we should return zero to let someone else try.
+ *
+ * Return:  'true'   @state->bdev is a dynamic disk
+ *          'false'  @state->bdev is not a dynamic disk, or an error occurred
+ */
+static bool ldm_validate_partition_table(struct parsed_partitions *state)
+{
+	Sector sect;
+	u8 *data;
+	struct partition *p;
+	int i;
+	bool result = false;
+
+	BUG_ON(!state);
+
+	data = read_part_sector(state, 0, &sect);
+	if (!data) {
+		ldm_info ("Disk read failed.");
+		return false;
+	}
+
+	if (*(__le16*) (data + 0x01FE) != cpu_to_le16 (MSDOS_LABEL_MAGIC))
+		goto out;
+
+	p = (struct partition*)(data + 0x01BE);
+	for (i = 0; i < 4; i++, p++)
+		if (SYS_IND (p) == LDM_PARTITION) {
+			result = true;
+			break;
+		}
+
+	if (result)
+		ldm_debug ("Found W2K dynamic disk partition type.");
+
+out:
+	put_dev_sector (sect);
+	return result;
+}
+
+/**
+ * ldm_get_disk_objid - Search a linked list of vblk's for a given Disk Id
+ * @ldb:  Cache of the database structures
+ *
+ * The LDM Database contains a list of all partitions on all dynamic disks.
+ * The primary PRIVHEAD, at the beginning of the physical disk, tells us
+ * the GUID of this disk.  This function searches for the GUID in a linked
+ * list of vblk's.
+ *
+ * Return:  Pointer, A matching vblk was found
+ *          NULL,    No match, or an error
+ */
+static struct vblk * ldm_get_disk_objid (const struct ldmdb *ldb)
+{
+	struct list_head *item;
+
+	BUG_ON (!ldb);
+
+	list_for_each (item, &ldb->v_disk) {
+		struct vblk *v = list_entry (item, struct vblk, list);
+		if (!memcmp (v->vblk.disk.disk_id, ldb->ph.disk_id, GUID_SIZE))
+			return v;
+	}
+
+	return NULL;
+}
+
+/**
+ * ldm_create_data_partitions - Create data partitions for this device
+ * @pp:   List of the partitions parsed so far
+ * @ldb:  Cache of the database structures
+ *
+ * The database contains ALL the partitions for ALL disk groups, so we need to
+ * filter out this specific disk. Using the disk's object id, we can find all
+ * the partitions in the database that belong to this disk.
+ *
+ * Add each partition in our database, to the parsed_partitions structure.
+ *
+ * N.B.  This function creates the partitions in the order it finds partition
+ *       objects in the linked list.
+ *
+ * Return:  'true'   Partition created
+ *          'false'  Error, probably a range checking problem
+ */
+static bool ldm_create_data_partitions (struct parsed_partitions *pp,
+					const struct ldmdb *ldb)
+{
+	struct list_head *item;
+	struct vblk *vb;
+	struct vblk *disk;
+	struct vblk_part *part;
+	int part_num = 1;
+
+	BUG_ON (!pp || !ldb);
+
+	disk = ldm_get_disk_objid (ldb);
+	if (!disk) {
+		ldm_crit ("Can't find the ID of this disk in the database.");
+		return false;
+	}
+
+	strlcat(pp->pp_buf, " [LDM]", PAGE_SIZE);
+
+	/* Create the data partitions */
+	list_for_each (item, &ldb->v_part) {
+		vb = list_entry (item, struct vblk, list);
+		part = &vb->vblk.part;
+
+		if (part->disk_id != disk->obj_id)
+			continue;
+
+		put_partition (pp, part_num, ldb->ph.logical_disk_start +
+				part->start, part->size);
+		part_num++;
+	}
+
+	strlcat(pp->pp_buf, "\n", PAGE_SIZE);
+	return true;
+}
+
+
+/**
+ * ldm_relative - Calculate the next relative offset
+ * @buffer:  Block of data being worked on
+ * @buflen:  Size of the block of data
+ * @base:    Size of the previous fixed width fields
+ * @offset:  Cumulative size of the previous variable-width fields
+ *
+ * Because many of the VBLK fields are variable-width, it's necessary
+ * to calculate each offset based on the previous one and the length
+ * of the field it pointed to.
+ *
+ * Return:  -1 Error, the calculated offset exceeded the size of the buffer
+ *           n OK, a range-checked offset into buffer
+ */
+static int ldm_relative(const u8 *buffer, int buflen, int base, int offset)
+{
+
+	base += offset;
+	if (!buffer || offset < 0 || base > buflen) {
+		if (!buffer)
+			ldm_error("!buffer");
+		if (offset < 0)
+			ldm_error("offset (%d) < 0", offset);
+		if (base > buflen)
+			ldm_error("base (%d) > buflen (%d)", base, buflen);
+		return -1;
+	}
+	if (base + buffer[base] >= buflen) {
+		ldm_error("base (%d) + buffer[base] (%d) >= buflen (%d)", base,
+				buffer[base], buflen);
+		return -1;
+	}
+	return buffer[base] + offset + 1;
+}
+
+/**
+ * ldm_get_vnum - Convert a variable-width, big endian number, into cpu order
+ * @block:  Pointer to the variable-width number to convert
+ *
+ * Large numbers in the LDM Database are often stored in a packed format.  Each
+ * number is prefixed by a one byte width marker.  All numbers in the database
+ * are stored in big-endian byte order.  This function reads one of these
+ * numbers and returns the result
+ *
+ * N.B.  This function DOES NOT perform any range checking, though the most
+ *       it will read is eight bytes.
+ *
+ * Return:  n A number
+ *          0 Zero, or an error occurred
+ */
+static u64 ldm_get_vnum (const u8 *block)
+{
+	u64 tmp = 0;
+	u8 length;
+
+	BUG_ON (!block);
+
+	length = *block++;
+
+	if (length && length <= 8)
+		while (length--)
+			tmp = (tmp << 8) | *block++;
+	else
+		ldm_error ("Illegal length %d.", length);
+
+	return tmp;
+}
+
+/**
+ * ldm_get_vstr - Read a length-prefixed string into a buffer
+ * @block:   Pointer to the length marker
+ * @buffer:  Location to copy string to
+ * @buflen:  Size of the output buffer
+ *
+ * Many of the strings in the LDM Database are not NULL terminated.  Instead
+ * they are prefixed by a one byte length marker.  This function copies one of
+ * these strings into a buffer.
+ *
+ * N.B.  This function DOES NOT perform any range checking on the input.
+ *       If the buffer is too small, the output will be truncated.
+ *
+ * Return:  0, Error and @buffer contents are undefined
+ *          n, String length in characters (excluding NULL)
+ *          buflen-1, String was truncated.
+ */
+static int ldm_get_vstr (const u8 *block, u8 *buffer, int buflen)
+{
+	int length;
+
+	BUG_ON (!block || !buffer);
+
+	length = block[0];
+	if (length >= buflen) {
+		ldm_error ("Truncating string %d -> %d.", length, buflen);
+		length = buflen - 1;
+	}
+	memcpy (buffer, block + 1, length);
+	buffer[length] = 0;
+	return length;
+}
+
+
+/**
+ * ldm_parse_cmp3 - Read a raw VBLK Component object into a vblk structure
+ * @buffer:  Block of data being worked on
+ * @buflen:  Size of the block of data
+ * @vb:      In-memory vblk in which to return information
+ *
+ * Read a raw VBLK Component object (version 3) into a vblk structure.
+ *
+ * Return:  'true'   @vb contains a Component VBLK
+ *          'false'  @vb contents are not defined
+ */
+static bool ldm_parse_cmp3 (const u8 *buffer, int buflen, struct vblk *vb)
+{
+	int r_objid, r_name, r_vstate, r_child, r_parent, r_stripe, r_cols, len;
+	struct vblk_comp *comp;
+
+	BUG_ON (!buffer || !vb);
+
+	r_objid  = ldm_relative (buffer, buflen, 0x18, 0);
+	r_name   = ldm_relative (buffer, buflen, 0x18, r_objid);
+	r_vstate = ldm_relative (buffer, buflen, 0x18, r_name);
+	r_child  = ldm_relative (buffer, buflen, 0x1D, r_vstate);
+	r_parent = ldm_relative (buffer, buflen, 0x2D, r_child);
+
+	if (buffer[0x12] & VBLK_FLAG_COMP_STRIPE) {
+		r_stripe = ldm_relative (buffer, buflen, 0x2E, r_parent);
+		r_cols   = ldm_relative (buffer, buflen, 0x2E, r_stripe);
+		len = r_cols;
+	} else {
+		r_stripe = 0;
+		r_cols   = 0;
+		len = r_parent;
+	}
+	if (len < 0)
+		return false;
+
+	len += VBLK_SIZE_CMP3;
+	if (len != get_unaligned_be32(buffer + 0x14))
+		return false;
+
+	comp = &vb->vblk.comp;
+	ldm_get_vstr (buffer + 0x18 + r_name, comp->state,
+		sizeof (comp->state));
+	comp->type      = buffer[0x18 + r_vstate];
+	comp->children  = ldm_get_vnum (buffer + 0x1D + r_vstate);
+	comp->parent_id = ldm_get_vnum (buffer + 0x2D + r_child);
+	comp->chunksize = r_stripe ? ldm_get_vnum (buffer+r_parent+0x2E) : 0;
+
+	return true;
+}
+
+/**
+ * ldm_parse_dgr3 - Read a raw VBLK Disk Group object into a vblk structure
+ * @buffer:  Block of data being worked on
+ * @buflen:  Size of the block of data
+ * @vb:      In-memory vblk in which to return information
+ *
+ * Read a raw VBLK Disk Group object (version 3) into a vblk structure.
+ *
+ * Return:  'true'   @vb contains a Disk Group VBLK
+ *          'false'  @vb contents are not defined
+ */
+static int ldm_parse_dgr3 (const u8 *buffer, int buflen, struct vblk *vb)
+{
+	int r_objid, r_name, r_diskid, r_id1, r_id2, len;
+	struct vblk_dgrp *dgrp;
+
+	BUG_ON (!buffer || !vb);
+
+	r_objid  = ldm_relative (buffer, buflen, 0x18, 0);
+	r_name   = ldm_relative (buffer, buflen, 0x18, r_objid);
+	r_diskid = ldm_relative (buffer, buflen, 0x18, r_name);
+
+	if (buffer[0x12] & VBLK_FLAG_DGR3_IDS) {
+		r_id1 = ldm_relative (buffer, buflen, 0x24, r_diskid);
+		r_id2 = ldm_relative (buffer, buflen, 0x24, r_id1);
+		len = r_id2;
+	} else {
+		r_id1 = 0;
+		r_id2 = 0;
+		len = r_diskid;
+	}
+	if (len < 0)
+		return false;
+
+	len += VBLK_SIZE_DGR3;
+	if (len != get_unaligned_be32(buffer + 0x14))
+		return false;
+
+	dgrp = &vb->vblk.dgrp;
+	ldm_get_vstr (buffer + 0x18 + r_name, dgrp->disk_id,
+		sizeof (dgrp->disk_id));
+	return true;
+}
+
+/**
+ * ldm_parse_dgr4 - Read a raw VBLK Disk Group object into a vblk structure
+ * @buffer:  Block of data being worked on
+ * @buflen:  Size of the block of data
+ * @vb:      In-memory vblk in which to return information
+ *
+ * Read a raw VBLK Disk Group object (version 4) into a vblk structure.
+ *
+ * Return:  'true'   @vb contains a Disk Group VBLK
+ *          'false'  @vb contents are not defined
+ */
+static bool ldm_parse_dgr4 (const u8 *buffer, int buflen, struct vblk *vb)
+{
+	char buf[64];
+	int r_objid, r_name, r_id1, r_id2, len;
+	struct vblk_dgrp *dgrp;
+
+	BUG_ON (!buffer || !vb);
+
+	r_objid  = ldm_relative (buffer, buflen, 0x18, 0);
+	r_name   = ldm_relative (buffer, buflen, 0x18, r_objid);
+
+	if (buffer[0x12] & VBLK_FLAG_DGR4_IDS) {
+		r_id1 = ldm_relative (buffer, buflen, 0x44, r_name);
+		r_id2 = ldm_relative (buffer, buflen, 0x44, r_id1);
+		len = r_id2;
+	} else {
+		r_id1 = 0;
+		r_id2 = 0;
+		len = r_name;
+	}
+	if (len < 0)
+		return false;
+
+	len += VBLK_SIZE_DGR4;
+	if (len != get_unaligned_be32(buffer + 0x14))
+		return false;
+
+	dgrp = &vb->vblk.dgrp;
+
+	ldm_get_vstr (buffer + 0x18 + r_objid, buf, sizeof (buf));
+	return true;
+}
+
+/**
+ * ldm_parse_dsk3 - Read a raw VBLK Disk object into a vblk structure
+ * @buffer:  Block of data being worked on
+ * @buflen:  Size of the block of data
+ * @vb:      In-memory vblk in which to return information
+ *
+ * Read a raw VBLK Disk object (version 3) into a vblk structure.
+ *
+ * Return:  'true'   @vb contains a Disk VBLK
+ *          'false'  @vb contents are not defined
+ */
+static bool ldm_parse_dsk3 (const u8 *buffer, int buflen, struct vblk *vb)
+{
+	int r_objid, r_name, r_diskid, r_altname, len;
+	struct vblk_disk *disk;
+
+	BUG_ON (!buffer || !vb);
+
+	r_objid   = ldm_relative (buffer, buflen, 0x18, 0);
+	r_name    = ldm_relative (buffer, buflen, 0x18, r_objid);
+	r_diskid  = ldm_relative (buffer, buflen, 0x18, r_name);
+	r_altname = ldm_relative (buffer, buflen, 0x18, r_diskid);
+	len = r_altname;
+	if (len < 0)
+		return false;
+
+	len += VBLK_SIZE_DSK3;
+	if (len != get_unaligned_be32(buffer + 0x14))
+		return false;
+
+	disk = &vb->vblk.disk;
+	ldm_get_vstr (buffer + 0x18 + r_diskid, disk->alt_name,
+		sizeof (disk->alt_name));
+	if (!ldm_parse_guid (buffer + 0x19 + r_name, disk->disk_id))
+		return false;
+
+	return true;
+}
+
+/**
+ * ldm_parse_dsk4 - Read a raw VBLK Disk object into a vblk structure
+ * @buffer:  Block of data being worked on
+ * @buflen:  Size of the block of data
+ * @vb:      In-memory vblk in which to return information
+ *
+ * Read a raw VBLK Disk object (version 4) into a vblk structure.
+ *
+ * Return:  'true'   @vb contains a Disk VBLK
+ *          'false'  @vb contents are not defined
+ */
+static bool ldm_parse_dsk4 (const u8 *buffer, int buflen, struct vblk *vb)
+{
+	int r_objid, r_name, len;
+	struct vblk_disk *disk;
+
+	BUG_ON (!buffer || !vb);
+
+	r_objid = ldm_relative (buffer, buflen, 0x18, 0);
+	r_name  = ldm_relative (buffer, buflen, 0x18, r_objid);
+	len     = r_name;
+	if (len < 0)
+		return false;
+
+	len += VBLK_SIZE_DSK4;
+	if (len != get_unaligned_be32(buffer + 0x14))
+		return false;
+
+	disk = &vb->vblk.disk;
+	memcpy (disk->disk_id, buffer + 0x18 + r_name, GUID_SIZE);
+	return true;
+}
+
+/**
+ * ldm_parse_prt3 - Read a raw VBLK Partition object into a vblk structure
+ * @buffer:  Block of data being worked on
+ * @buflen:  Size of the block of data
+ * @vb:      In-memory vblk in which to return information
+ *
+ * Read a raw VBLK Partition object (version 3) into a vblk structure.
+ *
+ * Return:  'true'   @vb contains a Partition VBLK
+ *          'false'  @vb contents are not defined
+ */
+static bool ldm_parse_prt3(const u8 *buffer, int buflen, struct vblk *vb)
+{
+	int r_objid, r_name, r_size, r_parent, r_diskid, r_index, len;
+	struct vblk_part *part;
+
+	BUG_ON(!buffer || !vb);
+	r_objid = ldm_relative(buffer, buflen, 0x18, 0);
+	if (r_objid < 0) {
+		ldm_error("r_objid %d < 0", r_objid);
+		return false;
+	}
+	r_name = ldm_relative(buffer, buflen, 0x18, r_objid);
+	if (r_name < 0) {
+		ldm_error("r_name %d < 0", r_name);
+		return false;
+	}
+	r_size = ldm_relative(buffer, buflen, 0x34, r_name);
+	if (r_size < 0) {
+		ldm_error("r_size %d < 0", r_size);
+		return false;
+	}
+	r_parent = ldm_relative(buffer, buflen, 0x34, r_size);
+	if (r_parent < 0) {
+		ldm_error("r_parent %d < 0", r_parent);
+		return false;
+	}
+	r_diskid = ldm_relative(buffer, buflen, 0x34, r_parent);
+	if (r_diskid < 0) {
+		ldm_error("r_diskid %d < 0", r_diskid);
+		return false;
+	}
+	if (buffer[0x12] & VBLK_FLAG_PART_INDEX) {
+		r_index = ldm_relative(buffer, buflen, 0x34, r_diskid);
+		if (r_index < 0) {
+			ldm_error("r_index %d < 0", r_index);
+			return false;
+		}
+		len = r_index;
+	} else {
+		r_index = 0;
+		len = r_diskid;
+	}
+	if (len < 0) {
+		ldm_error("len %d < 0", len);
+		return false;
+	}
+	len += VBLK_SIZE_PRT3;
+	if (len > get_unaligned_be32(buffer + 0x14)) {
+		ldm_error("len %d > BE32(buffer + 0x14) %d", len,
+				get_unaligned_be32(buffer + 0x14));
+		return false;
+	}
+	part = &vb->vblk.part;
+	part->start = get_unaligned_be64(buffer + 0x24 + r_name);
+	part->volume_offset = get_unaligned_be64(buffer + 0x2C + r_name);
+	part->size = ldm_get_vnum(buffer + 0x34 + r_name);
+	part->parent_id = ldm_get_vnum(buffer + 0x34 + r_size);
+	part->disk_id = ldm_get_vnum(buffer + 0x34 + r_parent);
+	if (vb->flags & VBLK_FLAG_PART_INDEX)
+		part->partnum = buffer[0x35 + r_diskid];
+	else
+		part->partnum = 0;
+	return true;
+}
+
+/**
+ * ldm_parse_vol5 - Read a raw VBLK Volume object into a vblk structure
+ * @buffer:  Block of data being worked on
+ * @buflen:  Size of the block of data
+ * @vb:      In-memory vblk in which to return information
+ *
+ * Read a raw VBLK Volume object (version 5) into a vblk structure.
+ *
+ * Return:  'true'   @vb contains a Volume VBLK
+ *          'false'  @vb contents are not defined
+ */
+static bool ldm_parse_vol5(const u8 *buffer, int buflen, struct vblk *vb)
+{
+	int r_objid, r_name, r_vtype, r_disable_drive_letter, r_child, r_size;
+	int r_id1, r_id2, r_size2, r_drive, len;
+	struct vblk_volu *volu;
+
+	BUG_ON(!buffer || !vb);
+	r_objid = ldm_relative(buffer, buflen, 0x18, 0);
+	if (r_objid < 0) {
+		ldm_error("r_objid %d < 0", r_objid);
+		return false;
+	}
+	r_name = ldm_relative(buffer, buflen, 0x18, r_objid);
+	if (r_name < 0) {
+		ldm_error("r_name %d < 0", r_name);
+		return false;
+	}
+	r_vtype = ldm_relative(buffer, buflen, 0x18, r_name);
+	if (r_vtype < 0) {
+		ldm_error("r_vtype %d < 0", r_vtype);
+		return false;
+	}
+	r_disable_drive_letter = ldm_relative(buffer, buflen, 0x18, r_vtype);
+	if (r_disable_drive_letter < 0) {
+		ldm_error("r_disable_drive_letter %d < 0",
+				r_disable_drive_letter);
+		return false;
+	}
+	r_child = ldm_relative(buffer, buflen, 0x2D, r_disable_drive_letter);
+	if (r_child < 0) {
+		ldm_error("r_child %d < 0", r_child);
+		return false;
+	}
+	r_size = ldm_relative(buffer, buflen, 0x3D, r_child);
+	if (r_size < 0) {
+		ldm_error("r_size %d < 0", r_size);
+		return false;
+	}
+	if (buffer[0x12] & VBLK_FLAG_VOLU_ID1) {
+		r_id1 = ldm_relative(buffer, buflen, 0x52, r_size);
+		if (r_id1 < 0) {
+			ldm_error("r_id1 %d < 0", r_id1);
+			return false;
+		}
+	} else
+		r_id1 = r_size;
+	if (buffer[0x12] & VBLK_FLAG_VOLU_ID2) {
+		r_id2 = ldm_relative(buffer, buflen, 0x52, r_id1);
+		if (r_id2 < 0) {
+			ldm_error("r_id2 %d < 0", r_id2);
+			return false;
+		}
+	} else
+		r_id2 = r_id1;
+	if (buffer[0x12] & VBLK_FLAG_VOLU_SIZE) {
+		r_size2 = ldm_relative(buffer, buflen, 0x52, r_id2);
+		if (r_size2 < 0) {
+			ldm_error("r_size2 %d < 0", r_size2);
+			return false;
+		}
+	} else
+		r_size2 = r_id2;
+	if (buffer[0x12] & VBLK_FLAG_VOLU_DRIVE) {
+		r_drive = ldm_relative(buffer, buflen, 0x52, r_size2);
+		if (r_drive < 0) {
+			ldm_error("r_drive %d < 0", r_drive);
+			return false;
+		}
+	} else
+		r_drive = r_size2;
+	len = r_drive;
+	if (len < 0) {
+		ldm_error("len %d < 0", len);
+		return false;
+	}
+	len += VBLK_SIZE_VOL5;
+	if (len > get_unaligned_be32(buffer + 0x14)) {
+		ldm_error("len %d > BE32(buffer + 0x14) %d", len,
+				get_unaligned_be32(buffer + 0x14));
+		return false;
+	}
+	volu = &vb->vblk.volu;
+	ldm_get_vstr(buffer + 0x18 + r_name, volu->volume_type,
+			sizeof(volu->volume_type));
+	memcpy(volu->volume_state, buffer + 0x18 + r_disable_drive_letter,
+			sizeof(volu->volume_state));
+	volu->size = ldm_get_vnum(buffer + 0x3D + r_child);
+	volu->partition_type = buffer[0x41 + r_size];
+	memcpy(volu->guid, buffer + 0x42 + r_size, sizeof(volu->guid));
+	if (buffer[0x12] & VBLK_FLAG_VOLU_DRIVE) {
+		ldm_get_vstr(buffer + 0x52 + r_size, volu->drive_hint,
+				sizeof(volu->drive_hint));
+	}
+	return true;
+}
+
+/**
+ * ldm_parse_vblk - Read a raw VBLK object into a vblk structure
+ * @buf:  Block of data being worked on
+ * @len:  Size of the block of data
+ * @vb:   In-memory vblk in which to return information
+ *
+ * Read a raw VBLK object into a vblk structure.  This function just reads the
+ * information common to all VBLK types, then delegates the rest of the work to
+ * helper functions: ldm_parse_*.
+ *
+ * Return:  'true'   @vb contains a VBLK
+ *          'false'  @vb contents are not defined
+ */
+static bool ldm_parse_vblk (const u8 *buf, int len, struct vblk *vb)
+{
+	bool result = false;
+	int r_objid;
+
+	BUG_ON (!buf || !vb);
+
+	r_objid = ldm_relative (buf, len, 0x18, 0);
+	if (r_objid < 0) {
+		ldm_error ("VBLK header is corrupt.");
+		return false;
+	}
+
+	vb->flags  = buf[0x12];
+	vb->type   = buf[0x13];
+	vb->obj_id = ldm_get_vnum (buf + 0x18);
+	ldm_get_vstr (buf+0x18+r_objid, vb->name, sizeof (vb->name));
+
+	switch (vb->type) {
+		case VBLK_CMP3:  result = ldm_parse_cmp3 (buf, len, vb); break;
+		case VBLK_DSK3:  result = ldm_parse_dsk3 (buf, len, vb); break;
+		case VBLK_DSK4:  result = ldm_parse_dsk4 (buf, len, vb); break;
+		case VBLK_DGR3:  result = ldm_parse_dgr3 (buf, len, vb); break;
+		case VBLK_DGR4:  result = ldm_parse_dgr4 (buf, len, vb); break;
+		case VBLK_PRT3:  result = ldm_parse_prt3 (buf, len, vb); break;
+		case VBLK_VOL5:  result = ldm_parse_vol5 (buf, len, vb); break;
+	}
+
+	if (result)
+		ldm_debug ("Parsed VBLK 0x%llx (type: 0x%02x) ok.",
+			 (unsigned long long) vb->obj_id, vb->type);
+	else
+		ldm_error ("Failed to parse VBLK 0x%llx (type: 0x%02x).",
+			(unsigned long long) vb->obj_id, vb->type);
+
+	return result;
+}
+
+
+/**
+ * ldm_ldmdb_add - Adds a raw VBLK entry to the ldmdb database
+ * @data:  Raw VBLK to add to the database
+ * @len:   Size of the raw VBLK
+ * @ldb:   Cache of the database structures
+ *
+ * The VBLKs are sorted into categories.  Partitions are also sorted by offset.
+ *
+ * N.B.  This function does not check the validity of the VBLKs.
+ *
+ * Return:  'true'   The VBLK was added
+ *          'false'  An error occurred
+ */
+static bool ldm_ldmdb_add (u8 *data, int len, struct ldmdb *ldb)
+{
+	struct vblk *vb;
+	struct list_head *item;
+
+	BUG_ON (!data || !ldb);
+
+	vb = kmalloc (sizeof (*vb), GFP_KERNEL);
+	if (!vb) {
+		ldm_crit ("Out of memory.");
+		return false;
+	}
+
+	if (!ldm_parse_vblk (data, len, vb)) {
+		kfree(vb);
+		return false;			/* Already logged */
+	}
+
+	/* Put vblk into the correct list. */
+	switch (vb->type) {
+	case VBLK_DGR3:
+	case VBLK_DGR4:
+		list_add (&vb->list, &ldb->v_dgrp);
+		break;
+	case VBLK_DSK3:
+	case VBLK_DSK4:
+		list_add (&vb->list, &ldb->v_disk);
+		break;
+	case VBLK_VOL5:
+		list_add (&vb->list, &ldb->v_volu);
+		break;
+	case VBLK_CMP3:
+		list_add (&vb->list, &ldb->v_comp);
+		break;
+	case VBLK_PRT3:
+		/* Sort by the partition's start sector. */
+		list_for_each (item, &ldb->v_part) {
+			struct vblk *v = list_entry (item, struct vblk, list);
+			if ((v->vblk.part.disk_id == vb->vblk.part.disk_id) &&
+			    (v->vblk.part.start > vb->vblk.part.start)) {
+				list_add_tail (&vb->list, &v->list);
+				return true;
+			}
+		}
+		list_add_tail (&vb->list, &ldb->v_part);
+		break;
+	}
+	return true;
+}
+
+/**
+ * ldm_frag_add - Add a VBLK fragment to a list
+ * @data:   Raw fragment to be added to the list
+ * @size:   Size of the raw fragment
+ * @frags:  Linked list of VBLK fragments
+ *
+ * Fragmented VBLKs may not be consecutive in the database, so they are placed
+ * in a list so they can be pieced together later.
+ *
+ * Return:  'true'   Success, the VBLK was added to the list
+ *          'false'  Error, a problem occurred
+ */
+static bool ldm_frag_add (const u8 *data, int size, struct list_head *frags)
+{
+	struct frag *f;
+	struct list_head *item;
+	int rec, num, group;
+
+	BUG_ON (!data || !frags);
+
+	if (size < 2 * VBLK_SIZE_HEAD) {
+		ldm_error("Value of size is to small.");
+		return false;
+	}
+
+	group = get_unaligned_be32(data + 0x08);
+	rec   = get_unaligned_be16(data + 0x0C);
+	num   = get_unaligned_be16(data + 0x0E);
+	if ((num < 1) || (num > 4)) {
+		ldm_error ("A VBLK claims to have %d parts.", num);
+		return false;
+	}
+	if (rec >= num) {
+		ldm_error("REC value (%d) exceeds NUM value (%d)", rec, num);
+		return false;
+	}
+
+	list_for_each (item, frags) {
+		f = list_entry (item, struct frag, list);
+		if (f->group == group)
+			goto found;
+	}
+
+	f = kmalloc (sizeof (*f) + size*num, GFP_KERNEL);
+	if (!f) {
+		ldm_crit ("Out of memory.");
+		return false;
+	}
+
+	f->group = group;
+	f->num   = num;
+	f->rec   = rec;
+	f->map   = 0xFF << num;
+
+	list_add_tail (&f->list, frags);
+found:
+	if (rec >= f->num) {
+		ldm_error("REC value (%d) exceeds NUM value (%d)", rec, f->num);
+		return false;
+	}
+
+	if (f->map & (1 << rec)) {
+		ldm_error ("Duplicate VBLK, part %d.", rec);
+		f->map &= 0x7F;			/* Mark the group as broken */
+		return false;
+	}
+
+	f->map |= (1 << rec);
+
+	data += VBLK_SIZE_HEAD;
+	size -= VBLK_SIZE_HEAD;
+
+	memcpy (f->data+rec*(size-VBLK_SIZE_HEAD)+VBLK_SIZE_HEAD, data, size);
+
+	return true;
+}
+
+/**
+ * ldm_frag_free - Free a linked list of VBLK fragments
+ * @list:  Linked list of fragments
+ *
+ * Free a linked list of VBLK fragments
+ *
+ * Return:  none
+ */
+static void ldm_frag_free (struct list_head *list)
+{
+	struct list_head *item, *tmp;
+
+	BUG_ON (!list);
+
+	list_for_each_safe (item, tmp, list)
+		kfree (list_entry (item, struct frag, list));
+}
+
+/**
+ * ldm_frag_commit - Validate fragmented VBLKs and add them to the database
+ * @frags:  Linked list of VBLK fragments
+ * @ldb:    Cache of the database structures
+ *
+ * Now that all the fragmented VBLKs have been collected, they must be added to
+ * the database for later use.
+ *
+ * Return:  'true'   All the fragments we added successfully
+ *          'false'  One or more of the fragments we invalid
+ */
+static bool ldm_frag_commit (struct list_head *frags, struct ldmdb *ldb)
+{
+	struct frag *f;
+	struct list_head *item;
+
+	BUG_ON (!frags || !ldb);
+
+	list_for_each (item, frags) {
+		f = list_entry (item, struct frag, list);
+
+		if (f->map != 0xFF) {
+			ldm_error ("VBLK group %d is incomplete (0x%02x).",
+				f->group, f->map);
+			return false;
+		}
+
+		if (!ldm_ldmdb_add (f->data, f->num*ldb->vm.vblk_size, ldb))
+			return false;		/* Already logged */
+	}
+	return true;
+}
+
+/**
+ * ldm_get_vblks - Read the on-disk database of VBLKs into memory
+ * @state: Partition check state including device holding the LDM Database
+ * @base:  Offset, into @state->bdev, of the database
+ * @ldb:   Cache of the database structures
+ *
+ * To use the information from the VBLKs, they need to be read from the disk,
+ * unpacked and validated.  We cache them in @ldb according to their type.
+ *
+ * Return:  'true'   All the VBLKs were read successfully
+ *          'false'  An error occurred
+ */
+static bool ldm_get_vblks(struct parsed_partitions *state, unsigned long base,
+			  struct ldmdb *ldb)
+{
+	int size, perbuf, skip, finish, s, v, recs;
+	u8 *data = NULL;
+	Sector sect;
+	bool result = false;
+	LIST_HEAD (frags);
+
+	BUG_ON(!state || !ldb);
+
+	size   = ldb->vm.vblk_size;
+	perbuf = 512 / size;
+	skip   = ldb->vm.vblk_offset >> 9;		/* Bytes to sectors */
+	finish = (size * ldb->vm.last_vblk_seq) >> 9;
+
+	for (s = skip; s < finish; s++) {		/* For each sector */
+		data = read_part_sector(state, base + OFF_VMDB + s, &sect);
+		if (!data) {
+			ldm_crit ("Disk read failed.");
+			goto out;
+		}
+
+		for (v = 0; v < perbuf; v++, data+=size) {  /* For each vblk */
+			if (MAGIC_VBLK != get_unaligned_be32(data)) {
+				ldm_error ("Expected to find a VBLK.");
+				goto out;
+			}
+
+			recs = get_unaligned_be16(data + 0x0E);	/* Number of records */
+			if (recs == 1) {
+				if (!ldm_ldmdb_add (data, size, ldb))
+					goto out;	/* Already logged */
+			} else if (recs > 1) {
+				if (!ldm_frag_add (data, size, &frags))
+					goto out;	/* Already logged */
+			}
+			/* else Record is not in use, ignore it. */
+		}
+		put_dev_sector (sect);
+		data = NULL;
+	}
+
+	result = ldm_frag_commit (&frags, ldb);	/* Failures, already logged */
+out:
+	if (data)
+		put_dev_sector (sect);
+	ldm_frag_free (&frags);
+
+	return result;
+}
+
+/**
+ * ldm_free_vblks - Free a linked list of vblk's
+ * @lh:  Head of a linked list of struct vblk
+ *
+ * Free a list of vblk's and free the memory used to maintain the list.
+ *
+ * Return:  none
+ */
+static void ldm_free_vblks (struct list_head *lh)
+{
+	struct list_head *item, *tmp;
+
+	BUG_ON (!lh);
+
+	list_for_each_safe (item, tmp, lh)
+		kfree (list_entry (item, struct vblk, list));
+}
+
+
+/**
+ * ldm_partition - Find out whether a device is a dynamic disk and handle it
+ * @state: Partition check state including device holding the LDM Database
+ *
+ * This determines whether the device @bdev is a dynamic disk and if so creates
+ * the partitions necessary in the gendisk structure pointed to by @hd.
+ *
+ * We create a dummy device 1, which contains the LDM database, and then create
+ * each partition described by the LDM database in sequence as devices 2+. For
+ * example, if the device is hda, we would have: hda1: LDM database, hda2, hda3,
+ * and so on: the actual data containing partitions.
+ *
+ * Return:  1 Success, @state->bdev is a dynamic disk and we handled it
+ *          0 Success, @state->bdev is not a dynamic disk
+ *         -1 An error occurred before enough information had been read
+ *            Or @state->bdev is a dynamic disk, but it may be corrupted
+ */
+int ldm_partition(struct parsed_partitions *state)
+{
+	struct ldmdb  *ldb;
+	unsigned long base;
+	int result = -1;
+
+	BUG_ON(!state);
+
+	/* Look for signs of a Dynamic Disk */
+	if (!ldm_validate_partition_table(state))
+		return 0;
+
+	ldb = kmalloc (sizeof (*ldb), GFP_KERNEL);
+	if (!ldb) {
+		ldm_crit ("Out of memory.");
+		goto out;
+	}
+
+	/* Parse and check privheads. */
+	if (!ldm_validate_privheads(state, &ldb->ph))
+		goto out;		/* Already logged */
+
+	/* All further references are relative to base (database start). */
+	base = ldb->ph.config_start;
+
+	/* Parse and check tocs and vmdb. */
+	if (!ldm_validate_tocblocks(state, base, ldb) ||
+	    !ldm_validate_vmdb(state, base, ldb))
+	    	goto out;		/* Already logged */
+
+	/* Initialize vblk lists in ldmdb struct */
+	INIT_LIST_HEAD (&ldb->v_dgrp);
+	INIT_LIST_HEAD (&ldb->v_disk);
+	INIT_LIST_HEAD (&ldb->v_volu);
+	INIT_LIST_HEAD (&ldb->v_comp);
+	INIT_LIST_HEAD (&ldb->v_part);
+
+	if (!ldm_get_vblks(state, base, ldb)) {
+		ldm_crit ("Failed to read the VBLKs from the database.");
+		goto cleanup;
+	}
+
+	/* Finally, create the data partition devices. */
+	if (ldm_create_data_partitions(state, ldb)) {
+		ldm_debug ("Parsed LDM database successfully.");
+		result = 1;
+	}
+	/* else Already logged */
+
+cleanup:
+	ldm_free_vblks (&ldb->v_dgrp);
+	ldm_free_vblks (&ldb->v_disk);
+	ldm_free_vblks (&ldb->v_volu);
+	ldm_free_vblks (&ldb->v_comp);
+	ldm_free_vblks (&ldb->v_part);
+out:
+	kfree (ldb);
+	return result;
+}
diff --git a/block/partitions/ldm.h b/block/partitions/ldm.h
new file mode 100644
index 00000000000..374242c0971
--- /dev/null
+++ b/block/partitions/ldm.h
@@ -0,0 +1,215 @@
+/**
+ * ldm - Part of the Linux-NTFS project.
+ *
+ * Copyright (C) 2001,2002 Richard Russon <ldm@flatcap.org>
+ * Copyright (c) 2001-2007 Anton Altaparmakov
+ * Copyright (C) 2001,2002 Jakob Kemi <jakob.kemi@telia.com>
+ *
+ * Documentation is available at http://www.linux-ntfs.org/doku.php?id=downloads 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program (in the main directory of the Linux-NTFS source
+ * in the file COPYING); if not, write to the Free Software Foundation,
+ * Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef _FS_PT_LDM_H_
+#define _FS_PT_LDM_H_
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/genhd.h>
+#include <linux/fs.h>
+#include <asm/unaligned.h>
+#include <asm/byteorder.h>
+
+struct parsed_partitions;
+
+/* Magic numbers in CPU format. */
+#define MAGIC_VMDB	0x564D4442		/* VMDB */
+#define MAGIC_VBLK	0x56424C4B		/* VBLK */
+#define MAGIC_PRIVHEAD	0x5052495648454144ULL	/* PRIVHEAD */
+#define MAGIC_TOCBLOCK	0x544F43424C4F434BULL	/* TOCBLOCK */
+
+/* The defined vblk types. */
+#define VBLK_VOL5		0x51		/* Volume,     version 5 */
+#define VBLK_CMP3		0x32		/* Component,  version 3 */
+#define VBLK_PRT3		0x33		/* Partition,  version 3 */
+#define VBLK_DSK3		0x34		/* Disk,       version 3 */
+#define VBLK_DSK4		0x44		/* Disk,       version 4 */
+#define VBLK_DGR3		0x35		/* Disk Group, version 3 */
+#define VBLK_DGR4		0x45		/* Disk Group, version 4 */
+
+/* vblk flags indicating extra information will be present */
+#define	VBLK_FLAG_COMP_STRIPE	0x10
+#define	VBLK_FLAG_PART_INDEX	0x08
+#define	VBLK_FLAG_DGR3_IDS	0x08
+#define	VBLK_FLAG_DGR4_IDS	0x08
+#define	VBLK_FLAG_VOLU_ID1	0x08
+#define	VBLK_FLAG_VOLU_ID2	0x20
+#define	VBLK_FLAG_VOLU_SIZE	0x80
+#define	VBLK_FLAG_VOLU_DRIVE	0x02
+
+/* size of a vblk's static parts */
+#define VBLK_SIZE_HEAD		16
+#define VBLK_SIZE_CMP3		22		/* Name and version */
+#define VBLK_SIZE_DGR3		12
+#define VBLK_SIZE_DGR4		44
+#define VBLK_SIZE_DSK3		12
+#define VBLK_SIZE_DSK4		45
+#define VBLK_SIZE_PRT3		28
+#define VBLK_SIZE_VOL5		58
+
+/* component types */
+#define COMP_STRIPE		0x01		/* Stripe-set */
+#define COMP_BASIC		0x02		/* Basic disk */
+#define COMP_RAID		0x03		/* Raid-set */
+
+/* Other constants. */
+#define LDM_DB_SIZE		2048		/* Size in sectors (= 1MiB). */
+
+#define OFF_PRIV1		6		/* Offset of the first privhead
+						   relative to the start of the
+						   device in sectors */
+
+/* Offsets to structures within the LDM Database in sectors. */
+#define OFF_PRIV2		1856		/* Backup private headers. */
+#define OFF_PRIV3		2047
+
+#define OFF_TOCB1		1		/* Tables of contents. */
+#define OFF_TOCB2		2
+#define OFF_TOCB3		2045
+#define OFF_TOCB4		2046
+
+#define OFF_VMDB		17		/* List of partitions. */
+
+#define LDM_PARTITION		0x42		/* Formerly SFS (Landis). */
+
+#define TOC_BITMAP1		"config"	/* Names of the two defined */
+#define TOC_BITMAP2		"log"		/* bitmaps in the TOCBLOCK. */
+
+/* Borrowed from msdos.c */
+#define SYS_IND(p)		(get_unaligned(&(p)->sys_ind))
+
+struct frag {				/* VBLK Fragment handling */
+	struct list_head list;
+	u32		group;
+	u8		num;		/* Total number of records */
+	u8		rec;		/* This is record number n */
+	u8		map;		/* Which portions are in use */
+	u8		data[0];
+};
+
+/* In memory LDM database structures. */
+
+#define GUID_SIZE		16
+
+struct privhead {			/* Offsets and sizes are in sectors. */
+	u16	ver_major;
+	u16	ver_minor;
+	u64	logical_disk_start;
+	u64	logical_disk_size;
+	u64	config_start;
+	u64	config_size;
+	u8	disk_id[GUID_SIZE];
+};
+
+struct tocblock {			/* We have exactly two bitmaps. */
+	u8	bitmap1_name[16];
+	u64	bitmap1_start;
+	u64	bitmap1_size;
+	u8	bitmap2_name[16];
+	u64	bitmap2_start;
+	u64	bitmap2_size;
+};
+
+struct vmdb {				/* VMDB: The database header */
+	u16	ver_major;
+	u16	ver_minor;
+	u32	vblk_size;
+	u32	vblk_offset;
+	u32	last_vblk_seq;
+};
+
+struct vblk_comp {			/* VBLK Component */
+	u8	state[16];
+	u64	parent_id;
+	u8	type;
+	u8	children;
+	u16	chunksize;
+};
+
+struct vblk_dgrp {			/* VBLK Disk Group */
+	u8	disk_id[64];
+};
+
+struct vblk_disk {			/* VBLK Disk */
+	u8	disk_id[GUID_SIZE];
+	u8	alt_name[128];
+};
+
+struct vblk_part {			/* VBLK Partition */
+	u64	start;
+	u64	size;			/* start, size and vol_off in sectors */
+	u64	volume_offset;
+	u64	parent_id;
+	u64	disk_id;
+	u8	partnum;
+};
+
+struct vblk_volu {			/* VBLK Volume */
+	u8	volume_type[16];
+	u8	volume_state[16];
+	u8	guid[16];
+	u8	drive_hint[4];
+	u64	size;
+	u8	partition_type;
+};
+
+struct vblk_head {			/* VBLK standard header */
+	u32 group;
+	u16 rec;
+	u16 nrec;
+};
+
+struct vblk {				/* Generalised VBLK */
+	u8	name[64];
+	u64	obj_id;
+	u32	sequence;
+	u8	flags;
+	u8	type;
+	union {
+		struct vblk_comp comp;
+		struct vblk_dgrp dgrp;
+		struct vblk_disk disk;
+		struct vblk_part part;
+		struct vblk_volu volu;
+	} vblk;
+	struct list_head list;
+};
+
+struct ldmdb {				/* Cache of the database */
+	struct privhead ph;
+	struct tocblock toc;
+	struct vmdb     vm;
+	struct list_head v_dgrp;
+	struct list_head v_disk;
+	struct list_head v_volu;
+	struct list_head v_comp;
+	struct list_head v_part;
+};
+
+int ldm_partition(struct parsed_partitions *state);
+
+#endif /* _FS_PT_LDM_H_ */
+
diff --git a/block/partitions/mac.c b/block/partitions/mac.c
new file mode 100644
index 00000000000..11f688bd76c
--- /dev/null
+++ b/block/partitions/mac.c
@@ -0,0 +1,134 @@
+/*
+ *  fs/partitions/mac.c
+ *
+ *  Code extracted from drivers/block/genhd.c
+ *  Copyright (C) 1991-1998  Linus Torvalds
+ *  Re-organised Feb 1998 Russell King
+ */
+
+#include <linux/ctype.h>
+#include "check.h"
+#include "mac.h"
+
+#ifdef CONFIG_PPC_PMAC
+#include <asm/machdep.h>
+extern void note_bootable_part(dev_t dev, int part, int goodness);
+#endif
+
+/*
+ * Code to understand MacOS partition tables.
+ */
+
+static inline void mac_fix_string(char *stg, int len)
+{
+	int i;
+
+	for (i = len - 1; i >= 0 && stg[i] == ' '; i--)
+		stg[i] = 0;
+}
+
+int mac_partition(struct parsed_partitions *state)
+{
+	Sector sect;
+	unsigned char *data;
+	int slot, blocks_in_map;
+	unsigned secsize;
+#ifdef CONFIG_PPC_PMAC
+	int found_root = 0;
+	int found_root_goodness = 0;
+#endif
+	struct mac_partition *part;
+	struct mac_driver_desc *md;
+
+	/* Get 0th block and look at the first partition map entry. */
+	md = read_part_sector(state, 0, &sect);
+	if (!md)
+		return -1;
+	if (be16_to_cpu(md->signature) != MAC_DRIVER_MAGIC) {
+		put_dev_sector(sect);
+		return 0;
+	}
+	secsize = be16_to_cpu(md->block_size);
+	put_dev_sector(sect);
+	data = read_part_sector(state, secsize/512, &sect);
+	if (!data)
+		return -1;
+	part = (struct mac_partition *) (data + secsize%512);
+	if (be16_to_cpu(part->signature) != MAC_PARTITION_MAGIC) {
+		put_dev_sector(sect);
+		return 0;		/* not a MacOS disk */
+	}
+	blocks_in_map = be32_to_cpu(part->map_count);
+	if (blocks_in_map < 0 || blocks_in_map >= DISK_MAX_PARTS) {
+		put_dev_sector(sect);
+		return 0;
+	}
+	strlcat(state->pp_buf, " [mac]", PAGE_SIZE);
+	for (slot = 1; slot <= blocks_in_map; ++slot) {
+		int pos = slot * secsize;
+		put_dev_sector(sect);
+		data = read_part_sector(state, pos/512, &sect);
+		if (!data)
+			return -1;
+		part = (struct mac_partition *) (data + pos%512);
+		if (be16_to_cpu(part->signature) != MAC_PARTITION_MAGIC)
+			break;
+		put_partition(state, slot,
+			be32_to_cpu(part->start_block) * (secsize/512),
+			be32_to_cpu(part->block_count) * (secsize/512));
+
+		if (!strnicmp(part->type, "Linux_RAID", 10))
+			state->parts[slot].flags = ADDPART_FLAG_RAID;
+#ifdef CONFIG_PPC_PMAC
+		/*
+		 * If this is the first bootable partition, tell the
+		 * setup code, in case it wants to make this the root.
+		 */
+		if (machine_is(powermac)) {
+			int goodness = 0;
+
+			mac_fix_string(part->processor, 16);
+			mac_fix_string(part->name, 32);
+			mac_fix_string(part->type, 32);					
+		    
+			if ((be32_to_cpu(part->status) & MAC_STATUS_BOOTABLE)
+			    && strcasecmp(part->processor, "powerpc") == 0)
+				goodness++;
+
+			if (strcasecmp(part->type, "Apple_UNIX_SVR2") == 0
+			    || (strnicmp(part->type, "Linux", 5) == 0
+			        && strcasecmp(part->type, "Linux_swap") != 0)) {
+				int i, l;
+
+				goodness++;
+				l = strlen(part->name);
+				if (strcmp(part->name, "/") == 0)
+					goodness++;
+				for (i = 0; i <= l - 4; ++i) {
+					if (strnicmp(part->name + i, "root",
+						     4) == 0) {
+						goodness += 2;
+						break;
+					}
+				}
+				if (strnicmp(part->name, "swap", 4) == 0)
+					goodness--;
+			}
+
+			if (goodness > found_root_goodness) {
+				found_root = slot;
+				found_root_goodness = goodness;
+			}
+		}
+#endif /* CONFIG_PPC_PMAC */
+	}
+#ifdef CONFIG_PPC_PMAC
+	if (found_root_goodness)
+		note_bootable_part(state->bdev->bd_dev, found_root,
+				   found_root_goodness);
+#endif
+
+	put_dev_sector(sect);
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	return 1;
+}
diff --git a/block/partitions/mac.h b/block/partitions/mac.h
new file mode 100644
index 00000000000..3c7d9843638
--- /dev/null
+++ b/block/partitions/mac.h
@@ -0,0 +1,44 @@
+/*
+ *  fs/partitions/mac.h
+ */
+
+#define MAC_PARTITION_MAGIC	0x504d
+
+/* type field value for A/UX or other Unix partitions */
+#define APPLE_AUX_TYPE	"Apple_UNIX_SVR2"
+
+struct mac_partition {
+	__be16	signature;	/* expected to be MAC_PARTITION_MAGIC */
+	__be16	res1;
+	__be32	map_count;	/* # blocks in partition map */
+	__be32	start_block;	/* absolute starting block # of partition */
+	__be32	block_count;	/* number of blocks in partition */
+	char	name[32];	/* partition name */
+	char	type[32];	/* string type description */
+	__be32	data_start;	/* rel block # of first data block */
+	__be32	data_count;	/* number of data blocks */
+	__be32	status;		/* partition status bits */
+	__be32	boot_start;
+	__be32	boot_size;
+	__be32	boot_load;
+	__be32	boot_load2;
+	__be32	boot_entry;
+	__be32	boot_entry2;
+	__be32	boot_cksum;
+	char	processor[16];	/* identifies ISA of boot */
+	/* there is more stuff after this that we don't need */
+};
+
+#define MAC_STATUS_BOOTABLE	8	/* partition is bootable */
+
+#define MAC_DRIVER_MAGIC	0x4552
+
+/* Driver descriptor structure, in block 0 */
+struct mac_driver_desc {
+	__be16	signature;	/* expected to be MAC_DRIVER_MAGIC */
+	__be16	block_size;
+	__be32	block_count;
+    /* ... more stuff */
+};
+
+int mac_partition(struct parsed_partitions *state);
diff --git a/block/partitions/msdos.c b/block/partitions/msdos.c
new file mode 100644
index 00000000000..5f79a6677c6
--- /dev/null
+++ b/block/partitions/msdos.c
@@ -0,0 +1,552 @@
+/*
+ *  fs/partitions/msdos.c
+ *
+ *  Code extracted from drivers/block/genhd.c
+ *  Copyright (C) 1991-1998  Linus Torvalds
+ *
+ *  Thanks to Branko Lankester, lankeste@fwi.uva.nl, who found a bug
+ *  in the early extended-partition checks and added DM partitions
+ *
+ *  Support for DiskManager v6.0x added by Mark Lord,
+ *  with information provided by OnTrack.  This now works for linux fdisk
+ *  and LILO, as well as loadlin and bootln.  Note that disks other than
+ *  /dev/hda *must* have a "DOS" type 0x51 partition in the first slot (hda1).
+ *
+ *  More flexible handling of extended partitions - aeb, 950831
+ *
+ *  Check partition table on IDE disks for common CHS translations
+ *
+ *  Re-organised Feb 1998 Russell King
+ */
+#include <linux/msdos_fs.h>
+
+#include "check.h"
+#include "msdos.h"
+#include "efi.h"
+
+/*
+ * Many architectures don't like unaligned accesses, while
+ * the nr_sects and start_sect partition table entries are
+ * at a 2 (mod 4) address.
+ */
+#include <asm/unaligned.h>
+
+#define SYS_IND(p)	get_unaligned(&p->sys_ind)
+
+static inline sector_t nr_sects(struct partition *p)
+{
+	return (sector_t)get_unaligned_le32(&p->nr_sects);
+}
+
+static inline sector_t start_sect(struct partition *p)
+{
+	return (sector_t)get_unaligned_le32(&p->start_sect);
+}
+
+static inline int is_extended_partition(struct partition *p)
+{
+	return (SYS_IND(p) == DOS_EXTENDED_PARTITION ||
+		SYS_IND(p) == WIN98_EXTENDED_PARTITION ||
+		SYS_IND(p) == LINUX_EXTENDED_PARTITION);
+}
+
+#define MSDOS_LABEL_MAGIC1	0x55
+#define MSDOS_LABEL_MAGIC2	0xAA
+
+static inline int
+msdos_magic_present(unsigned char *p)
+{
+	return (p[0] == MSDOS_LABEL_MAGIC1 && p[1] == MSDOS_LABEL_MAGIC2);
+}
+
+/* Value is EBCDIC 'IBMA' */
+#define AIX_LABEL_MAGIC1	0xC9
+#define AIX_LABEL_MAGIC2	0xC2
+#define AIX_LABEL_MAGIC3	0xD4
+#define AIX_LABEL_MAGIC4	0xC1
+static int aix_magic_present(struct parsed_partitions *state, unsigned char *p)
+{
+	struct partition *pt = (struct partition *) (p + 0x1be);
+	Sector sect;
+	unsigned char *d;
+	int slot, ret = 0;
+
+	if (!(p[0] == AIX_LABEL_MAGIC1 &&
+		p[1] == AIX_LABEL_MAGIC2 &&
+		p[2] == AIX_LABEL_MAGIC3 &&
+		p[3] == AIX_LABEL_MAGIC4))
+		return 0;
+	/* Assume the partition table is valid if Linux partitions exists */
+	for (slot = 1; slot <= 4; slot++, pt++) {
+		if (pt->sys_ind == LINUX_SWAP_PARTITION ||
+			pt->sys_ind == LINUX_RAID_PARTITION ||
+			pt->sys_ind == LINUX_DATA_PARTITION ||
+			pt->sys_ind == LINUX_LVM_PARTITION ||
+			is_extended_partition(pt))
+			return 0;
+	}
+	d = read_part_sector(state, 7, &sect);
+	if (d) {
+		if (d[0] == '_' && d[1] == 'L' && d[2] == 'V' && d[3] == 'M')
+			ret = 1;
+		put_dev_sector(sect);
+	};
+	return ret;
+}
+
+/*
+ * Create devices for each logical partition in an extended partition.
+ * The logical partitions form a linked list, with each entry being
+ * a partition table with two entries.  The first entry
+ * is the real data partition (with a start relative to the partition
+ * table start).  The second is a pointer to the next logical partition
+ * (with a start relative to the entire extended partition).
+ * We do not create a Linux partition for the partition tables, but
+ * only for the actual data partitions.
+ */
+
+static void parse_extended(struct parsed_partitions *state,
+			   sector_t first_sector, sector_t first_size)
+{
+	struct partition *p;
+	Sector sect;
+	unsigned char *data;
+	sector_t this_sector, this_size;
+	sector_t sector_size = bdev_logical_block_size(state->bdev) / 512;
+	int loopct = 0;		/* number of links followed
+				   without finding a data partition */
+	int i;
+
+	this_sector = first_sector;
+	this_size = first_size;
+
+	while (1) {
+		if (++loopct > 100)
+			return;
+		if (state->next == state->limit)
+			return;
+		data = read_part_sector(state, this_sector, &sect);
+		if (!data)
+			return;
+
+		if (!msdos_magic_present(data + 510))
+			goto done; 
+
+		p = (struct partition *) (data + 0x1be);
+
+		/*
+		 * Usually, the first entry is the real data partition,
+		 * the 2nd entry is the next extended partition, or empty,
+		 * and the 3rd and 4th entries are unused.
+		 * However, DRDOS sometimes has the extended partition as
+		 * the first entry (when the data partition is empty),
+		 * and OS/2 seems to use all four entries.
+		 */
+
+		/* 
+		 * First process the data partition(s)
+		 */
+		for (i=0; i<4; i++, p++) {
+			sector_t offs, size, next;
+			if (!nr_sects(p) || is_extended_partition(p))
+				continue;
+
+			/* Check the 3rd and 4th entries -
+			   these sometimes contain random garbage */
+			offs = start_sect(p)*sector_size;
+			size = nr_sects(p)*sector_size;
+			next = this_sector + offs;
+			if (i >= 2) {
+				if (offs + size > this_size)
+					continue;
+				if (next < first_sector)
+					continue;
+				if (next + size > first_sector + first_size)
+					continue;
+			}
+
+			put_partition(state, state->next, next, size);
+			if (SYS_IND(p) == LINUX_RAID_PARTITION)
+				state->parts[state->next].flags = ADDPART_FLAG_RAID;
+			loopct = 0;
+			if (++state->next == state->limit)
+				goto done;
+		}
+		/*
+		 * Next, process the (first) extended partition, if present.
+		 * (So far, there seems to be no reason to make
+		 *  parse_extended()  recursive and allow a tree
+		 *  of extended partitions.)
+		 * It should be a link to the next logical partition.
+		 */
+		p -= 4;
+		for (i=0; i<4; i++, p++)
+			if (nr_sects(p) && is_extended_partition(p))
+				break;
+		if (i == 4)
+			goto done;	 /* nothing left to do */
+
+		this_sector = first_sector + start_sect(p) * sector_size;
+		this_size = nr_sects(p) * sector_size;
+		put_dev_sector(sect);
+	}
+done:
+	put_dev_sector(sect);
+}
+
+/* james@bpgc.com: Solaris has a nasty indicator: 0x82 which also
+   indicates linux swap.  Be careful before believing this is Solaris. */
+
+static void parse_solaris_x86(struct parsed_partitions *state,
+			      sector_t offset, sector_t size, int origin)
+{
+#ifdef CONFIG_SOLARIS_X86_PARTITION
+	Sector sect;
+	struct solaris_x86_vtoc *v;
+	int i;
+	short max_nparts;
+
+	v = read_part_sector(state, offset + 1, &sect);
+	if (!v)
+		return;
+	if (le32_to_cpu(v->v_sanity) != SOLARIS_X86_VTOC_SANE) {
+		put_dev_sector(sect);
+		return;
+	}
+	{
+		char tmp[1 + BDEVNAME_SIZE + 10 + 11 + 1];
+
+		snprintf(tmp, sizeof(tmp), " %s%d: <solaris:", state->name, origin);
+		strlcat(state->pp_buf, tmp, PAGE_SIZE);
+	}
+	if (le32_to_cpu(v->v_version) != 1) {
+		char tmp[64];
+
+		snprintf(tmp, sizeof(tmp), "  cannot handle version %d vtoc>\n",
+			 le32_to_cpu(v->v_version));
+		strlcat(state->pp_buf, tmp, PAGE_SIZE);
+		put_dev_sector(sect);
+		return;
+	}
+	/* Ensure we can handle previous case of VTOC with 8 entries gracefully */
+	max_nparts = le16_to_cpu (v->v_nparts) > 8 ? SOLARIS_X86_NUMSLICE : 8;
+	for (i=0; i<max_nparts && state->next<state->limit; i++) {
+		struct solaris_x86_slice *s = &v->v_slice[i];
+		char tmp[3 + 10 + 1 + 1];
+
+		if (s->s_size == 0)
+			continue;
+		snprintf(tmp, sizeof(tmp), " [s%d]", i);
+		strlcat(state->pp_buf, tmp, PAGE_SIZE);
+		/* solaris partitions are relative to current MS-DOS
+		 * one; must add the offset of the current partition */
+		put_partition(state, state->next++,
+				 le32_to_cpu(s->s_start)+offset,
+				 le32_to_cpu(s->s_size));
+	}
+	put_dev_sector(sect);
+	strlcat(state->pp_buf, " >\n", PAGE_SIZE);
+#endif
+}
+
+#if defined(CONFIG_BSD_DISKLABEL)
+/* 
+ * Create devices for BSD partitions listed in a disklabel, under a
+ * dos-like partition. See parse_extended() for more information.
+ */
+static void parse_bsd(struct parsed_partitions *state,
+		      sector_t offset, sector_t size, int origin, char *flavour,
+		      int max_partitions)
+{
+	Sector sect;
+	struct bsd_disklabel *l;
+	struct bsd_partition *p;
+	char tmp[64];
+
+	l = read_part_sector(state, offset + 1, &sect);
+	if (!l)
+		return;
+	if (le32_to_cpu(l->d_magic) != BSD_DISKMAGIC) {
+		put_dev_sector(sect);
+		return;
+	}
+
+	snprintf(tmp, sizeof(tmp), " %s%d: <%s:", state->name, origin, flavour);
+	strlcat(state->pp_buf, tmp, PAGE_SIZE);
+
+	if (le16_to_cpu(l->d_npartitions) < max_partitions)
+		max_partitions = le16_to_cpu(l->d_npartitions);
+	for (p = l->d_partitions; p - l->d_partitions < max_partitions; p++) {
+		sector_t bsd_start, bsd_size;
+
+		if (state->next == state->limit)
+			break;
+		if (p->p_fstype == BSD_FS_UNUSED) 
+			continue;
+		bsd_start = le32_to_cpu(p->p_offset);
+		bsd_size = le32_to_cpu(p->p_size);
+		if (offset == bsd_start && size == bsd_size)
+			/* full parent partition, we have it already */
+			continue;
+		if (offset > bsd_start || offset+size < bsd_start+bsd_size) {
+			strlcat(state->pp_buf, "bad subpartition - ignored\n", PAGE_SIZE);
+			continue;
+		}
+		put_partition(state, state->next++, bsd_start, bsd_size);
+	}
+	put_dev_sector(sect);
+	if (le16_to_cpu(l->d_npartitions) > max_partitions) {
+		snprintf(tmp, sizeof(tmp), " (ignored %d more)",
+			 le16_to_cpu(l->d_npartitions) - max_partitions);
+		strlcat(state->pp_buf, tmp, PAGE_SIZE);
+	}
+	strlcat(state->pp_buf, " >\n", PAGE_SIZE);
+}
+#endif
+
+static void parse_freebsd(struct parsed_partitions *state,
+			  sector_t offset, sector_t size, int origin)
+{
+#ifdef CONFIG_BSD_DISKLABEL
+	parse_bsd(state, offset, size, origin, "bsd", BSD_MAXPARTITIONS);
+#endif
+}
+
+static void parse_netbsd(struct parsed_partitions *state,
+			 sector_t offset, sector_t size, int origin)
+{
+#ifdef CONFIG_BSD_DISKLABEL
+	parse_bsd(state, offset, size, origin, "netbsd", BSD_MAXPARTITIONS);
+#endif
+}
+
+static void parse_openbsd(struct parsed_partitions *state,
+			  sector_t offset, sector_t size, int origin)
+{
+#ifdef CONFIG_BSD_DISKLABEL
+	parse_bsd(state, offset, size, origin, "openbsd",
+		  OPENBSD_MAXPARTITIONS);
+#endif
+}
+
+/*
+ * Create devices for Unixware partitions listed in a disklabel, under a
+ * dos-like partition. See parse_extended() for more information.
+ */
+static void parse_unixware(struct parsed_partitions *state,
+			   sector_t offset, sector_t size, int origin)
+{
+#ifdef CONFIG_UNIXWARE_DISKLABEL
+	Sector sect;
+	struct unixware_disklabel *l;
+	struct unixware_slice *p;
+
+	l = read_part_sector(state, offset + 29, &sect);
+	if (!l)
+		return;
+	if (le32_to_cpu(l->d_magic) != UNIXWARE_DISKMAGIC ||
+	    le32_to_cpu(l->vtoc.v_magic) != UNIXWARE_DISKMAGIC2) {
+		put_dev_sector(sect);
+		return;
+	}
+	{
+		char tmp[1 + BDEVNAME_SIZE + 10 + 12 + 1];
+
+		snprintf(tmp, sizeof(tmp), " %s%d: <unixware:", state->name, origin);
+		strlcat(state->pp_buf, tmp, PAGE_SIZE);
+	}
+	p = &l->vtoc.v_slice[1];
+	/* I omit the 0th slice as it is the same as whole disk. */
+	while (p - &l->vtoc.v_slice[0] < UNIXWARE_NUMSLICE) {
+		if (state->next == state->limit)
+			break;
+
+		if (p->s_label != UNIXWARE_FS_UNUSED)
+			put_partition(state, state->next++,
+				      le32_to_cpu(p->start_sect),
+				      le32_to_cpu(p->nr_sects));
+		p++;
+	}
+	put_dev_sector(sect);
+	strlcat(state->pp_buf, " >\n", PAGE_SIZE);
+#endif
+}
+
+/*
+ * Minix 2.0.0/2.0.2 subpartition support.
+ * Anand Krishnamurthy <anandk@wiproge.med.ge.com>
+ * Rajeev V. Pillai    <rajeevvp@yahoo.com>
+ */
+static void parse_minix(struct parsed_partitions *state,
+			sector_t offset, sector_t size, int origin)
+{
+#ifdef CONFIG_MINIX_SUBPARTITION
+	Sector sect;
+	unsigned char *data;
+	struct partition *p;
+	int i;
+
+	data = read_part_sector(state, offset, &sect);
+	if (!data)
+		return;
+
+	p = (struct partition *)(data + 0x1be);
+
+	/* The first sector of a Minix partition can have either
+	 * a secondary MBR describing its subpartitions, or
+	 * the normal boot sector. */
+	if (msdos_magic_present (data + 510) &&
+	    SYS_IND(p) == MINIX_PARTITION) { /* subpartition table present */
+		char tmp[1 + BDEVNAME_SIZE + 10 + 9 + 1];
+
+		snprintf(tmp, sizeof(tmp), " %s%d: <minix:", state->name, origin);
+		strlcat(state->pp_buf, tmp, PAGE_SIZE);
+		for (i = 0; i < MINIX_NR_SUBPARTITIONS; i++, p++) {
+			if (state->next == state->limit)
+				break;
+			/* add each partition in use */
+			if (SYS_IND(p) == MINIX_PARTITION)
+				put_partition(state, state->next++,
+					      start_sect(p), nr_sects(p));
+		}
+		strlcat(state->pp_buf, " >\n", PAGE_SIZE);
+	}
+	put_dev_sector(sect);
+#endif /* CONFIG_MINIX_SUBPARTITION */
+}
+
+static struct {
+	unsigned char id;
+	void (*parse)(struct parsed_partitions *, sector_t, sector_t, int);
+} subtypes[] = {
+	{FREEBSD_PARTITION, parse_freebsd},
+	{NETBSD_PARTITION, parse_netbsd},
+	{OPENBSD_PARTITION, parse_openbsd},
+	{MINIX_PARTITION, parse_minix},
+	{UNIXWARE_PARTITION, parse_unixware},
+	{SOLARIS_X86_PARTITION, parse_solaris_x86},
+	{NEW_SOLARIS_X86_PARTITION, parse_solaris_x86},
+	{0, NULL},
+};
+ 
+int msdos_partition(struct parsed_partitions *state)
+{
+	sector_t sector_size = bdev_logical_block_size(state->bdev) / 512;
+	Sector sect;
+	unsigned char *data;
+	struct partition *p;
+	struct fat_boot_sector *fb;
+	int slot;
+
+	data = read_part_sector(state, 0, &sect);
+	if (!data)
+		return -1;
+	if (!msdos_magic_present(data + 510)) {
+		put_dev_sector(sect);
+		return 0;
+	}
+
+	if (aix_magic_present(state, data)) {
+		put_dev_sector(sect);
+		strlcat(state->pp_buf, " [AIX]", PAGE_SIZE);
+		return 0;
+	}
+
+	/*
+	 * Now that the 55aa signature is present, this is probably
+	 * either the boot sector of a FAT filesystem or a DOS-type
+	 * partition table. Reject this in case the boot indicator
+	 * is not 0 or 0x80.
+	 */
+	p = (struct partition *) (data + 0x1be);
+	for (slot = 1; slot <= 4; slot++, p++) {
+		if (p->boot_ind != 0 && p->boot_ind != 0x80) {
+			/*
+			 * Even without a valid boot inidicator value
+			 * its still possible this is valid FAT filesystem
+			 * without a partition table.
+			 */
+			fb = (struct fat_boot_sector *) data;
+			if (slot == 1 && fb->reserved && fb->fats
+				&& fat_valid_media(fb->media)) {
+				strlcat(state->pp_buf, "\n", PAGE_SIZE);
+				put_dev_sector(sect);
+				return 1;
+			} else {
+				put_dev_sector(sect);
+				return 0;
+			}
+		}
+	}
+
+#ifdef CONFIG_EFI_PARTITION
+	p = (struct partition *) (data + 0x1be);
+	for (slot = 1 ; slot <= 4 ; slot++, p++) {
+		/* If this is an EFI GPT disk, msdos should ignore it. */
+		if (SYS_IND(p) == EFI_PMBR_OSTYPE_EFI_GPT) {
+			put_dev_sector(sect);
+			return 0;
+		}
+	}
+#endif
+	p = (struct partition *) (data + 0x1be);
+
+	/*
+	 * Look for partitions in two passes:
+	 * First find the primary and DOS-type extended partitions.
+	 * On the second pass look inside *BSD, Unixware and Solaris partitions.
+	 */
+
+	state->next = 5;
+	for (slot = 1 ; slot <= 4 ; slot++, p++) {
+		sector_t start = start_sect(p)*sector_size;
+		sector_t size = nr_sects(p)*sector_size;
+		if (!size)
+			continue;
+		if (is_extended_partition(p)) {
+			/*
+			 * prevent someone doing mkfs or mkswap on an
+			 * extended partition, but leave room for LILO
+			 * FIXME: this uses one logical sector for > 512b
+			 * sector, although it may not be enough/proper.
+			 */
+			sector_t n = 2;
+			n = min(size, max(sector_size, n));
+			put_partition(state, slot, start, n);
+
+			strlcat(state->pp_buf, " <", PAGE_SIZE);
+			parse_extended(state, start, size);
+			strlcat(state->pp_buf, " >", PAGE_SIZE);
+			continue;
+		}
+		put_partition(state, slot, start, size);
+		if (SYS_IND(p) == LINUX_RAID_PARTITION)
+			state->parts[slot].flags = ADDPART_FLAG_RAID;
+		if (SYS_IND(p) == DM6_PARTITION)
+			strlcat(state->pp_buf, "[DM]", PAGE_SIZE);
+		if (SYS_IND(p) == EZD_PARTITION)
+			strlcat(state->pp_buf, "[EZD]", PAGE_SIZE);
+	}
+
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+
+	/* second pass - output for each on a separate line */
+	p = (struct partition *) (0x1be + data);
+	for (slot = 1 ; slot <= 4 ; slot++, p++) {
+		unsigned char id = SYS_IND(p);
+		int n;
+
+		if (!nr_sects(p))
+			continue;
+
+		for (n = 0; subtypes[n].parse && id != subtypes[n].id; n++)
+			;
+
+		if (!subtypes[n].parse)
+			continue;
+		subtypes[n].parse(state, start_sect(p) * sector_size,
+				  nr_sects(p) * sector_size, slot);
+	}
+	put_dev_sector(sect);
+	return 1;
+}
diff --git a/block/partitions/msdos.h b/block/partitions/msdos.h
new file mode 100644
index 00000000000..38c781c490b
--- /dev/null
+++ b/block/partitions/msdos.h
@@ -0,0 +1,8 @@
+/*
+ *  fs/partitions/msdos.h
+ */
+
+#define MSDOS_LABEL_MAGIC		0xAA55
+
+int msdos_partition(struct parsed_partitions *state);
+
diff --git a/block/partitions/osf.c b/block/partitions/osf.c
new file mode 100644
index 00000000000..764b86a0196
--- /dev/null
+++ b/block/partitions/osf.c
@@ -0,0 +1,86 @@
+/*
+ *  fs/partitions/osf.c
+ *
+ *  Code extracted from drivers/block/genhd.c
+ *
+ *  Copyright (C) 1991-1998  Linus Torvalds
+ *  Re-organised Feb 1998 Russell King
+ */
+
+#include "check.h"
+#include "osf.h"
+
+#define MAX_OSF_PARTITIONS 18
+
+int osf_partition(struct parsed_partitions *state)
+{
+	int i;
+	int slot = 1;
+	unsigned int npartitions;
+	Sector sect;
+	unsigned char *data;
+	struct disklabel {
+		__le32 d_magic;
+		__le16 d_type,d_subtype;
+		u8 d_typename[16];
+		u8 d_packname[16];
+		__le32 d_secsize;
+		__le32 d_nsectors;
+		__le32 d_ntracks;
+		__le32 d_ncylinders;
+		__le32 d_secpercyl;
+		__le32 d_secprtunit;
+		__le16 d_sparespertrack;
+		__le16 d_sparespercyl;
+		__le32 d_acylinders;
+		__le16 d_rpm, d_interleave, d_trackskew, d_cylskew;
+		__le32 d_headswitch, d_trkseek, d_flags;
+		__le32 d_drivedata[5];
+		__le32 d_spare[5];
+		__le32 d_magic2;
+		__le16 d_checksum;
+		__le16 d_npartitions;
+		__le32 d_bbsize, d_sbsize;
+		struct d_partition {
+			__le32 p_size;
+			__le32 p_offset;
+			__le32 p_fsize;
+			u8  p_fstype;
+			u8  p_frag;
+			__le16 p_cpg;
+		} d_partitions[MAX_OSF_PARTITIONS];
+	} * label;
+	struct d_partition * partition;
+
+	data = read_part_sector(state, 0, &sect);
+	if (!data)
+		return -1;
+
+	label = (struct disklabel *) (data+64);
+	partition = label->d_partitions;
+	if (le32_to_cpu(label->d_magic) != DISKLABELMAGIC) {
+		put_dev_sector(sect);
+		return 0;
+	}
+	if (le32_to_cpu(label->d_magic2) != DISKLABELMAGIC) {
+		put_dev_sector(sect);
+		return 0;
+	}
+	npartitions = le16_to_cpu(label->d_npartitions);
+	if (npartitions > MAX_OSF_PARTITIONS) {
+		put_dev_sector(sect);
+		return 0;
+	}
+	for (i = 0 ; i < npartitions; i++, partition++) {
+		if (slot == state->limit)
+		        break;
+		if (le32_to_cpu(partition->p_size))
+			put_partition(state, slot,
+				le32_to_cpu(partition->p_offset),
+				le32_to_cpu(partition->p_size));
+		slot++;
+	}
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	put_dev_sector(sect);
+	return 1;
+}
diff --git a/block/partitions/osf.h b/block/partitions/osf.h
new file mode 100644
index 00000000000..20ed2315ec1
--- /dev/null
+++ b/block/partitions/osf.h
@@ -0,0 +1,7 @@
+/*
+ *  fs/partitions/osf.h
+ */
+
+#define DISKLABELMAGIC (0x82564557UL)
+
+int osf_partition(struct parsed_partitions *state);
diff --git a/block/partitions/sgi.c b/block/partitions/sgi.c
new file mode 100644
index 00000000000..ea8a86dceaf
--- /dev/null
+++ b/block/partitions/sgi.c
@@ -0,0 +1,82 @@
+/*
+ *  fs/partitions/sgi.c
+ *
+ *  Code extracted from drivers/block/genhd.c
+ */
+
+#include "check.h"
+#include "sgi.h"
+
+struct sgi_disklabel {
+	__be32 magic_mushroom;		/* Big fat spliff... */
+	__be16 root_part_num;		/* Root partition number */
+	__be16 swap_part_num;		/* Swap partition number */
+	s8 boot_file[16];		/* Name of boot file for ARCS */
+	u8 _unused0[48];		/* Device parameter useless crapola.. */
+	struct sgi_volume {
+		s8 name[8];		/* Name of volume */
+		__be32 block_num;		/* Logical block number */
+		__be32 num_bytes;		/* How big, in bytes */
+	} volume[15];
+	struct sgi_partition {
+		__be32 num_blocks;		/* Size in logical blocks */
+		__be32 first_block;	/* First logical block */
+		__be32 type;		/* Type of this partition */
+	} partitions[16];
+	__be32 csum;			/* Disk label checksum */
+	__be32 _unused1;			/* Padding */
+};
+
+int sgi_partition(struct parsed_partitions *state)
+{
+	int i, csum;
+	__be32 magic;
+	int slot = 1;
+	unsigned int start, blocks;
+	__be32 *ui, cs;
+	Sector sect;
+	struct sgi_disklabel *label;
+	struct sgi_partition *p;
+	char b[BDEVNAME_SIZE];
+
+	label = read_part_sector(state, 0, &sect);
+	if (!label)
+		return -1;
+	p = &label->partitions[0];
+	magic = label->magic_mushroom;
+	if(be32_to_cpu(magic) != SGI_LABEL_MAGIC) {
+		/*printk("Dev %s SGI disklabel: bad magic %08x\n",
+		       bdevname(bdev, b), be32_to_cpu(magic));*/
+		put_dev_sector(sect);
+		return 0;
+	}
+	ui = ((__be32 *) (label + 1)) - 1;
+	for(csum = 0; ui >= ((__be32 *) label);) {
+		cs = *ui--;
+		csum += be32_to_cpu(cs);
+	}
+	if(csum) {
+		printk(KERN_WARNING "Dev %s SGI disklabel: csum bad, label corrupted\n",
+		       bdevname(state->bdev, b));
+		put_dev_sector(sect);
+		return 0;
+	}
+	/* All SGI disk labels have 16 partitions, disks under Linux only
+	 * have 15 minor's.  Luckily there are always a few zero length
+	 * partitions which we don't care about so we never overflow the
+	 * current_minor.
+	 */
+	for(i = 0; i < 16; i++, p++) {
+		blocks = be32_to_cpu(p->num_blocks);
+		start  = be32_to_cpu(p->first_block);
+		if (blocks) {
+			put_partition(state, slot, start, blocks);
+			if (be32_to_cpu(p->type) == LINUX_RAID_PARTITION)
+				state->parts[slot].flags = ADDPART_FLAG_RAID;
+		}
+		slot++;
+	}
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	put_dev_sector(sect);
+	return 1;
+}
diff --git a/block/partitions/sgi.h b/block/partitions/sgi.h
new file mode 100644
index 00000000000..b9553ebdd5a
--- /dev/null
+++ b/block/partitions/sgi.h
@@ -0,0 +1,8 @@
+/*
+ *  fs/partitions/sgi.h
+ */
+
+extern int sgi_partition(struct parsed_partitions *state);
+
+#define SGI_LABEL_MAGIC 0x0be5a941
+
diff --git a/block/partitions/sun.c b/block/partitions/sun.c
new file mode 100644
index 00000000000..b5b6fcfb3d3
--- /dev/null
+++ b/block/partitions/sun.c
@@ -0,0 +1,122 @@
+/*
+ *  fs/partitions/sun.c
+ *
+ *  Code extracted from drivers/block/genhd.c
+ *
+ *  Copyright (C) 1991-1998  Linus Torvalds
+ *  Re-organised Feb 1998 Russell King
+ */
+
+#include "check.h"
+#include "sun.h"
+
+int sun_partition(struct parsed_partitions *state)
+{
+	int i;
+	__be16 csum;
+	int slot = 1;
+	__be16 *ush;
+	Sector sect;
+	struct sun_disklabel {
+		unsigned char info[128];   /* Informative text string */
+		struct sun_vtoc {
+		    __be32 version;     /* Layout version */
+		    char   volume[8];   /* Volume name */
+		    __be16 nparts;      /* Number of partitions */
+		    struct sun_info {           /* Partition hdrs, sec 2 */
+			__be16 id;
+			__be16 flags;
+		    } infos[8];
+		    __be16 padding;     /* Alignment padding */
+		    __be32 bootinfo[3];  /* Info needed by mboot */
+		    __be32 sanity;       /* To verify vtoc sanity */
+		    __be32 reserved[10]; /* Free space */
+		    __be32 timestamp[8]; /* Partition timestamp */
+		} vtoc;
+		__be32 write_reinstruct; /* sectors to skip, writes */
+		__be32 read_reinstruct;  /* sectors to skip, reads */
+		unsigned char spare[148]; /* Padding */
+		__be16 rspeed;     /* Disk rotational speed */
+		__be16 pcylcount;  /* Physical cylinder count */
+		__be16 sparecyl;   /* extra sects per cylinder */
+		__be16 obs1;       /* gap1 */
+		__be16 obs2;       /* gap2 */
+		__be16 ilfact;     /* Interleave factor */
+		__be16 ncyl;       /* Data cylinder count */
+		__be16 nacyl;      /* Alt. cylinder count */
+		__be16 ntrks;      /* Tracks per cylinder */
+		__be16 nsect;      /* Sectors per track */
+		__be16 obs3;       /* bhead - Label head offset */
+		__be16 obs4;       /* ppart - Physical Partition */
+		struct sun_partition {
+			__be32 start_cylinder;
+			__be32 num_sectors;
+		} partitions[8];
+		__be16 magic;      /* Magic number */
+		__be16 csum;       /* Label xor'd checksum */
+	} * label;
+	struct sun_partition *p;
+	unsigned long spc;
+	char b[BDEVNAME_SIZE];
+	int use_vtoc;
+	int nparts;
+
+	label = read_part_sector(state, 0, &sect);
+	if (!label)
+		return -1;
+
+	p = label->partitions;
+	if (be16_to_cpu(label->magic) != SUN_LABEL_MAGIC) {
+/*		printk(KERN_INFO "Dev %s Sun disklabel: bad magic %04x\n",
+		       bdevname(bdev, b), be16_to_cpu(label->magic)); */
+		put_dev_sector(sect);
+		return 0;
+	}
+	/* Look at the checksum */
+	ush = ((__be16 *) (label+1)) - 1;
+	for (csum = 0; ush >= ((__be16 *) label);)
+		csum ^= *ush--;
+	if (csum) {
+		printk("Dev %s Sun disklabel: Csum bad, label corrupted\n",
+		       bdevname(state->bdev, b));
+		put_dev_sector(sect);
+		return 0;
+	}
+
+	/* Check to see if we can use the VTOC table */
+	use_vtoc = ((be32_to_cpu(label->vtoc.sanity) == SUN_VTOC_SANITY) &&
+		    (be32_to_cpu(label->vtoc.version) == 1) &&
+		    (be16_to_cpu(label->vtoc.nparts) <= 8));
+
+	/* Use 8 partition entries if not specified in validated VTOC */
+	nparts = (use_vtoc) ? be16_to_cpu(label->vtoc.nparts) : 8;
+
+	/*
+	 * So that old Linux-Sun partitions continue to work,
+	 * alow the VTOC to be used under the additional condition ...
+	 */
+	use_vtoc = use_vtoc || !(label->vtoc.sanity ||
+				 label->vtoc.version || label->vtoc.nparts);
+	spc = be16_to_cpu(label->ntrks) * be16_to_cpu(label->nsect);
+	for (i = 0; i < nparts; i++, p++) {
+		unsigned long st_sector;
+		unsigned int num_sectors;
+
+		st_sector = be32_to_cpu(p->start_cylinder) * spc;
+		num_sectors = be32_to_cpu(p->num_sectors);
+		if (num_sectors) {
+			put_partition(state, slot, st_sector, num_sectors);
+			state->parts[slot].flags = 0;
+			if (use_vtoc) {
+				if (be16_to_cpu(label->vtoc.infos[i].id) == LINUX_RAID_PARTITION)
+					state->parts[slot].flags |= ADDPART_FLAG_RAID;
+				else if (be16_to_cpu(label->vtoc.infos[i].id) == SUN_WHOLE_DISK)
+					state->parts[slot].flags |= ADDPART_FLAG_WHOLEDISK;
+			}
+		}
+		slot++;
+	}
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	put_dev_sector(sect);
+	return 1;
+}
diff --git a/block/partitions/sun.h b/block/partitions/sun.h
new file mode 100644
index 00000000000..2424baa8319
--- /dev/null
+++ b/block/partitions/sun.h
@@ -0,0 +1,8 @@
+/*
+ *  fs/partitions/sun.h
+ */
+
+#define SUN_LABEL_MAGIC          0xDABE
+#define SUN_VTOC_SANITY          0x600DDEEE
+
+int sun_partition(struct parsed_partitions *state);
diff --git a/block/partitions/sysv68.c b/block/partitions/sysv68.c
new file mode 100644
index 00000000000..9627ccffc1c
--- /dev/null
+++ b/block/partitions/sysv68.c
@@ -0,0 +1,95 @@
+/*
+ *  fs/partitions/sysv68.c
+ *
+ *  Copyright (C) 2007 Philippe De Muyter <phdm@macqel.be>
+ */
+
+#include "check.h"
+#include "sysv68.h"
+
+/*
+ *	Volume ID structure: on first 256-bytes sector of disk
+ */
+
+struct volumeid {
+	u8	vid_unused[248];
+	u8	vid_mac[8];	/* ASCII string "MOTOROLA" */
+};
+
+/*
+ *	config block: second 256-bytes sector on disk
+ */
+
+struct dkconfig {
+	u8	ios_unused0[128];
+	__be32	ios_slcblk;	/* Slice table block number */
+	__be16	ios_slccnt;	/* Number of entries in slice table */
+	u8	ios_unused1[122];
+};
+
+/*
+ *	combined volumeid and dkconfig block
+ */
+
+struct dkblk0 {
+	struct volumeid dk_vid;
+	struct dkconfig dk_ios;
+};
+
+/*
+ *	Slice Table Structure
+ */
+
+struct slice {
+	__be32	nblocks;		/* slice size (in blocks) */
+	__be32	blkoff;			/* block offset of slice */
+};
+
+
+int sysv68_partition(struct parsed_partitions *state)
+{
+	int i, slices;
+	int slot = 1;
+	Sector sect;
+	unsigned char *data;
+	struct dkblk0 *b;
+	struct slice *slice;
+	char tmp[64];
+
+	data = read_part_sector(state, 0, &sect);
+	if (!data)
+		return -1;
+
+	b = (struct dkblk0 *)data;
+	if (memcmp(b->dk_vid.vid_mac, "MOTOROLA", sizeof(b->dk_vid.vid_mac))) {
+		put_dev_sector(sect);
+		return 0;
+	}
+	slices = be16_to_cpu(b->dk_ios.ios_slccnt);
+	i = be32_to_cpu(b->dk_ios.ios_slcblk);
+	put_dev_sector(sect);
+
+	data = read_part_sector(state, i, &sect);
+	if (!data)
+		return -1;
+
+	slices -= 1; /* last slice is the whole disk */
+	snprintf(tmp, sizeof(tmp), "sysV68: %s(s%u)", state->name, slices);
+	strlcat(state->pp_buf, tmp, PAGE_SIZE);
+	slice = (struct slice *)data;
+	for (i = 0; i < slices; i++, slice++) {
+		if (slot == state->limit)
+			break;
+		if (be32_to_cpu(slice->nblocks)) {
+			put_partition(state, slot,
+				be32_to_cpu(slice->blkoff),
+				be32_to_cpu(slice->nblocks));
+			snprintf(tmp, sizeof(tmp), "(s%u)", i);
+			strlcat(state->pp_buf, tmp, PAGE_SIZE);
+		}
+		slot++;
+	}
+	strlcat(state->pp_buf, "\n", PAGE_SIZE);
+	put_dev_sector(sect);
+	return 1;
+}
diff --git a/block/partitions/sysv68.h b/block/partitions/sysv68.h
new file mode 100644
index 00000000000..bf2f5ffa97a
--- /dev/null
+++ b/block/partitions/sysv68.h
@@ -0,0 +1 @@
+extern int sysv68_partition(struct parsed_partitions *state);
diff --git a/block/partitions/ultrix.c b/block/partitions/ultrix.c
new file mode 100644
index 00000000000..8dbaf9f77a9
--- /dev/null
+++ b/block/partitions/ultrix.c
@@ -0,0 +1,48 @@
+/*
+ *  fs/partitions/ultrix.c
+ *
+ *  Code extracted from drivers/block/genhd.c
+ *
+ *  Re-organised Jul 1999 Russell King
+ */
+
+#include "check.h"
+#include "ultrix.h"
+
+int ultrix_partition(struct parsed_partitions *state)
+{
+	int i;
+	Sector sect;
+	unsigned char *data;
+	struct ultrix_disklabel {
+		s32	pt_magic;	/* magic no. indicating part. info exits */
+		s32	pt_valid;	/* set by driver if pt is current */
+		struct  pt_info {
+			s32		pi_nblocks; /* no. of sectors */
+			u32		pi_blkoff;  /* block offset for start */
+		} pt_part[8];
+	} *label;
+
+#define PT_MAGIC	0x032957	/* Partition magic number */
+#define PT_VALID	1		/* Indicates if struct is valid */
+
+	data = read_part_sector(state, (16384 - sizeof(*label))/512, &sect);
+	if (!data)
+		return -1;
+	
+	label = (struct ultrix_disklabel *)(data + 512 - sizeof(*label));
+
+	if (label->pt_magic == PT_MAGIC && label->pt_valid == PT_VALID) {
+		for (i=0; i<8; i++)
+			if (label->pt_part[i].pi_nblocks)
+				put_partition(state, i+1, 
+					      label->pt_part[i].pi_blkoff,
+					      label->pt_part[i].pi_nblocks);
+		put_dev_sector(sect);
+		strlcat(state->pp_buf, "\n", PAGE_SIZE);
+		return 1;
+	} else {
+		put_dev_sector(sect);
+		return 0;
+	}
+}
diff --git a/block/partitions/ultrix.h b/block/partitions/ultrix.h
new file mode 100644
index 00000000000..a3cc00b2bde
--- /dev/null
+++ b/block/partitions/ultrix.h
@@ -0,0 +1,5 @@
+/*
+ *  fs/partitions/ultrix.h
+ */
+
+int ultrix_partition(struct parsed_partitions *state);
diff --git a/fs/Kconfig b/fs/Kconfig
index 5f4c45d4aa1..30145d886bc 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -266,14 +266,6 @@ source "fs/9p/Kconfig"
 
 endif # NETWORK_FILESYSTEMS
 
-if BLOCK
-menu "Partition Types"
-
-source "fs/partitions/Kconfig"
-
-endmenu
-endif
-
 source "fs/nls/Kconfig"
 source "fs/dlm/Kconfig"
 
diff --git a/fs/Makefile b/fs/Makefile
index d2c3353d547..c36ebec298b 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -52,7 +52,6 @@ obj-$(CONFIG_FHANDLE)		+= fhandle.o
 obj-y				+= quota/
 
 obj-$(CONFIG_PROC_FS)		+= proc/
-obj-y				+= partitions/
 obj-$(CONFIG_SYSFS)		+= sysfs/
 obj-$(CONFIG_CONFIGFS_FS)	+= configfs/
 obj-y				+= devpts/
diff --git a/fs/partitions/Kconfig b/fs/partitions/Kconfig
deleted file mode 100644
index cb5f0a3f1b0..00000000000
--- a/fs/partitions/Kconfig
+++ /dev/null
@@ -1,251 +0,0 @@
-#
-# Partition configuration
-#
-config PARTITION_ADVANCED
-	bool "Advanced partition selection"
-	help
-	  Say Y here if you would like to use hard disks under Linux which
-	  were partitioned under an operating system running on a different
-	  architecture than your Linux system.
-
-	  Note that the answer to this question won't directly affect the
-	  kernel: saying N will just cause the configurator to skip all
-	  the questions about foreign partitioning schemes.
-
-	  If unsure, say N.
-
-config ACORN_PARTITION
-	bool "Acorn partition support" if PARTITION_ADVANCED
-	default y if ARCH_ACORN
-	help
-	  Support hard disks partitioned under Acorn operating systems.
-
-config ACORN_PARTITION_CUMANA
-	bool "Cumana partition support" if PARTITION_ADVANCED
-	default y if ARCH_ACORN
-	depends on ACORN_PARTITION
-	help
-	  Say Y here if you would like to use hard disks under Linux which
-	  were partitioned using the Cumana interface on Acorn machines.
-
-config ACORN_PARTITION_EESOX
-	bool "EESOX partition support" if PARTITION_ADVANCED
-	default y if ARCH_ACORN
-	depends on ACORN_PARTITION
-
-config ACORN_PARTITION_ICS
-	bool "ICS partition support" if PARTITION_ADVANCED
-	default y if ARCH_ACORN
-	depends on ACORN_PARTITION
-	help
-	  Say Y here if you would like to use hard disks under Linux which
-	  were partitioned using the ICS interface on Acorn machines.
-
-config ACORN_PARTITION_ADFS
-	bool "Native filecore partition support" if PARTITION_ADVANCED
-	default y if ARCH_ACORN
-	depends on ACORN_PARTITION
-	help
-	  The Acorn Disc Filing System is the standard file system of the
-	  RiscOS operating system which runs on Acorn's ARM-based Risc PC
-	  systems and the Acorn Archimedes range of machines.  If you say
-	  `Y' here, Linux will support disk partitions created under ADFS.
-
-config ACORN_PARTITION_POWERTEC
-	bool "PowerTec partition support" if PARTITION_ADVANCED
-	default y if ARCH_ACORN
-	depends on ACORN_PARTITION
-	help
-	  Support reading partition tables created on Acorn machines using
-	  the PowerTec SCSI drive.
-
-config ACORN_PARTITION_RISCIX
-	bool "RISCiX partition support" if PARTITION_ADVANCED
-	default y if ARCH_ACORN
-	depends on ACORN_PARTITION
-	help
-	  Once upon a time, there was a native Unix port for the Acorn series
-	  of machines called RISCiX.  If you say 'Y' here, Linux will be able
-	  to read disks partitioned under RISCiX.
-
-config OSF_PARTITION
-	bool "Alpha OSF partition support" if PARTITION_ADVANCED
-	default y if ALPHA
-	help
-	  Say Y here if you would like to use hard disks under Linux which
-	  were partitioned on an Alpha machine.
-
-config AMIGA_PARTITION
-	bool "Amiga partition table support" if PARTITION_ADVANCED
-	default y if (AMIGA || AFFS_FS=y)
-	help
-	  Say Y here if you would like to use hard disks under Linux which
-	  were partitioned under AmigaOS.
-
-config ATARI_PARTITION
-	bool "Atari partition table support" if PARTITION_ADVANCED
-	default y if ATARI
-	help
-	  Say Y here if you would like to use hard disks under Linux which
-	  were partitioned under the Atari OS.
-
-config IBM_PARTITION
-	bool "IBM disk label and partition support"
-	depends on PARTITION_ADVANCED && S390
-	help
-	  Say Y here if you would like to be able to read the hard disk
-	  partition table format used by IBM DASD disks operating under CMS.
-	  Otherwise, say N.
-
-config MAC_PARTITION
-	bool "Macintosh partition map support" if PARTITION_ADVANCED
-	default y if (MAC || PPC_PMAC)
-	help
-	  Say Y here if you would like to use hard disks under Linux which
-	  were partitioned on a Macintosh.
-
-config MSDOS_PARTITION
-	bool "PC BIOS (MSDOS partition tables) support" if PARTITION_ADVANCED
-	default y
-	help
-	  Say Y here.
-
-config BSD_DISKLABEL
-	bool "BSD disklabel (FreeBSD partition tables) support"
-	depends on PARTITION_ADVANCED && MSDOS_PARTITION
-	help
-	  FreeBSD uses its own hard disk partition scheme on your PC. It
-	  requires only one entry in the primary partition table of your disk
-	  and manages it similarly to DOS extended partitions, putting in its
-	  first sector a new partition table in BSD disklabel format. Saying Y
-	  here allows you to read these disklabels and further mount FreeBSD
-	  partitions from within Linux if you have also said Y to "UFS
-	  file system support", above. If you don't know what all this is
-	  about, say N.
-
-config MINIX_SUBPARTITION
-	bool "Minix subpartition support"
-	depends on PARTITION_ADVANCED && MSDOS_PARTITION
-	help
-	  Minix 2.0.0/2.0.2 subpartition table support for Linux.
-	  Say Y here if you want to mount and use Minix 2.0.0/2.0.2
-	  subpartitions.
-
-config SOLARIS_X86_PARTITION
-	bool "Solaris (x86) partition table support"
-	depends on PARTITION_ADVANCED && MSDOS_PARTITION
-	help
-	  Like most systems, Solaris x86 uses its own hard disk partition
-	  table format, incompatible with all others. Saying Y here allows you
-	  to read these partition tables and further mount Solaris x86
-	  partitions from within Linux if you have also said Y to "UFS
-	  file system support", above.
-
-config UNIXWARE_DISKLABEL
-	bool "Unixware slices support"
-	depends on PARTITION_ADVANCED && MSDOS_PARTITION
-	---help---
-	  Like some systems, UnixWare uses its own slice table inside a
-	  partition (VTOC - Virtual Table of Contents). Its format is
-	  incompatible with all other OSes. Saying Y here allows you to read
-	  VTOC and further mount UnixWare partitions read-only from within
-	  Linux if you have also said Y to "UFS file system support" or
-	  "System V and Coherent file system support", above.
-
-	  This is mainly used to carry data from a UnixWare box to your
-	  Linux box via a removable medium like magneto-optical, ZIP or
-	  removable IDE drives. Note, however, that a good portable way to
-	  transport files and directories between unixes (and even other
-	  operating systems) is given by the tar program ("man tar" or
-	  preferably "info tar").
-
-	  If you don't know what all this is about, say N.
-
-config LDM_PARTITION
-	bool "Windows Logical Disk Manager (Dynamic Disk) support"
-	depends on PARTITION_ADVANCED
-	---help---
-	  Say Y here if you would like to use hard disks under Linux which
-	  were partitioned using Windows 2000's/XP's or Vista's Logical Disk
-	  Manager.  They are also known as "Dynamic Disks".
-
-	  Note this driver only supports Dynamic Disks with a protective MBR
-	  label, i.e. DOS partition table.  It does not support GPT labelled
-	  Dynamic Disks yet as can be created with Vista.
-
-	  Windows 2000 introduced the concept of Dynamic Disks to get around
-	  the limitations of the PC's partitioning scheme.  The Logical Disk
-	  Manager allows the user to repartition a disk and create spanned,
-	  mirrored, striped or RAID volumes, all without the need for
-	  rebooting.
-
-	  Normal partitions are now called Basic Disks under Windows 2000, XP,
-	  and Vista.
-
-	  For a fuller description read <file:Documentation/ldm.txt>.
-
-	  If unsure, say N.
-
-config LDM_DEBUG
-	bool "Windows LDM extra logging"
-	depends on LDM_PARTITION
-	help
-	  Say Y here if you would like LDM to log verbosely.  This could be
-	  helpful if the driver doesn't work as expected and you'd like to
-	  report a bug.
-
-	  If unsure, say N.
-
-config SGI_PARTITION
-	bool "SGI partition support" if PARTITION_ADVANCED
-	default y if DEFAULT_SGI_PARTITION
-	help
-	  Say Y here if you would like to be able to read the hard disk
-	  partition table format used by SGI machines.
-
-config ULTRIX_PARTITION
-	bool "Ultrix partition table support" if PARTITION_ADVANCED
-	default y if MACH_DECSTATION
-	help
-	  Say Y here if you would like to be able to read the hard disk
-	  partition table format used by DEC (now Compaq) Ultrix machines.
-	  Otherwise, say N.
-
-config SUN_PARTITION
-	bool "Sun partition tables support" if PARTITION_ADVANCED
-	default y if (SPARC || SUN3 || SUN3X)
-	---help---
-	  Like most systems, SunOS uses its own hard disk partition table
-	  format, incompatible with all others. Saying Y here allows you to
-	  read these partition tables and further mount SunOS partitions from
-	  within Linux if you have also said Y to "UFS file system support",
-	  above. This is mainly used to carry data from a SPARC under SunOS to
-	  your Linux box via a removable medium like magneto-optical or ZIP
-	  drives; note however that a good portable way to transport files and
-	  directories between unixes (and even other operating systems) is
-	  given by the tar program ("man tar" or preferably "info tar"). If
-	  you don't know what all this is about, say N.
-
-config KARMA_PARTITION
-	bool "Karma Partition support"
-	depends on PARTITION_ADVANCED
-	help
-	  Say Y here if you would like to mount the Rio Karma MP3 player, as it
-	  uses a proprietary partition table.
-
-config EFI_PARTITION
-	bool "EFI GUID Partition support"
-	depends on PARTITION_ADVANCED
-	select CRC32
-	help
-	  Say Y here if you would like to use hard disks under Linux which
-	  were partitioned using EFI GPT.
-
-config SYSV68_PARTITION
-	bool "SYSV68 partition table support" if PARTITION_ADVANCED
-	default y if VME
-	help
-	  Say Y here if you would like to be able to read the hard disk
-	  partition table format used by Motorola Delta machines (using
-	  sysv68).
-	  Otherwise, say N.
diff --git a/fs/partitions/Makefile b/fs/partitions/Makefile
deleted file mode 100644
index 03af8eac51d..00000000000
--- a/fs/partitions/Makefile
+++ /dev/null
@@ -1,20 +0,0 @@
-#
-# Makefile for the linux kernel.
-#
-
-obj-$(CONFIG_BLOCK) := check.o
-
-obj-$(CONFIG_ACORN_PARTITION) += acorn.o
-obj-$(CONFIG_AMIGA_PARTITION) += amiga.o
-obj-$(CONFIG_ATARI_PARTITION) += atari.o
-obj-$(CONFIG_MAC_PARTITION) += mac.o
-obj-$(CONFIG_LDM_PARTITION) += ldm.o
-obj-$(CONFIG_MSDOS_PARTITION) += msdos.o
-obj-$(CONFIG_OSF_PARTITION) += osf.o
-obj-$(CONFIG_SGI_PARTITION) += sgi.o
-obj-$(CONFIG_SUN_PARTITION) += sun.o
-obj-$(CONFIG_ULTRIX_PARTITION) += ultrix.o
-obj-$(CONFIG_IBM_PARTITION) += ibm.o
-obj-$(CONFIG_EFI_PARTITION) += efi.o
-obj-$(CONFIG_KARMA_PARTITION) += karma.o
-obj-$(CONFIG_SYSV68_PARTITION) += sysv68.o
diff --git a/fs/partitions/acorn.c b/fs/partitions/acorn.c
deleted file mode 100644
index fbeb697374d..00000000000
--- a/fs/partitions/acorn.c
+++ /dev/null
@@ -1,556 +0,0 @@
-/*
- *  linux/fs/partitions/acorn.c
- *
- *  Copyright (c) 1996-2000 Russell King.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- *  Scan ADFS partitions on hard disk drives.  Unfortunately, there
- *  isn't a standard for partitioning drives on Acorn machines, so
- *  every single manufacturer of SCSI and IDE cards created their own
- *  method.
- */
-#include <linux/buffer_head.h>
-#include <linux/adfs_fs.h>
-
-#include "check.h"
-#include "acorn.h"
-
-/*
- * Partition types. (Oh for reusability)
- */
-#define PARTITION_RISCIX_MFM	1
-#define PARTITION_RISCIX_SCSI	2
-#define PARTITION_LINUX		9
-
-#if defined(CONFIG_ACORN_PARTITION_CUMANA) || \
-	defined(CONFIG_ACORN_PARTITION_ADFS)
-static struct adfs_discrecord *
-adfs_partition(struct parsed_partitions *state, char *name, char *data,
-	       unsigned long first_sector, int slot)
-{
-	struct adfs_discrecord *dr;
-	unsigned int nr_sects;
-
-	if (adfs_checkbblk(data))
-		return NULL;
-
-	dr = (struct adfs_discrecord *)(data + 0x1c0);
-
-	if (dr->disc_size == 0 && dr->disc_size_high == 0)
-		return NULL;
-
-	nr_sects = (le32_to_cpu(dr->disc_size_high) << 23) |
-		   (le32_to_cpu(dr->disc_size) >> 9);
-
-	if (name) {
-		strlcat(state->pp_buf, " [", PAGE_SIZE);
-		strlcat(state->pp_buf, name, PAGE_SIZE);
-		strlcat(state->pp_buf, "]", PAGE_SIZE);
-	}
-	put_partition(state, slot, first_sector, nr_sects);
-	return dr;
-}
-#endif
-
-#ifdef CONFIG_ACORN_PARTITION_RISCIX
-
-struct riscix_part {
-	__le32	start;
-	__le32	length;
-	__le32	one;
-	char	name[16];
-};
-
-struct riscix_record {
-	__le32	magic;
-#define RISCIX_MAGIC	cpu_to_le32(0x4a657320)
-	__le32	date;
-	struct riscix_part part[8];
-};
-
-#if defined(CONFIG_ACORN_PARTITION_CUMANA) || \
-	defined(CONFIG_ACORN_PARTITION_ADFS)
-static int riscix_partition(struct parsed_partitions *state,
-			    unsigned long first_sect, int slot,
-			    unsigned long nr_sects)
-{
-	Sector sect;
-	struct riscix_record *rr;
-	
-	rr = read_part_sector(state, first_sect, &sect);
-	if (!rr)
-		return -1;
-
-	strlcat(state->pp_buf, " [RISCiX]", PAGE_SIZE);
-
-
-	if (rr->magic == RISCIX_MAGIC) {
-		unsigned long size = nr_sects > 2 ? 2 : nr_sects;
-		int part;
-
-		strlcat(state->pp_buf, " <", PAGE_SIZE);
-
-		put_partition(state, slot++, first_sect, size);
-		for (part = 0; part < 8; part++) {
-			if (rr->part[part].one &&
-			    memcmp(rr->part[part].name, "All\0", 4)) {
-				put_partition(state, slot++,
-					le32_to_cpu(rr->part[part].start),
-					le32_to_cpu(rr->part[part].length));
-				strlcat(state->pp_buf, "(", PAGE_SIZE);
-				strlcat(state->pp_buf, rr->part[part].name, PAGE_SIZE);
-				strlcat(state->pp_buf, ")", PAGE_SIZE);
-			}
-		}
-
-		strlcat(state->pp_buf, " >\n", PAGE_SIZE);
-	} else {
-		put_partition(state, slot++, first_sect, nr_sects);
-	}
-
-	put_dev_sector(sect);
-	return slot;
-}
-#endif
-#endif
-
-#define LINUX_NATIVE_MAGIC 0xdeafa1de
-#define LINUX_SWAP_MAGIC   0xdeafab1e
-
-struct linux_part {
-	__le32 magic;
-	__le32 start_sect;
-	__le32 nr_sects;
-};
-
-#if defined(CONFIG_ACORN_PARTITION_CUMANA) || \
-	defined(CONFIG_ACORN_PARTITION_ADFS)
-static int linux_partition(struct parsed_partitions *state,
-			   unsigned long first_sect, int slot,
-			   unsigned long nr_sects)
-{
-	Sector sect;
-	struct linux_part *linuxp;
-	unsigned long size = nr_sects > 2 ? 2 : nr_sects;
-
-	strlcat(state->pp_buf, " [Linux]", PAGE_SIZE);
-
-	put_partition(state, slot++, first_sect, size);
-
-	linuxp = read_part_sector(state, first_sect, &sect);
-	if (!linuxp)
-		return -1;
-
-	strlcat(state->pp_buf, " <", PAGE_SIZE);
-	while (linuxp->magic == cpu_to_le32(LINUX_NATIVE_MAGIC) ||
-	       linuxp->magic == cpu_to_le32(LINUX_SWAP_MAGIC)) {
-		if (slot == state->limit)
-			break;
-		put_partition(state, slot++, first_sect +
-				 le32_to_cpu(linuxp->start_sect),
-				 le32_to_cpu(linuxp->nr_sects));
-		linuxp ++;
-	}
-	strlcat(state->pp_buf, " >", PAGE_SIZE);
-
-	put_dev_sector(sect);
-	return slot;
-}
-#endif
-
-#ifdef CONFIG_ACORN_PARTITION_CUMANA
-int adfspart_check_CUMANA(struct parsed_partitions *state)
-{
-	unsigned long first_sector = 0;
-	unsigned int start_blk = 0;
-	Sector sect;
-	unsigned char *data;
-	char *name = "CUMANA/ADFS";
-	int first = 1;
-	int slot = 1;
-
-	/*
-	 * Try Cumana style partitions - sector 6 contains ADFS boot block
-	 * with pointer to next 'drive'.
-	 *
-	 * There are unknowns in this code - is the 'cylinder number' of the
-	 * next partition relative to the start of this one - I'm assuming
-	 * it is.
-	 *
-	 * Also, which ID did Cumana use?
-	 *
-	 * This is totally unfinished, and will require more work to get it
-	 * going. Hence it is totally untested.
-	 */
-	do {
-		struct adfs_discrecord *dr;
-		unsigned int nr_sects;
-
-		data = read_part_sector(state, start_blk * 2 + 6, &sect);
-		if (!data)
-			return -1;
-
-		if (slot == state->limit)
-			break;
-
-		dr = adfs_partition(state, name, data, first_sector, slot++);
-		if (!dr)
-			break;
-
-		name = NULL;
-
-		nr_sects = (data[0x1fd] + (data[0x1fe] << 8)) *
-			   (dr->heads + (dr->lowsector & 0x40 ? 1 : 0)) *
-			   dr->secspertrack;
-
-		if (!nr_sects)
-			break;
-
-		first = 0;
-		first_sector += nr_sects;
-		start_blk += nr_sects >> (BLOCK_SIZE_BITS - 9);
-		nr_sects = 0; /* hmm - should be partition size */
-
-		switch (data[0x1fc] & 15) {
-		case 0: /* No partition / ADFS? */
-			break;
-
-#ifdef CONFIG_ACORN_PARTITION_RISCIX
-		case PARTITION_RISCIX_SCSI:
-			/* RISCiX - we don't know how to find the next one. */
-			slot = riscix_partition(state, first_sector, slot,
-						nr_sects);
-			break;
-#endif
-
-		case PARTITION_LINUX:
-			slot = linux_partition(state, first_sector, slot,
-					       nr_sects);
-			break;
-		}
-		put_dev_sector(sect);
-		if (slot == -1)
-			return -1;
-	} while (1);
-	put_dev_sector(sect);
-	return first ? 0 : 1;
-}
-#endif
-
-#ifdef CONFIG_ACORN_PARTITION_ADFS
-/*
- * Purpose: allocate ADFS partitions.
- *
- * Params : hd		- pointer to gendisk structure to store partition info.
- *	    dev		- device number to access.
- *
- * Returns: -1 on error, 0 for no ADFS boot sector, 1 for ok.
- *
- * Alloc  : hda  = whole drive
- *	    hda1 = ADFS partition on first drive.
- *	    hda2 = non-ADFS partition.
- */
-int adfspart_check_ADFS(struct parsed_partitions *state)
-{
-	unsigned long start_sect, nr_sects, sectscyl, heads;
-	Sector sect;
-	unsigned char *data;
-	struct adfs_discrecord *dr;
-	unsigned char id;
-	int slot = 1;
-
-	data = read_part_sector(state, 6, &sect);
-	if (!data)
-		return -1;
-
-	dr = adfs_partition(state, "ADFS", data, 0, slot++);
-	if (!dr) {
-		put_dev_sector(sect);
-    		return 0;
-	}
-
-	heads = dr->heads + ((dr->lowsector >> 6) & 1);
-	sectscyl = dr->secspertrack * heads;
-	start_sect = ((data[0x1fe] << 8) + data[0x1fd]) * sectscyl;
-	id = data[0x1fc] & 15;
-	put_dev_sector(sect);
-
-	/*
-	 * Work out start of non-adfs partition.
-	 */
-	nr_sects = (state->bdev->bd_inode->i_size >> 9) - start_sect;
-
-	if (start_sect) {
-		switch (id) {
-#ifdef CONFIG_ACORN_PARTITION_RISCIX
-		case PARTITION_RISCIX_SCSI:
-		case PARTITION_RISCIX_MFM:
-			slot = riscix_partition(state, start_sect, slot,
-						nr_sects);
-			break;
-#endif
-
-		case PARTITION_LINUX:
-			slot = linux_partition(state, start_sect, slot,
-					       nr_sects);
-			break;
-		}
-	}
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	return 1;
-}
-#endif
-
-#ifdef CONFIG_ACORN_PARTITION_ICS
-
-struct ics_part {
-	__le32 start;
-	__le32 size;
-};
-
-static int adfspart_check_ICSLinux(struct parsed_partitions *state,
-				   unsigned long block)
-{
-	Sector sect;
-	unsigned char *data = read_part_sector(state, block, &sect);
-	int result = 0;
-
-	if (data) {
-		if (memcmp(data, "LinuxPart", 9) == 0)
-			result = 1;
-		put_dev_sector(sect);
-	}
-
-	return result;
-}
-
-/*
- * Check for a valid ICS partition using the checksum.
- */
-static inline int valid_ics_sector(const unsigned char *data)
-{
-	unsigned long sum;
-	int i;
-
-	for (i = 0, sum = 0x50617274; i < 508; i++)
-		sum += data[i];
-
-	sum -= le32_to_cpu(*(__le32 *)(&data[508]));
-
-	return sum == 0;
-}
-
-/*
- * Purpose: allocate ICS partitions.
- * Params : hd		- pointer to gendisk structure to store partition info.
- *	    dev		- device number to access.
- * Returns: -1 on error, 0 for no ICS table, 1 for partitions ok.
- * Alloc  : hda  = whole drive
- *	    hda1 = ADFS partition 0 on first drive.
- *	    hda2 = ADFS partition 1 on first drive.
- *		..etc..
- */
-int adfspart_check_ICS(struct parsed_partitions *state)
-{
-	const unsigned char *data;
-	const struct ics_part *p;
-	int slot;
-	Sector sect;
-
-	/*
-	 * Try ICS style partitions - sector 0 contains partition info.
-	 */
-	data = read_part_sector(state, 0, &sect);
-	if (!data)
-	    	return -1;
-
-	if (!valid_ics_sector(data)) {
-	    	put_dev_sector(sect);
-		return 0;
-	}
-
-	strlcat(state->pp_buf, " [ICS]", PAGE_SIZE);
-
-	for (slot = 1, p = (const struct ics_part *)data; p->size; p++) {
-		u32 start = le32_to_cpu(p->start);
-		s32 size = le32_to_cpu(p->size); /* yes, it's signed. */
-
-		if (slot == state->limit)
-			break;
-
-		/*
-		 * Negative sizes tell the RISC OS ICS driver to ignore
-		 * this partition - in effect it says that this does not
-		 * contain an ADFS filesystem.
-		 */
-		if (size < 0) {
-			size = -size;
-
-			/*
-			 * Our own extension - We use the first sector
-			 * of the partition to identify what type this
-			 * partition is.  We must not make this visible
-			 * to the filesystem.
-			 */
-			if (size > 1 && adfspart_check_ICSLinux(state, start)) {
-				start += 1;
-				size -= 1;
-			}
-		}
-
-		if (size)
-			put_partition(state, slot++, start, size);
-	}
-
-	put_dev_sector(sect);
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	return 1;
-}
-#endif
-
-#ifdef CONFIG_ACORN_PARTITION_POWERTEC
-struct ptec_part {
-	__le32 unused1;
-	__le32 unused2;
-	__le32 start;
-	__le32 size;
-	__le32 unused5;
-	char type[8];
-};
-
-static inline int valid_ptec_sector(const unsigned char *data)
-{
-	unsigned char checksum = 0x2a;
-	int i;
-
-	/*
-	 * If it looks like a PC/BIOS partition, then it
-	 * probably isn't PowerTec.
-	 */
-	if (data[510] == 0x55 && data[511] == 0xaa)
-		return 0;
-
-	for (i = 0; i < 511; i++)
-		checksum += data[i];
-
-	return checksum == data[511];
-}
-
-/*
- * Purpose: allocate ICS partitions.
- * Params : hd		- pointer to gendisk structure to store partition info.
- *	    dev		- device number to access.
- * Returns: -1 on error, 0 for no ICS table, 1 for partitions ok.
- * Alloc  : hda  = whole drive
- *	    hda1 = ADFS partition 0 on first drive.
- *	    hda2 = ADFS partition 1 on first drive.
- *		..etc..
- */
-int adfspart_check_POWERTEC(struct parsed_partitions *state)
-{
-	Sector sect;
-	const unsigned char *data;
-	const struct ptec_part *p;
-	int slot = 1;
-	int i;
-
-	data = read_part_sector(state, 0, &sect);
-	if (!data)
-		return -1;
-
-	if (!valid_ptec_sector(data)) {
-		put_dev_sector(sect);
-		return 0;
-	}
-
-	strlcat(state->pp_buf, " [POWERTEC]", PAGE_SIZE);
-
-	for (i = 0, p = (const struct ptec_part *)data; i < 12; i++, p++) {
-		u32 start = le32_to_cpu(p->start);
-		u32 size = le32_to_cpu(p->size);
-
-		if (size)
-			put_partition(state, slot++, start, size);
-	}
-
-	put_dev_sector(sect);
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	return 1;
-}
-#endif
-
-#ifdef CONFIG_ACORN_PARTITION_EESOX
-struct eesox_part {
-	char	magic[6];
-	char	name[10];
-	__le32	start;
-	__le32	unused6;
-	__le32	unused7;
-	__le32	unused8;
-};
-
-/*
- * Guess who created this format?
- */
-static const char eesox_name[] = {
-	'N', 'e', 'i', 'l', ' ',
-	'C', 'r', 'i', 't', 'c', 'h', 'e', 'l', 'l', ' ', ' '
-};
-
-/*
- * EESOX SCSI partition format.
- *
- * This is a goddamned awful partition format.  We don't seem to store
- * the size of the partition in this table, only the start addresses.
- *
- * There are two possibilities where the size comes from:
- *  1. The individual ADFS boot block entries that are placed on the disk.
- *  2. The start address of the next entry.
- */
-int adfspart_check_EESOX(struct parsed_partitions *state)
-{
-	Sector sect;
-	const unsigned char *data;
-	unsigned char buffer[256];
-	struct eesox_part *p;
-	sector_t start = 0;
-	int i, slot = 1;
-
-	data = read_part_sector(state, 7, &sect);
-	if (!data)
-		return -1;
-
-	/*
-	 * "Decrypt" the partition table.  God knows why...
-	 */
-	for (i = 0; i < 256; i++)
-		buffer[i] = data[i] ^ eesox_name[i & 15];
-
-	put_dev_sector(sect);
-
-	for (i = 0, p = (struct eesox_part *)buffer; i < 8; i++, p++) {
-		sector_t next;
-
-		if (memcmp(p->magic, "Eesox", 6))
-			break;
-
-		next = le32_to_cpu(p->start);
-		if (i)
-			put_partition(state, slot++, start, next - start);
-		start = next;
-	}
-
-	if (i != 0) {
-		sector_t size;
-
-		size = get_capacity(state->bdev->bd_disk);
-		put_partition(state, slot++, start, size - start);
-		strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	}
-
-	return i ? 1 : 0;
-}
-#endif
diff --git a/fs/partitions/acorn.h b/fs/partitions/acorn.h
deleted file mode 100644
index ede82852969..00000000000
--- a/fs/partitions/acorn.h
+++ /dev/null
@@ -1,14 +0,0 @@
-/*
- * linux/fs/partitions/acorn.h
- *
- * Copyright (C) 1996-2001 Russell King.
- *
- *  I _hate_ this partitioning mess - why can't we have one defined
- *  format, and everyone stick to it?
- */
-
-int adfspart_check_CUMANA(struct parsed_partitions *state);
-int adfspart_check_ADFS(struct parsed_partitions *state);
-int adfspart_check_ICS(struct parsed_partitions *state);
-int adfspart_check_POWERTEC(struct parsed_partitions *state);
-int adfspart_check_EESOX(struct parsed_partitions *state);
diff --git a/fs/partitions/amiga.c b/fs/partitions/amiga.c
deleted file mode 100644
index 70cbf44a156..00000000000
--- a/fs/partitions/amiga.c
+++ /dev/null
@@ -1,139 +0,0 @@
-/*
- *  fs/partitions/amiga.c
- *
- *  Code extracted from drivers/block/genhd.c
- *
- *  Copyright (C) 1991-1998  Linus Torvalds
- *  Re-organised Feb 1998 Russell King
- */
-
-#include <linux/types.h>
-#include <linux/affs_hardblocks.h>
-
-#include "check.h"
-#include "amiga.h"
-
-static __inline__ u32
-checksum_block(__be32 *m, int size)
-{
-	u32 sum = 0;
-
-	while (size--)
-		sum += be32_to_cpu(*m++);
-	return sum;
-}
-
-int amiga_partition(struct parsed_partitions *state)
-{
-	Sector sect;
-	unsigned char *data;
-	struct RigidDiskBlock *rdb;
-	struct PartitionBlock *pb;
-	int start_sect, nr_sects, blk, part, res = 0;
-	int blksize = 1;	/* Multiplier for disk block size */
-	int slot = 1;
-	char b[BDEVNAME_SIZE];
-
-	for (blk = 0; ; blk++, put_dev_sector(sect)) {
-		if (blk == RDB_ALLOCATION_LIMIT)
-			goto rdb_done;
-		data = read_part_sector(state, blk, &sect);
-		if (!data) {
-			if (warn_no_part)
-				printk("Dev %s: unable to read RDB block %d\n",
-				       bdevname(state->bdev, b), blk);
-			res = -1;
-			goto rdb_done;
-		}
-		if (*(__be32 *)data != cpu_to_be32(IDNAME_RIGIDDISK))
-			continue;
-
-		rdb = (struct RigidDiskBlock *)data;
-		if (checksum_block((__be32 *)data, be32_to_cpu(rdb->rdb_SummedLongs) & 0x7F) == 0)
-			break;
-		/* Try again with 0xdc..0xdf zeroed, Windows might have
-		 * trashed it.
-		 */
-		*(__be32 *)(data+0xdc) = 0;
-		if (checksum_block((__be32 *)data,
-				be32_to_cpu(rdb->rdb_SummedLongs) & 0x7F)==0) {
-			printk("Warning: Trashed word at 0xd0 in block %d "
-				"ignored in checksum calculation\n",blk);
-			break;
-		}
-
-		printk("Dev %s: RDB in block %d has bad checksum\n",
-		       bdevname(state->bdev, b), blk);
-	}
-
-	/* blksize is blocks per 512 byte standard block */
-	blksize = be32_to_cpu( rdb->rdb_BlockBytes ) / 512;
-
-	{
-		char tmp[7 + 10 + 1 + 1];
-
-		/* Be more informative */
-		snprintf(tmp, sizeof(tmp), " RDSK (%d)", blksize * 512);
-		strlcat(state->pp_buf, tmp, PAGE_SIZE);
-	}
-	blk = be32_to_cpu(rdb->rdb_PartitionList);
-	put_dev_sector(sect);
-	for (part = 1; blk>0 && part<=16; part++, put_dev_sector(sect)) {
-		blk *= blksize;	/* Read in terms partition table understands */
-		data = read_part_sector(state, blk, &sect);
-		if (!data) {
-			if (warn_no_part)
-				printk("Dev %s: unable to read partition block %d\n",
-				       bdevname(state->bdev, b), blk);
-			res = -1;
-			goto rdb_done;
-		}
-		pb  = (struct PartitionBlock *)data;
-		blk = be32_to_cpu(pb->pb_Next);
-		if (pb->pb_ID != cpu_to_be32(IDNAME_PARTITION))
-			continue;
-		if (checksum_block((__be32 *)pb, be32_to_cpu(pb->pb_SummedLongs) & 0x7F) != 0 )
-			continue;
-
-		/* Tell Kernel about it */
-
-		nr_sects = (be32_to_cpu(pb->pb_Environment[10]) + 1 -
-			    be32_to_cpu(pb->pb_Environment[9])) *
-			   be32_to_cpu(pb->pb_Environment[3]) *
-			   be32_to_cpu(pb->pb_Environment[5]) *
-			   blksize;
-		if (!nr_sects)
-			continue;
-		start_sect = be32_to_cpu(pb->pb_Environment[9]) *
-			     be32_to_cpu(pb->pb_Environment[3]) *
-			     be32_to_cpu(pb->pb_Environment[5]) *
-			     blksize;
-		put_partition(state,slot++,start_sect,nr_sects);
-		{
-			/* Be even more informative to aid mounting */
-			char dostype[4];
-			char tmp[42];
-
-			__be32 *dt = (__be32 *)dostype;
-			*dt = pb->pb_Environment[16];
-			if (dostype[3] < ' ')
-				snprintf(tmp, sizeof(tmp), " (%c%c%c^%c)",
-					dostype[0], dostype[1],
-					dostype[2], dostype[3] + '@' );
-			else
-				snprintf(tmp, sizeof(tmp), " (%c%c%c%c)",
-					dostype[0], dostype[1],
-					dostype[2], dostype[3]);
-			strlcat(state->pp_buf, tmp, PAGE_SIZE);
-			snprintf(tmp, sizeof(tmp), "(res %d spb %d)",
-				be32_to_cpu(pb->pb_Environment[6]),
-				be32_to_cpu(pb->pb_Environment[4]));
-			strlcat(state->pp_buf, tmp, PAGE_SIZE);
-		}
-		res = 1;
-	}
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-
-rdb_done:
-	return res;
-}
diff --git a/fs/partitions/amiga.h b/fs/partitions/amiga.h
deleted file mode 100644
index d094585cada..00000000000
--- a/fs/partitions/amiga.h
+++ /dev/null
@@ -1,6 +0,0 @@
-/*
- *  fs/partitions/amiga.h
- */
-
-int amiga_partition(struct parsed_partitions *state);
-
diff --git a/fs/partitions/atari.c b/fs/partitions/atari.c
deleted file mode 100644
index 9875b05e80a..00000000000
--- a/fs/partitions/atari.c
+++ /dev/null
@@ -1,149 +0,0 @@
-/*
- *  fs/partitions/atari.c
- *
- *  Code extracted from drivers/block/genhd.c
- *
- *  Copyright (C) 1991-1998  Linus Torvalds
- *  Re-organised Feb 1998 Russell King
- */
-
-#include <linux/ctype.h>
-#include "check.h"
-#include "atari.h"
-
-/* ++guenther: this should be settable by the user ("make config")?.
- */
-#define ICD_PARTS
-
-/* check if a partition entry looks valid -- Atari format is assumed if at
-   least one of the primary entries is ok this way */
-#define	VALID_PARTITION(pi,hdsiz)					     \
-    (((pi)->flg & 1) &&							     \
-     isalnum((pi)->id[0]) && isalnum((pi)->id[1]) && isalnum((pi)->id[2]) && \
-     be32_to_cpu((pi)->st) <= (hdsiz) &&				     \
-     be32_to_cpu((pi)->st) + be32_to_cpu((pi)->siz) <= (hdsiz))
-
-static inline int OK_id(char *s)
-{
-	return  memcmp (s, "GEM", 3) == 0 || memcmp (s, "BGM", 3) == 0 ||
-		memcmp (s, "LNX", 3) == 0 || memcmp (s, "SWP", 3) == 0 ||
-		memcmp (s, "RAW", 3) == 0 ;
-}
-
-int atari_partition(struct parsed_partitions *state)
-{
-	Sector sect;
-	struct rootsector *rs;
-	struct partition_info *pi;
-	u32 extensect;
-	u32 hd_size;
-	int slot;
-#ifdef ICD_PARTS
-	int part_fmt = 0; /* 0:unknown, 1:AHDI, 2:ICD/Supra */
-#endif
-
-	rs = read_part_sector(state, 0, &sect);
-	if (!rs)
-		return -1;
-
-	/* Verify this is an Atari rootsector: */
-	hd_size = state->bdev->bd_inode->i_size >> 9;
-	if (!VALID_PARTITION(&rs->part[0], hd_size) &&
-	    !VALID_PARTITION(&rs->part[1], hd_size) &&
-	    !VALID_PARTITION(&rs->part[2], hd_size) &&
-	    !VALID_PARTITION(&rs->part[3], hd_size)) {
-		/*
-		 * if there's no valid primary partition, assume that no Atari
-		 * format partition table (there's no reliable magic or the like
-	         * :-()
-		 */
-		put_dev_sector(sect);
-		return 0;
-	}
-
-	pi = &rs->part[0];
-	strlcat(state->pp_buf, " AHDI", PAGE_SIZE);
-	for (slot = 1; pi < &rs->part[4] && slot < state->limit; slot++, pi++) {
-		struct rootsector *xrs;
-		Sector sect2;
-		ulong partsect;
-
-		if ( !(pi->flg & 1) )
-			continue;
-		/* active partition */
-		if (memcmp (pi->id, "XGM", 3) != 0) {
-			/* we don't care about other id's */
-			put_partition (state, slot, be32_to_cpu(pi->st),
-					be32_to_cpu(pi->siz));
-			continue;
-		}
-		/* extension partition */
-#ifdef ICD_PARTS
-		part_fmt = 1;
-#endif
-		strlcat(state->pp_buf, " XGM<", PAGE_SIZE);
-		partsect = extensect = be32_to_cpu(pi->st);
-		while (1) {
-			xrs = read_part_sector(state, partsect, &sect2);
-			if (!xrs) {
-				printk (" block %ld read failed\n", partsect);
-				put_dev_sector(sect);
-				return -1;
-			}
-
-			/* ++roman: sanity check: bit 0 of flg field must be set */
-			if (!(xrs->part[0].flg & 1)) {
-				printk( "\nFirst sub-partition in extended partition is not valid!\n" );
-				put_dev_sector(sect2);
-				break;
-			}
-
-			put_partition(state, slot,
-				   partsect + be32_to_cpu(xrs->part[0].st),
-				   be32_to_cpu(xrs->part[0].siz));
-
-			if (!(xrs->part[1].flg & 1)) {
-				/* end of linked partition list */
-				put_dev_sector(sect2);
-				break;
-			}
-			if (memcmp( xrs->part[1].id, "XGM", 3 ) != 0) {
-				printk("\nID of extended partition is not XGM!\n");
-				put_dev_sector(sect2);
-				break;
-			}
-
-			partsect = be32_to_cpu(xrs->part[1].st) + extensect;
-			put_dev_sector(sect2);
-			if (++slot == state->limit) {
-				printk( "\nMaximum number of partitions reached!\n" );
-				break;
-			}
-		}
-		strlcat(state->pp_buf, " >", PAGE_SIZE);
-	}
-#ifdef ICD_PARTS
-	if ( part_fmt!=1 ) { /* no extended partitions -> test ICD-format */
-		pi = &rs->icdpart[0];
-		/* sanity check: no ICD format if first partition invalid */
-		if (OK_id(pi->id)) {
-			strlcat(state->pp_buf, " ICD<", PAGE_SIZE);
-			for (; pi < &rs->icdpart[8] && slot < state->limit; slot++, pi++) {
-				/* accept only GEM,BGM,RAW,LNX,SWP partitions */
-				if (!((pi->flg & 1) && OK_id(pi->id)))
-					continue;
-				part_fmt = 2;
-				put_partition (state, slot,
-						be32_to_cpu(pi->st),
-						be32_to_cpu(pi->siz));
-			}
-			strlcat(state->pp_buf, " >", PAGE_SIZE);
-		}
-	}
-#endif
-	put_dev_sector(sect);
-
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-
-	return 1;
-}
diff --git a/fs/partitions/atari.h b/fs/partitions/atari.h
deleted file mode 100644
index fe2d32a89f3..00000000000
--- a/fs/partitions/atari.h
+++ /dev/null
@@ -1,34 +0,0 @@
-/*
- *  fs/partitions/atari.h
- *  Moved by Russell King from:
- *
- * linux/include/linux/atari_rootsec.h
- * definitions for Atari Rootsector layout
- * by Andreas Schwab (schwab@ls5.informatik.uni-dortmund.de)
- *
- * modified for ICD/Supra partitioning scheme restricted to at most 12
- * partitions
- * by Guenther Kelleter (guenther@pool.informatik.rwth-aachen.de)
- */
-
-struct partition_info
-{
-  u8 flg;			/* bit 0: active; bit 7: bootable */
-  char id[3];			/* "GEM", "BGM", "XGM", or other */
-  __be32 st;			/* start of partition */
-  __be32 siz;			/* length of partition */
-};
-
-struct rootsector
-{
-  char unused[0x156];		/* room for boot code */
-  struct partition_info icdpart[8];	/* info for ICD-partitions 5..12 */
-  char unused2[0xc];
-  u32 hd_siz;			/* size of disk in blocks */
-  struct partition_info part[4];
-  u32 bsl_st;			/* start of bad sector list */
-  u32 bsl_cnt;			/* length of bad sector list */
-  u16 checksum;			/* checksum for bootable disks */
-} __attribute__((__packed__));
-
-int atari_partition(struct parsed_partitions *state);
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
deleted file mode 100644
index e3c63d1c5e1..00000000000
--- a/fs/partitions/check.c
+++ /dev/null
@@ -1,687 +0,0 @@
-/*
- *  fs/partitions/check.c
- *
- *  Code extracted from drivers/block/genhd.c
- *  Copyright (C) 1991-1998  Linus Torvalds
- *  Re-organised Feb 1998 Russell King
- *
- *  We now have independent partition support from the
- *  block drivers, which allows all the partition code to
- *  be grouped in one location, and it to be mostly self
- *  contained.
- *
- *  Added needed MAJORS for new pairs, {hdi,hdj}, {hdk,hdl}
- */
-
-#include <linux/init.h>
-#include <linux/module.h>
-#include <linux/fs.h>
-#include <linux/slab.h>
-#include <linux/kmod.h>
-#include <linux/ctype.h>
-#include <linux/genhd.h>
-#include <linux/blktrace_api.h>
-
-#include "check.h"
-
-#include "acorn.h"
-#include "amiga.h"
-#include "atari.h"
-#include "ldm.h"
-#include "mac.h"
-#include "msdos.h"
-#include "osf.h"
-#include "sgi.h"
-#include "sun.h"
-#include "ibm.h"
-#include "ultrix.h"
-#include "efi.h"
-#include "karma.h"
-#include "sysv68.h"
-
-#ifdef CONFIG_BLK_DEV_MD
-extern void md_autodetect_dev(dev_t dev);
-#endif
-
-int warn_no_part = 1; /*This is ugly: should make genhd removable media aware*/
-
-static int (*check_part[])(struct parsed_partitions *) = {
-	/*
-	 * Probe partition formats with tables at disk address 0
-	 * that also have an ADFS boot block at 0xdc0.
-	 */
-#ifdef CONFIG_ACORN_PARTITION_ICS
-	adfspart_check_ICS,
-#endif
-#ifdef CONFIG_ACORN_PARTITION_POWERTEC
-	adfspart_check_POWERTEC,
-#endif
-#ifdef CONFIG_ACORN_PARTITION_EESOX
-	adfspart_check_EESOX,
-#endif
-
-	/*
-	 * Now move on to formats that only have partition info at
-	 * disk address 0xdc0.  Since these may also have stale
-	 * PC/BIOS partition tables, they need to come before
-	 * the msdos entry.
-	 */
-#ifdef CONFIG_ACORN_PARTITION_CUMANA
-	adfspart_check_CUMANA,
-#endif
-#ifdef CONFIG_ACORN_PARTITION_ADFS
-	adfspart_check_ADFS,
-#endif
-
-#ifdef CONFIG_EFI_PARTITION
-	efi_partition,		/* this must come before msdos */
-#endif
-#ifdef CONFIG_SGI_PARTITION
-	sgi_partition,
-#endif
-#ifdef CONFIG_LDM_PARTITION
-	ldm_partition,		/* this must come before msdos */
-#endif
-#ifdef CONFIG_MSDOS_PARTITION
-	msdos_partition,
-#endif
-#ifdef CONFIG_OSF_PARTITION
-	osf_partition,
-#endif
-#ifdef CONFIG_SUN_PARTITION
-	sun_partition,
-#endif
-#ifdef CONFIG_AMIGA_PARTITION
-	amiga_partition,
-#endif
-#ifdef CONFIG_ATARI_PARTITION
-	atari_partition,
-#endif
-#ifdef CONFIG_MAC_PARTITION
-	mac_partition,
-#endif
-#ifdef CONFIG_ULTRIX_PARTITION
-	ultrix_partition,
-#endif
-#ifdef CONFIG_IBM_PARTITION
-	ibm_partition,
-#endif
-#ifdef CONFIG_KARMA_PARTITION
-	karma_partition,
-#endif
-#ifdef CONFIG_SYSV68_PARTITION
-	sysv68_partition,
-#endif
-	NULL
-};
- 
-/*
- * disk_name() is used by partition check code and the genhd driver.
- * It formats the devicename of the indicated disk into
- * the supplied buffer (of size at least 32), and returns
- * a pointer to that same buffer (for convenience).
- */
-
-char *disk_name(struct gendisk *hd, int partno, char *buf)
-{
-	if (!partno)
-		snprintf(buf, BDEVNAME_SIZE, "%s", hd->disk_name);
-	else if (isdigit(hd->disk_name[strlen(hd->disk_name)-1]))
-		snprintf(buf, BDEVNAME_SIZE, "%sp%d", hd->disk_name, partno);
-	else
-		snprintf(buf, BDEVNAME_SIZE, "%s%d", hd->disk_name, partno);
-
-	return buf;
-}
-
-const char *bdevname(struct block_device *bdev, char *buf)
-{
-	return disk_name(bdev->bd_disk, bdev->bd_part->partno, buf);
-}
-
-EXPORT_SYMBOL(bdevname);
-
-/*
- * There's very little reason to use this, you should really
- * have a struct block_device just about everywhere and use
- * bdevname() instead.
- */
-const char *__bdevname(dev_t dev, char *buffer)
-{
-	scnprintf(buffer, BDEVNAME_SIZE, "unknown-block(%u,%u)",
-				MAJOR(dev), MINOR(dev));
-	return buffer;
-}
-
-EXPORT_SYMBOL(__bdevname);
-
-static struct parsed_partitions *
-check_partition(struct gendisk *hd, struct block_device *bdev)
-{
-	struct parsed_partitions *state;
-	int i, res, err;
-
-	state = kzalloc(sizeof(struct parsed_partitions), GFP_KERNEL);
-	if (!state)
-		return NULL;
-	state->pp_buf = (char *)__get_free_page(GFP_KERNEL);
-	if (!state->pp_buf) {
-		kfree(state);
-		return NULL;
-	}
-	state->pp_buf[0] = '\0';
-
-	state->bdev = bdev;
-	disk_name(hd, 0, state->name);
-	snprintf(state->pp_buf, PAGE_SIZE, " %s:", state->name);
-	if (isdigit(state->name[strlen(state->name)-1]))
-		sprintf(state->name, "p");
-
-	state->limit = disk_max_parts(hd);
-	i = res = err = 0;
-	while (!res && check_part[i]) {
-		memset(&state->parts, 0, sizeof(state->parts));
-		res = check_part[i++](state);
-		if (res < 0) {
-			/* We have hit an I/O error which we don't report now.
-		 	* But record it, and let the others do their job.
-		 	*/
-			err = res;
-			res = 0;
-		}
-
-	}
-	if (res > 0) {
-		printk(KERN_INFO "%s", state->pp_buf);
-
-		free_page((unsigned long)state->pp_buf);
-		return state;
-	}
-	if (state->access_beyond_eod)
-		err = -ENOSPC;
-	if (err)
-	/* The partition is unrecognized. So report I/O errors if there were any */
-		res = err;
-	if (!res)
-		strlcat(state->pp_buf, " unknown partition table\n", PAGE_SIZE);
-	else if (warn_no_part)
-		strlcat(state->pp_buf, " unable to read partition table\n", PAGE_SIZE);
-
-	printk(KERN_INFO "%s", state->pp_buf);
-
-	free_page((unsigned long)state->pp_buf);
-	kfree(state);
-	return ERR_PTR(res);
-}
-
-static ssize_t part_partition_show(struct device *dev,
-				   struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-
-	return sprintf(buf, "%d\n", p->partno);
-}
-
-static ssize_t part_start_show(struct device *dev,
-			       struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-
-	return sprintf(buf, "%llu\n",(unsigned long long)p->start_sect);
-}
-
-ssize_t part_size_show(struct device *dev,
-		       struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	return sprintf(buf, "%llu\n",(unsigned long long)p->nr_sects);
-}
-
-static ssize_t part_ro_show(struct device *dev,
-			    struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	return sprintf(buf, "%d\n", p->policy ? 1 : 0);
-}
-
-static ssize_t part_alignment_offset_show(struct device *dev,
-					  struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	return sprintf(buf, "%llu\n", (unsigned long long)p->alignment_offset);
-}
-
-static ssize_t part_discard_alignment_show(struct device *dev,
-					   struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	return sprintf(buf, "%u\n", p->discard_alignment);
-}
-
-ssize_t part_stat_show(struct device *dev,
-		       struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	int cpu;
-
-	cpu = part_stat_lock();
-	part_round_stats(cpu, p);
-	part_stat_unlock();
-	return sprintf(buf,
-		"%8lu %8lu %8llu %8u "
-		"%8lu %8lu %8llu %8u "
-		"%8u %8u %8u"
-		"\n",
-		part_stat_read(p, ios[READ]),
-		part_stat_read(p, merges[READ]),
-		(unsigned long long)part_stat_read(p, sectors[READ]),
-		jiffies_to_msecs(part_stat_read(p, ticks[READ])),
-		part_stat_read(p, ios[WRITE]),
-		part_stat_read(p, merges[WRITE]),
-		(unsigned long long)part_stat_read(p, sectors[WRITE]),
-		jiffies_to_msecs(part_stat_read(p, ticks[WRITE])),
-		part_in_flight(p),
-		jiffies_to_msecs(part_stat_read(p, io_ticks)),
-		jiffies_to_msecs(part_stat_read(p, time_in_queue)));
-}
-
-ssize_t part_inflight_show(struct device *dev,
-			struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-
-	return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
-		atomic_read(&p->in_flight[1]));
-}
-
-#ifdef CONFIG_FAIL_MAKE_REQUEST
-ssize_t part_fail_show(struct device *dev,
-		       struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-
-	return sprintf(buf, "%d\n", p->make_it_fail);
-}
-
-ssize_t part_fail_store(struct device *dev,
-			struct device_attribute *attr,
-			const char *buf, size_t count)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	int i;
-
-	if (count > 0 && sscanf(buf, "%d", &i) > 0)
-		p->make_it_fail = (i == 0) ? 0 : 1;
-
-	return count;
-}
-#endif
-
-static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL);
-static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
-static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
-static DEVICE_ATTR(ro, S_IRUGO, part_ro_show, NULL);
-static DEVICE_ATTR(alignment_offset, S_IRUGO, part_alignment_offset_show, NULL);
-static DEVICE_ATTR(discard_alignment, S_IRUGO, part_discard_alignment_show,
-		   NULL);
-static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
-static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
-#ifdef CONFIG_FAIL_MAKE_REQUEST
-static struct device_attribute dev_attr_fail =
-	__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
-#endif
-
-static struct attribute *part_attrs[] = {
-	&dev_attr_partition.attr,
-	&dev_attr_start.attr,
-	&dev_attr_size.attr,
-	&dev_attr_ro.attr,
-	&dev_attr_alignment_offset.attr,
-	&dev_attr_discard_alignment.attr,
-	&dev_attr_stat.attr,
-	&dev_attr_inflight.attr,
-#ifdef CONFIG_FAIL_MAKE_REQUEST
-	&dev_attr_fail.attr,
-#endif
-	NULL
-};
-
-static struct attribute_group part_attr_group = {
-	.attrs = part_attrs,
-};
-
-static const struct attribute_group *part_attr_groups[] = {
-	&part_attr_group,
-#ifdef CONFIG_BLK_DEV_IO_TRACE
-	&blk_trace_attr_group,
-#endif
-	NULL
-};
-
-static void part_release(struct device *dev)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	free_part_stats(p);
-	free_part_info(p);
-	kfree(p);
-}
-
-struct device_type part_type = {
-	.name		= "partition",
-	.groups		= part_attr_groups,
-	.release	= part_release,
-};
-
-static void delete_partition_rcu_cb(struct rcu_head *head)
-{
-	struct hd_struct *part = container_of(head, struct hd_struct, rcu_head);
-
-	part->start_sect = 0;
-	part->nr_sects = 0;
-	part_stat_set_all(part, 0);
-	put_device(part_to_dev(part));
-}
-
-void __delete_partition(struct hd_struct *part)
-{
-	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
-}
-
-void delete_partition(struct gendisk *disk, int partno)
-{
-	struct disk_part_tbl *ptbl = disk->part_tbl;
-	struct hd_struct *part;
-
-	if (partno >= ptbl->len)
-		return;
-
-	part = ptbl->part[partno];
-	if (!part)
-		return;
-
-	blk_free_devt(part_devt(part));
-	rcu_assign_pointer(ptbl->part[partno], NULL);
-	rcu_assign_pointer(ptbl->last_lookup, NULL);
-	kobject_put(part->holder_dir);
-	device_del(part_to_dev(part));
-
-	hd_struct_put(part);
-}
-
-static ssize_t whole_disk_show(struct device *dev,
-			       struct device_attribute *attr, char *buf)
-{
-	return 0;
-}
-static DEVICE_ATTR(whole_disk, S_IRUSR | S_IRGRP | S_IROTH,
-		   whole_disk_show, NULL);
-
-struct hd_struct *add_partition(struct gendisk *disk, int partno,
-				sector_t start, sector_t len, int flags,
-				struct partition_meta_info *info)
-{
-	struct hd_struct *p;
-	dev_t devt = MKDEV(0, 0);
-	struct device *ddev = disk_to_dev(disk);
-	struct device *pdev;
-	struct disk_part_tbl *ptbl;
-	const char *dname;
-	int err;
-
-	err = disk_expand_part_tbl(disk, partno);
-	if (err)
-		return ERR_PTR(err);
-	ptbl = disk->part_tbl;
-
-	if (ptbl->part[partno])
-		return ERR_PTR(-EBUSY);
-
-	p = kzalloc(sizeof(*p), GFP_KERNEL);
-	if (!p)
-		return ERR_PTR(-EBUSY);
-
-	if (!init_part_stats(p)) {
-		err = -ENOMEM;
-		goto out_free;
-	}
-	pdev = part_to_dev(p);
-
-	p->start_sect = start;
-	p->alignment_offset =
-		queue_limit_alignment_offset(&disk->queue->limits, start);
-	p->discard_alignment =
-		queue_limit_discard_alignment(&disk->queue->limits, start);
-	p->nr_sects = len;
-	p->partno = partno;
-	p->policy = get_disk_ro(disk);
-
-	if (info) {
-		struct partition_meta_info *pinfo = alloc_part_info(disk);
-		if (!pinfo)
-			goto out_free_stats;
-		memcpy(pinfo, info, sizeof(*info));
-		p->info = pinfo;
-	}
-
-	dname = dev_name(ddev);
-	if (isdigit(dname[strlen(dname) - 1]))
-		dev_set_name(pdev, "%sp%d", dname, partno);
-	else
-		dev_set_name(pdev, "%s%d", dname, partno);
-
-	device_initialize(pdev);
-	pdev->class = &block_class;
-	pdev->type = &part_type;
-	pdev->parent = ddev;
-
-	err = blk_alloc_devt(p, &devt);
-	if (err)
-		goto out_free_info;
-	pdev->devt = devt;
-
-	/* delay uevent until 'holders' subdir is created */
-	dev_set_uevent_suppress(pdev, 1);
-	err = device_add(pdev);
-	if (err)
-		goto out_put;
-
-	err = -ENOMEM;
-	p->holder_dir = kobject_create_and_add("holders", &pdev->kobj);
-	if (!p->holder_dir)
-		goto out_del;
-
-	dev_set_uevent_suppress(pdev, 0);
-	if (flags & ADDPART_FLAG_WHOLEDISK) {
-		err = device_create_file(pdev, &dev_attr_whole_disk);
-		if (err)
-			goto out_del;
-	}
-
-	/* everything is up and running, commence */
-	rcu_assign_pointer(ptbl->part[partno], p);
-
-	/* suppress uevent if the disk suppresses it */
-	if (!dev_get_uevent_suppress(ddev))
-		kobject_uevent(&pdev->kobj, KOBJ_ADD);
-
-	hd_ref_init(p);
-	return p;
-
-out_free_info:
-	free_part_info(p);
-out_free_stats:
-	free_part_stats(p);
-out_free:
-	kfree(p);
-	return ERR_PTR(err);
-out_del:
-	kobject_put(p->holder_dir);
-	device_del(pdev);
-out_put:
-	put_device(pdev);
-	blk_free_devt(devt);
-	return ERR_PTR(err);
-}
-
-static bool disk_unlock_native_capacity(struct gendisk *disk)
-{
-	const struct block_device_operations *bdops = disk->fops;
-
-	if (bdops->unlock_native_capacity &&
-	    !(disk->flags & GENHD_FL_NATIVE_CAPACITY)) {
-		printk(KERN_CONT "enabling native capacity\n");
-		bdops->unlock_native_capacity(disk);
-		disk->flags |= GENHD_FL_NATIVE_CAPACITY;
-		return true;
-	} else {
-		printk(KERN_CONT "truncated\n");
-		return false;
-	}
-}
-
-int rescan_partitions(struct gendisk *disk, struct block_device *bdev)
-{
-	struct parsed_partitions *state = NULL;
-	struct disk_part_iter piter;
-	struct hd_struct *part;
-	int p, highest, res;
-rescan:
-	if (state && !IS_ERR(state)) {
-		kfree(state);
-		state = NULL;
-	}
-
-	if (bdev->bd_part_count)
-		return -EBUSY;
-	res = invalidate_partition(disk, 0);
-	if (res)
-		return res;
-
-	disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY);
-	while ((part = disk_part_iter_next(&piter)))
-		delete_partition(disk, part->partno);
-	disk_part_iter_exit(&piter);
-
-	if (disk->fops->revalidate_disk)
-		disk->fops->revalidate_disk(disk);
-	check_disk_size_change(disk, bdev);
-	bdev->bd_invalidated = 0;
-	if (!get_capacity(disk) || !(state = check_partition(disk, bdev)))
-		return 0;
-	if (IS_ERR(state)) {
-		/*
-		 * I/O error reading the partition table.  If any
-		 * partition code tried to read beyond EOD, retry
-		 * after unlocking native capacity.
-		 */
-		if (PTR_ERR(state) == -ENOSPC) {
-			printk(KERN_WARNING "%s: partition table beyond EOD, ",
-			       disk->disk_name);
-			if (disk_unlock_native_capacity(disk))
-				goto rescan;
-		}
-		return -EIO;
-	}
-	/*
-	 * If any partition code tried to read beyond EOD, try
-	 * unlocking native capacity even if partition table is
-	 * successfully read as we could be missing some partitions.
-	 */
-	if (state->access_beyond_eod) {
-		printk(KERN_WARNING
-		       "%s: partition table partially beyond EOD, ",
-		       disk->disk_name);
-		if (disk_unlock_native_capacity(disk))
-			goto rescan;
-	}
-
-	/* tell userspace that the media / partition table may have changed */
-	kobject_uevent(&disk_to_dev(disk)->kobj, KOBJ_CHANGE);
-
-	/* Detect the highest partition number and preallocate
-	 * disk->part_tbl.  This is an optimization and not strictly
-	 * necessary.
-	 */
-	for (p = 1, highest = 0; p < state->limit; p++)
-		if (state->parts[p].size)
-			highest = p;
-
-	disk_expand_part_tbl(disk, highest);
-
-	/* add partitions */
-	for (p = 1; p < state->limit; p++) {
-		sector_t size, from;
-		struct partition_meta_info *info = NULL;
-
-		size = state->parts[p].size;
-		if (!size)
-			continue;
-
-		from = state->parts[p].from;
-		if (from >= get_capacity(disk)) {
-			printk(KERN_WARNING
-			       "%s: p%d start %llu is beyond EOD, ",
-			       disk->disk_name, p, (unsigned long long) from);
-			if (disk_unlock_native_capacity(disk))
-				goto rescan;
-			continue;
-		}
-
-		if (from + size > get_capacity(disk)) {
-			printk(KERN_WARNING
-			       "%s: p%d size %llu extends beyond EOD, ",
-			       disk->disk_name, p, (unsigned long long) size);
-
-			if (disk_unlock_native_capacity(disk)) {
-				/* free state and restart */
-				goto rescan;
-			} else {
-				/*
-				 * we can not ignore partitions of broken tables
-				 * created by for example camera firmware, but
-				 * we limit them to the end of the disk to avoid
-				 * creating invalid block devices
-				 */
-				size = get_capacity(disk) - from;
-			}
-		}
-
-		if (state->parts[p].has_info)
-			info = &state->parts[p].info;
-		part = add_partition(disk, p, from, size,
-				     state->parts[p].flags,
-				     &state->parts[p].info);
-		if (IS_ERR(part)) {
-			printk(KERN_ERR " %s: p%d could not be added: %ld\n",
-			       disk->disk_name, p, -PTR_ERR(part));
-			continue;
-		}
-#ifdef CONFIG_BLK_DEV_MD
-		if (state->parts[p].flags & ADDPART_FLAG_RAID)
-			md_autodetect_dev(part_to_dev(part)->devt);
-#endif
-	}
-	kfree(state);
-	return 0;
-}
-
-unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p)
-{
-	struct address_space *mapping = bdev->bd_inode->i_mapping;
-	struct page *page;
-
-	page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_CACHE_SHIFT-9)),
-				 NULL);
-	if (!IS_ERR(page)) {
-		if (PageError(page))
-			goto fail;
-		p->v = page;
-		return (unsigned char *)page_address(page) +  ((n & ((1 << (PAGE_CACHE_SHIFT - 9)) - 1)) << 9);
-fail:
-		page_cache_release(page);
-	}
-	p->v = NULL;
-	return NULL;
-}
-
-EXPORT_SYMBOL(read_dev_sector);
diff --git a/fs/partitions/check.h b/fs/partitions/check.h
deleted file mode 100644
index d68bf4dc3bc..00000000000
--- a/fs/partitions/check.h
+++ /dev/null
@@ -1,49 +0,0 @@
-#include <linux/pagemap.h>
-#include <linux/blkdev.h>
-#include <linux/genhd.h>
-
-/*
- * add_gd_partition adds a partitions details to the devices partition
- * description.
- */
-struct parsed_partitions {
-	struct block_device *bdev;
-	char name[BDEVNAME_SIZE];
-	struct {
-		sector_t from;
-		sector_t size;
-		int flags;
-		bool has_info;
-		struct partition_meta_info info;
-	} parts[DISK_MAX_PARTS];
-	int next;
-	int limit;
-	bool access_beyond_eod;
-	char *pp_buf;
-};
-
-static inline void *read_part_sector(struct parsed_partitions *state,
-				     sector_t n, Sector *p)
-{
-	if (n >= get_capacity(state->bdev->bd_disk)) {
-		state->access_beyond_eod = true;
-		return NULL;
-	}
-	return read_dev_sector(state->bdev, n, p);
-}
-
-static inline void
-put_partition(struct parsed_partitions *p, int n, sector_t from, sector_t size)
-{
-	if (n < p->limit) {
-		char tmp[1 + BDEVNAME_SIZE + 10 + 1];
-
-		p->parts[n].from = from;
-		p->parts[n].size = size;
-		snprintf(tmp, sizeof(tmp), " %s%d", p->name, n);
-		strlcat(p->pp_buf, tmp, PAGE_SIZE);
-	}
-}
-
-extern int warn_no_part;
-
diff --git a/fs/partitions/efi.c b/fs/partitions/efi.c
deleted file mode 100644
index 6296b403c67..00000000000
--- a/fs/partitions/efi.c
+++ /dev/null
@@ -1,675 +0,0 @@
-/************************************************************
- * EFI GUID Partition Table handling
- *
- * http://www.uefi.org/specs/
- * http://www.intel.com/technology/efi/
- *
- * efi.[ch] by Matt Domsch <Matt_Domsch@dell.com>
- *   Copyright 2000,2001,2002,2004 Dell Inc.
- *
- *  This program is free software; you can redistribute it and/or modify
- *  it under the terms of the GNU General Public License as published by
- *  the Free Software Foundation; either version 2 of the License, or
- *  (at your option) any later version.
- *
- *  This program is distributed in the hope that it will be useful,
- *  but WITHOUT ANY WARRANTY; without even the implied warranty of
- *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *  GNU General Public License for more details.
- *
- *  You should have received a copy of the GNU General Public License
- *  along with this program; if not, write to the Free Software
- *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
- *
- *
- * TODO:
- *
- * Changelog:
- * Mon Nov 09 2004 Matt Domsch <Matt_Domsch@dell.com>
- * - test for valid PMBR and valid PGPT before ever reading
- *   AGPT, allow override with 'gpt' kernel command line option.
- * - check for first/last_usable_lba outside of size of disk
- *
- * Tue  Mar 26 2002 Matt Domsch <Matt_Domsch@dell.com>
- * - Ported to 2.5.7-pre1 and 2.5.7-dj2
- * - Applied patch to avoid fault in alternate header handling
- * - cleaned up find_valid_gpt
- * - On-disk structure and copy in memory is *always* LE now - 
- *   swab fields as needed
- * - remove print_gpt_header()
- * - only use first max_p partition entries, to keep the kernel minor number
- *   and partition numbers tied.
- *
- * Mon  Feb 04 2002 Matt Domsch <Matt_Domsch@dell.com>
- * - Removed __PRIPTR_PREFIX - not being used
- *
- * Mon  Jan 14 2002 Matt Domsch <Matt_Domsch@dell.com>
- * - Ported to 2.5.2-pre11 + library crc32 patch Linus applied
- *
- * Thu Dec 6 2001 Matt Domsch <Matt_Domsch@dell.com>
- * - Added compare_gpts().
- * - moved le_efi_guid_to_cpus() back into this file.  GPT is the only
- *   thing that keeps EFI GUIDs on disk.
- * - Changed gpt structure names and members to be simpler and more Linux-like.
- * 
- * Wed Oct 17 2001 Matt Domsch <Matt_Domsch@dell.com>
- * - Removed CONFIG_DEVFS_VOLUMES_UUID code entirely per Martin Wilck
- *
- * Wed Oct 10 2001 Matt Domsch <Matt_Domsch@dell.com>
- * - Changed function comments to DocBook style per Andreas Dilger suggestion.
- *
- * Mon Oct 08 2001 Matt Domsch <Matt_Domsch@dell.com>
- * - Change read_lba() to use the page cache per Al Viro's work.
- * - print u64s properly on all architectures
- * - fixed debug_printk(), now Dprintk()
- *
- * Mon Oct 01 2001 Matt Domsch <Matt_Domsch@dell.com>
- * - Style cleanups
- * - made most functions static
- * - Endianness addition
- * - remove test for second alternate header, as it's not per spec,
- *   and is unnecessary.  There's now a method to read/write the last
- *   sector of an odd-sized disk from user space.  No tools have ever
- *   been released which used this code, so it's effectively dead.
- * - Per Asit Mallick of Intel, added a test for a valid PMBR.
- * - Added kernel command line option 'gpt' to override valid PMBR test.
- *
- * Wed Jun  6 2001 Martin Wilck <Martin.Wilck@Fujitsu-Siemens.com>
- * - added devfs volume UUID support (/dev/volumes/uuids) for
- *   mounting file systems by the partition GUID. 
- *
- * Tue Dec  5 2000 Matt Domsch <Matt_Domsch@dell.com>
- * - Moved crc32() to linux/lib, added efi_crc32().
- *
- * Thu Nov 30 2000 Matt Domsch <Matt_Domsch@dell.com>
- * - Replaced Intel's CRC32 function with an equivalent
- *   non-license-restricted version.
- *
- * Wed Oct 25 2000 Matt Domsch <Matt_Domsch@dell.com>
- * - Fixed the last_lba() call to return the proper last block
- *
- * Thu Oct 12 2000 Matt Domsch <Matt_Domsch@dell.com>
- * - Thanks to Andries Brouwer for his debugging assistance.
- * - Code works, detects all the partitions.
- *
- ************************************************************/
-#include <linux/crc32.h>
-#include <linux/ctype.h>
-#include <linux/math64.h>
-#include <linux/slab.h>
-#include "check.h"
-#include "efi.h"
-
-/* This allows a kernel command line option 'gpt' to override
- * the test for invalid PMBR.  Not __initdata because reloading
- * the partition tables happens after init too.
- */
-static int force_gpt;
-static int __init
-force_gpt_fn(char *str)
-{
-	force_gpt = 1;
-	return 1;
-}
-__setup("gpt", force_gpt_fn);
-
-
-/**
- * efi_crc32() - EFI version of crc32 function
- * @buf: buffer to calculate crc32 of
- * @len - length of buf
- *
- * Description: Returns EFI-style CRC32 value for @buf
- * 
- * This function uses the little endian Ethernet polynomial
- * but seeds the function with ~0, and xor's with ~0 at the end.
- * Note, the EFI Specification, v1.02, has a reference to
- * Dr. Dobbs Journal, May 1994 (actually it's in May 1992).
- */
-static inline u32
-efi_crc32(const void *buf, unsigned long len)
-{
-	return (crc32(~0L, buf, len) ^ ~0L);
-}
-
-/**
- * last_lba(): return number of last logical block of device
- * @bdev: block device
- * 
- * Description: Returns last LBA value on success, 0 on error.
- * This is stored (by sd and ide-geometry) in
- *  the part[0] entry for this disk, and is the number of
- *  physical sectors available on the disk.
- */
-static u64 last_lba(struct block_device *bdev)
-{
-	if (!bdev || !bdev->bd_inode)
-		return 0;
-	return div_u64(bdev->bd_inode->i_size,
-		       bdev_logical_block_size(bdev)) - 1ULL;
-}
-
-static inline int
-pmbr_part_valid(struct partition *part)
-{
-        if (part->sys_ind == EFI_PMBR_OSTYPE_EFI_GPT &&
-            le32_to_cpu(part->start_sect) == 1UL)
-                return 1;
-        return 0;
-}
-
-/**
- * is_pmbr_valid(): test Protective MBR for validity
- * @mbr: pointer to a legacy mbr structure
- *
- * Description: Returns 1 if PMBR is valid, 0 otherwise.
- * Validity depends on two things:
- *  1) MSDOS signature is in the last two bytes of the MBR
- *  2) One partition of type 0xEE is found
- */
-static int
-is_pmbr_valid(legacy_mbr *mbr)
-{
-	int i;
-	if (!mbr || le16_to_cpu(mbr->signature) != MSDOS_MBR_SIGNATURE)
-                return 0;
-	for (i = 0; i < 4; i++)
-		if (pmbr_part_valid(&mbr->partition_record[i]))
-                        return 1;
-	return 0;
-}
-
-/**
- * read_lba(): Read bytes from disk, starting at given LBA
- * @state
- * @lba
- * @buffer
- * @size_t
- *
- * Description: Reads @count bytes from @state->bdev into @buffer.
- * Returns number of bytes read on success, 0 on error.
- */
-static size_t read_lba(struct parsed_partitions *state,
-		       u64 lba, u8 *buffer, size_t count)
-{
-	size_t totalreadcount = 0;
-	struct block_device *bdev = state->bdev;
-	sector_t n = lba * (bdev_logical_block_size(bdev) / 512);
-
-	if (!buffer || lba > last_lba(bdev))
-                return 0;
-
-	while (count) {
-		int copied = 512;
-		Sector sect;
-		unsigned char *data = read_part_sector(state, n++, &sect);
-		if (!data)
-			break;
-		if (copied > count)
-			copied = count;
-		memcpy(buffer, data, copied);
-		put_dev_sector(sect);
-		buffer += copied;
-		totalreadcount +=copied;
-		count -= copied;
-	}
-	return totalreadcount;
-}
-
-/**
- * alloc_read_gpt_entries(): reads partition entries from disk
- * @state
- * @gpt - GPT header
- * 
- * Description: Returns ptes on success,  NULL on error.
- * Allocates space for PTEs based on information found in @gpt.
- * Notes: remember to free pte when you're done!
- */
-static gpt_entry *alloc_read_gpt_entries(struct parsed_partitions *state,
-					 gpt_header *gpt)
-{
-	size_t count;
-	gpt_entry *pte;
-
-	if (!gpt)
-		return NULL;
-
-	count = le32_to_cpu(gpt->num_partition_entries) *
-                le32_to_cpu(gpt->sizeof_partition_entry);
-	if (!count)
-		return NULL;
-	pte = kzalloc(count, GFP_KERNEL);
-	if (!pte)
-		return NULL;
-
-	if (read_lba(state, le64_to_cpu(gpt->partition_entry_lba),
-                     (u8 *) pte,
-		     count) < count) {
-		kfree(pte);
-                pte=NULL;
-		return NULL;
-	}
-	return pte;
-}
-
-/**
- * alloc_read_gpt_header(): Allocates GPT header, reads into it from disk
- * @state
- * @lba is the Logical Block Address of the partition table
- * 
- * Description: returns GPT header on success, NULL on error.   Allocates
- * and fills a GPT header starting at @ from @state->bdev.
- * Note: remember to free gpt when finished with it.
- */
-static gpt_header *alloc_read_gpt_header(struct parsed_partitions *state,
-					 u64 lba)
-{
-	gpt_header *gpt;
-	unsigned ssz = bdev_logical_block_size(state->bdev);
-
-	gpt = kzalloc(ssz, GFP_KERNEL);
-	if (!gpt)
-		return NULL;
-
-	if (read_lba(state, lba, (u8 *) gpt, ssz) < ssz) {
-		kfree(gpt);
-                gpt=NULL;
-		return NULL;
-	}
-
-	return gpt;
-}
-
-/**
- * is_gpt_valid() - tests one GPT header and PTEs for validity
- * @state
- * @lba is the logical block address of the GPT header to test
- * @gpt is a GPT header ptr, filled on return.
- * @ptes is a PTEs ptr, filled on return.
- *
- * Description: returns 1 if valid,  0 on error.
- * If valid, returns pointers to newly allocated GPT header and PTEs.
- */
-static int is_gpt_valid(struct parsed_partitions *state, u64 lba,
-			gpt_header **gpt, gpt_entry **ptes)
-{
-	u32 crc, origcrc;
-	u64 lastlba;
-
-	if (!ptes)
-		return 0;
-	if (!(*gpt = alloc_read_gpt_header(state, lba)))
-		return 0;
-
-	/* Check the GUID Partition Table signature */
-	if (le64_to_cpu((*gpt)->signature) != GPT_HEADER_SIGNATURE) {
-		pr_debug("GUID Partition Table Header signature is wrong:"
-			 "%lld != %lld\n",
-			 (unsigned long long)le64_to_cpu((*gpt)->signature),
-			 (unsigned long long)GPT_HEADER_SIGNATURE);
-		goto fail;
-	}
-
-	/* Check the GUID Partition Table header size */
-	if (le32_to_cpu((*gpt)->header_size) >
-			bdev_logical_block_size(state->bdev)) {
-		pr_debug("GUID Partition Table Header size is wrong: %u > %u\n",
-			le32_to_cpu((*gpt)->header_size),
-			bdev_logical_block_size(state->bdev));
-		goto fail;
-	}
-
-	/* Check the GUID Partition Table CRC */
-	origcrc = le32_to_cpu((*gpt)->header_crc32);
-	(*gpt)->header_crc32 = 0;
-	crc = efi_crc32((const unsigned char *) (*gpt), le32_to_cpu((*gpt)->header_size));
-
-	if (crc != origcrc) {
-		pr_debug("GUID Partition Table Header CRC is wrong: %x != %x\n",
-			 crc, origcrc);
-		goto fail;
-	}
-	(*gpt)->header_crc32 = cpu_to_le32(origcrc);
-
-	/* Check that the my_lba entry points to the LBA that contains
-	 * the GUID Partition Table */
-	if (le64_to_cpu((*gpt)->my_lba) != lba) {
-		pr_debug("GPT my_lba incorrect: %lld != %lld\n",
-			 (unsigned long long)le64_to_cpu((*gpt)->my_lba),
-			 (unsigned long long)lba);
-		goto fail;
-	}
-
-	/* Check the first_usable_lba and last_usable_lba are
-	 * within the disk.
-	 */
-	lastlba = last_lba(state->bdev);
-	if (le64_to_cpu((*gpt)->first_usable_lba) > lastlba) {
-		pr_debug("GPT: first_usable_lba incorrect: %lld > %lld\n",
-			 (unsigned long long)le64_to_cpu((*gpt)->first_usable_lba),
-			 (unsigned long long)lastlba);
-		goto fail;
-	}
-	if (le64_to_cpu((*gpt)->last_usable_lba) > lastlba) {
-		pr_debug("GPT: last_usable_lba incorrect: %lld > %lld\n",
-			 (unsigned long long)le64_to_cpu((*gpt)->last_usable_lba),
-			 (unsigned long long)lastlba);
-		goto fail;
-	}
-
-	/* Check that sizeof_partition_entry has the correct value */
-	if (le32_to_cpu((*gpt)->sizeof_partition_entry) != sizeof(gpt_entry)) {
-		pr_debug("GUID Partitition Entry Size check failed.\n");
-		goto fail;
-	}
-
-	if (!(*ptes = alloc_read_gpt_entries(state, *gpt)))
-		goto fail;
-
-	/* Check the GUID Partition Entry Array CRC */
-	crc = efi_crc32((const unsigned char *) (*ptes),
-			le32_to_cpu((*gpt)->num_partition_entries) *
-			le32_to_cpu((*gpt)->sizeof_partition_entry));
-
-	if (crc != le32_to_cpu((*gpt)->partition_entry_array_crc32)) {
-		pr_debug("GUID Partitition Entry Array CRC check failed.\n");
-		goto fail_ptes;
-	}
-
-	/* We're done, all's well */
-	return 1;
-
- fail_ptes:
-	kfree(*ptes);
-	*ptes = NULL;
- fail:
-	kfree(*gpt);
-	*gpt = NULL;
-	return 0;
-}
-
-/**
- * is_pte_valid() - tests one PTE for validity
- * @pte is the pte to check
- * @lastlba is last lba of the disk
- *
- * Description: returns 1 if valid,  0 on error.
- */
-static inline int
-is_pte_valid(const gpt_entry *pte, const u64 lastlba)
-{
-	if ((!efi_guidcmp(pte->partition_type_guid, NULL_GUID)) ||
-	    le64_to_cpu(pte->starting_lba) > lastlba         ||
-	    le64_to_cpu(pte->ending_lba)   > lastlba)
-		return 0;
-	return 1;
-}
-
-/**
- * compare_gpts() - Search disk for valid GPT headers and PTEs
- * @pgpt is the primary GPT header
- * @agpt is the alternate GPT header
- * @lastlba is the last LBA number
- * Description: Returns nothing.  Sanity checks pgpt and agpt fields
- * and prints warnings on discrepancies.
- * 
- */
-static void
-compare_gpts(gpt_header *pgpt, gpt_header *agpt, u64 lastlba)
-{
-	int error_found = 0;
-	if (!pgpt || !agpt)
-		return;
-	if (le64_to_cpu(pgpt->my_lba) != le64_to_cpu(agpt->alternate_lba)) {
-		printk(KERN_WARNING
-		       "GPT:Primary header LBA != Alt. header alternate_lba\n");
-		printk(KERN_WARNING "GPT:%lld != %lld\n",
-		       (unsigned long long)le64_to_cpu(pgpt->my_lba),
-                       (unsigned long long)le64_to_cpu(agpt->alternate_lba));
-		error_found++;
-	}
-	if (le64_to_cpu(pgpt->alternate_lba) != le64_to_cpu(agpt->my_lba)) {
-		printk(KERN_WARNING
-		       "GPT:Primary header alternate_lba != Alt. header my_lba\n");
-		printk(KERN_WARNING "GPT:%lld != %lld\n",
-		       (unsigned long long)le64_to_cpu(pgpt->alternate_lba),
-                       (unsigned long long)le64_to_cpu(agpt->my_lba));
-		error_found++;
-	}
-	if (le64_to_cpu(pgpt->first_usable_lba) !=
-            le64_to_cpu(agpt->first_usable_lba)) {
-		printk(KERN_WARNING "GPT:first_usable_lbas don't match.\n");
-		printk(KERN_WARNING "GPT:%lld != %lld\n",
-		       (unsigned long long)le64_to_cpu(pgpt->first_usable_lba),
-                       (unsigned long long)le64_to_cpu(agpt->first_usable_lba));
-		error_found++;
-	}
-	if (le64_to_cpu(pgpt->last_usable_lba) !=
-            le64_to_cpu(agpt->last_usable_lba)) {
-		printk(KERN_WARNING "GPT:last_usable_lbas don't match.\n");
-		printk(KERN_WARNING "GPT:%lld != %lld\n",
-		       (unsigned long long)le64_to_cpu(pgpt->last_usable_lba),
-                       (unsigned long long)le64_to_cpu(agpt->last_usable_lba));
-		error_found++;
-	}
-	if (efi_guidcmp(pgpt->disk_guid, agpt->disk_guid)) {
-		printk(KERN_WARNING "GPT:disk_guids don't match.\n");
-		error_found++;
-	}
-	if (le32_to_cpu(pgpt->num_partition_entries) !=
-            le32_to_cpu(agpt->num_partition_entries)) {
-		printk(KERN_WARNING "GPT:num_partition_entries don't match: "
-		       "0x%x != 0x%x\n",
-		       le32_to_cpu(pgpt->num_partition_entries),
-		       le32_to_cpu(agpt->num_partition_entries));
-		error_found++;
-	}
-	if (le32_to_cpu(pgpt->sizeof_partition_entry) !=
-            le32_to_cpu(agpt->sizeof_partition_entry)) {
-		printk(KERN_WARNING
-		       "GPT:sizeof_partition_entry values don't match: "
-		       "0x%x != 0x%x\n",
-                       le32_to_cpu(pgpt->sizeof_partition_entry),
-		       le32_to_cpu(agpt->sizeof_partition_entry));
-		error_found++;
-	}
-	if (le32_to_cpu(pgpt->partition_entry_array_crc32) !=
-            le32_to_cpu(agpt->partition_entry_array_crc32)) {
-		printk(KERN_WARNING
-		       "GPT:partition_entry_array_crc32 values don't match: "
-		       "0x%x != 0x%x\n",
-                       le32_to_cpu(pgpt->partition_entry_array_crc32),
-		       le32_to_cpu(agpt->partition_entry_array_crc32));
-		error_found++;
-	}
-	if (le64_to_cpu(pgpt->alternate_lba) != lastlba) {
-		printk(KERN_WARNING
-		       "GPT:Primary header thinks Alt. header is not at the end of the disk.\n");
-		printk(KERN_WARNING "GPT:%lld != %lld\n",
-			(unsigned long long)le64_to_cpu(pgpt->alternate_lba),
-			(unsigned long long)lastlba);
-		error_found++;
-	}
-
-	if (le64_to_cpu(agpt->my_lba) != lastlba) {
-		printk(KERN_WARNING
-		       "GPT:Alternate GPT header not at the end of the disk.\n");
-		printk(KERN_WARNING "GPT:%lld != %lld\n",
-			(unsigned long long)le64_to_cpu(agpt->my_lba),
-			(unsigned long long)lastlba);
-		error_found++;
-	}
-
-	if (error_found)
-		printk(KERN_WARNING
-		       "GPT: Use GNU Parted to correct GPT errors.\n");
-	return;
-}
-
-/**
- * find_valid_gpt() - Search disk for valid GPT headers and PTEs
- * @state
- * @gpt is a GPT header ptr, filled on return.
- * @ptes is a PTEs ptr, filled on return.
- * Description: Returns 1 if valid, 0 on error.
- * If valid, returns pointers to newly allocated GPT header and PTEs.
- * Validity depends on PMBR being valid (or being overridden by the
- * 'gpt' kernel command line option) and finding either the Primary
- * GPT header and PTEs valid, or the Alternate GPT header and PTEs
- * valid.  If the Primary GPT header is not valid, the Alternate GPT header
- * is not checked unless the 'gpt' kernel command line option is passed.
- * This protects against devices which misreport their size, and forces
- * the user to decide to use the Alternate GPT.
- */
-static int find_valid_gpt(struct parsed_partitions *state, gpt_header **gpt,
-			  gpt_entry **ptes)
-{
-	int good_pgpt = 0, good_agpt = 0, good_pmbr = 0;
-	gpt_header *pgpt = NULL, *agpt = NULL;
-	gpt_entry *pptes = NULL, *aptes = NULL;
-	legacy_mbr *legacymbr;
-	u64 lastlba;
-
-	if (!ptes)
-		return 0;
-
-	lastlba = last_lba(state->bdev);
-        if (!force_gpt) {
-                /* This will be added to the EFI Spec. per Intel after v1.02. */
-                legacymbr = kzalloc(sizeof (*legacymbr), GFP_KERNEL);
-                if (legacymbr) {
-                        read_lba(state, 0, (u8 *) legacymbr,
-				 sizeof (*legacymbr));
-                        good_pmbr = is_pmbr_valid(legacymbr);
-                        kfree(legacymbr);
-                }
-                if (!good_pmbr)
-                        goto fail;
-        }
-
-	good_pgpt = is_gpt_valid(state, GPT_PRIMARY_PARTITION_TABLE_LBA,
-				 &pgpt, &pptes);
-        if (good_pgpt)
-		good_agpt = is_gpt_valid(state,
-					 le64_to_cpu(pgpt->alternate_lba),
-					 &agpt, &aptes);
-        if (!good_agpt && force_gpt)
-                good_agpt = is_gpt_valid(state, lastlba, &agpt, &aptes);
-
-        /* The obviously unsuccessful case */
-        if (!good_pgpt && !good_agpt)
-                goto fail;
-
-        compare_gpts(pgpt, agpt, lastlba);
-
-        /* The good cases */
-        if (good_pgpt) {
-                *gpt  = pgpt;
-                *ptes = pptes;
-                kfree(agpt);
-                kfree(aptes);
-                if (!good_agpt) {
-                        printk(KERN_WARNING 
-			       "Alternate GPT is invalid, "
-                               "using primary GPT.\n");
-                }
-                return 1;
-        }
-        else if (good_agpt) {
-                *gpt  = agpt;
-                *ptes = aptes;
-                kfree(pgpt);
-                kfree(pptes);
-                printk(KERN_WARNING 
-                       "Primary GPT is invalid, using alternate GPT.\n");
-                return 1;
-        }
-
- fail:
-        kfree(pgpt);
-        kfree(agpt);
-        kfree(pptes);
-        kfree(aptes);
-        *gpt = NULL;
-        *ptes = NULL;
-        return 0;
-}
-
-/**
- * efi_partition(struct parsed_partitions *state)
- * @state
- *
- * Description: called from check.c, if the disk contains GPT
- * partitions, sets up partition entries in the kernel.
- *
- * If the first block on the disk is a legacy MBR,
- * it will get handled by msdos_partition().
- * If it's a Protective MBR, we'll handle it here.
- *
- * We do not create a Linux partition for GPT, but
- * only for the actual data partitions.
- * Returns:
- * -1 if unable to read the partition table
- *  0 if this isn't our partition table
- *  1 if successful
- *
- */
-int efi_partition(struct parsed_partitions *state)
-{
-	gpt_header *gpt = NULL;
-	gpt_entry *ptes = NULL;
-	u32 i;
-	unsigned ssz = bdev_logical_block_size(state->bdev) / 512;
-	u8 unparsed_guid[37];
-
-	if (!find_valid_gpt(state, &gpt, &ptes) || !gpt || !ptes) {
-		kfree(gpt);
-		kfree(ptes);
-		return 0;
-	}
-
-	pr_debug("GUID Partition Table is valid!  Yea!\n");
-
-	for (i = 0; i < le32_to_cpu(gpt->num_partition_entries) && i < state->limit-1; i++) {
-		struct partition_meta_info *info;
-		unsigned label_count = 0;
-		unsigned label_max;
-		u64 start = le64_to_cpu(ptes[i].starting_lba);
-		u64 size = le64_to_cpu(ptes[i].ending_lba) -
-			   le64_to_cpu(ptes[i].starting_lba) + 1ULL;
-
-		if (!is_pte_valid(&ptes[i], last_lba(state->bdev)))
-			continue;
-
-		put_partition(state, i+1, start * ssz, size * ssz);
-
-		/* If this is a RAID volume, tell md */
-		if (!efi_guidcmp(ptes[i].partition_type_guid,
-				 PARTITION_LINUX_RAID_GUID))
-			state->parts[i + 1].flags = ADDPART_FLAG_RAID;
-
-		info = &state->parts[i + 1].info;
-		/* Instead of doing a manual swap to big endian, reuse the
-		 * common ASCII hex format as the interim.
-		 */
-		efi_guid_unparse(&ptes[i].unique_partition_guid, unparsed_guid);
-		part_pack_uuid(unparsed_guid, info->uuid);
-
-		/* Naively convert UTF16-LE to 7 bits. */
-		label_max = min(sizeof(info->volname) - 1,
-				sizeof(ptes[i].partition_name));
-		info->volname[label_max] = 0;
-		while (label_count < label_max) {
-			u8 c = ptes[i].partition_name[label_count] & 0xff;
-			if (c && !isprint(c))
-				c = '!';
-			info->volname[label_count] = c;
-			label_count++;
-		}
-		state->parts[i + 1].has_info = true;
-	}
-	kfree(ptes);
-	kfree(gpt);
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	return 1;
-}
diff --git a/fs/partitions/efi.h b/fs/partitions/efi.h
deleted file mode 100644
index b69ab729558..00000000000
--- a/fs/partitions/efi.h
+++ /dev/null
@@ -1,134 +0,0 @@
-/************************************************************
- * EFI GUID Partition Table
- * Per Intel EFI Specification v1.02
- * http://developer.intel.com/technology/efi/efi.htm
- *
- * By Matt Domsch <Matt_Domsch@dell.com>  Fri Sep 22 22:15:56 CDT 2000  
- *   Copyright 2000,2001 Dell Inc.
- *
- *  This program is free software; you can redistribute it and/or modify
- *  it under the terms of the GNU General Public License as published by
- *  the Free Software Foundation; either version 2 of the License, or
- *  (at your option) any later version.
- * 
- *  This program is distributed in the hope that it will be useful,
- *  but WITHOUT ANY WARRANTY; without even the implied warranty of
- *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *  GNU General Public License for more details.
- *
- *  You should have received a copy of the GNU General Public License
- *  along with this program; if not, write to the Free Software
- *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
- * 
- ************************************************************/
-
-#ifndef FS_PART_EFI_H_INCLUDED
-#define FS_PART_EFI_H_INCLUDED
-
-#include <linux/types.h>
-#include <linux/fs.h>
-#include <linux/genhd.h>
-#include <linux/kernel.h>
-#include <linux/major.h>
-#include <linux/string.h>
-#include <linux/efi.h>
-
-#define MSDOS_MBR_SIGNATURE 0xaa55
-#define EFI_PMBR_OSTYPE_EFI 0xEF
-#define EFI_PMBR_OSTYPE_EFI_GPT 0xEE
-
-#define GPT_HEADER_SIGNATURE 0x5452415020494645ULL
-#define GPT_HEADER_REVISION_V1 0x00010000
-#define GPT_PRIMARY_PARTITION_TABLE_LBA 1
-
-#define PARTITION_SYSTEM_GUID \
-    EFI_GUID( 0xC12A7328, 0xF81F, 0x11d2, \
-              0xBA, 0x4B, 0x00, 0xA0, 0xC9, 0x3E, 0xC9, 0x3B) 
-#define LEGACY_MBR_PARTITION_GUID \
-    EFI_GUID( 0x024DEE41, 0x33E7, 0x11d3, \
-              0x9D, 0x69, 0x00, 0x08, 0xC7, 0x81, 0xF3, 0x9F)
-#define PARTITION_MSFT_RESERVED_GUID \
-    EFI_GUID( 0xE3C9E316, 0x0B5C, 0x4DB8, \
-              0x81, 0x7D, 0xF9, 0x2D, 0xF0, 0x02, 0x15, 0xAE)
-#define PARTITION_BASIC_DATA_GUID \
-    EFI_GUID( 0xEBD0A0A2, 0xB9E5, 0x4433, \
-              0x87, 0xC0, 0x68, 0xB6, 0xB7, 0x26, 0x99, 0xC7)
-#define PARTITION_LINUX_RAID_GUID \
-    EFI_GUID( 0xa19d880f, 0x05fc, 0x4d3b, \
-              0xa0, 0x06, 0x74, 0x3f, 0x0f, 0x84, 0x91, 0x1e)
-#define PARTITION_LINUX_SWAP_GUID \
-    EFI_GUID( 0x0657fd6d, 0xa4ab, 0x43c4, \
-              0x84, 0xe5, 0x09, 0x33, 0xc8, 0x4b, 0x4f, 0x4f)
-#define PARTITION_LINUX_LVM_GUID \
-    EFI_GUID( 0xe6d6d379, 0xf507, 0x44c2, \
-              0xa2, 0x3c, 0x23, 0x8f, 0x2a, 0x3d, 0xf9, 0x28)
-
-typedef struct _gpt_header {
-	__le64 signature;
-	__le32 revision;
-	__le32 header_size;
-	__le32 header_crc32;
-	__le32 reserved1;
-	__le64 my_lba;
-	__le64 alternate_lba;
-	__le64 first_usable_lba;
-	__le64 last_usable_lba;
-	efi_guid_t disk_guid;
-	__le64 partition_entry_lba;
-	__le32 num_partition_entries;
-	__le32 sizeof_partition_entry;
-	__le32 partition_entry_array_crc32;
-
-	/* The rest of the logical block is reserved by UEFI and must be zero.
-	 * EFI standard handles this by:
-	 *
-	 * uint8_t		reserved2[ BlockSize - 92 ];
-	 */
-} __attribute__ ((packed)) gpt_header;
-
-typedef struct _gpt_entry_attributes {
-	u64 required_to_function:1;
-	u64 reserved:47;
-        u64 type_guid_specific:16;
-} __attribute__ ((packed)) gpt_entry_attributes;
-
-typedef struct _gpt_entry {
-	efi_guid_t partition_type_guid;
-	efi_guid_t unique_partition_guid;
-	__le64 starting_lba;
-	__le64 ending_lba;
-	gpt_entry_attributes attributes;
-	efi_char16_t partition_name[72 / sizeof (efi_char16_t)];
-} __attribute__ ((packed)) gpt_entry;
-
-typedef struct _legacy_mbr {
-	u8 boot_code[440];
-	__le32 unique_mbr_signature;
-	__le16 unknown;
-	struct partition partition_record[4];
-	__le16 signature;
-} __attribute__ ((packed)) legacy_mbr;
-
-/* Functions */
-extern int efi_partition(struct parsed_partitions *state);
-
-#endif
-
-/*
- * Overrides for Emacs so that we follow Linus's tabbing style.
- * Emacs will notice this stuff at the end of the file and automatically
- * adjust the settings for this buffer only.  This must remain at the end
- * of the file.
- * --------------------------------------------------------------------------
- * Local variables:
- * c-indent-level: 4 
- * c-brace-imaginary-offset: 0
- * c-brace-offset: -4
- * c-argdecl-indent: 4
- * c-label-offset: -4
- * c-continued-statement-offset: 4
- * c-continued-brace-offset: 0
- * indent-tabs-mode: nil
- * tab-width: 8
- * End:
- */
diff --git a/fs/partitions/ibm.c b/fs/partitions/ibm.c
deleted file mode 100644
index d513a07f44b..00000000000
--- a/fs/partitions/ibm.c
+++ /dev/null
@@ -1,275 +0,0 @@
-/*
- * File...........: linux/fs/partitions/ibm.c
- * Author(s)......: Holger Smolinski <Holger.Smolinski@de.ibm.com>
- *                  Volker Sameske <sameske@de.ibm.com>
- * Bugreports.to..: <Linux390@de.ibm.com>
- * (C) IBM Corporation, IBM Deutschland Entwicklung GmbH, 1999,2000
- */
-
-#include <linux/buffer_head.h>
-#include <linux/hdreg.h>
-#include <linux/slab.h>
-#include <asm/dasd.h>
-#include <asm/ebcdic.h>
-#include <asm/uaccess.h>
-#include <asm/vtoc.h>
-
-#include "check.h"
-#include "ibm.h"
-
-/*
- * compute the block number from a
- * cyl-cyl-head-head structure
- */
-static sector_t
-cchh2blk (struct vtoc_cchh *ptr, struct hd_geometry *geo) {
-
-	sector_t cyl;
-	__u16 head;
-
-	/*decode cylinder and heads for large volumes */
-	cyl = ptr->hh & 0xFFF0;
-	cyl <<= 12;
-	cyl |= ptr->cc;
-	head = ptr->hh & 0x000F;
-	return cyl * geo->heads * geo->sectors +
-	       head * geo->sectors;
-}
-
-/*
- * compute the block number from a
- * cyl-cyl-head-head-block structure
- */
-static sector_t
-cchhb2blk (struct vtoc_cchhb *ptr, struct hd_geometry *geo) {
-
-	sector_t cyl;
-	__u16 head;
-
-	/*decode cylinder and heads for large volumes */
-	cyl = ptr->hh & 0xFFF0;
-	cyl <<= 12;
-	cyl |= ptr->cc;
-	head = ptr->hh & 0x000F;
-	return	cyl * geo->heads * geo->sectors +
-		head * geo->sectors +
-		ptr->b;
-}
-
-/*
- */
-int ibm_partition(struct parsed_partitions *state)
-{
-	struct block_device *bdev = state->bdev;
-	int blocksize, res;
-	loff_t i_size, offset, size, fmt_size;
-	dasd_information2_t *info;
-	struct hd_geometry *geo;
-	char type[5] = {0,};
-	char name[7] = {0,};
-	union label_t {
-		struct vtoc_volume_label_cdl vol;
-		struct vtoc_volume_label_ldl lnx;
-		struct vtoc_cms_label cms;
-	} *label;
-	unsigned char *data;
-	Sector sect;
-	sector_t labelsect;
-	char tmp[64];
-
-	res = 0;
-	blocksize = bdev_logical_block_size(bdev);
-	if (blocksize <= 0)
-		goto out_exit;
-	i_size = i_size_read(bdev->bd_inode);
-	if (i_size == 0)
-		goto out_exit;
-
-	info = kmalloc(sizeof(dasd_information2_t), GFP_KERNEL);
-	if (info == NULL)
-		goto out_exit;
-	geo = kmalloc(sizeof(struct hd_geometry), GFP_KERNEL);
-	if (geo == NULL)
-		goto out_nogeo;
-	label = kmalloc(sizeof(union label_t), GFP_KERNEL);
-	if (label == NULL)
-		goto out_nolab;
-
-	if (ioctl_by_bdev(bdev, BIODASDINFO2, (unsigned long)info) != 0 ||
-	    ioctl_by_bdev(bdev, HDIO_GETGEO, (unsigned long)geo) != 0)
-		goto out_freeall;
-
-	/*
-	 * Special case for FBA disks: label sector does not depend on
-	 * blocksize.
-	 */
-	if ((info->cu_type == 0x6310 && info->dev_type == 0x9336) ||
-	    (info->cu_type == 0x3880 && info->dev_type == 0x3370))
-		labelsect = info->label_block;
-	else
-		labelsect = info->label_block * (blocksize >> 9);
-
-	/*
-	 * Get volume label, extract name and type.
-	 */
-	data = read_part_sector(state, labelsect, &sect);
-	if (data == NULL)
-		goto out_readerr;
-
-	memcpy(label, data, sizeof(union label_t));
-	put_dev_sector(sect);
-
-	if ((!info->FBA_layout) && (!strcmp(info->type, "ECKD"))) {
-		strncpy(type, label->vol.vollbl, 4);
-		strncpy(name, label->vol.volid, 6);
-	} else {
-		strncpy(type, label->lnx.vollbl, 4);
-		strncpy(name, label->lnx.volid, 6);
-	}
-	EBCASC(type, 4);
-	EBCASC(name, 6);
-
-	res = 1;
-
-	/*
-	 * Three different formats: LDL, CDL and unformated disk
-	 *
-	 * identified by info->format
-	 *
-	 * unformated disks we do not have to care about
-	 */
-	if (info->format == DASD_FORMAT_LDL) {
-		if (strncmp(type, "CMS1", 4) == 0) {
-			/*
-			 * VM style CMS1 labeled disk
-			 */
-			blocksize = label->cms.block_size;
-			if (label->cms.disk_offset != 0) {
-				snprintf(tmp, sizeof(tmp), "CMS1/%8s(MDSK):", name);
-				strlcat(state->pp_buf, tmp, PAGE_SIZE);
-				/* disk is reserved minidisk */
-				offset = label->cms.disk_offset;
-				size = (label->cms.block_count - 1)
-					* (blocksize >> 9);
-			} else {
-				snprintf(tmp, sizeof(tmp), "CMS1/%8s:", name);
-				strlcat(state->pp_buf, tmp, PAGE_SIZE);
-				offset = (info->label_block + 1);
-				size = label->cms.block_count
-					* (blocksize >> 9);
-			}
-			put_partition(state, 1, offset*(blocksize >> 9),
-				      size-offset*(blocksize >> 9));
-		} else {
-			if (strncmp(type, "LNX1", 4) == 0) {
-				snprintf(tmp, sizeof(tmp), "LNX1/%8s:", name);
-				strlcat(state->pp_buf, tmp, PAGE_SIZE);
-				if (label->lnx.ldl_version == 0xf2) {
-					fmt_size = label->lnx.formatted_blocks
-						* (blocksize >> 9);
-				} else if (!strcmp(info->type, "ECKD")) {
-					/* formated w/o large volume support */
-					fmt_size = geo->cylinders * geo->heads
-					      * geo->sectors * (blocksize >> 9);
-				} else {
-					/* old label and no usable disk geometry
-					 * (e.g. DIAG) */
-					fmt_size = i_size >> 9;
-				}
-				size = i_size >> 9;
-				if (fmt_size < size)
-					size = fmt_size;
-				offset = (info->label_block + 1);
-			} else {
-				/* unlabeled disk */
-				strlcat(state->pp_buf, "(nonl)", PAGE_SIZE);
-				size = i_size >> 9;
-				offset = (info->label_block + 1);
-			}
-			put_partition(state, 1, offset*(blocksize >> 9),
-				      size-offset*(blocksize >> 9));
-		}
-	} else if (info->format == DASD_FORMAT_CDL) {
-		/*
-		 * New style CDL formatted disk
-		 */
-		sector_t blk;
-		int counter;
-
-		/*
-		 * check if VOL1 label is available
-		 * if not, something is wrong, skipping partition detection
-		 */
-		if (strncmp(type, "VOL1",  4) == 0) {
-			snprintf(tmp, sizeof(tmp), "VOL1/%8s:", name);
-			strlcat(state->pp_buf, tmp, PAGE_SIZE);
-			/*
-			 * get block number and read then go through format1
-			 * labels
-			 */
-			blk = cchhb2blk(&label->vol.vtoc, geo) + 1;
-			counter = 0;
-			data = read_part_sector(state, blk * (blocksize/512),
-						&sect);
-			while (data != NULL) {
-				struct vtoc_format1_label f1;
-
-				memcpy(&f1, data,
-				       sizeof(struct vtoc_format1_label));
-				put_dev_sector(sect);
-
-				/* skip FMT4 / FMT5 / FMT7 labels */
-				if (f1.DS1FMTID == _ascebc['4']
-				    || f1.DS1FMTID == _ascebc['5']
-				    || f1.DS1FMTID == _ascebc['7']
-				    || f1.DS1FMTID == _ascebc['9']) {
-					blk++;
-					data = read_part_sector(state,
-						blk * (blocksize/512), &sect);
-					continue;
-				}
-
-				/* only FMT1 and 8 labels valid at this point */
-				if (f1.DS1FMTID != _ascebc['1'] &&
-				    f1.DS1FMTID != _ascebc['8'])
-					break;
-
-				/* OK, we got valid partition data */
-				offset = cchh2blk(&f1.DS1EXT1.llimit, geo);
-				size  = cchh2blk(&f1.DS1EXT1.ulimit, geo) -
-					offset + geo->sectors;
-				if (counter >= state->limit)
-					break;
-				put_partition(state, counter + 1,
-					      offset * (blocksize >> 9),
-					      size * (blocksize >> 9));
-				counter++;
-				blk++;
-				data = read_part_sector(state,
-						blk * (blocksize/512), &sect);
-			}
-
-			if (!data)
-				/* Are we not supposed to report this ? */
-				goto out_readerr;
-		} else
-			printk(KERN_WARNING "Warning, expected Label VOL1 not "
-			       "found, treating as CDL formated Disk");
-
-	}
-
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	goto out_freeall;
-
-
-out_readerr:
-	res = -1;
-out_freeall:
-	kfree(label);
-out_nolab:
-	kfree(geo);
-out_nogeo:
-	kfree(info);
-out_exit:
-	return res;
-}
diff --git a/fs/partitions/ibm.h b/fs/partitions/ibm.h
deleted file mode 100644
index 08fb0804a81..00000000000
--- a/fs/partitions/ibm.h
+++ /dev/null
@@ -1 +0,0 @@
-int ibm_partition(struct parsed_partitions *);
diff --git a/fs/partitions/karma.c b/fs/partitions/karma.c
deleted file mode 100644
index 0ea19312706..00000000000
--- a/fs/partitions/karma.c
+++ /dev/null
@@ -1,57 +0,0 @@
-/*
- *  fs/partitions/karma.c
- *  Rio Karma partition info.
- *
- *  Copyright (C) 2006 Bob Copeland (me@bobcopeland.com)
- *  based on osf.c
- */
-
-#include "check.h"
-#include "karma.h"
-
-int karma_partition(struct parsed_partitions *state)
-{
-	int i;
-	int slot = 1;
-	Sector sect;
-	unsigned char *data;
-	struct disklabel {
-		u8 d_reserved[270];
-		struct d_partition {
-			__le32 p_res;
-			u8 p_fstype;
-			u8 p_res2[3];
-			__le32 p_offset;
-			__le32 p_size;
-		} d_partitions[2];
-		u8 d_blank[208];
-		__le16 d_magic;
-	} __attribute__((packed)) *label;
-	struct d_partition *p;
-
-	data = read_part_sector(state, 0, &sect);
-	if (!data)
-		return -1;
-
-	label = (struct disklabel *)data;
-	if (le16_to_cpu(label->d_magic) != KARMA_LABEL_MAGIC) {
-		put_dev_sector(sect);
-		return 0;
-	}
-
-	p = label->d_partitions;
-	for (i = 0 ; i < 2; i++, p++) {
-		if (slot == state->limit)
-			break;
-
-		if (p->p_fstype == 0x4d && le32_to_cpu(p->p_size)) {
-			put_partition(state, slot, le32_to_cpu(p->p_offset),
-				le32_to_cpu(p->p_size));
-		}
-		slot++;
-	}
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	put_dev_sector(sect);
-	return 1;
-}
-
diff --git a/fs/partitions/karma.h b/fs/partitions/karma.h
deleted file mode 100644
index c764b2e9df2..00000000000
--- a/fs/partitions/karma.h
+++ /dev/null
@@ -1,8 +0,0 @@
-/*
- *  fs/partitions/karma.h
- */
-
-#define KARMA_LABEL_MAGIC		0xAB56
-
-int karma_partition(struct parsed_partitions *state);
-
diff --git a/fs/partitions/ldm.c b/fs/partitions/ldm.c
deleted file mode 100644
index bd8ae788f68..00000000000
--- a/fs/partitions/ldm.c
+++ /dev/null
@@ -1,1570 +0,0 @@
-/**
- * ldm - Support for Windows Logical Disk Manager (Dynamic Disks)
- *
- * Copyright (C) 2001,2002 Richard Russon <ldm@flatcap.org>
- * Copyright (c) 2001-2007 Anton Altaparmakov
- * Copyright (C) 2001,2002 Jakob Kemi <jakob.kemi@telia.com>
- *
- * Documentation is available at http://www.linux-ntfs.org/doku.php?id=downloads 
- *
- * This program is free software; you can redistribute it and/or modify it under
- * the terms of the GNU General Public License as published by the Free Software
- * Foundation; either version 2 of the License, or (at your option) any later
- * version.
- *
- * This program is distributed in the hope that it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
- * FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more
- * details.
- *
- * You should have received a copy of the GNU General Public License along with
- * this program (in the main directory of the source in the file COPYING); if
- * not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330,
- * Boston, MA  02111-1307  USA
- */
-
-#include <linux/slab.h>
-#include <linux/pagemap.h>
-#include <linux/stringify.h>
-#include <linux/kernel.h>
-#include "ldm.h"
-#include "check.h"
-#include "msdos.h"
-
-/**
- * ldm_debug/info/error/crit - Output an error message
- * @f:    A printf format string containing the message
- * @...:  Variables to substitute into @f
- *
- * ldm_debug() writes a DEBUG level message to the syslog but only if the
- * driver was compiled with debug enabled. Otherwise, the call turns into a NOP.
- */
-#ifndef CONFIG_LDM_DEBUG
-#define ldm_debug(...)	do {} while (0)
-#else
-#define ldm_debug(f, a...) _ldm_printk (KERN_DEBUG, __func__, f, ##a)
-#endif
-
-#define ldm_crit(f, a...)  _ldm_printk (KERN_CRIT,  __func__, f, ##a)
-#define ldm_error(f, a...) _ldm_printk (KERN_ERR,   __func__, f, ##a)
-#define ldm_info(f, a...)  _ldm_printk (KERN_INFO,  __func__, f, ##a)
-
-static __printf(3, 4)
-void _ldm_printk(const char *level, const char *function, const char *fmt, ...)
-{
-	struct va_format vaf;
-	va_list args;
-
-	va_start (args, fmt);
-
-	vaf.fmt = fmt;
-	vaf.va = &args;
-
-	printk("%s%s(): %pV\n", level, function, &vaf);
-
-	va_end(args);
-}
-
-/**
- * ldm_parse_hexbyte - Convert a ASCII hex number to a byte
- * @src:  Pointer to at least 2 characters to convert.
- *
- * Convert a two character ASCII hex string to a number.
- *
- * Return:  0-255  Success, the byte was parsed correctly
- *          -1     Error, an invalid character was supplied
- */
-static int ldm_parse_hexbyte (const u8 *src)
-{
-	unsigned int x;		/* For correct wrapping */
-	int h;
-
-	/* high part */
-	x = h = hex_to_bin(src[0]);
-	if (h < 0)
-		return -1;
-
-	/* low part */
-	h = hex_to_bin(src[1]);
-	if (h < 0)
-		return -1;
-
-	return (x << 4) + h;
-}
-
-/**
- * ldm_parse_guid - Convert GUID from ASCII to binary
- * @src:   36 char string of the form fa50ff2b-f2e8-45de-83fa-65417f2f49ba
- * @dest:  Memory block to hold binary GUID (16 bytes)
- *
- * N.B. The GUID need not be NULL terminated.
- *
- * Return:  'true'   @dest contains binary GUID
- *          'false'  @dest contents are undefined
- */
-static bool ldm_parse_guid (const u8 *src, u8 *dest)
-{
-	static const int size[] = { 4, 2, 2, 2, 6 };
-	int i, j, v;
-
-	if (src[8]  != '-' || src[13] != '-' ||
-	    src[18] != '-' || src[23] != '-')
-		return false;
-
-	for (j = 0; j < 5; j++, src++)
-		for (i = 0; i < size[j]; i++, src+=2, *dest++ = v)
-			if ((v = ldm_parse_hexbyte (src)) < 0)
-				return false;
-
-	return true;
-}
-
-/**
- * ldm_parse_privhead - Read the LDM Database PRIVHEAD structure
- * @data:  Raw database PRIVHEAD structure loaded from the device
- * @ph:    In-memory privhead structure in which to return parsed information
- *
- * This parses the LDM database PRIVHEAD structure supplied in @data and
- * sets up the in-memory privhead structure @ph with the obtained information.
- *
- * Return:  'true'   @ph contains the PRIVHEAD data
- *          'false'  @ph contents are undefined
- */
-static bool ldm_parse_privhead(const u8 *data, struct privhead *ph)
-{
-	bool is_vista = false;
-
-	BUG_ON(!data || !ph);
-	if (MAGIC_PRIVHEAD != get_unaligned_be64(data)) {
-		ldm_error("Cannot find PRIVHEAD structure. LDM database is"
-			" corrupt. Aborting.");
-		return false;
-	}
-	ph->ver_major = get_unaligned_be16(data + 0x000C);
-	ph->ver_minor = get_unaligned_be16(data + 0x000E);
-	ph->logical_disk_start = get_unaligned_be64(data + 0x011B);
-	ph->logical_disk_size = get_unaligned_be64(data + 0x0123);
-	ph->config_start = get_unaligned_be64(data + 0x012B);
-	ph->config_size = get_unaligned_be64(data + 0x0133);
-	/* Version 2.11 is Win2k/XP and version 2.12 is Vista. */
-	if (ph->ver_major == 2 && ph->ver_minor == 12)
-		is_vista = true;
-	if (!is_vista && (ph->ver_major != 2 || ph->ver_minor != 11)) {
-		ldm_error("Expected PRIVHEAD version 2.11 or 2.12, got %d.%d."
-			" Aborting.", ph->ver_major, ph->ver_minor);
-		return false;
-	}
-	ldm_debug("PRIVHEAD version %d.%d (Windows %s).", ph->ver_major,
-			ph->ver_minor, is_vista ? "Vista" : "2000/XP");
-	if (ph->config_size != LDM_DB_SIZE) {	/* 1 MiB in sectors. */
-		/* Warn the user and continue, carefully. */
-		ldm_info("Database is normally %u bytes, it claims to "
-			"be %llu bytes.", LDM_DB_SIZE,
-			(unsigned long long)ph->config_size);
-	}
-	if ((ph->logical_disk_size == 0) || (ph->logical_disk_start +
-			ph->logical_disk_size > ph->config_start)) {
-		ldm_error("PRIVHEAD disk size doesn't match real disk size");
-		return false;
-	}
-	if (!ldm_parse_guid(data + 0x0030, ph->disk_id)) {
-		ldm_error("PRIVHEAD contains an invalid GUID.");
-		return false;
-	}
-	ldm_debug("Parsed PRIVHEAD successfully.");
-	return true;
-}
-
-/**
- * ldm_parse_tocblock - Read the LDM Database TOCBLOCK structure
- * @data:  Raw database TOCBLOCK structure loaded from the device
- * @toc:   In-memory toc structure in which to return parsed information
- *
- * This parses the LDM Database TOCBLOCK (table of contents) structure supplied
- * in @data and sets up the in-memory tocblock structure @toc with the obtained
- * information.
- *
- * N.B.  The *_start and *_size values returned in @toc are not range-checked.
- *
- * Return:  'true'   @toc contains the TOCBLOCK data
- *          'false'  @toc contents are undefined
- */
-static bool ldm_parse_tocblock (const u8 *data, struct tocblock *toc)
-{
-	BUG_ON (!data || !toc);
-
-	if (MAGIC_TOCBLOCK != get_unaligned_be64(data)) {
-		ldm_crit ("Cannot find TOCBLOCK, database may be corrupt.");
-		return false;
-	}
-	strncpy (toc->bitmap1_name, data + 0x24, sizeof (toc->bitmap1_name));
-	toc->bitmap1_name[sizeof (toc->bitmap1_name) - 1] = 0;
-	toc->bitmap1_start = get_unaligned_be64(data + 0x2E);
-	toc->bitmap1_size  = get_unaligned_be64(data + 0x36);
-
-	if (strncmp (toc->bitmap1_name, TOC_BITMAP1,
-			sizeof (toc->bitmap1_name)) != 0) {
-		ldm_crit ("TOCBLOCK's first bitmap is '%s', should be '%s'.",
-				TOC_BITMAP1, toc->bitmap1_name);
-		return false;
-	}
-	strncpy (toc->bitmap2_name, data + 0x46, sizeof (toc->bitmap2_name));
-	toc->bitmap2_name[sizeof (toc->bitmap2_name) - 1] = 0;
-	toc->bitmap2_start = get_unaligned_be64(data + 0x50);
-	toc->bitmap2_size  = get_unaligned_be64(data + 0x58);
-	if (strncmp (toc->bitmap2_name, TOC_BITMAP2,
-			sizeof (toc->bitmap2_name)) != 0) {
-		ldm_crit ("TOCBLOCK's second bitmap is '%s', should be '%s'.",
-				TOC_BITMAP2, toc->bitmap2_name);
-		return false;
-	}
-	ldm_debug ("Parsed TOCBLOCK successfully.");
-	return true;
-}
-
-/**
- * ldm_parse_vmdb - Read the LDM Database VMDB structure
- * @data:  Raw database VMDB structure loaded from the device
- * @vm:    In-memory vmdb structure in which to return parsed information
- *
- * This parses the LDM Database VMDB structure supplied in @data and sets up
- * the in-memory vmdb structure @vm with the obtained information.
- *
- * N.B.  The *_start, *_size and *_seq values will be range-checked later.
- *
- * Return:  'true'   @vm contains VMDB info
- *          'false'  @vm contents are undefined
- */
-static bool ldm_parse_vmdb (const u8 *data, struct vmdb *vm)
-{
-	BUG_ON (!data || !vm);
-
-	if (MAGIC_VMDB != get_unaligned_be32(data)) {
-		ldm_crit ("Cannot find the VMDB, database may be corrupt.");
-		return false;
-	}
-
-	vm->ver_major = get_unaligned_be16(data + 0x12);
-	vm->ver_minor = get_unaligned_be16(data + 0x14);
-	if ((vm->ver_major != 4) || (vm->ver_minor != 10)) {
-		ldm_error ("Expected VMDB version %d.%d, got %d.%d. "
-			"Aborting.", 4, 10, vm->ver_major, vm->ver_minor);
-		return false;
-	}
-
-	vm->vblk_size     = get_unaligned_be32(data + 0x08);
-	if (vm->vblk_size == 0) {
-		ldm_error ("Illegal VBLK size");
-		return false;
-	}
-
-	vm->vblk_offset   = get_unaligned_be32(data + 0x0C);
-	vm->last_vblk_seq = get_unaligned_be32(data + 0x04);
-
-	ldm_debug ("Parsed VMDB successfully.");
-	return true;
-}
-
-/**
- * ldm_compare_privheads - Compare two privhead objects
- * @ph1:  First privhead
- * @ph2:  Second privhead
- *
- * This compares the two privhead structures @ph1 and @ph2.
- *
- * Return:  'true'   Identical
- *          'false'  Different
- */
-static bool ldm_compare_privheads (const struct privhead *ph1,
-				   const struct privhead *ph2)
-{
-	BUG_ON (!ph1 || !ph2);
-
-	return ((ph1->ver_major          == ph2->ver_major)		&&
-		(ph1->ver_minor          == ph2->ver_minor)		&&
-		(ph1->logical_disk_start == ph2->logical_disk_start)	&&
-		(ph1->logical_disk_size  == ph2->logical_disk_size)	&&
-		(ph1->config_start       == ph2->config_start)		&&
-		(ph1->config_size        == ph2->config_size)		&&
-		!memcmp (ph1->disk_id, ph2->disk_id, GUID_SIZE));
-}
-
-/**
- * ldm_compare_tocblocks - Compare two tocblock objects
- * @toc1:  First toc
- * @toc2:  Second toc
- *
- * This compares the two tocblock structures @toc1 and @toc2.
- *
- * Return:  'true'   Identical
- *          'false'  Different
- */
-static bool ldm_compare_tocblocks (const struct tocblock *toc1,
-				   const struct tocblock *toc2)
-{
-	BUG_ON (!toc1 || !toc2);
-
-	return ((toc1->bitmap1_start == toc2->bitmap1_start)	&&
-		(toc1->bitmap1_size  == toc2->bitmap1_size)	&&
-		(toc1->bitmap2_start == toc2->bitmap2_start)	&&
-		(toc1->bitmap2_size  == toc2->bitmap2_size)	&&
-		!strncmp (toc1->bitmap1_name, toc2->bitmap1_name,
-			sizeof (toc1->bitmap1_name))		&&
-		!strncmp (toc1->bitmap2_name, toc2->bitmap2_name,
-			sizeof (toc1->bitmap2_name)));
-}
-
-/**
- * ldm_validate_privheads - Compare the primary privhead with its backups
- * @state: Partition check state including device holding the LDM Database
- * @ph1:   Memory struct to fill with ph contents
- *
- * Read and compare all three privheads from disk.
- *
- * The privheads on disk show the size and location of the main disk area and
- * the configuration area (the database).  The values are range-checked against
- * @hd, which contains the real size of the disk.
- *
- * Return:  'true'   Success
- *          'false'  Error
- */
-static bool ldm_validate_privheads(struct parsed_partitions *state,
-				   struct privhead *ph1)
-{
-	static const int off[3] = { OFF_PRIV1, OFF_PRIV2, OFF_PRIV3 };
-	struct privhead *ph[3] = { ph1 };
-	Sector sect;
-	u8 *data;
-	bool result = false;
-	long num_sects;
-	int i;
-
-	BUG_ON (!state || !ph1);
-
-	ph[1] = kmalloc (sizeof (*ph[1]), GFP_KERNEL);
-	ph[2] = kmalloc (sizeof (*ph[2]), GFP_KERNEL);
-	if (!ph[1] || !ph[2]) {
-		ldm_crit ("Out of memory.");
-		goto out;
-	}
-
-	/* off[1 & 2] are relative to ph[0]->config_start */
-	ph[0]->config_start = 0;
-
-	/* Read and parse privheads */
-	for (i = 0; i < 3; i++) {
-		data = read_part_sector(state, ph[0]->config_start + off[i],
-					&sect);
-		if (!data) {
-			ldm_crit ("Disk read failed.");
-			goto out;
-		}
-		result = ldm_parse_privhead (data, ph[i]);
-		put_dev_sector (sect);
-		if (!result) {
-			ldm_error ("Cannot find PRIVHEAD %d.", i+1); /* Log again */
-			if (i < 2)
-				goto out;	/* Already logged */
-			else
-				break;	/* FIXME ignore for now, 3rd PH can fail on odd-sized disks */
-		}
-	}
-
-	num_sects = state->bdev->bd_inode->i_size >> 9;
-
-	if ((ph[0]->config_start > num_sects) ||
-	   ((ph[0]->config_start + ph[0]->config_size) > num_sects)) {
-		ldm_crit ("Database extends beyond the end of the disk.");
-		goto out;
-	}
-
-	if ((ph[0]->logical_disk_start > ph[0]->config_start) ||
-	   ((ph[0]->logical_disk_start + ph[0]->logical_disk_size)
-		    > ph[0]->config_start)) {
-		ldm_crit ("Disk and database overlap.");
-		goto out;
-	}
-
-	if (!ldm_compare_privheads (ph[0], ph[1])) {
-		ldm_crit ("Primary and backup PRIVHEADs don't match.");
-		goto out;
-	}
-	/* FIXME ignore this for now
-	if (!ldm_compare_privheads (ph[0], ph[2])) {
-		ldm_crit ("Primary and backup PRIVHEADs don't match.");
-		goto out;
-	}*/
-	ldm_debug ("Validated PRIVHEADs successfully.");
-	result = true;
-out:
-	kfree (ph[1]);
-	kfree (ph[2]);
-	return result;
-}
-
-/**
- * ldm_validate_tocblocks - Validate the table of contents and its backups
- * @state: Partition check state including device holding the LDM Database
- * @base:  Offset, into @state->bdev, of the database
- * @ldb:   Cache of the database structures
- *
- * Find and compare the four tables of contents of the LDM Database stored on
- * @state->bdev and return the parsed information into @toc1.
- *
- * The offsets and sizes of the configs are range-checked against a privhead.
- *
- * Return:  'true'   @toc1 contains validated TOCBLOCK info
- *          'false'  @toc1 contents are undefined
- */
-static bool ldm_validate_tocblocks(struct parsed_partitions *state,
-				   unsigned long base, struct ldmdb *ldb)
-{
-	static const int off[4] = { OFF_TOCB1, OFF_TOCB2, OFF_TOCB3, OFF_TOCB4};
-	struct tocblock *tb[4];
-	struct privhead *ph;
-	Sector sect;
-	u8 *data;
-	int i, nr_tbs;
-	bool result = false;
-
-	BUG_ON(!state || !ldb);
-	ph = &ldb->ph;
-	tb[0] = &ldb->toc;
-	tb[1] = kmalloc(sizeof(*tb[1]) * 3, GFP_KERNEL);
-	if (!tb[1]) {
-		ldm_crit("Out of memory.");
-		goto err;
-	}
-	tb[2] = (struct tocblock*)((u8*)tb[1] + sizeof(*tb[1]));
-	tb[3] = (struct tocblock*)((u8*)tb[2] + sizeof(*tb[2]));
-	/*
-	 * Try to read and parse all four TOCBLOCKs.
-	 *
-	 * Windows Vista LDM v2.12 does not always have all four TOCBLOCKs so
-	 * skip any that fail as long as we get at least one valid TOCBLOCK.
-	 */
-	for (nr_tbs = i = 0; i < 4; i++) {
-		data = read_part_sector(state, base + off[i], &sect);
-		if (!data) {
-			ldm_error("Disk read failed for TOCBLOCK %d.", i);
-			continue;
-		}
-		if (ldm_parse_tocblock(data, tb[nr_tbs]))
-			nr_tbs++;
-		put_dev_sector(sect);
-	}
-	if (!nr_tbs) {
-		ldm_crit("Failed to find a valid TOCBLOCK.");
-		goto err;
-	}
-	/* Range check the TOCBLOCK against a privhead. */
-	if (((tb[0]->bitmap1_start + tb[0]->bitmap1_size) > ph->config_size) ||
-			((tb[0]->bitmap2_start + tb[0]->bitmap2_size) >
-			ph->config_size)) {
-		ldm_crit("The bitmaps are out of range.  Giving up.");
-		goto err;
-	}
-	/* Compare all loaded TOCBLOCKs. */
-	for (i = 1; i < nr_tbs; i++) {
-		if (!ldm_compare_tocblocks(tb[0], tb[i])) {
-			ldm_crit("TOCBLOCKs 0 and %d do not match.", i);
-			goto err;
-		}
-	}
-	ldm_debug("Validated %d TOCBLOCKs successfully.", nr_tbs);
-	result = true;
-err:
-	kfree(tb[1]);
-	return result;
-}
-
-/**
- * ldm_validate_vmdb - Read the VMDB and validate it
- * @state: Partition check state including device holding the LDM Database
- * @base:  Offset, into @bdev, of the database
- * @ldb:   Cache of the database structures
- *
- * Find the vmdb of the LDM Database stored on @bdev and return the parsed
- * information in @ldb.
- *
- * Return:  'true'   @ldb contains validated VBDB info
- *          'false'  @ldb contents are undefined
- */
-static bool ldm_validate_vmdb(struct parsed_partitions *state,
-			      unsigned long base, struct ldmdb *ldb)
-{
-	Sector sect;
-	u8 *data;
-	bool result = false;
-	struct vmdb *vm;
-	struct tocblock *toc;
-
-	BUG_ON (!state || !ldb);
-
-	vm  = &ldb->vm;
-	toc = &ldb->toc;
-
-	data = read_part_sector(state, base + OFF_VMDB, &sect);
-	if (!data) {
-		ldm_crit ("Disk read failed.");
-		return false;
-	}
-
-	if (!ldm_parse_vmdb (data, vm))
-		goto out;				/* Already logged */
-
-	/* Are there uncommitted transactions? */
-	if (get_unaligned_be16(data + 0x10) != 0x01) {
-		ldm_crit ("Database is not in a consistent state.  Aborting.");
-		goto out;
-	}
-
-	if (vm->vblk_offset != 512)
-		ldm_info ("VBLKs start at offset 0x%04x.", vm->vblk_offset);
-
-	/*
-	 * The last_vblkd_seq can be before the end of the vmdb, just make sure
-	 * it is not out of bounds.
-	 */
-	if ((vm->vblk_size * vm->last_vblk_seq) > (toc->bitmap1_size << 9)) {
-		ldm_crit ("VMDB exceeds allowed size specified by TOCBLOCK.  "
-				"Database is corrupt.  Aborting.");
-		goto out;
-	}
-
-	result = true;
-out:
-	put_dev_sector (sect);
-	return result;
-}
-
-
-/**
- * ldm_validate_partition_table - Determine whether bdev might be a dynamic disk
- * @state: Partition check state including device holding the LDM Database
- *
- * This function provides a weak test to decide whether the device is a dynamic
- * disk or not.  It looks for an MS-DOS-style partition table containing at
- * least one partition of type 0x42 (formerly SFS, now used by Windows for
- * dynamic disks).
- *
- * N.B.  The only possible error can come from the read_part_sector and that is
- *       only likely to happen if the underlying device is strange.  If that IS
- *       the case we should return zero to let someone else try.
- *
- * Return:  'true'   @state->bdev is a dynamic disk
- *          'false'  @state->bdev is not a dynamic disk, or an error occurred
- */
-static bool ldm_validate_partition_table(struct parsed_partitions *state)
-{
-	Sector sect;
-	u8 *data;
-	struct partition *p;
-	int i;
-	bool result = false;
-
-	BUG_ON(!state);
-
-	data = read_part_sector(state, 0, &sect);
-	if (!data) {
-		ldm_info ("Disk read failed.");
-		return false;
-	}
-
-	if (*(__le16*) (data + 0x01FE) != cpu_to_le16 (MSDOS_LABEL_MAGIC))
-		goto out;
-
-	p = (struct partition*)(data + 0x01BE);
-	for (i = 0; i < 4; i++, p++)
-		if (SYS_IND (p) == LDM_PARTITION) {
-			result = true;
-			break;
-		}
-
-	if (result)
-		ldm_debug ("Found W2K dynamic disk partition type.");
-
-out:
-	put_dev_sector (sect);
-	return result;
-}
-
-/**
- * ldm_get_disk_objid - Search a linked list of vblk's for a given Disk Id
- * @ldb:  Cache of the database structures
- *
- * The LDM Database contains a list of all partitions on all dynamic disks.
- * The primary PRIVHEAD, at the beginning of the physical disk, tells us
- * the GUID of this disk.  This function searches for the GUID in a linked
- * list of vblk's.
- *
- * Return:  Pointer, A matching vblk was found
- *          NULL,    No match, or an error
- */
-static struct vblk * ldm_get_disk_objid (const struct ldmdb *ldb)
-{
-	struct list_head *item;
-
-	BUG_ON (!ldb);
-
-	list_for_each (item, &ldb->v_disk) {
-		struct vblk *v = list_entry (item, struct vblk, list);
-		if (!memcmp (v->vblk.disk.disk_id, ldb->ph.disk_id, GUID_SIZE))
-			return v;
-	}
-
-	return NULL;
-}
-
-/**
- * ldm_create_data_partitions - Create data partitions for this device
- * @pp:   List of the partitions parsed so far
- * @ldb:  Cache of the database structures
- *
- * The database contains ALL the partitions for ALL disk groups, so we need to
- * filter out this specific disk. Using the disk's object id, we can find all
- * the partitions in the database that belong to this disk.
- *
- * Add each partition in our database, to the parsed_partitions structure.
- *
- * N.B.  This function creates the partitions in the order it finds partition
- *       objects in the linked list.
- *
- * Return:  'true'   Partition created
- *          'false'  Error, probably a range checking problem
- */
-static bool ldm_create_data_partitions (struct parsed_partitions *pp,
-					const struct ldmdb *ldb)
-{
-	struct list_head *item;
-	struct vblk *vb;
-	struct vblk *disk;
-	struct vblk_part *part;
-	int part_num = 1;
-
-	BUG_ON (!pp || !ldb);
-
-	disk = ldm_get_disk_objid (ldb);
-	if (!disk) {
-		ldm_crit ("Can't find the ID of this disk in the database.");
-		return false;
-	}
-
-	strlcat(pp->pp_buf, " [LDM]", PAGE_SIZE);
-
-	/* Create the data partitions */
-	list_for_each (item, &ldb->v_part) {
-		vb = list_entry (item, struct vblk, list);
-		part = &vb->vblk.part;
-
-		if (part->disk_id != disk->obj_id)
-			continue;
-
-		put_partition (pp, part_num, ldb->ph.logical_disk_start +
-				part->start, part->size);
-		part_num++;
-	}
-
-	strlcat(pp->pp_buf, "\n", PAGE_SIZE);
-	return true;
-}
-
-
-/**
- * ldm_relative - Calculate the next relative offset
- * @buffer:  Block of data being worked on
- * @buflen:  Size of the block of data
- * @base:    Size of the previous fixed width fields
- * @offset:  Cumulative size of the previous variable-width fields
- *
- * Because many of the VBLK fields are variable-width, it's necessary
- * to calculate each offset based on the previous one and the length
- * of the field it pointed to.
- *
- * Return:  -1 Error, the calculated offset exceeded the size of the buffer
- *           n OK, a range-checked offset into buffer
- */
-static int ldm_relative(const u8 *buffer, int buflen, int base, int offset)
-{
-
-	base += offset;
-	if (!buffer || offset < 0 || base > buflen) {
-		if (!buffer)
-			ldm_error("!buffer");
-		if (offset < 0)
-			ldm_error("offset (%d) < 0", offset);
-		if (base > buflen)
-			ldm_error("base (%d) > buflen (%d)", base, buflen);
-		return -1;
-	}
-	if (base + buffer[base] >= buflen) {
-		ldm_error("base (%d) + buffer[base] (%d) >= buflen (%d)", base,
-				buffer[base], buflen);
-		return -1;
-	}
-	return buffer[base] + offset + 1;
-}
-
-/**
- * ldm_get_vnum - Convert a variable-width, big endian number, into cpu order
- * @block:  Pointer to the variable-width number to convert
- *
- * Large numbers in the LDM Database are often stored in a packed format.  Each
- * number is prefixed by a one byte width marker.  All numbers in the database
- * are stored in big-endian byte order.  This function reads one of these
- * numbers and returns the result
- *
- * N.B.  This function DOES NOT perform any range checking, though the most
- *       it will read is eight bytes.
- *
- * Return:  n A number
- *          0 Zero, or an error occurred
- */
-static u64 ldm_get_vnum (const u8 *block)
-{
-	u64 tmp = 0;
-	u8 length;
-
-	BUG_ON (!block);
-
-	length = *block++;
-
-	if (length && length <= 8)
-		while (length--)
-			tmp = (tmp << 8) | *block++;
-	else
-		ldm_error ("Illegal length %d.", length);
-
-	return tmp;
-}
-
-/**
- * ldm_get_vstr - Read a length-prefixed string into a buffer
- * @block:   Pointer to the length marker
- * @buffer:  Location to copy string to
- * @buflen:  Size of the output buffer
- *
- * Many of the strings in the LDM Database are not NULL terminated.  Instead
- * they are prefixed by a one byte length marker.  This function copies one of
- * these strings into a buffer.
- *
- * N.B.  This function DOES NOT perform any range checking on the input.
- *       If the buffer is too small, the output will be truncated.
- *
- * Return:  0, Error and @buffer contents are undefined
- *          n, String length in characters (excluding NULL)
- *          buflen-1, String was truncated.
- */
-static int ldm_get_vstr (const u8 *block, u8 *buffer, int buflen)
-{
-	int length;
-
-	BUG_ON (!block || !buffer);
-
-	length = block[0];
-	if (length >= buflen) {
-		ldm_error ("Truncating string %d -> %d.", length, buflen);
-		length = buflen - 1;
-	}
-	memcpy (buffer, block + 1, length);
-	buffer[length] = 0;
-	return length;
-}
-
-
-/**
- * ldm_parse_cmp3 - Read a raw VBLK Component object into a vblk structure
- * @buffer:  Block of data being worked on
- * @buflen:  Size of the block of data
- * @vb:      In-memory vblk in which to return information
- *
- * Read a raw VBLK Component object (version 3) into a vblk structure.
- *
- * Return:  'true'   @vb contains a Component VBLK
- *          'false'  @vb contents are not defined
- */
-static bool ldm_parse_cmp3 (const u8 *buffer, int buflen, struct vblk *vb)
-{
-	int r_objid, r_name, r_vstate, r_child, r_parent, r_stripe, r_cols, len;
-	struct vblk_comp *comp;
-
-	BUG_ON (!buffer || !vb);
-
-	r_objid  = ldm_relative (buffer, buflen, 0x18, 0);
-	r_name   = ldm_relative (buffer, buflen, 0x18, r_objid);
-	r_vstate = ldm_relative (buffer, buflen, 0x18, r_name);
-	r_child  = ldm_relative (buffer, buflen, 0x1D, r_vstate);
-	r_parent = ldm_relative (buffer, buflen, 0x2D, r_child);
-
-	if (buffer[0x12] & VBLK_FLAG_COMP_STRIPE) {
-		r_stripe = ldm_relative (buffer, buflen, 0x2E, r_parent);
-		r_cols   = ldm_relative (buffer, buflen, 0x2E, r_stripe);
-		len = r_cols;
-	} else {
-		r_stripe = 0;
-		r_cols   = 0;
-		len = r_parent;
-	}
-	if (len < 0)
-		return false;
-
-	len += VBLK_SIZE_CMP3;
-	if (len != get_unaligned_be32(buffer + 0x14))
-		return false;
-
-	comp = &vb->vblk.comp;
-	ldm_get_vstr (buffer + 0x18 + r_name, comp->state,
-		sizeof (comp->state));
-	comp->type      = buffer[0x18 + r_vstate];
-	comp->children  = ldm_get_vnum (buffer + 0x1D + r_vstate);
-	comp->parent_id = ldm_get_vnum (buffer + 0x2D + r_child);
-	comp->chunksize = r_stripe ? ldm_get_vnum (buffer+r_parent+0x2E) : 0;
-
-	return true;
-}
-
-/**
- * ldm_parse_dgr3 - Read a raw VBLK Disk Group object into a vblk structure
- * @buffer:  Block of data being worked on
- * @buflen:  Size of the block of data
- * @vb:      In-memory vblk in which to return information
- *
- * Read a raw VBLK Disk Group object (version 3) into a vblk structure.
- *
- * Return:  'true'   @vb contains a Disk Group VBLK
- *          'false'  @vb contents are not defined
- */
-static int ldm_parse_dgr3 (const u8 *buffer, int buflen, struct vblk *vb)
-{
-	int r_objid, r_name, r_diskid, r_id1, r_id2, len;
-	struct vblk_dgrp *dgrp;
-
-	BUG_ON (!buffer || !vb);
-
-	r_objid  = ldm_relative (buffer, buflen, 0x18, 0);
-	r_name   = ldm_relative (buffer, buflen, 0x18, r_objid);
-	r_diskid = ldm_relative (buffer, buflen, 0x18, r_name);
-
-	if (buffer[0x12] & VBLK_FLAG_DGR3_IDS) {
-		r_id1 = ldm_relative (buffer, buflen, 0x24, r_diskid);
-		r_id2 = ldm_relative (buffer, buflen, 0x24, r_id1);
-		len = r_id2;
-	} else {
-		r_id1 = 0;
-		r_id2 = 0;
-		len = r_diskid;
-	}
-	if (len < 0)
-		return false;
-
-	len += VBLK_SIZE_DGR3;
-	if (len != get_unaligned_be32(buffer + 0x14))
-		return false;
-
-	dgrp = &vb->vblk.dgrp;
-	ldm_get_vstr (buffer + 0x18 + r_name, dgrp->disk_id,
-		sizeof (dgrp->disk_id));
-	return true;
-}
-
-/**
- * ldm_parse_dgr4 - Read a raw VBLK Disk Group object into a vblk structure
- * @buffer:  Block of data being worked on
- * @buflen:  Size of the block of data
- * @vb:      In-memory vblk in which to return information
- *
- * Read a raw VBLK Disk Group object (version 4) into a vblk structure.
- *
- * Return:  'true'   @vb contains a Disk Group VBLK
- *          'false'  @vb contents are not defined
- */
-static bool ldm_parse_dgr4 (const u8 *buffer, int buflen, struct vblk *vb)
-{
-	char buf[64];
-	int r_objid, r_name, r_id1, r_id2, len;
-	struct vblk_dgrp *dgrp;
-
-	BUG_ON (!buffer || !vb);
-
-	r_objid  = ldm_relative (buffer, buflen, 0x18, 0);
-	r_name   = ldm_relative (buffer, buflen, 0x18, r_objid);
-
-	if (buffer[0x12] & VBLK_FLAG_DGR4_IDS) {
-		r_id1 = ldm_relative (buffer, buflen, 0x44, r_name);
-		r_id2 = ldm_relative (buffer, buflen, 0x44, r_id1);
-		len = r_id2;
-	} else {
-		r_id1 = 0;
-		r_id2 = 0;
-		len = r_name;
-	}
-	if (len < 0)
-		return false;
-
-	len += VBLK_SIZE_DGR4;
-	if (len != get_unaligned_be32(buffer + 0x14))
-		return false;
-
-	dgrp = &vb->vblk.dgrp;
-
-	ldm_get_vstr (buffer + 0x18 + r_objid, buf, sizeof (buf));
-	return true;
-}
-
-/**
- * ldm_parse_dsk3 - Read a raw VBLK Disk object into a vblk structure
- * @buffer:  Block of data being worked on
- * @buflen:  Size of the block of data
- * @vb:      In-memory vblk in which to return information
- *
- * Read a raw VBLK Disk object (version 3) into a vblk structure.
- *
- * Return:  'true'   @vb contains a Disk VBLK
- *          'false'  @vb contents are not defined
- */
-static bool ldm_parse_dsk3 (const u8 *buffer, int buflen, struct vblk *vb)
-{
-	int r_objid, r_name, r_diskid, r_altname, len;
-	struct vblk_disk *disk;
-
-	BUG_ON (!buffer || !vb);
-
-	r_objid   = ldm_relative (buffer, buflen, 0x18, 0);
-	r_name    = ldm_relative (buffer, buflen, 0x18, r_objid);
-	r_diskid  = ldm_relative (buffer, buflen, 0x18, r_name);
-	r_altname = ldm_relative (buffer, buflen, 0x18, r_diskid);
-	len = r_altname;
-	if (len < 0)
-		return false;
-
-	len += VBLK_SIZE_DSK3;
-	if (len != get_unaligned_be32(buffer + 0x14))
-		return false;
-
-	disk = &vb->vblk.disk;
-	ldm_get_vstr (buffer + 0x18 + r_diskid, disk->alt_name,
-		sizeof (disk->alt_name));
-	if (!ldm_parse_guid (buffer + 0x19 + r_name, disk->disk_id))
-		return false;
-
-	return true;
-}
-
-/**
- * ldm_parse_dsk4 - Read a raw VBLK Disk object into a vblk structure
- * @buffer:  Block of data being worked on
- * @buflen:  Size of the block of data
- * @vb:      In-memory vblk in which to return information
- *
- * Read a raw VBLK Disk object (version 4) into a vblk structure.
- *
- * Return:  'true'   @vb contains a Disk VBLK
- *          'false'  @vb contents are not defined
- */
-static bool ldm_parse_dsk4 (const u8 *buffer, int buflen, struct vblk *vb)
-{
-	int r_objid, r_name, len;
-	struct vblk_disk *disk;
-
-	BUG_ON (!buffer || !vb);
-
-	r_objid = ldm_relative (buffer, buflen, 0x18, 0);
-	r_name  = ldm_relative (buffer, buflen, 0x18, r_objid);
-	len     = r_name;
-	if (len < 0)
-		return false;
-
-	len += VBLK_SIZE_DSK4;
-	if (len != get_unaligned_be32(buffer + 0x14))
-		return false;
-
-	disk = &vb->vblk.disk;
-	memcpy (disk->disk_id, buffer + 0x18 + r_name, GUID_SIZE);
-	return true;
-}
-
-/**
- * ldm_parse_prt3 - Read a raw VBLK Partition object into a vblk structure
- * @buffer:  Block of data being worked on
- * @buflen:  Size of the block of data
- * @vb:      In-memory vblk in which to return information
- *
- * Read a raw VBLK Partition object (version 3) into a vblk structure.
- *
- * Return:  'true'   @vb contains a Partition VBLK
- *          'false'  @vb contents are not defined
- */
-static bool ldm_parse_prt3(const u8 *buffer, int buflen, struct vblk *vb)
-{
-	int r_objid, r_name, r_size, r_parent, r_diskid, r_index, len;
-	struct vblk_part *part;
-
-	BUG_ON(!buffer || !vb);
-	r_objid = ldm_relative(buffer, buflen, 0x18, 0);
-	if (r_objid < 0) {
-		ldm_error("r_objid %d < 0", r_objid);
-		return false;
-	}
-	r_name = ldm_relative(buffer, buflen, 0x18, r_objid);
-	if (r_name < 0) {
-		ldm_error("r_name %d < 0", r_name);
-		return false;
-	}
-	r_size = ldm_relative(buffer, buflen, 0x34, r_name);
-	if (r_size < 0) {
-		ldm_error("r_size %d < 0", r_size);
-		return false;
-	}
-	r_parent = ldm_relative(buffer, buflen, 0x34, r_size);
-	if (r_parent < 0) {
-		ldm_error("r_parent %d < 0", r_parent);
-		return false;
-	}
-	r_diskid = ldm_relative(buffer, buflen, 0x34, r_parent);
-	if (r_diskid < 0) {
-		ldm_error("r_diskid %d < 0", r_diskid);
-		return false;
-	}
-	if (buffer[0x12] & VBLK_FLAG_PART_INDEX) {
-		r_index = ldm_relative(buffer, buflen, 0x34, r_diskid);
-		if (r_index < 0) {
-			ldm_error("r_index %d < 0", r_index);
-			return false;
-		}
-		len = r_index;
-	} else {
-		r_index = 0;
-		len = r_diskid;
-	}
-	if (len < 0) {
-		ldm_error("len %d < 0", len);
-		return false;
-	}
-	len += VBLK_SIZE_PRT3;
-	if (len > get_unaligned_be32(buffer + 0x14)) {
-		ldm_error("len %d > BE32(buffer + 0x14) %d", len,
-				get_unaligned_be32(buffer + 0x14));
-		return false;
-	}
-	part = &vb->vblk.part;
-	part->start = get_unaligned_be64(buffer + 0x24 + r_name);
-	part->volume_offset = get_unaligned_be64(buffer + 0x2C + r_name);
-	part->size = ldm_get_vnum(buffer + 0x34 + r_name);
-	part->parent_id = ldm_get_vnum(buffer + 0x34 + r_size);
-	part->disk_id = ldm_get_vnum(buffer + 0x34 + r_parent);
-	if (vb->flags & VBLK_FLAG_PART_INDEX)
-		part->partnum = buffer[0x35 + r_diskid];
-	else
-		part->partnum = 0;
-	return true;
-}
-
-/**
- * ldm_parse_vol5 - Read a raw VBLK Volume object into a vblk structure
- * @buffer:  Block of data being worked on
- * @buflen:  Size of the block of data
- * @vb:      In-memory vblk in which to return information
- *
- * Read a raw VBLK Volume object (version 5) into a vblk structure.
- *
- * Return:  'true'   @vb contains a Volume VBLK
- *          'false'  @vb contents are not defined
- */
-static bool ldm_parse_vol5(const u8 *buffer, int buflen, struct vblk *vb)
-{
-	int r_objid, r_name, r_vtype, r_disable_drive_letter, r_child, r_size;
-	int r_id1, r_id2, r_size2, r_drive, len;
-	struct vblk_volu *volu;
-
-	BUG_ON(!buffer || !vb);
-	r_objid = ldm_relative(buffer, buflen, 0x18, 0);
-	if (r_objid < 0) {
-		ldm_error("r_objid %d < 0", r_objid);
-		return false;
-	}
-	r_name = ldm_relative(buffer, buflen, 0x18, r_objid);
-	if (r_name < 0) {
-		ldm_error("r_name %d < 0", r_name);
-		return false;
-	}
-	r_vtype = ldm_relative(buffer, buflen, 0x18, r_name);
-	if (r_vtype < 0) {
-		ldm_error("r_vtype %d < 0", r_vtype);
-		return false;
-	}
-	r_disable_drive_letter = ldm_relative(buffer, buflen, 0x18, r_vtype);
-	if (r_disable_drive_letter < 0) {
-		ldm_error("r_disable_drive_letter %d < 0",
-				r_disable_drive_letter);
-		return false;
-	}
-	r_child = ldm_relative(buffer, buflen, 0x2D, r_disable_drive_letter);
-	if (r_child < 0) {
-		ldm_error("r_child %d < 0", r_child);
-		return false;
-	}
-	r_size = ldm_relative(buffer, buflen, 0x3D, r_child);
-	if (r_size < 0) {
-		ldm_error("r_size %d < 0", r_size);
-		return false;
-	}
-	if (buffer[0x12] & VBLK_FLAG_VOLU_ID1) {
-		r_id1 = ldm_relative(buffer, buflen, 0x52, r_size);
-		if (r_id1 < 0) {
-			ldm_error("r_id1 %d < 0", r_id1);
-			return false;
-		}
-	} else
-		r_id1 = r_size;
-	if (buffer[0x12] & VBLK_FLAG_VOLU_ID2) {
-		r_id2 = ldm_relative(buffer, buflen, 0x52, r_id1);
-		if (r_id2 < 0) {
-			ldm_error("r_id2 %d < 0", r_id2);
-			return false;
-		}
-	} else
-		r_id2 = r_id1;
-	if (buffer[0x12] & VBLK_FLAG_VOLU_SIZE) {
-		r_size2 = ldm_relative(buffer, buflen, 0x52, r_id2);
-		if (r_size2 < 0) {
-			ldm_error("r_size2 %d < 0", r_size2);
-			return false;
-		}
-	} else
-		r_size2 = r_id2;
-	if (buffer[0x12] & VBLK_FLAG_VOLU_DRIVE) {
-		r_drive = ldm_relative(buffer, buflen, 0x52, r_size2);
-		if (r_drive < 0) {
-			ldm_error("r_drive %d < 0", r_drive);
-			return false;
-		}
-	} else
-		r_drive = r_size2;
-	len = r_drive;
-	if (len < 0) {
-		ldm_error("len %d < 0", len);
-		return false;
-	}
-	len += VBLK_SIZE_VOL5;
-	if (len > get_unaligned_be32(buffer + 0x14)) {
-		ldm_error("len %d > BE32(buffer + 0x14) %d", len,
-				get_unaligned_be32(buffer + 0x14));
-		return false;
-	}
-	volu = &vb->vblk.volu;
-	ldm_get_vstr(buffer + 0x18 + r_name, volu->volume_type,
-			sizeof(volu->volume_type));
-	memcpy(volu->volume_state, buffer + 0x18 + r_disable_drive_letter,
-			sizeof(volu->volume_state));
-	volu->size = ldm_get_vnum(buffer + 0x3D + r_child);
-	volu->partition_type = buffer[0x41 + r_size];
-	memcpy(volu->guid, buffer + 0x42 + r_size, sizeof(volu->guid));
-	if (buffer[0x12] & VBLK_FLAG_VOLU_DRIVE) {
-		ldm_get_vstr(buffer + 0x52 + r_size, volu->drive_hint,
-				sizeof(volu->drive_hint));
-	}
-	return true;
-}
-
-/**
- * ldm_parse_vblk - Read a raw VBLK object into a vblk structure
- * @buf:  Block of data being worked on
- * @len:  Size of the block of data
- * @vb:   In-memory vblk in which to return information
- *
- * Read a raw VBLK object into a vblk structure.  This function just reads the
- * information common to all VBLK types, then delegates the rest of the work to
- * helper functions: ldm_parse_*.
- *
- * Return:  'true'   @vb contains a VBLK
- *          'false'  @vb contents are not defined
- */
-static bool ldm_parse_vblk (const u8 *buf, int len, struct vblk *vb)
-{
-	bool result = false;
-	int r_objid;
-
-	BUG_ON (!buf || !vb);
-
-	r_objid = ldm_relative (buf, len, 0x18, 0);
-	if (r_objid < 0) {
-		ldm_error ("VBLK header is corrupt.");
-		return false;
-	}
-
-	vb->flags  = buf[0x12];
-	vb->type   = buf[0x13];
-	vb->obj_id = ldm_get_vnum (buf + 0x18);
-	ldm_get_vstr (buf+0x18+r_objid, vb->name, sizeof (vb->name));
-
-	switch (vb->type) {
-		case VBLK_CMP3:  result = ldm_parse_cmp3 (buf, len, vb); break;
-		case VBLK_DSK3:  result = ldm_parse_dsk3 (buf, len, vb); break;
-		case VBLK_DSK4:  result = ldm_parse_dsk4 (buf, len, vb); break;
-		case VBLK_DGR3:  result = ldm_parse_dgr3 (buf, len, vb); break;
-		case VBLK_DGR4:  result = ldm_parse_dgr4 (buf, len, vb); break;
-		case VBLK_PRT3:  result = ldm_parse_prt3 (buf, len, vb); break;
-		case VBLK_VOL5:  result = ldm_parse_vol5 (buf, len, vb); break;
-	}
-
-	if (result)
-		ldm_debug ("Parsed VBLK 0x%llx (type: 0x%02x) ok.",
-			 (unsigned long long) vb->obj_id, vb->type);
-	else
-		ldm_error ("Failed to parse VBLK 0x%llx (type: 0x%02x).",
-			(unsigned long long) vb->obj_id, vb->type);
-
-	return result;
-}
-
-
-/**
- * ldm_ldmdb_add - Adds a raw VBLK entry to the ldmdb database
- * @data:  Raw VBLK to add to the database
- * @len:   Size of the raw VBLK
- * @ldb:   Cache of the database structures
- *
- * The VBLKs are sorted into categories.  Partitions are also sorted by offset.
- *
- * N.B.  This function does not check the validity of the VBLKs.
- *
- * Return:  'true'   The VBLK was added
- *          'false'  An error occurred
- */
-static bool ldm_ldmdb_add (u8 *data, int len, struct ldmdb *ldb)
-{
-	struct vblk *vb;
-	struct list_head *item;
-
-	BUG_ON (!data || !ldb);
-
-	vb = kmalloc (sizeof (*vb), GFP_KERNEL);
-	if (!vb) {
-		ldm_crit ("Out of memory.");
-		return false;
-	}
-
-	if (!ldm_parse_vblk (data, len, vb)) {
-		kfree(vb);
-		return false;			/* Already logged */
-	}
-
-	/* Put vblk into the correct list. */
-	switch (vb->type) {
-	case VBLK_DGR3:
-	case VBLK_DGR4:
-		list_add (&vb->list, &ldb->v_dgrp);
-		break;
-	case VBLK_DSK3:
-	case VBLK_DSK4:
-		list_add (&vb->list, &ldb->v_disk);
-		break;
-	case VBLK_VOL5:
-		list_add (&vb->list, &ldb->v_volu);
-		break;
-	case VBLK_CMP3:
-		list_add (&vb->list, &ldb->v_comp);
-		break;
-	case VBLK_PRT3:
-		/* Sort by the partition's start sector. */
-		list_for_each (item, &ldb->v_part) {
-			struct vblk *v = list_entry (item, struct vblk, list);
-			if ((v->vblk.part.disk_id == vb->vblk.part.disk_id) &&
-			    (v->vblk.part.start > vb->vblk.part.start)) {
-				list_add_tail (&vb->list, &v->list);
-				return true;
-			}
-		}
-		list_add_tail (&vb->list, &ldb->v_part);
-		break;
-	}
-	return true;
-}
-
-/**
- * ldm_frag_add - Add a VBLK fragment to a list
- * @data:   Raw fragment to be added to the list
- * @size:   Size of the raw fragment
- * @frags:  Linked list of VBLK fragments
- *
- * Fragmented VBLKs may not be consecutive in the database, so they are placed
- * in a list so they can be pieced together later.
- *
- * Return:  'true'   Success, the VBLK was added to the list
- *          'false'  Error, a problem occurred
- */
-static bool ldm_frag_add (const u8 *data, int size, struct list_head *frags)
-{
-	struct frag *f;
-	struct list_head *item;
-	int rec, num, group;
-
-	BUG_ON (!data || !frags);
-
-	if (size < 2 * VBLK_SIZE_HEAD) {
-		ldm_error("Value of size is to small.");
-		return false;
-	}
-
-	group = get_unaligned_be32(data + 0x08);
-	rec   = get_unaligned_be16(data + 0x0C);
-	num   = get_unaligned_be16(data + 0x0E);
-	if ((num < 1) || (num > 4)) {
-		ldm_error ("A VBLK claims to have %d parts.", num);
-		return false;
-	}
-	if (rec >= num) {
-		ldm_error("REC value (%d) exceeds NUM value (%d)", rec, num);
-		return false;
-	}
-
-	list_for_each (item, frags) {
-		f = list_entry (item, struct frag, list);
-		if (f->group == group)
-			goto found;
-	}
-
-	f = kmalloc (sizeof (*f) + size*num, GFP_KERNEL);
-	if (!f) {
-		ldm_crit ("Out of memory.");
-		return false;
-	}
-
-	f->group = group;
-	f->num   = num;
-	f->rec   = rec;
-	f->map   = 0xFF << num;
-
-	list_add_tail (&f->list, frags);
-found:
-	if (rec >= f->num) {
-		ldm_error("REC value (%d) exceeds NUM value (%d)", rec, f->num);
-		return false;
-	}
-
-	if (f->map & (1 << rec)) {
-		ldm_error ("Duplicate VBLK, part %d.", rec);
-		f->map &= 0x7F;			/* Mark the group as broken */
-		return false;
-	}
-
-	f->map |= (1 << rec);
-
-	data += VBLK_SIZE_HEAD;
-	size -= VBLK_SIZE_HEAD;
-
-	memcpy (f->data+rec*(size-VBLK_SIZE_HEAD)+VBLK_SIZE_HEAD, data, size);
-
-	return true;
-}
-
-/**
- * ldm_frag_free - Free a linked list of VBLK fragments
- * @list:  Linked list of fragments
- *
- * Free a linked list of VBLK fragments
- *
- * Return:  none
- */
-static void ldm_frag_free (struct list_head *list)
-{
-	struct list_head *item, *tmp;
-
-	BUG_ON (!list);
-
-	list_for_each_safe (item, tmp, list)
-		kfree (list_entry (item, struct frag, list));
-}
-
-/**
- * ldm_frag_commit - Validate fragmented VBLKs and add them to the database
- * @frags:  Linked list of VBLK fragments
- * @ldb:    Cache of the database structures
- *
- * Now that all the fragmented VBLKs have been collected, they must be added to
- * the database for later use.
- *
- * Return:  'true'   All the fragments we added successfully
- *          'false'  One or more of the fragments we invalid
- */
-static bool ldm_frag_commit (struct list_head *frags, struct ldmdb *ldb)
-{
-	struct frag *f;
-	struct list_head *item;
-
-	BUG_ON (!frags || !ldb);
-
-	list_for_each (item, frags) {
-		f = list_entry (item, struct frag, list);
-
-		if (f->map != 0xFF) {
-			ldm_error ("VBLK group %d is incomplete (0x%02x).",
-				f->group, f->map);
-			return false;
-		}
-
-		if (!ldm_ldmdb_add (f->data, f->num*ldb->vm.vblk_size, ldb))
-			return false;		/* Already logged */
-	}
-	return true;
-}
-
-/**
- * ldm_get_vblks - Read the on-disk database of VBLKs into memory
- * @state: Partition check state including device holding the LDM Database
- * @base:  Offset, into @state->bdev, of the database
- * @ldb:   Cache of the database structures
- *
- * To use the information from the VBLKs, they need to be read from the disk,
- * unpacked and validated.  We cache them in @ldb according to their type.
- *
- * Return:  'true'   All the VBLKs were read successfully
- *          'false'  An error occurred
- */
-static bool ldm_get_vblks(struct parsed_partitions *state, unsigned long base,
-			  struct ldmdb *ldb)
-{
-	int size, perbuf, skip, finish, s, v, recs;
-	u8 *data = NULL;
-	Sector sect;
-	bool result = false;
-	LIST_HEAD (frags);
-
-	BUG_ON(!state || !ldb);
-
-	size   = ldb->vm.vblk_size;
-	perbuf = 512 / size;
-	skip   = ldb->vm.vblk_offset >> 9;		/* Bytes to sectors */
-	finish = (size * ldb->vm.last_vblk_seq) >> 9;
-
-	for (s = skip; s < finish; s++) {		/* For each sector */
-		data = read_part_sector(state, base + OFF_VMDB + s, &sect);
-		if (!data) {
-			ldm_crit ("Disk read failed.");
-			goto out;
-		}
-
-		for (v = 0; v < perbuf; v++, data+=size) {  /* For each vblk */
-			if (MAGIC_VBLK != get_unaligned_be32(data)) {
-				ldm_error ("Expected to find a VBLK.");
-				goto out;
-			}
-
-			recs = get_unaligned_be16(data + 0x0E);	/* Number of records */
-			if (recs == 1) {
-				if (!ldm_ldmdb_add (data, size, ldb))
-					goto out;	/* Already logged */
-			} else if (recs > 1) {
-				if (!ldm_frag_add (data, size, &frags))
-					goto out;	/* Already logged */
-			}
-			/* else Record is not in use, ignore it. */
-		}
-		put_dev_sector (sect);
-		data = NULL;
-	}
-
-	result = ldm_frag_commit (&frags, ldb);	/* Failures, already logged */
-out:
-	if (data)
-		put_dev_sector (sect);
-	ldm_frag_free (&frags);
-
-	return result;
-}
-
-/**
- * ldm_free_vblks - Free a linked list of vblk's
- * @lh:  Head of a linked list of struct vblk
- *
- * Free a list of vblk's and free the memory used to maintain the list.
- *
- * Return:  none
- */
-static void ldm_free_vblks (struct list_head *lh)
-{
-	struct list_head *item, *tmp;
-
-	BUG_ON (!lh);
-
-	list_for_each_safe (item, tmp, lh)
-		kfree (list_entry (item, struct vblk, list));
-}
-
-
-/**
- * ldm_partition - Find out whether a device is a dynamic disk and handle it
- * @state: Partition check state including device holding the LDM Database
- *
- * This determines whether the device @bdev is a dynamic disk and if so creates
- * the partitions necessary in the gendisk structure pointed to by @hd.
- *
- * We create a dummy device 1, which contains the LDM database, and then create
- * each partition described by the LDM database in sequence as devices 2+. For
- * example, if the device is hda, we would have: hda1: LDM database, hda2, hda3,
- * and so on: the actual data containing partitions.
- *
- * Return:  1 Success, @state->bdev is a dynamic disk and we handled it
- *          0 Success, @state->bdev is not a dynamic disk
- *         -1 An error occurred before enough information had been read
- *            Or @state->bdev is a dynamic disk, but it may be corrupted
- */
-int ldm_partition(struct parsed_partitions *state)
-{
-	struct ldmdb  *ldb;
-	unsigned long base;
-	int result = -1;
-
-	BUG_ON(!state);
-
-	/* Look for signs of a Dynamic Disk */
-	if (!ldm_validate_partition_table(state))
-		return 0;
-
-	ldb = kmalloc (sizeof (*ldb), GFP_KERNEL);
-	if (!ldb) {
-		ldm_crit ("Out of memory.");
-		goto out;
-	}
-
-	/* Parse and check privheads. */
-	if (!ldm_validate_privheads(state, &ldb->ph))
-		goto out;		/* Already logged */
-
-	/* All further references are relative to base (database start). */
-	base = ldb->ph.config_start;
-
-	/* Parse and check tocs and vmdb. */
-	if (!ldm_validate_tocblocks(state, base, ldb) ||
-	    !ldm_validate_vmdb(state, base, ldb))
-	    	goto out;		/* Already logged */
-
-	/* Initialize vblk lists in ldmdb struct */
-	INIT_LIST_HEAD (&ldb->v_dgrp);
-	INIT_LIST_HEAD (&ldb->v_disk);
-	INIT_LIST_HEAD (&ldb->v_volu);
-	INIT_LIST_HEAD (&ldb->v_comp);
-	INIT_LIST_HEAD (&ldb->v_part);
-
-	if (!ldm_get_vblks(state, base, ldb)) {
-		ldm_crit ("Failed to read the VBLKs from the database.");
-		goto cleanup;
-	}
-
-	/* Finally, create the data partition devices. */
-	if (ldm_create_data_partitions(state, ldb)) {
-		ldm_debug ("Parsed LDM database successfully.");
-		result = 1;
-	}
-	/* else Already logged */
-
-cleanup:
-	ldm_free_vblks (&ldb->v_dgrp);
-	ldm_free_vblks (&ldb->v_disk);
-	ldm_free_vblks (&ldb->v_volu);
-	ldm_free_vblks (&ldb->v_comp);
-	ldm_free_vblks (&ldb->v_part);
-out:
-	kfree (ldb);
-	return result;
-}
diff --git a/fs/partitions/ldm.h b/fs/partitions/ldm.h
deleted file mode 100644
index 374242c0971..00000000000
--- a/fs/partitions/ldm.h
+++ /dev/null
@@ -1,215 +0,0 @@
-/**
- * ldm - Part of the Linux-NTFS project.
- *
- * Copyright (C) 2001,2002 Richard Russon <ldm@flatcap.org>
- * Copyright (c) 2001-2007 Anton Altaparmakov
- * Copyright (C) 2001,2002 Jakob Kemi <jakob.kemi@telia.com>
- *
- * Documentation is available at http://www.linux-ntfs.org/doku.php?id=downloads 
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or (at your option)
- * any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program (in the main directory of the Linux-NTFS source
- * in the file COPYING); if not, write to the Free Software Foundation,
- * Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
- */
-
-#ifndef _FS_PT_LDM_H_
-#define _FS_PT_LDM_H_
-
-#include <linux/types.h>
-#include <linux/list.h>
-#include <linux/genhd.h>
-#include <linux/fs.h>
-#include <asm/unaligned.h>
-#include <asm/byteorder.h>
-
-struct parsed_partitions;
-
-/* Magic numbers in CPU format. */
-#define MAGIC_VMDB	0x564D4442		/* VMDB */
-#define MAGIC_VBLK	0x56424C4B		/* VBLK */
-#define MAGIC_PRIVHEAD	0x5052495648454144ULL	/* PRIVHEAD */
-#define MAGIC_TOCBLOCK	0x544F43424C4F434BULL	/* TOCBLOCK */
-
-/* The defined vblk types. */
-#define VBLK_VOL5		0x51		/* Volume,     version 5 */
-#define VBLK_CMP3		0x32		/* Component,  version 3 */
-#define VBLK_PRT3		0x33		/* Partition,  version 3 */
-#define VBLK_DSK3		0x34		/* Disk,       version 3 */
-#define VBLK_DSK4		0x44		/* Disk,       version 4 */
-#define VBLK_DGR3		0x35		/* Disk Group, version 3 */
-#define VBLK_DGR4		0x45		/* Disk Group, version 4 */
-
-/* vblk flags indicating extra information will be present */
-#define	VBLK_FLAG_COMP_STRIPE	0x10
-#define	VBLK_FLAG_PART_INDEX	0x08
-#define	VBLK_FLAG_DGR3_IDS	0x08
-#define	VBLK_FLAG_DGR4_IDS	0x08
-#define	VBLK_FLAG_VOLU_ID1	0x08
-#define	VBLK_FLAG_VOLU_ID2	0x20
-#define	VBLK_FLAG_VOLU_SIZE	0x80
-#define	VBLK_FLAG_VOLU_DRIVE	0x02
-
-/* size of a vblk's static parts */
-#define VBLK_SIZE_HEAD		16
-#define VBLK_SIZE_CMP3		22		/* Name and version */
-#define VBLK_SIZE_DGR3		12
-#define VBLK_SIZE_DGR4		44
-#define VBLK_SIZE_DSK3		12
-#define VBLK_SIZE_DSK4		45
-#define VBLK_SIZE_PRT3		28
-#define VBLK_SIZE_VOL5		58
-
-/* component types */
-#define COMP_STRIPE		0x01		/* Stripe-set */
-#define COMP_BASIC		0x02		/* Basic disk */
-#define COMP_RAID		0x03		/* Raid-set */
-
-/* Other constants. */
-#define LDM_DB_SIZE		2048		/* Size in sectors (= 1MiB). */
-
-#define OFF_PRIV1		6		/* Offset of the first privhead
-						   relative to the start of the
-						   device in sectors */
-
-/* Offsets to structures within the LDM Database in sectors. */
-#define OFF_PRIV2		1856		/* Backup private headers. */
-#define OFF_PRIV3		2047
-
-#define OFF_TOCB1		1		/* Tables of contents. */
-#define OFF_TOCB2		2
-#define OFF_TOCB3		2045
-#define OFF_TOCB4		2046
-
-#define OFF_VMDB		17		/* List of partitions. */
-
-#define LDM_PARTITION		0x42		/* Formerly SFS (Landis). */
-
-#define TOC_BITMAP1		"config"	/* Names of the two defined */
-#define TOC_BITMAP2		"log"		/* bitmaps in the TOCBLOCK. */
-
-/* Borrowed from msdos.c */
-#define SYS_IND(p)		(get_unaligned(&(p)->sys_ind))
-
-struct frag {				/* VBLK Fragment handling */
-	struct list_head list;
-	u32		group;
-	u8		num;		/* Total number of records */
-	u8		rec;		/* This is record number n */
-	u8		map;		/* Which portions are in use */
-	u8		data[0];
-};
-
-/* In memory LDM database structures. */
-
-#define GUID_SIZE		16
-
-struct privhead {			/* Offsets and sizes are in sectors. */
-	u16	ver_major;
-	u16	ver_minor;
-	u64	logical_disk_start;
-	u64	logical_disk_size;
-	u64	config_start;
-	u64	config_size;
-	u8	disk_id[GUID_SIZE];
-};
-
-struct tocblock {			/* We have exactly two bitmaps. */
-	u8	bitmap1_name[16];
-	u64	bitmap1_start;
-	u64	bitmap1_size;
-	u8	bitmap2_name[16];
-	u64	bitmap2_start;
-	u64	bitmap2_size;
-};
-
-struct vmdb {				/* VMDB: The database header */
-	u16	ver_major;
-	u16	ver_minor;
-	u32	vblk_size;
-	u32	vblk_offset;
-	u32	last_vblk_seq;
-};
-
-struct vblk_comp {			/* VBLK Component */
-	u8	state[16];
-	u64	parent_id;
-	u8	type;
-	u8	children;
-	u16	chunksize;
-};
-
-struct vblk_dgrp {			/* VBLK Disk Group */
-	u8	disk_id[64];
-};
-
-struct vblk_disk {			/* VBLK Disk */
-	u8	disk_id[GUID_SIZE];
-	u8	alt_name[128];
-};
-
-struct vblk_part {			/* VBLK Partition */
-	u64	start;
-	u64	size;			/* start, size and vol_off in sectors */
-	u64	volume_offset;
-	u64	parent_id;
-	u64	disk_id;
-	u8	partnum;
-};
-
-struct vblk_volu {			/* VBLK Volume */
-	u8	volume_type[16];
-	u8	volume_state[16];
-	u8	guid[16];
-	u8	drive_hint[4];
-	u64	size;
-	u8	partition_type;
-};
-
-struct vblk_head {			/* VBLK standard header */
-	u32 group;
-	u16 rec;
-	u16 nrec;
-};
-
-struct vblk {				/* Generalised VBLK */
-	u8	name[64];
-	u64	obj_id;
-	u32	sequence;
-	u8	flags;
-	u8	type;
-	union {
-		struct vblk_comp comp;
-		struct vblk_dgrp dgrp;
-		struct vblk_disk disk;
-		struct vblk_part part;
-		struct vblk_volu volu;
-	} vblk;
-	struct list_head list;
-};
-
-struct ldmdb {				/* Cache of the database */
-	struct privhead ph;
-	struct tocblock toc;
-	struct vmdb     vm;
-	struct list_head v_dgrp;
-	struct list_head v_disk;
-	struct list_head v_volu;
-	struct list_head v_comp;
-	struct list_head v_part;
-};
-
-int ldm_partition(struct parsed_partitions *state);
-
-#endif /* _FS_PT_LDM_H_ */
-
diff --git a/fs/partitions/mac.c b/fs/partitions/mac.c
deleted file mode 100644
index 11f688bd76c..00000000000
--- a/fs/partitions/mac.c
+++ /dev/null
@@ -1,134 +0,0 @@
-/*
- *  fs/partitions/mac.c
- *
- *  Code extracted from drivers/block/genhd.c
- *  Copyright (C) 1991-1998  Linus Torvalds
- *  Re-organised Feb 1998 Russell King
- */
-
-#include <linux/ctype.h>
-#include "check.h"
-#include "mac.h"
-
-#ifdef CONFIG_PPC_PMAC
-#include <asm/machdep.h>
-extern void note_bootable_part(dev_t dev, int part, int goodness);
-#endif
-
-/*
- * Code to understand MacOS partition tables.
- */
-
-static inline void mac_fix_string(char *stg, int len)
-{
-	int i;
-
-	for (i = len - 1; i >= 0 && stg[i] == ' '; i--)
-		stg[i] = 0;
-}
-
-int mac_partition(struct parsed_partitions *state)
-{
-	Sector sect;
-	unsigned char *data;
-	int slot, blocks_in_map;
-	unsigned secsize;
-#ifdef CONFIG_PPC_PMAC
-	int found_root = 0;
-	int found_root_goodness = 0;
-#endif
-	struct mac_partition *part;
-	struct mac_driver_desc *md;
-
-	/* Get 0th block and look at the first partition map entry. */
-	md = read_part_sector(state, 0, &sect);
-	if (!md)
-		return -1;
-	if (be16_to_cpu(md->signature) != MAC_DRIVER_MAGIC) {
-		put_dev_sector(sect);
-		return 0;
-	}
-	secsize = be16_to_cpu(md->block_size);
-	put_dev_sector(sect);
-	data = read_part_sector(state, secsize/512, &sect);
-	if (!data)
-		return -1;
-	part = (struct mac_partition *) (data + secsize%512);
-	if (be16_to_cpu(part->signature) != MAC_PARTITION_MAGIC) {
-		put_dev_sector(sect);
-		return 0;		/* not a MacOS disk */
-	}
-	blocks_in_map = be32_to_cpu(part->map_count);
-	if (blocks_in_map < 0 || blocks_in_map >= DISK_MAX_PARTS) {
-		put_dev_sector(sect);
-		return 0;
-	}
-	strlcat(state->pp_buf, " [mac]", PAGE_SIZE);
-	for (slot = 1; slot <= blocks_in_map; ++slot) {
-		int pos = slot * secsize;
-		put_dev_sector(sect);
-		data = read_part_sector(state, pos/512, &sect);
-		if (!data)
-			return -1;
-		part = (struct mac_partition *) (data + pos%512);
-		if (be16_to_cpu(part->signature) != MAC_PARTITION_MAGIC)
-			break;
-		put_partition(state, slot,
-			be32_to_cpu(part->start_block) * (secsize/512),
-			be32_to_cpu(part->block_count) * (secsize/512));
-
-		if (!strnicmp(part->type, "Linux_RAID", 10))
-			state->parts[slot].flags = ADDPART_FLAG_RAID;
-#ifdef CONFIG_PPC_PMAC
-		/*
-		 * If this is the first bootable partition, tell the
-		 * setup code, in case it wants to make this the root.
-		 */
-		if (machine_is(powermac)) {
-			int goodness = 0;
-
-			mac_fix_string(part->processor, 16);
-			mac_fix_string(part->name, 32);
-			mac_fix_string(part->type, 32);					
-		    
-			if ((be32_to_cpu(part->status) & MAC_STATUS_BOOTABLE)
-			    && strcasecmp(part->processor, "powerpc") == 0)
-				goodness++;
-
-			if (strcasecmp(part->type, "Apple_UNIX_SVR2") == 0
-			    || (strnicmp(part->type, "Linux", 5) == 0
-			        && strcasecmp(part->type, "Linux_swap") != 0)) {
-				int i, l;
-
-				goodness++;
-				l = strlen(part->name);
-				if (strcmp(part->name, "/") == 0)
-					goodness++;
-				for (i = 0; i <= l - 4; ++i) {
-					if (strnicmp(part->name + i, "root",
-						     4) == 0) {
-						goodness += 2;
-						break;
-					}
-				}
-				if (strnicmp(part->name, "swap", 4) == 0)
-					goodness--;
-			}
-
-			if (goodness > found_root_goodness) {
-				found_root = slot;
-				found_root_goodness = goodness;
-			}
-		}
-#endif /* CONFIG_PPC_PMAC */
-	}
-#ifdef CONFIG_PPC_PMAC
-	if (found_root_goodness)
-		note_bootable_part(state->bdev->bd_dev, found_root,
-				   found_root_goodness);
-#endif
-
-	put_dev_sector(sect);
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	return 1;
-}
diff --git a/fs/partitions/mac.h b/fs/partitions/mac.h
deleted file mode 100644
index 3c7d9843638..00000000000
--- a/fs/partitions/mac.h
+++ /dev/null
@@ -1,44 +0,0 @@
-/*
- *  fs/partitions/mac.h
- */
-
-#define MAC_PARTITION_MAGIC	0x504d
-
-/* type field value for A/UX or other Unix partitions */
-#define APPLE_AUX_TYPE	"Apple_UNIX_SVR2"
-
-struct mac_partition {
-	__be16	signature;	/* expected to be MAC_PARTITION_MAGIC */
-	__be16	res1;
-	__be32	map_count;	/* # blocks in partition map */
-	__be32	start_block;	/* absolute starting block # of partition */
-	__be32	block_count;	/* number of blocks in partition */
-	char	name[32];	/* partition name */
-	char	type[32];	/* string type description */
-	__be32	data_start;	/* rel block # of first data block */
-	__be32	data_count;	/* number of data blocks */
-	__be32	status;		/* partition status bits */
-	__be32	boot_start;
-	__be32	boot_size;
-	__be32	boot_load;
-	__be32	boot_load2;
-	__be32	boot_entry;
-	__be32	boot_entry2;
-	__be32	boot_cksum;
-	char	processor[16];	/* identifies ISA of boot */
-	/* there is more stuff after this that we don't need */
-};
-
-#define MAC_STATUS_BOOTABLE	8	/* partition is bootable */
-
-#define MAC_DRIVER_MAGIC	0x4552
-
-/* Driver descriptor structure, in block 0 */
-struct mac_driver_desc {
-	__be16	signature;	/* expected to be MAC_DRIVER_MAGIC */
-	__be16	block_size;
-	__be32	block_count;
-    /* ... more stuff */
-};
-
-int mac_partition(struct parsed_partitions *state);
diff --git a/fs/partitions/msdos.c b/fs/partitions/msdos.c
deleted file mode 100644
index 5f79a6677c6..00000000000
--- a/fs/partitions/msdos.c
+++ /dev/null
@@ -1,552 +0,0 @@
-/*
- *  fs/partitions/msdos.c
- *
- *  Code extracted from drivers/block/genhd.c
- *  Copyright (C) 1991-1998  Linus Torvalds
- *
- *  Thanks to Branko Lankester, lankeste@fwi.uva.nl, who found a bug
- *  in the early extended-partition checks and added DM partitions
- *
- *  Support for DiskManager v6.0x added by Mark Lord,
- *  with information provided by OnTrack.  This now works for linux fdisk
- *  and LILO, as well as loadlin and bootln.  Note that disks other than
- *  /dev/hda *must* have a "DOS" type 0x51 partition in the first slot (hda1).
- *
- *  More flexible handling of extended partitions - aeb, 950831
- *
- *  Check partition table on IDE disks for common CHS translations
- *
- *  Re-organised Feb 1998 Russell King
- */
-#include <linux/msdos_fs.h>
-
-#include "check.h"
-#include "msdos.h"
-#include "efi.h"
-
-/*
- * Many architectures don't like unaligned accesses, while
- * the nr_sects and start_sect partition table entries are
- * at a 2 (mod 4) address.
- */
-#include <asm/unaligned.h>
-
-#define SYS_IND(p)	get_unaligned(&p->sys_ind)
-
-static inline sector_t nr_sects(struct partition *p)
-{
-	return (sector_t)get_unaligned_le32(&p->nr_sects);
-}
-
-static inline sector_t start_sect(struct partition *p)
-{
-	return (sector_t)get_unaligned_le32(&p->start_sect);
-}
-
-static inline int is_extended_partition(struct partition *p)
-{
-	return (SYS_IND(p) == DOS_EXTENDED_PARTITION ||
-		SYS_IND(p) == WIN98_EXTENDED_PARTITION ||
-		SYS_IND(p) == LINUX_EXTENDED_PARTITION);
-}
-
-#define MSDOS_LABEL_MAGIC1	0x55
-#define MSDOS_LABEL_MAGIC2	0xAA
-
-static inline int
-msdos_magic_present(unsigned char *p)
-{
-	return (p[0] == MSDOS_LABEL_MAGIC1 && p[1] == MSDOS_LABEL_MAGIC2);
-}
-
-/* Value is EBCDIC 'IBMA' */
-#define AIX_LABEL_MAGIC1	0xC9
-#define AIX_LABEL_MAGIC2	0xC2
-#define AIX_LABEL_MAGIC3	0xD4
-#define AIX_LABEL_MAGIC4	0xC1
-static int aix_magic_present(struct parsed_partitions *state, unsigned char *p)
-{
-	struct partition *pt = (struct partition *) (p + 0x1be);
-	Sector sect;
-	unsigned char *d;
-	int slot, ret = 0;
-
-	if (!(p[0] == AIX_LABEL_MAGIC1 &&
-		p[1] == AIX_LABEL_MAGIC2 &&
-		p[2] == AIX_LABEL_MAGIC3 &&
-		p[3] == AIX_LABEL_MAGIC4))
-		return 0;
-	/* Assume the partition table is valid if Linux partitions exists */
-	for (slot = 1; slot <= 4; slot++, pt++) {
-		if (pt->sys_ind == LINUX_SWAP_PARTITION ||
-			pt->sys_ind == LINUX_RAID_PARTITION ||
-			pt->sys_ind == LINUX_DATA_PARTITION ||
-			pt->sys_ind == LINUX_LVM_PARTITION ||
-			is_extended_partition(pt))
-			return 0;
-	}
-	d = read_part_sector(state, 7, &sect);
-	if (d) {
-		if (d[0] == '_' && d[1] == 'L' && d[2] == 'V' && d[3] == 'M')
-			ret = 1;
-		put_dev_sector(sect);
-	};
-	return ret;
-}
-
-/*
- * Create devices for each logical partition in an extended partition.
- * The logical partitions form a linked list, with each entry being
- * a partition table with two entries.  The first entry
- * is the real data partition (with a start relative to the partition
- * table start).  The second is a pointer to the next logical partition
- * (with a start relative to the entire extended partition).
- * We do not create a Linux partition for the partition tables, but
- * only for the actual data partitions.
- */
-
-static void parse_extended(struct parsed_partitions *state,
-			   sector_t first_sector, sector_t first_size)
-{
-	struct partition *p;
-	Sector sect;
-	unsigned char *data;
-	sector_t this_sector, this_size;
-	sector_t sector_size = bdev_logical_block_size(state->bdev) / 512;
-	int loopct = 0;		/* number of links followed
-				   without finding a data partition */
-	int i;
-
-	this_sector = first_sector;
-	this_size = first_size;
-
-	while (1) {
-		if (++loopct > 100)
-			return;
-		if (state->next == state->limit)
-			return;
-		data = read_part_sector(state, this_sector, &sect);
-		if (!data)
-			return;
-
-		if (!msdos_magic_present(data + 510))
-			goto done; 
-
-		p = (struct partition *) (data + 0x1be);
-
-		/*
-		 * Usually, the first entry is the real data partition,
-		 * the 2nd entry is the next extended partition, or empty,
-		 * and the 3rd and 4th entries are unused.
-		 * However, DRDOS sometimes has the extended partition as
-		 * the first entry (when the data partition is empty),
-		 * and OS/2 seems to use all four entries.
-		 */
-
-		/* 
-		 * First process the data partition(s)
-		 */
-		for (i=0; i<4; i++, p++) {
-			sector_t offs, size, next;
-			if (!nr_sects(p) || is_extended_partition(p))
-				continue;
-
-			/* Check the 3rd and 4th entries -
-			   these sometimes contain random garbage */
-			offs = start_sect(p)*sector_size;
-			size = nr_sects(p)*sector_size;
-			next = this_sector + offs;
-			if (i >= 2) {
-				if (offs + size > this_size)
-					continue;
-				if (next < first_sector)
-					continue;
-				if (next + size > first_sector + first_size)
-					continue;
-			}
-
-			put_partition(state, state->next, next, size);
-			if (SYS_IND(p) == LINUX_RAID_PARTITION)
-				state->parts[state->next].flags = ADDPART_FLAG_RAID;
-			loopct = 0;
-			if (++state->next == state->limit)
-				goto done;
-		}
-		/*
-		 * Next, process the (first) extended partition, if present.
-		 * (So far, there seems to be no reason to make
-		 *  parse_extended()  recursive and allow a tree
-		 *  of extended partitions.)
-		 * It should be a link to the next logical partition.
-		 */
-		p -= 4;
-		for (i=0; i<4; i++, p++)
-			if (nr_sects(p) && is_extended_partition(p))
-				break;
-		if (i == 4)
-			goto done;	 /* nothing left to do */
-
-		this_sector = first_sector + start_sect(p) * sector_size;
-		this_size = nr_sects(p) * sector_size;
-		put_dev_sector(sect);
-	}
-done:
-	put_dev_sector(sect);
-}
-
-/* james@bpgc.com: Solaris has a nasty indicator: 0x82 which also
-   indicates linux swap.  Be careful before believing this is Solaris. */
-
-static void parse_solaris_x86(struct parsed_partitions *state,
-			      sector_t offset, sector_t size, int origin)
-{
-#ifdef CONFIG_SOLARIS_X86_PARTITION
-	Sector sect;
-	struct solaris_x86_vtoc *v;
-	int i;
-	short max_nparts;
-
-	v = read_part_sector(state, offset + 1, &sect);
-	if (!v)
-		return;
-	if (le32_to_cpu(v->v_sanity) != SOLARIS_X86_VTOC_SANE) {
-		put_dev_sector(sect);
-		return;
-	}
-	{
-		char tmp[1 + BDEVNAME_SIZE + 10 + 11 + 1];
-
-		snprintf(tmp, sizeof(tmp), " %s%d: <solaris:", state->name, origin);
-		strlcat(state->pp_buf, tmp, PAGE_SIZE);
-	}
-	if (le32_to_cpu(v->v_version) != 1) {
-		char tmp[64];
-
-		snprintf(tmp, sizeof(tmp), "  cannot handle version %d vtoc>\n",
-			 le32_to_cpu(v->v_version));
-		strlcat(state->pp_buf, tmp, PAGE_SIZE);
-		put_dev_sector(sect);
-		return;
-	}
-	/* Ensure we can handle previous case of VTOC with 8 entries gracefully */
-	max_nparts = le16_to_cpu (v->v_nparts) > 8 ? SOLARIS_X86_NUMSLICE : 8;
-	for (i=0; i<max_nparts && state->next<state->limit; i++) {
-		struct solaris_x86_slice *s = &v->v_slice[i];
-		char tmp[3 + 10 + 1 + 1];
-
-		if (s->s_size == 0)
-			continue;
-		snprintf(tmp, sizeof(tmp), " [s%d]", i);
-		strlcat(state->pp_buf, tmp, PAGE_SIZE);
-		/* solaris partitions are relative to current MS-DOS
-		 * one; must add the offset of the current partition */
-		put_partition(state, state->next++,
-				 le32_to_cpu(s->s_start)+offset,
-				 le32_to_cpu(s->s_size));
-	}
-	put_dev_sector(sect);
-	strlcat(state->pp_buf, " >\n", PAGE_SIZE);
-#endif
-}
-
-#if defined(CONFIG_BSD_DISKLABEL)
-/* 
- * Create devices for BSD partitions listed in a disklabel, under a
- * dos-like partition. See parse_extended() for more information.
- */
-static void parse_bsd(struct parsed_partitions *state,
-		      sector_t offset, sector_t size, int origin, char *flavour,
-		      int max_partitions)
-{
-	Sector sect;
-	struct bsd_disklabel *l;
-	struct bsd_partition *p;
-	char tmp[64];
-
-	l = read_part_sector(state, offset + 1, &sect);
-	if (!l)
-		return;
-	if (le32_to_cpu(l->d_magic) != BSD_DISKMAGIC) {
-		put_dev_sector(sect);
-		return;
-	}
-
-	snprintf(tmp, sizeof(tmp), " %s%d: <%s:", state->name, origin, flavour);
-	strlcat(state->pp_buf, tmp, PAGE_SIZE);
-
-	if (le16_to_cpu(l->d_npartitions) < max_partitions)
-		max_partitions = le16_to_cpu(l->d_npartitions);
-	for (p = l->d_partitions; p - l->d_partitions < max_partitions; p++) {
-		sector_t bsd_start, bsd_size;
-
-		if (state->next == state->limit)
-			break;
-		if (p->p_fstype == BSD_FS_UNUSED) 
-			continue;
-		bsd_start = le32_to_cpu(p->p_offset);
-		bsd_size = le32_to_cpu(p->p_size);
-		if (offset == bsd_start && size == bsd_size)
-			/* full parent partition, we have it already */
-			continue;
-		if (offset > bsd_start || offset+size < bsd_start+bsd_size) {
-			strlcat(state->pp_buf, "bad subpartition - ignored\n", PAGE_SIZE);
-			continue;
-		}
-		put_partition(state, state->next++, bsd_start, bsd_size);
-	}
-	put_dev_sector(sect);
-	if (le16_to_cpu(l->d_npartitions) > max_partitions) {
-		snprintf(tmp, sizeof(tmp), " (ignored %d more)",
-			 le16_to_cpu(l->d_npartitions) - max_partitions);
-		strlcat(state->pp_buf, tmp, PAGE_SIZE);
-	}
-	strlcat(state->pp_buf, " >\n", PAGE_SIZE);
-}
-#endif
-
-static void parse_freebsd(struct parsed_partitions *state,
-			  sector_t offset, sector_t size, int origin)
-{
-#ifdef CONFIG_BSD_DISKLABEL
-	parse_bsd(state, offset, size, origin, "bsd", BSD_MAXPARTITIONS);
-#endif
-}
-
-static void parse_netbsd(struct parsed_partitions *state,
-			 sector_t offset, sector_t size, int origin)
-{
-#ifdef CONFIG_BSD_DISKLABEL
-	parse_bsd(state, offset, size, origin, "netbsd", BSD_MAXPARTITIONS);
-#endif
-}
-
-static void parse_openbsd(struct parsed_partitions *state,
-			  sector_t offset, sector_t size, int origin)
-{
-#ifdef CONFIG_BSD_DISKLABEL
-	parse_bsd(state, offset, size, origin, "openbsd",
-		  OPENBSD_MAXPARTITIONS);
-#endif
-}
-
-/*
- * Create devices for Unixware partitions listed in a disklabel, under a
- * dos-like partition. See parse_extended() for more information.
- */
-static void parse_unixware(struct parsed_partitions *state,
-			   sector_t offset, sector_t size, int origin)
-{
-#ifdef CONFIG_UNIXWARE_DISKLABEL
-	Sector sect;
-	struct unixware_disklabel *l;
-	struct unixware_slice *p;
-
-	l = read_part_sector(state, offset + 29, &sect);
-	if (!l)
-		return;
-	if (le32_to_cpu(l->d_magic) != UNIXWARE_DISKMAGIC ||
-	    le32_to_cpu(l->vtoc.v_magic) != UNIXWARE_DISKMAGIC2) {
-		put_dev_sector(sect);
-		return;
-	}
-	{
-		char tmp[1 + BDEVNAME_SIZE + 10 + 12 + 1];
-
-		snprintf(tmp, sizeof(tmp), " %s%d: <unixware:", state->name, origin);
-		strlcat(state->pp_buf, tmp, PAGE_SIZE);
-	}
-	p = &l->vtoc.v_slice[1];
-	/* I omit the 0th slice as it is the same as whole disk. */
-	while (p - &l->vtoc.v_slice[0] < UNIXWARE_NUMSLICE) {
-		if (state->next == state->limit)
-			break;
-
-		if (p->s_label != UNIXWARE_FS_UNUSED)
-			put_partition(state, state->next++,
-				      le32_to_cpu(p->start_sect),
-				      le32_to_cpu(p->nr_sects));
-		p++;
-	}
-	put_dev_sector(sect);
-	strlcat(state->pp_buf, " >\n", PAGE_SIZE);
-#endif
-}
-
-/*
- * Minix 2.0.0/2.0.2 subpartition support.
- * Anand Krishnamurthy <anandk@wiproge.med.ge.com>
- * Rajeev V. Pillai    <rajeevvp@yahoo.com>
- */
-static void parse_minix(struct parsed_partitions *state,
-			sector_t offset, sector_t size, int origin)
-{
-#ifdef CONFIG_MINIX_SUBPARTITION
-	Sector sect;
-	unsigned char *data;
-	struct partition *p;
-	int i;
-
-	data = read_part_sector(state, offset, &sect);
-	if (!data)
-		return;
-
-	p = (struct partition *)(data + 0x1be);
-
-	/* The first sector of a Minix partition can have either
-	 * a secondary MBR describing its subpartitions, or
-	 * the normal boot sector. */
-	if (msdos_magic_present (data + 510) &&
-	    SYS_IND(p) == MINIX_PARTITION) { /* subpartition table present */
-		char tmp[1 + BDEVNAME_SIZE + 10 + 9 + 1];
-
-		snprintf(tmp, sizeof(tmp), " %s%d: <minix:", state->name, origin);
-		strlcat(state->pp_buf, tmp, PAGE_SIZE);
-		for (i = 0; i < MINIX_NR_SUBPARTITIONS; i++, p++) {
-			if (state->next == state->limit)
-				break;
-			/* add each partition in use */
-			if (SYS_IND(p) == MINIX_PARTITION)
-				put_partition(state, state->next++,
-					      start_sect(p), nr_sects(p));
-		}
-		strlcat(state->pp_buf, " >\n", PAGE_SIZE);
-	}
-	put_dev_sector(sect);
-#endif /* CONFIG_MINIX_SUBPARTITION */
-}
-
-static struct {
-	unsigned char id;
-	void (*parse)(struct parsed_partitions *, sector_t, sector_t, int);
-} subtypes[] = {
-	{FREEBSD_PARTITION, parse_freebsd},
-	{NETBSD_PARTITION, parse_netbsd},
-	{OPENBSD_PARTITION, parse_openbsd},
-	{MINIX_PARTITION, parse_minix},
-	{UNIXWARE_PARTITION, parse_unixware},
-	{SOLARIS_X86_PARTITION, parse_solaris_x86},
-	{NEW_SOLARIS_X86_PARTITION, parse_solaris_x86},
-	{0, NULL},
-};
- 
-int msdos_partition(struct parsed_partitions *state)
-{
-	sector_t sector_size = bdev_logical_block_size(state->bdev) / 512;
-	Sector sect;
-	unsigned char *data;
-	struct partition *p;
-	struct fat_boot_sector *fb;
-	int slot;
-
-	data = read_part_sector(state, 0, &sect);
-	if (!data)
-		return -1;
-	if (!msdos_magic_present(data + 510)) {
-		put_dev_sector(sect);
-		return 0;
-	}
-
-	if (aix_magic_present(state, data)) {
-		put_dev_sector(sect);
-		strlcat(state->pp_buf, " [AIX]", PAGE_SIZE);
-		return 0;
-	}
-
-	/*
-	 * Now that the 55aa signature is present, this is probably
-	 * either the boot sector of a FAT filesystem or a DOS-type
-	 * partition table. Reject this in case the boot indicator
-	 * is not 0 or 0x80.
-	 */
-	p = (struct partition *) (data + 0x1be);
-	for (slot = 1; slot <= 4; slot++, p++) {
-		if (p->boot_ind != 0 && p->boot_ind != 0x80) {
-			/*
-			 * Even without a valid boot inidicator value
-			 * its still possible this is valid FAT filesystem
-			 * without a partition table.
-			 */
-			fb = (struct fat_boot_sector *) data;
-			if (slot == 1 && fb->reserved && fb->fats
-				&& fat_valid_media(fb->media)) {
-				strlcat(state->pp_buf, "\n", PAGE_SIZE);
-				put_dev_sector(sect);
-				return 1;
-			} else {
-				put_dev_sector(sect);
-				return 0;
-			}
-		}
-	}
-
-#ifdef CONFIG_EFI_PARTITION
-	p = (struct partition *) (data + 0x1be);
-	for (slot = 1 ; slot <= 4 ; slot++, p++) {
-		/* If this is an EFI GPT disk, msdos should ignore it. */
-		if (SYS_IND(p) == EFI_PMBR_OSTYPE_EFI_GPT) {
-			put_dev_sector(sect);
-			return 0;
-		}
-	}
-#endif
-	p = (struct partition *) (data + 0x1be);
-
-	/*
-	 * Look for partitions in two passes:
-	 * First find the primary and DOS-type extended partitions.
-	 * On the second pass look inside *BSD, Unixware and Solaris partitions.
-	 */
-
-	state->next = 5;
-	for (slot = 1 ; slot <= 4 ; slot++, p++) {
-		sector_t start = start_sect(p)*sector_size;
-		sector_t size = nr_sects(p)*sector_size;
-		if (!size)
-			continue;
-		if (is_extended_partition(p)) {
-			/*
-			 * prevent someone doing mkfs or mkswap on an
-			 * extended partition, but leave room for LILO
-			 * FIXME: this uses one logical sector for > 512b
-			 * sector, although it may not be enough/proper.
-			 */
-			sector_t n = 2;
-			n = min(size, max(sector_size, n));
-			put_partition(state, slot, start, n);
-
-			strlcat(state->pp_buf, " <", PAGE_SIZE);
-			parse_extended(state, start, size);
-			strlcat(state->pp_buf, " >", PAGE_SIZE);
-			continue;
-		}
-		put_partition(state, slot, start, size);
-		if (SYS_IND(p) == LINUX_RAID_PARTITION)
-			state->parts[slot].flags = ADDPART_FLAG_RAID;
-		if (SYS_IND(p) == DM6_PARTITION)
-			strlcat(state->pp_buf, "[DM]", PAGE_SIZE);
-		if (SYS_IND(p) == EZD_PARTITION)
-			strlcat(state->pp_buf, "[EZD]", PAGE_SIZE);
-	}
-
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-
-	/* second pass - output for each on a separate line */
-	p = (struct partition *) (0x1be + data);
-	for (slot = 1 ; slot <= 4 ; slot++, p++) {
-		unsigned char id = SYS_IND(p);
-		int n;
-
-		if (!nr_sects(p))
-			continue;
-
-		for (n = 0; subtypes[n].parse && id != subtypes[n].id; n++)
-			;
-
-		if (!subtypes[n].parse)
-			continue;
-		subtypes[n].parse(state, start_sect(p) * sector_size,
-				  nr_sects(p) * sector_size, slot);
-	}
-	put_dev_sector(sect);
-	return 1;
-}
diff --git a/fs/partitions/msdos.h b/fs/partitions/msdos.h
deleted file mode 100644
index 38c781c490b..00000000000
--- a/fs/partitions/msdos.h
+++ /dev/null
@@ -1,8 +0,0 @@
-/*
- *  fs/partitions/msdos.h
- */
-
-#define MSDOS_LABEL_MAGIC		0xAA55
-
-int msdos_partition(struct parsed_partitions *state);
-
diff --git a/fs/partitions/osf.c b/fs/partitions/osf.c
deleted file mode 100644
index 764b86a0196..00000000000
--- a/fs/partitions/osf.c
+++ /dev/null
@@ -1,86 +0,0 @@
-/*
- *  fs/partitions/osf.c
- *
- *  Code extracted from drivers/block/genhd.c
- *
- *  Copyright (C) 1991-1998  Linus Torvalds
- *  Re-organised Feb 1998 Russell King
- */
-
-#include "check.h"
-#include "osf.h"
-
-#define MAX_OSF_PARTITIONS 18
-
-int osf_partition(struct parsed_partitions *state)
-{
-	int i;
-	int slot = 1;
-	unsigned int npartitions;
-	Sector sect;
-	unsigned char *data;
-	struct disklabel {
-		__le32 d_magic;
-		__le16 d_type,d_subtype;
-		u8 d_typename[16];
-		u8 d_packname[16];
-		__le32 d_secsize;
-		__le32 d_nsectors;
-		__le32 d_ntracks;
-		__le32 d_ncylinders;
-		__le32 d_secpercyl;
-		__le32 d_secprtunit;
-		__le16 d_sparespertrack;
-		__le16 d_sparespercyl;
-		__le32 d_acylinders;
-		__le16 d_rpm, d_interleave, d_trackskew, d_cylskew;
-		__le32 d_headswitch, d_trkseek, d_flags;
-		__le32 d_drivedata[5];
-		__le32 d_spare[5];
-		__le32 d_magic2;
-		__le16 d_checksum;
-		__le16 d_npartitions;
-		__le32 d_bbsize, d_sbsize;
-		struct d_partition {
-			__le32 p_size;
-			__le32 p_offset;
-			__le32 p_fsize;
-			u8  p_fstype;
-			u8  p_frag;
-			__le16 p_cpg;
-		} d_partitions[MAX_OSF_PARTITIONS];
-	} * label;
-	struct d_partition * partition;
-
-	data = read_part_sector(state, 0, &sect);
-	if (!data)
-		return -1;
-
-	label = (struct disklabel *) (data+64);
-	partition = label->d_partitions;
-	if (le32_to_cpu(label->d_magic) != DISKLABELMAGIC) {
-		put_dev_sector(sect);
-		return 0;
-	}
-	if (le32_to_cpu(label->d_magic2) != DISKLABELMAGIC) {
-		put_dev_sector(sect);
-		return 0;
-	}
-	npartitions = le16_to_cpu(label->d_npartitions);
-	if (npartitions > MAX_OSF_PARTITIONS) {
-		put_dev_sector(sect);
-		return 0;
-	}
-	for (i = 0 ; i < npartitions; i++, partition++) {
-		if (slot == state->limit)
-		        break;
-		if (le32_to_cpu(partition->p_size))
-			put_partition(state, slot,
-				le32_to_cpu(partition->p_offset),
-				le32_to_cpu(partition->p_size));
-		slot++;
-	}
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	put_dev_sector(sect);
-	return 1;
-}
diff --git a/fs/partitions/osf.h b/fs/partitions/osf.h
deleted file mode 100644
index 20ed2315ec1..00000000000
--- a/fs/partitions/osf.h
+++ /dev/null
@@ -1,7 +0,0 @@
-/*
- *  fs/partitions/osf.h
- */
-
-#define DISKLABELMAGIC (0x82564557UL)
-
-int osf_partition(struct parsed_partitions *state);
diff --git a/fs/partitions/sgi.c b/fs/partitions/sgi.c
deleted file mode 100644
index ea8a86dceaf..00000000000
--- a/fs/partitions/sgi.c
+++ /dev/null
@@ -1,82 +0,0 @@
-/*
- *  fs/partitions/sgi.c
- *
- *  Code extracted from drivers/block/genhd.c
- */
-
-#include "check.h"
-#include "sgi.h"
-
-struct sgi_disklabel {
-	__be32 magic_mushroom;		/* Big fat spliff... */
-	__be16 root_part_num;		/* Root partition number */
-	__be16 swap_part_num;		/* Swap partition number */
-	s8 boot_file[16];		/* Name of boot file for ARCS */
-	u8 _unused0[48];		/* Device parameter useless crapola.. */
-	struct sgi_volume {
-		s8 name[8];		/* Name of volume */
-		__be32 block_num;		/* Logical block number */
-		__be32 num_bytes;		/* How big, in bytes */
-	} volume[15];
-	struct sgi_partition {
-		__be32 num_blocks;		/* Size in logical blocks */
-		__be32 first_block;	/* First logical block */
-		__be32 type;		/* Type of this partition */
-	} partitions[16];
-	__be32 csum;			/* Disk label checksum */
-	__be32 _unused1;			/* Padding */
-};
-
-int sgi_partition(struct parsed_partitions *state)
-{
-	int i, csum;
-	__be32 magic;
-	int slot = 1;
-	unsigned int start, blocks;
-	__be32 *ui, cs;
-	Sector sect;
-	struct sgi_disklabel *label;
-	struct sgi_partition *p;
-	char b[BDEVNAME_SIZE];
-
-	label = read_part_sector(state, 0, &sect);
-	if (!label)
-		return -1;
-	p = &label->partitions[0];
-	magic = label->magic_mushroom;
-	if(be32_to_cpu(magic) != SGI_LABEL_MAGIC) {
-		/*printk("Dev %s SGI disklabel: bad magic %08x\n",
-		       bdevname(bdev, b), be32_to_cpu(magic));*/
-		put_dev_sector(sect);
-		return 0;
-	}
-	ui = ((__be32 *) (label + 1)) - 1;
-	for(csum = 0; ui >= ((__be32 *) label);) {
-		cs = *ui--;
-		csum += be32_to_cpu(cs);
-	}
-	if(csum) {
-		printk(KERN_WARNING "Dev %s SGI disklabel: csum bad, label corrupted\n",
-		       bdevname(state->bdev, b));
-		put_dev_sector(sect);
-		return 0;
-	}
-	/* All SGI disk labels have 16 partitions, disks under Linux only
-	 * have 15 minor's.  Luckily there are always a few zero length
-	 * partitions which we don't care about so we never overflow the
-	 * current_minor.
-	 */
-	for(i = 0; i < 16; i++, p++) {
-		blocks = be32_to_cpu(p->num_blocks);
-		start  = be32_to_cpu(p->first_block);
-		if (blocks) {
-			put_partition(state, slot, start, blocks);
-			if (be32_to_cpu(p->type) == LINUX_RAID_PARTITION)
-				state->parts[slot].flags = ADDPART_FLAG_RAID;
-		}
-		slot++;
-	}
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	put_dev_sector(sect);
-	return 1;
-}
diff --git a/fs/partitions/sgi.h b/fs/partitions/sgi.h
deleted file mode 100644
index b9553ebdd5a..00000000000
--- a/fs/partitions/sgi.h
+++ /dev/null
@@ -1,8 +0,0 @@
-/*
- *  fs/partitions/sgi.h
- */
-
-extern int sgi_partition(struct parsed_partitions *state);
-
-#define SGI_LABEL_MAGIC 0x0be5a941
-
diff --git a/fs/partitions/sun.c b/fs/partitions/sun.c
deleted file mode 100644
index b5b6fcfb3d3..00000000000
--- a/fs/partitions/sun.c
+++ /dev/null
@@ -1,122 +0,0 @@
-/*
- *  fs/partitions/sun.c
- *
- *  Code extracted from drivers/block/genhd.c
- *
- *  Copyright (C) 1991-1998  Linus Torvalds
- *  Re-organised Feb 1998 Russell King
- */
-
-#include "check.h"
-#include "sun.h"
-
-int sun_partition(struct parsed_partitions *state)
-{
-	int i;
-	__be16 csum;
-	int slot = 1;
-	__be16 *ush;
-	Sector sect;
-	struct sun_disklabel {
-		unsigned char info[128];   /* Informative text string */
-		struct sun_vtoc {
-		    __be32 version;     /* Layout version */
-		    char   volume[8];   /* Volume name */
-		    __be16 nparts;      /* Number of partitions */
-		    struct sun_info {           /* Partition hdrs, sec 2 */
-			__be16 id;
-			__be16 flags;
-		    } infos[8];
-		    __be16 padding;     /* Alignment padding */
-		    __be32 bootinfo[3];  /* Info needed by mboot */
-		    __be32 sanity;       /* To verify vtoc sanity */
-		    __be32 reserved[10]; /* Free space */
-		    __be32 timestamp[8]; /* Partition timestamp */
-		} vtoc;
-		__be32 write_reinstruct; /* sectors to skip, writes */
-		__be32 read_reinstruct;  /* sectors to skip, reads */
-		unsigned char spare[148]; /* Padding */
-		__be16 rspeed;     /* Disk rotational speed */
-		__be16 pcylcount;  /* Physical cylinder count */
-		__be16 sparecyl;   /* extra sects per cylinder */
-		__be16 obs1;       /* gap1 */
-		__be16 obs2;       /* gap2 */
-		__be16 ilfact;     /* Interleave factor */
-		__be16 ncyl;       /* Data cylinder count */
-		__be16 nacyl;      /* Alt. cylinder count */
-		__be16 ntrks;      /* Tracks per cylinder */
-		__be16 nsect;      /* Sectors per track */
-		__be16 obs3;       /* bhead - Label head offset */
-		__be16 obs4;       /* ppart - Physical Partition */
-		struct sun_partition {
-			__be32 start_cylinder;
-			__be32 num_sectors;
-		} partitions[8];
-		__be16 magic;      /* Magic number */
-		__be16 csum;       /* Label xor'd checksum */
-	} * label;
-	struct sun_partition *p;
-	unsigned long spc;
-	char b[BDEVNAME_SIZE];
-	int use_vtoc;
-	int nparts;
-
-	label = read_part_sector(state, 0, &sect);
-	if (!label)
-		return -1;
-
-	p = label->partitions;
-	if (be16_to_cpu(label->magic) != SUN_LABEL_MAGIC) {
-/*		printk(KERN_INFO "Dev %s Sun disklabel: bad magic %04x\n",
-		       bdevname(bdev, b), be16_to_cpu(label->magic)); */
-		put_dev_sector(sect);
-		return 0;
-	}
-	/* Look at the checksum */
-	ush = ((__be16 *) (label+1)) - 1;
-	for (csum = 0; ush >= ((__be16 *) label);)
-		csum ^= *ush--;
-	if (csum) {
-		printk("Dev %s Sun disklabel: Csum bad, label corrupted\n",
-		       bdevname(state->bdev, b));
-		put_dev_sector(sect);
-		return 0;
-	}
-
-	/* Check to see if we can use the VTOC table */
-	use_vtoc = ((be32_to_cpu(label->vtoc.sanity) == SUN_VTOC_SANITY) &&
-		    (be32_to_cpu(label->vtoc.version) == 1) &&
-		    (be16_to_cpu(label->vtoc.nparts) <= 8));
-
-	/* Use 8 partition entries if not specified in validated VTOC */
-	nparts = (use_vtoc) ? be16_to_cpu(label->vtoc.nparts) : 8;
-
-	/*
-	 * So that old Linux-Sun partitions continue to work,
-	 * alow the VTOC to be used under the additional condition ...
-	 */
-	use_vtoc = use_vtoc || !(label->vtoc.sanity ||
-				 label->vtoc.version || label->vtoc.nparts);
-	spc = be16_to_cpu(label->ntrks) * be16_to_cpu(label->nsect);
-	for (i = 0; i < nparts; i++, p++) {
-		unsigned long st_sector;
-		unsigned int num_sectors;
-
-		st_sector = be32_to_cpu(p->start_cylinder) * spc;
-		num_sectors = be32_to_cpu(p->num_sectors);
-		if (num_sectors) {
-			put_partition(state, slot, st_sector, num_sectors);
-			state->parts[slot].flags = 0;
-			if (use_vtoc) {
-				if (be16_to_cpu(label->vtoc.infos[i].id) == LINUX_RAID_PARTITION)
-					state->parts[slot].flags |= ADDPART_FLAG_RAID;
-				else if (be16_to_cpu(label->vtoc.infos[i].id) == SUN_WHOLE_DISK)
-					state->parts[slot].flags |= ADDPART_FLAG_WHOLEDISK;
-			}
-		}
-		slot++;
-	}
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	put_dev_sector(sect);
-	return 1;
-}
diff --git a/fs/partitions/sun.h b/fs/partitions/sun.h
deleted file mode 100644
index 2424baa8319..00000000000
--- a/fs/partitions/sun.h
+++ /dev/null
@@ -1,8 +0,0 @@
-/*
- *  fs/partitions/sun.h
- */
-
-#define SUN_LABEL_MAGIC          0xDABE
-#define SUN_VTOC_SANITY          0x600DDEEE
-
-int sun_partition(struct parsed_partitions *state);
diff --git a/fs/partitions/sysv68.c b/fs/partitions/sysv68.c
deleted file mode 100644
index 9627ccffc1c..00000000000
--- a/fs/partitions/sysv68.c
+++ /dev/null
@@ -1,95 +0,0 @@
-/*
- *  fs/partitions/sysv68.c
- *
- *  Copyright (C) 2007 Philippe De Muyter <phdm@macqel.be>
- */
-
-#include "check.h"
-#include "sysv68.h"
-
-/*
- *	Volume ID structure: on first 256-bytes sector of disk
- */
-
-struct volumeid {
-	u8	vid_unused[248];
-	u8	vid_mac[8];	/* ASCII string "MOTOROLA" */
-};
-
-/*
- *	config block: second 256-bytes sector on disk
- */
-
-struct dkconfig {
-	u8	ios_unused0[128];
-	__be32	ios_slcblk;	/* Slice table block number */
-	__be16	ios_slccnt;	/* Number of entries in slice table */
-	u8	ios_unused1[122];
-};
-
-/*
- *	combined volumeid and dkconfig block
- */
-
-struct dkblk0 {
-	struct volumeid dk_vid;
-	struct dkconfig dk_ios;
-};
-
-/*
- *	Slice Table Structure
- */
-
-struct slice {
-	__be32	nblocks;		/* slice size (in blocks) */
-	__be32	blkoff;			/* block offset of slice */
-};
-
-
-int sysv68_partition(struct parsed_partitions *state)
-{
-	int i, slices;
-	int slot = 1;
-	Sector sect;
-	unsigned char *data;
-	struct dkblk0 *b;
-	struct slice *slice;
-	char tmp[64];
-
-	data = read_part_sector(state, 0, &sect);
-	if (!data)
-		return -1;
-
-	b = (struct dkblk0 *)data;
-	if (memcmp(b->dk_vid.vid_mac, "MOTOROLA", sizeof(b->dk_vid.vid_mac))) {
-		put_dev_sector(sect);
-		return 0;
-	}
-	slices = be16_to_cpu(b->dk_ios.ios_slccnt);
-	i = be32_to_cpu(b->dk_ios.ios_slcblk);
-	put_dev_sector(sect);
-
-	data = read_part_sector(state, i, &sect);
-	if (!data)
-		return -1;
-
-	slices -= 1; /* last slice is the whole disk */
-	snprintf(tmp, sizeof(tmp), "sysV68: %s(s%u)", state->name, slices);
-	strlcat(state->pp_buf, tmp, PAGE_SIZE);
-	slice = (struct slice *)data;
-	for (i = 0; i < slices; i++, slice++) {
-		if (slot == state->limit)
-			break;
-		if (be32_to_cpu(slice->nblocks)) {
-			put_partition(state, slot,
-				be32_to_cpu(slice->blkoff),
-				be32_to_cpu(slice->nblocks));
-			snprintf(tmp, sizeof(tmp), "(s%u)", i);
-			strlcat(state->pp_buf, tmp, PAGE_SIZE);
-		}
-		slot++;
-	}
-	strlcat(state->pp_buf, "\n", PAGE_SIZE);
-	put_dev_sector(sect);
-	return 1;
-}
diff --git a/fs/partitions/sysv68.h b/fs/partitions/sysv68.h
deleted file mode 100644
index bf2f5ffa97a..00000000000
--- a/fs/partitions/sysv68.h
+++ /dev/null
@@ -1 +0,0 @@
-extern int sysv68_partition(struct parsed_partitions *state);
diff --git a/fs/partitions/ultrix.c b/fs/partitions/ultrix.c
deleted file mode 100644
index 8dbaf9f77a9..00000000000
--- a/fs/partitions/ultrix.c
+++ /dev/null
@@ -1,48 +0,0 @@
-/*
- *  fs/partitions/ultrix.c
- *
- *  Code extracted from drivers/block/genhd.c
- *
- *  Re-organised Jul 1999 Russell King
- */
-
-#include "check.h"
-#include "ultrix.h"
-
-int ultrix_partition(struct parsed_partitions *state)
-{
-	int i;
-	Sector sect;
-	unsigned char *data;
-	struct ultrix_disklabel {
-		s32	pt_magic;	/* magic no. indicating part. info exits */
-		s32	pt_valid;	/* set by driver if pt is current */
-		struct  pt_info {
-			s32		pi_nblocks; /* no. of sectors */
-			u32		pi_blkoff;  /* block offset for start */
-		} pt_part[8];
-	} *label;
-
-#define PT_MAGIC	0x032957	/* Partition magic number */
-#define PT_VALID	1		/* Indicates if struct is valid */
-
-	data = read_part_sector(state, (16384 - sizeof(*label))/512, &sect);
-	if (!data)
-		return -1;
-	
-	label = (struct ultrix_disklabel *)(data + 512 - sizeof(*label));
-
-	if (label->pt_magic == PT_MAGIC && label->pt_valid == PT_VALID) {
-		for (i=0; i<8; i++)
-			if (label->pt_part[i].pi_nblocks)
-				put_partition(state, i+1, 
-					      label->pt_part[i].pi_blkoff,
-					      label->pt_part[i].pi_nblocks);
-		put_dev_sector(sect);
-		strlcat(state->pp_buf, "\n", PAGE_SIZE);
-		return 1;
-	} else {
-		put_dev_sector(sect);
-		return 0;
-	}
-}
diff --git a/fs/partitions/ultrix.h b/fs/partitions/ultrix.h
deleted file mode 100644
index a3cc00b2bde..00000000000
--- a/fs/partitions/ultrix.h
+++ /dev/null
@@ -1,5 +0,0 @@
-/*
- *  fs/partitions/ultrix.h
- */
-
-int ultrix_partition(struct parsed_partitions *state);
-- 
cgit v1.2.3-70-g09d2


From 94ea4158f1733e3b10cef067d535f504866e0c41 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Fri, 16 Sep 2011 00:45:36 -0400
Subject: separate partition format handling from generic code

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 block/Makefile            |   2 +-
 block/partition-generic.c | 537 ++++++++++++++++++++++++++++++++++++++++++++++
 block/partitions/check.c  | 523 +-------------------------------------------
 block/partitions/check.h  |   3 +
 4 files changed, 542 insertions(+), 523 deletions(-)
 create mode 100644 block/partition-generic.c

(limited to 'block')

diff --git a/block/Makefile b/block/Makefile
index 3199c0cdef5..39b76ba66ff 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-iopoll.o blk-lib.o ioctl.o genhd.o scsi_ioctl.o \
-			partitions/
+			partition-generic.o partitions/
 
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
 obj-$(CONFIG_BLK_DEV_BSGLIB)	+= bsg-lib.o
diff --git a/block/partition-generic.c b/block/partition-generic.c
new file mode 100644
index 00000000000..d06ec1c829c
--- /dev/null
+++ b/block/partition-generic.c
@@ -0,0 +1,537 @@
+/*
+ *  Code extracted from drivers/block/genhd.c
+ *  Copyright (C) 1991-1998  Linus Torvalds
+ *  Re-organised Feb 1998 Russell King
+ *
+ *  We now have independent partition support from the
+ *  block drivers, which allows all the partition code to
+ *  be grouped in one location, and it to be mostly self
+ *  contained.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/kmod.h>
+#include <linux/ctype.h>
+#include <linux/genhd.h>
+#include <linux/blktrace_api.h>
+
+#include "partitions/check.h"
+
+#ifdef CONFIG_BLK_DEV_MD
+extern void md_autodetect_dev(dev_t dev);
+#endif
+ 
+/*
+ * disk_name() is used by partition check code and the genhd driver.
+ * It formats the devicename of the indicated disk into
+ * the supplied buffer (of size at least 32), and returns
+ * a pointer to that same buffer (for convenience).
+ */
+
+char *disk_name(struct gendisk *hd, int partno, char *buf)
+{
+	if (!partno)
+		snprintf(buf, BDEVNAME_SIZE, "%s", hd->disk_name);
+	else if (isdigit(hd->disk_name[strlen(hd->disk_name)-1]))
+		snprintf(buf, BDEVNAME_SIZE, "%sp%d", hd->disk_name, partno);
+	else
+		snprintf(buf, BDEVNAME_SIZE, "%s%d", hd->disk_name, partno);
+
+	return buf;
+}
+
+const char *bdevname(struct block_device *bdev, char *buf)
+{
+	return disk_name(bdev->bd_disk, bdev->bd_part->partno, buf);
+}
+
+EXPORT_SYMBOL(bdevname);
+
+/*
+ * There's very little reason to use this, you should really
+ * have a struct block_device just about everywhere and use
+ * bdevname() instead.
+ */
+const char *__bdevname(dev_t dev, char *buffer)
+{
+	scnprintf(buffer, BDEVNAME_SIZE, "unknown-block(%u,%u)",
+				MAJOR(dev), MINOR(dev));
+	return buffer;
+}
+
+EXPORT_SYMBOL(__bdevname);
+
+static ssize_t part_partition_show(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+
+	return sprintf(buf, "%d\n", p->partno);
+}
+
+static ssize_t part_start_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+
+	return sprintf(buf, "%llu\n",(unsigned long long)p->start_sect);
+}
+
+ssize_t part_size_show(struct device *dev,
+		       struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	return sprintf(buf, "%llu\n",(unsigned long long)p->nr_sects);
+}
+
+static ssize_t part_ro_show(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	return sprintf(buf, "%d\n", p->policy ? 1 : 0);
+}
+
+static ssize_t part_alignment_offset_show(struct device *dev,
+					  struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	return sprintf(buf, "%llu\n", (unsigned long long)p->alignment_offset);
+}
+
+static ssize_t part_discard_alignment_show(struct device *dev,
+					   struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	return sprintf(buf, "%u\n", p->discard_alignment);
+}
+
+ssize_t part_stat_show(struct device *dev,
+		       struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	int cpu;
+
+	cpu = part_stat_lock();
+	part_round_stats(cpu, p);
+	part_stat_unlock();
+	return sprintf(buf,
+		"%8lu %8lu %8llu %8u "
+		"%8lu %8lu %8llu %8u "
+		"%8u %8u %8u"
+		"\n",
+		part_stat_read(p, ios[READ]),
+		part_stat_read(p, merges[READ]),
+		(unsigned long long)part_stat_read(p, sectors[READ]),
+		jiffies_to_msecs(part_stat_read(p, ticks[READ])),
+		part_stat_read(p, ios[WRITE]),
+		part_stat_read(p, merges[WRITE]),
+		(unsigned long long)part_stat_read(p, sectors[WRITE]),
+		jiffies_to_msecs(part_stat_read(p, ticks[WRITE])),
+		part_in_flight(p),
+		jiffies_to_msecs(part_stat_read(p, io_ticks)),
+		jiffies_to_msecs(part_stat_read(p, time_in_queue)));
+}
+
+ssize_t part_inflight_show(struct device *dev,
+			struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+
+	return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
+		atomic_read(&p->in_flight[1]));
+}
+
+#ifdef CONFIG_FAIL_MAKE_REQUEST
+ssize_t part_fail_show(struct device *dev,
+		       struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+
+	return sprintf(buf, "%d\n", p->make_it_fail);
+}
+
+ssize_t part_fail_store(struct device *dev,
+			struct device_attribute *attr,
+			const char *buf, size_t count)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	int i;
+
+	if (count > 0 && sscanf(buf, "%d", &i) > 0)
+		p->make_it_fail = (i == 0) ? 0 : 1;
+
+	return count;
+}
+#endif
+
+static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL);
+static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
+static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
+static DEVICE_ATTR(ro, S_IRUGO, part_ro_show, NULL);
+static DEVICE_ATTR(alignment_offset, S_IRUGO, part_alignment_offset_show, NULL);
+static DEVICE_ATTR(discard_alignment, S_IRUGO, part_discard_alignment_show,
+		   NULL);
+static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
+static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+#ifdef CONFIG_FAIL_MAKE_REQUEST
+static struct device_attribute dev_attr_fail =
+	__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
+#endif
+
+static struct attribute *part_attrs[] = {
+	&dev_attr_partition.attr,
+	&dev_attr_start.attr,
+	&dev_attr_size.attr,
+	&dev_attr_ro.attr,
+	&dev_attr_alignment_offset.attr,
+	&dev_attr_discard_alignment.attr,
+	&dev_attr_stat.attr,
+	&dev_attr_inflight.attr,
+#ifdef CONFIG_FAIL_MAKE_REQUEST
+	&dev_attr_fail.attr,
+#endif
+	NULL
+};
+
+static struct attribute_group part_attr_group = {
+	.attrs = part_attrs,
+};
+
+static const struct attribute_group *part_attr_groups[] = {
+	&part_attr_group,
+#ifdef CONFIG_BLK_DEV_IO_TRACE
+	&blk_trace_attr_group,
+#endif
+	NULL
+};
+
+static void part_release(struct device *dev)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	free_part_stats(p);
+	free_part_info(p);
+	kfree(p);
+}
+
+struct device_type part_type = {
+	.name		= "partition",
+	.groups		= part_attr_groups,
+	.release	= part_release,
+};
+
+static void delete_partition_rcu_cb(struct rcu_head *head)
+{
+	struct hd_struct *part = container_of(head, struct hd_struct, rcu_head);
+
+	part->start_sect = 0;
+	part->nr_sects = 0;
+	part_stat_set_all(part, 0);
+	put_device(part_to_dev(part));
+}
+
+void __delete_partition(struct hd_struct *part)
+{
+	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
+}
+
+void delete_partition(struct gendisk *disk, int partno)
+{
+	struct disk_part_tbl *ptbl = disk->part_tbl;
+	struct hd_struct *part;
+
+	if (partno >= ptbl->len)
+		return;
+
+	part = ptbl->part[partno];
+	if (!part)
+		return;
+
+	blk_free_devt(part_devt(part));
+	rcu_assign_pointer(ptbl->part[partno], NULL);
+	rcu_assign_pointer(ptbl->last_lookup, NULL);
+	kobject_put(part->holder_dir);
+	device_del(part_to_dev(part));
+
+	hd_struct_put(part);
+}
+
+static ssize_t whole_disk_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	return 0;
+}
+static DEVICE_ATTR(whole_disk, S_IRUSR | S_IRGRP | S_IROTH,
+		   whole_disk_show, NULL);
+
+struct hd_struct *add_partition(struct gendisk *disk, int partno,
+				sector_t start, sector_t len, int flags,
+				struct partition_meta_info *info)
+{
+	struct hd_struct *p;
+	dev_t devt = MKDEV(0, 0);
+	struct device *ddev = disk_to_dev(disk);
+	struct device *pdev;
+	struct disk_part_tbl *ptbl;
+	const char *dname;
+	int err;
+
+	err = disk_expand_part_tbl(disk, partno);
+	if (err)
+		return ERR_PTR(err);
+	ptbl = disk->part_tbl;
+
+	if (ptbl->part[partno])
+		return ERR_PTR(-EBUSY);
+
+	p = kzalloc(sizeof(*p), GFP_KERNEL);
+	if (!p)
+		return ERR_PTR(-EBUSY);
+
+	if (!init_part_stats(p)) {
+		err = -ENOMEM;
+		goto out_free;
+	}
+	pdev = part_to_dev(p);
+
+	p->start_sect = start;
+	p->alignment_offset =
+		queue_limit_alignment_offset(&disk->queue->limits, start);
+	p->discard_alignment =
+		queue_limit_discard_alignment(&disk->queue->limits, start);
+	p->nr_sects = len;
+	p->partno = partno;
+	p->policy = get_disk_ro(disk);
+
+	if (info) {
+		struct partition_meta_info *pinfo = alloc_part_info(disk);
+		if (!pinfo)
+			goto out_free_stats;
+		memcpy(pinfo, info, sizeof(*info));
+		p->info = pinfo;
+	}
+
+	dname = dev_name(ddev);
+	if (isdigit(dname[strlen(dname) - 1]))
+		dev_set_name(pdev, "%sp%d", dname, partno);
+	else
+		dev_set_name(pdev, "%s%d", dname, partno);
+
+	device_initialize(pdev);
+	pdev->class = &block_class;
+	pdev->type = &part_type;
+	pdev->parent = ddev;
+
+	err = blk_alloc_devt(p, &devt);
+	if (err)
+		goto out_free_info;
+	pdev->devt = devt;
+
+	/* delay uevent until 'holders' subdir is created */
+	dev_set_uevent_suppress(pdev, 1);
+	err = device_add(pdev);
+	if (err)
+		goto out_put;
+
+	err = -ENOMEM;
+	p->holder_dir = kobject_create_and_add("holders", &pdev->kobj);
+	if (!p->holder_dir)
+		goto out_del;
+
+	dev_set_uevent_suppress(pdev, 0);
+	if (flags & ADDPART_FLAG_WHOLEDISK) {
+		err = device_create_file(pdev, &dev_attr_whole_disk);
+		if (err)
+			goto out_del;
+	}
+
+	/* everything is up and running, commence */
+	rcu_assign_pointer(ptbl->part[partno], p);
+
+	/* suppress uevent if the disk suppresses it */
+	if (!dev_get_uevent_suppress(ddev))
+		kobject_uevent(&pdev->kobj, KOBJ_ADD);
+
+	hd_ref_init(p);
+	return p;
+
+out_free_info:
+	free_part_info(p);
+out_free_stats:
+	free_part_stats(p);
+out_free:
+	kfree(p);
+	return ERR_PTR(err);
+out_del:
+	kobject_put(p->holder_dir);
+	device_del(pdev);
+out_put:
+	put_device(pdev);
+	blk_free_devt(devt);
+	return ERR_PTR(err);
+}
+
+static bool disk_unlock_native_capacity(struct gendisk *disk)
+{
+	const struct block_device_operations *bdops = disk->fops;
+
+	if (bdops->unlock_native_capacity &&
+	    !(disk->flags & GENHD_FL_NATIVE_CAPACITY)) {
+		printk(KERN_CONT "enabling native capacity\n");
+		bdops->unlock_native_capacity(disk);
+		disk->flags |= GENHD_FL_NATIVE_CAPACITY;
+		return true;
+	} else {
+		printk(KERN_CONT "truncated\n");
+		return false;
+	}
+}
+
+int rescan_partitions(struct gendisk *disk, struct block_device *bdev)
+{
+	struct parsed_partitions *state = NULL;
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+	int p, highest, res;
+rescan:
+	if (state && !IS_ERR(state)) {
+		kfree(state);
+		state = NULL;
+	}
+
+	if (bdev->bd_part_count)
+		return -EBUSY;
+	res = invalidate_partition(disk, 0);
+	if (res)
+		return res;
+
+	disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY);
+	while ((part = disk_part_iter_next(&piter)))
+		delete_partition(disk, part->partno);
+	disk_part_iter_exit(&piter);
+
+	if (disk->fops->revalidate_disk)
+		disk->fops->revalidate_disk(disk);
+	check_disk_size_change(disk, bdev);
+	bdev->bd_invalidated = 0;
+	if (!get_capacity(disk) || !(state = check_partition(disk, bdev)))
+		return 0;
+	if (IS_ERR(state)) {
+		/*
+		 * I/O error reading the partition table.  If any
+		 * partition code tried to read beyond EOD, retry
+		 * after unlocking native capacity.
+		 */
+		if (PTR_ERR(state) == -ENOSPC) {
+			printk(KERN_WARNING "%s: partition table beyond EOD, ",
+			       disk->disk_name);
+			if (disk_unlock_native_capacity(disk))
+				goto rescan;
+		}
+		return -EIO;
+	}
+	/*
+	 * If any partition code tried to read beyond EOD, try
+	 * unlocking native capacity even if partition table is
+	 * successfully read as we could be missing some partitions.
+	 */
+	if (state->access_beyond_eod) {
+		printk(KERN_WARNING
+		       "%s: partition table partially beyond EOD, ",
+		       disk->disk_name);
+		if (disk_unlock_native_capacity(disk))
+			goto rescan;
+	}
+
+	/* tell userspace that the media / partition table may have changed */
+	kobject_uevent(&disk_to_dev(disk)->kobj, KOBJ_CHANGE);
+
+	/* Detect the highest partition number and preallocate
+	 * disk->part_tbl.  This is an optimization and not strictly
+	 * necessary.
+	 */
+	for (p = 1, highest = 0; p < state->limit; p++)
+		if (state->parts[p].size)
+			highest = p;
+
+	disk_expand_part_tbl(disk, highest);
+
+	/* add partitions */
+	for (p = 1; p < state->limit; p++) {
+		sector_t size, from;
+		struct partition_meta_info *info = NULL;
+
+		size = state->parts[p].size;
+		if (!size)
+			continue;
+
+		from = state->parts[p].from;
+		if (from >= get_capacity(disk)) {
+			printk(KERN_WARNING
+			       "%s: p%d start %llu is beyond EOD, ",
+			       disk->disk_name, p, (unsigned long long) from);
+			if (disk_unlock_native_capacity(disk))
+				goto rescan;
+			continue;
+		}
+
+		if (from + size > get_capacity(disk)) {
+			printk(KERN_WARNING
+			       "%s: p%d size %llu extends beyond EOD, ",
+			       disk->disk_name, p, (unsigned long long) size);
+
+			if (disk_unlock_native_capacity(disk)) {
+				/* free state and restart */
+				goto rescan;
+			} else {
+				/*
+				 * we can not ignore partitions of broken tables
+				 * created by for example camera firmware, but
+				 * we limit them to the end of the disk to avoid
+				 * creating invalid block devices
+				 */
+				size = get_capacity(disk) - from;
+			}
+		}
+
+		if (state->parts[p].has_info)
+			info = &state->parts[p].info;
+		part = add_partition(disk, p, from, size,
+				     state->parts[p].flags,
+				     &state->parts[p].info);
+		if (IS_ERR(part)) {
+			printk(KERN_ERR " %s: p%d could not be added: %ld\n",
+			       disk->disk_name, p, -PTR_ERR(part));
+			continue;
+		}
+#ifdef CONFIG_BLK_DEV_MD
+		if (state->parts[p].flags & ADDPART_FLAG_RAID)
+			md_autodetect_dev(part_to_dev(part)->devt);
+#endif
+	}
+	kfree(state);
+	return 0;
+}
+
+unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p)
+{
+	struct address_space *mapping = bdev->bd_inode->i_mapping;
+	struct page *page;
+
+	page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_CACHE_SHIFT-9)),
+				 NULL);
+	if (!IS_ERR(page)) {
+		if (PageError(page))
+			goto fail;
+		p->v = page;
+		return (unsigned char *)page_address(page) +  ((n & ((1 << (PAGE_CACHE_SHIFT - 9)) - 1)) << 9);
+fail:
+		page_cache_release(page);
+	}
+	p->v = NULL;
+	return NULL;
+}
+
+EXPORT_SYMBOL(read_dev_sector);
diff --git a/block/partitions/check.c b/block/partitions/check.c
index e3c63d1c5e1..bc908672c97 100644
--- a/block/partitions/check.c
+++ b/block/partitions/check.c
@@ -13,14 +13,9 @@
  *  Added needed MAJORS for new pairs, {hdi,hdj}, {hdk,hdl}
  */
 
-#include <linux/init.h>
-#include <linux/module.h>
-#include <linux/fs.h>
 #include <linux/slab.h>
-#include <linux/kmod.h>
 #include <linux/ctype.h>
 #include <linux/genhd.h>
-#include <linux/blktrace_api.h>
 
 #include "check.h"
 
@@ -39,10 +34,6 @@
 #include "karma.h"
 #include "sysv68.h"
 
-#ifdef CONFIG_BLK_DEV_MD
-extern void md_autodetect_dev(dev_t dev);
-#endif
-
 int warn_no_part = 1; /*This is ugly: should make genhd removable media aware*/
 
 static int (*check_part[])(struct parsed_partitions *) = {
@@ -114,48 +105,8 @@ static int (*check_part[])(struct parsed_partitions *) = {
 #endif
 	NULL
 };
- 
-/*
- * disk_name() is used by partition check code and the genhd driver.
- * It formats the devicename of the indicated disk into
- * the supplied buffer (of size at least 32), and returns
- * a pointer to that same buffer (for convenience).
- */
-
-char *disk_name(struct gendisk *hd, int partno, char *buf)
-{
-	if (!partno)
-		snprintf(buf, BDEVNAME_SIZE, "%s", hd->disk_name);
-	else if (isdigit(hd->disk_name[strlen(hd->disk_name)-1]))
-		snprintf(buf, BDEVNAME_SIZE, "%sp%d", hd->disk_name, partno);
-	else
-		snprintf(buf, BDEVNAME_SIZE, "%s%d", hd->disk_name, partno);
-
-	return buf;
-}
-
-const char *bdevname(struct block_device *bdev, char *buf)
-{
-	return disk_name(bdev->bd_disk, bdev->bd_part->partno, buf);
-}
-
-EXPORT_SYMBOL(bdevname);
-
-/*
- * There's very little reason to use this, you should really
- * have a struct block_device just about everywhere and use
- * bdevname() instead.
- */
-const char *__bdevname(dev_t dev, char *buffer)
-{
-	scnprintf(buffer, BDEVNAME_SIZE, "unknown-block(%u,%u)",
-				MAJOR(dev), MINOR(dev));
-	return buffer;
-}
-
-EXPORT_SYMBOL(__bdevname);
 
-static struct parsed_partitions *
+struct parsed_partitions *
 check_partition(struct gendisk *hd, struct block_device *bdev)
 {
 	struct parsed_partitions *state;
@@ -213,475 +164,3 @@ check_partition(struct gendisk *hd, struct block_device *bdev)
 	kfree(state);
 	return ERR_PTR(res);
 }
-
-static ssize_t part_partition_show(struct device *dev,
-				   struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-
-	return sprintf(buf, "%d\n", p->partno);
-}
-
-static ssize_t part_start_show(struct device *dev,
-			       struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-
-	return sprintf(buf, "%llu\n",(unsigned long long)p->start_sect);
-}
-
-ssize_t part_size_show(struct device *dev,
-		       struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	return sprintf(buf, "%llu\n",(unsigned long long)p->nr_sects);
-}
-
-static ssize_t part_ro_show(struct device *dev,
-			    struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	return sprintf(buf, "%d\n", p->policy ? 1 : 0);
-}
-
-static ssize_t part_alignment_offset_show(struct device *dev,
-					  struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	return sprintf(buf, "%llu\n", (unsigned long long)p->alignment_offset);
-}
-
-static ssize_t part_discard_alignment_show(struct device *dev,
-					   struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	return sprintf(buf, "%u\n", p->discard_alignment);
-}
-
-ssize_t part_stat_show(struct device *dev,
-		       struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	int cpu;
-
-	cpu = part_stat_lock();
-	part_round_stats(cpu, p);
-	part_stat_unlock();
-	return sprintf(buf,
-		"%8lu %8lu %8llu %8u "
-		"%8lu %8lu %8llu %8u "
-		"%8u %8u %8u"
-		"\n",
-		part_stat_read(p, ios[READ]),
-		part_stat_read(p, merges[READ]),
-		(unsigned long long)part_stat_read(p, sectors[READ]),
-		jiffies_to_msecs(part_stat_read(p, ticks[READ])),
-		part_stat_read(p, ios[WRITE]),
-		part_stat_read(p, merges[WRITE]),
-		(unsigned long long)part_stat_read(p, sectors[WRITE]),
-		jiffies_to_msecs(part_stat_read(p, ticks[WRITE])),
-		part_in_flight(p),
-		jiffies_to_msecs(part_stat_read(p, io_ticks)),
-		jiffies_to_msecs(part_stat_read(p, time_in_queue)));
-}
-
-ssize_t part_inflight_show(struct device *dev,
-			struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-
-	return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
-		atomic_read(&p->in_flight[1]));
-}
-
-#ifdef CONFIG_FAIL_MAKE_REQUEST
-ssize_t part_fail_show(struct device *dev,
-		       struct device_attribute *attr, char *buf)
-{
-	struct hd_struct *p = dev_to_part(dev);
-
-	return sprintf(buf, "%d\n", p->make_it_fail);
-}
-
-ssize_t part_fail_store(struct device *dev,
-			struct device_attribute *attr,
-			const char *buf, size_t count)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	int i;
-
-	if (count > 0 && sscanf(buf, "%d", &i) > 0)
-		p->make_it_fail = (i == 0) ? 0 : 1;
-
-	return count;
-}
-#endif
-
-static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL);
-static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
-static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
-static DEVICE_ATTR(ro, S_IRUGO, part_ro_show, NULL);
-static DEVICE_ATTR(alignment_offset, S_IRUGO, part_alignment_offset_show, NULL);
-static DEVICE_ATTR(discard_alignment, S_IRUGO, part_discard_alignment_show,
-		   NULL);
-static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
-static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
-#ifdef CONFIG_FAIL_MAKE_REQUEST
-static struct device_attribute dev_attr_fail =
-	__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
-#endif
-
-static struct attribute *part_attrs[] = {
-	&dev_attr_partition.attr,
-	&dev_attr_start.attr,
-	&dev_attr_size.attr,
-	&dev_attr_ro.attr,
-	&dev_attr_alignment_offset.attr,
-	&dev_attr_discard_alignment.attr,
-	&dev_attr_stat.attr,
-	&dev_attr_inflight.attr,
-#ifdef CONFIG_FAIL_MAKE_REQUEST
-	&dev_attr_fail.attr,
-#endif
-	NULL
-};
-
-static struct attribute_group part_attr_group = {
-	.attrs = part_attrs,
-};
-
-static const struct attribute_group *part_attr_groups[] = {
-	&part_attr_group,
-#ifdef CONFIG_BLK_DEV_IO_TRACE
-	&blk_trace_attr_group,
-#endif
-	NULL
-};
-
-static void part_release(struct device *dev)
-{
-	struct hd_struct *p = dev_to_part(dev);
-	free_part_stats(p);
-	free_part_info(p);
-	kfree(p);
-}
-
-struct device_type part_type = {
-	.name		= "partition",
-	.groups		= part_attr_groups,
-	.release	= part_release,
-};
-
-static void delete_partition_rcu_cb(struct rcu_head *head)
-{
-	struct hd_struct *part = container_of(head, struct hd_struct, rcu_head);
-
-	part->start_sect = 0;
-	part->nr_sects = 0;
-	part_stat_set_all(part, 0);
-	put_device(part_to_dev(part));
-}
-
-void __delete_partition(struct hd_struct *part)
-{
-	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
-}
-
-void delete_partition(struct gendisk *disk, int partno)
-{
-	struct disk_part_tbl *ptbl = disk->part_tbl;
-	struct hd_struct *part;
-
-	if (partno >= ptbl->len)
-		return;
-
-	part = ptbl->part[partno];
-	if (!part)
-		return;
-
-	blk_free_devt(part_devt(part));
-	rcu_assign_pointer(ptbl->part[partno], NULL);
-	rcu_assign_pointer(ptbl->last_lookup, NULL);
-	kobject_put(part->holder_dir);
-	device_del(part_to_dev(part));
-
-	hd_struct_put(part);
-}
-
-static ssize_t whole_disk_show(struct device *dev,
-			       struct device_attribute *attr, char *buf)
-{
-	return 0;
-}
-static DEVICE_ATTR(whole_disk, S_IRUSR | S_IRGRP | S_IROTH,
-		   whole_disk_show, NULL);
-
-struct hd_struct *add_partition(struct gendisk *disk, int partno,
-				sector_t start, sector_t len, int flags,
-				struct partition_meta_info *info)
-{
-	struct hd_struct *p;
-	dev_t devt = MKDEV(0, 0);
-	struct device *ddev = disk_to_dev(disk);
-	struct device *pdev;
-	struct disk_part_tbl *ptbl;
-	const char *dname;
-	int err;
-
-	err = disk_expand_part_tbl(disk, partno);
-	if (err)
-		return ERR_PTR(err);
-	ptbl = disk->part_tbl;
-
-	if (ptbl->part[partno])
-		return ERR_PTR(-EBUSY);
-
-	p = kzalloc(sizeof(*p), GFP_KERNEL);
-	if (!p)
-		return ERR_PTR(-EBUSY);
-
-	if (!init_part_stats(p)) {
-		err = -ENOMEM;
-		goto out_free;
-	}
-	pdev = part_to_dev(p);
-
-	p->start_sect = start;
-	p->alignment_offset =
-		queue_limit_alignment_offset(&disk->queue->limits, start);
-	p->discard_alignment =
-		queue_limit_discard_alignment(&disk->queue->limits, start);
-	p->nr_sects = len;
-	p->partno = partno;
-	p->policy = get_disk_ro(disk);
-
-	if (info) {
-		struct partition_meta_info *pinfo = alloc_part_info(disk);
-		if (!pinfo)
-			goto out_free_stats;
-		memcpy(pinfo, info, sizeof(*info));
-		p->info = pinfo;
-	}
-
-	dname = dev_name(ddev);
-	if (isdigit(dname[strlen(dname) - 1]))
-		dev_set_name(pdev, "%sp%d", dname, partno);
-	else
-		dev_set_name(pdev, "%s%d", dname, partno);
-
-	device_initialize(pdev);
-	pdev->class = &block_class;
-	pdev->type = &part_type;
-	pdev->parent = ddev;
-
-	err = blk_alloc_devt(p, &devt);
-	if (err)
-		goto out_free_info;
-	pdev->devt = devt;
-
-	/* delay uevent until 'holders' subdir is created */
-	dev_set_uevent_suppress(pdev, 1);
-	err = device_add(pdev);
-	if (err)
-		goto out_put;
-
-	err = -ENOMEM;
-	p->holder_dir = kobject_create_and_add("holders", &pdev->kobj);
-	if (!p->holder_dir)
-		goto out_del;
-
-	dev_set_uevent_suppress(pdev, 0);
-	if (flags & ADDPART_FLAG_WHOLEDISK) {
-		err = device_create_file(pdev, &dev_attr_whole_disk);
-		if (err)
-			goto out_del;
-	}
-
-	/* everything is up and running, commence */
-	rcu_assign_pointer(ptbl->part[partno], p);
-
-	/* suppress uevent if the disk suppresses it */
-	if (!dev_get_uevent_suppress(ddev))
-		kobject_uevent(&pdev->kobj, KOBJ_ADD);
-
-	hd_ref_init(p);
-	return p;
-
-out_free_info:
-	free_part_info(p);
-out_free_stats:
-	free_part_stats(p);
-out_free:
-	kfree(p);
-	return ERR_PTR(err);
-out_del:
-	kobject_put(p->holder_dir);
-	device_del(pdev);
-out_put:
-	put_device(pdev);
-	blk_free_devt(devt);
-	return ERR_PTR(err);
-}
-
-static bool disk_unlock_native_capacity(struct gendisk *disk)
-{
-	const struct block_device_operations *bdops = disk->fops;
-
-	if (bdops->unlock_native_capacity &&
-	    !(disk->flags & GENHD_FL_NATIVE_CAPACITY)) {
-		printk(KERN_CONT "enabling native capacity\n");
-		bdops->unlock_native_capacity(disk);
-		disk->flags |= GENHD_FL_NATIVE_CAPACITY;
-		return true;
-	} else {
-		printk(KERN_CONT "truncated\n");
-		return false;
-	}
-}
-
-int rescan_partitions(struct gendisk *disk, struct block_device *bdev)
-{
-	struct parsed_partitions *state = NULL;
-	struct disk_part_iter piter;
-	struct hd_struct *part;
-	int p, highest, res;
-rescan:
-	if (state && !IS_ERR(state)) {
-		kfree(state);
-		state = NULL;
-	}
-
-	if (bdev->bd_part_count)
-		return -EBUSY;
-	res = invalidate_partition(disk, 0);
-	if (res)
-		return res;
-
-	disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY);
-	while ((part = disk_part_iter_next(&piter)))
-		delete_partition(disk, part->partno);
-	disk_part_iter_exit(&piter);
-
-	if (disk->fops->revalidate_disk)
-		disk->fops->revalidate_disk(disk);
-	check_disk_size_change(disk, bdev);
-	bdev->bd_invalidated = 0;
-	if (!get_capacity(disk) || !(state = check_partition(disk, bdev)))
-		return 0;
-	if (IS_ERR(state)) {
-		/*
-		 * I/O error reading the partition table.  If any
-		 * partition code tried to read beyond EOD, retry
-		 * after unlocking native capacity.
-		 */
-		if (PTR_ERR(state) == -ENOSPC) {
-			printk(KERN_WARNING "%s: partition table beyond EOD, ",
-			       disk->disk_name);
-			if (disk_unlock_native_capacity(disk))
-				goto rescan;
-		}
-		return -EIO;
-	}
-	/*
-	 * If any partition code tried to read beyond EOD, try
-	 * unlocking native capacity even if partition table is
-	 * successfully read as we could be missing some partitions.
-	 */
-	if (state->access_beyond_eod) {
-		printk(KERN_WARNING
-		       "%s: partition table partially beyond EOD, ",
-		       disk->disk_name);
-		if (disk_unlock_native_capacity(disk))
-			goto rescan;
-	}
-
-	/* tell userspace that the media / partition table may have changed */
-	kobject_uevent(&disk_to_dev(disk)->kobj, KOBJ_CHANGE);
-
-	/* Detect the highest partition number and preallocate
-	 * disk->part_tbl.  This is an optimization and not strictly
-	 * necessary.
-	 */
-	for (p = 1, highest = 0; p < state->limit; p++)
-		if (state->parts[p].size)
-			highest = p;
-
-	disk_expand_part_tbl(disk, highest);
-
-	/* add partitions */
-	for (p = 1; p < state->limit; p++) {
-		sector_t size, from;
-		struct partition_meta_info *info = NULL;
-
-		size = state->parts[p].size;
-		if (!size)
-			continue;
-
-		from = state->parts[p].from;
-		if (from >= get_capacity(disk)) {
-			printk(KERN_WARNING
-			       "%s: p%d start %llu is beyond EOD, ",
-			       disk->disk_name, p, (unsigned long long) from);
-			if (disk_unlock_native_capacity(disk))
-				goto rescan;
-			continue;
-		}
-
-		if (from + size > get_capacity(disk)) {
-			printk(KERN_WARNING
-			       "%s: p%d size %llu extends beyond EOD, ",
-			       disk->disk_name, p, (unsigned long long) size);
-
-			if (disk_unlock_native_capacity(disk)) {
-				/* free state and restart */
-				goto rescan;
-			} else {
-				/*
-				 * we can not ignore partitions of broken tables
-				 * created by for example camera firmware, but
-				 * we limit them to the end of the disk to avoid
-				 * creating invalid block devices
-				 */
-				size = get_capacity(disk) - from;
-			}
-		}
-
-		if (state->parts[p].has_info)
-			info = &state->parts[p].info;
-		part = add_partition(disk, p, from, size,
-				     state->parts[p].flags,
-				     &state->parts[p].info);
-		if (IS_ERR(part)) {
-			printk(KERN_ERR " %s: p%d could not be added: %ld\n",
-			       disk->disk_name, p, -PTR_ERR(part));
-			continue;
-		}
-#ifdef CONFIG_BLK_DEV_MD
-		if (state->parts[p].flags & ADDPART_FLAG_RAID)
-			md_autodetect_dev(part_to_dev(part)->devt);
-#endif
-	}
-	kfree(state);
-	return 0;
-}
-
-unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p)
-{
-	struct address_space *mapping = bdev->bd_inode->i_mapping;
-	struct page *page;
-
-	page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_CACHE_SHIFT-9)),
-				 NULL);
-	if (!IS_ERR(page)) {
-		if (PageError(page))
-			goto fail;
-		p->v = page;
-		return (unsigned char *)page_address(page) +  ((n & ((1 << (PAGE_CACHE_SHIFT - 9)) - 1)) << 9);
-fail:
-		page_cache_release(page);
-	}
-	p->v = NULL;
-	return NULL;
-}
-
-EXPORT_SYMBOL(read_dev_sector);
diff --git a/block/partitions/check.h b/block/partitions/check.h
index d68bf4dc3bc..52b100311ec 100644
--- a/block/partitions/check.h
+++ b/block/partitions/check.h
@@ -22,6 +22,9 @@ struct parsed_partitions {
 	char *pp_buf;
 };
 
+struct parsed_partitions *
+check_partition(struct gendisk *, struct block_device *);
+
 static inline void *read_part_sector(struct parsed_partitions *state,
 				     sector_t n, Sector *p)
 {
-- 
cgit v1.2.3-70-g09d2


From ff01bb4832651c6d25ac509a06a10fcbd75c461c Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Fri, 16 Sep 2011 02:31:11 -0400
Subject: fs: move code out of buffer.c

Move invalidate_bdev, block_sync_page into fs/block_dev.c.  Export
kill_bdev as well, so brd doesn't have to open code it.  Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving.  The small comment replacing it says enough.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/powerpc/sysdev/axonram.c   |  1 -
 block/genhd.c                   |  1 -
 block/ioctl.c                   |  2 +-
 drivers/block/amiflop.c         |  2 +-
 drivers/block/brd.c             |  9 ++++----
 drivers/block/floppy.c          |  1 -
 drivers/block/loop.c            |  1 -
 drivers/cdrom/cdrom.c           |  1 -
 drivers/md/dm.c                 |  1 -
 drivers/md/md.c                 |  3 +--
 drivers/mtd/devices/block2mtd.c |  1 -
 drivers/s390/block/dasd.c       |  1 -
 drivers/scsi/scsicam.c          |  1 -
 drivers/tty/sysrq.c             |  2 +-
 fs/block_dev.c                  | 30 ++++++++++++++++++++++---
 fs/buffer.c                     | 50 -----------------------------------------
 fs/cachefiles/interface.c       |  1 -
 fs/cramfs/inode.c               |  1 -
 fs/fs-writeback.c               |  1 -
 fs/libfs.c                      |  2 +-
 fs/quota/dquot.c                |  1 -
 fs/quota/quota.c                |  1 -
 fs/splice.c                     |  1 -
 fs/sync.c                       |  1 -
 include/linux/fs.h              |  3 +++
 kernel/power/swap.c             |  1 -
 mm/page-writeback.c             |  2 +-
 mm/swap_state.c                 |  1 -
 28 files changed, 40 insertions(+), 83 deletions(-)

(limited to 'block')

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ba427191906..1c16141c031 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -25,7 +25,6 @@
 
 #include <linux/bio.h>
 #include <linux/blkdev.h>
-#include <linux/buffer_head.h>
 #include <linux/device.h>
 #include <linux/errno.h>
 #include <linux/fs.h>
diff --git a/block/genhd.c b/block/genhd.c
index bf443a71b93..b7d1a0e4268 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -15,7 +15,6 @@
 #include <linux/slab.h>
 #include <linux/kmod.h>
 #include <linux/kobj_map.h>
-#include <linux/buffer_head.h>
 #include <linux/mutex.h>
 #include <linux/idr.h>
 #include <linux/log2.h>
diff --git a/block/ioctl.c b/block/ioctl.c
index ca939fc1030..91e7b19c86f 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -5,7 +5,7 @@
 #include <linux/blkpg.h>
 #include <linux/hdreg.h>
 #include <linux/backing-dev.h>
-#include <linux/buffer_head.h>
+#include <linux/fs.h>
 #include <linux/blktrace_api.h>
 #include <asm/uaccess.h>
 
diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index 8eba86bba59..386146d792d 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -63,7 +63,7 @@
 #include <linux/mutex.h>
 #include <linux/amifdreg.h>
 #include <linux/amifd.h>
-#include <linux/buffer_head.h>
+#include <linux/fs.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
 #include <linux/interrupt.h>
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index d22119d49e5..ec246437f5a 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -17,7 +17,7 @@
 #include <linux/highmem.h>
 #include <linux/mutex.h>
 #include <linux/radix-tree.h>
-#include <linux/buffer_head.h> /* invalidate_bh_lrus() */
+#include <linux/fs.h>
 #include <linux/slab.h>
 
 #include <asm/uaccess.h>
@@ -402,14 +402,13 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode,
 	error = -EBUSY;
 	if (bdev->bd_openers <= 1) {
 		/*
-		 * Invalidate the cache first, so it isn't written
-		 * back to the device.
+		 * Kill the cache first, so it isn't written back to the
+		 * device.
 		 *
 		 * Another thread might instantiate more buffercache here,
 		 * but there is not much we can do to close that race.
 		 */
-		invalidate_bh_lrus();
-		truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
+		kill_bdev(bdev);
 		brd_free_pages(brd);
 		error = 0;
 	}
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 9955a53733b..510fb10ec45 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -188,7 +188,6 @@ static int print_unex = 1;
 #include <linux/init.h>
 #include <linux/platform_device.h>
 #include <linux/mod_devicetable.h>
-#include <linux/buffer_head.h>	/* for invalidate_buffers() */
 #include <linux/mutex.h>
 #include <linux/io.h>
 #include <linux/uaccess.h>
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 1e888c9e85b..f00257782fc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -69,7 +69,6 @@
 #include <linux/freezer.h>
 #include <linux/mutex.h>
 #include <linux/writeback.h>
-#include <linux/buffer_head.h>		/* for invalidate_bdev() */
 #include <linux/completion.h>
 #include <linux/highmem.h>
 #include <linux/kthread.h>
diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c
index f997c27d79e..2118211aff9 100644
--- a/drivers/cdrom/cdrom.c
+++ b/drivers/cdrom/cdrom.c
@@ -267,7 +267,6 @@
 
 #include <linux/module.h>
 #include <linux/fs.h>
-#include <linux/buffer_head.h>
 #include <linux/major.h>
 #include <linux/types.h>
 #include <linux/errno.h>
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4720f68f817..b89c548ec3f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -14,7 +14,6 @@
 #include <linux/moduleparam.h>
 #include <linux/blkpg.h>
 #include <linux/bio.h>
-#include <linux/buffer_head.h>
 #include <linux/mempool.h>
 #include <linux/slab.h>
 #include <linux/idr.h>
diff --git a/drivers/md/md.c b/drivers/md/md.c
index f47f1f8ac44..5d1b6762f10 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -36,8 +36,7 @@
 #include <linux/blkdev.h>
 #include <linux/sysctl.h>
 #include <linux/seq_file.h>
-#include <linux/mutex.h>
-#include <linux/buffer_head.h> /* for invalidate_bdev */
+#include <linux/fs.h>
 #include <linux/poll.h>
 #include <linux/ctype.h>
 #include <linux/string.h>
diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index b78f23169d4..ebeabc727f7 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -14,7 +14,6 @@
 #include <linux/list.h>
 #include <linux/init.h>
 #include <linux/mtd/mtd.h>
-#include <linux/buffer_head.h>
 #include <linux/mutex.h>
 #include <linux/mount.h>
 #include <linux/slab.h>
diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
index 65894f05a80..2de5b60ee8c 100644
--- a/drivers/s390/block/dasd.c
+++ b/drivers/s390/block/dasd.c
@@ -17,7 +17,6 @@
 #include <linux/ctype.h>
 #include <linux/major.h>
 #include <linux/slab.h>
-#include <linux/buffer_head.h>
 #include <linux/hdreg.h>
 #include <linux/async.h>
 #include <linux/mutex.h>
diff --git a/drivers/scsi/scsicam.c b/drivers/scsi/scsicam.c
index 6803b1e26ec..92d24d6dcb3 100644
--- a/drivers/scsi/scsicam.c
+++ b/drivers/scsi/scsicam.c
@@ -16,7 +16,6 @@
 #include <linux/genhd.h>
 #include <linux/kernel.h>
 #include <linux/blkdev.h>
-#include <linux/buffer_head.h>
 #include <asm/unaligned.h>
 
 #include <scsi/scsicam.h>
diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 43db715f150..7867b7c4538 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -32,7 +32,6 @@
 #include <linux/module.h>
 #include <linux/suspend.h>
 #include <linux/writeback.h>
-#include <linux/buffer_head.h>		/* for fsync_bdev() */
 #include <linux/swap.h>
 #include <linux/spinlock.h>
 #include <linux/vt_kern.h>
@@ -41,6 +40,7 @@
 #include <linux/oom.h>
 #include <linux/slab.h>
 #include <linux/input.h>
+#include <linux/uaccess.h>
 
 #include <asm/ptrace.h>
 #include <asm/irq_regs.h>
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7866cdd9fe7..69a5b6fbee2 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -17,6 +17,7 @@
 #include <linux/module.h>
 #include <linux/blkpg.h>
 #include <linux/buffer_head.h>
+#include <linux/swap.h>
 #include <linux/pagevec.h>
 #include <linux/writeback.h>
 #include <linux/mpage.h>
@@ -25,6 +26,7 @@
 #include <linux/namei.h>
 #include <linux/log2.h>
 #include <linux/kmemleak.h>
+#include <linux/cleancache.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -82,13 +84,35 @@ static sector_t max_block(struct block_device *bdev)
 }
 
 /* Kill _all_ buffers and pagecache , dirty or not.. */
-static void kill_bdev(struct block_device *bdev)
+void kill_bdev(struct block_device *bdev)
 {
-	if (bdev->bd_inode->i_mapping->nrpages == 0)
+	struct address_space *mapping = bdev->bd_inode->i_mapping;
+
+	if (mapping->nrpages == 0)
 		return;
+
 	invalidate_bh_lrus();
-	truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
+	truncate_inode_pages(mapping, 0);
 }	
+EXPORT_SYMBOL(kill_bdev);
+
+/* Invalidate clean unused buffers and pagecache. */
+void invalidate_bdev(struct block_device *bdev)
+{
+	struct address_space *mapping = bdev->bd_inode->i_mapping;
+
+	if (mapping->nrpages == 0)
+		return;
+
+	invalidate_bh_lrus();
+	lru_add_drain_all();	/* make sure all lru add caches are flushed */
+	invalidate_mapping_pages(mapping, 0, -1);
+	/* 99% of the time, we don't need to flush the cleancache on the bdev.
+	 * But, for the strange corners, lets be cautious
+	 */
+	cleancache_flush_inode(mapping);
+}
+EXPORT_SYMBOL(invalidate_bdev);
 
 int set_blocksize(struct block_device *bdev, int size)
 {
diff --git a/fs/buffer.c b/fs/buffer.c
index 19d8eb7fdc8..1a30db77af3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -41,7 +41,6 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
-#include <linux/cleancache.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -231,55 +230,6 @@ out:
 	return ret;
 }
 
-/* If invalidate_buffers() will trash dirty buffers, it means some kind
-   of fs corruption is going on. Trashing dirty data always imply losing
-   information that was supposed to be just stored on the physical layer
-   by the user.
-
-   Thus invalidate_buffers in general usage is not allwowed to trash
-   dirty buffers. For example ioctl(FLSBLKBUF) expects dirty data to
-   be preserved.  These buffers are simply skipped.
-  
-   We also skip buffers which are still in use.  For example this can
-   happen if a userspace program is reading the block device.
-
-   NOTE: In the case where the user removed a removable-media-disk even if
-   there's still dirty data not synced on disk (due a bug in the device driver
-   or due an error of the user), by not destroying the dirty buffers we could
-   generate corruption also on the next media inserted, thus a parameter is
-   necessary to handle this case in the most safe way possible (trying
-   to not corrupt also the new disk inserted with the data belonging to
-   the old now corrupted disk). Also for the ramdisk the natural thing
-   to do in order to release the ramdisk memory is to destroy dirty buffers.
-
-   These are two special cases. Normal usage imply the device driver
-   to issue a sync on the device (without waiting I/O completion) and
-   then an invalidate_buffers call that doesn't trash dirty buffers.
-
-   For handling cache coherency with the blkdev pagecache the 'update' case
-   is been introduced. It is needed to re-read from disk any pinned
-   buffer. NOTE: re-reading from disk is destructive so we can do it only
-   when we assume nobody is changing the buffercache under our I/O and when
-   we think the disk contains more recent information than the buffercache.
-   The update == 1 pass marks the buffers we need to update, the update == 2
-   pass does the actual I/O. */
-void invalidate_bdev(struct block_device *bdev)
-{
-	struct address_space *mapping = bdev->bd_inode->i_mapping;
-
-	if (mapping->nrpages == 0)
-		return;
-
-	invalidate_bh_lrus();
-	lru_add_drain_all();	/* make sure all lru add caches are flushed */
-	invalidate_mapping_pages(mapping, 0, -1);
-	/* 99% of the time, we don't need to flush the cleancache on the bdev.
-	 * But, for the strange corners, lets be cautious
-	 */
-	cleancache_flush_inode(mapping);
-}
-EXPORT_SYMBOL(invalidate_bdev);
-
 /*
  * Kick the writeback threads then try to free up some ZONE_NORMAL memory.
  */
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index 1064805e653..67bef6d0148 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -11,7 +11,6 @@
 
 #include <linux/slab.h>
 #include <linux/mount.h>
-#include <linux/buffer_head.h>
 #include "internal.h"
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 739fb59bcdc..c37adb22211 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -20,7 +20,6 @@
 #include <linux/cramfs_fs.h>
 #include <linux/slab.h>
 #include <linux/cramfs_fs_sb.h>
-#include <linux/buffer_head.h>
 #include <linux/vfs.h>
 #include <linux/mutex.h>
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 517f211a3bd..80a4574028f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -25,7 +25,6 @@
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
-#include <linux/buffer_head.h>
 #include <linux/tracepoint.h>
 #include "internal.h"
 
diff --git a/fs/libfs.c b/fs/libfs.c
index f6d411eef1e..5b2dbb3ba4f 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -12,7 +12,7 @@
 #include <linux/mutex.h>
 #include <linux/exportfs.h>
 #include <linux/writeback.h>
-#include <linux/buffer_head.h>
+#include <linux/buffer_head.h> /* sync_mapping_buffers */
 
 #include <asm/uaccess.h>
 
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 5b572c89e6c..5d81e92daf8 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -73,7 +73,6 @@
 #include <linux/security.h>
 #include <linux/kmod.h>
 #include <linux/namei.h>
-#include <linux/buffer_head.h>
 #include <linux/capability.h>
 #include <linux/quotaops.h>
 #include "../internal.h" /* ugh */
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 35f4b0ecdeb..7898cd688a0 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -13,7 +13,6 @@
 #include <linux/kernel.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
-#include <linux/buffer_head.h>
 #include <linux/capability.h>
 #include <linux/quotaops.h>
 #include <linux/types.h>
diff --git a/fs/splice.c b/fs/splice.c
index fa2defa8afc..1ec0493266b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -25,7 +25,6 @@
 #include <linux/mm_inline.h>
 #include <linux/swap.h>
 #include <linux/writeback.h>
-#include <linux/buffer_head.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
 #include <linux/uio.h>
diff --git a/fs/sync.c b/fs/sync.c
index 101b8ef901d..f3501ef3923 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -14,7 +14,6 @@
 #include <linux/linkage.h>
 #include <linux/pagemap.h>
 #include <linux/quotaops.h>
-#include <linux/buffer_head.h>
 #include <linux/backing-dev.h>
 #include "internal.h"
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cec429d76ab..e853ba5eddd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2092,6 +2092,7 @@ extern void bd_forget(struct inode *inode);
 extern void bdput(struct block_device *);
 extern void invalidate_bdev(struct block_device *);
 extern int sync_blockdev(struct block_device *bdev);
+extern void kill_bdev(struct block_device *);
 extern struct super_block *freeze_bdev(struct block_device *);
 extern void emergency_thaw_all(void);
 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
@@ -2099,6 +2100,7 @@ extern int fsync_bdev(struct block_device *);
 #else
 static inline void bd_forget(struct inode *inode) {}
 static inline int sync_blockdev(struct block_device *bdev) { return 0; }
+static inline void kill_bdev(struct block_device *bdev) {}
 static inline void invalidate_bdev(struct block_device *bdev) {}
 
 static inline struct super_block *freeze_bdev(struct block_device *sb)
@@ -2415,6 +2417,7 @@ extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
 				unsigned long nr_segs, loff_t pos);
 extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
 			int datasync);
+extern void block_sync_page(struct page *page);
 
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 11a594c4ba2..3739ecced08 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -18,7 +18,6 @@
 #include <linux/bitops.h>
 #include <linux/genhd.h>
 #include <linux/device.h>
-#include <linux/buffer_head.h>
 #include <linux/bio.h>
 #include <linux/blkdev.h>
 #include <linux/swap.h>
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 50f08241f98..8616ef3025a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -32,7 +32,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
-#include <linux/buffer_head.h>
+#include <linux/buffer_head.h> /* __set_page_dirty_buffers */
 #include <linux/pagevec.h>
 #include <trace/events/writeback.h>
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 78cc4d1f6cc..ea6b32d6187 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -13,7 +13,6 @@
 #include <linux/swapops.h>
 #include <linux/init.h>
 #include <linux/pagemap.h>
-#include <linux/buffer_head.h>
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
-- 
cgit v1.2.3-70-g09d2


From 2c9ede55ecec58099b72e4bb8eab719f32f72c31 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Sat, 23 Jul 2011 20:24:48 -0400
Subject: switch device_get_devnode() and ->devnode() to umode_t *

both callers of device_get_devnode() are only interested in lower 16bits
and nobody tries to return anything wider than 16bit anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/x86/kernel/cpuid.c                    | 2 +-
 arch/x86/kernel/msr.c                      | 2 +-
 block/bsg.c                                | 2 +-
 block/genhd.c                              | 2 +-
 drivers/base/core.c                        | 4 ++--
 drivers/base/devtmpfs.c                    | 2 +-
 drivers/block/aoe/aoechr.c                 | 2 +-
 drivers/block/pktcdvd.c                    | 2 +-
 drivers/char/mem.c                         | 4 ++--
 drivers/char/misc.c                        | 2 +-
 drivers/char/raw.c                         | 2 +-
 drivers/char/tile-srom.c                   | 2 +-
 drivers/gpu/drm/drm_sysfs.c                | 2 +-
 drivers/hid/usbhid/hiddev.c                | 2 +-
 drivers/infiniband/core/cm.c               | 2 +-
 drivers/infiniband/core/user_mad.c         | 2 +-
 drivers/infiniband/core/uverbs_main.c      | 2 +-
 drivers/input/input.c                      | 2 +-
 drivers/media/dvb/ddbridge/ddbridge-core.c | 2 +-
 drivers/media/dvb/dvb-core/dvbdev.c        | 2 +-
 drivers/media/rc/rc-main.c                 | 2 +-
 drivers/tty/tty_io.c                       | 2 +-
 drivers/usb/class/usblp.c                  | 2 +-
 drivers/usb/core/file.c                    | 2 +-
 drivers/usb/core/usb.c                     | 2 +-
 drivers/usb/misc/iowarrior.c               | 2 +-
 drivers/usb/misc/legousbtower.c            | 2 +-
 include/linux/device.h                     | 6 +++---
 include/linux/genhd.h                      | 2 +-
 include/linux/usb.h                        | 2 +-
 sound/sound_core.c                         | 2 +-
 31 files changed, 35 insertions(+), 35 deletions(-)

(limited to 'block')

diff --git a/arch/x86/kernel/cpuid.c b/arch/x86/kernel/cpuid.c
index 212a6a42527..a524353d93f 100644
--- a/arch/x86/kernel/cpuid.c
+++ b/arch/x86/kernel/cpuid.c
@@ -177,7 +177,7 @@ static struct notifier_block __refdata cpuid_class_cpu_notifier =
 	.notifier_call = cpuid_class_cpu_callback,
 };
 
-static char *cpuid_devnode(struct device *dev, mode_t *mode)
+static char *cpuid_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "cpu/%u/cpuid", MINOR(dev->devt));
 }
diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c
index 12fcbe2c143..96356762a51 100644
--- a/arch/x86/kernel/msr.c
+++ b/arch/x86/kernel/msr.c
@@ -236,7 +236,7 @@ static struct notifier_block __refdata msr_class_cpu_notifier = {
 	.notifier_call = msr_class_cpu_callback,
 };
 
-static char *msr_devnode(struct device *dev, mode_t *mode)
+static char *msr_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "cpu/%u/msr", MINOR(dev->devt));
 }
diff --git a/block/bsg.c b/block/bsg.c
index 702f1316bb8..9651ec7b87c 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -1070,7 +1070,7 @@ EXPORT_SYMBOL_GPL(bsg_register_queue);
 
 static struct cdev bsg_cdev;
 
-static char *bsg_devnode(struct device *dev, mode_t *mode)
+static char *bsg_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "bsg/%s", dev_name(dev));
 }
diff --git a/block/genhd.c b/block/genhd.c
index 02e9fca8082..80578f3176e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1109,7 +1109,7 @@ struct class block_class = {
 	.name		= "block",
 };
 
-static char *block_devnode(struct device *dev, mode_t *mode)
+static char *block_devnode(struct device *dev, umode_t *mode)
 {
 	struct gendisk *disk = dev_to_disk(dev);
 
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 919daa7cd5b..1dfa1d616fa 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -198,7 +198,7 @@ static int dev_uevent(struct kset *kset, struct kobject *kobj,
 	if (MAJOR(dev->devt)) {
 		const char *tmp;
 		const char *name;
-		mode_t mode = 0;
+		umode_t mode = 0;
 
 		add_uevent_var(env, "MAJOR=%u", MAJOR(dev->devt));
 		add_uevent_var(env, "MINOR=%u", MINOR(dev->devt));
@@ -1182,7 +1182,7 @@ static struct device *next_device(struct klist_iter *i)
  * freed by the caller.
  */
 const char *device_get_devnode(struct device *dev,
-			       mode_t *mode, const char **tmp)
+			       umode_t *mode, const char **tmp)
 {
 	char *s;
 
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index a4760e095ff..3990f682e69 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -40,7 +40,7 @@ static struct req {
 	struct completion done;
 	int err;
 	const char *name;
-	mode_t mode;	/* 0 => delete */
+	umode_t mode;	/* 0 => delete */
 	struct device *dev;
 } *requests;
 
diff --git a/drivers/block/aoe/aoechr.c b/drivers/block/aoe/aoechr.c
index 5f8e39c43ae..e86d2062a16 100644
--- a/drivers/block/aoe/aoechr.c
+++ b/drivers/block/aoe/aoechr.c
@@ -270,7 +270,7 @@ static const struct file_operations aoe_fops = {
 	.llseek = noop_llseek,
 };
 
-static char *aoe_devnode(struct device *dev, mode_t *mode)
+static char *aoe_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "etherd/%s", dev_name(dev));
 }
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index a63b0a2b780..d59edeabd93 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2817,7 +2817,7 @@ static const struct block_device_operations pktcdvd_ops = {
 	.check_events =		pkt_check_events,
 };
 
-static char *pktcdvd_devnode(struct gendisk *gd, mode_t *mode)
+static char *pktcdvd_devnode(struct gendisk *gd, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "pktcdvd/%s", gd->disk_name);
 }
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 14517903371..d6e9d081c8b 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -847,7 +847,7 @@ static const struct file_operations kmsg_fops = {
 
 static const struct memdev {
 	const char *name;
-	mode_t mode;
+	umode_t mode;
 	const struct file_operations *fops;
 	struct backing_dev_info *dev_info;
 } devlist[] = {
@@ -901,7 +901,7 @@ static const struct file_operations memory_fops = {
 	.llseek = noop_llseek,
 };
 
-static char *mem_devnode(struct device *dev, mode_t *mode)
+static char *mem_devnode(struct device *dev, umode_t *mode)
 {
 	if (mode && devlist[MINOR(dev->devt)].mode)
 		*mode = devlist[MINOR(dev->devt)].mode;
diff --git a/drivers/char/misc.c b/drivers/char/misc.c
index 778273c9324..522136d4084 100644
--- a/drivers/char/misc.c
+++ b/drivers/char/misc.c
@@ -258,7 +258,7 @@ int misc_deregister(struct miscdevice *misc)
 EXPORT_SYMBOL(misc_register);
 EXPORT_SYMBOL(misc_deregister);
 
-static char *misc_devnode(struct device *dev, mode_t *mode)
+static char *misc_devnode(struct device *dev, umode_t *mode)
 {
 	struct miscdevice *c = dev_get_drvdata(dev);
 
diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index b6de2c04714..54a3a6d0981 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -308,7 +308,7 @@ static const struct file_operations raw_ctl_fops = {
 
 static struct cdev raw_cdev;
 
-static char *raw_devnode(struct device *dev, mode_t *mode)
+static char *raw_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "raw/%s", dev_name(dev));
 }
diff --git a/drivers/char/tile-srom.c b/drivers/char/tile-srom.c
index cf3ee008dca..4dc019408fa 100644
--- a/drivers/char/tile-srom.c
+++ b/drivers/char/tile-srom.c
@@ -329,7 +329,7 @@ static struct device_attribute srom_dev_attrs[] = {
 	__ATTR_NULL
 };
 
-static char *srom_devnode(struct device *dev, mode_t *mode)
+static char *srom_devnode(struct device *dev, umode_t *mode)
 {
 	*mode = S_IRUGO | S_IWUSR;
 	return kasprintf(GFP_KERNEL, "srom/%s", dev_name(dev));
diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index 0f9ef9bf673..62c3675045a 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -72,7 +72,7 @@ static int drm_class_resume(struct device *dev)
 	return 0;
 }
 
-static char *drm_devnode(struct device *dev, mode_t *mode)
+static char *drm_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "dri/%s", dev_name(dev));
 }
diff --git a/drivers/hid/usbhid/hiddev.c b/drivers/hid/usbhid/hiddev.c
index 4ef02b269a7..7c297d305d5 100644
--- a/drivers/hid/usbhid/hiddev.c
+++ b/drivers/hid/usbhid/hiddev.c
@@ -859,7 +859,7 @@ static const struct file_operations hiddev_fops = {
 	.llseek		= noop_llseek,
 };
 
-static char *hiddev_devnode(struct device *dev, mode_t *mode)
+static char *hiddev_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "usb/%s", dev_name(dev));
 }
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 8b72f39202f..c889aaef341 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -3659,7 +3659,7 @@ static struct kobj_type cm_port_obj_type = {
 	.release = cm_release_port_obj
 };
 
-static char *cm_devnode(struct device *dev, mode_t *mode)
+static char *cm_devnode(struct device *dev, umode_t *mode)
 {
 	if (mode)
 		*mode = 0666;
diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index 07db22997e9..f0d588f8859 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1175,7 +1175,7 @@ static void ib_umad_remove_one(struct ib_device *device)
 	kref_put(&umad_dev->ref, ib_umad_release_dev);
 }
 
-static char *umad_devnode(struct device *dev, mode_t *mode)
+static char *umad_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "infiniband/%s", dev_name(dev));
 }
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 87963674637..604556d73d2 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -846,7 +846,7 @@ static void ib_uverbs_remove_one(struct ib_device *device)
 	kfree(uverbs_dev);
 }
 
-static char *uverbs_devnode(struct device *dev, mode_t *mode)
+static char *uverbs_devnode(struct device *dev, umode_t *mode)
 {
 	if (mode)
 		*mode = 0666;
diff --git a/drivers/input/input.c b/drivers/input/input.c
index da38d97a51b..1f78c957a75 100644
--- a/drivers/input/input.c
+++ b/drivers/input/input.c
@@ -1624,7 +1624,7 @@ static struct device_type input_dev_type = {
 #endif
 };
 
-static char *input_devnode(struct device *dev, mode_t *mode)
+static char *input_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "input/%s", dev_name(dev));
 }
diff --git a/drivers/media/dvb/ddbridge/ddbridge-core.c b/drivers/media/dvb/ddbridge/ddbridge-core.c
index ba9a643b9c6..d1e91bc80e7 100644
--- a/drivers/media/dvb/ddbridge/ddbridge-core.c
+++ b/drivers/media/dvb/ddbridge/ddbridge-core.c
@@ -1480,7 +1480,7 @@ static const struct file_operations ddb_fops = {
 	.open           = ddb_open,
 };
 
-static char *ddb_devnode(struct device *device, mode_t *mode)
+static char *ddb_devnode(struct device *device, umode_t *mode)
 {
 	struct ddb *dev = dev_get_drvdata(device);
 
diff --git a/drivers/media/dvb/dvb-core/dvbdev.c b/drivers/media/dvb/dvb-core/dvbdev.c
index f7328777595..00a67326c19 100644
--- a/drivers/media/dvb/dvb-core/dvbdev.c
+++ b/drivers/media/dvb/dvb-core/dvbdev.c
@@ -450,7 +450,7 @@ static int dvb_uevent(struct device *dev, struct kobj_uevent_env *env)
 	return 0;
 }
 
-static char *dvb_devnode(struct device *dev, mode_t *mode)
+static char *dvb_devnode(struct device *dev, umode_t *mode)
 {
 	struct dvb_device *dvbdev = dev_get_drvdata(dev);
 
diff --git a/drivers/media/rc/rc-main.c b/drivers/media/rc/rc-main.c
index 29f900065d8..f5db8b949bc 100644
--- a/drivers/media/rc/rc-main.c
+++ b/drivers/media/rc/rc-main.c
@@ -715,7 +715,7 @@ static void ir_close(struct input_dev *idev)
 }
 
 /* class for /sys/class/rc */
-static char *ir_devnode(struct device *dev, mode_t *mode)
+static char *ir_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "rc/%s", dev_name(dev));
 }
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 05085beb83d..3fdebd306b9 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -3267,7 +3267,7 @@ void __init console_init(void)
 	}
 }
 
-static char *tty_devnode(struct device *dev, mode_t *mode)
+static char *tty_devnode(struct device *dev, umode_t *mode)
 {
 	if (!mode)
 		return NULL;
diff --git a/drivers/usb/class/usblp.c b/drivers/usb/class/usblp.c
index cb3a93243a0..bc5089f76ce 100644
--- a/drivers/usb/class/usblp.c
+++ b/drivers/usb/class/usblp.c
@@ -1045,7 +1045,7 @@ static const struct file_operations usblp_fops = {
 	.llseek =	noop_llseek,
 };
 
-static char *usblp_devnode(struct device *dev, mode_t *mode)
+static char *usblp_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "usb/%s", dev_name(dev));
 }
diff --git a/drivers/usb/core/file.c b/drivers/usb/core/file.c
index 99458c843d6..d95760de9e8 100644
--- a/drivers/usb/core/file.c
+++ b/drivers/usb/core/file.c
@@ -66,7 +66,7 @@ static struct usb_class {
 	struct class *class;
 } *usb_class;
 
-static char *usb_devnode(struct device *dev, mode_t *mode)
+static char *usb_devnode(struct device *dev, umode_t *mode)
 {
 	struct usb_class_driver *drv;
 
diff --git a/drivers/usb/core/usb.c b/drivers/usb/core/usb.c
index 73cd90012ec..1382c90d083 100644
--- a/drivers/usb/core/usb.c
+++ b/drivers/usb/core/usb.c
@@ -326,7 +326,7 @@ static const struct dev_pm_ops usb_device_pm_ops = {
 #endif	/* CONFIG_PM */
 
 
-static char *usb_devnode(struct device *dev, mode_t *mode)
+static char *usb_devnode(struct device *dev, umode_t *mode)
 {
 	struct usb_device *usb_dev;
 
diff --git a/drivers/usb/misc/iowarrior.c b/drivers/usb/misc/iowarrior.c
index 81457904d6b..5bd4b0526de 100644
--- a/drivers/usb/misc/iowarrior.c
+++ b/drivers/usb/misc/iowarrior.c
@@ -734,7 +734,7 @@ static const struct file_operations iowarrior_fops = {
 	.llseek = noop_llseek,
 };
 
-static char *iowarrior_devnode(struct device *dev, mode_t *mode)
+static char *iowarrior_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "usb/%s", dev_name(dev));
 }
diff --git a/drivers/usb/misc/legousbtower.c b/drivers/usb/misc/legousbtower.c
index a989356f693..94f6566b99f 100644
--- a/drivers/usb/misc/legousbtower.c
+++ b/drivers/usb/misc/legousbtower.c
@@ -269,7 +269,7 @@ static const struct file_operations tower_fops = {
 	.llseek =	tower_llseek,
 };
 
-static char *legousbtower_devnode(struct device *dev, mode_t *mode)
+static char *legousbtower_devnode(struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "usb/%s", dev_name(dev));
 }
diff --git a/include/linux/device.h b/include/linux/device.h
index 3136ede5a1e..2fe0005543e 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -294,7 +294,7 @@ struct class {
 	struct kobject			*dev_kobj;
 
 	int (*dev_uevent)(struct device *dev, struct kobj_uevent_env *env);
-	char *(*devnode)(struct device *dev, mode_t *mode);
+	char *(*devnode)(struct device *dev, umode_t *mode);
 
 	void (*class_release)(struct class *class);
 	void (*dev_release)(struct device *dev);
@@ -423,7 +423,7 @@ struct device_type {
 	const char *name;
 	const struct attribute_group **groups;
 	int (*uevent)(struct device *dev, struct kobj_uevent_env *env);
-	char *(*devnode)(struct device *dev, mode_t *mode);
+	char *(*devnode)(struct device *dev, umode_t *mode);
 	void (*release)(struct device *dev);
 
 	const struct dev_pm_ops *pm;
@@ -720,7 +720,7 @@ extern int device_rename(struct device *dev, const char *new_name);
 extern int device_move(struct device *dev, struct device *new_parent,
 		       enum dpm_order dpm_order);
 extern const char *device_get_devnode(struct device *dev,
-				      mode_t *mode, const char **tmp);
+				      umode_t *mode, const char **tmp);
 extern void *dev_get_drvdata(const struct device *dev);
 extern int dev_set_drvdata(struct device *dev, void *data);
 
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 6d18f3531f1..fe23ee76858 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -163,7 +163,7 @@ struct gendisk {
                                          * disks that can't be partitioned. */
 
 	char disk_name[DISK_NAME_LEN];	/* name of major driver */
-	char *(*devnode)(struct gendisk *gd, mode_t *mode);
+	char *(*devnode)(struct gendisk *gd, umode_t *mode);
 
 	unsigned int events;		/* supported events */
 	unsigned int async_events;	/* async events, subset of all */
diff --git a/include/linux/usb.h b/include/linux/usb.h
index d3d0c137433..a59321779f8 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -935,7 +935,7 @@ extern struct bus_type usb_bus_type;
  */
 struct usb_class_driver {
 	char *name;
-	char *(*devnode)(struct device *dev, mode_t *mode);
+	char *(*devnode)(struct device *dev, umode_t *mode);
 	const struct file_operations *fops;
 	int minor_base;
 };
diff --git a/sound/sound_core.c b/sound/sound_core.c
index 6ce277860fd..c6e81fb928e 100644
--- a/sound/sound_core.c
+++ b/sound/sound_core.c
@@ -29,7 +29,7 @@ MODULE_DESCRIPTION("Core sound module");
 MODULE_AUTHOR("Alan Cox");
 MODULE_LICENSE("GPL");
 
-static char *sound_devnode(struct device *dev, mode_t *mode)
+static char *sound_devnode(struct device *dev, umode_t *mode)
 {
 	if (MAJOR(dev->devt) == SOUND_MAJOR)
 		return NULL;
-- 
cgit v1.2.3-70-g09d2


From 07d106d0a33d6063d2061305903deb02489eba20 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 5 Jan 2012 15:40:12 -0800
Subject: vfs: fix up ENOIOCTLCMD error handling

We're doing some odd things there, which already messes up various users
(see the net/socket.c code that this removes), and it was going to add
yet more crud to the block layer because of the incorrect error code
translation.

ENOIOCTLCMD is not an error return that should be returned to user mode
from the "ioctl()" system call, but it should *not* be translated as
EINVAL ("Invalid argument").  It should be translated as ENOTTY
("Inappropriate ioctl for device").

That EINVAL confusion has apparently so permeated some code that the
block layer actually checks for it, which is sad.  We continue to do so
for now, but add a big comment about how wrong that is, and we should
remove it entirely eventually.  In the meantime, this tries to keep the
changes localized to just the EINVAL -> ENOTTY fix, and removing code
that makes it harder to do the right thing.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 block/ioctl.c     | 26 ++++++++++++++++++++++----
 fs/compat_ioctl.c | 38 ++------------------------------------
 fs/ioctl.c        |  2 +-
 net/socket.c      | 16 +---------------
 4 files changed, 26 insertions(+), 56 deletions(-)

(limited to 'block')

diff --git a/block/ioctl.c b/block/ioctl.c
index ca939fc1030..d510c2a4eff 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -179,6 +179,26 @@ int __blkdev_driver_ioctl(struct block_device *bdev, fmode_t mode,
  */
 EXPORT_SYMBOL_GPL(__blkdev_driver_ioctl);
 
+/*
+ * Is it an unrecognized ioctl? The correct returns are either
+ * ENOTTY (final) or ENOIOCTLCMD ("I don't know this one, try a
+ * fallback"). ENOIOCTLCMD gets turned into ENOTTY by the ioctl
+ * code before returning.
+ *
+ * Confused drivers sometimes return EINVAL, which is wrong. It
+ * means "I understood the ioctl command, but the parameters to
+ * it were wrong".
+ *
+ * We should aim to just fix the broken drivers, the EINVAL case
+ * should go away.
+ */
+static inline int is_unrecognized_ioctl(int ret)
+{
+	return	ret == -EINVAL ||
+		ret == -ENOTTY ||
+		ret == -ENOIOCTLCMD;
+}
+
 /*
  * always keep this in sync with compat_blkdev_ioctl()
  */
@@ -196,8 +216,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 			return -EACCES;
 
 		ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg);
-		/* -EINVAL to handle old uncorrected drivers */
-		if (ret != -EINVAL && ret != -ENOTTY)
+		if (!is_unrecognized_ioctl(ret))
 			return ret;
 
 		fsync_bdev(bdev);
@@ -206,8 +225,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 
 	case BLKROSET:
 		ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg);
-		/* -EINVAL to handle old uncorrected drivers */
-		if (ret != -EINVAL && ret != -ENOTTY)
+		if (!is_unrecognized_ioctl(ret))
 			return ret;
 		if (!capable(CAP_SYS_ADMIN))
 			return -EACCES;
diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 51352de88ef..a10e428b32b 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1506,35 +1506,6 @@ static long do_ioctl_trans(int fd, unsigned int cmd,
 	return -ENOIOCTLCMD;
 }
 
-static void compat_ioctl_error(struct file *filp, unsigned int fd,
-		unsigned int cmd, unsigned long arg)
-{
-	char buf[10];
-	char *fn = "?";
-	char *path;
-
-	/* find the name of the device. */
-	path = (char *)__get_free_page(GFP_KERNEL);
-	if (path) {
-		fn = d_path(&filp->f_path, path, PAGE_SIZE);
-		if (IS_ERR(fn))
-			fn = "?";
-	}
-
-	 sprintf(buf,"'%c'", (cmd>>_IOC_TYPESHIFT) & _IOC_TYPEMASK);
-	if (!isprint(buf[1]))
-		sprintf(buf, "%02x", buf[1]);
-	compat_printk("ioctl32(%s:%d): Unknown cmd fd(%d) "
-			"cmd(%08x){t:%s;sz:%u} arg(%08x) on %s\n",
-			current->comm, current->pid,
-			(int)fd, (unsigned int)cmd, buf,
-			(cmd >> _IOC_SIZESHIFT) & _IOC_SIZEMASK,
-			(unsigned int)arg, fn);
-
-	if (path)
-		free_page((unsigned long)path);
-}
-
 static int compat_ioctl_check_table(unsigned int xcmd)
 {
 	int i;
@@ -1621,13 +1592,8 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
 		goto found_handler;
 
 	error = do_ioctl_trans(fd, cmd, arg, filp);
-	if (error == -ENOIOCTLCMD) {
-		static int count;
-
-		if (++count <= 50)
-			compat_ioctl_error(filp, fd, cmd, arg);
-		error = -EINVAL;
-	}
+	if (error == -ENOIOCTLCMD)
+		error = -ENOTTY;
 
 	goto out_fput;
 
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 1d9b9fcb2db..066836e8184 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -42,7 +42,7 @@ static long vfs_ioctl(struct file *filp, unsigned int cmd,
 
 	error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
 	if (error == -ENOIOCTLCMD)
-		error = -EINVAL;
+		error = -ENOTTY;
  out:
 	return error;
 }
diff --git a/net/socket.c b/net/socket.c
index 2877647f347..a0053750e37 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2883,7 +2883,7 @@ static int bond_ioctl(struct net *net, unsigned int cmd,
 
 		return dev_ioctl(net, cmd, uifr);
 	default:
-		return -EINVAL;
+		return -ENOIOCTLCMD;
 	}
 }
 
@@ -3210,20 +3210,6 @@ static int compat_sock_ioctl_trans(struct file *file, struct socket *sock,
 		return sock_do_ioctl(net, sock, cmd, arg);
 	}
 
-	/* Prevent warning from compat_sys_ioctl, these always
-	 * result in -EINVAL in the native case anyway. */
-	switch (cmd) {
-	case SIOCRTMSG:
-	case SIOCGIFCOUNT:
-	case SIOCSRARP:
-	case SIOCGRARP:
-	case SIOCDRARP:
-	case SIOCSIFLINK:
-	case SIOCGIFSLAVE:
-	case SIOCSIFSLAVE:
-		return -EINVAL;
-	}
-
 	return -ENOIOCTLCMD;
 }
 
-- 
cgit v1.2.3-70-g09d2


From b1bd055d397e09f99dcef9b138ed104ff1812fcb Mon Sep 17 00:00:00 2001
From: "Martin K. Petersen" <martin.petersen@oracle.com>
Date: Wed, 11 Jan 2012 16:27:11 +0100
Subject: block: Introduce blk_set_stacking_limits function

Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.

This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.

Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-settings.c   | 32 ++++++++++++++++++++++++--------
 drivers/md/dm-table.c  |  6 +++---
 drivers/md/md.c        |  1 +
 include/linux/blkdev.h |  1 +
 4 files changed, 29 insertions(+), 11 deletions(-)

(limited to 'block')

diff --git a/block/blk-settings.c b/block/blk-settings.c
index fa1eb0449a0..d3234fc494a 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -104,9 +104,7 @@ EXPORT_SYMBOL_GPL(blk_queue_lld_busy);
  * @lim:  the queue_limits structure to reset
  *
  * Description:
- *   Returns a queue_limit struct to its default state.  Can be used by
- *   stacking drivers like DM that stage table swaps and reuse an
- *   existing device queue.
+ *   Returns a queue_limit struct to its default state.
  */
 void blk_set_default_limits(struct queue_limits *lim)
 {
@@ -114,13 +112,12 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->max_integrity_segments = 0;
 	lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
 	lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
-	lim->max_sectors = BLK_DEF_MAX_SECTORS;
-	lim->max_hw_sectors = INT_MAX;
+	lim->max_sectors = lim->max_hw_sectors = BLK_SAFE_MAX_SECTORS;
 	lim->max_discard_sectors = 0;
 	lim->discard_granularity = 0;
 	lim->discard_alignment = 0;
 	lim->discard_misaligned = 0;
-	lim->discard_zeroes_data = 1;
+	lim->discard_zeroes_data = 0;
 	lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
 	lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY >> PAGE_SHIFT);
 	lim->alignment_offset = 0;
@@ -130,6 +127,27 @@ void blk_set_default_limits(struct queue_limits *lim)
 }
 EXPORT_SYMBOL(blk_set_default_limits);
 
+/**
+ * blk_set_stacking_limits - set default limits for stacking devices
+ * @lim:  the queue_limits structure to reset
+ *
+ * Description:
+ *   Returns a queue_limit struct to its default state. Should be used
+ *   by stacking drivers like DM that have no internal limits.
+ */
+void blk_set_stacking_limits(struct queue_limits *lim)
+{
+	blk_set_default_limits(lim);
+
+	/* Inherit limits from component devices */
+	lim->discard_zeroes_data = 1;
+	lim->max_segments = USHRT_MAX;
+	lim->max_hw_sectors = UINT_MAX;
+
+	lim->max_sectors = BLK_DEF_MAX_SECTORS;
+}
+EXPORT_SYMBOL(blk_set_stacking_limits);
+
 /**
  * blk_queue_make_request - define an alternate make_request function for a device
  * @q:  the request queue for the device to be affected
@@ -165,8 +183,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	q->nr_batching = BLK_BATCH_REQ;
 
 	blk_set_default_limits(&q->limits);
-	blk_queue_max_hw_sectors(q, BLK_SAFE_MAX_SECTORS);
-	q->limits.discard_zeroes_data = 0;
 
 	/*
 	 * by default assume old behaviour and bounce for any highmem page
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 8e913213014..63cc54289af 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -699,7 +699,7 @@ static int validate_hardware_logical_block_alignment(struct dm_table *table,
 	while (i < dm_table_get_num_targets(table)) {
 		ti = dm_table_get_target(table, i++);
 
-		blk_set_default_limits(&ti_limits);
+		blk_set_stacking_limits(&ti_limits);
 
 		/* combine all target devices' limits */
 		if (ti->type->iterate_devices)
@@ -1221,10 +1221,10 @@ int dm_calculate_queue_limits(struct dm_table *table,
 	struct queue_limits ti_limits;
 	unsigned i = 0;
 
-	blk_set_default_limits(limits);
+	blk_set_stacking_limits(limits);
 
 	while (i < dm_table_get_num_targets(table)) {
-		blk_set_default_limits(&ti_limits);
+		blk_set_stacking_limits(&ti_limits);
 
 		ti = dm_table_get_target(table, i++);
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ee981737edf..114ba155af8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4622,6 +4622,7 @@ static int md_alloc(dev_t dev, char *name)
 	mddev->queue->queuedata = mddev;
 
 	blk_queue_make_request(mddev->queue, md_make_request);
+	blk_set_stacking_limits(&mddev->queue->limits);
 
 	disk = alloc_disk(1 << shift);
 	if (!disk) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8bca04873f5..adc34133a56 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -844,6 +844,7 @@ extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
 extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
 extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
 extern void blk_set_default_limits(struct queue_limits *lim);
+extern void blk_set_stacking_limits(struct queue_limits *lim);
 extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 			    sector_t offset);
 extern int bdev_stack_limits(struct queue_limits *t, struct block_device *bdev,
-- 
cgit v1.2.3-70-g09d2


From ef00f59c95fe6e002e7c6e3663cdea65e253f4cc Mon Sep 17 00:00:00 2001
From: "Martin K. Petersen" <martin.petersen@oracle.com>
Date: Wed, 11 Jan 2012 16:29:31 +0100
Subject: block: Add BLKROTATIONAL ioctl

Introduce an ioctl which permits applications to query whether a block
device is rotational.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/compat_ioctl.c | 3 +++
 block/ioctl.c        | 2 ++
 include/linux/fs.h   | 1 +
 3 files changed, 6 insertions(+)

(limited to 'block')

diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c
index 7b725020823..7c668c8a6f9 100644
--- a/block/compat_ioctl.c
+++ b/block/compat_ioctl.c
@@ -719,6 +719,9 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	case BLKSECTGET:
 		return compat_put_ushort(arg,
 					 queue_max_sectors(bdev_get_queue(bdev)));
+	case BLKROTATIONAL:
+		return compat_put_ushort(arg,
+					 !blk_queue_nonrot(bdev_get_queue(bdev)));
 	case BLKRASET: /* compatible, but no compat_ptr (!) */
 	case BLKFRASET:
 		if (!capable(CAP_SYS_ADMIN))
diff --git a/block/ioctl.c b/block/ioctl.c
index ca939fc1030..337d207ab14 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -278,6 +278,8 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 		return put_uint(arg, bdev_discard_zeroes_data(bdev));
 	case BLKSECTGET:
 		return put_ushort(arg, queue_max_sectors(bdev_get_queue(bdev)));
+	case BLKROTATIONAL:
+		return put_ushort(arg, !blk_queue_nonrot(bdev_get_queue(bdev)));
 	case BLKRASET:
 	case BLKFRASET:
 		if(!capable(CAP_SYS_ADMIN))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e0bc4ffb8e7..95dd911506f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -319,6 +319,7 @@ struct inodes_stat_t {
 #define BLKPBSZGET _IO(0x12,123)
 #define BLKDISCARDZEROES _IO(0x12,124)
 #define BLKSECDISCARD _IO(0x12,125)
+#define BLKROTATIONAL _IO(0x12,126)
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
-- 
cgit v1.2.3-70-g09d2


From 577ebb374c78314ac4617242f509e2f5e7156649 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Thu, 12 Jan 2012 16:01:27 +0100
Subject: block: add and use scsi_blk_cmd_ioctl

Introduce a wrapper around scsi_cmd_ioctl that takes a block device.

The function will then be enhanced to detect partition block devices
and, in that case, subject the ioctls to whitelisting.

Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 block/scsi_ioctl.c             | 7 +++++++
 drivers/block/cciss.c          | 6 +++---
 drivers/block/ub.c             | 3 +--
 drivers/block/virtio_blk.c     | 4 ++--
 drivers/cdrom/cdrom.c          | 3 +--
 drivers/ide/ide-floppy_ioctl.c | 3 +--
 drivers/scsi/sd.c              | 2 +-
 include/linux/blkdev.h         | 2 ++
 8 files changed, 18 insertions(+), 12 deletions(-)

(limited to 'block')

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index fbdf0d802ec..a2c11f33087 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -690,6 +690,13 @@ int scsi_cmd_ioctl(struct request_queue *q, struct gendisk *bd_disk, fmode_t mod
 }
 EXPORT_SYMBOL(scsi_cmd_ioctl);
 
+int scsi_cmd_blk_ioctl(struct block_device *bd, fmode_t mode,
+		       unsigned int cmd, void __user *arg)
+{
+	return scsi_cmd_ioctl(bd->bd_disk->queue, bd->bd_disk, mode, cmd, arg);
+}
+EXPORT_SYMBOL(scsi_cmd_blk_ioctl);
+
 static int __init blk_scsi_ioctl_init(void)
 {
 	blk_set_cmd_filter_defaults(&blk_default_cmd_filter);
diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 587cce57ada..b0f553b26d0 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1735,7 +1735,7 @@ static int cciss_ioctl(struct block_device *bdev, fmode_t mode,
 	case CCISS_BIG_PASSTHRU:
 		return cciss_bigpassthru(h, argp);
 
-	/* scsi_cmd_ioctl handles these, below, though some are not */
+	/* scsi_cmd_blk_ioctl handles these, below, though some are not */
 	/* very meaningful for cciss.  SG_IO is the main one people want. */
 
 	case SG_GET_VERSION_NUM:
@@ -1746,9 +1746,9 @@ static int cciss_ioctl(struct block_device *bdev, fmode_t mode,
 	case SG_EMULATED_HOST:
 	case SG_IO:
 	case SCSI_IOCTL_SEND_COMMAND:
-		return scsi_cmd_ioctl(disk->queue, disk, mode, cmd, argp);
+		return scsi_cmd_blk_ioctl(bdev, mode, cmd, argp);
 
-	/* scsi_cmd_ioctl would normally handle these, below, but */
+	/* scsi_cmd_blk_ioctl would normally handle these, below, but */
 	/* they aren't a good fit for cciss, as CD-ROMs are */
 	/* not supported, and we don't have any bus/target/lun */
 	/* which we present to the kernel. */
diff --git a/drivers/block/ub.c b/drivers/block/ub.c
index 0e376d46bdd..7333b9e4441 100644
--- a/drivers/block/ub.c
+++ b/drivers/block/ub.c
@@ -1744,12 +1744,11 @@ static int ub_bd_release(struct gendisk *disk, fmode_t mode)
 static int ub_bd_ioctl(struct block_device *bdev, fmode_t mode,
     unsigned int cmd, unsigned long arg)
 {
-	struct gendisk *disk = bdev->bd_disk;
 	void __user *usermem = (void __user *) arg;
 	int ret;
 
 	mutex_lock(&ub_mutex);
-	ret = scsi_cmd_ioctl(disk->queue, disk, mode, cmd, usermem);
+	ret = scsi_cmd_blk_ioctl(bdev, mode, cmd, usermem);
 	mutex_unlock(&ub_mutex);
 
 	return ret;
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index ffd5ca91929..c4a60badf25 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -250,8 +250,8 @@ static int virtblk_ioctl(struct block_device *bdev, fmode_t mode,
 	if (!virtio_has_feature(vblk->vdev, VIRTIO_BLK_F_SCSI))
 		return -ENOTTY;
 
-	return scsi_cmd_ioctl(disk->queue, disk, mode, cmd,
-			      (void __user *)data);
+	return scsi_cmd_blk_ioctl(bdev, mode, cmd,
+				  (void __user *)data);
 }
 
 /* We provide getgeo only to please some old bootloader/partitioning tools */
diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c
index 1bbf7645a97..55eaf474d32 100644
--- a/drivers/cdrom/cdrom.c
+++ b/drivers/cdrom/cdrom.c
@@ -2746,12 +2746,11 @@ int cdrom_ioctl(struct cdrom_device_info *cdi, struct block_device *bdev,
 {
 	void __user *argp = (void __user *)arg;
 	int ret;
-	struct gendisk *disk = bdev->bd_disk;
 
 	/*
 	 * Try the generic SCSI command ioctl's first.
 	 */
-	ret = scsi_cmd_ioctl(disk->queue, disk, mode, cmd, argp);
+	ret = scsi_cmd_blk_ioctl(bdev, mode, cmd, argp);
 	if (ret != -ENOTTY)
 		return ret;
 
diff --git a/drivers/ide/ide-floppy_ioctl.c b/drivers/ide/ide-floppy_ioctl.c
index d267b7affad..a22ca846701 100644
--- a/drivers/ide/ide-floppy_ioctl.c
+++ b/drivers/ide/ide-floppy_ioctl.c
@@ -292,8 +292,7 @@ int ide_floppy_ioctl(ide_drive_t *drive, struct block_device *bdev,
 	 * and CDROM_SEND_PACKET (legacy) ioctls
 	 */
 	if (cmd != CDROM_SEND_PACKET && cmd != SCSI_IOCTL_SEND_COMMAND)
-		err = scsi_cmd_ioctl(bdev->bd_disk->queue, bdev->bd_disk,
-				mode, cmd, argp);
+		err = scsi_cmd_blk_ioctl(bdev, mode, cmd, argp);
 
 	if (err == -ENOTTY)
 		err = generic_ide_ioctl(drive, bdev, cmd, arg);
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7b3f8075e2a..b4d57bb04c7 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1097,7 +1097,7 @@ static int sd_ioctl(struct block_device *bdev, fmode_t mode,
 			error = scsi_ioctl(sdp, cmd, p);
 			break;
 		default:
-			error = scsi_cmd_ioctl(disk->queue, disk, mode, cmd, p);
+			error = scsi_cmd_blk_ioctl(bdev, mode, cmd, p);
 			if (error != -ENOTTY)
 				break;
 			error = scsi_ioctl(sdp, cmd, p);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 94acd8172b5..ca7b869508c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -675,6 +675,8 @@ extern int blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
 extern void blk_recount_segments(struct request_queue *, struct bio *);
+extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
+			      unsigned int, void __user *);
 extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			  unsigned int, void __user *);
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
-- 
cgit v1.2.3-70-g09d2


From 0bfc96cb77224736dfa35c3c555d37b3646ef35e Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Thu, 12 Jan 2012 16:01:28 +0100
Subject: block: fail SCSI passthrough ioctls on partition devices

Linux allows executing the SG_IO ioctl on a partition or LVM volume, and
will pass the command to the underlying block device.  This is
well-known, but it is also a large security problem when (via Unix
permissions, ACLs, SELinux or a combination thereof) a program or user
needs to be granted access only to part of the disk.

This patch lets partitions forward a small set of harmless ioctls;
others are logged with printk so that we can see which ioctls are
actually sent.  In my tests only CDROM_GET_CAPABILITY actually occurred.
Of course it was being sent to a (partition on a) hard disk, so it would
have failed with ENOTTY and the patch isn't changing anything in
practice.  Still, I'm treating it specially to avoid spamming the logs.

In principle, this restriction should include programs running with
CAP_SYS_RAWIO.  If for example I let a program access /dev/sda2 and
/dev/sdb, it still should not be able to read/write outside the
boundaries of /dev/sda2 independent of the capabilities.  However, for
now programs with CAP_SYS_RAWIO will still be allowed to send the
ioctls.  Their actions will still be logged.

This patch does not affect the non-libata IDE driver.  That driver
however already tests for bd != bd->bd_contains before issuing some
ioctl; it could be restricted further to forbid these ioctls even for
programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO.

Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Make it also print the command name when warning - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 block/scsi_ioctl.c     | 45 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/scsi/sd.c      | 11 +++++++++--
 include/linux/blkdev.h |  1 +
 3 files changed, 55 insertions(+), 2 deletions(-)

(limited to 'block')

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index a2c11f33087..260fa80ef57 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -24,6 +24,7 @@
 #include <linux/capability.h>
 #include <linux/completion.h>
 #include <linux/cdrom.h>
+#include <linux/ratelimit.h>
 #include <linux/slab.h>
 #include <linux/times.h>
 #include <asm/uaccess.h>
@@ -690,9 +691,53 @@ int scsi_cmd_ioctl(struct request_queue *q, struct gendisk *bd_disk, fmode_t mod
 }
 EXPORT_SYMBOL(scsi_cmd_ioctl);
 
+int scsi_verify_blk_ioctl(struct block_device *bd, unsigned int cmd)
+{
+	if (bd && bd == bd->bd_contains)
+		return 0;
+
+	/* Actually none of these is particularly useful on a partition,
+	 * but they are safe.
+	 */
+	switch (cmd) {
+	case SCSI_IOCTL_GET_IDLUN:
+	case SCSI_IOCTL_GET_BUS_NUMBER:
+	case SCSI_IOCTL_GET_PCI:
+	case SCSI_IOCTL_PROBE_HOST:
+	case SG_GET_VERSION_NUM:
+	case SG_SET_TIMEOUT:
+	case SG_GET_TIMEOUT:
+	case SG_GET_RESERVED_SIZE:
+	case SG_SET_RESERVED_SIZE:
+	case SG_EMULATED_HOST:
+		return 0;
+	case CDROM_GET_CAPABILITY:
+		/* Keep this until we remove the printk below.  udev sends it
+		 * and we do not want to spam dmesg about it.   CD-ROMs do
+		 * not have partitions, so we get here only for disks.
+		 */
+		return -ENOIOCTLCMD;
+	default:
+		break;
+	}
+
+	/* In particular, rule out all resets and host-specific ioctls.  */
+	printk_ratelimited(KERN_WARNING
+			   "%s: sending ioctl %x to a partition!\n", current->comm, cmd);
+
+	return capable(CAP_SYS_RAWIO) ? 0 : -ENOIOCTLCMD;
+}
+EXPORT_SYMBOL(scsi_verify_blk_ioctl);
+
 int scsi_cmd_blk_ioctl(struct block_device *bd, fmode_t mode,
 		       unsigned int cmd, void __user *arg)
 {
+	int ret;
+
+	ret = scsi_verify_blk_ioctl(bd, cmd);
+	if (ret < 0)
+		return ret;
+
 	return scsi_cmd_ioctl(bd->bd_disk->queue, bd->bd_disk, mode, cmd, arg);
 }
 EXPORT_SYMBOL(scsi_cmd_blk_ioctl);
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b4d57bb04c7..c691fb50e6c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1075,6 +1075,10 @@ static int sd_ioctl(struct block_device *bdev, fmode_t mode,
 	SCSI_LOG_IOCTL(1, sd_printk(KERN_INFO, sdkp, "sd_ioctl: disk=%s, "
 				    "cmd=0x%x\n", disk->disk_name, cmd));
 
+	error = scsi_verify_blk_ioctl(bdev, cmd);
+	if (error < 0)
+		return error;
+
 	/*
 	 * If we are in the middle of error recovery, don't let anyone
 	 * else try and use this device.  Also, if error recovery fails, it
@@ -1267,6 +1271,11 @@ static int sd_compat_ioctl(struct block_device *bdev, fmode_t mode,
 			   unsigned int cmd, unsigned long arg)
 {
 	struct scsi_device *sdev = scsi_disk(bdev->bd_disk)->device;
+	int ret;
+
+	ret = scsi_verify_blk_ioctl(bdev, cmd);
+	if (ret < 0)
+		return ret;
 
 	/*
 	 * If we are in the middle of error recovery, don't let anyone
@@ -1278,8 +1287,6 @@ static int sd_compat_ioctl(struct block_device *bdev, fmode_t mode,
 		return -ENODEV;
 	       
 	if (sdev->host->hostt->compat_ioctl) {
-		int ret;
-
 		ret = sdev->host->hostt->compat_ioctl(sdev, cmd, (void __user *)arg);
 
 		return ret;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ca7b869508c..0ed1eb06231 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -675,6 +675,7 @@ extern int blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
 extern void blk_recount_segments(struct request_queue *, struct bio *);
+extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
 extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
 			      unsigned int, void __user *);
 extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
-- 
cgit v1.2.3-70-g09d2


From 5d381efb3d1f1ef10535a31ca0dd9b22fe1e1922 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Sun, 15 Jan 2012 10:29:48 +0100
Subject: Revert "block: recursive merge requests"

This reverts commit 274193224cdabd687d804a26e0150bb20f2dd52c.

We have some problems related to selection of empty queues
that need to be resolved, evidence so far points to the
recursive merge logic making either being the cause or at
least the accelerator for this. So revert it for now, until
we figure this out.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/elevator.c | 16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

(limited to 'block')

diff --git a/block/elevator.c b/block/elevator.c
index 99838f460b4..91e18f8af9b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -515,7 +515,6 @@ static bool elv_attempt_insert_merge(struct request_queue *q,
 				     struct request *rq)
 {
 	struct request *__rq;
-	bool ret;
 
 	if (blk_queue_nomerges(q))
 		return false;
@@ -529,21 +528,14 @@ static bool elv_attempt_insert_merge(struct request_queue *q,
 	if (blk_queue_noxmerges(q))
 		return false;
 
-	ret = false;
 	/*
 	 * See if our hash lookup can find a potential backmerge.
 	 */
-	while (1) {
-		__rq = elv_rqhash_find(q, blk_rq_pos(rq));
-		if (!__rq || !blk_attempt_req_merge(q, __rq, rq))
-			break;
-
-		/* The merged request could be merged with others, try again */
-		ret = true;
-		rq = __rq;
-	}
+	__rq = elv_rqhash_find(q, blk_rq_pos(rq));
+	if (__rq && blk_attempt_req_merge(q, __rq, rq))
+		return true;
 
-	return ret;
+	return false;
 }
 
 void elv_merged_request(struct request_queue *q, struct request *rq, int type)
-- 
cgit v1.2.3-70-g09d2


From 54b466e44b1c7809144bbd8cd6be3f85877ca46f Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Tue, 17 Jan 2012 21:26:11 +0100
Subject: cfq-iosched: fix use-after-free of cfqq

With the changes in life time management between the cfq IO contexts
and the cfq queues, we now risk having cfqd->active_queue being
freed when cfq_slice_expired() is being called. cfq_preempt_queue()
caches this queue and uses it after calling said function, causing
a use-after-free condition. This triggers the following oops,
when cfqq_type() attempts to dereference it:

BUG: unable to handle kernel paging request at ffff8800746c4f0c
IP: [<ffffffff81266d59>] cfqq_type+0xb/0x20
PGD 18d4063 PUD 1fe15067 PMD 1ffb9067 PTE 80000000746c4160
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU 3
Modules linked in:

Pid: 1, comm: init Not tainted 3.2.0-josef+ #367 Bochs Bochs
RIP: 0010:[<ffffffff81266d59>]  [<ffffffff81266d59>] cfqq_type+0xb/0x20
RSP: 0018:ffff880079c11778  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880076f3df08 RCX: 0000000000000000
RDX: 0000000000000006 RSI: ffff880074271888 RDI: ffff8800746c4f08
RBP: ffff880079c11778 R08: 0000000000000078 R09: 0000000000000001
R10: 09f911029d74e35b R11: 09f911029d74e35b R12: ffff880076f337f0
R13: ffff8800746c4f08 R14: ffff8800746c4f08 R15: 0000000000000002
FS:  00007f62fd44f700(0000) GS:ffff88007cd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8800746c4f0c CR3: 0000000076c21000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process init (pid: 1, threadinfo ffff880079c10000, task ffff880079c0a040)
Stack:
 ffff880079c117c8 ffffffff812683d8 ffff880079c117a8 ffffffff8125de43
 ffff8800744fcf48 ffff880074b43e98 ffff8800770c8828 ffff880074b43e98
 0000000000000003 0000000000000000 ffff880079c117f8 ffffffff81254149
Call Trace:
 [<ffffffff812683d8>] cfq_insert_request+0x3f5/0x47c
 [<ffffffff8125de43>] ? blk_recount_segments+0x20/0x31
 [<ffffffff81254149>] __elv_add_request+0x1ca/0x200
 [<ffffffff8125aa99>] blk_queue_bio+0x2ef/0x312
 [<ffffffff81258f7b>] generic_make_request+0x9f/0xe0
 [<ffffffff8125907b>] submit_bio+0xbf/0xca
 [<ffffffff81136ec7>] submit_bh+0xdf/0xfe
 [<ffffffff81176d04>] ext3_bread+0x50/0x99
 [<ffffffff811785b3>] dx_probe+0x38/0x291
 [<ffffffff81178864>] ext3_dx_find_entry+0x58/0x219
 [<ffffffff81178ad5>] ext3_find_entry+0xb0/0x406
 [<ffffffff8110c4d5>] ? cache_alloc_debugcheck_after.isra.46+0x14d/0x1a0
 [<ffffffff8110cfbd>] ? kmem_cache_alloc+0xef/0x191
 [<ffffffff8117a330>] ext3_lookup+0x39/0xe1
 [<ffffffff81119461>] d_alloc_and_lookup+0x45/0x6c
 [<ffffffff8111ac41>] do_lookup+0x1e4/0x2f5
 [<ffffffff8111aef6>] link_path_walk+0x1a4/0x6ef
 [<ffffffff8111b557>] path_lookupat+0x59/0x5ea
 [<ffffffff8127406c>] ? __strncpy_from_user+0x30/0x5a
 [<ffffffff8111bce0>] do_path_lookup+0x23/0x59
 [<ffffffff8111cfd6>] user_path_at_empty+0x53/0x99
 [<ffffffff8107b37b>] ? remove_wait_queue+0x51/0x56
 [<ffffffff8111d02d>] user_path_at+0x11/0x13
 [<ffffffff811141f5>] vfs_fstatat+0x3a/0x64
 [<ffffffff8111425a>] vfs_stat+0x1b/0x1d
 [<ffffffff81114359>] sys_newstat+0x1a/0x33
 [<ffffffff81060e12>] ? task_stopped_code+0x42/0x42
 [<ffffffff815d6712>] system_call_fastpath+0x16/0x1b
Code: 89 e6 48 89 c7 e8 fa ca fe ff 85 c0 74 06 4c 89 2b 41 b6 01 5b 44 89 f0 41 5c 41 5d 41 5e 5d c3 55 48 89 e5 66 66 66 66 90 31 c0 <8b> 57 04 f6 c6 01 74 0b 83 e2 20 83 fa 01 19 c0 83 c0 02 5d c3
RIP  [<ffffffff81266d59>] cfqq_type+0xb/0x20
 RSP <ffff880079c11778>
CR2: ffff8800746c4f0c

Get rid of the caching of cfqd->active_queue, and reorder the
check so that it happens before we expire the active queue.

Thanks to Tejun for pin pointing the error location.

Reported-by: Chris Mason <chris.mason@oracle.com>
Tested-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 163263ddd38..ee55019066a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3117,18 +3117,17 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
  */
 static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	struct cfq_queue *old_cfqq = cfqd->active_queue;
-
 	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
 
 	/*
 	 * workload type is changed, don't save slice, otherwise preempt
 	 * doesn't happen
 	 */
-	if (cfqq_type(old_cfqq) != cfqq_type(cfqq))
+	if (cfqq_type(cfqd->active_queue) != cfqq_type(cfqq))
 		cfqq->cfqg->saved_workload_slice = 0;
 
+	cfq_slice_expired(cfqd, 1);
+
 	/*
 	 * Put the new queue at the front of the of the current list,
 	 * so we know that it will be selected next.
-- 
cgit v1.2.3-70-g09d2


From df0793abb929e66606fa25f3875ff1b89de5ad32 Mon Sep 17 00:00:00 2001
From: Shaohua Li <shaohua.li@intel.com>
Date: Thu, 19 Jan 2012 09:20:09 +0100
Subject: block,cfq: change code order

cfq_slice_expired will change saved_workload_slice. It should be called
first so saved_workload_slice is correctly set to 0 after workload type
is changed.
This fixes the code order changed by 54b466e44b1c7.

Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/cfq-iosched.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

(limited to 'block')

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ee55019066a..da21c24dbed 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3117,17 +3117,18 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
  */
 static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
+	enum wl_type_t old_type = cfqq_type(cfqd->active_queue);
+
 	cfq_log_cfqq(cfqd, cfqq, "preempt");
+	cfq_slice_expired(cfqd, 1);
 
 	/*
 	 * workload type is changed, don't save slice, otherwise preempt
 	 * doesn't happen
 	 */
-	if (cfqq_type(cfqd->active_queue) != cfqq_type(cfqq))
+	if (old_type != cfqq_type(cfqq))
 		cfqq->cfqg->saved_workload_slice = 0;
 
-	cfq_slice_expired(cfqd, 1);
-
 	/*
 	 * Put the new queue at the front of the of the current list,
 	 * so we know that it will be selected next.
-- 
cgit v1.2.3-70-g09d2


From 05c30b9551f1904d9950ad0d28e65fc4ff3c8a8e Mon Sep 17 00:00:00 2001
From: Shaohua Li <shaohua.li@intel.com>
Date: Thu, 19 Jan 2012 09:20:10 +0100
Subject: block: fix NULL icq_cache reference

Vivek reported a kernel crash:
[   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[   94.218004] PGD 13abda067 PUD 137d52067 PMD 0
[   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[   94.218004] CPU 0
[   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
[   94.218004]
[   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
[   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
[   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
[   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
[   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
[   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
[   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
[   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
[   94.218004] Stack:
[   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
[   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
[   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
[   94.218004] Call Trace:
[   94.218004]  <IRQ>
[   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
[   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
[   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
[   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
[   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
[   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
[   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
[   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
[   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
[   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
[   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80

Once a queue is quiesced, it's not supposed to have any elvpriv data or
icq's, and elevator switching depends on that.  Request alloc path
followed the rule for elvpriv data but forgot apply it to icq's
leading to the following crash during elevator switch. Fix it by not
allocating icq's if ELVPRIV is not set for the request.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index e6c05a97ee2..63670257511 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -872,13 +872,15 @@ retry:
 	spin_unlock_irq(q->queue_lock);
 
 	/* create icq if missing */
-	if (unlikely(et->icq_cache && !icq))
+	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
 		icq = ioc_create_icq(q, gfp_mask);
+		if (!icq)
+			goto fail_icq;
+	}
 
-	/* rqs are guaranteed to have icq on elv_set_request() if requested */
-	if (likely(!et->icq_cache || icq))
-		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
+	rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
 
+fail_icq:
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
-- 
cgit v1.2.3-70-g09d2


From 9fa73472ddbcd3da87d35a7f4566eaaf345f798e Mon Sep 17 00:00:00 2001
From: Shaohua Li <shaohua.li@intel.com>
Date: Mon, 6 Feb 2012 08:57:29 +0100
Subject: block: fix ioc locking warning

Meelis reported a warning:

WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec()
Hardware name: 939Dual-SATA2
timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103
Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq
Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176
Call Trace:
 <IRQ>  [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96
 [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d
 [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43
 [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa
 [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d
 [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec
 [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31
 [<ffffffff810145fd>] ? apic_write+0x11/0x13
 [<ffffffff81027475>] __do_softirq+0x74/0xfa
 [<ffffffff812f337a>] call_softirq+0x1a/0x30
 [<ffffffff81002ff9>] do_softirq+0x31/0x68
 [<ffffffff810276cf>] irq_exit+0x3d/0xa3
 [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77
 [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70
 <EOI>  [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d
 [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d
 [<ffffffff8100801f>] ? default_idle+0x1e/0x32
 [<ffffffff81008019>] ? default_idle+0x18/0x32
 [<ffffffff810008b1>] cpu_idle+0x87/0xd1
 [<ffffffff812de861>] rest_init+0x85/0x89
 [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8
 [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82
 [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7

this_q == locked_q is possible. There are two problems here:
1. In UP case, there is preemption counter issue as spin_trylock always
successes.
2. In SMP case, the loop breaks too earlier.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Knut Petersen <Knut_Petersen@t-online.de>
Tested-by: Knut Petersen <Knut_Petersen@t-online.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-ioc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 27a06e00eae..7490b6da245 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -204,7 +204,9 @@ void put_io_context(struct io_context *ioc, struct request_queue *locked_q)
 				spin_unlock(last_q->queue_lock);
 			last_q = NULL;
 
-			if (!spin_trylock(this_q->queue_lock))
+			/* spin_trylock() always successes in UP case */
+			if (this_q != locked_q &&
+			    !spin_trylock(this_q->queue_lock))
 				break;
 			last_q = this_q;
 			continue;
-- 
cgit v1.2.3-70-g09d2


From 11a3122f6cf2d988a77eb8883d0fc49cd013a6d5 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Tue, 7 Feb 2012 07:51:30 +0100
Subject: block: strip out locking optimization in put_io_context()

put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue.  It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.

While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization.  Strip it out.  If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-cgroup.c        |  2 +-
 block/blk-core.c          |  2 +-
 block/blk-ioc.c           | 92 ++++++-----------------------------------------
 block/cfq-iosched.c       |  2 +-
 fs/ioprio.c               |  2 +-
 include/linux/blkdev.h    |  3 --
 include/linux/iocontext.h |  5 ++-
 kernel/fork.c             |  2 +-
 8 files changed, 18 insertions(+), 92 deletions(-)

(limited to 'block')

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fa8f2630944..75642a352a8 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1659,7 +1659,7 @@ static void blkiocg_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
 		if (ioc) {
 			ioc_cgroup_changed(ioc);
-			put_io_context(ioc, NULL);
+			put_io_context(ioc);
 		}
 	}
 }
diff --git a/block/blk-core.c b/block/blk-core.c
index 63670257511..532b3a21b38 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -642,7 +642,7 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 	if (rq->cmd_flags & REQ_ELVPRIV) {
 		elv_put_request(q, rq);
 		if (rq->elv.icq)
-			put_io_context(rq->elv.icq->ioc, q);
+			put_io_context(rq->elv.icq->ioc);
 	}
 
 	mempool_free(rq, q->rq.rq_pool);
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 7490b6da245..9884fd7427f 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -29,21 +29,6 @@ void get_io_context(struct io_context *ioc)
 }
 EXPORT_SYMBOL(get_io_context);
 
-/*
- * Releasing ioc may nest into another put_io_context() leading to nested
- * fast path release.  As the ioc's can't be the same, this is okay but
- * makes lockdep whine.  Keep track of nesting and use it as subclass.
- */
-#ifdef CONFIG_LOCKDEP
-#define ioc_release_depth(q)		((q) ? (q)->ioc_release_depth : 0)
-#define ioc_release_depth_inc(q)	(q)->ioc_release_depth++
-#define ioc_release_depth_dec(q)	(q)->ioc_release_depth--
-#else
-#define ioc_release_depth(q)		0
-#define ioc_release_depth_inc(q)	do { } while (0)
-#define ioc_release_depth_dec(q)	do { } while (0)
-#endif
-
 static void icq_free_icq_rcu(struct rcu_head *head)
 {
 	struct io_cq *icq = container_of(head, struct io_cq, __rcu_head);
@@ -75,11 +60,8 @@ static void ioc_exit_icq(struct io_cq *icq)
 	if (rcu_dereference_raw(ioc->icq_hint) == icq)
 		rcu_assign_pointer(ioc->icq_hint, NULL);
 
-	if (et->ops.elevator_exit_icq_fn) {
-		ioc_release_depth_inc(q);
+	if (et->ops.elevator_exit_icq_fn)
 		et->ops.elevator_exit_icq_fn(icq);
-		ioc_release_depth_dec(q);
-	}
 
 	/*
 	 * @icq->q might have gone away by the time RCU callback runs
@@ -149,81 +131,29 @@ static void ioc_release_fn(struct work_struct *work)
 /**
  * put_io_context - put a reference of io_context
  * @ioc: io_context to put
- * @locked_q: request_queue the caller is holding queue_lock of (hint)
  *
  * Decrement reference count of @ioc and release it if the count reaches
- * zero.  If the caller is holding queue_lock of a queue, it can indicate
- * that with @locked_q.  This is an optimization hint and the caller is
- * allowed to pass in %NULL even when it's holding a queue_lock.
+ * zero.
  */
-void put_io_context(struct io_context *ioc, struct request_queue *locked_q)
+void put_io_context(struct io_context *ioc)
 {
-	struct request_queue *last_q = locked_q;
 	unsigned long flags;
 
 	if (ioc == NULL)
 		return;
 
 	BUG_ON(atomic_long_read(&ioc->refcount) <= 0);
-	if (locked_q)
-		lockdep_assert_held(locked_q->queue_lock);
-
-	if (!atomic_long_dec_and_test(&ioc->refcount))
-		return;
 
 	/*
-	 * Destroy @ioc.  This is a bit messy because icq's are chained
-	 * from both ioc and queue, and ioc->lock nests inside queue_lock.
-	 * The inner ioc->lock should be held to walk our icq_list and then
-	 * for each icq the outer matching queue_lock should be grabbed.
-	 * ie. We need to do reverse-order double lock dancing.
-	 *
-	 * Another twist is that we are often called with one of the
-	 * matching queue_locks held as indicated by @locked_q, which
-	 * prevents performing double-lock dance for other queues.
-	 *
-	 * So, we do it in two stages.  The fast path uses the queue_lock
-	 * the caller is holding and, if other queues need to be accessed,
-	 * uses trylock to avoid introducing locking dependency.  This can
-	 * handle most cases, especially if @ioc was performing IO on only
-	 * single device.
-	 *
-	 * If trylock doesn't cut it, we defer to @ioc->release_work which
-	 * can do all the double-locking dancing.
+	 * Releasing ioc requires reverse order double locking and we may
+	 * already be holding a queue_lock.  Do it asynchronously from wq.
 	 */
-	spin_lock_irqsave_nested(&ioc->lock, flags,
-				 ioc_release_depth(locked_q));
-
-	while (!hlist_empty(&ioc->icq_list)) {
-		struct io_cq *icq = hlist_entry(ioc->icq_list.first,
-						struct io_cq, ioc_node);
-		struct request_queue *this_q = icq->q;
-
-		if (this_q != last_q) {
-			if (last_q && last_q != locked_q)
-				spin_unlock(last_q->queue_lock);
-			last_q = NULL;
-
-			/* spin_trylock() always successes in UP case */
-			if (this_q != locked_q &&
-			    !spin_trylock(this_q->queue_lock))
-				break;
-			last_q = this_q;
-			continue;
-		}
-		ioc_exit_icq(icq);
+	if (atomic_long_dec_and_test(&ioc->refcount)) {
+		spin_lock_irqsave(&ioc->lock, flags);
+		if (!hlist_empty(&ioc->icq_list))
+			schedule_work(&ioc->release_work);
+		spin_unlock_irqrestore(&ioc->lock, flags);
 	}
-
-	if (last_q && last_q != locked_q)
-		spin_unlock(last_q->queue_lock);
-
-	spin_unlock_irqrestore(&ioc->lock, flags);
-
-	/* if no icq is left, we're done; otherwise, kick release_work */
-	if (hlist_empty(&ioc->icq_list))
-		kmem_cache_free(iocontext_cachep, ioc);
-	else
-		schedule_work(&ioc->release_work);
 }
 EXPORT_SYMBOL(put_io_context);
 
@@ -238,7 +168,7 @@ void exit_io_context(struct task_struct *task)
 	task_unlock(task);
 
 	atomic_dec(&ioc->nr_tasks);
-	put_io_context(ioc, NULL);
+	put_io_context(ioc);
 }
 
 /**
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index da21c24dbed..5684df6848b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1794,7 +1794,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		cfqd->active_queue = NULL;
 
 	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->icq.ioc, cfqd->queue);
+		put_io_context(cfqd->active_cic->icq.ioc);
 		cfqd->active_cic = NULL;
 	}
 }
diff --git a/fs/ioprio.c b/fs/ioprio.c
index f84b380d65e..0f1b9515213 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -51,7 +51,7 @@ int set_task_ioprio(struct task_struct *task, int ioprio)
 	ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
 	if (ioc) {
 		ioc_ioprio_changed(ioc, ioprio);
-		put_io_context(ioc, NULL);
+		put_io_context(ioc);
 	}
 
 	return err;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6c6a1f00806..606cf339bb5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -399,9 +399,6 @@ struct request_queue {
 	/* Throttle data */
 	struct throtl_data *td;
 #endif
-#ifdef CONFIG_LOCKDEP
-	int			ioc_release_depth;
-#endif
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 7e1371c4bcc..119773eebe3 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -133,7 +133,7 @@ static inline struct io_context *ioc_task_link(struct io_context *ioc)
 
 struct task_struct;
 #ifdef CONFIG_BLOCK
-void put_io_context(struct io_context *ioc, struct request_queue *locked_q);
+void put_io_context(struct io_context *ioc);
 void exit_io_context(struct task_struct *task);
 struct io_context *get_task_io_context(struct task_struct *task,
 				       gfp_t gfp_flags, int node);
@@ -141,8 +141,7 @@ void ioc_ioprio_changed(struct io_context *ioc, int ioprio);
 void ioc_cgroup_changed(struct io_context *ioc);
 #else
 struct io_context;
-static inline void put_io_context(struct io_context *ioc,
-				  struct request_queue *locked_q) { }
+static inline void put_io_context(struct io_context *ioc) { }
 static inline void exit_io_context(struct task_struct *task) { }
 #endif
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 051f090d40c..c574aefa8d1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -890,7 +890,7 @@ static int copy_io(unsigned long clone_flags, struct task_struct *tsk)
 			return -ENOMEM;
 
 		new_ioc->ioprio = ioc->ioprio;
-		put_io_context(new_ioc, NULL);
+		put_io_context(new_ioc);
 	}
 #endif
 	return 0;
-- 
cgit v1.2.3-70-g09d2


From 050c8ea80e3e90019d9e981c6a117ef614e882ed Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 8 Feb 2012 09:19:38 +0100
Subject: block: separate out blk_rq_merge_ok() and blk_try_merge() from
 elevator functions

blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
test.  blk_try_merge() determines merge direction and expects the
caller to have tested elv_rq_merge_ok() previously.

elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
elv_iosched_allow_merge().  elv_try_merge() is removed and the two
callers are updated to call elv_rq_merge_ok() explicitly followed by
blk_try_merge().  While at it, make rq_merge_ok() functions return
bool.

This is to prepare for plug merge update and doesn't introduce any
behavior change.

This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c         |  4 ++--
 block/blk-merge.c        | 37 ++++++++++++++++++++++++++++++++
 block/blk.h              |  2 ++
 block/elevator.c         | 55 ++++--------------------------------------------
 include/linux/elevator.h |  3 +--
 5 files changed, 46 insertions(+), 55 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index 532b3a21b38..fa697bf691e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1282,10 +1282,10 @@ static bool attempt_plug_merge(struct request_queue *q, struct bio *bio,
 
 		(*request_count)++;
 
-		if (rq->q != q)
+		if (rq->q != q || !elv_rq_merge_ok(rq, bio))
 			continue;
 
-		el_ret = elv_try_merge(rq, bio);
+		el_ret = blk_try_merge(rq, bio);
 		if (el_ret == ELEVATOR_BACK_MERGE) {
 			ret = bio_attempt_back_merge(q, rq, bio);
 			if (ret)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index cfcc37cb222..160035f5488 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -471,3 +471,40 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 {
 	return attempt_merge(q, rq, next);
 }
+
+bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
+{
+	if (!rq_mergeable(rq))
+		return false;
+
+	/* don't merge file system requests and discard requests */
+	if ((bio->bi_rw & REQ_DISCARD) != (rq->bio->bi_rw & REQ_DISCARD))
+		return false;
+
+	/* don't merge discard requests and secure discard requests */
+	if ((bio->bi_rw & REQ_SECURE) != (rq->bio->bi_rw & REQ_SECURE))
+		return false;
+
+	/* different data direction or already started, don't merge */
+	if (bio_data_dir(bio) != rq_data_dir(rq))
+		return false;
+
+	/* must be same device and not a special request */
+	if (rq->rq_disk != bio->bi_bdev->bd_disk || rq->special)
+		return false;
+
+	/* only merge integrity protected bio into ditto rq */
+	if (bio_integrity(bio) != blk_integrity_rq(rq))
+		return false;
+
+	return true;
+}
+
+int blk_try_merge(struct request *rq, struct bio *bio)
+{
+	if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_sector)
+		return ELEVATOR_BACK_MERGE;
+	else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_sector)
+		return ELEVATOR_FRONT_MERGE;
+	return ELEVATOR_NO_MERGE;
+}
diff --git a/block/blk.h b/block/blk.h
index 7efd772336d..9c12f80882b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -137,6 +137,8 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 				struct request *next);
 void blk_recalc_rq_segments(struct request *rq);
 void blk_rq_set_mixed_merge(struct request *rq);
+bool blk_rq_merge_ok(struct request *rq, struct bio *bio);
+int blk_try_merge(struct request *rq, struct bio *bio);
 
 void blk_queue_congestion_threshold(struct request_queue *q);
 
diff --git a/block/elevator.c b/block/elevator.c
index 91e18f8af9b..f016855a46b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -70,39 +70,9 @@ static int elv_iosched_allow_merge(struct request *rq, struct bio *bio)
 /*
  * can we safely merge with this request?
  */
-int elv_rq_merge_ok(struct request *rq, struct bio *bio)
+bool elv_rq_merge_ok(struct request *rq, struct bio *bio)
 {
-	if (!rq_mergeable(rq))
-		return 0;
-
-	/*
-	 * Don't merge file system requests and discard requests
-	 */
-	if ((bio->bi_rw & REQ_DISCARD) != (rq->bio->bi_rw & REQ_DISCARD))
-		return 0;
-
-	/*
-	 * Don't merge discard requests and secure discard requests
-	 */
-	if ((bio->bi_rw & REQ_SECURE) != (rq->bio->bi_rw & REQ_SECURE))
-		return 0;
-
-	/*
-	 * different data direction or already started, don't merge
-	 */
-	if (bio_data_dir(bio) != rq_data_dir(rq))
-		return 0;
-
-	/*
-	 * must be same device and not a special request
-	 */
-	if (rq->rq_disk != bio->bi_bdev->bd_disk || rq->special)
-		return 0;
-
-	/*
-	 * only merge integrity protected bio into ditto rq
-	 */
-	if (bio_integrity(bio) != blk_integrity_rq(rq))
+	if (!blk_rq_merge_ok(rq, bio))
 		return 0;
 
 	if (!elv_iosched_allow_merge(rq, bio))
@@ -112,23 +82,6 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 }
 EXPORT_SYMBOL(elv_rq_merge_ok);
 
-int elv_try_merge(struct request *__rq, struct bio *bio)
-{
-	int ret = ELEVATOR_NO_MERGE;
-
-	/*
-	 * we can merge and sequence is ok, check if it's possible
-	 */
-	if (elv_rq_merge_ok(__rq, bio)) {
-		if (blk_rq_pos(__rq) + blk_rq_sectors(__rq) == bio->bi_sector)
-			ret = ELEVATOR_BACK_MERGE;
-		else if (blk_rq_pos(__rq) - bio_sectors(bio) == bio->bi_sector)
-			ret = ELEVATOR_FRONT_MERGE;
-	}
-
-	return ret;
-}
-
 static struct elevator_type *elevator_find(const char *name)
 {
 	struct elevator_type *e;
@@ -478,8 +431,8 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	/*
 	 * First try one-hit cache.
 	 */
-	if (q->last_merge) {
-		ret = elv_try_merge(q->last_merge, bio);
+	if (q->last_merge && elv_rq_merge_ok(q->last_merge, bio)) {
+		ret = blk_try_merge(q->last_merge, bio);
 		if (ret != ELEVATOR_NO_MERGE) {
 			*req = q->last_merge;
 			return ret;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c24f3d7fbf1..d6dfb65c888 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -122,7 +122,6 @@ extern void elv_dispatch_add_tail(struct request_queue *, struct request *);
 extern void elv_add_request(struct request_queue *, struct request *, int);
 extern void __elv_add_request(struct request_queue *, struct request *, int);
 extern int elv_merge(struct request_queue *, struct request **, struct bio *);
-extern int elv_try_merge(struct request *, struct bio *);
 extern void elv_merge_requests(struct request_queue *, struct request *,
 			       struct request *);
 extern void elv_merged_request(struct request_queue *, struct request *, int);
@@ -155,7 +154,7 @@ extern ssize_t elv_iosched_store(struct request_queue *, const char *, size_t);
 extern int elevator_init(struct request_queue *, char *);
 extern void elevator_exit(struct elevator_queue *);
 extern int elevator_change(struct request_queue *, const char *);
-extern int elv_rq_merge_ok(struct request *, struct bio *);
+extern bool elv_rq_merge_ok(struct request *, struct bio *);
 
 /*
  * Helper functions.
-- 
cgit v1.2.3-70-g09d2


From 07c2bd37350c9b1af71b35d05f16e300a6602948 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 8 Feb 2012 09:19:42 +0100
Subject: block: don't call elevator callbacks for plug merges

Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn().  Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.

For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion.  Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.

This, for example, leads to the following crash during elevator
switch.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
 IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
 PGD 112cbc067 PUD 115d5c067 PMD 0
 Oops: 0000 [#1] PREEMPT SMP
 CPU 1
 Modules linked in: deadline_iosched

 Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
 RIP: 0010:[<ffffffff813b34e9>]  [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
 RSP: 0018:ffff8801143a38f8  EFLAGS: 00010297
 RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
 RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
 RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
 R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
 FS:  00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
 Stack:
  ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
  ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
  ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
 Call Trace:
  [<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
  [<ffffffff81391bf1>] elv_try_merge+0x21/0x40
  [<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
  [<ffffffff81396a5a>] generic_make_request+0xca/0x100
  [<ffffffff81396b04>] submit_bio+0x74/0x100
  [<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
  [<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
  [<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
  [<ffffffff811986b2>] do_sync_read+0xe2/0x120
  [<ffffffff81199345>] vfs_read+0xc5/0x180
  [<ffffffff81199501>] sys_read+0x51/0x90
  [<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b

There are multiple ways to fix this including making plug merge check
ELVPRIV; however,

* Calling into elevator outside queue lock is confusing and
  error-prone.

* Requests on plug list aren't known to the elevator.  They aren't on
  the elevator yet, so there's no elevator specific state to update.

* Given the nature of plug merges - collecting bio's for the same
  purpose from the same issuer - elevator specific restrictions aren't
  applicable.

So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().

This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.

Note that this makes per-cgroup merged stats skip plug merging.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c         | 19 +++++++++----------
 block/cfq-iosched.c      | 15 ++++-----------
 include/linux/elevator.h |  6 ------
 3 files changed, 13 insertions(+), 27 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index fa697bf691e..3a78b00edd7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1212,7 +1212,6 @@ static bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
 	req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
 
 	drive_stat_acct(req, 0);
-	elv_bio_merged(q, req, bio);
 	return true;
 }
 
@@ -1243,7 +1242,6 @@ static bool bio_attempt_front_merge(struct request_queue *q,
 	req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
 
 	drive_stat_acct(req, 0);
-	elv_bio_merged(q, req, bio);
 	return true;
 }
 
@@ -1257,13 +1255,12 @@ static bool bio_attempt_front_merge(struct request_queue *q,
  * on %current's plugged list.  Returns %true if merge was successful,
  * otherwise %false.
  *
- * This function is called without @q->queue_lock; however, elevator is
- * accessed iff there already are requests on the plugged list which in
- * turn guarantees validity of the elevator.
- *
- * Note that, on successful merge, elevator operation
- * elevator_bio_merged_fn() will be called without queue lock.  Elevator
- * must be ready for this.
+ * Plugging coalesces IOs from the same issuer for the same purpose without
+ * going through @q->queue_lock.  As such it's more of an issuing mechanism
+ * than scheduling, and the request, while may have elvpriv data, is not
+ * added on the elevator at this point.  In addition, we don't have
+ * reliable access to the elevator outside queue lock.  Only check basic
+ * merging parameters without querying the elevator.
  */
 static bool attempt_plug_merge(struct request_queue *q, struct bio *bio,
 			       unsigned int *request_count)
@@ -1282,7 +1279,7 @@ static bool attempt_plug_merge(struct request_queue *q, struct bio *bio,
 
 		(*request_count)++;
 
-		if (rq->q != q || !elv_rq_merge_ok(rq, bio))
+		if (rq->q != q || !blk_rq_merge_ok(rq, bio))
 			continue;
 
 		el_ret = blk_try_merge(rq, bio);
@@ -1347,12 +1344,14 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
 	el_ret = elv_merge(q, &req, bio);
 	if (el_ret == ELEVATOR_BACK_MERGE) {
 		if (bio_attempt_back_merge(q, req, bio)) {
+			elv_bio_merged(q, req, bio);
 			if (!attempt_back_merge(q, req))
 				elv_merged_request(q, req, el_ret);
 			goto out_unlock;
 		}
 	} else if (el_ret == ELEVATOR_FRONT_MERGE) {
 		if (bio_attempt_front_merge(q, req, bio)) {
+			elv_bio_merged(q, req, bio);
 			if (!attempt_front_merge(q, req))
 				elv_merged_request(q, req, el_ret);
 			goto out_unlock;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5684df6848b..d0ba5053366 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1699,18 +1699,11 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 
 	/*
 	 * Lookup the cfqq that this bio will be queued with and allow
-	 * merge only if rq is queued there.  This function can be called
-	 * from plug merge without queue_lock.  In such cases, ioc of @rq
-	 * and %current are guaranteed to be equal.  Avoid lookup which
-	 * requires queue_lock by using @rq's cic.
+	 * merge only if rq is queued there.
 	 */
-	if (current->io_context == RQ_CIC(rq)->icq.ioc) {
-		cic = RQ_CIC(rq);
-	} else {
-		cic = cfq_cic_lookup(cfqd, current->io_context);
-		if (!cic)
-			return false;
-	}
+	cic = cfq_cic_lookup(cfqd, current->io_context);
+	if (!cic)
+		return false;
 
 	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
 	return cfqq == RQ_CFQQ(rq);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index d6dfb65c888..7d4e0356f32 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -42,12 +42,6 @@ struct elevator_ops
 	elevator_merged_fn *elevator_merged_fn;
 	elevator_merge_req_fn *elevator_merge_req_fn;
 	elevator_allow_merge_fn *elevator_allow_merge_fn;
-
-	/*
-	 * Used for both plugged list and elevator merging and in the
-	 * former case called without queue_lock.  Read comment on top of
-	 * attempt_plug_merge() for details.
-	 */
 	elevator_bio_merged_fn *elevator_bio_merged_fn;
 
 	elevator_dispatch_fn *elevator_dispatch_fn;
-- 
cgit v1.2.3-70-g09d2


From 37b40adf2d1b4a5e51323be73ccf8ddcf3f15dd3 Mon Sep 17 00:00:00 2001
From: Stanislaw Gruszka <sgruszka@redhat.com>
Date: Wed, 8 Feb 2012 20:02:03 +0100
Subject: bsg: fix sysfs link remove warning

We create "bsg" link if q->kobj.sd is not NULL, so remove it only
when the same condition is true.

Fixes:

WARNING: at fs/sysfs/inode.c:323 sysfs_hash_and_remove+0x2b/0x77()
sysfs: can not remove 'bsg', no directory
Call Trace:
  [<c0429683>] warn_slowpath_common+0x6a/0x7f
  [<c0537a68>] ? sysfs_hash_and_remove+0x2b/0x77
  [<c042970b>] warn_slowpath_fmt+0x2b/0x2f
  [<c0537a68>] sysfs_hash_and_remove+0x2b/0x77
  [<c053969a>] sysfs_remove_link+0x20/0x23
  [<c05d88f1>] bsg_unregister_queue+0x40/0x6d
  [<c0692263>] __scsi_remove_device+0x31/0x9d
  [<c069149f>] scsi_forget_host+0x41/0x52
  [<c0689fa9>] scsi_remove_host+0x71/0xe0
  [<f7de5945>] quiesce_and_remove_host+0x51/0x83 [usb_storage]
  [<f7de5a1e>] usb_stor_disconnect+0x18/0x22 [usb_storage]
  [<c06c29de>] usb_unbind_interface+0x4e/0x109
  [<c067a80f>] __device_release_driver+0x6b/0xa6
  [<c067a861>] device_release_driver+0x17/0x22
  [<c067a46a>] bus_remove_device+0xd6/0xe6
  [<c06785e2>] device_del+0xf2/0x137
  [<c06c101f>] usb_disable_device+0x94/0x1a0

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bsg.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/bsg.c b/block/bsg.c
index 4cf703fd98b..ff64ae3bace 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -983,7 +983,8 @@ void bsg_unregister_queue(struct request_queue *q)
 
 	mutex_lock(&bsg_mutex);
 	idr_remove(&bsg_minor_idr, bcd->minor);
-	sysfs_remove_link(&q->kobj, "bsg");
+	if (q->kobj.sd)
+		sysfs_remove_link(&q->kobj, "bsg");
 	device_unregister(bcd->class_dev);
 	bcd->class_dev = NULL;
 	kref_put(&bcd->ref, bsg_kref_release_function);
-- 
cgit v1.2.3-70-g09d2


From d8c66c5d59247e25a69428aced0b79d33b9c66d6 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Sat, 11 Feb 2012 12:37:25 +0100
Subject: block: fix lockdep warning on io_context release put_io_context()

11a3122f6c "block: strip out locking optimization in put_io_context()"
removed ioc_lock depth lockdep annoation along with locking
optimization; however, while recursing from put_io_context() is no
longer possible, ioc_release_fn() may still end up putting the last
reference of another ioc through elevator, which wlil grab ioc->lock
triggering spurious (as the ioc is always different one) A-A deadlock
warning.

As this can only happen one time from ioc_release_fn(), using non-zero
subclass from ioc_release_fn() is enough.  Use subclass 1.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-ioc.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

(limited to 'block')

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 9884fd7427f..8b782a63c29 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -80,8 +80,15 @@ static void ioc_release_fn(struct work_struct *work)
 	struct io_context *ioc = container_of(work, struct io_context,
 					      release_work);
 	struct request_queue *last_q = NULL;
+	unsigned long flags;
 
-	spin_lock_irq(&ioc->lock);
+	/*
+	 * Exiting icq may call into put_io_context() through elevator
+	 * which will trigger lockdep warning.  The ioc's are guaranteed to
+	 * be different, use a different locking subclass here.  Use
+	 * irqsave variant as there's no spin_lock_irq_nested().
+	 */
+	spin_lock_irqsave_nested(&ioc->lock, flags, 1);
 
 	while (!hlist_empty(&ioc->icq_list)) {
 		struct io_cq *icq = hlist_entry(ioc->icq_list.first,
@@ -103,15 +110,15 @@ static void ioc_release_fn(struct work_struct *work)
 			 */
 			if (last_q) {
 				spin_unlock(last_q->queue_lock);
-				spin_unlock_irq(&ioc->lock);
+				spin_unlock_irqrestore(&ioc->lock, flags);
 				blk_put_queue(last_q);
 			} else {
-				spin_unlock_irq(&ioc->lock);
+				spin_unlock_irqrestore(&ioc->lock, flags);
 			}
 
 			last_q = this_q;
-			spin_lock_irq(this_q->queue_lock);
-			spin_lock(&ioc->lock);
+			spin_lock_irqsave(this_q->queue_lock, flags);
+			spin_lock_nested(&ioc->lock, 1);
 			continue;
 		}
 		ioc_exit_icq(icq);
@@ -119,10 +126,10 @@ static void ioc_release_fn(struct work_struct *work)
 
 	if (last_q) {
 		spin_unlock(last_q->queue_lock);
-		spin_unlock_irq(&ioc->lock);
+		spin_unlock_irqrestore(&ioc->lock, flags);
 		blk_put_queue(last_q);
 	} else {
-		spin_unlock_irq(&ioc->lock);
+		spin_unlock_irqrestore(&ioc->lock, flags);
 	}
 
 	kmem_cache_free(iocontext_cachep, ioc);
-- 
cgit v1.2.3-70-g09d2


From 97387e3baaf3c35ad560f8878e943c720a77da1b Mon Sep 17 00:00:00 2001
From: Anton Altaparmakov <anton@tuxera.com>
Date: Fri, 24 Feb 2012 09:37:42 +0000
Subject: LDM: Fix reassembly of extended VBLKs.

From: Ben Hutchings <ben@decadent.org.uk>

Extended VBLKs (those larger than the preset VBLK size) are divided
into fragments, each with its own VBLK header.  Our LDM implementation
generally assumes that each VBLK is contiguous in memory, so these
fragments must be assembled before further processing.

Currently the reassembly seems to be done quite wrongly - no VBLK
header is copied into the contiguous buffer, and the length of the
header is subtracted twice from each fragment.  Also the total
length of the reassembled VBLK is calculated incorrectly.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
---
 block/partitions/ldm.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

(limited to 'block')

diff --git a/block/partitions/ldm.c b/block/partitions/ldm.c
index bd8ae788f68..e507cfbd044 100644
--- a/block/partitions/ldm.c
+++ b/block/partitions/ldm.c
@@ -2,7 +2,7 @@
  * ldm - Support for Windows Logical Disk Manager (Dynamic Disks)
  *
  * Copyright (C) 2001,2002 Richard Russon <ldm@flatcap.org>
- * Copyright (c) 2001-2007 Anton Altaparmakov
+ * Copyright (c) 2001-2012 Anton Altaparmakov
  * Copyright (C) 2001,2002 Jakob Kemi <jakob.kemi@telia.com>
  *
  * Documentation is available at http://www.linux-ntfs.org/doku.php?id=downloads 
@@ -1341,20 +1341,17 @@ found:
 		ldm_error("REC value (%d) exceeds NUM value (%d)", rec, f->num);
 		return false;
 	}
-
 	if (f->map & (1 << rec)) {
 		ldm_error ("Duplicate VBLK, part %d.", rec);
 		f->map &= 0x7F;			/* Mark the group as broken */
 		return false;
 	}
-
 	f->map |= (1 << rec);
-
+	if (!rec)
+		memcpy(f->data, data, VBLK_SIZE_HEAD);
 	data += VBLK_SIZE_HEAD;
 	size -= VBLK_SIZE_HEAD;
-
-	memcpy (f->data+rec*(size-VBLK_SIZE_HEAD)+VBLK_SIZE_HEAD, data, size);
-
+	memcpy(f->data + VBLK_SIZE_HEAD + rec * size, data, size);
 	return true;
 }
 
-- 
cgit v1.2.3-70-g09d2