From 18404756765c713a0be4eb1082920c04822ce588 Mon Sep 17 00:00:00 2001 From: Max Krasnyansky Date: Thu, 29 May 2008 11:02:52 -0700 Subject: genirq: Expose default irq affinity mask (take 3) Current IRQ affinity interface does not provide a way to set affinity for the IRQs that will be allocated/activated in the future. This patch creates /proc/irq/default_smp_affinity that lets users set default affinity mask for the newly allocated IRQs. Changing the default does not affect affinity masks for the currently active IRQs, they have to be changed explicitly. Updated based on Paul J's comments and added some more documentation. Signed-off-by: Max Krasnyansky Cc: pj@sgi.com Cc: a.p.zijlstra@chello.nl Cc: tglx@linutronix.de Cc: rdunlap@xenotime.net Cc: mingo@elte.hu Signed-off-by: Thomas Gleixner --- Documentation/filesystems/proc.txt | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index dbc3c6a3650..7f268f327d7 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -380,28 +380,35 @@ i386 and x86_64 platforms support the new IRQ vector displays. Of some interest is the introduction of the /proc/irq directory to 2.4. It could be used to set IRQ to CPU affinity, this means that you can "hook" an IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the -irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask +irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and +prof_cpu_mask. For example > ls /proc/irq/ 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask - 1 11 13 15 17 19 3 5 7 9 + 1 11 13 15 17 19 3 5 7 9 default_smp_affinity > ls /proc/irq/0/ smp_affinity -The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ -is the same by default: +smp_affinity is a bitmask, in which you can specify which CPUs can handle the +IRQ, you can set it by doing: - > cat /proc/irq/0/smp_affinity - ffffffff + > echo 1 > /proc/irq/10/smp_affinity + +This means that only the first CPU will handle the IRQ, but you can also echo +5 which means that only the first and fourth CPU can handle the IRQ. -It's a bitmask, in which you can specify which CPUs can handle the IRQ, you can -set it by doing: +The contents of each smp_affinity file is the same by default: + + > cat /proc/irq/0/smp_affinity + ffffffff - > echo 1 > /proc/irq/prof_cpu_mask +The default_smp_affinity mask applies to all non-active IRQs, which are the +IRQs which have not yet been allocated/activated, and hence which lack a +/proc/irq/[0-9]* directory. -This means that only the first CPU will handle the IRQ, but you can also echo 5 -which means that only the first and fourth CPU can handle the IRQ. +prof_cpu_mask specifies which CPUs are to be profiled by the system wide +profiler. Default value is ffffffff (all cpus). The way IRQs are routed is handled by the IO-APIC, and it's Round Robin between all the CPUs which are allowed to handle it. As usual the kernel has -- cgit v1.2.3-70-g09d2 From 9f1585cb03866452e0df61a83c88302181e50054 Mon Sep 17 00:00:00 2001 From: Steven Whitehouse Date: Thu, 26 Jun 2008 08:25:57 +0100 Subject: [GFS2] Glock documentation This patch adds a file describing the internals of GFS2's glock abstraction. Signed-off-by: Steven Whitehouse --- Documentation/filesystems/gfs2-glocks.txt | 114 ++++++++++++++++++++++++++++++ 1 file changed, 114 insertions(+) create mode 100644 Documentation/filesystems/gfs2-glocks.txt (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt new file mode 100644 index 00000000000..4dae9a3840b --- /dev/null +++ b/Documentation/filesystems/gfs2-glocks.txt @@ -0,0 +1,114 @@ + Glock internal locking rules + ------------------------------ + +This documents the basic principles of the glock state machine +internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h) +has two main (internal) locks: + + 1. A spinlock (gl_spin) which protects the internal state such + as gl_state, gl_target and the list of holders (gl_holders) + 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other + threads from making calls to the DLM, etc. at the same time. If a + thread takes this lock, it must then call run_queue (usually via the + workqueue) when it releases it in order to ensure any pending tasks + are completed. + +The gl_holders list contains all the queued lock requests (not +just the holders) associated with the glock. If there are any +held locks, then they will be contiguous entries at the head +of the list. Locks are granted in strictly the order that they +are queued, except for those marked LM_FLAG_PRIORITY which are +used only during recovery, and even then only for journal locks. + +There are three lock states that users of the glock layer can request, +namely shared (SH), deferred (DF) and exclusive (EX). Those translate +to the following DLM lock modes: + +Glock mode | DLM lock mode +------------------------------ + UN | IV/NL Unlocked (no DLM lock associated with glock) or NL + SH | PR (Protected read) + DF | CW (Concurrent write) + EX | EX (Exclusive) + +Thus DF is basically a shared mode which is incompatible with the "normal" +shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O +operations. The glocks are basically a lock plus some routines which deal +with cache management. The following rules apply for the cache: + +Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata +-------------------------------------------------------------------------- + UN | No | No | No | No + SH | Yes | Yes | No | No + DF | No | Yes | No | No + EX | Yes | Yes | Yes | Yes + +These rules are implemented using the various glock operations which +are defined for each type of glock. Not all types of glocks use +all the modes. Only inode glocks use the DF mode for example. + +Table of glock operations and per type constants: + +Field | Purpose +---------------------------------------------------------------------------- +go_xmote_th | Called before remote state change (e.g. to sync dirty data) +go_xmote_bh | Called after remote state change (e.g. to refill cache) +go_inval | Called if remote state change requires invalidating the cache +go_demote_ok | Returns boolean value of whether its ok to demote a glock + | (e.g. checks timeout, and that there is no cached data) +go_lock | Called for the first local holder of a lock +go_unlock | Called on the final local unlock of a lock +go_dump | Called to print content of object for debugfs file, or on + | error to dump glock to the log. +go_type; | The type of the glock, LM_TYPE_..... +go_min_hold_time | The minimum hold time + +The minimum hold time for each lock is the time after a remote lock +grant for which we ignore remote demote requests. This is in order to +prevent a situation where locks are being bounced around the cluster +from node to node with none of the nodes making any progress. This +tends to show up most with shared mmaped files which are being written +to by multiple nodes. By delaying the demotion in response to a +remote callback, that gives the userspace program time to make +some progress before the pages are unmapped. + +There is a plan to try and remove the go_lock and go_unlock callbacks +if possible, in order to try and speed up the fast path though the locking. +Also, eventually we hope to make the glock "EX" mode locally shared +such that any local locking will be done with the i_mutex as required +rather than via the glock. + +Locking rules for glock operations: + +Operation | GLF_LOCK bit lock held | gl_spin spinlock held +----------------------------------------------------------------- +go_xmote_th | Yes | No +go_xmote_bh | Yes | No +go_inval | Yes | No +go_demote_ok | Sometimes | Yes +go_lock | Yes | No +go_unlock | Yes | No +go_dump | Sometimes | Yes + +N.B. Operations must not drop either the bit lock or the spinlock +if its held on entry. go_dump and do_demote_ok must never block. +Note that go_dump will only be called if the glock's state +indicates that it is caching uptodate data. + +Glock locking order within GFS2: + + 1. i_mutex (if required) + 2. Rename glock (for rename only) + 3. Inode glock(s) + (Parents before children, inodes at "same level" with same parent in + lock number order) + 4. Rgrp glock(s) (for (de)allocation operations) + 5. Transaction glock (via gfs2_trans_begin) for non-read operations + 6. Page lock (always last, very important!) + +There are two glocks per inode. One deals with access to the inode +itself (locking order as above), and the other, known as the iopen +glock is used in conjunction with the i_nlink field in the inode to +determine the lifetime of the inode in question. Locking of inodes +is on a per-inode basis. Locking of rgrps is on a per rgrp basis. + -- cgit v1.2.3-70-g09d2 From 93e3270c87549dc531a0b0e5d06362d998d810cb Mon Sep 17 00:00:00 2001 From: "Jose R. Santos" Date: Fri, 11 Jul 2008 19:27:31 -0400 Subject: ext4: Documentation updates. Some of the information in Documentation/filesystems/ext4.txt is out of date and in need of an update. Signed-off-by: Jose R. Santos Signed-off-by: "Theodore Ts'o" --- Documentation/filesystems/ext4.txt | 106 ++++++++++++++++++++++--------------- 1 file changed, 62 insertions(+), 44 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 0c5086db835..7e940c64be4 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -13,72 +13,89 @@ Mailing list: linux-ext4@vger.kernel.org 1. Quick usage instructions: =========================== - - Grab updated e2fsprogs from - ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/ - This is a patchset on top of e2fsprogs-1.39, which can be found at + - Compile and install the latest version of e2fsprogs (as of this + writing version 1.41) from: + + http://sourceforge.net/project/showfiles.php?group_id=2406 + + or + ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/ - - It's still mke2fs -j /dev/hda1 + or grab the latest git repository from: + + git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git + + - Create a new filesystem using the ext4dev filesystem type: + + # mke2fs -t ext4dev /dev/hda1 + + Or configure an existing ext3 filesystem to support extents and set + the test_fs flag to indicate that it's ok for an in-development + filesystem to touch this filesystem: - - mount /dev/hda1 /wherever -t ext4dev + # tune2fs -O extents -E test_fs /dev/hda1 - - To enable extents, + If the filesystem was created with 128 byte inodes, it can be + converted to use 256 byte for greater efficiency via: - mount /dev/hda1 /wherever -t ext4dev -o extents + # tune2fs -I 256 /dev/hda1 - - The filesystem is compatible with the ext3 driver until you add a file - which has extents (ie: `mount -o extents', then create a file). + (Note: we currently do not have tools to convert an ext4dev + filesystem back to ext3; so please do not do try this on production + filesystems.) - NOTE: The "extents" mount flag is temporary. It will soon go away and - extents will be enabled by the "-o extents" flag to mke2fs or tune2fs + - Mounting: + + # mount -t ext4dev /dev/hda1 /wherever - When comparing performance with other filesystems, remember that - ext3/4 by default offers higher data integrity guarantees than most. So - when comparing with a metadata-only journalling filesystem, use `mount -o - data=writeback'. And you might as well use `mount -o nobh' too along - with it. Making the journal larger than the mke2fs default often helps - performance with metadata-intensive workloads. + ext3/4 by default offers higher data integrity guarantees than most. + So when comparing with a metadata-only journalling filesystem, such + as ext3, use `mount -o data=writeback'. And you might as well use + `mount -o nobh' too along with it. Making the journal larger than + the mke2fs default often helps performance with metadata-intensive + workloads. 2. Features =========== 2.1 Currently available -* ability to use filesystems > 16TB +* ability to use filesystems > 16TB (e2fsprogs support not available yet) * extent format reduces metadata overhead (RAM, IO for access, transactions) * extent format more robust in face of on-disk corruption due to magics, * internal redunancy in tree - -2.1 Previously available, soon to be enabled by default by "mkefs.ext4": - -* dir_index and resize inode will be on by default -* large inodes will be used by default for fast EAs, nsec timestamps, etc +* improved file allocation (multi-block alloc, delayed alloc) +* fix 32000 subdirectory limit +* nsec timestamps for mtime, atime, ctime, create time +* inode version field on disk (NFSv4, Lustre) +* reduced e2fsck time via uninit_bg feature +* journal checksumming for robustness, performance +* persistent file preallocation (e.g for streaming media, databases) +* ability to pack bitmaps and inode tables into larger virtual groups via the + flex_bg feature +* large file support +* Inode allocation using large virtual block groups via flex_bg 2.2 Candidate features for future inclusion -There are several under discussion, whether they all make it in is -partly a function of how much time everyone has to work on them: +* Online defrag (patches available but not well tested) +* reduced mke2fs time via lazy itable initialization in conjuction with + the uninit_bg feature (capability to do this is available in e2fsprogs + but a kernel thread to do lazy zeroing of unused inode table blocks + after filesystem is first mounted is required for safety) -* improved file allocation (multi-block alloc, delayed alloc; basically done) -* fix 32000 subdirectory limit (patch exists, needs some e2fsck work) -* nsec timestamps for mtime, atime, ctime, create time (patch exists, - needs some e2fsck work) -* inode version field on disk (NFSv4, Lustre; prototype exists) -* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists) -* journal checksumming for robustness, performance (prototype exists) -* persistent file preallocation (e.g for streaming media, databases) +There are several others under discussion, whether they all make it in is +partly a function of how much time everyone has to work on them. Features like +metadata checksumming have been discussed and planned for a bit but no patches +exist yet so I'm not sure they're in the near-term roadmap. -Features like metadata checksumming have been discussed and planned for -a bit but no patches exist yet so I'm not sure they're in the near-term -roadmap. +The big performance win will come with mballoc, delalloc and flex_bg +grouping of bitmaps and inode tables. Some test results available here: -The big performance win will come with mballoc and delalloc. CFS has -been using mballoc for a few years already with Lustre, and IBM + Bull -did a lot of benchmarking on it. The reason it isn't in the first set of -patches is partly a manageability issue, and partly because it doesn't -directly affect the on-disk format (outside of much better allocation) -so it isn't critical to get into the first round of changes. I believe -Alex is working on a new set of patches right now. + - http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html + - http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html 3. Options ========== @@ -224,7 +241,7 @@ stripe=n Number of filesystem blocks that mballoc will try disks * RAID chunk size in file system blocks. Data Mode ---------- +========= There are 3 different data modes: * writeback mode @@ -256,7 +273,8 @@ kernel source: programs: http://e2fsprogs.sourceforge.net/ - http://ext2resize.sourceforge.net useful links: http://fedoraproject.org/wiki/ext3-devel http://www.bullopensource.org/ext4/ + http://ext4.wiki.kernel.org/index.php/Main_Page + http://fedoraproject.org/wiki/Features/Ext4 -- cgit v1.2.3-70-g09d2 From 49f1487b2e41bd8439ea39a4f15b4064e823cc54 Mon Sep 17 00:00:00 2001 From: Mingming Cao Date: Fri, 11 Jul 2008 19:27:31 -0400 Subject: ext4: Documention update for new ordered mode and delayed allocation Adding some documentations for delayed allocation and new ordered mode. Signed-off-by: Mingming Cao Signed-off-by: "Theodore Ts'o" --- Documentation/filesystems/ext4.txt | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 7e940c64be4..80e193d82e2 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -66,7 +66,7 @@ Mailing list: linux-ext4@vger.kernel.org * extent format reduces metadata overhead (RAM, IO for access, transactions) * extent format more robust in face of on-disk corruption due to magics, * internal redunancy in tree -* improved file allocation (multi-block alloc, delayed alloc) +* improved file allocation (multi-block alloc) * fix 32000 subdirectory limit * nsec timestamps for mtime, atime, ctime, create time * inode version field on disk (NFSv4, Lustre) @@ -77,6 +77,10 @@ Mailing list: linux-ext4@vger.kernel.org flex_bg feature * large file support * Inode allocation using large virtual block groups via flex_bg +* delayed allocation +* large block (up to pagesize) support +* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force + the ordering) 2.2 Candidate features for future inclusion @@ -239,7 +243,9 @@ stripe=n Number of filesystem blocks that mballoc will try to use for allocation size and alignment. For RAID5/6 systems this should be the number of data disks * RAID chunk size in file system blocks. - +delalloc (*) Deferring block allocation until write-out time. +nodelalloc Disable delayed allocation. Blocks are allocation + when data is copied from user to page cache. Data Mode ========= There are 3 different data modes: @@ -253,10 +259,10 @@ typically provide the best ext4 performance. * ordered mode In data=ordered mode, ext4 only officially journals metadata, but it logically -groups metadata and data blocks into a single unit called a transaction. When -it's time to write the new metadata out to disk, the associated data blocks -are written first. In general, this mode performs slightly slower than -writeback but significantly faster than journal mode. +groups metadata information related to data changes with the data blocks into a +single unit called a transaction. When it's time to write the new metadata +out to disk, the associated data blocks are written first. In general, +this mode performs slightly slower than writeback but significantly faster than journal mode. * journal mode data=journal mode provides full data and metadata journaling. All new data is @@ -264,7 +270,8 @@ written to the journal first, and then to its final location. In the event of a crash, the journal can be replayed, bringing both data and metadata into a consistent state. This mode is the slowest except when data needs to be read from and written to disk at the same time where it -outperforms all others modes. +outperforms all others modes. Curently ext4 does not have delayed +allocation support if this data journalling mode is selected. References ========== -- cgit v1.2.3-70-g09d2 From 11c3b79218390a139f2d474ee1e983a672d5839a Mon Sep 17 00:00:00 2001 From: Joel Becker Date: Thu, 12 Jun 2008 14:00:18 -0700 Subject: configfs: Allow ->make_item() and ->make_group() to return detailed errors. The configfs operations ->make_item() and ->make_group() currently return a new item/group. A return of NULL signifies an error. Because of this, -ENOMEM is the only return code bubbled up the stack. Multiple folks have requested the ability to return specific error codes when these operations fail. This patch adds that ability by changing the ->make_item/group() ops to return an int. Also updated are the in-kernel users of configfs. Signed-off-by: Joel Becker --- Documentation/filesystems/configfs/configfs.txt | 10 +++-- .../filesystems/configfs/configfs_example.c | 14 ++++--- drivers/net/netconsole.c | 10 +++-- fs/configfs/dir.c | 13 +++---- fs/dlm/config.c | 45 ++++++++++++++-------- fs/ocfs2/cluster/heartbeat.c | 17 ++++---- fs/ocfs2/cluster/nodemanager.c | 45 ++++++++++++++-------- include/linux/configfs.h | 4 +- 8 files changed, 94 insertions(+), 64 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs/configfs.txt index 44c97e6accb..15838d706ea 100644 --- a/Documentation/filesystems/configfs/configfs.txt +++ b/Documentation/filesystems/configfs/configfs.txt @@ -233,10 +233,12 @@ accomplished via the group operations specified on the group's config_item_type. struct configfs_group_operations { - struct config_item *(*make_item)(struct config_group *group, - const char *name); - struct config_group *(*make_group)(struct config_group *group, - const char *name); + int (*make_item)(struct config_group *group, + const char *name, + struct config_item **new_item); + int (*make_group)(struct config_group *group, + const char *name, + struct config_group **new_group); int (*commit_item)(struct config_item *item); void (*disconnect_notify)(struct config_group *group, struct config_item *item); diff --git a/Documentation/filesystems/configfs/configfs_example.c b/Documentation/filesystems/configfs/configfs_example.c index 25151fd5c2c..0b422acd470 100644 --- a/Documentation/filesystems/configfs/configfs_example.c +++ b/Documentation/filesystems/configfs/configfs_example.c @@ -273,13 +273,13 @@ static inline struct simple_children *to_simple_children(struct config_item *ite return item ? container_of(to_config_group(item), struct simple_children, group) : NULL; } -static struct config_item *simple_children_make_item(struct config_group *group, const char *name) +static int simple_children_make_item(struct config_group *group, const char *name, struct config_item **new_item) { struct simple_child *simple_child; simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL); if (!simple_child) - return NULL; + return -ENOMEM; config_item_init_type_name(&simple_child->item, name, @@ -287,7 +287,8 @@ static struct config_item *simple_children_make_item(struct config_group *group, simple_child->storeme = 0; - return &simple_child->item; + *new_item = &simple_child->item; + return 0; } static struct configfs_attribute simple_children_attr_description = { @@ -359,20 +360,21 @@ static struct configfs_subsystem simple_children_subsys = { * children of its own. */ -static struct config_group *group_children_make_group(struct config_group *group, const char *name) +static int group_children_make_group(struct config_group *group, const char *name, struct config_group **new_group) { struct simple_children *simple_children; simple_children = kzalloc(sizeof(struct simple_children), GFP_KERNEL); if (!simple_children) - return NULL; + return -ENOMEM; config_group_init_type_name(&simple_children->group, name, &simple_children_type); - return &simple_children->group; + *new_group = &simple_children->group; + return 0; } static struct configfs_attribute group_children_attr_description = { diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c index 665341e4305..387a1339501 100644 --- a/drivers/net/netconsole.c +++ b/drivers/net/netconsole.c @@ -585,8 +585,9 @@ static struct config_item_type netconsole_target_type = { * Group operations and type for netconsole_subsys. */ -static struct config_item *make_netconsole_target(struct config_group *group, - const char *name) +static int make_netconsole_target(struct config_group *group, + const char *name, + struct config_item **new_item) { unsigned long flags; struct netconsole_target *nt; @@ -598,7 +599,7 @@ static struct config_item *make_netconsole_target(struct config_group *group, nt = kzalloc(sizeof(*nt), GFP_KERNEL); if (!nt) { printk(KERN_ERR "netconsole: failed to allocate memory\n"); - return NULL; + return -ENOMEM; } nt->np.name = "netconsole"; @@ -615,7 +616,8 @@ static struct config_item *make_netconsole_target(struct config_group *group, list_add(&nt->list, &target_list); spin_unlock_irqrestore(&target_list_lock, flags); - return &nt->item; + *new_item = &nt->item; + return 0; } static void drop_netconsole_target(struct config_group *group, diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c index 614e382a604..0e64312a084 100644 --- a/fs/configfs/dir.c +++ b/fs/configfs/dir.c @@ -1073,25 +1073,24 @@ static int configfs_mkdir(struct inode *dir, struct dentry *dentry, int mode) group = NULL; item = NULL; if (type->ct_group_ops->make_group) { - group = type->ct_group_ops->make_group(to_config_group(parent_item), name); - if (group) { + ret = type->ct_group_ops->make_group(to_config_group(parent_item), name, &group); + if (!ret) { link_group(to_config_group(parent_item), group); item = &group->cg_item; } } else { - item = type->ct_group_ops->make_item(to_config_group(parent_item), name); - if (item) + ret = type->ct_group_ops->make_item(to_config_group(parent_item), name, &item); + if (!ret) link_obj(parent_item, item); } mutex_unlock(&subsys->su_mutex); kfree(name); - if (!item) { + if (ret) { /* - * If item == NULL, then link_obj() was never called. + * If ret != 0, then link_obj() was never called. * There are no extra references to clean up. */ - ret = -ENOMEM; goto out_put; } diff --git a/fs/dlm/config.c b/fs/dlm/config.c index eac23bd288b..492d8caaaf2 100644 --- a/fs/dlm/config.c +++ b/fs/dlm/config.c @@ -41,16 +41,20 @@ struct comm; struct nodes; struct node; -static struct config_group *make_cluster(struct config_group *, const char *); +static int make_cluster(struct config_group *, const char *, + struct config_group **); static void drop_cluster(struct config_group *, struct config_item *); static void release_cluster(struct config_item *); -static struct config_group *make_space(struct config_group *, const char *); +static int make_space(struct config_group *, const char *, + struct config_group **); static void drop_space(struct config_group *, struct config_item *); static void release_space(struct config_item *); -static struct config_item *make_comm(struct config_group *, const char *); +static int make_comm(struct config_group *, const char *, + struct config_item **); static void drop_comm(struct config_group *, struct config_item *); static void release_comm(struct config_item *); -static struct config_item *make_node(struct config_group *, const char *); +static int make_node(struct config_group *, const char *, + struct config_item **); static void drop_node(struct config_group *, struct config_item *); static void release_node(struct config_item *); @@ -392,8 +396,8 @@ static struct node *to_node(struct config_item *i) return i ? container_of(i, struct node, item) : NULL; } -static struct config_group *make_cluster(struct config_group *g, - const char *name) +static int make_cluster(struct config_group *g, const char *name, + struct config_group **new_g) { struct cluster *cl = NULL; struct spaces *sps = NULL; @@ -431,14 +435,15 @@ static struct config_group *make_cluster(struct config_group *g, space_list = &sps->ss_group; comm_list = &cms->cs_group; - return &cl->group; + *new_g = &cl->group; + return 0; fail: kfree(cl); kfree(gps); kfree(sps); kfree(cms); - return NULL; + return -ENOMEM; } static void drop_cluster(struct config_group *g, struct config_item *i) @@ -466,7 +471,8 @@ static void release_cluster(struct config_item *i) kfree(cl); } -static struct config_group *make_space(struct config_group *g, const char *name) +static int make_space(struct config_group *g, const char *name, + struct config_group **new_g) { struct space *sp = NULL; struct nodes *nds = NULL; @@ -489,13 +495,14 @@ static struct config_group *make_space(struct config_group *g, const char *name) INIT_LIST_HEAD(&sp->members); mutex_init(&sp->members_lock); sp->members_count = 0; - return &sp->group; + *new_g = &sp->group; + return 0; fail: kfree(sp); kfree(gps); kfree(nds); - return NULL; + return -ENOMEM; } static void drop_space(struct config_group *g, struct config_item *i) @@ -522,19 +529,21 @@ static void release_space(struct config_item *i) kfree(sp); } -static struct config_item *make_comm(struct config_group *g, const char *name) +static int make_comm(struct config_group *g, const char *name, + struct config_item **new_i) { struct comm *cm; cm = kzalloc(sizeof(struct comm), GFP_KERNEL); if (!cm) - return NULL; + return -ENOMEM; config_item_init_type_name(&cm->item, name, &comm_type); cm->nodeid = -1; cm->local = 0; cm->addr_count = 0; - return &cm->item; + *new_i = &cm->item; + return 0; } static void drop_comm(struct config_group *g, struct config_item *i) @@ -554,14 +563,15 @@ static void release_comm(struct config_item *i) kfree(cm); } -static struct config_item *make_node(struct config_group *g, const char *name) +static int make_node(struct config_group *g, const char *name, + struct config_item **new_i) { struct space *sp = to_space(g->cg_item.ci_parent); struct node *nd; nd = kzalloc(sizeof(struct node), GFP_KERNEL); if (!nd) - return NULL; + return -ENOMEM; config_item_init_type_name(&nd->item, name, &node_type); nd->nodeid = -1; @@ -573,7 +583,8 @@ static struct config_item *make_node(struct config_group *g, const char *name) sp->members_count++; mutex_unlock(&sp->members_lock); - return &nd->item; + *new_i = &nd->item; + return 0; } static void drop_node(struct config_group *g, struct config_item *i) diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c index f02ccb34604..443d108211a 100644 --- a/fs/ocfs2/cluster/heartbeat.c +++ b/fs/ocfs2/cluster/heartbeat.c @@ -1489,25 +1489,28 @@ static struct o2hb_heartbeat_group *to_o2hb_heartbeat_group(struct config_group : NULL; } -static struct config_item *o2hb_heartbeat_group_make_item(struct config_group *group, - const char *name) +static int o2hb_heartbeat_group_make_item(struct config_group *group, + const char *name, + struct config_item **new_item) { struct o2hb_region *reg = NULL; - struct config_item *ret = NULL; + int ret = 0; reg = kzalloc(sizeof(struct o2hb_region), GFP_KERNEL); - if (reg == NULL) - goto out; /* ENOMEM */ + if (reg == NULL) { + ret = -ENOMEM; + goto out; + } config_item_init_type_name(®->hr_item, name, &o2hb_region_type); - ret = ®->hr_item; + *new_item = ®->hr_item; spin_lock(&o2hb_live_lock); list_add_tail(®->hr_all_item, &o2hb_all_regions); spin_unlock(&o2hb_live_lock); out: - if (ret == NULL) + if (ret) kfree(reg); return ret; diff --git a/fs/ocfs2/cluster/nodemanager.c b/fs/ocfs2/cluster/nodemanager.c index cfdb08b484e..b364b7052e4 100644 --- a/fs/ocfs2/cluster/nodemanager.c +++ b/fs/ocfs2/cluster/nodemanager.c @@ -644,27 +644,32 @@ out: return ret; } -static struct config_item *o2nm_node_group_make_item(struct config_group *group, - const char *name) +static int o2nm_node_group_make_item(struct config_group *group, + const char *name, + struct config_item **new_item) { struct o2nm_node *node = NULL; - struct config_item *ret = NULL; + int ret = 0; - if (strlen(name) > O2NM_MAX_NAME_LEN) - goto out; /* ENAMETOOLONG */ + if (strlen(name) > O2NM_MAX_NAME_LEN) { + ret = -ENAMETOOLONG; + goto out; + } node = kzalloc(sizeof(struct o2nm_node), GFP_KERNEL); - if (node == NULL) - goto out; /* ENOMEM */ + if (node == NULL) { + ret = -ENOMEM; + goto out; + } strcpy(node->nd_name, name); /* use item.ci_namebuf instead? */ config_item_init_type_name(&node->nd_item, name, &o2nm_node_type); spin_lock_init(&node->nd_lock); - ret = &node->nd_item; + *new_item = &node->nd_item; out: - if (ret == NULL) + if (ret) kfree(node); return ret; @@ -751,25 +756,31 @@ static struct o2nm_cluster_group *to_o2nm_cluster_group(struct config_group *gro } #endif -static struct config_group *o2nm_cluster_group_make_group(struct config_group *group, - const char *name) +static int o2nm_cluster_group_make_group(struct config_group *group, + const char *name, + struct config_group **new_group) { struct o2nm_cluster *cluster = NULL; struct o2nm_node_group *ns = NULL; - struct config_group *o2hb_group = NULL, *ret = NULL; + struct config_group *o2hb_group = NULL; void *defs = NULL; + int ret = 0; /* this runs under the parent dir's i_mutex; there can be only * one caller in here at a time */ - if (o2nm_single_cluster) - goto out; /* ENOSPC */ + if (o2nm_single_cluster) { + ret = -ENOSPC; + goto out; + } cluster = kzalloc(sizeof(struct o2nm_cluster), GFP_KERNEL); ns = kzalloc(sizeof(struct o2nm_node_group), GFP_KERNEL); defs = kcalloc(3, sizeof(struct config_group *), GFP_KERNEL); o2hb_group = o2hb_alloc_hb_set(); - if (cluster == NULL || ns == NULL || o2hb_group == NULL || defs == NULL) + if (cluster == NULL || ns == NULL || o2hb_group == NULL || defs == NULL) { + ret = -ENOMEM; goto out; + } config_group_init_type_name(&cluster->cl_group, name, &o2nm_cluster_type); @@ -786,11 +797,11 @@ static struct config_group *o2nm_cluster_group_make_group(struct config_group *g cluster->cl_idle_timeout_ms = O2NET_IDLE_TIMEOUT_MS_DEFAULT; cluster->cl_keepalive_delay_ms = O2NET_KEEPALIVE_DELAY_MS_DEFAULT; - ret = &cluster->cl_group; + *new_group = &cluster->cl_group; o2nm_single_cluster = cluster; out: - if (ret == NULL) { + if (ret) { kfree(cluster); kfree(ns); o2hb_free_hb_set(o2hb_group); diff --git a/include/linux/configfs.h b/include/linux/configfs.h index 3ae65b1bf90..0488f937634 100644 --- a/include/linux/configfs.h +++ b/include/linux/configfs.h @@ -165,8 +165,8 @@ struct configfs_item_operations { }; struct configfs_group_operations { - struct config_item *(*make_item)(struct config_group *group, const char *name); - struct config_group *(*make_group)(struct config_group *group, const char *name); + int (*make_item)(struct config_group *group, const char *name, struct config_item **new_item); + int (*make_group)(struct config_group *group, const char *name, struct config_group **new_group); int (*commit_item)(struct config_item *item); void (*disconnect_notify)(struct config_group *group, struct config_item *item); void (*drop_item)(struct config_group *group, struct config_item *item); -- cgit v1.2.3-70-g09d2 From e56a99d5a42dcb91e622ae7a0289d8fb2ddabffb Mon Sep 17 00:00:00 2001 From: Artem Bityutskiy Date: Mon, 14 Jul 2008 19:08:34 +0300 Subject: UBIFS: add brief documentation Signed-off-by: Artem Bityutskiy Signed-off-by: Adrian Hunter --- Documentation/filesystems/ubifs.txt | 164 ++++++++++++++++++++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 Documentation/filesystems/ubifs.txt (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt new file mode 100644 index 00000000000..540e9e7f59c --- /dev/null +++ b/Documentation/filesystems/ubifs.txt @@ -0,0 +1,164 @@ +Introduction +============= + +UBIFS file-system stands for UBI File System. UBI stands for "Unsorted +Block Images". UBIFS is a flash file system, which means it is designed +to work with flash devices. It is important to understand, that UBIFS +is completely different to any traditional file-system in Linux, like +Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems +which work with MTD devices, not block devices. The other Linux +file-system of this class is JFFS2. + +To make it more clear, here is a small comparison of MTD devices and +block devices. + +1 MTD devices represent flash devices and they consist of eraseblocks of + rather large size, typically about 128KiB. Block devices consist of + small blocks, typically 512 bytes. +2 MTD devices support 3 main operations - read from some offset within an + eraseblock, write to some offset within an eraseblock, and erase a whole + eraseblock. Block devices support 2 main operations - read a whole + block and write a whole block. +3 The whole eraseblock has to be erased before it becomes possible to + re-write its contents. Blocks may be just re-written. +4 Eraseblocks become worn out after some number of erase cycles - + typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC + NAND flashes. Blocks do not have the wear-out property. +5 Eraseblocks may become bad (only on NAND flashes) and software should + deal with this. Blocks on hard drives typically do not become bad, + because hardware has mechanisms to substitute bad blocks, at least in + modern LBA disks. + +It should be quite obvious why UBIFS is very different to traditional +file-systems. + +UBIFS works on top of UBI. UBI is a separate software layer which may be +found in drivers/mtd/ubi. UBI is basically a volume management and +wear-leveling layer. It provides so called UBI volumes which is a higher +level abstraction than a MTD device. The programming model of UBI devices +is very similar to MTD devices - they still consist of large eraseblocks, +they have read/write/erase operations, but UBI devices are devoid of +limitations like wear and bad blocks (items 4 and 5 in the above list). + +In a sense, UBIFS is a next generation of JFFS2 file-system, but it is +very different and incompatible to JFFS2. The following are the main +differences. + +* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on + top of UBI volumes. +* JFFS2 does not have on-media index and has to build it while mounting, + which requires full media scan. UBIFS maintains the FS indexing + information on the flash media and does not require full media scan, + so it mounts many times faster than JFFS2. +* JFFS2 is a write-through file-system, while UBIFS supports write-back, + which makes UBIFS much faster on writes. + +Similarly to JFFS2, UBIFS supports on-the-flight compression which makes +it possible to fit quite a lot of data to the flash. + +Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts. +It does not need stuff like ckfs.ext2. UBIFS automatically replays its +journal and recovers from crashes, ensuring that the on-flash data +structures are consistent. + +UBIFS scales logarithmically (most of the data structures it uses are +trees), so the mount time and memory consumption do not linearly depend +on the flash size, like in case of JFFS2. This is because UBIFS +maintains the FS index on the flash media. However, UBIFS depends on +UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly. +Nevertheless, UBI/UBIFS scales considerably better than JFFS2. + +The authors of UBIFS believe, that it is possible to develop UBI2 which +would scale logarithmically as well. UBI2 would support the same API as UBI, +but it would be binary incompatible to UBI. So UBIFS would not need to be +changed to use UBI2 + + +Mount options +============= + +(*) == default. + +norm_unmount (*) commit on unmount; the journal is committed + when the file-system is unmounted so that the + next mount does not have to replay the journal + and it becomes very fast; +fast_unmount do not commit on unmount; this option makes + unmount faster, but the next mount slower + because of the need to replay the journal. + + +Quick usage instructions +======================== + +The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax, +where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is +UBI volume name. + +Mount volume 0 on UBI device 0 to /mnt/ubifs: +$ mount -t ubifs ubi0_0 /mnt/ubifs + +Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume +name): +$ mount -t ubifs ubi0:rootfs /mnt/ubifs + +The following is an example of the kernel boot arguments to attach mtd0 +to UBI and mount volume "rootfs": +ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs + + +Module Parameters for Debugging +=============================== + +When UBIFS has been compiled with debugging enabled, there are 3 module +parameters that are available to control aspects of testing and debugging. +The parameters are unsigned integers where each bit controls an option. +The parameters are: + +debug_msgs Selects which debug messages to display, as follows: + + Message Type Flag value + + General messages 1 + Journal messages 2 + Mount messages 4 + Commit messages 8 + LEB search messages 16 + Budgeting messages 32 + Garbage collection messages 64 + Tree Node Cache (TNC) messages 128 + LEB properties (lprops) messages 256 + Input/output messages 512 + Log messages 1024 + Scan messages 2048 + Recovery messages 4096 + +debug_chks Selects extra checks that UBIFS can do while running: + + Check Flag value + + General checks 1 + Check Tree Node Cache (TNC) 2 + Check indexing tree size 4 + Check orphan area 8 + Check old indexing tree 16 + Check LEB properties (lprops) 32 + Check leaf nodes and inodes 64 + +debug_tsts Selects a mode of testing, as follows: + + Test mode Flag value + + Force in-the-gaps method 2 + Failure mode for recovery testing 4 + +For example, set debug_msgs to 5 to display General messages and Mount +messages. + + +References +========== + +UBIFS documentation and FAQ/HOWTO at the MTD web site: +http://www.linux-mtd.infradead.org/doc/ubifs.html +http://www.linux-mtd.infradead.org/faq/ubifs.html -- cgit v1.2.3-70-g09d2