UBIFS - UBI File-System
Table of contents
- Big red note
- Overview
- Source code
- Mailing list
- User-space tools
- Scalability
- Write-back support
- Compression
- Checksumming
- Read-ahead
- Space for superuser
- Extended attributes
- Mount options
- Flash space accounting issues
- Documentation
- How to send an UBIFS bugreport?
- Raw flash vs. FTL devices
Big red note
One thing people have to understand when dealing with UBIFS is that UBIFS is very different to any traditional file system - it does not work on top of block devices (like hard drives, MMC/SD cards, USB flash drives, SSDs, etc). UBIFS was designed to work on top of raw flash, which has nothing to do with block devices. This is why UBIFS does not work on MMC cards or USB flash drives - they look like block devices to the outside world because they implement FTL (Flash Translation Layer) support in hardware, which simply speaking emulates a block device on top of the built-in flash chip. Please, make sure you understand the difference between raw flash and, say, MMC flash before reading about UBIFS. This section should help.
Overview
UBIFS is a new flash file system developed by Nokia engineers with help of the University of Szeged. In a way, UBIFS may be considered as the next generation of the JFFS2 file-system.
JFFS2 file system works on top of MTD devices, but UBIFS works on top of UBI volumes and cannot operate on top of MTD devices. In other words, there are 3 subsystems involved:
- MTD subsystem, which provides uniform interface to access
flash chips. MTD provides an notion of MTD devices (e.g.,
/dev/mtd0) which basically represent raw flash; - UBI subsystem, which is a wear-leveling and volume management system for flash devices; UBI works on top of MTD devices and provides a notion of UBI volumes; UBI volumes are higher level entities than MTD devices and they are devoid of many unpleasant issues MTD devices have (e.g., wearing and bad blocks); see here for more information;
- UBIFS file system, which works on top of UBI volumes.
Here is a list of some of UBIFS features:
- scalability - UBIFS scales well with respect to flash size; namely, mount time, memory consumption and I/O speed does not depend on flash size (it is not 100% true for memory consumption, but the dependency is very weak); UBIFS (not UBI!) should work fine for hundreds of GiB flashes; however, UBIFS depends on UBI which has scalability limitations (see here); nonetheless, UBI/UBIFS stack scales much better than JFFS2, and if UBI becomes a bottleneck, it is always possible to implement UBI2 without changing UBIFS;
- fast mount - unlike JFFS2, UBIFS does not have to scan whole media when mounting, it takes milliseconds for UBIFS to mount the media, and this does not depend on flash size; however, UBI initialization time depends on flash size and has to be taken into account (see here for more details);
- write-back support - this dramatically improves the throughput of the file system comparing to JFFS2, which is write-through; see here for more details;
- tolerance to unclean reboots - UBIFS is a journaling file system and it tolerates sudden crashes and unclean reboots; UBIFS just replays the journal and recovers from the unclean reboot; mount time is a little bit slower in this case, because of the need to replay the journal, but UBIFS does not need to scan whole media, so it anyway takes fractions of a second to mount UBIFS; note, authors payed special attention to this UBIFS aspect, see this FAQ entry;
- fast I/O - even with write-back disabled (e.g., if UBIFS is
mounted with the "
-o sync" mount option) UBIFS shows good performance which is close to JFFS2 performance; bear in mind, it is extremely difficult to compete with JFFS2 in synchronous I/O, because JFFS2 does not maintain indexing data structures on flash, so it does not have the maintenance overhead, while UBIFS does have it; but UBIFS is still fast because of the way UBIFS commits the journal - it does not move the data physically from one place to another but instead, it just adds corresponding information to the file system index and picks different eraseblocks for the new journal (i.e., UBIFS has sort of "wandering" journal which constantly changes the position); there are other tricks like multi-headed journal which make UBIFS perform well; - on-the-flight compression - the data is stored in compressed form on the flash media, which makes it possible to put considerably more data to the flash than if the data was not compressed; this is very similar to what JFFS2 has; UBIFS also allows to switch the compression on/off on per-inode basis, which is very flexible; for example, one may switch the compression off by default and enable it only for certain files which are supposed to compress well; or one may switch compression on by default but disable it for supposedly uncompressible data like multimedia files; at the moment UBIFS supports only zlib and LZO compressors and it is not difficult to add more; see this section for more information.
- recoverability - UBIFS may be fully recovered if the indexing information gets corrupted; each piece of information in UBIFS has a header which describes this piece of information, and it is possible to fully reconstruct the file system index by scanning the flash media; to make it more clear, imaging you wiped out the FAT table on your FAT file system (FAT table is the index of the FS); for FAT FS it is fatal; but if you similarly wipe out UBIFS index, you still may re-construct it, although a special user-space tool would be required to do this;
- integrity - UBIFS (as well as UBI) checksums everything it writes to the flash media to guarantee data integrity, UBIFS does not leave data or meta-data corruptions unnoticed (JFFS2 is doing the same, though); however, you may disable CRC checking for data to improve file-system read speed and lessen CPU usage - read this section.
Source code
UBIFS is in mainline since 17 July 2008 and the first kernel release which
contains UBIFS is 2.6.27. But since UBIFS is a very new file system,
it is recommended to pick the latest updates and fixes from the
linux-next branch of the UBIFS git tree:
git://git.infradead.org/ubifs-2.6.git
Here is the corresponding Git-web view.
The git tree has master and linux-next branches.
The master branch contains the most recent stuff which is often
incomplete, buggy, or not tested very well. This branch may be re-based from
time to time. The linux-next branch contains stable UBIFS updates
and fixes. This branch is included to the
linux-next git tree and goes
to main-line. The linux-next branch is never re-based. Thus,
unless you are an active UBIFS developer, use the linux-next
branch.
There are also 2.6.21, 2.6.22, 2.6.23,
2.6.24, 2.6.25, 2.6.26 and
2.6.27kernel back-ports, although they are not always up-to-date
and one may want to pick up additional patches from the main development git
tree to utilize the latest UBIFS version. The back-ports may be found at:
git://git.infradead.org/~dedekind/ubifs-v2.6.27.gitgit://git.infradead.org/~dedekind/ubifs-v2.6.26.gitgit://git.infradead.org/~dedekind/ubifs-v2.6.25.gitgit://git.infradead.org/~dedekind/ubifs-v2.6.24.gitgit://git.infradead.org/~dedekind/ubifs-v2.6.23.gitgit://git.infradead.org/~dedekind/ubifs-v2.6.22.gitgit://git.infradead.org/~dedekind/ubifs-v2.6.21.git
The Linux kernel is changing rapidly and it is rather difficult to make back-ports fully match the mainline code, and the back-ports have various limitations. Below are some of them
- Since it is impossible to register memory shrinker in kernel
versions
2.6.21and2.6.22, the UBIFS shrinker does not work there, which means that UBIFS TNC cache never gets shrinked and the system may run out of memory if the file system is very large. - Because of VFS interface changes since kernel version
2.6.24(write_begin()instead ofprepare_write()) back-ports2.6.21,2.6.22, and2.6.23may have slightly worse performance comparing to the mainline code.
Thus, only 2.6.25 and higher version back-ports have full
UBIFS functionality and are not much different to the mainline UBIFS. Note, all
the back-port trees also have many MTD patches back-ported, and they have most
of the UBI changes back-ported. Also bear in mind, the back-port trees are not
going to be maintained forever and will be deleted at some point.
Mailing list
You are welcome to send feed-back, bug-reports, patches, etc to the MTD mailing list. Feel free to ask questions. It might make sense to check the UBIFS FAQ as well.
User-space tools
There is only one UBIFS user-space tool at the moment -
mkfs.ubifs, which creates UBIFS images. The tool may be found in
the MTD utils repository (mkfs.ubifs sub-directory):
git://git.infradead.org/mtd-utils.git
The images produced by mkfs.ubifs may be written to
UBI volumes using ubiupdatevol
or may be further fed to the
ubinize
tool to create an UBI image which may be put to the MTD device.
Scalability
All the data structures UBIFS is using are trees, so it scales logarithmically in terms of flash size. However, UBI scales linearly (see here) which makes overall UBI/UBIFS stack scalability linear. But the UBIFS authors believe it is always possible to create logarithmically scalable UBI2 and improve the situation. Current UBI should be OK for 2-16GiB flashes, depending on the I/O speed and requirements.
Note, although the UBI scalability is linear, it anyway scales better much than JFFS2, which was originally designed for small ~32MiB NOR flashes. JFFS2 has scalability issues on the "file system level", while UBI/UBIFS stack has scalability issues only on lower "raw flash level". The following table describes the issues in more details.
| Scalability issue | JFFS2 | UBIFS |
| Mount time linearly depends on the flash size | True, the dependency is linear, because JFFS2 has to scan whole flash media when mounting. | UBIFS mount time does not depend on the flash size. But UBI needs to scan the flash media, which is actually quicker than JFFS2 scanning. So overall, UBI/UBIFS has this linear dependency. |
| Memory consumption linearly depends on the flash size | True, the dependency is linear. | UBIFS memory does depend on the flash size in the current implementation, because the LPT shrinker is not implemented. But it is not difficult to implement the LPT shrinker and get rid of the dependency. It is not implemented only because the memory consumption is too small to make the coding work worth it. UBI memory consumption linearly depends on flash size. Thus, overall UBI/UBIFS stack has the linear dependency. |
| Mount time linearly depends on the file system contents | True, the more data is stored on the file system, the longer it takes to mount it, because JFFS2 has to do more scanning work. | False, mount time does not depend on the file system contents. At the worst case (if there was an unclean reboot), UBIFS has to scan and replay the journal which has fixed and configurable size. |
| Full file system checking is required after each mount | True. JFFS2 has to check whole file system just after it has been
mounted in case of NAND flash. The checking involves reading all
nodes for each inode and checking their CRC checksums, which
consumes a lot of CPU. For example, this may be seen by running the
top utility just after JFFS2 has been mounted. This
slows down overall system boot-up time. Fundamentally, this is
needed because JFFS2 does not store space accounting information
(i.e., free/dirty space) on the flash media but instead, gathers
this information by scanning the flash media. |
False. UBIFS does not scan/check whole file system because it stores the space accounting information on the flash media in the so-called LPT (Logical eraseblock Properties Tree) tree. |
| Memory consumption linearly depends on file system contents | True. JFFS2 keeps a small data structure in RAM for each node on flash, so the more data is stored on the flash media, the more memory JFFS2 consumes. | False. UBIFS memory consumption does not depend on how much data is stored on the flash media. |
| The first file access time linearly depends on its size | True. JFFS2 has to keep in RAM so-called "fragment tree" for each inode corresponding to an opened file. The fragment tree is an in-memory RB-tree which is indexed by file offset and refers on-flash nodes corresponding to this offset. The fragment tree is not stored on the flash media. Instead, it is built on-the-flight when the file is opened for the first time. To build the fragment tree, JFFS2 has to read each data node corresponding to this inode from the flash. This means, the larger is the file, the longer it takes to open it for the first time. And the larger is the file the more memory it takes when it is opened. Depending on the system, JFFS2 becomes nearly unusable starting from certain file size. | False. UBIFS stores all the indexing information on the media in the indexing B-tree. Whenever a piece of data has to be read from the file system, the B-tree is looked-up and the corresponding flash address to read is found. There is a TNC cache which caches the B-tree nodes when the B-tree is looked-up, and the cache is shrinkable, which means it might be shrunk when the kernel needs more memory. |
| File-system performance depends on I/O history | True. Since JFFS2 is fully synchronous, it writes data to the flash media as soon as the data arrives. If one changes few bytes in the middle of a file, JFFS2 writes a data node which contains those bytes to the flash. If there are many random small writes all over the place, the file system becomes fragmented. JFFS2 merges small fragments to 4KiB chunks, which involves re-compression and re-writing the data. But this "de-fragmentation" is happening during garbage collection and at random time, because JFFS2 wear-leveling algorithm is based on random eraseblock selection. So if there were a lot of small writes, JFFS2 becomes slower some time later - the performance just goes down out of the blue which makes the system less predictable. | False. UBIFS always writes in 4KiB chunks. This does not hurt the performance much because of the write-back support: the data changes do not go to the flash straight away - they are instead deferred and are done later, when (hopefully) more data is changed at the same data page and usually in background. |
Write-back support
UBIFS supports write-back, which means that file changes do not go to the
flash media straight away, but they are cached and go to the flash later, when
it is absolutely necessary. This helps to greatly reduce the amount of I/O
which results in better performance. Write-back caching is standard technique
which is used by most file systems like ext3 or
XFS.
In contrast, JFFS2 does not have write-back support and all the
JFFS2 file system changes go the flash synchronously. Well, this is not
completely true and JFFS2 does have a small buffer of a NAND page size (if the
underlying flash is NAND). This buffer contains last written data and is
flushed once it is full. However, because the amount of cached data is very
small, JFFS2 is very close to a synchronous file system.
Write-back support requires the application programmers to take extra care about synchronizing important files in time. Otherwise the files may corrupt or disappear in case of power-cuts, which happen very often in many embedded devices. Let's look at what Linux manual pages say:
$ man 2 write
....
NOTES
A successful return from write() does not make any guarantee that data
has been committed to disk. In fact, on some buggy implementations, it
does not even guarantee that space has successfully been reserved for
the data. The only way to be sure is to call fsync(2) after you are
done writing all your data.
...
This is true for UBIFS (except of the "some buggy implementations" part, because UBIFS does reserves space for cached dirty data). This is also true for JFFS2, as well as for any other Linux file system.
However, some (perhaps not very good) user-space programmers do not take write-back into account. They do not read man pages carefully. And some applications which have been used in embedded systems which run JFFS2 worked fine, because JFFS2 is so close to being synchronous. Of course, the applications are buggy, but they appeared to work well enough with JFFS2. But the bugs show up when UBIFS is used. Please, be careful and check/test your applications with respect to power cut tolerance if you switch from JFFS2 to UBIFS. The following is a list of useful hints and advices.
- If you want to switch into synchronous mode, use
-o syncoption when mounting UBIFS; however, the file system performance may drop - be careful; - Always keep in mind the above statement from the manual pages and
run
fsync()for all important files you change; of course, there is no need to synchronize "throw-away" temporary files; Just think how important is the file data and decide; - If you want to be more accurate, you may use
fdatasync(), in which cases only data changes will be flushed, but not inode meta-data changes (e.g., "mtime" or permissions); this might be more optimal than usingfsync()if the synchronization is done often, e.g., in a loop; otherwise just stick withfsync(); - In shell, the
synccommand may be used, but it synchronizes whole file system which might be not very optimal; and there is a similar libcsync()function; - Alternatively to
fdatasync()you may useO_SYNCflag of theopen()call; this will make sure all the data (but not meta-data) changes go to the media before thewrite()operation returns; - It is possible to make certain inodes to be synchronous by
default by setting the "sync" inode flag; in a shell, the
chattr +Scommand may be used; in C programs, use theFS_IOC_SETFLAGSioctlcommand; Note, themkfs.ubifstool checks for the "sync" flag in the original FS tree, so the synchronous files in the original FS tree will be synchronous in the resulting UBIFS image.
Let us stress that the above items are true for any Linux file system,
including JFFS2.
fsync() may be called for directories - it synchronizes
the directory inode meta-data. The "sync" flag may also be set for
directories to make the directory inode synchronous. But the flag is inherited,
which means all new children of this directory will also have this flag. New
files and sub-directories of this directory will also be synchronous, and their
children, and so forth. This feature is very useful if one needs to create a
whole sub-tree of synchronous files and directories, or to make all new children
of some directory to be synchronous by default (e.g., /etc).
The fdatasync() call for directories is "no-op" in UBIFS and
all UBIFS operations which change directory entries are synchronous.
However, you should not to assume this for portability (e.g., this is not
true for ext2). Similarly, the "dirsync" inode flag has
no effect in UBIFS.
The functions mentioned above work on file-descriptors, not on streams
(FILE *). To synchronize a stream, you should first get its file
descriptor using the fileno() libc function, then flush the
stream using fflush(), and then synchronize the file using
fsync() or fdatasync(). You may use other
synchronization methods, but remember to flush the stream before synchronizing
the file. The fflush() function flushes the libc-level
buffers, while sync(), fsync(), etc flush
kernel-level buffers.
Updating a file atomically
This sub-section describes common technique of updating the contents of a file atomically. This technique is applicable to other all POSIX-compatible file systems, not only to UBIFS.
To atomically update the contents of file foo, you have to
first make a copy bar of this file, then change file
bar, synchronize file bar, and re-name
bar to foo. Because the re-name operation is atomic
(this is a POSIX requirement) and bar is
synchronized, the whole update operation is atomic as well. Indeed, if a power
cut happens, you will end up with intact foo and half-updated
bar, in which case the whole atomic update operation may be run
again.
Compression
UBIFS supports on-the-flight compression, which it compressed data before writing them to the flash media, and de-compressed before reading them, and this is absolutely transparent to the users. UBIFS compresses only regular files data. Directories, device nodes and so on are not compressed. Meta-data and the indexing information are not compressed as well.
At the moment UBIFS supports LZO and zlib compressors. Zlib
provides better compression ratio, but LZO is faster in both compression and
decompression. LZO is the default compressor for in-kernel UBIFS and for the
mkfs.ubifs utility. And of course you may disable UBIFS
compression altogether using the "-x none"
mkfs.ubifs option.
UBIFS splits all data on 4KiB chunks and compresses each chunk
independently. This is not optimal, because larger chunks of data would
compress better, but this still provides noticeable flash space economy. For
example, real-life root file-system image for an ARM platform becomes ~40%
smaller with LZO compression and ~50% smaller with zlib compression. This
means that you may fit a 300MiB rootfs image into a 256MiB UBI volume and still
have about 100MiB of free space. However, the figures may be different
depending on the contents of the file-system. For example, if your file-system
mostly contains mp3 files, UBIFS will be unable to efficiently
compress them, just because mp3 files are already compressed.
In UBIFS it is possible to enable or disable compression individually for each inode by setting or cleaning this compression flag. Note, the compression flag of directories is inherited, which means that when files and sub-directories are created, they inherit the compression flag of the parent directory. Please, refer this section for instruction about how the compression flag may be toggled.
It is also possible to combine LZO and zlib compressors, see this FAQ section.
It's also worth noting that JFFS2 LZO compression is a little bit different to UBIFS zlib compression. UBIFS uses crypto-API deflate method, while JFFS2 uses zlib library directly. As a result, UBIFS and JFFS2 use different zlib compression options. Namely, JFFS2 uses deflate level 3 and window bits 15, while UBIFS uses deflate level 6 and window bits -11 (minus makes zlib avoid putting a header to the output data stream). Experiments with compressing ARM code showed that JFFS2 compression ratio is slightly smaller, decompression speed is also slightly slower, but compression speed is faster.
Checksumming
Every piece of information UBIFS writes to the media has a CRC-32 checksum, and UBIFS verifies the checksum for every piece of information it reads from the media. CRC-32 is a quite strong function and any data corruption will most probably be noticed. The same is true for UBI.
CRC-32 loads the CPU and makes the file-system slower - this is the price we
pay for providing very high data integrity level. But UBIFS allows to switch the
data checksumming off using the no_chk_data_crc mount option. If
UBIFS is mounted with this option, it does not check CRC-32 checksum for data,
but it does check it for the internal indexing information. And this option
only affects reading, not writing, because UBIFS always calculates CRC-32 when
writing the data.
Disabling checksum verification for data speeds-up file-system read speed and reduces CPU usage. But of course, it also lowers the file-system integrity level, so you should decide whether you want to use it or not depending on your system requirements. In general, if you use SLC NAND flash or NOR flash, it is probably fine to disable CRC-32. In case of MLC NAND flash, you should probably be more careful. However, see this FAQ section for more information about UBIFS on MLC NAND.
Note, currently UBIFS cannot disable CRC-32 calculations on write, because
UBIFS recovery process depends on in. When recovering from an unclean reboot
and re-playing the journal, UBIFS has to be able to detect broken and
half-written UBIFS nodes and drop them, and UBIFS depends on the CRC-32
checksum here. So the no_chk_data_crc mount option does not
improve UBIFS write speed. However, UBIFS writes speed should not be a problem
for a great deal of standard work-loads because of
write-back support.
In other words, if you use UBIFS with data CRC-32 checking disabled, you still have the CRC-32 checksum attached to each piece of data, and you may mount UBIFS with default options to enable CRC-32 checking at any time (e.g., when you suspect the file-system might be corrupted because you visited the Large Hadron Collider and exposed your flash to proton beams).
Read-ahead
Read-ahead is an optimization technique which makes the file system read a little bit more data than users actually ask. The idea is that files are often read sequentially from the beginning to the end, so the file system tries to make next data available before the user actually asks for them.
Linux VFS is capable of doing read-ahead and this does not require any support from the file system. This probably works well for traditional block-based file systems, however this does not work well for UBIFS. UBIFS works with UBI API, which works with MTD API, which is synchronous. MTD API is pretty trivial and does not have any request queues. This means that VFS blocks UBIFS readers and makes them wait for read-ahead process. In opposite, block-device API is asynchronous and readers do not wait for read-ahead.
VFS read-ahead was designed for hard drives, and it was benchmarked with hard-drives. But the nature of raw flash devices is very different to the nature of Hard Drives Raw flash devices do not heave such a huge seek time as hard drives do, so the techniques which work for HDDs do not necessarily work well for flash.
That said, VFS read-ahead only slows UBIFS down instead of improving it,
so UBIFS disables VFS read-ahead. But UBIFS has its own internal read-ahead,
which we call "bulk-read". You may enable bulk-read using the
"bulk_read" UBIFS mount option.
Some flashes may read faster if the data is read at one go, rather than at several read requests. For example, OneNAND can do "read-while-load" if it reads more than one NAND page. So UBIFS may benefit from reading large data chunks at one go, and this is exactly what bulk-read does.
If UBIFS notices that a file is being read sequentially (at least 3 sequential 4KiB blocks has been read), and if UBIFS sees that the further file data resides sequentially at the same eraseblock, it starts reading data ahead using large read requests, which makes it possible to read at higher rates. So UBIFS reads more than it is asked to, and it pushes the read-ahead data to the file caches, so the data become instantly available for the further user read requests.
Here is an example. Suppose the user is reading a file sequentially. We are lucky and the file is not fragmented on the media. Suppose LEB 25 contains data nodes belonging to this file, and the data nodes are logically (in terms of logical file offset) and physically (in terms of LEB/offset addresses) sequential. Suppose user requests to read data node at LEB 25 offset 0. In this case UBIFS will actually read whole LEB 25 at one go, then populate the file cache with all the read data. And when the user asks the next piece of data, it will already be in the cache.
Obviously, the bulk-read feature may slow UBIFS down in some work-loads, so you should be careful. It is also worth noting that bulk-read feature cannot help on highly fragmented file-systems. Although UBIFS does not fragment file-systems (e.g., the Garbage-Collector does not re-order data nodes), but UBIFS does not try to de-fragment them. For example, if you write a file sequentially, it won't be fragmented. But if you write more than one file at a time, they may become fragmented (well, this also depends on how write-back flushes the changes), and UBIFS won't automatically de-fragment them. However, it is possible to implement a background de-fragmentator. It is also possible to have per-inode journal head and avoid mixing data nodes belonging to different inodes in the same LEB. So there is room for improvements.
Space for superuser
UBIFS reserves some space for the superuser (root), which means that when
the file-system is full for normal users, there is still little space for the
super-user. File-systems like ext2 have a similar feature.
mkfs.ubifs
reserves ~1%, but at maximum ~5MiB of the space by default. The amount of
reserved space is stored in the UBIFS superblock and may be changed arbitrarily.
Currently mkfs.ubifs does not have a command line option to
override the defaults, but it should be trivial to implement.
By default only root may use the reserved space. But it is possible to
extend the list of power users who are able to utilize the reserved space.
UBIFS may record several user and group IDs at the superblock and allow them
to utilize the reserved space as well. But again, current
mkfs.ubifs utility does not have corresponding command line
options, but it should be trivial to implement them. UBIFS authors added the
mechanism, but did not use it so did not implement corresponding
mkfs.ubifs options.
Note, UBIFS prints the amount of reserved space when mounts the file-system. See UBIFS messages in the system log.
Extended attributes
UBIFS supports extended attributes if the corresponding configuration option
is enabled (no additional mount options are required). It supports the
user, trusted, and security name-spaces.
However, access control lists (ACL) support is not implemented.
Mount options
The following are UBIFS-specific mount options.
norm_unmount(default) - commit on unmount; the journal is committed when the file system is unmounted so that the next mount does not have to replay the journal and it becomes very fast;fast_umount- do not commit on unmount; this option makes unmount faster, but the next mount slower because of the need to replay the uncommitted journal;chk_data_crc(default) - check data CRC-32 checksums;no_chk_data_crc- do not check data CRC-32 checksums, see this section for more details;bulk_read- enable bulk-read, see here;no_bulk_read(default) - do not bulk-read.
Example:
$ mount -t ubifs -o fast_umount,no_chk_data_crc ubi0:rootfs /mnt/ubifs
mounts UBIFS file-system to /mnt/ubifs, enables fast
unmount and disables data CRC checking.
Besides, UBIFS supports the standard sync mount option which
may be used to disable UBIFS write-back and write-buffer caching and make it
fully synchronous. Note, UBIFS does not support "atime", so the
atime mount option has no effect.
Flash space accounting issues
Traditional file systems like ext2 can easily calculate amount
of free space. The calculation is usually quite precise and users are
accustomed to this. However, the situation is very different in UBIFS - it
cannot really report precise amount of free space which confuses users.
Instead, it reports minimum amount of free space, which usually
less than the real amount. Sometimes the mistake may be very high. For example,
UBIFS may report (via the statfs() system call) that there is no
free space, but one would still be able to write quite a lot.
To put it differently, UBIFS is often lying about the amount of free space it has. As a rule, it may fit considerably more bytes than it reports. However, it never reports more free space than it has. It reports less, and very rarely it may report the exact amount. And this is not because UBIFS authors are jerks, there are fundamental reasons for this, which are discussed below.
Effect of compression
The first factor is UBIFS on-flight compression. Users usually seem to expect that if file system reports N bytes of free space, than it is possible to create an N-byte file. And because of the compression, this is not quite true for UBIFS. Depending on how well the file data compresses, UBIFS may fit several times more than it reports.
When UBIFS calculates free space, it does not a-priori know anything about the data which is going to be written, so it cannot take into account the compression, so it always assumes the worst-case scenario when the data does not compress.
Well, this does not sound as a big issue. However, compression becomes an issue for free space reporting when compression is combined with write-back. Namely, UBIFS cannot know how well the cached dirty data would compress, and the only way to find this out is to actually compress it. See below.
Effect of write-back
Suppose there are X bytes of dirty file data in the page cache. They will be flushed to the flash media later, but they are in RAM so far. UBIFS (namely, the budgeting sub-system) has reserved X + O bytes on the flash media for this data, where O is file system overhead (e.g., the data has to be indexed, each data node has a header, etc).
The problem is that UBIFS cannot accurately calculate X and O, and it uses pessimistic worst-case calculations, so that when the cached data are flushed, they may take considerably less flash space than the reserved X + O. For example, this may lead to the following situations.
$ df Filesystem 1K-blocks Used Available Use% Mounted on ubi0:ubifs 49568 49568 0 100% /mnt/ubifs $ sync $ df Filesystem 1K-blocks Used Available Use% Mounted on ubi0:ubifs 49568 39164 7428 85% /mnt/ubifs
First time df reported zero free space, but after
the sync it reported 15% free space. This is because
there were a lot of cached dirty data, and UBIFS reserved all flash space
for them. But once the data has reached the flash media, they took considerably
less flash space.
Here are the reasons why UBIFS reserves more space than it is needed.
- One of the reasons is again related to the compression. The data is
stored in the uncompressed form in the cache, and UBIFS does know how
well it would compress, so it assumes the data wouldn't compress at all.
However, real-life data usually compresses quite well (unless it
already compressed, e.g. it belongs to a
.tgzor.mp3file). This leads to major over-estimation of the X component. - Due to the design, UBIFS nodes never cross logical eraseblock (or LEB, see here) boundaries, so there are small spots of wasted space at the end of eraseblocks. The amount of this wasted flash space depends on the data and in which order this data has been written or changed. And traditionally UBIFS pessimistically assumes maximum possible amount of wasted space, which leads to over-estimation of the O component. See the next sub-section.
Thus, UBIFS reports more accurate free space value if it is synchronized.
Wastage
As it was mentioned above, UBIFS nodes do not cross LEB boundaries. Consider the following numbers:
- maximum UBIFS node size (non-compressed data node) is 4256 bytes;
- smallest UBIFS node size (a data node with 8 bytes of data, corresponding to a file tail) is 56 bytes;
- depending on name length, directory entry nodes take 56-304 bytes;
- typical LEB size in case of NAND flash with 128KiB physical eraseblocks and 2048 bytes NAND page is 126KiB (or 124KiB if the NAND chip does not support sub-pages, see here).
Thus, if the vast majority of nodes on the flash were non-compressed data nodes, UBIFS would waste 1344 bytes at the ends of 126KiB LEBs. But real-life data is often compressible, so data node sizes vary, and the amount of wasted space at the ends of eraseblocks varies from 0 to 4255.
UBIFS is doing some job to put small nodes like directory entries to the ends of LEBs to lessen the amount of wasted space, but it is not ideal and UBIFS still may waste unnecessarily large chunks of flash space at the ends of eraseblocks.
When reporting free space, UBIFS does not know which kind of data is going to be written to the flash media, and in which sequence. Thus, it assumes the maximum possible wastage of 4255 bytes per LEB. This calculation is too pessimistic for most real-life situations and the average real-life wastage is considerably less than 4255 bytes per LEB. However, UBIFS reports the absolute minimum amount of free space user-space applications may count on.
The above means that the larger is LEB size, the better is UBIFS free space prediction. E.g., UBIFS is better in this respect on NANDs with 128KiB eraseblock size, comparing to NANDs with 16KiB eraseblock size.
Dirty space
Dirty space is the flash space occupied by UBIFS nodes which were invalidated because they were changed or removed. For example, if the contents of a file is re-written, than corresponding data nodes are invalidated and new data nodes are written to the flash media. The invalidated nodes comprise dirty space. There are other mechanisms how dirty space appears as well.
UBIFS cannot re-use dirty space straight away, because corresponding flash areas do not contain all 0xFF bytes. Before dirty space can be re-used, UBIFS has to garbage-collect corresponding LEBs. The idea of Garbage collector which reclaims dirty space is the same as in JFFS2. Please, refer the JFFS2 design document for more information.
Roughly, UBIFS garbage collector picks a victim LEB which has some dirty space and moves valid UBIFS nodes from the victim LEB to the LEB which was reserved for GC. This produces some amount of free space at the end of the reserved LEB. Then GC pick new victim LEB, and moves the data to the reserved LEB. When the reserved LEB is full, UBIFS picks another empty LEB (e.g., the old victim which had been made free a step ago), and continues moving nodes from the victim LEB to the new reserved LEB. The process continues until a full empty LEB is produced.
UBIFS has a notion of minimum I/O unit size, which characterizes minimum amount of data which may be written to the flash (see here for more information). Typically, UBIFS works on large-page NAND flashes and min. I/O size is 2KiB.
Consider a situation when GC picks eraseblocks with less than min. I/O unit size dirty space. When all nodes from the victim LEB have been moved to the reserved LEB, the last min. I/O unit of the reserved LEB has to be written to the flash media, which means no space would be reclaimed. The reason why the last min. I/O unit of the reserved LEB has to me written immediately is because the victim LEB cannot be erased before all the moved nodes have reached the media. Indeed, otherwise an unclean reboot would result in lost data.
Well, things are actually not that simple and UBIFS GC actually tries not to waste space, but it is not always possible and UBIFS GC is far from being ideal. Anyway, what matters is that UBIFS cannot always reclaim dirty space if the amount of it is less than min. I/O unit size.
When UBIFS reports free space to the users, it treats dirty space as available for new data, because after garbage-collection dirty space becomes free space. But we have just showed, UBIFS cannot reclaim all dirty space and turn it into free space. Worse, UBIFS does not precisely know how much dirty space it can reclaim. So it again uses pessimistic calculations.
Thus, the less dirty space the FS has, and the smaller is dirty space fragmentation, the more precise is UBIFS free space reporting. In practice this means that a file system which is close to be full has less accurate free space reporting comparing to a less full file system, because this file system presumably has more dirty space.
Note, to fix this issue, UBIFS would need to run GC in
statfs(), which would turn as much dirty space as possible into
free space, which would result in more precise free space reporting. However,
this would make statfs() very slow. Another possibility would be
to implement background GC in UBIFS (just like in JFFS2), which would lessen
effect of dirty space with time.
Precise index size is not known
As you probably know, UBIFS maintain the FS index on flash. The index takes some flash space. There also UBIFS journal, which contain FS data. The FS data in the journal is not indexed, which means that on-flash index does not refer it. Instead, UBIFS keeps indexing information for the journal in RAM. When the file system is mounted, UBIFS has to scan the journal and build this part of the index in RAM. So the journal is like a small JFFS2 file system inside UBIFS.
The journal becomes indexed as the result of the commit operation. During the commit UBIFS updates the on-flash index and makes it refer information in the journal. Then UBIFS picks other LEBs for the new journal, so the journal changes is position after the commit.
UBIFS maintains precise accounting of the index size. That is, UBIFS always knows how many bytes the on-flash index takes. However, UBIFS does not know precisely how much will the index grow (or shrink) after the commit. This means, it does not know whether how much will the index size change after the journal data references will be included into the on-flash index. And UBIFS again makes pessimistic calculations here and assumes worst-case scenario.
However, after the commit operation, UBIFS knows exact index size again. The
sync() system not only flushes all dirty data, but it also call
makes UBIFS commit the journal. This means that file system synchronization the
makes free space prediction mistake lower.
It is worth noting that this is not a fundamental thing. It is just an UBIFS implementation detail. UBIFS could calculate precise index size without actually running the commit operation, but the UBIFS authors found it difficult to implement. And the effect of the index size uncertainty should be low.
Documentation
If flash file systems is a completely new area for you, it is recommended to start from learning JFFS2, because many basic ideas are the same in UBIFS. Read the JFFS2 design document.
You may find the description of main JFFS2 issues, as well as very basic UBIFS ideas in the JFFS3 design document. Remember, the document in general is old and out-of-date. We do not use the "JFFS3" name anymore, and JFFS3 was re-named to UBIFS. The document was written when UBI did not exist and the document assumes that JFFS3 is talking directly to the MTD device, just like JFFS2. However, the JFFS2 overview, JFFS3 Requirements, and Introduction to JFFS3 chapters are still mostly valid and give a good introduction into basic UBIFS ideas like wandering tree and the journal. Although please note, that the superblock description is irrelevant for UBIFS. UBIFS is based on UBI and does not need that trick. However, the superblock location idea may be used to create new scalable UBI2 layer.
This web-page as well as the UBIFS FAQ contains a plenty of UBIFS information. And you have to study UBI as well, because UBIFS depends on the services provided by the UBI layer. See the UBI documentation and UBI FAQ sections.
Look at UBIFS presentation slides (ubifs.odp) which
give another UBI/UBIFS overview. The slides were prepared in OpenOffice.org
Impress 2.4, so you need
OpenOffice to see them. The
slides contain animation, so you have to watch them in "slide show" mode
(use F5 key). And if you do not have any possibility to get
OpenOffice, here is a pdf version, but it is very ugly
because it does not store the animation and draws all animation steps at
once.
There is an UBIFS white-paper document available as well. However, it might be rather difficult for newbies, so we recommend to start with the JFFS3 design document. The UBIFS white-paper gives a complete UBIFS design picture and describes the UBIFS internals. The white-paper does not contain some details which you may find at this web-page or in the UBIFS FAQ, and vice-versa.
And finally, there is UBIFS source code. The code has a great deal of comments, so we recommend to look there if you need all the details. And of course, you are welcome to ask questions at the UBIFS mailing list.
How to send an UBIFS bugreport?
Before sending a bug report, please:
- make sure you have compiled kernel symbols in
(
CONFIG_KALLSYMS_ALL=yin.config); - mark the Enable debugging check-box in the kernel
configuration menu(
CONFIG_UBIFS_FS_DEBUG=yin.config); Please, do not enable other debugging sub-options like debugging messages unless you know what you are doing;
Then reproduce the bug (hopefully it is reproducible). Attach all the
bug-related messages including the UBIFS messages from the kernel ring buffer,
which may be collected using the dmesg utility or using
minicom with serial console capturing. And of course, it is wise
to describe how the problem can be reproduced. The bugreport should be sent to
the MTD mailing list.
Note, sometimes UBIFS bugs may appear to be UBI bugs. If you suspect there are UBI problems, please, also enable UBI debugging. Please, refer the UBI debugging) section for more information.
Raw flash vs. FTL devices
FTL stands for "Flash Translation Layer" and it is software which emulates a block device on top of flash hardware. At early days FTL ran on the host computer. For example, old PCMCIA flash devices were essentially raw flash devices, and the PCMCIA standard defined the media storage format for them. So the host computer had to run the FTL software driver which implements PCMCIA FTL. However, nowadays FTL is usually firmware, and it is run by the controller which is built intro the storage device. For example, if you look inside an USB flash drive, you'll find there a NAND chip (or several of them), and a micro-controller, which runs FTL firmware. Some USB flash drives are known to have quite powerful ARM processors inside. Similarly, MMC, eMMC, SD, SSD, and other FTL devices have a built-in controller which runs FTL firmware.
All FTL devices have an interface which provides block I/O access. Well, the interfaces are different and they are defined by different specifications, e.g., MMC, eMMC, SD, USB mass storage, ATA, and so on. But all of them provide block-based access to the device. By block-based access we mean that whole device is represented as an linear array of (usually 512-byte) blocks. Each block may be read or written.
Linux has an abstraction of a block device. For example hard drives are block devices. Linux has many file systems and the block I/O subsystem, which includes elevators and so on which have been created to work with block devices (historically - hard drives). So the idea is that the same software may be used with FTL devices. For example, you may use FAT file system on your MMC card, or ext3 file system or your SSD.
Although most flashes on the commodity hardware have FTL, there are systems which have bare flashes and do not use FTL. Those are mostly various handheld devices and embedded systems. Raw flash devices are very different to block devices. They have different work model, they have tighter constraints and more issues than block devices. In case of FTL devices these constraints and issues are hidden, but in case of raw flash the software has to deal with them. Please, refer this table for the some more details about the difference between block devices and raw flashes.
UBIFS file system has been designed for raw flash. It does work with block devices and it assumes the raw flash device model. In other words, it assumes the device has eraseblocks, which may be written to, read from, or erased. UBIFS takes care of writing all data out-of-place, doing garbage-collection and so on. UBIFS utilizes UBI, which is doing stuff like wear-leveling and bad eraseblock handling. All these things are not normally needed for block devices.
Very often people ask questions like "why would one need to use raw flash and why not just use eMMC, or something like this?". Well, there is no simple answer, and the following is what UBIFS developers think. Please, take it as just our opinion and take into account the date of this writing (15 October 2008). The answer is given in a form of a list of non-structured items, and the reader should structure and interpret it in a way which is appropriate for his system. And because mass storage systems mostly use NAND flash (modern FTL devices have NAND flash arrays inside), the answer talk specifically about NAND flashes. Also, we'd like to emphasize that we do not give general recommendations and everything depends on system requirements.
- Bare NAND chips are cheaper and simpler, which is very important for small system. However, it seems like the industry pushes FTL devices forward and the situation is not that simple and obvious anymore. Indeed, an FTL devise is more complex than a raw NAND of similar size, because FTL device has to have additional controller and so on. But since the industry tends to produce a lot of FTL devices, and actually sell a lot of them, the price is going down.
- If you need an flash storage where you are going to use FAT file system, then in most cases you should stick with an FTL device (eMMC, MMC, SD or whatever). Just make sure the FTL device is doing proper wear-leveling. We believe the good brand FTL devices do it well.
- The other situation is when you are going to use your FTL device for system storage (e.g. for rootfs) and use a more robust file system like ext3. In this situation you should take into account various system requirements like tolerance to sudden power cuts. The following items are mostly related to system storage situations.
- FTL devices are "black boxes". FTL algorithms are normally vendor secrets. But we know that NAND flash has issues like wear-leveling, bad blocks handling, read-disturb and so on. And it is important to get them right, especially in case of MLC NAND flash, which may have very short eraseblock life-time (e.g., only 1000 erase-cycles). But because FTL algorithms are closed, it is difficult to be sure whether a specific FTL device gets everything right or not.
- If you start thinking about how FTL could be implemented, you realize that it must do things like garbage collection (sometimes referred to as "reclaim process"). And flash hardware pretty much requires most writes to be out-of-place. But how does FTL behaves in case of sudden power-cuts? What if a power-cut happens while it is in the middle of doing garbage collection? Does the FTL device guarantee that the data which was on the flash media before the power cut happens will not disappear or become corrupted?
- The power-cut tolerance may be tested, while it is quite difficult to test stuff like wear-leveling or read-disturb handling, because it may require too much time.
- We have heard reports that some USB flash drives wear out very quickly, i.e., they start reporting I/O errors after few weeks of intensive use. This means that FTL does not do proper wear-leveling. But this does not mean that all USB flash drives are bad, but you just should be careful.
- We have heard reports that MMC and SD cards corrupt and loose data if power is cut during writing. Even the data which was there long time before may corrupt or disappear. This means that they have bad FTL which does not do things properly. But again, this does not have to be true for all MMCs and SDs - there are many different vendors. But again, you should be careful.
- In general, if you glance back into the history, many FTL devices were mostly used with FAT file system for storing stuff like photo and video. FAT file system is not reliable by definition, which suggests that FTL devices may also not be very reliable, just because historically this was not really required. Indeed, it is not a big deal to loose a couple of photos. However, it is crucial to make sure that system libraries do not corrupt because of power-cuts.
- Good FTL, especially if it deals with MLC NAND (which is used in modern mass storage devices) must be a rather complex piece of software. Implementing it in firmware might be a not very simple task. And running it might require a powerful controller. Obviously, we may suspect that vendors go for various kind of tricks or compromises to keep their devices "good enough" and cheap. For example, it is known that some vendors optimize their FTL devices for FAT, and if you start using ext3 on top of it, you might face some unexpected problems or the device may become not as good as you would imagine. Again, with closed FTL it is often difficult to verify this..
- SSD drives are probably very different to eMMC, MMC/SD etc. We have not worked with SSD drives. They are expensive and they probably have powerful CPUs inside, which run complex firmware which is probably getting things righ.
- FTL devices are becoming more popular and better, although it is not easy to distinguish between good and bad FTL devices (of course vendors would assure you their device is perfect). Generally, there is nothing wrong in using an FTL device as long as you trust it, or have tested it, or it simply fit your system requirements.
- In case of raw flash we exactly know what we are doing. UBI/UBIFS handles all aspects of NAND flash like bad erase-blocks and wear-leveling. It guarantees power-cut tolerance. It is open and available, so you may always validate, test, and fix it. In opposite, it is not that simple to fix firmware bugs.
- Theoretically, UBIFS may do better job, because it knows much more information about the files than FTL. For example, UBIFS knows about deleted files, while FTL does not, so FTL may do unneeded work trying to preserve the sectors belonging to deleted files. However, some FTL devices support "discard" requests and may benefit from the file system hints about unused sectors. Nevertheless, in general, UBIFS should do better job on a bare NAND, than a traditional FS on an FTL device with a similar NAND chip. On the other hand, FTL devices may include multiple NAND chips, highly parallelise things and provide fast I/O. Probably SSD is a good example. But this may affect the cost.
- Obviously, the advantage of FTL devices is that you use old and trusted software on top of them.
So it is indeed difficult to give an answer. Just think about cons and pros, take into account your system requirements and decide. Nonetheless, raw flashes are used, mostly in the embedded world, and this is why UBIFS has been developed.