Fault Tolerance and Data Integrity in a File System

azblue · **Joined:** Sat Feb 27, 2010 8:55 pm **Posts:** 147

To a large extent, there is a tradeoff between simplicity and what may be called, for lack of a better phrase, "useful features". I'm trying to create a file system that makes use of some useful features (e.g., keeping track of who "owns" a file) while maintaining simplicity.

I've run into a couple issues; I'm going to use FAT to explain the issues, as I assume we're all familiar with FAT.

Fault Tolerance

Let's say I want to create a file named "MyFile" in the root directory, it takes 3 clusters, and clusters x,y, & z are free. I need to:
A) make an entry in the root directory for a file named "MyFile" that starts at cluster x
B1) modify the xth entry in the FAT to point to y
B2) modify the yth entry in the FAT to point to z
B3) modify the zth entry in the FAT to be "End of file"
C) write the contents of the file to the clusters x,y,z

The most intuitive order is what is what is shown above, but this is not fault tolerant; to cite but one example, if the computer loses power after B3 the file system will appear to be valid but "MyFile" will contain garbage.

The first step should be C; if we lose power after that we "lose" our data (it's there on the disk, but we have no idea it's there), but the file system is in a valid state. The data lose can't be considered that big a deal as there is literally no way to prevent data loss in the case of abrupt power failure.

If step A is done next, a power failure will result in "MyFile" showing that it starts at cluster x, but cluster x will be marked as free, and we won't know where clusters y and z are. On top of that, we have no way of knowing there is any problem until an attempt is made to access the file. So A cannot be the 2nd step.

If, after step C, step B is done in reverse (B3, then B2, then B1) any interruption in the middle of or end of step B can be caught, as there will be some or all of a cluster chain that has no file pointing to it. However, this can only be caught with a full scan of the file system -- and there's no way of knowing a scan of the file system is necessary.

It looks as if my options are:
1. Add a journal
2. Have a file system that is not fault tolerant.

My primary design goal is simplicity. While a journal obviously isn't that complicated, it does add a tad more complexity. So,

QUESTION ONE: Is there a way to ensure the integrity of the file system without a journal?

One idea I have to ensure the integrity of the data on the disk was to have a checksum or a hash for every file. At creation time this is no big deal: compute the magic number while writing the file. After the fact it presents problems: Checking the checksum/hash every time a file is read is no problem for a file that's a couple KB, but what about one that's 50GB? Reading a single byte from the file would incur the enormous cost of reading the entire file to ensure it checks out first. Not performing that check would defeat the purpose of having the checksum/hash. The solution I've come up with is to have checksum/hash of every cluster. So,

QUESTION TWO Is there a better way of ensuring the integrity of data read from the disk than having a checksum or hash of every cluster (or group of clusters)?

I hope I've managed to adequately explain myself

Brendan · **Posted:** Tue Oct 10, 2017 8:53 pm

Hi,

azblue wrote:

QUESTION ONE: Is there a way to ensure the integrity of the file system without a journal?

Yes. For FAT you can do the writes in a special order to ensure that (if there's a power failure, unexpected device removal, device driver crash, etc) the file system remains consistent and you can only end up with lost clusters. For example:

A) write the contents of the file to the clusters x,y,z (if power fails during or immediately after this step there's no problem)
B) mark those clusters as "used" (and create the cluster chain - "x points to y points to z") in the cluster allocation table (if power fails during or immediately after this step you only get lost clusters)
C) ("atomically" - e.g. by replacing an existing sector for the root directory) add an entry in the root directory for a file named "MyFile" that starts at cluster x

The main problem with this is performance. Ensuring that all writes happen in a "correct for fault tolerance" order is significantly slower than ensuring writes happen in the best order for performance (e.g. to minimise seek times or whatever).

azblue wrote:

One idea I have to ensure the integrity of the data on the disk was to have a checksum or a hash for every file. At creation time this is no big deal: compute the magic number while writing the file. After the fact it presents problems: Checking the checksum/hash every time a file is read is no problem for a file that's a couple KB, but what about one that's 50GB? Reading a single byte from the file would incur the enormous cost of reading the entire file to ensure it checks out first. Not performing that check would defeat the purpose of having the checksum/hash. The solution I've come up with is to have checksum/hash of every cluster. So,

To ensure the integrity of which data on the disk? For file system metadata (e.g. to guard against file system corruption caused by things like power failures, etc) a checksum for each file would do nothing.

Note that you could have "per block" checksums instead of "per file" checksums; so that you can read a block in the middle of a file and check the block's checksum (and don't have to read the entire file); and modern storage device drivers already have "per sector" checksums, so there isn't that much advantage in having a "per block" checksum in the file system itself.

azblue wrote:

QUESTION TWO Is there a better way of ensuring the integrity of data read from the disk than having a checksum or hash of every cluster (or group of clusters)?

If a file's data is messed up and/or can't be read, that's merely unfortunate. If the file system's meta-data is messed up (e.g. whatever you use to keep track of which blocks were free got messed up so file system thought a lot of "in use" blocks were free and corrupted random files spread everywhere) then that's a major disaster. For file system meta-data I'd consider using some sort of error correcting code (and not just a checksum) that's been designed so that if a whole sector can't be read (storage device's built in "per sector" checksum/CRC failed so storage device refuses to allow that sector to be read at all) that sector's contents can be reconstructed. For file data, I'd design file formats with checksum/CRC built-in so that applications can do "end to end integrity" themselves, regardless of how or where the file is passed between layers, transferred or stored; and then I wouldn't bother doing anything to ensure file data integrity in the file system at all.

Cheers,

Brendan

mikegonta · **Joined:** Thu May 19, 2011 5:13 am **Posts:** 229

Brendan wrote:

The main problem with this is performance. Ensuring that all writes happen in a "correct for fault tolerance" order is significantly
slower than ensuring writes happen in the best order for performance (e.g. to minimise seek times or whatever).

Good point.
However, modern rotating media hard drives internally remap bad sectors and buffer recent requests so there is no way of knowing
where the heads are actually physically seeking to.
What would the other best order for performance examples be?

Brendan · **Posted:** Wed Oct 11, 2017 8:25 am

Hi,

mikegonta wrote:

Brendan wrote:

The main problem with this is performance. Ensuring that all writes happen in a "correct for fault tolerance" order is significantly
slower than ensuring writes happen in the best order for performance (e.g. to minimise seek times or whatever).

Good point.
However, modern rotating media hard drives internally remap bad sectors and buffer recent requests so there is no way of knowing
where the heads are actually physically seeking to.

The spare sectors used to replace bad sectors are typically on the same track, so that (unless all of the spare sectors for that track are already in use) you'd only suffer extra rotational latency and won't get the extra cost of shifting heads. The other thing that complicates things is that for modern hard disks "sectors per track" is not a constant - for tracks near the outer edge (larger circumference) there's more sectors per track, and for tracks near the inner edge (smaller circumference) there's less sectors per track; and information about the actual "sectors per track" for each track isn't easily obtained.

Despite all of this, a simple "closer LBA numbers mean faster seek times" approximation still gives useful results (significantly better than random allocation).

Note: I have wondered whether the real geometry (including "sectors per track" for each track) could be determined using (very time consuming) benchmarking/measurement, and then stored in some sort of database of hard disk characteristics; where it could be auto-fetched/auto-installed if anyone else has a hard disk with the same manufacturer and model number and then used for optimising seek times (and probably other purposes).

mikegonta wrote:

What would the other best order for performance examples be?

CD-ROM and DVD is mostly a single track arranged in a spiral (so you'd want to optimise for "sequential reads in the forward direction" as much as possible, and "all reads in a forward direction (skipping over unwanted sectors)" if you can't make the reads sequential). Seeking back one sector means moving heads (ouch) and waiting for almost a full rotation of the disk.

For backup tape it should be relatively obvious (sequential reads in the forward direction are ideal; cost of seeking approximated by difference between LBAs , e.g. "distance from end of last used sector to start of next wanted sector", possibly with an additional small penalty for backward seeks to account for angular momentum).

Flash is typically large physical blocks (where "logical blocks" are emulated - e.g. to write one 512-byte emulated sector; it might read a 256 KiB block into its cache, modify 512 bytes in cache, then write the 256 KiB block back) and has other quirks (TRIM, wear levelling). For this you'd want to try to read/write whole blocks and/or access logical blocks that happen to be in the same larger physical block; but it's another case where getting accurate performance characteristics (e.g. figuring out the physical block size) is hard.

RAM is like flash, except that the physical block size is much smaller (e.g. 64-byte cache line size). Optimising access patterns would mostly follow "TLB rules" (e.g. accesses within the same 4 KiB page are cheapest because you've already fetched that TLB, accesses within the same 2 MiB, 1 GiB or 512 GiB area might also be cheaper due to either large page use or "TLB directory cache" support in the CPU).

For networked file systems the access patterns are the server's problem (depending on the underlying file system/s and their storage devices in the server). However; I'd be tempted to assume that (for client/s) network bandwidth will be the bottleneck; and to improve network bandwidth I'd be tempted to consider a block size that's related to MTU (in an "MTU = header_size + block_size" way), especially for smaller MTUs (where packet headers are a larger fraction of the total packet). Sadly, for typical ethernet (e.g. 1500-byte MTU) this gives a relatively ugly/awkward block size (not a nice friendly power of 2).

Cheers,

Brendan

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5134

azblue wrote:

there's no way of knowing a scan of the file system is necessary

Windows uses one of the reserved bits somewhere to indicate whether the filesystem was cleanly unmounted or not. It's not foolproof (Windows also allows users to skip the check, and it won't ask again unless the disk is disconnected again without a clean unmount) but it's better than nothing.

Brendan · **Posted:** Wed Oct 11, 2017 7:41 pm

Hi,

Octocontrabass wrote:

azblue wrote:

there's no way of knowing a scan of the file system is necessary

Windows uses one of the reserved bits somewhere to indicate whether the filesystem was cleanly unmounted or not. It's not foolproof (Windows also allows users to skip the check, and it won't ask again unless the disk is disconnected again without a clean unmount) but it's better than nothing.

I'd prefer "slow continuous scan always in progress in the background" (where the OS mostly uses idle time to do things like check for bad blocks, optimise/defragment the file system, recover any lost sectors, compress "never read" files, do data de-duplication, etc).

There's few things that are more annoying than rebooting a server and seeing the dreaded "345 days since file system last checked, please wait for hours while customers are screaming at you about downtime" message in poorly designed OSs (e.g. Linux).

Cheers,

Brendan

Sik · **Joined:** Wed Aug 17, 2016 4:55 am **Posts:** 251

Answering to the first post: yeah, C should be done first, but after that if the remainder steps are not all completed then you will end up with invalid cluster pointers regardless of which order you did them in. That said, determining whether you need to do a scan should be relatively easy... if the filesystem wasn't unmounted properly (or the OS wasn't shut down properly) just assume you need a scan.

It's still full of faulty assumptions though. The biggest issue with FAT is that the table doesn't have a proper way to recover, so it's not easy to fix if it goes wrong. On top of that, when you're modifying an entry you need to rewrite the sector, and if that procedure wasn't completed you risk the whole sector containing garbage (affecting many adjacent entries). Not to mention that the drive may be lying about whether it wrote a sector yet and/or if it does it in the order you told it (meaning you shouldn't trust recent writes to be entirely safe).

So yeah, if you really want to make it more robust for real you may actually need something more complex =/

OSDev.org

Fault Tolerance and Data Integrity in a File System

Who is online