OSDev.org

Posted: **Sun Sep 25, 2022 6:44 am**

Hello,

I finished the first version of my custom filesystem. You can find it here. Current features include:

Error detection (not correction) with a fast 64-bit hash.
Transparent compression with LZ4 (can be extended with other compression algorithms)
Transactional updates, i.e. the filesystem will never be in an inconsistent state on disk even if power is cut in the middle of a transaction.
Sparse files.
Up to 2^32 entries per directory, indexed with a hashmap and a cryptographic hash.
File names up to 255 bytes long.
Extensions that can be enabled per directory. Current extensions include "unix", which adds UID, GID and permissions to each entry, and "mtime", which adds a 64-bit modification time in milliseconds to each entry.
Small files (up to 2^16) can be embedded directly inside directories to reduce space usage and improve locality.

It includes a FUSE driver which has been tested on Linux.

Why yet another filesystem?

Personally, I'm a fan of filesystems that support transparent compression since you can save a huge amount of space with it. On my Linux desktop I use ZFS which currently reports a x1.96 compressratio, i.e. the total size of all data is halved, which doubles the effective capacity of my main disk.

However, ZFS is very complex and porting it to my own OS is likely very difficult. i've looked at other filesystems with compression such as BTRFS but I'm unable to find any that isn't either very complex or a read-only FS. Hence I decided to roll my own.

Design (briefly)

I ensured that all updates can be performed atomically as to reduce the odds the filesystem becomes corrupt. I went for a transaction instead of journal based system since a) it makes it easier to ensure consistency (IMO) and b) there is constant copying between host and device anyways, but at least transaction based systems should need less copies as you don't need to write twice.

All data is grouped into power-of-two sized "records". Each record points to a variable amount of blocks. By splitting data into records it is possible to compress and decompress only parts of large files instead of the entire file.

While all modern disks have ECC to protect against bitrot a hash has also been added to records anyways. This should help with catching errors that may have occurred during transmission which may occur over a poor link. While the hash only provides detection capabilities it at least let the user be aware of issues immediately instead of letting corruption happen silently.

I use hashmaps for directories instead of e.g. B-trees since the object layer of NRFS acts as a sort of MMU, i.e. it exposes each object as a contiguous space. Since hashmaps are much simpler to implement and the object layer makes it practical I went with that.

Specification & Future work

I made a specification of NRFS, which is 430 lines long. I first wrote the specification and then made an implementation based on this specification, so it should be complete as of writing.

While the storage format is mostly finished, the implementation is admittedly not great: it doesn't support asynchronous requests, writes out records redundantly and doesn't handle errors such as a hash mismatch gracefully. I will address this once I figure a good design for other planned features (notably, mirroring and error correction).

Feedback is welcome!

Posted: **Sun Sep 25, 2022 6:56 am**

Seems good.

Posted: **Sun Sep 25, 2022 11:25 am**

Demindiro wrote: Transparent compression with LZ4 (can be extended with other compression algorithms)

I do hope this one can be turned off. This is just futile effort for encrypted or already compressed files (e.g. JPEG or MP3, or most video formats), because there the entropy is too large for LZ compression to make a meaningful difference. And for small files you have the problem that compression doesn't even save any disk space, since a file cannot occupy less than one unit of storage. In that case it also doesn't save any transfer time, since you cannot transfer less than one block into memory. So it would just be a total waste of CPU time to compress there.

Demindiro wrote:While all modern disks have ECC to protect against bitrot a hash has also been added to records anyways.

That is a good idea in general. You do not always read back what you wrote to the disk.

Demindiro wrote:I use hashmaps for directories instead of e.g. B-trees

The only reason ext3/4 went with B-trees for indexed directories is because they wanted to remain at least read-only compatible with ext2 drivers. In general, a directory is a map from string (filename) to file/inode, and a hashmap is the best known implementation of such a thing (giving average constant lookup time, rather than logarithmic in the directory size, as with most trees, or linear in the directory size, as with most lists).

Posted: **Mon Sep 26, 2022 2:06 pm**

nullplan wrote:I do hope this one can be turned off.

The compression is optional and can be turned off entirely. It is also set per record so that the driver can pick the optimal choice.

nullplan wrote:So it would just be a total waste of CPU time to compress there.

LZ4 compression and decompression is very cheap, especially compared to the transfer speeds of e.g. SATA disks. The library I'm using claims it achieves 1897 MiB/s compression and 7123 MiB/s decompression (running the benchmarks myself after disabled the "safe" features gives ~1000 MiB/s and ~4400 MiB/s respectively). Unless you have a very fast disk (e.g. NVMe SSD) it usually makes more sense to keep it enabled.

EDIT: I ran a benchmark with a 1.8G MKV file and measured ~6.0 GiB/s for both compression and decompression. Even the fastest SSDs can only barely keep up with this, let alone if you use multiple cores for compression. Keeping LZ4 on by default is a sane choice IMO.

Posted: **Tue Dec 13, 2022 1:41 pm**

I've been working on v0.2 of the filesystem. It is not ready yet but I'd like to make a brief progress report.

The major changes are:

Partial support for RAID 10 (needs testing + error correction hasn't been implemented yet)
Record cache has been completely overhauled.
Object layer is fully asynchronous, though the filesystem layer doesn't take advantage of this yet.

RAID 10

The idea is pretty simple: make chains of devices and mirror the contents.

This scheme allows mixing devices with variable sizes, which can be useful to expand the filesystem over time. It is also technically possible to reduce the size of the filesystem as records can freely be moved around.

Record cache

Implementing a proper and decently performant cache proved to be exceptionally difficult. I spent a signficant amount of time coming up with a scheme that allows reading and writing cached data in constant time while still making it practical to work with on-disk records.

The key insights are:

"Record trees" are similar to page tables, where an offset (akin to a "virtual address") is split up into indices.
User software only writes to leaf records. When dirty leaves are flushed, a new record "reference" is created which is then put in the parent. In turn, these parent records are marked dirty. In a sense, the dirty status "bubbles up" all the way to the filesystem header.
The cache can make private instances of the record tree helper structure, with functionality that is not directly exposed to user software (e.g. the bubble up mechanism).
Every record has exactly one parent.

Resizing turned out to be the most difficult, as the depth of a tree may change and an arbitary amount of records may need to be destroyed.

Growing a tree was quite trivial, as it can be done by creating a single new record, making the root record a child of it and clearing the original root. The changes will eventually bubble up and set a proper root record.

Shrinking was far from trivial and I spent the most time on it. Fuzzing helped a great deal in finding bugs, which weren't limited to the resize function. Shrinking a tree potentially involves:

Making the appropriate child record the new root.
Reducing the depth, destroying top-level records and all children except the left-most ones.
Removing out-of-range records that may still be in the cache and have children that must be destroyed.
Note that due to the bubble up mechanism these records have not yet been committed to disk.
Trimming records so if the tree is grown again the data is zeroed.

Asynchronous

The object layer is fully asynchronous, which allows multiple requests to be made concurrently with relatively little overhead.

However, the API can only be accessed by a single thread at any time. I do not consider it to be worth it to parallelize it as the caching mechanism itself should be plenty fast on a single thread. Copying, compression and hashing may be parallelized, though this will remain an implementation detail.

Posted: **Tue Dec 13, 2022 7:08 pm**

Demindiro wrote:RAID 10

The idea is pretty simple: make chains of devices and mirror the contents.

That's RAID 1 on top of two JBOD spanned volumes. RAID 10 is RAID 0 on top of two or more RAID 1 volumes.

They have the same amount of redundancy, but RAID 10 implies higher performance than the underlying devices, and I don't see how your design would achieve that - especially when mixing devices of different sizes.

Posted: **Wed Dec 14, 2022 4:01 am**

Octocontrabass wrote:
Demindiro wrote:RAID 10

The idea is pretty simple: make chains of devices and mirror the contents.
That's RAID 1 on top of two JBOD spanned volumes. RAID 10 is RAID 0 on top of two or more RAID 1 volumes.

They have the same amount of redundancy, but RAID 10 implies higher performance than the underlying devices, and I don't see how your design would achieve that - especially when mixing devices of different sizes.

You can use devices with identical sizes in which case it is equivalent to RAID 10. The current design is more flexible however, which may be useful if one wants to e.g. use a bunch of random, old drives.

Posted: **Wed Dec 14, 2022 7:09 am**

I haven't read your code yet.

But do you plan to create portability layer so other developers can integrate your code into kernel?

Posted: **Wed Dec 14, 2022 7:14 am**

arseniuss wrote:But do you plan to create portability layer so other developers can integrate your code into kernel?

It is intended to be relatively easy to integrate in OSes other than mine, so yes.

Currently there is only a Rust API + a FUSE driver, but once the code is somewhat stable I intend to add a C API + allow providing a custom (fallible) allocator.

OSDev.org

Nora's Reliable FileSystem (NRFS) v0.1

Nora's Reliable FileSystem (NRFS) v0.1

Re: Nora's Reliable FileSystem (NRFS) v0.1

Re: Nora's Reliable FileSystem (NRFS) v0.1

Re: Nora's Reliable FileSystem (NRFS) v0.1

Re: Nora's Reliable FileSystem (NRFS) v0.1

Re: Nora's Reliable FileSystem (NRFS) v0.1

Re: Nora's Reliable FileSystem (NRFS) v0.1

Re: Nora's Reliable FileSystem (NRFS) v0.1

Re: Nora's Reliable FileSystem (NRFS) v0.1