Object Oriented File System

SpyderTL · **Joined:** Sun Sep 19, 2010 10:05 pm **Posts:** 1074

Hey guys. I am in the process of redesigning my Object Oriented OS, and one of the things that I want to redesign from scratch is the File System.

Here are some of the features I'm trying to include:

Object based - Instead of "files", everything stored on disk will be an object of some type. The details about that type (fields, methods, inheritance, etc.) may be located on a different volume (if possible).

Indexes - In order to improve performance, I would like to have indexes (database style) that would allow me to quickly find objects by name, type, date created, etc. These indexes would need to be kept up to date, or recalculated as needed, or on demand, and the user should be able to add new indexes as needed to find specific information quickly (all files created by me with the tag "cars").

Memory manager compatible - If possible, I'd like the file system "structures" to be used by the memory manager as well, so that they could be simply copied or moved from disk to memory and back with very little "serialization" needed. Also, things like garbage collection and memory allocation could essentially use the same code.

Let me know what you guys think, and if you have any ideas on how all of this could be implemented.

Thanks,
Joshua

And before you even get started... Posts that contain the word "stupid", or that don't contain any helpful suggestions will be ignored.

Also, instead of "folders", I'm planning on using "collections" of objects. Objects can be, at any given time, included in one or more collections. And collections can contain collections (cause they are objects).

Brendan · **Posted:** Sun Jun 15, 2014 10:23 am

Hi,

SpyderTL wrote:

Object based - Instead of "files", everything stored on disk will be an object of some type. The details about that type (fields, methods, inheritance, etc.) may be located on a different volume (if possible).

What actually is an "object" in this case? Do you mean that data and the code to use that data will be stored together as an object (e.g. rather than just having a 123 KiB text file you'd have a something containing 123 KiB of data plus 345 KiB of code that implements some sort of abstracted interface for doing things like getting, setting, sorting and searching that data)?

In that case, it'd be more efficient to store the data and the code separately (to avoid duplicating the code in every single file); but that would mean you're back to just having "data files" and "executable files" with no encapsulation.

SpyderTL wrote:

Memory manager compatible - If possible, I'd like the file system "structures" to be used by the memory manager as well, so that they could be simply copied or moved from disk to memory and back with very little "serialization" needed. Also, things like garbage collection and memory allocation could essentially use the same code.

How would portability work (e.g. if an object is stored on a "read only" CD and that same object is loaded by a little-endian 80x86 machine and a big-endian POWER machine)? If an object is not just data (includes code); then how would that work?

Also what about inter-operability. For example, if someone downloads a file via. FTP; or copies files from a FAT or ISO9660 file system; then do you transparently transform the raw file into an object before storing it; and where does any missing meta-data come from?

Cheers,

Brendan

SpyderTL · **Joined:** Sun Sep 19, 2010 10:05 pm **Posts:** 1074

Of course, all of this is subject to change, but here is what I'm thinking...

Objects are just the instance data that make up an object. The code and "reflection" information is stored elsewhere in a Class (which is an object, of type Class.)

All objects have a reference (of some sort) back to their Class object. myObject.GetClass() (currently). This will either be a pointer, or a unique value that designates a specific class name/namespace. (I'm currently storing both)

For real "Files" downloaded or copied from other file systems, they will probably be treated as a generic File object, and perhaps have specific subclasses (Mpeg3AudioFile, AdobePdfFile, HtmlFile, etc.) Those classes would have methods that would extract/convert the data into the actual objects (AudioClip, Document, WebPage, etc.)

Some ideas that I've been playing with in my head:

Catalog - A collection (array, linked list, etc.) of object pointers, or object structs with pointers to their data. This would be a master table containing every object on the disk (or in memory). Could be used for garbage collection / defragmenting fairly easily. Would be slow to scan through looking for specific objects, but would only be used as a last resort.

Indexes - Collection of calculated values, and pointers to objects, sorted array, or paged tables, or binary tree, etc. Could be used to quickly find objects of a specific type, or specific name, or any other field or calculated value. (objects larger than 1MB, for example).

Application Catalogs - In addition to disk catalogs and memory catalogs, applications may have their own catalog of objects, so that an entire application, so that it can be shut down and all of its local objects and memory can be freed at once.

Application Packages - Applications live within their own sandbox, and they have their own data on disk and in memory. Access outside of an applications sandbox will be gated and provided by the operating system.

Classes - Classes are objects of a specific type that contain size, field location and method description information for all objects of that type. Of course, they are stored on disk, like all other objects, but they must be loaded into memory before they can be used (although leaving them on disk would be an interesting exercise).

Methods - Methods live within classes (or perhaps just referenced by classes). Normally, methods would contain platform specific pre-compiled code, but at some level, could actually contain any compiled (platform specific) or uncompiled (source code / IL), as long as the method description within the class could designate what type of code to expect at that location. Non-platform specific code could be compiled Just-In-Time before it is executed.

Fields - Fields contain data, or pointers to data. I can see several different usages for fields -- simple data (byte, int, string), stored within the object, statc data (simple data stored within the class object), indirect pointer to data (object field data contains pointer or offset to actual data), object reference (object field contains object ID, or address of entry in Catalog array above, so that objects can be moved). No one approach listed covers all of the bases you would need, so all of these will probably need to be supported.

Object size - The size of an object will probably need to be stored separately from the object data, since an object's data may not necessarily contain size information (for example, zero terminated strings), but knowing the exact size of an objects data is important if you are going to be moving objects around in memory (garbage collection) or on disk (defragmenting).

Byte Level Addresses - In order to maintain compatibility between the file system and the memory manager (use the same structs), pointers to objects will (probably) need to be able to point to a specific byte address. Block addresses make less sense in memory than they do on disk, unless you are going to assume paged memory. Otherwise, pointers will probably need to be 64-bit absolute addresses, and disk access code will need to be able to convert from addresses to blocks, and vice versa.

No Catalog - I've started to consider getting rid of the catalog and just using indexes. As long as one of the indexes was guaranteed to hold every object (say the object class index), it could serve the same purpose, and could be used as the catalog.

Hopefully this gives you guys an idea of the direction I'm headed. Suggestions or issues are welcome. As usual, negative comments will be ignored.

GhostlyDeath · **Joined:** Tue Jun 05, 2012 8:21 pm **Posts:** 6

So you want a file system that accesses data with schema-like accessors (they wrap proprietary data and exposes the internal data with a common interface (ex: Text, Pictures, Music, Sound, etc.)) but making it possible for user added schema-like classes/programs to also be stored on the disk so that when you move it to another system you can access the data still?

I believe this has been done before.

iansjack · **Posted:** Sun Jun 15, 2014 3:06 pm

I'm a great believer in KISS. I think what you are proposing is too open to disastrous file-system corruption, particularly if different parts of the object are stored on different disks. It's going to take a lot of work to implement utilities to guarantee file-system integrty or (more likely) correct the inevitable corruption.

(I realize that this is a negative comment and will be ignored.)

I would be more interested in a file system that emulated the OS/400 idea of single-level storage.

jnc100 · **Posted:** Sun Jun 15, 2014 3:42 pm

Most current object models (Java, .NET, python etc) support the ability to serialize objects to binary streams, and the serialize/deserialize functions respect host byte order. Thus they can be saved in current file systems, alongside other file formats. This is ability to store objects as well as 'simple' binary data is therefore already present in most file systems without requiring any extra functionality of the file system. However, some current file systems go further anyway, e.g. with extended attributes in ext or the ability to support multiple streams per file in NTFS. On the basis of this, I cannot really see what your file system adds (aside from extra incompatibility with existing file systems). I would therefore advise adding this functionality at the standard library level (e.g. modified fopen etc) rather than at the file system level, and instead use currently available file systems.

I do, however, like the idea of indexing files according to other attributes aside from location. For example, most media players will often rescan an entire drive looking for playable media. If it were indexed as part of the file system, then that does have a performance benefit.

Regards,
John.

Brendan · **Posted:** Sun Jun 15, 2014 7:00 pm

Hi,

Let me see if I understand this correctly...

SpyderTL wrote:

Objects are just the instance data that make up an object. The code and "reflection" information is stored elsewhere in a Class (which is an object, of type Class.)

All objects have a reference (of some sort) back to their Class object. myObject.GetClass() (currently). This will either be a pointer, or a unique value that designates a specific class name/namespace. (I'm currently storing both)

Ok, so you'd have instance data stored in one place (e.g. a file called "hello.txt") and the class stored in another place (e.g. a file called "notepad.exe"). Then, each object would have some sort of reference to its class (e.g. the file name's extension, like ".txt" or ".exe").

Of course for an object oriented system you could also support polymorphism. For example, when the user does right-clicks on the icon representing an object's instance data, you could display a context sensitive menu that includes a list of all classes that can handle that instance data. For example, by right clicking on "hello.txt" you might get a list that includes the classes "notepad.exe", "edit.com" and "firefox.exe".

SpyderTL wrote:

For real "Files" downloaded or copied from other file systems, they will probably be treated as a generic File object, and perhaps have specific subclasses (Mpeg3AudioFile, AdobePdfFile, HtmlFile, etc.) Those classes would have methods that would extract/convert the data into the actual objects (AudioClip, Document, WebPage, etc.)

Alternatively; you could have special classes that can use the downloaded file "as is", to avoid the need for conversion.

SpyderTL wrote:

Some ideas that I've been playing with in my head:

Catalog - A collection (array, linked list, etc.) of object pointers, or object structs with pointers to their data. This would be a master table containing every object on the disk (or in memory). Could be used for garbage collection / defragmenting fairly easily. Would be slow to scan through looking for specific objects, but would only be used as a last resort.

That's a good idea. You could even have an "object manager" that displays the contents of a collection, where each piece of instance data in the collection is represented by an icon. For command line, you could have a command to list the objects in a collection by name (e.g. "ls myCollectionName").

Of course a collection would be able to hold other collections; and (for the GUI) the "object manager" would let you move from one collection to another by clicking its icon (e.g. if it's showing the contents of "C:\myCollection", you could click on the icon for "C:\myCollection\mySubCollection" to see the contents of that). For command line, you could even have a utility called "tree" that shows you the tree of collections.

SpyderTL wrote:

Indexes - Collection of calculated values, and pointers to objects, sorted array, or paged tables, or binary tree, etc. Could be used to quickly find objects of a specific type, or specific name, or any other field or calculated value. (objects larger than 1MB, for example).

That's a good idea. However, you should also have some sort of dialog box that administrators can use to control the object indexing. For a rough example, maybe something like this:

Attachment:

index.png [ 158.4 KiB | Viewed 12993 times ]

SpyderTL wrote:

Application Catalogs - In addition to disk catalogs and memory catalogs, applications may have their own catalog of objects, so that an entire application, so that it can be shut down and all of its local objects and memory can be freed at once.

That's a good idea too. Each application could have it's own catalogue for it's own files (e.g. the application "foo" could have a collection called "C:\Programs\foo").

SpyderTL wrote:

Classes - Classes are objects of a specific type that contain size, field location and method description information for all objects of that type. Of course, they are stored on disk, like all other objects, but they must be loaded into memory before they can be used (although leaving them on disk would be an interesting exercise).

You could support "memory mapped classes", where the class remains on disk but "pages" are loaded into memory by the OS when that page of the class is first accessed. Of course you could do the same for instance data too.

Cheers,

Brendan

SpyderTL · **Joined:** Sun Sep 19, 2010 10:05 pm **Posts:** 1074

iansjack wrote:

I would be more interested in a file system that emulated the OS/400 idea of single-level storage.

I did some digging on OS/400, and in the process found the BeFS/Haiku/AtheOS group of file systems, and also some information about WinFS, ReFS and GNOME Storage. They all look promising for storing objects with properties, instead of just storing bytes.

I'll look through the structures for these file systems over the next few days and see if any of the can be used, directly, or if I can "borrow" some ideas from them.

embryo · **Posted:** Mon Jun 16, 2014 12:46 am

SpyderTL wrote:

one of the things that I want to redesign from scratch is the File System.

Here are some of the features I'm trying to include:

Object based - Instead of "files", everything stored on disk will be an object of some type. The details about that type (fields, methods, inheritance, etc.) may be located on a different volume (if possible).

Indexes - ...

The main advantage of information storage is it's ability to give us full information about stored information (metadata). Internal organization of a storage is absolutely irrelevant to an end user, but the ability to provide a rich metadata is important. You have started from internals without any thought about "externals" (end user) and as a result the existing object world just swallowed you deep inside and keeps you in a kind of darkness.

My bet is that file system of a bright future is a metadata processing system. And a future OS will have artificial intelligence

(And this comment can be considered as negative, then it can be ignored

)

AndrewAPrice · **Posted:** Mon Jun 16, 2014 7:56 am

A while ago, I had this idea that files wouldn't contain a single 'stream' of bytes as they do now, but they would have multiple streams: http://forum.osdev.org/viewtopic.php?f=15&t=15615&p=113093#p113093

Kind of like:

Code:

FILE *f = fopen("painting.img", "r", "description-data"); // opens the 'description' stream, maybe a document of what's in the image
FILE *f = open("painting.img", "r", "description-encoding'); // ascii text? utf8 text? markdown? rtf? pdf? word document? odf?
FILE *f = open("painting.img", "r", "image-data"); // opens the 'image-data' image, maybe a jpeg compressed file
FILE *f = open("painting.img", "r", "image-encoding"); // jpeg? bmp? png? svg?
FILE *f = open("painting.img", "r", "image-resolution"); // 12x12px, 3 inch x 3 inch, etc

So a file would essentially be a hash table of streams or something.

But, could you store array data in a file? For example, if you had a PDF file, you may want to embed fonts, something like:

Code:

FILE *f = fopen("painting.img", "r", "description-embedded-font-1");
FILE *f = fopen("painting.img", "r", "description-embedded-font-2");
FILE *f = fopen("painting.img", "r", "description-embedded-font-3");

Would it be wise to wrap this around some iteratable data structure? For example, allow you to iterate over "description-embedded-font"? Perhaps some kind of 'array', for example:

Code:

size_t embedded_fonts = fcountstreams("painting.img", "r", "description-embedded-font-*");
while(embedded_fonts > 0) {
  FILE *font = fopen("painting.img", "r", "description-embedded-font-%i", embedded_font);
  embedded_fonts--;
}

And similar method for pattern matching textual stream names.

sortie · **Joined:** Wed Mar 21, 2012 3:01 pm **Posts:** 930

You just reinvented Resource Forks. The concept is interesting and can be useful, but have one main problem: Interaction with dumb tools. What should a tool like cp(1) do when it encounters such a file? What about mv(1)? How can a user upload such a file and share it over HTTP? Windows actually has these and uses them to store information such as the origin of .exe files (which is how it knows to warn you when you execute a .exe file you downloaded from the internet) - but you don't necessarily want that information to propagate along with the file if you copy it.

The idea of having multiple streams can still be userful, however. A reasonable compromise could be to layer it above traditional filesystem semantics:

Code:

FILE* traditional_fp = fopen("foo", "r");
STEAM_FILE* steam_file = stream_open_from_file(traditional_fp, "description-data");

Or any other way of realizing that. You actually could do the opposite, at least in theory, where the traditional API is layered above the steam API: If you try to open a file with multiple streams and the file is on a stream-aware filesystem, it is transformed into a canonical byte representation that can be losslessly stored on a dumb filesystem or transferred over the wire.

Combuster · **Posted:** Mon Jun 16, 2014 9:01 am

sortie wrote:

What should a tool like cp(1) do when it encounters such a file? What about mv(1)?

Not be there in the first place because it's unix and not nearly modern enough? :mrgreen:

sortie · **Joined:** Wed Mar 21, 2012 3:01 pm **Posts:** 930

Combuster wrote:

sortie wrote:

What should a tool like cp(1) do when it encounters such a file? What about mv(1)?

Not be there in the first place because it's unix and not nearly modern enough? :mrgreen:

Now you are just missing the point. :-)

AndrewAPrice · **Posted:** Mon Jun 16, 2014 9:07 am

Thanks for the link on resource forks.

sortie wrote:

What should a tool like cp(1) do when it encounters such a file? What about mv(1)? How can a user upload such a file and share it over HTTP?

Yep, traditional tools would have to be redesigned.

For transferring over conventional protocols, you could 'box' files - you hinted at it here:

sortie wrote:

it is transformed into a canonical byte representation that can be losslessly stored on a dumb filesystem or transferred over the wire.

'Boxing' a file flattens all streams into a single stream, while 'unboxing' a file extracts a single stream into many. You could have a file system flag indicating if a file is boxed or not.

Even more complex - what the streams were iteratable trees? I'll use pseudo-JSON to describe what I mean;

Code:

"painting.img" = {
  description: {
    data: "<raw data here>",
    encoding: "pdf"
    embedded-fonts: [
       "Comic Sans MS": {
          data: "<true-type data here>",
          encoding: "true-type"
       },
       "FixedSys": {
          data: "<bitmap data here>",
          encoding: "bitmap"
       }
    ]
  },
  image: {
    data: "<png data here>",
    encoding: "png",
    resolution: "320x128",
    size: "250x100mm",
    geotag: "<geocords here>",
    colour-profile: {
       data: "<icc profile data here>",
       encoding: "icc"
    }
  }
}

I don't see why you couldn't serialize/box this for transfering across non-stream-aware medium.

Loosing compatibility with existing applications (non-stream aware APIs) is not such an issue for me - since my OS is about starting everything from scratch (including my own language.) If compatibility is a thing for you, you could either allow the API to auto-box variables, or simply indicate one of the streams as a default stream.

Could your file system simply be a database? It might not be that hard to implement - you could probably port SQLite into your kernel, and point to database to a raw disk or partition instead of a file. Although I doubt SQL would be very efficient, I'd probably look into a NoSQL solution instead.

Alternatively...

sortie wrote:

A reasonable compromise could be to layer it above traditional filesystem semantics

You could store the database at the file level, instead of the FS level - removing the need to 'box' files and maintaining compatibility with existing file systems and software. I think it may be easier to optimise at the FS level simply because you could garbage collect as you defragment/compress your streams.

SpyderTL · **Joined:** Sun Sep 19, 2010 10:05 pm **Posts:** 1074

One of the ideas that I want to accomplish if at all possible is to have the same structures in memory as on disk for keeping track of objects, so that they can be easily moved or copied from one to the other. This almost mandates that all pointer references would have to be 64-bit byte level addresses... no blocks, no cylinders, no sectors, no heads.

In order to pull this off, I'm going to have to, essentially, emulate a stream/character device for what is actually a block device. I would need to wrap up all of the logic that takes an address, converts it to a block number and an offset, loads the block into memory (cache) and handles loading new blocks as needed (as individual bytes are read from the "stream"). I'm pretty sure this is how FileStreams in .NET and Java work.

So what do you guys think about applying this to the entire file system? All addresses would be 64-bit offsets from the beginning of the device, and all data access would translate between bytes and blocks on the fly.

Can you guys think of any technical reason why this would not work?

Wouldn't this effectively eliminate the wasted "slack space" that you get on most (if not all) other file systems?

OSDev.org

Object Oriented File System

Who is online