Asynchronous mess of a VFS..

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

So it seems my quest for the ultimate asynchronous is about to cause Voix VFS to get rewritten again (that'd be VFSv3 then) and since the design is getting pretty nasty, I thought I'd subject it to comments here. Basicly, idea is to make all I/O and filesystem operations asynchronous. I'm going to start by making reads/writes asynchronous first, because that's required for the rest anyway.

Since this is going to result in rather ugly code with callbacks, I'm trying to abstract the stuff as much as possible, such that I don't need to special-case any more than I have to. So I've basicly come up with the following ideas so far:

1. Any I/O is done on abstract files, which are represented by block-lists.

2. It is FS specific issue how to come up with a block-list. VFS (or the FS itself) just asks for a list of N blocks starting from block M. This will allow either proper inodes or FAT style linked lists, but also handling free-space bitmaps and other special filesystem areas as "abstract files" (block lists).

3. To do I/O on a device, one tells what block one wants, what function to call with the block, and what additional data to pass to the function.

4. To do I/O on an abstract file (well, vnode mostly) one asks the relevant filesystem for a blocklist, and gives it a function to call, and additional data to pass to that function. Coming up with a blocklist might require FS to do some other I/O, but it can do so, as long as it keeps the original callback-data as part of the "additional info" of it's own callback.

5. We don't actually call any of the callbacks directly, but magically cause them to be called just before exit from the kernel (as delayed procedure invocations as they are sometimes called). This avoids stack growth (recursion) when the blocks are already in cache.

6. Magically the user process gets informed when the operation either completes or fails.

6b. Even more magically the user process might cancel the operation before it has completed.

Does this sound general enough?

Colonel Kernel · **Posted:** Tue Apr 24, 2007 8:36 am

I still don't understand why you're using callbacks instead of a message-oriented API. Is it just that you want it to look POSIX-ish?

binutils · **Joined:** Thu Apr 05, 2007 6:07 am **Posts:** 214

http://www.bullopensource.org/posix/ + http://www.namesys.com/v4/v4.html ?

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

I don't wanna use thread-per-pending-request since that doesn't really scale, and the whole point of asynchronous I/O is specifically avoiding the need for thread-per-pending-request, because otherwise you could just block the threads and none of the complications would exist. This is how my VFSv2 functions.

So I could use a filesystem thread, and message with that, right? But it doesn't buy anything, because that filesystem thread would need to use state-machines or callbacks or something anyway, if it wanted to avoid blocking the whole filesystem simply because somebody decided to access a floppy, and even worse, while the floppy is loading something, I might be able to serve a lot of stuff from cache..

I didn't mean that I'm doing callbacks to userspace though. That I'll do with a form of messaging, not too different from Linux epoll mechanism. It's just internal to kernel that I need to deal with callbacks if I want to avoid the thread-per-pending-request stuff, and that's the very goal I set myself when I started writing the system.

The target is a system where no thread ever blocks on anything, unless it wants to. That means there's never any reason not to kill a thread. There's never any reason for user to not be able to cancel an operation that seems to hang. And that means a busy web-server will never need more than one thread per processor, even if it loaded some of the files from floppies.

This will actually get me pretty damn far from POSIX. Sure, I can write a library which provides a degenerate POSIX like API. But if you issue an open() in POSIX, I'd like to see how you do something useful (in the same thread) while the kernel is trying to resolve the path you gave from a dead NFS server. Still trying, mate. Still trying. Sure, POSIX has aio_read/aio_write, but that's like equivalent to the first step into what I'm aiming for.

Colonel Kernel · **Posted:** Tue Apr 24, 2007 12:57 pm

mystran wrote:

I don't wanna use thread-per-pending-request since that doesn't really scale, and the whole point of asynchronous I/O is specifically avoiding the need for thread-per-pending-request

Why not use a thread pool? (Sorry if this has been suggested before).

I'm thinking in terms of a microkernel though, which may not match your OS architecture...

Quote:

I didn't mean that I'm doing callbacks to userspace though. That I'll do with a form of messaging, not too different from Linux epoll mechanism. It's just internal to kernel that I need to deal with callbacks if I want to avoid the thread-per-pending-request stuff, and that's the very goal I set myself when I started writing the system.

Ok, that makes more sense now. Callbacks to user space give me the willies.

Quote:

The target is a system where no thread ever blocks on anything, unless it wants to.

I think "I want to receive a message now" qualifies as wanting to block...

Brendan · **Posted:** Wed Apr 25, 2007 2:02 am

Hi,

I'm planning a similar thing for my OS - for me everything uses asynchronous messaging, so fully asynchronous file I/O is natural. I partially implemented it ages ago (in an old version of my OS), and it worked fairly well.

For a (very rough and dodgy) example:

Code:

void myVFS(void) {
    for(;;) {
        getMessage();
        switch(message.type) {
            case REQUEST_OPEN:
                handleOpen();
                break;
            case FS_RESPONSE_DIRECTORY;
                handleFSdirectory();
                break;
            case REQUEST_CLOSE:
                handleClose();
                break;
            case REQUEST_READ:
                handleRead();
                break;
            case FS_RESPONSE_READ;
                handleFSread();
                break;
        }
    }
}


void handleOpen(void) {
    DIR_INFO *directoryInfo;

    directoryInfo = findFileInfoInDirectoryCache(message.fileName);
    if(directoryInfo != NULL) {
        if(directoryInfo.status != PENDING) {
            satisfyOpenRequest(directoryInfo, message.sender, message.openType);
            return;
        }
        createFileInfoInDirectoryCache(message.fileName, PENDING);
        sendMessage(file_system, FS_REQUEST_DIRECTORY, message.fileName);
    }
    createOPENrequestStructure(message.sender, message.fileName, message.openType);
}


void satisfyOpenRequest(DIR_INFO *directoryInfo, THREAD_ID client, int openType) {
    int status;
    int handle;

    status = checkFilePermissionsAndStuff(directoryInfo, openType);
    if(status != OK) {
        sendMessage(client, RESPONSE_OPEN, status);
        return;
    }
    handle = createFileHandle(client, directoryInfo, openType);
    sendMessage(client, RESPONSE_OPEN, OK, handle);
}


void handleFSdirectory(void) {
     DIR_INFO *directoryInfo;
     OPEN_REQUEST_STRUCTURE *request;

    insertReceivedDirectoryDataIntoDirectoryCache();

    request = firstOPENrequestStructure;
    while(request != NULL) {
        directoryInfo = findFileInfoInDirectoryCache(request.fileName);

        if(directoryInfo != NULL) {
            if(directoryInfo.status != PENDING) {
                satisfyOpenRequest(directoryInfo, request.client, request.openType);
            }
        }
        request = request.next;
    }
}


void handleClose(void) {
    int status;

    status = destroyFileHandle(message.sender, message.fileHandle);
    sendMessage(message.sender, RESPONSE_CLOSE, status, message.fileHandle);
}


void handleRead(void) {
    int status;
    int handle;
    void *data;
    int size;

    status = findFileHandle(&handle, message.sender, message.fileHandle);
    if(status != OK) {
        sendMessage(message.sender, RESPONSE_READ, status, message.fileHandle);
        return;
    }
    status = getFileData( getHandleFileName(handle), &data, &size);
    if(status == OK) {
        sendMessage(message.sender, RESPONSE_READ, OK, data, size);
        return;
    }
    sendMessage(file_system, FS_REQUEST_READ, getHandleFileName(handle));
    createREADrequestStructure(handle);
}


void handleFSread(void) {
    READ_REQUEST_STRUCTURE *request;
    int status;
    int handle;
    void *data;
    int size;

    insertReceivedFileDataIntoFileDataCache();

    request = firstREADrequestStructure;
    while(request != NULL) {
        status = getFileData( getHandleFileName(handle), &data, &size);
        if(status == OK) {
            sendMessage(request.client, RESPONSE_READ, OK, data, size);
            return;
        }
        request = request.next;
    }
}

There's no callbacks (the VFS and file systems all run as seperate processes in user-space). There is a state machine, but I can't see how you'd keep track of multiple requests without keeping track of the state of each request. File systems can be simple (when it recieves a request from the VFS it does that request, sends a response and waits for the next request) - no state machine or caching needed.

It also easily supports nesting. For example, the hard disk driver could be responsible for the file "/dev/hda", which might be mounted by "file system A" at "/mnt/fsA". File system A might contain the file "/mnt/fsA/someDiskImage" which might be mounted by "file system B" at "/mnt/fsB". In this case an application can ask the VFS to read data from the file "/mnt/fsB/foo", the VFS would ask file system B to get the data, file system B would ask the VFS to read data from "/mnt/fsA/someDiskImage", the VFS would ask file system A to get the data, file system A would ask the VFS to read data from "/dev/hda" and the VFS would ask the hard disk driver to get the data. In this case it'd be good if file system code used a special "don't cache" mode when it opens the file it's mounting to prevent the same data being cached in different places in the VFS cache.

You could make the VFS multi-threaded (e.g. one thread per CPU), but IMHO if you've got the VFS, file systems and disk drivers all implemented as seperate processes then most of the work can already be spread across other CPUs, and there isn't much point complicating the VFS itself with re-entrancy locking (and lock contention), race conditions, etc. If there's a huge amount of RAM and almost all requests can be satisfied from the VFS cache, then it might be worthwhile having a multi-threaded VFS (but then you'd have to wonder who's making all the requests and what they're doing with all the data).

Of course if you succeed you could have problems with sychronous file I/O. For example, when a process accesses a memory mapped file it needs to block until that data is read from disk, which can be tricky when the file I/O is entirely asynchonous (waiting for the right message to arrive within the page fault handler?). Of course memory mapped files have other problems too because there's no reasonable way for processes to do their own error handling (e.g. Windows generates a "blue screen of death" when it gets a read error in a memory mapped file).

Cheers,

Brendan

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Colonel Kernel wrote:

I think "I want to receive a message now" qualifies as wanting to block...

Depending how you look at it, yes or no. If you mean "I don't have anything else to do until a message arrives" then yeah, it qualifies.

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Brendan wrote:

There's no callbacks (the VFS and file systems all run as seperate processes in user-space). There is a state machine, but I can't see how you'd keep track of multiple requests without keeping track of the state of each request.

Yeah well, my point about callbacks is that personally I find it more flexible to keep the state machine in the form of callbacks. If you prefer explicit state machine, it's really just an implementation detail.

Quote:

File systems can be simple (when it recieves a request from the VFS it does that request, sends a response and waits for the next request) - no state machine or caching needed.

Filesystems still need states / caching if you don't want to block the filesystem either. I mean, my purpose is not only to have the VFS layer be fully asynchronous, but also the specific filesystems to be fully asynchronous, such that if a FAT driver needs to load a given cluster of the infamous file-allocation-table, it won't block itself on device I/O while it's doing that. Rather it just tells the device to wake it up when the block arrives.

Quote:

You could make the VFS multi-threaded (e.g. one thread per CPU), but IMHO if you've got the VFS, file systems and disk drivers all implemented as seperate processes then most of the work can already be spread across other CPUs, and there isn't much point complicating the VFS itself with re-entrancy locking (and lock contention), race conditions, etc.

Actually I'm looking to do "zero-thread" VFS, which does all it's work from deferred function calls (generalized bottom halves). My kernel panics the moment a deferred function call tries to call any function that could potentially block, so maybe that gives you an idea.

In fact, there are three different contexts my kernel currently knows about:

1. normal context which can block all it wants, and normally runs with interrupts enabled
2. interrupt context which isn't allowed to block and must always run interrupts disabled
3. deferred context, which isn't allowed to block, but normally runs with interrupts enabled

Most VFS functionality should go into the deferred context.

Quote:

Of course if you succeed you could have problems with sychronous file I/O.

Not really. When a system call or exception arrives in kernel, it's in normal context. It can allocate a small event-queue (for a single event), and start the synchronous I/O operation with reporting set to this temporary queue. Then it can block on that queue, and return to userspace once it completes.

Quote:

Of course memory mapped files have other problems too because there's no reasonable way for processes to do their own error handling (e.g. Windows generates a "blue screen of death" when it gets a read error in a memory mapped file).

Well, there's basicly two reasonable ways to handle this: either fake memory mapped files by having every page-fault generate a signal, and have the signal handler request the page. This is a good solution if want to get memory management out of a micro-kernel as much as possible.

Alternatively, you could send the signal only if read fails, and let the application figure out what to do with the issue. Say, in some cases it could just tell the system to put some zero-pages in there to get rid of the fault, and set a flag somewhere that the main process can then check to see that it didn't get the actual contents.

Default action for such a signal would be to kill the process, and if you'd get a double fault from trying to send the signal, then you kill the process anyway. Ofcourse it's naive to hope for programs to actually deal with such issues other than in some special cases, but blue-screening seems a bit too extreme to me..

Kevin McGuire · **Posted:** Wed Apr 25, 2007 6:18 pm

I thought about that method of continuation you talked about. It actually seems like a good idea although I have never actually implemented it so you can not take my word for it.

However. I got to thinking about the parts of a request being:

[request]--->[execution]--->[result]

A normal thread code path for AI/O would break after the [request] phase. While the device would perform the actual [execution] or transfer. Finally the driver should handle the [result] phase in the end. As versus SI/O would cause a blocking until the result phase is completed.

Of course interrupts remove the execution part unless you are forced to do some sort of polling during the execution which I would doubt.

So everything seems all dandy and fine until I start taking a closer inspection of the internals of the request part. You see I would think that at some point and time two or more threads would collide inside the request portion of the time line. Now this portion would most likely vary in execution time by the means of the code path running through different file system drivers and instances of device drivers. However, the problem could still be very likely such as when doing a lot of I/O on the root device.

The only way to tame this wild problem is with mutual exclusion. The problem is this presents a blocking operation in a sense. The thread never knows exactly how long this request portion could take. If you were to consider the scheduler aspect you might run into starvation problems where a slightly higher execution context or thread keeps a lower one blocked for a significant amount of time. I knew there are ways to fix this of course so not trying to say it is not possible because I am sure it is already done in many places.

But, on the opposite side you could say, "So what?". Well I said that then got to thinking that the idea is to keep all data buses in the system at one-hundred percent utilization if any work to be done exists. With this in mind I come to the conclusion that a starvation could turn into a inefficient situation instead of just one thread being treated totally unfair.

----
Since you might want to have some sort of intelligent decision to check the request of other threads to find one that might want to read or write a block closer to where the hard drive of the disk just left which would help decrease latency and potential allow you're kernel to keep that data bus at full utilization instead of having wasted bus data cycles.

And you might run into a situation where true asynchronous I/O does not actually exist because of locking operations and unfair scheduling complications.

\\\\edit\\\\
I think I ended up sliding right along side of another post up (above) which the second issue of true asynchronous I/O might not actually exist of not done correctly.

Brendan wrote:

You could make the VFS multi-threaded (e.g. one thread per CPU), but IMHO if you've got the VFS, file systems and disk drivers all implemented as seperate processes then most of the work can already be spread across other CPUs, and there isn't much point complicating the VFS itself with re-entrancy locking (and lock contention), race conditions, etc. If there's a huge amount of RAM and almost all requests can be satisfied from the VFS cache, then it might be worthwhile having a multi-threaded VFS (but then you'd have to wonder who's making all the requests and what they're doing with all the data).

And just to be clear I am not dogging you're plan so far because not but a couple of twenty four hours ago I felt the same idea until I came to this conclusion.

Kevin McGuire · **Posted:** Thu Apr 26, 2007 7:29 pm

Quote:

1. Any I/O is done on abstract files, which are represented by block-lists.

This idea I have decided to us myself. I ran into a problem of using a thread in my VFS where it would not be able to access a user space data buffer except by switching address spaces constantly. So I just with in the past few hours decided to use the same method where instead a process will write directly to the file's block in memory.

I also came to the conclusion to support a flush operation on file blocks. Such that:

Code:

int _write(int fd, unsigned char *buffer, unsigned int size, unsigned int flags){
    /// lookup where the page is in memory for the current file position in
    /// the file referenced by the file descriptor identifier.
    ......
    /// memcpy the data onto this page.
    ......
    /// flush the block pages that we dirtied.
    .....
}
int write(int fd, unsigned char *buffer, unsigned int size){
    /// might support a NO_FLUSH flag or such.
    return _write(fd, buffer, size, 0);
}

So I think that was a good idea to do things that way possibly. I say possibly because I have yet to implement it completely where I can actually see if it is viable.

Quote:

6. Magically the user process gets informed when the operation either completes or fails.

I think in this case I might use another two methods that can be chosen when the request is actually submitted to the kernel.

1. The result will be pushed into a QUE that can be accessed or checked by the user process.
2. The result will not be pushed into a QUE, but instead a thread will be spawned to enter into a callback function designated by the user process. Potentially a unique callback for each request..

So generally once a request is issued by a AI/O function. The only way to not generate a page fault is by using one of the higher level standard library functions for file access as a stream (above). Where it calculates where the block was mapped into its address space using the file descriptor. However, you should be able to in theory also sleep for the result and access the virtual address directly from the QUE skipping the standard library if you liked.. I dunno.

In some cases the user process might never need to be informed of a result for quite a few calls. After so many it might check the QUE and determine if any have failed and why so as to make a recovery if possible. In other cases it could check a variable which might indicate if and how many requests had a failed result code to determine if a error recovery scheme needs to be executed or just completely fail and die.

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Kevin McGuire wrote:

I think in this case I might use another two methods that can be chosen when the request is actually submitted to the kernel.

1. The result will be pushed into a QUE that can be accessed or checked by the user process.
2. The result will not be pushed into a QUE, but instead a thread will be spawned to enter into a callback function designated by the user process. Potentially a unique callback for each request..

Actually, my idea was to have something like the queue. I have the support for edge-triggered event-queues already.

But I don't like your method 2. Spawning a thread is excess, and I think kinda defeats the purpose of not spawning a thread in the first place. You can do callbacks without having a separate thread after all: you interrupt the original user-thread kinda like you'd do for a signal, make it run the callback, and then return like from interrupt.

The downside of the callback in the same thread is that if one allows such callbacks to be done at any time, one needs to use lock/wait-free methods for getting the result to the main thread, which can be tricky. They can be made safe though, either by letting userspace explicitly disable them at times, or only allowing them when userspace calls certain "safepoint" functions. Ofcourse with safepoint functions it's not too different from if you used a queue in the first place..

So I personally believe that the queue approach is the only really good one.

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Actually, I thought of the cache thingie a bit more, and came to the conclusion, that instead of preferring slow devices over fast devices, it might be better idea to prefer devices with few blocks in cache over devices with lots of blocks. For the floppy vs. hard-disk this will normally give the same result, since while a floppy is slow, it's also small, and therefore is unlikely to have more blocks in cache than a hard disk.

But there's other nice properties. Let's suppose no more memory is available, a hard disk has 10MB of caches, and a full floppy (1.44MB) is loaded into the cache. It makes sense to throw the floppy (or at least some of it) away much faster than if the hard disk had 100MB or maybe even 1GB caches. It also gets rid of pathological cases where we trashed all cache reading a couple of huge files from a scratched DVD (with lots of retries needed), and now it takes ages before the hard disk can get a healthy cache again.

It's also easy to implement: just do per-device caching, but when you run out of memory, look at each devices candidate block to be thrown away, get a score of some sort for each (for LRU that'd be the time since last used), then scale with the number of blocks the device currently has, and select the one that scored worst.

Kevin McGuire · **Posted:** Sat Apr 28, 2007 6:48 am

I like that better. The size factor.

OSDev.org

Asynchronous mess of a VFS..

Who is online