Re-usable virtual filesystem design

alnyan · **Posted:** Thu Mar 07, 2019 2:50 am

Hi everybody,

I'm now designing a new implementation for the VFS in my kernel and the thing that I find tricky to implement is how should drivers transfer data to memory when reading/writing, assuming that source/destination buffers may be located in tasks' page directories (which is tricky because having only a pointer the driver cannot tell which task exactly it belongs to) as well as somewhere in kernel space.

Just for clarification - the memory map in my kernel is static: addresses below 0xC0000000 belong to userspace and everything above is kernel and possibly device mappings.

Could you give any advice? I think there're better methods I haven't yet heard/thought of.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1605

Here's my design: When a user reads from a file, the file's read method is called. Which calls the FS's read method, which goes to the volume, which goes to the disk (usually. I mean, there could be a loopback device or LVM in there as well). At this point, there are two possibilities: Either the data is in the disk buffer, then the query can be answered directly, or it is not. In the latter case, a request for the data is put to the IO manager and the calling task is put to sleep uninterruptibly. The IO manager (a kernel task) will at some point deign to answer the request. Which loads the data into kernel space. Then it wakes the thread.

All tasks have the same kernel-space mappings, so a virtual address in kernel space in the IO manager has the same meaning in the user task that called originally. So then the system call merely has to copy the new data from kernel space to user space. This copy has to take place, anyway, for safety reasons.

File mappings are similar, only the request comes from the page fault handler, not the read system call.

alnyan · **Posted:** Thu Mar 07, 2019 1:24 pm

nullplan wrote:

At this point, there are two possibilities: Either the data is in the disk buffer, then the query can be answered directly, or it is not.

Does this mean the driver only has to write the buffers in kernel space, and the copy to userspace will be handled by IO manager instead? That seems a good idea, thanks for suggestion. I've already had a similar design where the data going to/from the disk has to pass through kernel space buffer before going into userspace.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1605

In my design, the IO manager directly calls into the driver. And the copy to user space is handled by the syscall or page fault (well, OK, the page fault just remaps the buffer page).

alnyan · **Posted:** Thu Mar 07, 2019 1:31 pm

I'm not sure if my syscall handlers can immediately send a reply to the task - the idea I had is that when a syscall handler determines the operation would be a blocking one, it would put the task into sleep and switch to some other task, leaving the operation running in background (and then it's taken care of by IO manager).

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1605

I think this might be a difference in design. I have one kernel stack per task. That way, I can leave one task sleeping while the IO manager fetches the data. When the IO manager is finished fetching the data, it just wakes the original thread. Which can then return to userspace with the data.

Code:

ssize_t read(int fd, void* __user buf, size_t len) {
  if ((unsigned)fd >= current->process->fdlim || !current->process->files[fd])
    return -EBADF;
  [a few layers further down]
  while (!(p = find_in_disk_buffer(disk, lba))) {
    current->flags |= TIF_SLEEP_UNINT;
    ioman_request_read(disk, lba, current);
    schedule();
  }
  [and back up]
  return copy_to_user(buf, p, len); 
}

The schedule() function stops the calling thread until it is scheduled again. And the TIF_SLEEP_UNINT flag means the thread can never be scheduled again. Which is why the ioman has to know about the calling thread, so it can wake it (by removing that task flag; the scheduler takes care of the rest).

alnyan · **Posted:** Thu Mar 07, 2019 4:06 pm

nullplan wrote:

The schedule() function stops the calling thread until it is scheduled again. And the TIF_SLEEP_UNINT flag means the thread can never be scheduled again. Which is why the ioman has to know about the calling thread, so it can wake it (by removing that task flag; the scheduler takes care of the rest).

So this means the syscall itself gets suspended until the completion of the operation? My idea was that the syscall just selects a new task (i.e. calls schedule() for userspace tasks) suspending the current one. I, too, have a kernel stack per task, but it's intended only for userspace task context storage and I didn't think syscall could or should be interrupted (I do have kernel tasks, though)

alnyan · **Posted:** Thu Mar 07, 2019 4:11 pm

Just to clarify, here's the algorithm I thought of for reads from userspace (for kernel space, I think, it would just poll the status of the operation):

Code:

void sys_read(int fd, void *buf, size_t req, ssize_t *res) {
   fd_t *f = ...; // Look up the descriptor in current task's table
   [ set result pointer (which is a pointer to %eax on task's kernel stack, which is the result it gets from syscall once resumed) in fd structure ]
   if (f->read_blocks) {
      [ set task's status to waiting and start an IO operation using IO manager ]
      schedule();
   } else {
      // Just read the file directly
      *res = f->read(f, buf, req);
   }
}

OSDev.org

Re-usable virtual filesystem design

Who is online