OSDev.org

The Place to Start for Operating System Developers
It is currently Mon Sep 23, 2019 12:29 pm

All times are UTC - 6 hours




Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: Proper buffer design for file reading
PostPosted: Sun Jul 07, 2019 5:42 pm 
Offline
Member
Member
User avatar

Joined: Wed Dec 12, 2018 12:16 pm
Posts: 119
Location: Chile
Let's say that we implemented an X filesystem (in this case FAT32) into our OS.
We have a function that reads a file, which as parameters needs the filename and a buffer.
Now we have a problem, the buffer size. The file can be larger or smaller than the buffer, we can't just create a buffer with the size of the file because can happen that the file is a big one (2GB for example). It will use almost all the RAM. Obiously, that is a very very terrible design.
So what I have to do? How I implement a proper buffer for file reading, independently of the file size?
This function should read a file (not working), with a buffer as parameter.
Code:
int fat32_read_file(uint8_t* buffer, struct file fp)
{
   if (!hd_exists())
      return 1;
   hd_read(start_of_root, FAT32_FILES_PER_DIRECTORY * sizeof(struct DirectoryEntry), (uint8_t*)&drce[0]);
   uint8_t* buff = 0;
   uint8_t* fatbuff = 0;
   if (strcmp((char*)drce[fp.file_no].file_name, (char*)fp.file_name) == 0) {
      uint8_t fcluster = ((uint32_t)drce[fp.file_no].cluster_number_hi) << 16 | ((uint32_t)drce[fp.file_no].cluster_number_lo);   
      int32_t ncluster = fcluster;
      int32_t file_size = fp.file_size;
      /* 1 sector file (less than 512 bytes) */
      if (file_size < 512) {
         hd_read(fcluster, 512, buff);
         buff[file_size] = '\0';
         //kputs("%s", (char*)buff);
      }

      while (file_size > 0) {
         uint32_t fsect = start_of_data + bpb.sectors_per_cluster * (ncluster - 2);
         uint32_t sector_offset = 0;
         for (; file_size > 0; file_size -= 512) {
            hd_read(fsect + sector_offset, 512, buff);
            buff[file_size > 512 ? 512 : file_size] = '\0';
            //kputs("%s", (char*)buff);
            memcpy(buffer, buff, 512);
            if (++sector_offset > bpb.sectors_per_cluster)
               break;
         }
         uint32_t fsectcurrentcl = ncluster / (512 / sizeof(uint32_t));
         hd_read(fat_start + fsectcurrentcl, 512, fatbuff);
         uint32_t foffsectcurrentcl = ncluster % (512 / sizeof (uint32_t));
         ncluster = ((uint32_t*)&fatbuff)[foffsectcurrentcl] & 0x0FFFFFFF;
      }
      return 0;
   }
   return 1;
}

We "open" the file with this:
Code:
struct file fat32_open_file(uint8_t* filename)
{
   if (!hd_exists() && !filename)
      return;
   struct file fp;
   hd_read(start_of_root, FAT32_FILES_PER_DIRECTORY * sizeof(struct DirectoryEntry), (uint8_t*)&drce[0]);

   for (int i = 0; i < FAT32_FILES_PER_DIRECTORY; ++i) {
      if (drce[i].file_name[0] == FAT32_NO_FILES)
         break;
      /* LFN */
      if (drce[i].attributes & FAT32_LFN)
         continue;
      /* folder */
      if (drce[i].attributes & FAT32_DIRECTORY)
         continue;

      /* If the first byte of file_name is 0xE5, means that the file is deleted */
      if (drce[i].file_name[0] == FAT32_DELETED_FILE)
         continue;
      
      uint8_t fil[12];
      fat2human(drce[i].file_name, fil);
      trimName(fil, 11);
      if (strcmp((char*)fil, (char*)filename) == 0) {
         strcpy((char*)fp.file_name, (char*)filename);
         fp.file_size = drce[i].file_size;
         fp.file_no = i;
         return fp;
      }
   }
   kputs("\nFile %s not found\n", filename);
   strcpy((char*)fp.file_name, (char*)filename);
   fp.file_size = 0;
   fp.file_no = 0;
   return fp;
}

"fp" is a file structure:
Code:
struct file
{
   uint8_t file_name[11]; /* let's assume fat32 SFN for now */
   uint32_t file_size;
   uint32_t file_no; /* Number of the file in the directory (starting from 0) */
   uint32_t file_pos;
};

Now, opening the file:
Code:
uint8_t a[2048]; /* Main problem here, the buffer is limited to 2048 bytes! */
fat32_read_file(a, fp);
kputs("%s", (char*)a);


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Sun Jul 07, 2019 10:27 pm 
Offline
Member
Member

Joined: Wed Aug 30, 2017 8:24 am
Posts: 241
Learning from UNIX means learning to win. In UNIX there is a read function defined as:
Code:
ssize_t read(int, char*, size_t);
The first parameter is the FD, the OS token used to identify the file (returned from open()). After that you specify a buffer as pointer and size. The function tries to fill the buffer as far as possible, but returns how much it read. Or -1 on error. Importantly, it can return short for any reason at all. For example, on FAT32, it might just decide to only finish up the current cluster of the file and return data from the next cluster only on the next call.

End of file is marked by this function returning zero.


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Mon Jul 08, 2019 5:44 am 
Offline
Member
Member
User avatar

Joined: Thu Oct 13, 2016 4:55 pm
Posts: 375
Hi,

hextakatt wrote:
Now we have a problem, the buffer size. The file can be larger or smaller than the buffer
nullplan is right, you should not worry about this. Most OS lets the userspace application define the buffer and its size. Your concern is more how to translate the file offset into series of block buffers. You see, the storage has 512 bytes blocks. One or more blocks (which may or may not be contiguous) make up the data that you have to copy to the caller's buffer. Copying is important, because those nasty userland applications can read from any arbitrary offsets, and any arbitrary lengths, not just 512 byte aligned offsets and multiples of 512 bytes. This means there's a 1 to 511 chance that the data you'll need to return starts in the middle of a sector. Also because data sectors may not be contiguous on the storage, you'll have to copy arbitrary sequence of sectors to get the correct data.

This whole situation is very similar to virtual address mappings. Only this time "virtual address space" is the user buffer, "physical address space" are the LBA sectors, page frame is 512 bytes, and there's no hardware to do the translation for you, you have to do it from software. The actual translation scheme depends on what FS you're using. The trick here is, how to implement the (offset) .. (offset + buffer size) mapping (in "virtual address space") to a sequence of sectors (in "physical address space") efficiently and FS independently (if you want to support more FSes with VFS).

I'd suggest to download the Minix2 source and take a look at minix/fs/read.c. Most of the code deals with splitting up the "virtual address space" into a sequence of LBA addresses. I would not recommend Minix3 or Linux as they are much more bloated and complicated. Minix2 is simple.

Another possible source would be Xv6, although I must admit the function names are a bit cryptic, and you must read the doc to understand how file system buffers work in this little OS.

More precisely you have three layers to deal with:
1. layer: LBA sectors
2. layer: file content, each with it's own "address space" up to file size (depends on FS how this is mapped to sectors)
3. layer: user space buffers, specified by (file offset) .. (file offset + buffer size)
You'll need a code that gets layer 3 as its input, and walks down to layer 1 to get a list of sectors. Typically layer 3 is implemented in the VFS code. Layer 2 is implemented in the FS driver, finally layer 1 is implemented in the block device drivers.

Cheers,
bzt


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Mon Jul 08, 2019 4:58 pm 
Offline
Member
Member
User avatar

Joined: Wed Dec 12, 2018 12:16 pm
Posts: 119
Location: Chile
bzt wrote:
hextakatt wrote:
Now we have a problem, the buffer size. The file can be larger or smaller than the buffer
nullplan is right, you should not worry about this. Most OS lets the userspace application define the buffer and its size.

But for internal OS usage and If I need to read an entire file? (e.g: binary files)


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Mon Jul 08, 2019 5:01 pm 
Offline
Member
Member
User avatar

Joined: Mon May 22, 2017 5:56 am
Posts: 142
hextakatt wrote:
But for internal OS usage and If I need to read an entire file? (e.g: binary files)

You mean in the kernel? You need kmalloc as a kernel-space equivalent of malloc, AFAIK. Get the size, kmalloc a big enough buffer, ...

_________________
Wir mussen wissen, wir werden wissen.


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Mon Jul 08, 2019 5:10 pm 
Offline
Member
Member
User avatar

Joined: Wed Dec 12, 2018 12:16 pm
Posts: 119
Location: Chile
eekee wrote:
hextakatt wrote:
But for internal OS usage and If I need to read an entire file? (e.g: binary files)

You mean in the kernel? You need kmalloc as a kernel-space equivalent of malloc, AFAIK. Get the size, kmalloc a big enough buffer, ...

Well, that was my initial solution, but that doesn't works for big files (magnitude of gigabytes), it will use all the RAM. It works?


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Mon Jul 08, 2019 6:21 pm 
Offline
Member
Member
User avatar

Joined: Mon May 22, 2017 5:56 am
Posts: 142
Ooh... I didn't understand what you're asking until I looked at the code. Not that I fully understand it, but I think I get what part of the job you're asking now. I suppose you'd have to read into a cluster-size or sector-size buffer, then copy from there into the buffer the user supplies. If the user's buffer is bigger, read another cluster or sector into your buffer, and copy again. If the user's buffer is smaller, well that's the reason you have a buffer the size of a full sector or cluster. The file size barely matters here, you only need it so you can calculate the right amount to copy from the last cluster.

As a simple disk cache, you could have a whole array of these cluster-buffers together with data on which buffer corresponds to which cluster on the disk. Just an idea I probably haven't given enough thought to. :)

This is all about read(); mmap() is a whole other planet.

Edit: This would probably benefit from being separated into layers as bzt suggests. I'm not sure which layers, though. :)

_________________
Wir mussen wissen, wir werden wissen.


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Mon Jul 08, 2019 7:43 pm 
Offline
Member
Member
User avatar

Joined: Wed Dec 12, 2018 12:16 pm
Posts: 119
Location: Chile
Well, I tried to fix my code, but now I get the goddamn invalid opcode exception. Nice.
Code:
int fat32_read_file(uint8_t* filename, uint8_t* buffer, struct file fp)
{
   /* Check HDD precense, I don't want to shoot my foot */
   if (!hd_exists() && !filename)
      return 1;
   hd_read(start_of_root, FAT32_FILES_PER_DIRECTORY * sizeof(struct DirectoryEntry), (uint8_t*)&drce[0]);

   uint64_t buffer_size = sizeof(buffer) / sizeof(uint8_t);
   uint8_t* buff = 0;
   uint8_t* fatbuff = 0;
   uint8_t fil[12];
   fat2human(drce[fp.file_no].file_name, fil);
   trimName(fil, 11);
   if (strcmp((char*)fil, (char*)filename) == 0) {
      uint8_t fcluster = ((uint32_t)drce[fp.file_no].cluster_number_hi) << 16 | ((uint32_t)drce[fp.file_no].cluster_number_lo);   
      int32_t ncluster = fcluster;
      int32_t file_size = fp.file_size;
      int32_t bytesr = 0;

      kputs("\nFile content: \n");

      /* 1 sector file (less than 512 bytes) */
      if (file_size < 512 && buffer_size > 512) {
         hd_read(fcluster, 512, buff);
         memcpy(buffer, buff, 512);
         //buff[file_size] = '\0';
         //kputs("%s", (char*)buff);
      }

      /* File bigger than a sector, cluster */
      while (file_size > 0) {
         uint32_t fsect = start_of_data + bpb.sectors_per_cluster * (ncluster - 2);
         uint32_t sector_offset = 0;
         for (; file_size > 0; file_size -= 512) {
            hd_read(fsect + sector_offset, 512, buff);
            //buff[file_size > 512 ? 512 : file_size] = '\0';
            //kputs("%s", (char*)buff);
            memcpy(buffer, buff, 512);
            bytesr += (file_size > 512 ? 512 : file_size);
            if (++sector_offset > bpb.sectors_per_cluster)
               break;
         }
         uint32_t fsectcurrentcl = ncluster / (512 / sizeof(uint32_t));

         hd_read(fat_start + fsectcurrentcl, 512, fatbuff);
         uint32_t foffsectcurrentcl = ncluster % (512 / sizeof (uint32_t));
         ncluster = ((uint32_t*)&fatbuff)[foffsectcurrentcl] & 0x0FFFFFFF;
      }
      return 0;
   }
   kputs("\nFile %s not found\n", filename);
   return 1;
}


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Mon Jul 08, 2019 9:44 pm 
Offline
Member
Member

Joined: Wed Aug 30, 2017 8:24 am
Posts: 241
A pointer without size is like ice cream without a tub: Sort of what you wanted, but your hands are gross afterwards.

I haven't read much of your code, but this line can't work:
Code:
uint64_t buffer_size = sizeof(buffer) / sizeof(uint8_t);

Leaving aside that, if uint8_t exists, it must be an unsigned char, the size of which is defined as 1, buffer in this case is a pointer. sizeof therefore returns 4 or 8, depending on architecture. You should always pass the buffer size as argument. The called function has no other means of identifying the size.

That holds for most kinds of code, BTW. Pointer without size is usually an error. Sometimes the size can be inferred, but not often.


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Tue Jul 09, 2019 8:05 pm 
Offline
Member
Member
User avatar

Joined: Wed Dec 12, 2018 12:16 pm
Posts: 119
Location: Chile
nullplan wrote:
A pointer without size is like ice cream without a tub: Sort of what you wanted, but your hands are gross afterwards.

I haven't read much of your code, but this line can't work:
Code:
uint64_t buffer_size = sizeof(buffer) / sizeof(uint8_t);

Leaving aside that, if uint8_t exists, it must be an unsigned char, the size of which is defined as 1, buffer in this case is a pointer. sizeof therefore returns 4 or 8, depending on architecture. You should always pass the buffer size as argument. The called function has no other means of identifying the size.

That holds for most kinds of code, BTW. Pointer without size is usually an error. Sometimes the size can be inferred, but not often.

I deleted that faulty line, and now as a paremeter I specify the buffer size (uint32_t). But, still having the same error.
Code:
int fat32_read_file(uint8_t* filename, uint8_t* buffer, uint32_t buffsiz, struct file fp)
{
   /* Check HDD precense, I don't want to shoot my foot */
   if (!hd_exists() && !filename)
      return 1;
   hd_read(start_of_root, FAT32_FILES_PER_DIRECTORY * sizeof(struct DirectoryEntry), (uint8_t*)&drce[0]);

   uint8_t* buff = 0;
   uint8_t* fatbuff = 0;
   uint8_t fil[12];
   fat2human(drce[fp.file_no].file_name, fil);
   trimName(fil, 11);
   if (strcmp((char*)fil, (char*)filename) == 0) {
      uint8_t fcluster = ((uint32_t)drce[fp.file_no].cluster_number_hi) << 16 | ((uint32_t)drce[fp.file_no].cluster_number_lo);   
      int32_t ncluster = fcluster;
      int32_t file_size = fp.file_size;

      kputs("\nFile content: \n");

      /* 1 sector file (less than 512 bytes) */
      if (file_size < 512) {
         hd_read(fcluster, 512, buff);
         memcpy(buffer, buff, 512);
      }

      /* File bigger than a sector, cluster */
      while (file_size > 0) {
         uint32_t fsect = start_of_data + bpb.sectors_per_cluster * (ncluster - 2);
         uint32_t sector_offset = 0;
         for (; file_size > 0; file_size -= 512) {
            hd_read(fsect + sector_offset, 512, buff);
            memcpy(buffer, buff, 512);
   
            if (++sector_offset > bpb.sectors_per_cluster)
               break;
         }
         uint32_t fsectcurrentcl = ncluster / (512 / sizeof(uint32_t));

         hd_read(fat_start + fsectcurrentcl, 512, fatbuff);
         uint32_t foffsectcurrentcl = ncluster % (512 / sizeof (uint32_t));
         ncluster = ((uint32_t*)&fatbuff)[foffsectcurrentcl] & 0x0FFFFFFF;
      }
      return 0;
   }
   kputs("\nFile %s not found\n", filename);
   return 1;
}


Code:
uint8_t a[512];
fat32_read_file(filename, a, 512, fp);
kputs("%s", (char*)a);

Even if this code worked, I still cannot read the entire file at the same time!


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Tue Jul 09, 2019 8:19 pm 
Offline
Member
Member

Joined: Sun Jun 23, 2019 5:36 pm
Posts: 56
If I may ask... why can't you read the file in chunks and treat the buffer as a ring buffer?
Code:
uint8_t a[512];
fat32_read_file(filename, a, 512, fp);
kputs("%s", (char*)a);
// Zero out buffer
for (int i = 0; i < 512; ++i) {
a[i] = 0;
}
// Keep reading

Would this not suffice for your use-case?


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Tue Jul 09, 2019 9:21 pm 
Offline
Member
Member
User avatar

Joined: Wed Dec 12, 2018 12:16 pm
Posts: 119
Location: Chile
Ethin wrote:
If I may ask... why can't you read the file in chunks and treat the buffer as a ring buffer?
Code:
uint8_t a[512];
fat32_read_file(filename, a, 512, fp);
kputs("%s", (char*)a);
// Zero out buffer
for (int i = 0; i < 512; ++i) {
a[i] = 0;
}
// Keep reading

Would this not suffice for your use-case?

Because I need to read the entire file at once.


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Wed Jul 10, 2019 1:28 am 
Offline
Member
Member

Joined: Thu May 17, 2007 1:27 pm
Posts: 607
The usual approach (if you cannot predict the file size from the inode / directory entry) is to reallocate a buffer that is c times as large as the current one once you run out of space, for some constant c. That way, you're only wasting a factor of c space, with only constant amortized performance overhead.

That being said, allocating unlimited buffers in the kernel (i.e., from non-swapable memory) is a clear code smell and a potential DOS vector. Why are you trying to read an entire file at once?

_________________
managarm: A microkernel-based OS that is capable of running a Wayland desktop


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Wed Jul 10, 2019 9:51 am 
Offline

Joined: Sat Jul 06, 2019 3:24 pm
Posts: 5
First you alloc linear memory without any physical RAM mapped to it. You need to know max file size for min complexity.
Then user(or your kernel) accesses 1st 2MB chunk, #PF triggers, and you read 2MB from disk and map appropriate ram into 1st 2MB.
Repeat for the remaining 2MB chunks.
This way you can read all file at once or only a portion. Covers all cases.
You can choose 64MB chunks or larger - up to you.

hextakatt wrote:
we can't just create a buffer with the size of the file because can happen that the file is a big one (2GB for example).
You have no choice but to alloc linear memory of the required size if you want entire file at once. But you don't have to map physical RAM into that right away.

hextakatt wrote:
But for internal OS usage and If I need to read an entire file? (e.g: binary files)
Set reasonable limits and your life will get easier


Top
 Profile  
 
 Post subject: Re: Proper buffer design for file reading
PostPosted: Wed Jul 10, 2019 11:19 am 
Offline
Member
Member
User avatar

Joined: Mon May 22, 2017 5:56 am
Posts: 142
loonie's post basically describes a simple mmap, doesn't it? The other day I was thinking about mmap versus read/write, and realised mmap is really awesome for certain tasks, it really simplifies the code and especially how you think about the task.

Korona's right if you haven't got the size ahead of time, (or don't want to get it; stat() can be slow,) but I want to add a little caveat.
Korona wrote:
The usual approach (if you cannot predict the file size from the inode / directory entry) is to reallocate a buffer that is c times as large as the current one once you run out of space, for some constant c. That way, you're only wasting a factor of c space, with only constant amortized performance overhead.

Yes, but you want to cap the multiplication at some point. Make it grow linearly after a certain size. Python uses c=2, but they found they had to cap it when someone presented them with a real-life case of allocations big enough to reach some huge size but not much more than that. Hundreds of megabytes or maybe gigabytes of address space were wasted.

_________________
Wir mussen wissen, wir werden wissen.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next

All times are UTC - 6 hours


Who is online

Users browsing this forum: Google [Bot], qookie and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group