Providing fork and exec for 64-bit in an existing 32-bit OS

rdos · Post by **rdos** » Sun Nov 18, 2012 3:04 pm

Background: I have a DOS/Win32 compatible interface with CreateProcess and Exec integrated. I also have an "app" interface which allows chaining several applications in the same process. This is needed for things like compilers. 32-bit flat applications can both be chained (and typically relocated), and run in a new process (and console). Creating a new console is integrated in CreateProcess.

What I want is a *nix compatible interface with fork and exec. Fork should just create a new address space, and mark all memory as "copy-on-write". When a page in the forked process is written, a page fault would occur which would copy the contents to a new page, and restart the operation. Exec would clear the address space, and load a new executable file. File handles would be stored outside of the executable area, and would be inherited by the forked process even if it does exec.

The problem is that I cannot use 32-bit CreateProcess to create the new address space for fork, since that would imply the address space would be filled with exec, since those are one and the same. Additionally, I would get a new console for fork, which wouldn't work as it should.

My thinking is that the 32-bit address space between 3G and 4G, which controls consoles and handles, would be inherited as-is in the new 64-bit address space. Thus, fork would create a new CR3, mark all entries above 4G as copy-on-write, but link to an identical copy of the page for 3-4G. In order to create a new console, the 64-bit app would need to use spawn (which would call CreateProcess) instead of exec. Initially, the root 64-bit address space would be created by the 32-bit executable loader with CreateProcess. Thus, there could be multiple 64-bit root address spaces.

I also think that the area between 1G and 3G, which is normally reserved for 32-bit applications, could be used for passing buffers between the 64-bit process and the 32-bit kernel. That would mean buffers as large as 2G could be passed to kernel. This area could be setup to "non-present" when a new 64-bit process is created, and then allocated on-demand by page fault handler.

rdos · Post by **rdos** » Tue Nov 20, 2012 9:36 am

Seems like the original definition of fork was that parent and child was totally independent, and that the child got a copy of all the parents memory. That also means that a copy-on-write implementation needs to mark both parent and child as copy-on-write. A special problem then is that when a write occurs in one of the processes, a new page needs to be allocated in one of the processes (preferably in the one where the write occurs), the content needs to be copied there, and then the copy-on-write state in both processes must be cleared. This operation must be atomic on SMP-systems (both processes might decide to write the same page on different cores at the same time).

Another problem is what happens if the parent or child does another fork. Then the fork would involve three address spaces, and things start to become really complicated and nasty.

I think a possible solution would be that only two processes can be interlinked. An exec in one of the processes would need to clear all the copy-on-write bits in the other process (as an atomic operation), and another fork in one of the processes would force a copy of all the process contents, and clearing the copy-on-write bits in both processes before doing another fork.

With those requirements, it would be possible to add another page table alias (possibly as the 511:th entry in the PLM4), where the page tables of the interlinked process could be accessed directly. Additionally, there would need to be an kernel object for synchronization that could ensure mutual exclusion so the copy-on-write bits could be made SMP-safe. The link to this structure could be kept in the address space of both processes.

I also think that fork itself only needs to allocate one new physical page (the CR3), and then could set copy-on-write bits for page tables entries at all levels (directories, dir ptrs, PML4). When a single page with copy-on-write bit set is written, a copy of the 3 paging levels is also done if necessary.

bluemoon · Post by **bluemoon** » Tue Nov 20, 2012 9:58 am

rdos wrote:Seems like the original definition of fork was that parent and child was totally independent, and that the child got a copy of all the parents memory.

Not all memory, but all user-space, visible memory, and system handles (prefer of same value as viewed from application, but this can be in fact pointed to different place in kernel memory).

rdos · Post by **rdos** » Tue Nov 20, 2012 10:08 am

bluemoon wrote:
rdos wrote:Seems like the original definition of fork was that parent and child was totally independent, and that the child got a copy of all the parents memory.
Not all memory, but all user-space, visible memory, and system handles (prefer of same value as viewed from application, but this can be in fact pointed to different place in kernel memory).

Yes, of course. The new process would simply get the same page directory entries to the kernel, and it's data structures (just like is the case when using CreateProcess the "Windows"-way, which creates the initial address space for 64-bit applications). The 2G sharing area between long mode and kernel would also become initialized to empty, just like when creating a new thread.

Another problem is to clone system handles, but I already did much of the work on this when I last looked at fork in the 32-bit environment.

rdos · Post by **rdos** » Wed Dec 26, 2012 7:01 am

I've redesigned the memory-layout because of the above requirements. First, the 4G address space is partitioned into 4 areas when PAE/ long mode paging is used. These thus could have different mappings.

The last 1G (0xC0000000 - 0xFFFFFFFF) contain 32-bit kernel data and code. This area is increased (mainly the global page allocation pool) to match the natural alignment of PAE-paging. Just below (0xBFC00000-0xBFFFFFFF) is the per-application handle space, and other global per-application data (like the ELF header & tables for 64-bit). This area will be the same after a fork, but will be reinitialized (with a new paging-structiure) when exec is called. Handling of input focus is still bound to a process, and resides in the last 1G, and thus new consoles will not be started with fork or exec (as should be), but will be started with spawn.

The first 1G (0x00000000 - 0x3FFFFFFF) will also be assigned to a process, and thus cannot be used by 64-bit code for per-application data. It contains the linear address of BIOS, text-mode buffer and other things that are related to input focus. It also contains the DOS / V86 area, and potientially the real-mode idt. This area also contains the unity-mapped long mode monitor (0x110000 - 0x11FFFF).

The middle 2G (0x40000000 - 0xBFBFFFFF) will be used for passing buffers between 64-bit applications and kernel. It will also be used for reading contents of the 64-bit ELF-executable before passing pages to user-space.

About the only problem I have with this is the conflict between the 32-bit page table alias (which is between 3-4G) and the long mode page table alias (which is in the last PML4 entry). I think that 64-bit processes must always use the long mode page table alias, which means that there needs to be a third variant of all page-table functions (today I have protected mode and PAE variant). Some of these could also be extended to handle long mode addresses. OTOH, if I map the 32-bit page table alias in the middle 2G area, it would be per-application, and thus could map the correct entries for the middle 2G area per application.

Edit: I've just re-arranged the memory mappings so that the aliased 32-bit page tables are just below 3G, and the 32-bit system page table alias is just above. That means I can use the 32-bit page table alias for everything below 4G, and be sure it is consistent with the long mode page table alias.

Gigasoft · Post by **Gigasoft** » Wed Dec 26, 2012 12:42 pm

Here's a way you can implement copy on write with an arbitrary number of cloned address spaces. For each area in memory, have a linked list of the clones. Every page in each area has a counter containing the number of copies and a flag which is set if the page is a clone. If either are set, the page is marked as read only. When writing and the flag is set, make a copy of the page, clear the flag, replace the current PTE, and decrement the counter for the corresponding original page. If the counter is not 0, make a copy for every cloned memory area, clear the flags and recursively replace the PTEs in each cloned area and their cloned areas, and so on.

Another, perhaps better solution is to keep a flag denoting a page as a clone, and either a pointer to the area it came from, or the number of copies, depending on the flag. When writing to an original, go through the cloned areas until you find one where the corresponding page is a clone. Set the original for all the other cloned pages to the first area you found. When writing to a copy, go through the original pointers until you find the first original, and decrement its counter. Then make a copy of the written page and replace it with the copy. Making a copy of a page involves finding the first original page and incrementing its counter.

rdos · Post by **rdos** » Wed Dec 26, 2012 1:05 pm

It looks simple, but it is not. First, there is no area in the page tables themselves that could keep linked lists when the pages are used, so this must be done somewhere else. Second, when an address space is exited or does exec, it needs to update cloned address spaces (and this must be atomic!). Updating address spaces need direct access to the address spaces themselves, and each address space alias takes 1/512th of the total address space.

Then there is the issue if there is ever address spaces that have more than one clone as the usual scenario is a fork + an exec, which will never have more than two interlinked address spaces.

I think I'll use a single bit in the page table entry to denote copy-on-write, and no lists. I also don't support backup-store and swapping, which makes it easier and also makes lists unnecessary for other uses.

bluemoon · Post by **bluemoon** » Wed Dec 26, 2012 2:53 pm

I have not yet implemented CoW but I would do it this way:

When page fault:
1) Check if the page is present, user bit clear, and write error, otherwise it is not CoW, handle it accordingly.
2) To speed up the identification of the cause (ie. CoW), may consider to use the OS specific bit; but see notes below.
3) loop thru' a kernel data structure that record all CoW pages, with counter, process list and their address. This structure may be implemented with one or more map (for parallel access) with physical address as hash key.
4) If a record is found, handle it accordingly (clone page, release reference, possibility remove entry if zero reference, and remap into address)
5) If no record found, it is basically access error.

Note:
Step 1 is pretty quick and if that passed, there are only two possible scenario - writing to a CoW page, or write error.
Step 2 is optional, since at that point it's either CoW or kill process(write error) - we may assume CoW is the usual case here.

The point is, quickly identify if the cause is CoW, so that time consuming task (like finding the structure) is not performed for irrelevant causes.

rdos · Post by **rdos** » Wed Dec 26, 2012 5:10 pm

bluemoon wrote:I have not yet implemented CoW but I would do it this way:

When page fault:
1) Check if the page is present, user bit clear, and write error, otherwise it is not CoW, handle it accordingly.
2) To speed up the identification of the cause (ie. CoW), may consider to use the OS specific bit; but see notes below.
3) loop thru' a kernel data structure that record all CoW pages, with counter, process list and their address. This structure may be implemented with one or more map (for parallel access) with physical address as hash key.
4) If a record is found, handle it accordingly (clone page, release reference, possibility remove entry if zero reference, and remap into address)
5) If no record found, it is basically access error.

Note:
Step 1 is pretty quick and if that passed, there are only two possible scenario - writing to a CoW page, or write error.
Step 2 is optional, since at that point it's either CoW or kill process(write error) - we may assume CoW is the usual case here.

The point is, quickly identify if the cause is CoW, so that time consuming task (like finding the structure) is not performed for irrelevant causes.

I would say like this:

1) Check for write-error. If not write error, not relevant for CoW, so leave
2) Check if CoW bit is set. If not set, not relevant for CoW, so generate page fault exception
3) Lock CoW section
4) Allocate a new page and copy contents to new page. Set new page to R/W and clear CoW
5) In the aliased page structure of the other process, make the page R/W, and clear CoW bit
6) Unlock CoW section

rdos · Post by **rdos** » Thu Dec 27, 2012 4:10 am

Even more optimal, but a little more complex, is to implement CoW like this:

Code: Select all

1) On fork, mark all writable (and user level accessible) top-level PML4 entries as read-only and CoW. 
2) On page fault, check if the error reason is a write to a readonly page. If not, handle elsewhere
3) Lock
4) Start at the bottom of the page table tree
5) If CoW page, goto 8)
6) If readonly page,  unlock and generate page fault exception.
7) If PML4, go to 13). Otherwise move up one level in the tree and goto 5)
8) Start at the bottom of the page table tree
9) If not bottom level, set page content entries that are writable, except current page, to read-only and CoW.
10) Allocate a new page and copy page contents.
11) Set page attributes to R/W and not CoW
12) If the current page was not marked as CoW, move up in the tree and do 9) again
13) Unlock
14) Flush TLB-entry
15) Re-execute

The advantage here is that no entries needs to be copied initially. If only lowest level page tables are marked as CoW, the whole page tree must first be copied on a fork.

Edit: The nice thing with this algorithm is that the page table tree of the forked process never needs to be accessed!

The tricky thing is where TLB-flushes are needed, and which entries that must be flushed. A quick-and-dirty solution would be to flush the entire TLB after 3), but that would not be optimal.

rdos · Post by **rdos** » Thu Dec 27, 2012 5:00 am

Possible solution including TLB-flushes: (also some changes in the algorithm)

Code: Select all

1) On fork, mark all writable (and user level accessible) top-level PML4 entries as read-only and CoW. 
2) On page fault, check if the error reason is a write to a readonly page. If not, handle elsewhere
3) Lock
4) Start at the bottom of the page table tree
5) Flush current TLB entry
6) If CoW page, goto 10)
7) If readonly page,  unlock and generate page fault exception.
8) If PML4, go to 21). 
9) Move up one level in the tree and goto 5)
10) Start at the bottom of the page table tree
11) Set saved page to 0
12) Save CoW status of current page
13) If not CoW, set page table entries that are writable to read-only and CoW for all pages except current page which should be R/W and not CoW.
14) Allocate a new page and copy page contents.
15) If saved page, fill in saved page as R/W and not CoW in the correct entry in the new page.
16) If CoW goto 19)
17) Set saved page to newly allocated page
18) Move up one level in the tree and goto 12)
19) Setup new page as R/W and not CoW
20) Clear CoW and set R/W in forked process
21) Unlock
22) Re-execute

Edit: There is still a need to keep the alias of the forked process. That's because pages marked as CoW are already different in the two processes, and both of them needs to be changed on a write-access.

A simplified solution (which rely on multiple page faults when CoW is not at the lowest level:

Code: Select all

1) On fork, mark all writable (and user level accessible) top-level PML4 entries as read-only and CoW. 
2) On page fault, check if the error reason is a write to a readonly page. If not, handle elsewhere
3) Lock
4) Start at the bottom of the page table tree
5) Flush current TLB entry
6) If CoW page, goto 10)
7) If readonly page,  unlock and generate page fault exception.
8) If PML4, go to 14). 
9) Move up one level in the tree and goto 5)
10) If not bottom level, update page contents to make writable entries read-only and CoW 
11) Allocate a new page and copy page contents
12) Link new entry as R/W and CoW in current process
13) Clear CoW and set R/W in forked process
14) Unlock
15) Re-execute

Owen · Post by **Owen** » Thu Dec 27, 2012 1:36 pm

You can't use just a bit in the page table entry, because a process may fork() multiple times (indeed, most do)

Most OSes allocate a structure per page of physical memory for handling this (and more) data. For example, I allocate a 32 byte (32-bit)/64 byte (64-bit) structure per page.

rdos · Post by **rdos** » Thu Dec 27, 2012 5:54 pm

Owen wrote:You can't use just a bit in the page table entry, because a process may fork() multiple times (indeed, most do)

I'll solve that by copying all CoW pages before a second fork.

And I think the typical scenario is fork followed by exec. Exec will clear all CoW pages as well. So once the forked process does exec, it can once more fork without any problems.

Typically, multiple execution within the same address space will use threads, not fork.

Owen · Post by **Owen** » Fri Dec 28, 2012 8:26 am

Forking multiple times from a single process has traditionally been the way network servers were written (inetd, apache, etc). It's still incredibly common (e.g. Apache is still most often deployed in the prefork mode, because some common plugins e.g. mod_php are not thread safe, and probably never will be)

For servers which need no global context it's still a very common model

rdos · Post by **rdos** » Fri Dec 28, 2012 9:16 am

Owen wrote:Forking multiple times from a single process has traditionally been the way network servers were written (inetd, apache, etc). It's still incredibly common (e.g. Apache is still most often deployed in the prefork mode, because some common plugins e.g. mod_php are not thread safe, and probably never will be)

I don't see such servers being ported to RDOS. I already have a simplified httpd class that can be extended with C++, and pretty easy could provide dynamic content. I'm more interested in getting virtual machines (JVM) ported, and Mozilla. Possibly also SSH and SCH. And even if making a full copy of the process state costs a lot of memory, it works. Multiple forks will still be supported but at higher cost. This kind of is a design-issue. I don't want to have additional overhead for linear/physical memory. That's also why I will not implement swapping and virtual memory.

OSDev.org

Providing fork and exec for 64-bit in an existing 32-bit OS

Providing fork and exec for 64-bit in an existing 32-bit OS

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit

Re: Providing fork and exec for 64-bit in an existing 32-bit