Memory Segmentation in the x86 platform and elsewhere

Schol-R-LEA · **Posted:** Fri Feb 03, 2017 10:53 am

I have collected the (recent) discussions on memory segmentation from previous threads to move it out of the 'architectural lowest common denominator' discussion. I think that this needs to be aired, and I would like to give rdos a chance to both address standing critiques (including my own) and present his case for segmentation.

Now, to anticipate one particular point, I should mention that in the end, the decision is rdos's, not ours. I am asking this mainly to understand his reasoning, and see if it applies more generally, but for RDOS itself, it is his call, and in this particular case, he had reasons why (in particular) AMD64/EMT64 support isn't relevant to the system's immediate future.

---------------------------------------

From Why do people say C is better than Assembly?:

rdos wrote:

From my point of view, the major reason why C sucks for OS development is that it cannot handle segmentation properly, and thus any C coded OS always rely on paging and IPC, which is horribly slow, or use a memory model that can easily create errors that never will be detected and that lies in the kernel as latent bombs.

So, the argument really isn't if hand-coded assembly is faster than an optimizing C compiler, but if a micro-kernel design with lots of TLB shootdowns and context switches ever can beat a monolithic kernel written in assembly using segmentation, which I think it never can. So the C people continue with their flat monolithic kernels that are prone to memory corruption issues, as this is the only way they can reach decent performance.

Schol-R-LEA wrote:

Far be it from me to defend C, but I really don't think you are right in this, for a number of reasons.

First, the main complaint you make has no applicability to long mode, nor to CPU architectures which have never had memory segmentation to begin with. Memory segmentation is a peculiarity of the 8086 borne of the cost of memory addressing hardware at the time it was designed - memory protection wasn't a significant consideration, as the 8086 was only meant to be a short-term design primarily intended for embedded controllers anyway - and one which even Intel has admitted was a mistake. The only other segmented-memory CPU architecture of any note, the IBM System/360, similarly abandoned segmentation as a bad design choice in the mid 1970s.

Second, segmentation is actually significantly slower than page-based memory protection for 32-bit protected mode in most current x86 models. Now, admittedly, this is because Intel has de-emphasized support for segmentation since the 80386, and not an inherent aspect of the segmented model, but it is still the case.

Third, the question of segmentation had nothing to do with the C language at all. While the current dominant implementations (GCC and Visual C++) only work with a flat memory model, both the Borland and Watcom C compilers handled segmentation quite satisfactorily, and AFAIK their successors still do. The Amsterdam Compiler Kit C compiler used in early versions of Minix had no problems providing for a segmented OS written in C.

Fourth, the issue of monolithic vs. micro-kernel is orthogonal to the issues of languages and memory models. Both micro kernel systems (such as the above mentioned Minix) and monolithic kernel designs (e.g., Xenix, Coherent, UCSD Pascal) have been written for x86 real mode, using assembly, C, or even some other languages (e.g., Pascal, Parasol, Modula-2), so arguing about the kernel design methodology is pretty meaningless here.

(As an aside, I would like to remind you that the original impetus for developing the micro-kernel model in the late 1970s was to provide better process isolation in systems with no hardware memory protection - such as the 8086.)

Finally, segmentation does not actually provide any improvement in the level of memory protection from paged protection in 32-bit protected mode, and [erroneous statements about current implementations elided].

[W]hat does matter is that segmentation is both less flexible than paging, and provides no protections which paging doesn't - it provides fewer, in fact, as the read-only limitations on user-space memory in a given segment are all-or-nothing, whereas page protection can be set on a per-process and per-page basis. While it is true that a given segment can be as small as a page or as large as the full addressing space, whereas paging requires all pages to be the same relatively small size, using segments in a fine-grained manner would, in effect, be using the segmentation to emulate paging - there would be no significant advantages.

While it is true that C compilers running in flat mode rarely take proper care in handling the page settings, and for that matter few operating systems give the compilers fine-grained control over setting page protection to begin with, that is a flaw in the OS and/or the compiler, not in the language itself, I would say.

Brendan wrote:

Kazinsal wrote:

rdos wrote:

and far more efficient than any C compiler would be able to produce

Your efficiency claims are as good as fiction without proof.

You only need to see the "small address spaces" research done by L4 to see that there's some potential benefits in using segmentation to reduce task switch costs (at least for artificial micro-benchmarks that may or may not have any practical value, for systems that don't use asynchronousity to reduce the number of task switches).

Of course that has nothing to do with "C vs. assembly" and nothing to do with the relatively bizarre (and likely very erroneous) "assembly is less error prone because it can use segmentation while most C compilers can't" issue.

---------------------------------------

From "What features can I rely on the existence of across arch":

Korona wrote:

rdos wrote:

Yeah, thinking in terms of "how many architectures can I support" is all wrong. Portability always sacrifices speed and the ability to exploit specific features of a CPU. That might be ok if you want something that is mediocre on many CPUs, but not otherwise. I only support x86 because only x86 has segmentation, which I consider a vital component of memory protection. I also rely heavily on assembler code instead of C, which I can do since I only support x86.

x86_64 does not support segmentation. You're basically saying "Rather than designing my kernel so that it can adapt to different scenarios I optimize for an architecture that have been obsolete for at least 10 years and today only serves a niche market".

I also really doubt the "portability hurts performance" statement. Supporting a feature on one architecture does not mean that you have to require it on every architecture. However it is very easy to convince me if that statement is indeed true: Just show me a single (general purpose) OS that exploits all the nice special features of x86 and outperforms Linux or Windows on common workloads (or even a single real-world workload; microbenchmarks don't count). Hint: You can't; performance depends on the design of your algorithms and not on "Let me use segmentation instead of paging".

Brendan · **Posted:** Fri Feb 03, 2017 4:10 pm

Hi,

Korona wrote:

rdos wrote:

Yeah, thinking in terms of "how many architectures can I support" is all wrong. Portability always sacrifices speed and the ability to exploit specific features of a CPU. That might be ok if you want something that is mediocre on many CPUs, but not otherwise. I only support x86 because only x86 has segmentation, which I consider a vital component of memory protection. I also rely heavily on assembler code instead of C, which I can do since I only support x86.

x86_64 does not support segmentation. You're basically saying "Rather than designing my kernel so that it can adapt to different scenarios I optimize for an architecture that have been obsolete for at least 10 years and today only serves a niche market".

I also really doubt the "portability hurts performance" statement. Supporting a feature on one architecture does not mean that you have to require it on every architecture. However it is very easy to convince me if that statement is indeed true: Just show me a single (general purpose) OS that exploits all the nice special features of x86 and outperforms Linux or Windows on common workloads (or even a single real-world workload; microbenchmarks don't count). Hint: You can't; performance depends on the design of your algorithms and not on "Let me use segmentation instead of paging".

It's impossible to write a portable OS. If anyone doesn't believe me; try deleting all the non-portable code in Linux (e.g. the entire "/src/linux/arch" directory) and replace it with pure portable C code (without any "#ifdef ARCH...", etc) - you are guaranteed to fail.

For an OS/kernel to work at all; at some point "portable code" must depend on "non-portable code". The real question is where. This "where" is a compromise between how much work is involved in porting the OS and how much non-portable optimisation can be used. In other words; "where" is an unavoidable compromise between portability and (potential) performance.

For a lot of monolithic kernels (including Linux) "where" is a set of internal abstractions in a special (delineated) area of the source code. For some kernels (e.g. Windows) "where" begins with a hardware abstraction layer/HAL. For some kernels (e.g. the original L4 micro-kernel) "where" is the kernel API (e.g. kernel/s written in pure assembly language and only user-space is portable).

Cheers,

Brendan

Kazinsal · **Joined:** Wed Jul 13, 2011 7:38 pm **Posts:** 558

I have exactly one argument in favour of segmentation in x86, and that's using it in conjunction with paging and only in conjunction with paging to provide W^X on pre-NX x86 machines. I could go over how it works in detail but instead I'll link to the relevant slide from an OpenBSD presentation on it given by Theo de Raadt in 2005:

Source: https://www.openbsd.org/papers/ven05-deraadt/mgp00015.html

OpenBSD has been using segmentation as a W^X assist on non-NX capable x86 machines for 12 years now, and has had it mandatory and with no exceptions in the amd64 and i386 kernels for a couple years now. I am planning on implementing this in Araxes as soon as I have something useful to test it with, at the very least in userspace, but with intent to expand it to the kernel as well.

I accept this being non-portable because it serves as a way to work around non-NX capable x86 machines literally lacking the per-page execute flag.

alexfru · **Joined:** Tue Mar 04, 2014 5:27 am **Posts:** 1108

There are two problems with using segmentation to restrict execution to a part of the address space. One is dynamically loaded libraries. Their code needs to fall within this executable region, wich requires some flexibility in loading and relocation. The other is JITs and such. Perhaps, the executable/library format should explicitly describe .bss-like regions with read/write/execute rights.

Schol-R-LEA · **Posted:** Fri Feb 03, 2017 11:13 pm

Brendan wrote:

Korona wrote:

I also really doubt the "portability hurts performance" statement. [...]

It's impossible to write a portable OS. [...]

Oops. I meant to cut that section of the quote from Korona - it was relevant in the original thread, but not so much in this one.

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Even though it is a bit off topic in this thread I still want to respond to Brendan's post and elaborate my point.

Brendan wrote:

It's impossible to write a portable OS. If anyone doesn't believe me; try deleting all the non-portable code in Linux (e.g. the entire "/src/linux/arch" directory) and replace it with pure portable C code (without any "#ifdef ARCH...", etc) - you are guaranteed to fail.

For an OS/kernel to work at all; at some point "portable code" must depend on "non-portable code". The real question is where. This "where" is a compromise between how much work is involved in porting the OS and how much non-portable optimisation can be used. In other words; "where" is an unavoidable compromise between portability and (potential) performance.

I did not claim that every single line of code should be "portable" in an OS. As you pointed out that would be really silly and impossible. However I claim that

It is possible to write an OS core (i.e. kernel + drivers) that is clearly divided into a smallish architecture-specific and a larger generic part. Stuff that has to be handled in an architecture-specific way includes context management, page table management, interrupt table management, hardware access (e.g. I/O ports, MMIO and the PCI configuration space) and management of CPU specific features (e.g. VT-x, SMEP, performance counters, machine check handling and so on). Things that can and should be handled in the generic part include scheduling algorithms, memory management algorithms (e.g. buddy allocators for physical allocation, radix or red-black tree for virtual allocation and slab allocators for heap allocations) and driver code.
It does not make sense to write those generic algorithms in an architecture-specific way. I claim that there is zero performance benefit (and a huge maintenance burden) if you write your red-black tree in x86 assembly. It might even perform worse because your compiler does better instruction scheduling than you do.
It is possible (albeit difficult) to design APIs that abstract over the architecture specific parts without hurting practical performance. This does not mean that there has to be a single generic code path for every little detail. Those APIs may very well expose architecture specific differences when doing so makes sense. Because the original thread was about paging let's talk about that: For example there are many archs that do not have write-combining mappings. Does that mean that a kernel must not expose write-combining mappings to drivers? Of course the answer is no: An API should allow drivers to use them when they are available. An elegant solution could for example introduce flags like READS_HAVE_NO_SIDE_EFFECTS | ALLOW_BURST_WRITES for memory ranges. I do not claim that it is possible to design perfect APIs. There will always be trade-offs. I do however claim that it is possible to design good-enough APIs so that there are no practical (i.e. outside of microbenchmarks) performance differences.
Portability is only harmed if an OS requires a specific highly architecture-dependent feature and not if it enables the use of a specific feature. For example it might totally make sense to use segmentation if it is available. However it does not make sense (for a general-purpose OS) to require segmentation to be present. For example in my OS I use the kernel CS and SS registers to keep track of the execution context (e.g. an exception can look at the previous CS register to determine if it interrupted an IRQ handler, the idle routine or a kernel thread and CS atomically gets updated by interrupts and IRET). However if I have to port to an arch that has no CS register I will just find another way to do the same job and I won't have to rewrite my whole kernel.
An OS that is written to be highly architecture-specific but does not put enough effort on API and algorithm design will always be outperformed by an OS that does make some portability trade-offs but focuses on API and algorithm design and on well-written and practical implementations of those algorithms.
As a corollary of the last point: Because nobody is able to maintain and evolve sophisticated algorithms written in assembler it is impractical to write an efficient generlc-purpose OS in assembly.

Brendan · **Posted:** Sat Feb 04, 2017 12:19 pm

Hi,

Korona wrote:

It is possible to write an OS core (i.e. kernel + drivers) that is clearly divided into a smallish architecture-specific and a larger generic part. Stuff that has to be handled in an architecture-specific way includes context management, page table management, interrupt table management, hardware access (e.g. I/O ports, MMIO and the PCI configuration space) and management of CPU specific features (e.g. VT-x, SMEP, performance counters, machine check handling and so on).

Lets talk about the "accessed" and "dirty" flags for pages. They're very important for virtual memory management (determining what to send to swap space, figuring out if a page from a memory mapped file was modified since it was read from disk, etc). Let's assume that there are:

Some architectures which have "accessed" and "dirty" flags for each virtual page (but not for physical pages)
Some architectures which have "accessed" and "dirty" flags for each physical page (but not for each virtual page)
Some architectures which only have "accessed" flags and don't have "dirty" flags (where you have to emulate your own "dirty" via. page faults)
Some architectures which don't have "accessed" or "dirty" flags (where you have to emulate it all via. page faults)

You can achieve "portability" by:

Never using "accessed" or "dirty" flags on any architecture (and always emulating them with page faults).
Have an "isAccessedFlagSupportedForPhysicalPages()", "isAccessedFlagSupportedForVirtualPages()", "isDirtyFlagSupportedForPhysicalPages()", and "isDirtyFlagSupportedForVirtualPages()" functions or macros as part of your lower level abstraction; and end up with a virtual memory management that is a complicated mess that tries to handle all the possibilities.
Have a different "mostly generic" virtual memory manager for each case, where your abstraction only abstracts some details (like what format page table entries use) and not others (if there's "accessed" or "dirty" flags).
Have a different virtual memory manager for each architecture, where your abstraction is at a higher level (the interface to the virtual memory manager itself).

You can see that there's already a clear compromise between how easy/hard it is to port and (potential) optimisation.

However, this is just 2 flags, and there are many other differences - is "no execute" supported, can caching be controlled or influenced by paging, does the TLB exist and does hardware maintain "TLB coherency" or does it need invalidation, are multiple different page sizes supported (and which), etc. The end result is a large number of possible compromises between how easy/hard it is to port and (potential) optimisation.

However, that's just looking at support for architectures before you even start thinking about optimisation properly. Can the virtual memory allocation strategy be optimised to improve cache efficiency, can the virtual memory allocation strategy be optimised for NUMA (and if so, which way/s are best under which conditions), can performance be improved by replacing/discarding "multi-CPU locking" for the single-CPU case, etc. The end result is a huge number of possible compromises between how easy/hard it is to port and (potential) optimisation.

However, that's just looking at the virtual memory manager in isolation. Maybe (on some architectures) some CPUs share TLB entries with other CPUs, so you want to optimise your scheduler to improve TLB efficiency to suit the computer's paging hardware. Maybe (on some architectures) there's a trick you can use to avoid copying data between communicating processes by optimising IPC to suit the computer's paging hardware. Maybe (on some architectures) there's hardware virtualisation that can be heavily optimised to suit the computer's paging hardware. The end result is a massive number of possible compromises between how easy/hard it is to port and (potential) optimisation.

However, that's just looking at "things that involve virtual memory management".

Korona wrote:

Things that can and should be handled in the generic part include scheduling algorithms, memory management algorithms (e.g. buddy allocators for physical allocation, radix or red-black tree for virtual allocation and slab allocators for heap allocations) and driver code.

I'd estimate that this sort of naivety will cost between 5% and 50% of performance (compared to much higher level abstractions - e.g. scheduler, memory manager, etc for each architecture that's optimised specifically for that specific architecture). Note that this has nothing at all to do with "assembly vs. C" and everything to do with "portable vs. non-portable" (e.g. "portable C vs. non-portable C").

Cheers,

Brendan

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3192

I never said that segmentation should be used on its own. It is necessary to use both paging and segmentation. Paging is used for what it was intended for: To map linear memory to physical memory, and provide multiple address spaces. Paging is NOT used for protection because data structures in programs are not page-aligned. This is where segmentation comes in.

In my design, every driver gets two segments: A code segment and a data segment. The driver will use these for internal operations as much possible because this creates an efficient protection where the driver cannot execute code outside its own address space, and also cannot access memory outside its own address space. It can only access other memory by getting passed 48-bit segmented pointers, and it can only call other code by using predefined OS-calls.

This scheme can be moved to x86-64 too. Just substitute segment for 4G memory region, and give each driver its own 4G linear address space. Then make sure drivers can use only rip-relative code, and you basically have a similar protection scheme.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5137

rdos wrote:

In my design, every driver gets two segments: A code segment and a data segment. The driver will use these for internal operations as much possible because this creates an efficient protection where the driver cannot execute code outside its own address space, and also cannot access memory outside its own address space. It can only access other memory by getting passed 48-bit segmented pointers, and it can only call other code by using predefined OS-calls.

In my design, every driver gets its own address space. The driver can't access anything outside its own address space because those pages either require privileges it doesn't have (the kernel's pages) or aren't mapped at all (everything else). It can only access other memory if the kernel decides it has permission to access additional pages, and it can only call other code through the kernel's system calls.

Which part of my design is lacking in appropriate protection? (Which part of your design prevents a malicious or compromised driver from guessing the segment selector belonging to another program?)

rdos wrote:

This scheme can be moved to x86-64 too. Just substitute segment for 4G memory region, and give each driver its own 4G linear address space. Then make sure drivers can use only rip-relative code, and you basically have a similar protection scheme.

What stops a malicious or compromised driver from generating addresses outside of its 4G memory region?

Schol-R-LEA · **Posted:** Sat Feb 04, 2017 6:42 pm

rdos wrote:

I never said that segmentation should be used on its own. It is necessary to use both paging and segmentation. Paging is used for what it was intended for: To map linear memory to physical memory, and provide multiple address spaces.

This argument has a serious problem, however: segmentation was not (originally) designed for memory protection, either. The 8086 had no memory protection, and all x86 CPUs since then retain this when operating in real mode. In the original design, segmentation was a means of simplifying the addressing, by allowing a 20-bit memory address space be mapped to 16 address lines. It was designed that way in part to make it easier to port 8-bit 8080/Z80/8080A code, by allowing 16-bit segment-local addressing, but the primary purpose was to save four address pins in the CPU's DIP with the understanding that you'd just take the hit from double-dipping when you needed a wider pointer. No matter how you handled the segments, a program could always use a FAR address to get to anywhere in the address space and the only penalties were in performance and memory use.

While memory protection based on segments was added in the 80286, then redesigned and expanded in the 80386, memory protection as no more the purpose of segmentation than it was for paging - and unlike segmentation, a certain amount of memory protection has been part of the paging system from the start.

rdos wrote:

Paging is NOT used for protection because data structures in programs are not page-aligned. This is where segmentation comes in.

I am unclear on just what it is you are arguing for at this point. Is the issue that

a page is too large for most individual data structures, and thus wasteful if each is given a separate page (or, conversely, that you would need to pack multiple data structures into each page, which might expose them to unwanted access if more than one process is sharing a given page which holds data that shouldn't be shared - I can't see why anyone would do that);
that it is too small for some, and thus contains hidden breaks in the data which could affect locality;
that the requirement for each page's permissions to be set rather than having all the related ones set at once for the segment is inefficient compared to segment selectors (except that you can group pages, by page directory entry, so that makes no sense), and/or that the system is too likely to handle the page permissions incorrectly (except that you said you are using both pages and multiple segments, which means that actually adds complexity over a flat layout),
something I have missed entirely?

rdos wrote:

In my design, every driver gets two segments: A code segment and a data segment. The driver will use these for internal operations as much possible because this creates an efficient protection where the driver cannot execute code outside its own address space, and also cannot access memory outside its own address space. It can only access other memory by getting passed 48-bit segmented pointers, and it can only call other code by using predefined OS-calls.

OK, I am lost here. I was assuming that your argument was that segmentation was more fine grained than paging, but not only is this extremely coarse-grained (if I am reading this right, you aren't even separating the heap and the stack!), it also isn't actually memory protection at all. Each userland process can still read and write any other process' userland segments; you are simply enforcing non-access by convention, a pure honor system. True, you would have to hand it a 48-bit pointer for it to know where the other process' memory is ahead of time - but that does nothing to stop it from trying to scribble over another process' memory by generating a random segment selector address and either segfaulting or trashing something, with some parent process endlessly restarting it when it faults.

Now, I am not going to say you are wrong, or even argue specifically against segmentation when discussing a OS that is explicitly meant to only run on x86-32. I am trying to understand why you disagree with conventional wisdom on this, and what you are seeing as a problem in a flat memory model. Hell, I agree that it is tempting to use segmentation some ways, and that the flat model requirement in GCC is one of the main reasons it isn't use more. However, even I were thinking of using segmentation - which is an option if I decide to do a 32-bit x86 implementation of Kether - I would not bake it into the design, but bury it in the runtime synthesis. Since I only know of one other OS that uses runtime synthesis, and it was written in 68000 assembly language, it is safe to say that trying to abstract away segmentation on a design meant for multiple implementations would be pretty problematic.

Brendan · **Posted:** Sat Feb 04, 2017 11:57 pm

Hi,

Octocontrabass wrote:

rdos wrote:

In my design, every driver gets two segments: A code segment and a data segment. The driver will use these for internal operations as much possible because this creates an efficient protection where the driver cannot execute code outside its own address space, and also cannot access memory outside its own address space. It can only access other memory by getting passed 48-bit segmented pointers, and it can only call other code by using predefined OS-calls.

In my design, every driver gets its own address space. The driver can't access anything outside its own address space because those pages either require privileges it doesn't have (the kernel's pages) or aren't mapped at all (everything else). It can only access other memory if the kernel decides it has permission to access additional pages, and it can only call other code through the kernel's system calls.

Which part of my design is lacking in appropriate protection? (Which part of your design prevents a malicious or compromised driver from guessing the segment selector belonging to another program?)

Possibly "no-execute" page protection (if CPU is ancient and doesn't support it). Apart from that rare case, nothing.

Octocontrabass wrote:

rdos wrote:

This scheme can be moved to x86-64 too. Just substitute segment for 4G memory region, and give each driver its own 4G linear address space. Then make sure drivers can use only rip-relative code, and you basically have a similar protection scheme.

What stops a malicious or compromised driver from generating addresses outside of its 4G memory region?

My guess is that nothing prevents a malicious (or buggy) driver from trashing all other drivers (and all other processes). The funny part is that most of the segments would be larger than 1 MiB and you'd have to use "multiple of 4 KiB granularity" for the segment limits (and wouldn't save much RAM by not using "4 KiB granularity paging" for protection); and that long mode does support segmentation for 32-bit processes and the same "segmented processes" could run unmodified (without even recompiling) in both long mode and protected mode.

Schol-R-LEA wrote:

rdos wrote:

I never said that segmentation should be used on its own. It is necessary to use both paging and segmentation. Paging is used for what it was intended for: To map linear memory to physical memory, and provide multiple address spaces.

This argument has a serious problem, however: segmentation was not (originally) designed for memory protection, either. The 8086 had no memory protection, and all x86 CPUs since then retain this when operating in real mode. In the original design, segmentation was a means of simplifying the addressing, by allowing a 20-bit memory address space be mapped to 16 address lines. It was designed that way in part to make it easier to port 8-bit 8080/Z80/8080A code, by allowing 16-bit segment-local addressing, but the primary purpose was to save four address pins in the CPU's DIP with the understanding that you'd just take the hit from double-dipping when you needed a wider pointer. No matter how you handled the segments, a program could always use a FAR address to get to anywhere in the address space and the only penalties were in performance and memory use.

Segmentation on 80x86 has a slightly more convoluted history. Once upon a time (a long time ago) some CPU manufacturers used/supported capability-based addressing. Intel implemented it in their iAPX 432 CPU, but that CPU sucked and died (too complex, too slow, too expensive, and not even slightly compatible with 8086 or anything else). Segmentation on 80286 was a kind of "hybrid mangled merge" of segmentation on 8086 (which was only really used to extend the physical address space) and capability-based addressing ideas from iAPX 432.

Cheers,

Brendan

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Hi,

Brendan wrote:

You can see that there's already a clear compromise between how easy/hard it is to port and (potential) optimisation.

Yes, that is absolutely right but remember that the original post that I responded did not claim that there are compromises but that "good" compromises are impossible! I claim that it often is possible to find an efficient compromise between code deduplication across all architectures (and thus maintainability) and performance. We can look at your accessed-dirty example to illustrate that.

Brendan wrote:

Lets talk about the "accessed" and "dirty" flags for pages. They're very important for virtual memory management (determining what to send to swap space, figuring out if a page from a memory mapped file was modified since it was read from disk, etc). Let's assume that there are:

Some architectures which have "accessed" and "dirty" flags for each virtual page (but not for physical pages)
Some architectures which have "accessed" and "dirty" flags for each physical page (but not for each virtual page)
Some architectures which only have "accessed" flags and don't have "dirty" flags (where you have to emulate your own "dirty" via. page faults)
Some architectures which don't have "accessed" or "dirty" flags (where you have to emulate it all via. page faults)

You can achieve "portability" by:

Never using "accessed" or "dirty" flags on any architecture (and always emulating them with page faults).
Have an "isAccessedFlagSupportedForPhysicalPages()", "isAccessedFlagSupportedForVirtualPages()", "isDirtyFlagSupportedForPhysicalPages()", and "isDirtyFlagSupportedForVirtualPages()" functions or macros as part of your lower level abstraction; and end up with a virtual memory management that is a complicated mess that tries to handle all the possibilities.
Have a different "mostly generic" virtual memory manager for each case, where your abstraction only abstracts some details (like what format page table entries use) and not others (if there's "accessed" or "dirty" flags).
Have a different virtual memory manager for each architecture, where your abstraction is at a higher level (the interface to the virtual memory manager itself).

I guess we both agree that the first two designs wound lead to abysmal performance or complexity so we can ignore them. Let's take the third design and try to make it work. Why do we need accessed and dirty bits? Basically there are two operations we need to perform:

Determine inactive pages that can be evicted or moved to swap.
Determine if those pages need to be written back to disk (i.e. check if their physical or virtual dirty bit is set).

When implementing the first operation it is difficult to abstract over per-physical-page vs. per-virtual-page accessed bits. So we just do not try to do that. We expect each arch to either implement a scan_physical_range() or a scan_virtual_range() function that updates a LRU list of accessed pages. Each arch can chose whatever algorithm suits their page table format best. There is no performance compromise here. Now we can have generic code that detects memory pressure (with arch-specific hooks as needed) and determines when to free pages. When a page is actually freed we call an arch-specific is_lru_item_dirty() function that tells us if we have to write the page back to disk. Again, each arch can use an efficient arch-specific algorithm for that and we avoid performance compromises.

So now we require each arch to implement two functions related to accessed-dirty bits. However we did not really have to compromise performance at all. Note that this somewhat over-simplifies the situation: We still need hooks in the page fault handler to emulate dirty bits based on page faults and so on. But even if we do that it is still a big win over the fourth design: We can share all the allocate-virtual-range and resolve-address-to-virtual-range code, the memory pressure detection, the copy-on-write / fork() logic that we need if we want to implement POSIX, the memory-mapped file logic, things like userfaultfd() if we chose to implement it and so on.

Yes accessed-dirty is not the only arch-related part of virtual memory management and a real OS will require many more arch-specific hooks. But notice how we reduced the number of VMM code paths from complete duplication to a few hooks per architecture. That is still a big win without compromising performance. Also notice that our abstraction did not lead to an exponential amount of code paths in the generic code. There are a few diverging code paths but we do not need O(archs) but only O(arch-features) code paths. This is still way easier to maintain.

On the other hand if we implement the fourth design and duplicate things like the allocate-virtual-range code across all archs we can be sure to have bit rotting code paths after a few iterations. Technologies like persistent memory or the need for userfaultfd()-like techniques to handle virtual machine migration show that VMM still needs to evolve even if paging has been around for decades. Having to implement these things on a per-arch basis is a great way to introduce performance regressions and bugs.

Brendan wrote:

Note that this has nothing at all to do with "assembly vs. C" and everything to do with "portable vs. non-portable" (e.g. "portable C vs. non-portable C").

Yes I know. I only made remarks about assembly because Schol-R-LEA cited some posts from the "Why do people say C is better than Assembly?" thread and that was tangential to that discussion :wink:

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3192

Octocontrabass wrote:

rdos wrote:

In my design, every driver gets two segments: A code segment and a data segment. The driver will use these for internal operations as much possible because this creates an efficient protection where the driver cannot execute code outside its own address space, and also cannot access memory outside its own address space. It can only access other memory by getting passed 48-bit segmented pointers, and it can only call other code by using predefined OS-calls.

In my design, every driver gets its own address space. The driver can't access anything outside its own address space because those pages either require privileges it doesn't have (the kernel's pages) or aren't mapped at all (everything else). It can only access other memory if the kernel decides it has permission to access additional pages, and it can only call other code through the kernel's system calls.

I suppose you are writing a microkernel. The problem with your solution is that it is far more expensive than segmentation (two address space switches per syscall, causing two TLB invalidations). It is also not safe because any driver could fabricate a request for your server process and include fraudulent data. Which means you need even more overhead to check parameters, in addition to the huge overhead of the context switches involved.

Octocontrabass wrote:

Which part of my design is lacking in appropriate protection? (Which part of your design prevents a malicious or compromised driver from guessing the segment selector belonging to another program?)

My solution doesn't target maliciuos code, and I don't think yours can either. It's a solution to isolate bugs to a smaller scope. When you link a flat kernel into a single binary, and stuff all the data together in one segment, any bug in limits or erroneous address calculations are likely to stay undetected and corrupt things for other drivers. The use of segmentation largely avoids this and finds the bugs much faster.

Octocontrabass wrote:

rdos wrote:

This scheme can be moved to x86-64 too. Just substitute segment for 4G memory region, and give each driver its own 4G linear address space. Then make sure drivers can use only rip-relative code, and you basically have a similar protection scheme.

What stops a malicious or compromised driver from generating addresses outside of its 4G memory region?

Nothing, but just like segmentation, it narrows the scope of bugs which means they are found quicker and are less likely to remain in release versions as fatal errors.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3192

Schol-R-LEA wrote:

This argument has a serious problem, however: segmentation was not (originally) designed for memory protection, either. The 8086 had no memory protection, and all x86 CPUs since then retain this when operating in real mode. In the original design, segmentation was a means of simplifying the addressing, by allowing a 20-bit memory address space be mapped to 16 address lines. It was designed that way in part to make it easier to port 8-bit 8080/Z80/8080A code, by allowing 16-bit segment-local addressing, but the primary purpose was to save four address pins in the CPU's DIP with the understanding that you'd just take the hit from double-dipping when you needed a wider pointer. No matter how you handled the segments, a program could always use a FAR address to get to anywhere in the address space and the only penalties were in performance and memory use.

There is no requirement that it was designed for memory protection. The 32-bit offsets used in RIP-relative x86-64 code was not intended for protection either, but it can still be used that way. By putting code out-of-scope (except for specific references), we narrow down the effect of bugs which makes them easier to find. If you have problems with the network driver, you are pretty sure the bug originates there, and not in the video driver. Something that the packed-flat-memory model cannot guarantee.

Schol-R-LEA wrote:

While memory protection based on segments was added in the 80286, then redesigned and expanded in the 80386, memory protection as no more the purpose of segmentation than it was for paging - and unlike segmentation, a certain amount of memory protection has been part of the paging system from the start.

Paging alone is not much better than using physical memory directly, and using physical memory directly would boost performance considerably, especially with the 4-paging levels in x86-64. Sure, totally fraudulent pointers are likely to generate page-faults, but limit violations or mis-calculated addresses are not.

Schol-R-LEA wrote:

rdos wrote:

Paging is NOT used for protection because data structures in programs are not page-aligned. This is where segmentation comes in.

I am unclear on just what it is you are arguing for at this point. Is the issue that

a page is too large for most individual data structures, and thus wasteful if each is given a separate page (or, conversely, that you would need to pack multiple data structures into each page, which might expose them to unwanted access if more than one process is sharing a given page which holds data that shouldn't be shared - I can't see why anyone would do that);
that it is too small for some, and thus contains hidden breaks in the data which could affect locality;
that the requirement for each page's permissions to be set rather than having all the related ones set at once for the segment is inefficient compared to segment selectors (except that you can group pages, by page directory entry, so that makes no sense), and/or that the system is too likely to handle the page permissions incorrectly (except that you said you are using both pages and multiple segments, which means that actually adds complexity over a flat layout),
something I have missed entirely?

If Intel had designed for 32-bit selectors, it would be natural to map every object to its own selector. Unfortunately, when Intel designed their 32-bit architecture, they didn't extend selectors to 32-bits, rather kept them at 16-bits. That means a segmented OS needs to make compromises in how it uses segmentation not to run out of selectors. That's why I have protection at the level of the driver, and then also map some major objects, like thread descriptors and other objects that are not likely to be produced in large quantities to selectors. Things like FS buffers use flat addresses because they cannot be mapped to selectors.

All page-level protection will operate on page sizes, and thus never maps naturally to software structures. Therefore, paging doesn't support exact limit checking, which segmentation supports.

Schol-R-LEA wrote:

OK, I am lost here. I was assuming that your argument was that segmentation was more fine grained than paging, but not only is this extremely coarse-grained (if I am reading this right, you aren't even separating the heap and the stack!), it also isn't actually memory protection at all.

Each thread has it's own kernel stack that is mapped to a selector (and thus has exact limit checking). The heap can allocate both selectors and linear addresses, and the selection is based on the above compromise that 16-bit selectors cause.

Schol-R-LEA wrote:

Each userland process can still read and write any other process' userland segments; you are simply enforcing non-access by convention, a pure honor system. True, you would have to hand it a 48-bit pointer for it to know where the other process' memory is ahead of time - but that does nothing to stop it from trying to scribble over another process' memory by generating a random segment selector address and either segfaulting or trashing something, with some parent process endlessly restarting it when it faults.

Currently, userland uses C/C++ and a flat memory model. Processes are separated by paging just as they are in Windows and Linux. I even implemented the fork and exec functionality of Unix recently. It's only the kernel that uses segmentation. Although it is certainly possible to add a segmented userland executable format since all references passed to kernel use 48-bit addresses. I once had such a format, but I no longer use it.

Schol-R-LEA wrote:

Now, I am not going to say you are wrong, or even argue specifically against segmentation when discussing a OS that is explicitly meant to only run on x86-32. I am trying to understand why you disagree with conventional wisdom on this, and what you are seeing as a problem in a flat memory model.

I think I answered that above and in another post. The flat memory model packs code and data in such a way that there is no separation between drivers, and in fact, all drivers operate in a shared context.

Schol-R-LEA wrote:

Hell, I agree that it is tempting to use segmentation some ways, and that the flat model requirement in GCC is one of the main reasons it isn't use more. However, even I were thinking of using segmentation - which is an option if I decide to do a 32-bit x86 implementation of Kether - I would not bake it into the design, but bury it in the runtime synthesis. Since I only know of one other OS that uses runtime synthesis, and it was written in 68000 assembly language, it is safe to say that trying to abstract away segmentation on a design meant for multiple implementations would be pretty problematic.

Probably. I have ported a few C drivers to my OS (ACPI, FreeType), and it wasn't a big problem, but there was a need for an assembly layer to interface with the OS. Today, I sometimes decide to use C for a driver, but I almost always end up with an assembly layer for the interface, so the code is not portable.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5137

rdos wrote:

I suppose you are writing a microkernel. The problem with your solution is that it is far more expensive than segmentation (two address space switches per syscall, causing two TLB invalidations). It is also not safe because any driver could fabricate a request for your server process and include fraudulent data. Which means you need even more overhead to check parameters, in addition to the huge overhead of the context switches involved.

I'm actually aiming for more of a hybrid kernel, where components that can do a lot of damage regardless of separation (like the memory manager) are part of the kernel since the benefits of separation do not outweigh the costs. This means system calls for things like memory allocation don't require an address space switch at all.

For things like passing messages between tasks, the kernel has no need to validate the contents of the message; it only checks to make sure that the two tasks are allowed to communicate. It's up to the receiver of the message to validate its contents. What's to stop a driver from making fraudulent requests in your system?

Don't ignore the costs of using segments with non-zero base addresses. Remember, on modern x86, that adds a lot of overhead too. I can work towards a system that minimizes address space switches; there is nothing you can do to remove segmentation overhead.

rdos wrote:

My solution doesn't target maliciuos code, and I don't think yours can either. It's a solution to isolate bugs to a smaller scope. When you link a flat kernel into a single binary, and stuff all the data together in one segment, any bug in limits or erroneous address calculations are likely to stay undetected and corrupt things for other drivers. The use of segmentation largely avoids this and finds the bugs much faster.

The use of separate address spaces means my drivers can't corrupt each another, too. So, where is my design lacking in adequate protection? Where does your design prevent malicious code from guessing the right segment to use to access someone else's data?

rdos wrote:

Nothing, but just like segmentation, it narrows the scope of bugs which means they are found quicker and are less likely to remain in release versions as fatal errors.

So you're saying that RDOS is as easy to compromise as MS-DOS?

OSDev.org

Memory Segmentation in the x86 platform and elsewhere

Who is online