Brendan wrote:
20.0 million "near calls" per second on a 1.2 GHz CPU works out to about 60 cycles per "near call". I'd expect that a near call actually costs about 4 cycles, so this first test indicates that the loop overhead is probably about 56 cycles per iteration (probably because the compiler is crap - a decent compiler would have inlined the "do nothing" function, then decided that "sync_val" never changes because it's not volatile and generated a "jmp $" infinite loop).
That's probably not correct. The overhead is not in the loop, but in the C procedure that saves registers, checks the stack and so on. I think it is reasonable to set loop overhead to 10 cycles and procedure overhead to 46 instead
Brendan wrote:
2.7 million "call gates" per second on a 1.2 GHz CPU works out to about 444 cycles per "call gate". By subtracting the "about 56 cycles per iteration" loop overhead from above this gives us an actual figure closer to 388 cycles for the call gate alone.
That would be 434 using the corrected figure
Brendan wrote:
3.8 million "sysenters" per second on a 1.2 GHz CPU works out to about 316 cycles per "sysenter". By subtracting the "about 56 cycles per iteration" loop overhead again this gives us an actual figure closer to 260 cycles for the sysenter alone. This is about 50% faster than the call gate method.
And that would be 306. Just above 40% faster.
Brendan wrote:
6.5 million "alternative sysenters" per second on a 1.2 GHz CPU works out to about 185 cycles per "alternative sysenter". By subtracting the "about 56 cycles per iteration" loop overhead again this gives us an actual figure closer to 129 cycles for the alternative sysenter alone. This is about 300% faster than the call gate method, and about 200% faster than the original "sysenter" method.
Overhead would be 175 cycles, and that is 150% faster.
Brendan wrote:
The only difference between the "sysenter" and "alternative sysenter" method is that the former loads a different SS:ESP while the latter doesn't. Because the alternative method is about 200% faster, this means that loading a different SS:ESP must halve the performance. Loading a different value into ESP is just a normal "mov" and should only cost about 2 cycles. Therefore loading a different value into the SS register must be costing about 126 cycles all by itself. Loading a different value into CS would cost about the same. Therefore, without these (CS and SS) segment loads (at 126 cycles each) the cost of the sysenter and sysexit instructions alone would be about 5 cycles.
I think loading CS is also a lot faster than loading SS (probably similar to loading general segment register), and SYSENTER/SYSEXIT doesn't use 5 cycles, but a lot more.
Brendan wrote:
This is undeniable proof that if RDOS didn't use segmentation system calls would be about as fast as a near call.
Yeah, and unreliable.
Brendan wrote:
If a flat application does attempt to forge dodgy values for EIP or ESP it'd only cause a page fault due to the correct use of the supervisor/user flag in page table entries, and would be no worse than the same application doing "jmp somewhere_in_kernel" or "mov esp,somewhere_in_kernel".
Not so since these are used to load/save stack state in application space in kernel. User/supervisor flags are useless when the operations take place in kernel.
Brendan wrote:
My main concern (if I understand RDOS enough) would be segmented applications using the SYSENTER interface to break their segments. For example, if you have several segmented applications in the same virtual address space, then one of them could use SYSENTER to modify its SS and then use its SS:ESP to read a different application's data.
The last version provides full protection. Besides, for segmented applications, the SYSENTER interface could not be used (application CS/SS are not flat with a zero base), and thus would default to call gates only.
There is one issue though, and it is that the CS and SS that is setup by SYSEXIT has an incorrect limit, which means that CS and SS could be used to address kernel. However, this is not a big issue as RDOS has supervisor only access to kernel pages. If you note the code carefully, you can see that I deliberately use DS (which is loaded with a limit that excludes kernel) when I address the user-supplied stack, so if the user forges ECX, the stack operations will fault in kernel. For the same reason I use CS override for data that are located in kernel.