Hi,
rdos wrote:
I think that it can be concluded that the primary segmentation issue on modern processors is changing SS register. As little of RDOS kernel-mode code manipulates the stack, it would be a good idea to use a flat SS selector in kernel mode on modern processors. That in itself, in conjunction with using SYSENTER/SYSEXIT could provide a considerable speed-up of syscalls.
For all segment register loads the CPU needs to fetch data from the L1 data cache (or worse) to get to the GDT/LDT entry and then do protection checks. Accessing L1 cache alone probably costs about 12 cycles. For DS, ES, FS, GS segment loads the CPU can use things like out-of-order execution and register renaming to hide the performance problem; so these segment loads seem to suck less. For CS loads the CPU can't hide the performance problem - the CPU has to wait for the CS load to complete before it can fetch the next instruction. For SS loads I'd assume similar restrictions (e.g. all calls/returns/pushes/pops need to wait for the earlier segment load to complete).
Basically, all segment register loads suck, potentially including (for e.g.) loading DS in code where all/most of the following instructions depend on DS, but sometimes the CPU can hide the suckage in some cases. Call gates suck twice as much (as the CPU has to fetch the gate's descriptors before it can start fetching the code descriptor). Both SYSENTER and SYSCALL avoid the need to fetch data from the L1 data cache (or worse) and most of the protection checks; and therefore have far less impact on a typical CPU's out-of-order execution pipeline (it'd still cause a temporary blockage, but the blockage is cleared a lot sooner). The same would apply for SYSEXIT/SYSRET compared to "RETF".
Cheers,
Brendan