Hi,
rdos wrote:
Brendan wrote:
No. To load a normal register the CPU has to:
- fetch the value
- store that value into the register
To load a segment register the CPU has to (things in bold are things that aren't done when loading a normal register):
- fetch the value
- determine if it's in the GDT or LDT
- check if the value is above the GDT/LDT limit
- check if the entry is marked as "present"
- check if there's a privilege level problem (CPL > DPL)
- load (and decode) the GDT/LDT entry and extract "base address", "limit" and "attributes"
- store that information into the register
All of the things you marked in bold could be done in parallel in one cycle.
That depends on how big a cycle is. If you make the CPU slow enough, even things like "fsqrt" can be done in one cycle. Of course if Intel did make the cycles larger they'd also want most instructions to complete in a fraction of a cycle; and you'd be complaining that your segment register loads take an entire cycle while most other things only take an eighth of a cycle.
Of course not all of these things can be done in parallel either. For example, you can't store the information into a register unless you already know what that information is.
rdos wrote:
Brendan wrote:
Because loading a normal register is relatively simple it can be done without micro-code; and for some CPUs, in some cases (e.g. loading a register with another register) it isn't even a micro-op (e.g. handled by "register renaming" in the front-end). Loading a segment register is complicated (too complex for a simple micro-op) and therefore there's an extra "fetch micro-ops from micro-code" step involved (in addition to all the other work).
That's no argument, but a preference of the chip designer that will bother to optimize some things but not others. In fact, segment register loads were optimized in the beginning, and it is a later development to not to bother with optimizing them any more.
Brendan wrote:
For all these reasons; loading a segment register should probably be around 10 times slower than loading a normal register.
Look at the instruction timings for i386, and you can see that you are mistaken.
Here's what the "INTEL 80386 PROGRAMMER'S REFERENCE MANUAL 1986" says:
Code:
Opcode Instruction Clocks Description
88 /r MOV r/m8,r8 2/2 Move byte register to r/m byte
89 /r MOV r/m16,r16 2/2 Move word register to r/m word
89 /r MOV r/m32,r32 2/2 Move dword register to r/m dword
8A /r MOV r8,r/m8 2/4 Move r/m byte to byte register
8B /r MOV r16,r/m16 2/4 Move r/m word to word register
8B /r MOV r32,r/m32 2/4 Move r/m dword to dword register
8C /r MOV r/m16,Sreg 2/2 Move segment register to r/m word
8D /r MOV Sreg,r/m16 2/5,pm=18/19 Move r/m word to segment register
A0 MOV AL,moffs8 4 Move byte at (seg:offset) to AL
A1 MOV AX,moffs16 4 Move word at (seg:offset) to AX
A1 MOV EAX,moffs32 4 Move dword at (seg:offset) to EAX
A2 MOV moffs8,AL 2 Move AL to (seg:offset)
A3 MOV moffs16,AX 2 Move AX to (seg:offset)
A3 MOV moffs32,EAX 2 Move EAX to (seg:offset)
B0 + rb MOV reg8,imm8 2 Move immediate byte to register
B8 + rw MOV reg16,imm16 2 Move immediate word to register
B8 + rd MOV reg32,imm32 2 Move immediate dword to register
C6 MOV r/m8,imm8 2/2 Move immediate byte to r/m byte
C7 MOV r/m16,imm16 2/2 Move immediate word to r/m word
C7 MOV r/m32,imm32 2/2 Move immediate dword to r/m dword
Code:
Clock counts for instructions that have an r/m (register or memory) operand
are separated by a slash. The count to the left is used for a register
operand; the count to the right is used for a memory operand.
Code:
pm=, a clock count that applies when the instruction executes in
Protected Mode. pm= is not given when the clock counts are the same for
Protected and Real Address Modes.
From this you can see that in protected mode something like "mov es,ax" costs 18 cycles, which is 9 times higher than most other MOV instructions and 4.5 times higher than the second slowest MOV instruction involving normal registers.
From this it's very obvious that (for protected mode)
segment register loads have always sucked.
rdos wrote:
Brendan wrote:
For using (rather than loading) a normal index register vs. using a segment register; the CPU would calculate the virtual address (and would probably be optimised to do this very quickly as it's done very often in all code), and would then convert that virtual address into a linear address.
Both GDT and LDT already have linear addresses, so there is one step less there when fetching a descriptor.
For using (rather than loading) a segment register; there's no need to fetch the segment's descriptor at all.
rdos wrote:
You might also compare with the 4-level paging scheme used for long mode. If that had similar level of optimization as segmentation, code would run 100s of times slower than it does. And the 4-level paging scheme affects each and every memory access, including code fetches, so it is really hardware intensive.
Intel or AMD could have something like a "GDT/LDT descriptor cache" to make segment register loads faster (in the same way that the TLBs makes paging faster). Ironically, AMD researched this and then
patented it, and then never bothered implementing it.
Cheers,
Brendan