Brendan wrote:
With this in mind, the first thing I'd be doing (for multi-CPU) is saying "FPU state is saved during task switches whenever the previous tasks used it". That avoids all synchronisation, IPI overhead and tasks migration problems. It also means I'd only be considering "delayed FPU state loading" (and not saving).
Exactly. Avoiding IPIs, especially to many CPUs, is essential both for performance and for ease of debugging and getting it to work under all conditions.
Brendan wrote:
When a "device not available" exception occurs you load the FPU state, and set a flag so the scheduler knows that the FPU state needs to be saved when there's a task switch.
The second step would be tracking how often each task uses the FPU. If you detect that a task uses the FPU most of the time, then you can avoid the overhead of a likely "device not available" exception by pre-loading the FPU state during the task switch.
This is my current logic. Except that I also check if the next scheduled thread is identical to the one owning the FPU context.
Brendan wrote:
The next step would be "delayed FPU state initialisation". When a task is created, set a flag saying "FPU state not initialised", and if/when the task uses the FPU, initialise the FPU state in the "device not available" exception handler.
I'd initialize the state in the thread control block at thread creation time instead. There is no need to execute the "finit" operation in the new task context anyway. Basically, setting tag-register to 0xFFFF will do it.
Brendan wrote:
Of course all of the above applies to MMX (and SSE and AVX) too.
However, for SSE it may be possible to also use the "OSFXSR" flag in CR4 to detect if a task actually uses SSE; and avoid loading and saving SSE state for tasks that only use FPU/MMX and don't use SSE. When a task is created you'd set a flag saying "FPU state not initialised" and another flag saying "SSE state not initialised". When you get the first "device not available" exception you initialise FPU state, set the "FPU state initialised" flag and return (like before); and if/when you get an "invalid opcode" exception you check if it was an SSE instruction, initialise SSE state and set the "SSE state initialised" flag (and also initialise FPU state if it hasn't already been initialised). If SSE state has been initialised, when you switch to the task you'd set TS and OFSXSR (and any "device not available" exception would cause both FPU and SSE to be loaded, rather than just FPU).
After all that comes AVX, where things get messy (but the basic idea behind "avoid loading and saving SSE state" should work for AVX too, just with XGETBV, XSETBV and XCR0 instead of OSFXR alone).
Messy. I don't bother with MMX, SSE and AVX yet. They are not used by any tasks yet.