@OP
The problem is that you view switch_task as procedure that on its own takes you from any context to any other context. I.e. it will take you from a kernel mode context and land you directly into user mode somewhere. You can implement it that way, but the code will be complicated and stuffed with branches, dealing with much too much responsibility (including iret and ret return paths as you noticed). It will be coupled with the implementation details of other kernel facilities of all kinds, like interrupt handlers and thread creation routines.
Instead kernels have a function that does something which is more appropriately named switch_kernel_call_stack (vs your switch_task per se.) Switch_kernel_call_stack cannot take you to user mode, cannot switch the processor mode, cannot exit interrupt handlers. It only switches the current kernel call stack and performs normal return. The caller then has the responsibility to perform any work that the new thread context additionally requires. If switch_kernel_call_stack returns into interrupt handler, it will iret. If that interrupt handler's iret lands the eip into kernel code, this will most likely resume a kernel thread or an incomplete system call. If it lands into user mode, it will resume an interrupted user thread. In the special (and rather convoluted) case that switch_kernel_call_stack returns into a user thread setup routine, the latter will have to complete the user space thread creation (i.e. the part which has to be performed inside the kernel anyway). I hope that this will not confuse you, but another way to think about it is to say that you should be implementing user space and kernel space preemption on top of cooperative multitasking. switch_kernel_call_stack implements the classic cooperative call stack switching, and interrupt routines facilitate the preemption mechanics.
Your question about eflags again stems from the fact that you are trying to stuff too much responsibility into switch_task, which as I said should be semantically more like switch_kernel_call_stack. From now on, I will talk about switch_kernel_call_stack instead, to assert that point. As Brendan already stated, the flags indicate transient state, which is relevant when you are interrupted midway through your code, hence why iret restores it, but is not relevant when you perform explicit call. Why? Because it is not callee saved state. And switch_kernel_call_stack has not interrupted anyone. It has been called. Thus it needs to save and restore not all possible registers, but only callee saved registers. It may be called from inside an interrupt handler, but it is not its job to deal with that directly for the most part. The only relevant exception is dealing carefully with the interrupt flag (i.e. IF), because you need to avoid stack overflow hazards, as I already mentioned in another post. The point is, that when you call a function, you don't expect it to restore the ZF or SF. The arithmetic flags in FLAGS are not callee saved state. IF must be preserved, and the additional EFLAGS will not change from one kernel thread context to another. And again, switch_kernel_call_stack is just a normal call, as far as its caller is concerned. It will return much later, but when it returns, it will behave just as a normal routine returning from its job.
Regarding new user threads, it will be good to understand how control is passed into user mode with iret. This happens for various reasons. It can happen because a system call completes (if int 0x80 is used), when an interrupt handler ends, or it can even happen deliberately at any point in the kernel code. The kernel can perform an iret from any place, to call out into user space if it so desires. To do so, it will push ss, esp, eflags, cs, eip, and then perform iret. This will launch whatever user procedure the kernel wants on whatever user stack the kernel wishes. This is very specific, but it illustrates that iret is the general mechanism for starting user mode code, whether in order to resume it, to call out into it, or to initially start the user thread.
So, we can assume that you need to arrange switch_kernel_call_stack to return into some run_user_thread type of routine. For simplicity, lets also consider that the thread creator has populated the kernel stack beneath the return address to run_user_thread with the user context. That is, lets assume that below the callee saved context that switch_kernel_call_stack restores (which the thread creator must also populate), lies the address of run_user_thread (which switch_kernel_call_stack retuns into), and below that are the eip, cs, eflags, esp, and ss for the initial user context. run_user_thread can consist of a simple iret in this case. You may want to distribute more work here to offload the creator, but this will be more complicated and thus I avoid it deliberately.
Last, but not least, I am not talking about process creation, but thread creation. Process creation requires that you set-up a new address space, allocate a user stack, start an image loader somewhere, etc. This requires a lot of additional thought about how the work will be distributed between kernel space and user space libraries, and whether the fork-exec model or the create-from-executable model will be used.
Edit: I noticed that we are avoiding the issue of populating the GS descriptor. This is the only rather exotic functionality that switch_kernel_call_stack may have to perform if you use gs based addressing to access the per-cpu/per-thread data like most kernels do. It is technically a callee saved state, although the notion of descriptors is not explicitly referred to when talking about ABIs.
|