iretq returning to incorrect location when I context switch

j4cobgarby · Post by **j4cobgarby** » Sat Feb 25, 2023 10:35 am

I'm working on some code to test the theory behind context switching. I have two assembly procedures `save_regs` and `load_regs`, and this question concerns `load_regs`.

This procedure is supposed to load register values from memory into the CPU. The values are passed to it from C, as the first and only parameter to the procedure. Of course one of the registers it loads is RIP, and so the behaviour I want is that the CPU will jump to execute wherever RIP points.

The code is as follows (GNU assembly):

Code: Select all

load_regs:
    push %rbp
    mov %rsp,   %rbp

    xchg %bx, %bx

    # Load RFLAGS
    mov 152(%rdi), %rax
    push %rax
    popfq

    # Load CR3
    mov 160(%rdi), %rax
    mov %rax, %cr3

    # Load general purpose regs
    mov 8(%rdi), %rbx
        ... These lines omitted
    mov 120(%rdi), %r15

    cli
    # Push RFLAGS
    pushfq
    # Push CS
    mov 136(%rdi), %rax
    push %rax
    # Push saved instruction pointer as iret return address
    mov 128(%rdi), %rax
    push %rax
    mov (%rdi), %rax

    iretq

If I call this procedure where the target RIP is 0xffff,ffff,8000,0d10, for example, then just before the iretq instruction the stack looks like this:

Code: Select all

 | STACK 0xffff80007ff85f18 [0xffffffff80000d10]
 | STACK 0xffff80007ff85f20 [0x0010001000100008]
 | STACK 0xffff80007ff85f28 [0x0000000000000082]

As you can see, at the top of the stack is the correct address for RIP, followed by the correct CS value and the correct RFLAGS value. When iretq is executed though, RIP is for some reason set to 0xffffffff80000cd0.

My understanding of iretq is that it should perform the following operation:(https://namazso.github.io/x86/html/IRET ... IRETQ.html)

Code: Select all

RIP := Pop();
CS := Pop(); (* 64-bit pop, high-order 48 bits discarded *)
tempRFLAGS := Pop();

and so I don't see why RIP should not end up being equal to 0xffffffff80000d10 rather than 0x[...]0cd0. It seems my understanding must be lacking somewhere.

I'll explain the logic of the `load_regs` procedure:
- Up until the cli instruction I simply take the corresponding register values from the struct pointer and load them into the registers. This works.
- After the cli instruction, I set up the top of the stack for iretq to use. First RFLAGS is pushed, then the CS register is pushed from the struct.
- Finally, I push the RIP value from the struct on to the stack, and then put the correct value of rax into that register.

What's going wrong here? Clearly most of this procedure is working as intended, at least to my understanding, because the stack looks as if it's set up correctly by the end.

j4cobgarby · Post by **j4cobgarby** » Sat Feb 25, 2023 10:37 am

Update: It appears that when I run iretq I may be getting a #GP fault, and the RIP it's jumping to may be the #GP interrupt handler. I'm looking more into this, will update.

qookie · Post by **qookie** » Sat Feb 25, 2023 11:21 am

In 64-bit mode the CPU also pops RSP and SS off of the stack during an IRET, even with CS RPL=0 (although I don't remember whether it uses the value it pops).

rdos · Post by **rdos** » Sat Feb 25, 2023 1:09 pm

In addition to missing SS & RSP, there is also a need to handle CR3 and/or TR.

It doesn't make sense to save the value of CS (and SS) in the task structure in long mode since those are disabled, and so the most reasonable is to use push ss and push cs, and not something from a task structure.

Of course, if you switch tasks, the destination must have a new stack image, and you cannot pass the original stack.

In addition to that, there is no need for cli. A real task switcher needs some more advanced method to ensure the task switcher is not disturbed by interrupts. In fact, iretq and the pushing of CS & RIP is not sensitive to interrupts.

nullplan · Post by **nullplan** » Sat Feb 25, 2023 1:49 pm

I don't think this can work. I would combine both routines into one. Make one routine that takes as argument a pointer to the current task's stack pointer and as second argument the new task's stack pointer. Something along these lines:

Code: Select all

struct task {
  ...
  void *sp;
  ...
};

void switch_tasks(void **old_sp, void *new_sp);

The AMD64 ABI specifies which registers are preserved across function calls, and so it is unnecessary to save the other registers. You need to save RBX, RSP, RBP, R12-R15, and of course RIP. Although RIP is already on the stack from the call instruction, so it doesn't need explicit saving. I would write it like this:

Code: Select all

switch_tasks:
  pushq %rbx
  pushq %rbp
  pushq %r12
  pushq %r13
  pushq %r14
  pushq %r15
  movq %rsp, (%rdi)
  movq %rsi, %rsp
  popq %r15
  popq %r14
  popq %r13
  popq %r12
  popq %rbp
  popq %rbx
  retq

So that saves all the nonvolatile registers, saves the stack pointer into the old task structure, switches stacks, then restores the nonvolatile registers and returns. This means this call "returns" to the new task. But another call to the same function will at some point return to this call's old task. That is why the function is not declared as "noreturn". I also see no reason to muck with the interrupt flag at any point during this call. Interrupts can be handled the entire time this code is running. At no point does RSP not point to the stack.

Anyway, initializing this for a new thread is as simple as generating a stack frame as the above function expects it. Maybe something like

Code: Select all

extern const char start_kernel_task[];
struct task *new_kernel_task(void (*fn)(void *), void *arg)
{
  struct task *r = alloc_task();
  r->stack = alloc_stack(); /* I don't know how you organize your stacks. Make sure you get the top end of the stack here. */
  struct initframe {
    uint64_t r15, r14, r13, r12, rbp, rbx, rip;
  ] *iframe = (struct initframe*)r->stack - 1;
/* save arguments to stack, set initial routine to start_kernel_task */
  iframe->r15 = fn;
  iframe->r14 = arg;
  iframe->rip = start_kernel_task;
  r->stack = iframe;
  set_task_state(r, TASK_RUNNABLE);
  return r;
}

Code: Select all

start_kernel_task:
/* function in r15, argument in r14. Need to set RBP to zero to mark lowest stack frame, and align RSP. */
  xorl %ebp, %ebp
  andq $-16, %rsp
  movq %r14, %rdi
  callq *%r15
/* now call end_task and hope that never returns.  */
  callq end_task
  ud2

I am assuming end_task() is a C routine that does not return. The "ud2" there prevents the prefetcher from running off into uncharted territory after that call. But it has to be a call for stack alignment.

This routine is only capable of switching between tasks in kernel mode. But that is no problem, since even a user task is put into kernel mode when a timer interrupt occurs. I would not save segment registers here. I see no reason to since they should be the same before and after the switch, since you are only changing from one kernel-mode task to another one. Once user tasks get into the mix, you will need to manage CR3. There is a host of ways to do this. Not sure what rdos means with TR; that is only loaded once. However, you will need to update TSS.RSP0 if user tasks get into the mix and you implement one kernel stack per task (as I am going to assume here).

rdos · Post by **rdos** » Sat Feb 25, 2023 4:42 pm

TR comes into play if you decide to have one kernel stack per task, and don't want to mess with fields in a single TSS on task switches. I have one TSS per thread as that is more convinient. The TSS also contain debug registers, CR3 and the IO permission bitmap, all which I use too. This design also allows user-mode code to identify threads with str without needing syscalls.

nullplan · Post by **nullplan** » Sun Feb 26, 2023 12:30 am

rdos wrote:TR comes into play if you decide to have one kernel stack per task, and don't want to mess with fields in a single TSS on task switches. I have one TSS per thread as that is more convinient. The TSS also contain debug registers, CR3 and the IO permission bitmap, all which I use too. This design also allows user-mode code to identify threads with str without needing syscalls.

Ah, OK. Well, I only update TSS.RSP0, keep the CR3 separately in the task structure, and don't use IOPL. And userspace can identify threads with by just reading %fs:0 (as each thread has its own registers). Yes, the PID requires a syscall to obtain in my design, that is true.

nullplan wrote:I don't think this can work.

I think I need to explain a bit more what I meant. I meant that conceptually, you have a big problem even if your save and load functions did what they say. Say your scheduler switches tasks with

Code: Select all

save_registers(&current_task.regs);
load_registers(&new_task.regs);

What registers get saved to the current task? In particular, what RIP gets saved to the current task? It's the RIP from just before calling load_registers with the next task. So you call load_registers on the next task, it immediately calls load_registers on the task it switched to when it called save_registers, and the task you wanted to jump to is skipped. Indeed, tasks saved in this manner can form a circle and you end up with an infinite loop. And since you are also disabling interrupts somewhere in there, you can never break out of it.

That is a problem I solve by only recording the top of stack in the task structure and having a single function that switches stacks and non-volatile registers. And for inspecting registers of inactive tasks, e.g. for debugging, those are not the registers you want to look at, anyway. You want to look at the registers created at the interrupt/system call boundary, since those are the userspace registers. My solution for that is to put the task structure at the top end of the kernel stack and set RSP0 equal to the task structure's address. All entry code into the kernel creates the same stackframe, so there is only one stackframe type to give to userspace, and the registers are always saved in that manner directly below the task structure whenever the task is suspended. Which is the only safe time to inspect and change these regs anyway.

rdos · Post by **rdos** » Sun Feb 26, 2023 4:42 am

nullplan wrote:
rdos wrote:TR comes into play if you decide to have one kernel stack per task, and don't want to mess with fields in a single TSS on task switches. I have one TSS per thread as that is more convinient. The TSS also contain debug registers, CR3 and the IO permission bitmap, all which I use too. This design also allows user-mode code to identify threads with str without needing syscalls.
Ah, OK. Well, I only update TSS.RSP0, keep the CR3 separately in the task structure, and don't use IOPL. And userspace can identify threads with by just reading %fs:0 (as each thread has its own registers). Yes, the PID requires a syscall to obtain in my design, that is true.
nullplan wrote:I don't think this can work.
I think I need to explain a bit more what I meant. I meant that conceptually, you have a big problem even if your save and load functions did what they say. Say your scheduler switches tasks with
Code: Select all
save_registers(&current_task.regs);
load_registers(&new_task.regs);
What registers get saved to the current task? In particular, what RIP gets saved to the current task? It's the RIP from just before calling load_registers with the next task. So you call load_registers on the next task, it immediately calls load_registers on the task it switched to when it called save_registers, and the task you wanted to jump to is skipped. Indeed, tasks saved in this manner can form a circle and you end up with an infinite loop. And since you are also disabling interrupts somewhere in there, you can never break out of it.

That is a problem I solve by only recording the top of stack in the task structure and having a single function that switches stacks and non-volatile registers. And for inspecting registers of inactive tasks, e.g. for debugging, those are not the registers you want to look at, anyway. You want to look at the registers created at the interrupt/system call boundary, since those are the userspace registers. My solution for that is to put the task structure at the top end of the kernel stack and set RSP0 equal to the task structure's address. All entry code into the kernel creates the same stackframe, so there is only one stackframe type to give to userspace, and the registers are always saved in that manner directly below the task structure whenever the task is suspended. Which is the only safe time to inspect and change these regs anyway.

There are many ways to solve this. The TSS and the task structure are shared in my design. CPU register contents are saved in this shared area. When I still used hardware task switching, the registers were saved & restored from the TSS automatically, but as I stopped using it because of problems with multicore operation of these functions, I saved & loaded them with software from their normal positions. Now I can save long mode registers too, and so the registers no longer are at their default positions. Hardware task-switching is not supported in long mode, but registers can still be saved in the task structure. The advantage of saving registers in the task structure is that it is much easier to create a kernel debugger that needs to single step and modify register context, which is a lot more complicated if registers are saved on the kernel stack for user mode and in other ways in kernel (if kernel debugging is supported at all).

j4cobgarby · Post by **j4cobgarby** » Sun Feb 26, 2023 8:50 am

nullplan wrote:I don't think this can work. I would combine both routines into one. Make one routine that takes as argument a pointer to the current task's stack pointer and as second argument the new task's stack pointer. Something along these lines:
Code: Select all
struct task {
  ...
  void *sp;
  ...
};

void switch_tasks(void **old_sp, void *new_sp);
The AMD64 ABI specifies which registers are preserved across function calls, and so it is unnecessary to save the other registers. You need to save RBX, RSP, RBP, R12-R15, and of course RIP. Although RIP is already on the stack from the call instruction, so it doesn't need explicit saving. I would write it like this:
Code: Select all
switch_tasks:
  pushq %rbx
  pushq %rbp
  pushq %r12
  pushq %r13
  pushq %r14
  pushq %r15
  movq %rsp, (%rdi)
  movq %rsi, %rsp
  popq %r15
  popq %r14
  popq %r13
  popq %r12
  popq %rbp
  popq %rbx
  retq
So that saves all the nonvolatile registers, saves the stack pointer into the old task structure, switches stacks, then restores the nonvolatile registers and returns. This means this call "returns" to the new task. But another call to the same function will at some point return to this call's old task. That is why the function is not declared as "noreturn". I also see no reason to muck with the interrupt flag at any point during this call. Interrupts can be handled the entire time this code is running. At no point does RSP not point to the stack.

Anyway, initializing this for a new thread is as simple as generating a stack frame as the above function expects it. Maybe something like
Code: Select all
extern const char start_kernel_task[];
struct task *new_kernel_task(void (*fn)(void *), void *arg)
{
  struct task *r = alloc_task();
  r->stack = alloc_stack(); /* I don't know how you organize your stacks. Make sure you get the top end of the stack here. */
  struct initframe {
    uint64_t r15, r14, r13, r12, rbp, rbx, rip;
  ] *iframe = (struct initframe*)r->stack - 1;
/* save arguments to stack, set initial routine to start_kernel_task */
  iframe->r15 = fn;
  iframe->r14 = arg;
  iframe->rip = start_kernel_task;
  r->stack = iframe;
  set_task_state(r, TASK_RUNNABLE);
  return r;
}
Code: Select all
start_kernel_task:
/* function in r15, argument in r14. Need to set RBP to zero to mark lowest stack frame, and align RSP. */
  xorl %ebp, %ebp
  andq $-16, %rsp
  movq %r14, %rdi
  callq *%r15
/* now call end_task and hope that never returns.  */
  callq end_task
  ud2
I am assuming end_task() is a C routine that does not return. The "ud2" there prevents the prefetcher from running off into uncharted territory after that call. But it has to be a call for stack alignment.

This routine is only capable of switching between tasks in kernel mode. But that is no problem, since even a user task is put into kernel mode when a timer interrupt occurs. I would not save segment registers here. I see no reason to since they should be the same before and after the switch, since you are only changing from one kernel-mode task to another one. Once user tasks get into the mix, you will need to manage CR3. There is a host of ways to do this. Not sure what rdos means with TR; that is only loaded once. However, you will need to update TSS.RSP0 if user tasks get into the mix and you implement one kernel stack per task (as I am going to assume here).

Thanks, this seems like a really good approach! One question though, even though some registers (e.g. RAX) don't need to be saved across function calls, surely they should still be saved in this case? Because if a task is preempted while running and the CPU switches to another task, and then back to the first task, it would want to keep RAX the same as if the switch never happened, right?

nullplan · Post by **nullplan** » Sun Feb 26, 2023 10:08 am

j4cobgarby wrote:Thanks, this seems like a really good approach! One question though, even though some registers (e.g. RAX) don't need to be saved across function calls, surely they should still be saved in this case? Because if a task is preempted while running and the CPU switches to another task, and then back to the first task, it would want to keep RAX the same as if the switch never happened, right?

That should be handled by your interrupt entry and exit code. On interrupt entry, you save all registers, and on interrupt exit, you restore them. Timer interrupt hits, you handle the interrupt, do all the song and dance your hardware needs, then switch stacks to another thread. At some point, you will switch back to the interrupted task, and then return from the interrupt, and in the course of that restore all the registers.

OSDev.org

iretq returning to incorrect location when I context switch

iretq returning to incorrect location when I context switch

Re: iretq returning to incorrect location when I context swi

Re: iretq returning to incorrect location when I context swi

Re: iretq returning to incorrect location when I context swi

Re: iretq returning to incorrect location when I context swi

Re: iretq returning to incorrect location when I context swi

Re: iretq returning to incorrect location when I context swi

Re: iretq returning to incorrect location when I context swi

Re: iretq returning to incorrect location when I context swi

Re: iretq returning to incorrect location when I context swi