Inconsistent faults when switching to and from user task

tabz · **Posted:** Wed Mar 20, 2019 1:30 pm

I've been working (a.k. banging my head against a brick wall) on adding x86 user mode to my kernel for the past couple of weeks. I designed the task switching protocol, stack layouts and everything makes sense, but I keep getting inconsistent faults some time after switching to a user task. Switching between kernel tasks works perfectly so I'm baffled as to what could be causing switching to user tasks to be causing faults.

My task switching protocol is as follows:

1. On a timer interrupt, get the next process as per some scheduling policy.
2. If it's a kernel task, then call arch_switch_to_kernel_task (source attached), else call arch_switch_to_user_task
arch_switch_to_kernel_task:
Save current cpu state into process' saved kernel state
Restore next process' CPU state from process' saved kernel state
Return to the address stored at the top of the kernel stack, effectively continuing on from where the task was halted.

arch_switch_to_user_task:
Save current cpu state into process' saved kernel state
Restore next process' CPU state from process' saved kernel state
Return to the address stored at the top of the kernel stack, which eventually finds itself in the irq handler. This pops off the saved user state and enters user mode with an iret

To facilitate this protocol, each task is initialised:

kernel task:
the registers in the saved kernel state are set to 0
stack:
top: exit function address
top - 4: entry function address <-- kernel_state.esp

user task:
the registers in both the kernel state and user state are set to 0
kernel stack:
top: exit function address
top - 4: initial user cpu state
top - 4 - sizeof(arch_cpu_state_t): Unused eax value
top - 8 - sizeof(arch_cpu_state_t): Return address <-- kernel_state.esp
user stack:
top: exit function address <-- user_state.useresp

Sometimes I get a page fault at c0104450 in:

Code:

c010444a <linkedlist_size>:
c010444a:       55                      push   %ebp
c010444b:       89 e5                   mov    %esp,%ebp
c010444d:       8b 45 08                mov    0x8(%ebp),%eax
c0104450:       8b 40 08                mov    0x8(%eax),%eax
c0104453:       5d                      pop    %ebp
c0104454:       c3                      ret

with the CPU state being

Code:

[ERROR] PANIC @ src/arch/x86/idt/exceptions.c:38: Page fault
   cs=0x1B   ss=0x23   gs=0x23   fs=0x23   es=0x23   ds=0x23
   ebp=0xC052D127   esp=0xC052D26F
   edi=0x0   esi=0x0   ebx=0x0   edx=0x0   ecx=0x0   eax=0x13FEAC4
   int=0xE      err=0x4
   eip=0xC0104450   ef=0x286   uesp=0xC052D127
   cr0=0x80000011   cr2=0x13FEACC   cr3=0x400000

Then sometimes it fails at c01095c4 in:

Code:

c01095a1 <arch_switch_to_kernel_task>:
c01095a1:       fa                      cli    
c01095a2:       57                      push   %edi
c01095a3:       50                      push   %eax
c01095a4:       8b 7c 24 0c             mov    0xc(%esp),%edi
c01095a8:       8b 44 24 10             mov    0x10(%esp),%eax
c01095ac:       8f 47 2c                popl   0x2c(%edi)
c01095af:       8f 47 10                popl   0x10(%edi)
c01095b2:       89 77 14                mov    %esi,0x14(%edi)
c01095b5:       89 6f 18                mov    %ebp,0x18(%edi)
c01095b8:       89 67 1c                mov    %esp,0x1c(%edi)
c01095bb:       89 5f 20                mov    %ebx,0x20(%edi)
c01095be:       89 57 24                mov    %edx,0x24(%edi)
c01095c1:       89 4f 28                mov    %ecx,0x28(%edi)
c01095c4:       8b 78 10                mov    0x10(%eax),%edi
c01095c7:       8b 70 14                mov    0x14(%eax),%esi
c01095ca:       8b 68 18                mov    0x18(%eax),%ebp
c01095cd:       8b 60 1c                mov    0x1c(%eax),%esp
c01095d0:       8b 58 20                mov    0x20(%eax),%ebx
c01095d3:       8b 50 24                mov    0x24(%eax),%edx
c01095d6:       8b 48 28                mov    0x28(%eax),%ecx
c01095d9:       8b 40 2c                mov    0x2c(%eax),%eax
c01095dc:       89 25 04 dc 10 c0       mov    %esp,0xc010dc04
c01095e2:       fb                      sti    
c01095e3:       c3                      ret

with the CPU state being

Code:

[ERROR] PANIC @ src/arch/x86/idt/exceptions.c:38: Page fault
   cs=0x8   ss=0xC052D603   gs=0x10   fs=0x10   es=0x10   ds=0x10
   ebp=0xC052CE53   esp=0xC052CE03
   edi=0xC052D603   esi=0x0   ebx=0x0   edx=0x3FDC0   ecx=0x1   eax=0x3FDC0
   int=0xE      err=0x0
   eip=0xC01095C4   ef=0x82   uesp=0xC0104F7B
   cr0=0x80000011   cr2=0x3FDD0   cr3=0x400000

This always happens after a seemingly arbitrary number of task switches and it all works perfectly if I make the "cleaner" task a kernel task rather than a user one.

I have a feeling that it could be to do with my setting of ebp/esp, or me missing a part of the switching protocol, but I can't for the life of me figure it out. Does anybody have any idea?

Source for reference:
* Boot file
* Where the switching happens
* Where a process is initialised
* IRQ and ISR handling
* The shceduler
* CPU state definition

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

Without looking further at your code: In both examples the faulting instruction dereferenced EAX, and EAX did not contain a kernel pointer (I guess all your kernel pointers start with a C, right?). So therefore, either the state was corrupted in memory, or EAX was changed before it could be used. Since your interrupt handling seems solid enough at first glance, I guess it's the former.

tabz · **Posted:** Thu Mar 21, 2019 4:44 am

nullplan wrote:

Without looking further at your code: In both examples the faulting instruction dereferenced EAX, and EAX did not contain a kernel pointer (I guess all your kernel pointers start with a C, right?).

Yeah that's right, all kernel symbols and heap are in the upper 1GB of memory.

nullplan wrote:

So therefore, either the state was corrupted in memory, or EAX was changed before it could be used. Since your interrupt handling seems solid enough at first glance, I guess it's the former.

Your point about memory state corruption seems the most likely and is made more likely by another fault that I couldn't reproduce just before posting this thread, but has happened again now:

It sometimes triggers an invalid opcode fault when a task does a ret and jumps to an address in the heap, which makes sense since the heap doesn't have contain instructions. Below is the output of "qemu -d int,in_asm":

Code:

Servicing hardware INT=0x20
     8: v=20 e=0000 i=0 cpl=0 IP=0008:c0103172 pc=c0103172 SP=0010:c0111e88 env->regs[R_EAX]=00020000
EAX=00020000 EBX=00100000 ECX=00000001 EDX=c010a0d1
ESI=00111f94 EDI=00000000 EBP=c0111eb0 ESP=c0111e88
EIP=c0103172 EFL=00200206 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0028 c010dc00 0000dc68 00008900 DPL=0 TSS32-avl
GDT=     c010dbc0 0000002f
IDT=     c010d3c0 000007ff
CR0=80000011 CR2=00000000 CR3=00400000 CR4=00000010
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000 
DR6=ffff0ff0 DR7=00000400
CCS=00000004 CCD=c0111e7c CCO=EFLAGS  
EFER=0000000000000000
[DEBUG] Switching to user task cleaner
----------------
IN: 
0xc0525c46:  (bad)  0x52(%edi)

check_exception old: 0xffffffff new 0x6
     9: v=06 e=0000 i=0 cpl=3 IP=001b:c0525c46 pc=c0525c46 SP=0023:c052614f env->regs[R_EAX]=00000000
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000000
ESI=00000000 EDI=00000000 EBP=00000002 ESP=c052614f
EIP=c0525c46 EFL=00000286 [--S--P-] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =0023 00000000 ffffffff 00cff300 DPL=3 DS   [-WA]
CS =001b 00000000 ffffffff 00cffa00 DPL=3 CS32 [-R-]
SS =0023 00000000 ffffffff 00cff300 DPL=3 DS   [-WA]
DS =0023 00000000 ffffffff 00cff300 DPL=3 DS   [-WA]
FS =0023 00000000 ffffffff 00cff300 DPL=3 DS   [-WA]
GS =0023 00000000 ffffffff 00cff300 DPL=3 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0028 c010dc00 0000dc68 00008900 DPL=0 TSS32-avl
GDT=     c010dbc0 0000002f
IDT=     c010d3c0 000007ff
CR0=80000011 CR2=00000000 CR3=00400000 CR4=00000010
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000 
DR6=ffff0ff0 DR7=00000400
CCS=00000010 CCD=c052613f CCO=ADDL    
EFER=0000000000000000

As you can see, a kernel task is interrupted by the timer so it switches to the user task which eventually does a ret into the heap, at which point the CPU sees a stream of arbitrary instructions and faults on the one that is invalid. This screams stack corruption to me but I can't see where I've gone wrong.

linuxyne · **Joined:** Sat Jul 02, 2016 7:02 am **Posts:** 210

If the stack size is 1024 bytes (or 256 4-byte entries), ARCH_INIT_PROCESS_STATE's writing to the stack[256] entry is generally considered as a buffer overflow.

ARCH_INIT_PROCESS_STATE sets esp to entry 251, with the intention of storing 6 (ebp, edi, esi, ebx, entry, exit) entries beginning at 251, but it can only store 5 entries (251 to 255).

Unless the above is deliberately as designed, it could be a potential cause.

tabz · **Posted:** Thu Mar 21, 2019 7:47 am

linuxyne wrote:

If the stack size is 1024 bytes (or 256 4-byte entries), ARCH_INIT_PROCESS_STATE's writing to the stack[256] entry is generally considered as a buffer overflow.

ARCH_INIT_PROCESS_STATE sets esp to entry 251, with the intention of storing 6 (ebp, edi, esi, ebx, entry, exit) entries beginning at 251, but it can only store 5 entries (251 to 255).

Unless the above is deliberately as designed, it could be a potential cause.

That is a good observation and isn't how it was designed, so I will look into that and report back. it could certainly explain the stack corruption.

Edit: I think you may be looking at some old code since I removed the ARCH_INIT_PROCESS_STATE macro and replaced it with the arch_init_process_state function. Your observation with buffer overflow still applies to the new code though.

linuxyne · **Joined:** Sat Jul 02, 2016 7:02 am **Posts:** 210

tabz wrote:

I think you may be looking at some old code since I removed the ARCH_INIT_PROCESS_STATE macro and replaced it with the arch_init_process_state function.

True. I went where the Github search on the name took me, and that, I now realize, was a commit on 16th Feb 2019. The code search on github seems to be quite limited in power.

tabz · **Posted:** Thu Mar 21, 2019 11:15 am

linuxyne wrote:

tabz wrote:

I think you may be looking at some old code since I removed the ARCH_INIT_PROCESS_STATE macro and replaced it with the arch_init_process_state function.

True. I went where the Github search on the name took me, and that, I now realize, was a commit on 16th Feb 2019. The code search on github seems to be quite limited in power.

Yeah you're right! I don't like it either.

Addressing your observations has gotten rid of the first two issues that I reported (thanks!) but the issue where the code is jumping into the heap is still occuring. Below is an excerpt from GDB I got when breaking at the isr14 handler, does it reveal anything helpful (perhaps the "corrupt stack?" message :p)?

Code:

Breakpoint 1, isr14 () at src/arch/x86/idt/idt_asm.s:41
41   ISR_ERR   14
(gdb) bt
#0  isr14 () at src/arch/x86/idt/idt_asm.s:41
#1  0x00000004 in ?? ()
warning: (Internal error: pc 0xc0526791 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0xc0526790 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0xc0526790 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0xc0526790 in read in psymtab, but not in symtab.)

#2  0xc0526791 in ?? () at src/arch/x86/multitasking/multitasking.s:81
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

linuxyne · **Joined:** Sat Jul 02, 2016 7:02 am **Posts:** 210

tabz wrote:

Below is an excerpt from GDB I got when breaking at the isr14 handler, does it reveal anything helpful (perhaps the "corrupt stack?" message :p)?

No

.

But, I believe the arch_switch_to_*_task functions are causing stack underflow by forcing the cpu to set the esp progressively lower and lower every time an interrupt/exception arrives.

tss.esp0 is set when switching to a task, and cpu reads it back at the time of
interrupt/exception in order to push parameters on it. But the value tss.esp0 arrives from the esp value of the same task when it was going away a few seconds ago. Although the stack unwinds properly because of the return mechanisms, when tss.esp0 is set, the stack unwind (or part of it) is not taken into account.

I hope my words above made some kind of sense about the circular nature of the processing.

This should be easy to verify by progressively increasing the cleaner's stacks from 1kb to say 512kb in chunks of 1kb, and then seeing if the failure takes longer and longer to hit. (Edit: Or, since you have a working debugger setup, you can put a conditional break-print-and-go on the instructions which store
esp into (tss+4) to see if the value settles for each thread, or it runs into
a progression.)

The older output shows that the stacks become misaligned. Does kmalloc, etc. return aligned buffers?

tabz · **Posted:** Fri Mar 22, 2019 8:20 am

linuxyne wrote:

tabz wrote:

Below is an excerpt from GDB I got when breaking at the isr14 handler, does it reveal anything helpful (perhaps the "corrupt stack?" message :p)?

No

.

But, I believe the arch_switch_to_*_task functions are causing stack underflow by forcing the cpu to set the esp progressively lower and lower every time an interrupt/exception arrives.

[...]

This should be easy to verify by progressively increasing the cleaner's stacks from 1kb to say 512kb in chunks of 1kb, and then seeing if the failure takes longer and longer to hit. (Edit: Or, since you have a working debugger setup, you can put a conditional break-print-and-go on the instructions which store
esp into (tss+4) to see if the value settles for each thread, or it runs into
a progression.)

You're precisely correct. I kept doubling both the user stack and kernel stack and it went on longer and longer before hitting the page fault. Well spotted!

linuxyne wrote:

The older output shows that the stacks become misaligned. Does kmalloc, etc. return aligned buffers?

kmalloc doesn't return aligned buffers but kmalloc_a does. The only place I remember using kmalloc_a was in my paging code.

linuxyne wrote:

tss.esp0 is set when switching to a task, and cpu reads it back at the time of
interrupt/exception in order to push parameters on it. But the value tss.esp0 arrives from the esp value of the same task when it was going away a few seconds ago. Although the stack unwinds properly because of the return mechanisms, when tss.esp0 is set, the stack unwind (or part of it) is not taken into account.

I hope my words above made some kind of sense about the circular nature of the processing.

I think I understand. Which part of the stack is not taken into account? Is it the parameters passed to arch_switch_to_*_task? If it won't take too much time, I think an ascii graphic/example would be of help

linuxyne · **Joined:** Sat Jul 02, 2016 7:02 am **Posts:** 210

tabz wrote:

kmalloc doesn't return aligned buffers

If kmalloc does not return an aligned buffer, then cleaner's user and kernel stacks are likely misaligned, although I can't recall at the moment what restrictions x86 places on esp, but that should be easy to find.

tabz wrote:

I think an ascii graphic/example would be of help

Some diagrams below. To see if the situation described below occurs, a breakpoint can be placed at 'mov %esp, (tss + 4)' instructions inside arch_switch_to_user_task (for now) and see the %esp value being stored. Since the tasks should preserve the stack, and since they do not perform (or can be made to perform nothing more than an idle, tight loop), the esp value must settle to some constant (one constant for each of the tasks).

Code:

arch_switch_to_user_task  <----- (A)
switch_to_next
on_tick
irq
init

(A)
        1. Store esp into init.esp
        2. Load esp from cleaner.esp
        3. Store esp into (tss+4). Denote a pointer to this stack location by V.
        4. Unwind and go to user mode

arch_switch_to_kernel_task  <----- (C)
switch_to_next
on_tick
irq                       <----- (B)
cleaner

(B)
        1. Load esp from (tss+4). Loads the value from A.3
        2. Cpu pushes parameters. V -= 0x10 (or some appropriate decrement).
        3. Further calls on_tick, and others, cause V -= 0x100 (for e.g.)

(C)
        1. Store esp into cleaner.esp. The value that gets stored here is current V.
        2. Load esp from init.esp
        3. Store esp into (tss+4).
        4. Unwind and go to init task.

Now we can see that situation (C) loops back to situation (A), and V continues to decrement.

tabz · **Posted:** Fri Mar 22, 2019 9:58 am

linuxyne wrote:

tabz wrote:

kmalloc doesn't return aligned buffers

If kmalloc does not return an aligned buffer, then cleaner's user and kernel stacks are likely misaligned, although I can't recall at the moment what restrictions x86 places on esp, but that should be easy to find.

I'll investigate, experiment and report back.

linuxyne wrote:

tabz wrote:

I think an ascii graphic/example would be of help

Some diagrams below. To see if the situation described below occurs, a breakpoint can be placed at 'mov %esp, (tss + 4)' instructions inside arch_switch_to_user_task (for now) and see the %esp value being stored. Since the tasks should preserve the stack, and since they do not perform (or can be made to perform nothing more than an idle, tight loop), the esp value must settle to some constant (one constant for each of the tasks).

Code:

arch_switch_to_user_task  <----- (A)
switch_to_next
on_tick
irq
init

(A)
        1. Store esp into init.esp
        2. Load esp from cleaner.esp
        3. Store esp into (tss+4). Denote a pointer to this stack location by V.
        4. Unwind and go to user mode

arch_switch_to_kernel_task  <----- (C)
switch_to_next
on_tick
irq                       <----- (B)
cleaner

(B)
        1. Load esp from (tss+4). Loads the value from A.3
        2. Cpu pushes parameters. V -= 0x10 (or some appropriate decrement).
        3. Further calls on_tick, and others, cause V -= 0x100 (for e.g.)

(C)
        1. Store esp into cleaner.esp. The value that gets stored here is current V.
        2. Load esp from init.esp
        3. Store esp into (tss+4).
        4. Unwind and go to init task.

Now we can see that situation (C) loops back to situation (A), and V continues to decrement.

Ah thank you for that, it makes a lot of sense. Using your suggestion to set a break point does reveal that it never reaches a constant and keeps decrementing. To address this, would you suggest setting TSS.esp0 to some predefined constant or setting it in some other part of the switching tree?

linuxyne · **Joined:** Sat Jul 02, 2016 7:02 am **Posts:** 210

tabz wrote:

Using your suggestion to set a break point does reveal that it never reaches a constant and keeps decrementing.

If so, once the %esp goes below the lower address of the kernel stack, you have the proof of the corruption.

tabz wrote:

To address this, would you suggest setting TSS.esp0 to some predefined constant or setting it in some other part of the switching tree?

The value set in tss.esp0 depends on the reason the control needs to go out of the kernel mode to the user mode or to a different (interrupt/exception) context.

For instance, it is possible for a thread to interweave its activity in both the modes. In such a case, a backtrace might look like below:

Code:

umode_func13  <----- control is here at the moment.
kmode_func06
umode_func02
umode_func84
kmode_func61
umode_func47
kmode_func06

Each kmode_func that decided to relinquish control to a umode_func must set tss.esp0 to such a value as to both preserve the currently active kernel mode stack frames, and also allow further calls into the kernel mode (through syscalls/interrupts/excptns) as long as there's stack space.

In jaq, it can help to separate the two concepts:
(1) switching between threads, and
(2) transitioning between the modes/contexts for a single thread.

That separation then moves the responsibility of determining the appropriate
value for tss.esp0 into part (2).

Part (1) does not need to particularly worry about tss.esp0, except copy pasting from the incoming thread's state to cpu.tss.esp0, as a part of the context switch.

In jaq, if we always return to usermode after completely unwinding the kernel stack, then setting tss.esp0 to the base of the kernel stack looks sufficient to me. That means committing to the fact that as long as the task is running in the usermode, its kernel stack does not contain any active frames (it may continue to hold state, etc. which lie beyond the base as we define it).

There might be other ways to handle this situation. For instance, linux has a per-cpu entry stack to which tss.esp0/sysenter_esp points, and the interrupt/exception handling switches back and forth between the entry stack and the task's kernel stack.

tabz · **Posted:** Mon Mar 25, 2019 3:04 pm

linuxyne wrote:

tabz wrote:

Using your suggestion to set a break point does reveal that it never reaches a constant and keeps decrementing.

If so, once the %esp goes below the lower address of the kernel stack, you have the proof of the corruption.

tabz wrote:

To address this, would you suggest setting TSS.esp0 to some predefined constant or setting it in some other part of the switching tree?

The value set in tss.esp0 depends on the reason the control needs to go out of the kernel mode to the user mode or to a different (interrupt/exception) context.

For instance, it is possible for a thread to interweave its activity in both the modes. In such a case, a backtrace might look like below:

Code:

umode_func13  <----- control is here at the moment.
kmode_func06
umode_func02
umode_func84
kmode_func61
umode_func47
kmode_func06

Each kmode_func that decided to relinquish control to a umode_func must set tss.esp0 to such a value as to both preserve the currently active kernel mode stack frames, and also allow further calls into the kernel mode (through syscalls/interrupts/excptns) as long as there's stack space.

In jaq, it can help to separate the two concepts:
(1) switching between threads, and
(2) transitioning between the modes/contexts for a single thread.

That separation then moves the responsibility of determining the appropriate
value for tss.esp0 into part (2).

Part (1) does not need to particularly worry about tss.esp0, except copy pasting from the incoming thread's state to cpu.tss.esp0, as a part of the context switch.

In jaq, if we always return to usermode after completely unwinding the kernel stack, then setting tss.esp0 to the base of the kernel stack looks sufficient to me. That means committing to the fact that as long as the task is running in the usermode, its kernel stack does not contain any active frames (it may continue to hold state, etc. which lie beyond the base as we define it).

I can't believe I didn't think of that! Setting tss.esp to the base of the kernel stack worked perfectly. Thank you for your help. I learnt a thing or two from this.

OSDev.org

Inconsistent faults when switching to and from user task

Who is online