Mutex Problem (Or Anything Else) in SMP Scheduling

Js2xxx · **Posted:** Sun Apr 30, 2017 1:07 am

Hi everyone. I've made threads run in my OS with round-robin algorithm and with uniprocessor(and it works well). Now I want to make threads run with another CPU. So I've correctly made AP's awake and let it load another thread. It works well for several milliseconds (the first picture: "B" runs on an AP and "I" runs on the BSP, the "!" suggests causing reentrant).
However, after that, everything is broken and the system reset (the second picture).

What's wrong? Is it a problem in mutex? Or any other causes?
Thanks for any help and advice.

Some useful(I think it is) information about my OS:
1.The timer handler(written in asm) and the scheduler(written in C) is shared with all the CPUs.
2.Every CPU has got it's own thread queue(based on array and with proper enqueue and dequeue operation).
3.The CPUs' logical IDs are stored in IA32_TSC_AUX and the scheduler can determine which thread queue to choose.
4.Before calling the scheduler, there's a spinlock and after that the lock will be released so the scheduler is atomic.
5.The thread context is stored in stacks.

max · **Posted:** Fri May 05, 2017 5:45 pm

Hey,

you can't really expect help if you don't post any code or have done any more research on what's going wrong.

My crystal ball says you're causing a triple fault somewhere in your code.

Greets

Geri · **Joined:** Sun Jul 14, 2013 6:01 pm **Posts:** 442

at this level, you cant have mutexes/semaphores. those are high-level features, that you ensure with software assistance in your kernel. cache coherency is not ensured beethwen cores, you need a few thousands of cycles to sync the cache beethwen cores, a mutex-like feature will not magically work.

in my kernel, i organize everything from core 0. that controls what thread will execute what thread. your solution sounds nice, but its too nice to be true, dont try writing spectacularry optimistic code involving kernel funcionality to be placed on the AP-s, becouse in most cases .... uh.. what example i could i illustrate it to an asian... so a dragon will come and rape your lolis.

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

Geri wrote:

at this level, you cant have mutexes/semaphores. those are high-level features, that you ensure with software assistance in your kernel. cache coherency is not ensured beethwen cores, you need a few thousands of cycles to sync the cache beethwen cores, a mutex-like feature will not magically work.

I guess I don't understand what you are talking about at all. Are you saying that x86 doesn't have cache coherency? I mean, isn't cache coherency _only_ about multi-core to begin with? You're saying that either x86 is incoherent or it lacks multi-core cache coherency, but cache coherency is only relevant when there's multiple cores, so..?

Also where does that "few thousand cycles" come from? Any source? Yes, moving cache lines between cores costs, but it would be pretty stupid if it costs more than actually going all the way to RAM, and going to RAM I think is something like 100-300 cycles and cache would be thousands according to you?

Geri wrote:

in my kernel, i organize everything from core 0. that controls what thread will execute what thread. your solution sounds nice, but its too nice to be true, dont try writing spectacularry optimistic code involving kernel funcionality to be placed on the AP-s, becouse in most cases .... uh.. what example i could i illustrate it to an asian... so a dragon will come and rape your lolis.

I plan to do "kernel" functionality on every core, not just BSP and as far as I know there's absolutely no issues doing it that way, and that's what pretty much every "real" OS does too...

You do have to make sure that your kernel functionality doesn't mess with what the other cores are doing but that's just basic multi-threading, if you don't understand multi-threading then you shouldn't be bringing the AP's up until you do understand.

Depending on your system, going to BSP for every kernel call may have significant performance impact..

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Geri wrote:

at this level, you cant have mutexes/semaphores. those are high-level features, that you ensure with software assistance in your kernel. cache coherency is not ensured beethwen cores, you need a few thousands of cycles to sync the cache beethwen cores, a mutex-like feature will not magically work.

That is just bullshit. You need kernel level mutexes/semaphores to implement a SMP-safe kernel. The most common form of mutexes at the kernel level are spinlocks that protect data structures from concurrent modification. This has to be coupled with disabling interrupts if the structures are also accessed by IRQ handlers. More complex kernels will require more complex mutexes that e.g. can block/unblock kernel threads.

dozniak · **Posted:** Sat May 06, 2017 7:22 am

LtG wrote:

Geri wrote:

at this level, you cant have mutexes/semaphores. those are high-level features, that you ensure with software assistance in your kernel. cache coherency is not ensured beethwen cores, you need a few thousands of cycles to sync the cache beethwen cores, a mutex-like feature will not magically work.

I guess I don't understand what you are talking about at all. Are you saying that x86 doesn't have cache coherency? I mean, isn't cache coherency _only_ about multi-core to begin with? You're saying that either x86 is incoherent or it lacks multi-core cache coherency, but cache coherency is only relevant when there's multiple cores, so..?

Don't listen to him, he has absolutely no clue.

Geri · **Joined:** Sun Jul 14, 2013 6:01 pm **Posts:** 442

dozniak:

who didnt yet did smp (or anything), leave the topic.

oh jesus, you are out from the game

Geri · **Joined:** Sun Jul 14, 2013 6:01 pm **Posts:** 442

LtG wrote:

Also where does that "few thousand cycles" come from? Any source?

i benchmarked it

LtG wrote:

I think is something like 100-300 cycles and cache would be thousands according to you?

you thinking. i thinking, and i benchmarking. dont just think. do.

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

Geri wrote:

LtG wrote:

Also where does that "few thousand cycles" come from? Any source?

i benchmarked it

What did you benchmark? I thought you said there's no cache coherency, so what is it that you benchmarked?

Care to show the code used to do the benchmarking? Maybe you accidentally included some other things in the benchmark that radically affect the results.

Here's some info w.r.t. cache coherency:
https://spcl.inf.ethz.ch/Publications/.pdf/atomic-bench.pdf

dozniak · **Posted:** Sat May 06, 2017 2:25 pm

Geri wrote:

who didnt yet did smp (or anything), leave the topic.

You didn', so why don't you leave the game?

Brendan · **Posted:** Sat May 06, 2017 2:38 pm

Hi,

Geri wrote:

at this level, you cant have mutexes/semaphores. those are high-level features, that you ensure with software assistance in your kernel.

At which level? The original poster said they're using spinlocks (and not mutexes) for their scheduler, and didn't say what mutexes were being used for. With the lack of information available its reasonable to assume that the mutexes are only used at a higher level.

Korona wrote:

You need kernel level mutexes/semaphores to implement a SMP-safe kernel. The most common form of mutexes at the kernel level are spinlocks that protect data structures from concurrent modification.

Most people define "spinlock" as a kind of lock that spins when the lock can't be acquired (that never does a task switch), and "mutex" as a kind of lock that does a task switch when the lock can't be acquired (potentially after spinning briefly). With these definitions a mutex can't be used by the scheduler's task switch code without causing a nesting problem (e.g. when the scheduler can't acquire a mutex needed to switch tasks, the task blocks and the scheduler tries to acquire a mutex to switch tasks, and ...).

Geri wrote:

cache coherency is not ensured beethwen cores, you need a few thousands of cycles to sync the cache beethwen cores, a mutex-like feature will not magically work.

For normal caches; cache coherency is ensured between cores, unless you've done something extremely wrong (misconfigured the MTRRs such that one CPU thinks an area is uncached while a different CPU thinks the same area is cached).

The TLBs are the only caches where coherency (between the TLB on one CPU and the caches on the same CPU, and between the TLB on a CPU and normal caches on any other CPU, and between the TLB on one CPU and the TLBs on another CPU) isn't ensured.

If you have done something extremely wrong (misconfigured the MTRRs) it's impossible to sync the caches. The only things you can do is invalidate specific cache lines, invalidate the entire cache or disable the cache/s.

Geri wrote:

in my kernel, i organize everything from core 0. that controls what thread will execute what thread.

That's extremely poor for scalability - you'd end up with most CPUs wasted most of the time because they're waiting for one CPU to reschedule; plus all CPUs fighting over the same cache lines.

Geri wrote:

your solution sounds nice, but its too nice to be true, dont try writing spectacularry optimistic code involving kernel funcionality to be placed on the AP-s, becouse in most cases ....

AP CPUs using kernel functionality, including AP CPUs doing scheduling independently, is how every sane OS has always done it.

LtG wrote:

Geri wrote:

LtG wrote:

Also where does that "few thousand cycles" come from? Any source?

i benchmarked it

What did you benchmark? I thought you said there's no cache coherency, so what is it that you benchmarked?

As a random guess, I'd assume Geri benchmarked a WBINVD instruction - they're very expensive (but unnecessary for cache coherency).

Cheers,

Brendan

Geri · **Joined:** Sun Jul 14, 2013 6:01 pm **Posts:** 442

Brendan: the problem is that he was not very specific about his problem. his only hint was that the problem was directly with his smp code (thats why he named the topic like that). a lot of thing can go wrong in his code - so i told him hints what is generally his issue. if we dont see any code, i dont think i or we can be more specific. reading his post about his SMP implementation, it is shockingly complex (compared to mine). basically he tried to gain very lot of performance and/or increase the effective clock cycle of his sheduler. he sacraficed the simplicity and the cleanness for a brutal optimization by placing a kernel mode task switching code snippet on the AP-s, making it almost undebuggable. this isnt a question of an used instruction, this basically could and will lead to various logic errors in the handling, and we cant distant-heal it.

Brendan · **Posted:** Sat May 06, 2017 3:38 pm

Hi,

Geri wrote:

Brendan: the problem is that he was not very specific about his problem. his only hint was that the problem was directly with his smp code (thats why he named the topic like that). a lot of thing can go wrong in his code - so i told him hints what is generally his issue. if we dont see any code, i dont think i or we can be more specific.

Yes; and this is why max asked for some code (in the very first reply to this topic), and why everyone (except you) was waiting for Js2xxx to respond to max's request for some code and didn't reply.

Geri wrote:

reading his post about his SMP implementation, it is shockingly complex (compared to mine). basically he tried to gain very lot of performance and/or increase the effective clock cycle of his sheduler. he sacraficed the simplicity and the cleanness for a brutal optimization by placing a kernel mode task switching code snippet on the AP-s, making it almost undebuggable. this isnt a question of an used instruction, this basically could and will lead to various logic errors in the handling, and we cant distant-heal it.

Complexity is relative, and (compared to most multi-CPU schedulers) this scheduler is simple.

When you say it's "too complex" what you're really saying is "simple compared to most multi-CPU schedulers, but complex relative to your scheduler". From this we can infer that your scheduler is "simplistic" compared to the scheduler Js2xxx described and that your scheduler is "extremely simplistic" compared to most multi-CPU schedulers.

Cheers,

Brendan

Js2xxx · **Posted:** Sat May 06, 2017 8:31 pm

I haven't logged in for some days and just now I saw a lot of posts... It scared me... :shock:

Well, someone asked me for the code, so here it is:

Code:

; The whole code of the timer handler: (It works well in single processor)
TimerLock   db   0
MsgReentrant   db   "!", 0
aqtemp      dq   0 ; other variables used in the handler are extern.
TimerHandler:
   sub   rsp, 8
   
   mov   [rel aqtemp], rsp
   push   rax
   push   rcx
   push   rdx
   push   rbx
   push   qword   [rel aqtemp]
   push   rbp
   push   rsi
   push   rdi
   push   r8
   push   r9
   push   r10
   push   r11
   push   r12
   push   r13
   push   r14
   push   r15

   mov   ecx, IA32_X2APIC_LVT_TIMER
   rdmsr
   bts   eax, 16
   wrmsr

   mov   ecx, IA32_X2APIC_EOI
   xor   rax, rax
   xor   rdx, rdx
   wrmsr

   call   GetPcsrId
   movzx   rax, ax
   lea   rbx, [rel ReentrantFlag]
   inc   dword   [rbx + rax * 4]
   cmp   dword [rbx + rax * 4], 0
   ja   .reentrant

   mov   rsp, 0xFFFFFF8000590000

   call   SetTempTss

   mov   ax, 0x11 << 3
   ltr   ax

   sti

   mov   rdi, TimerLock
   call   EnterSpinLock

   call   GetPcsrId
   mov   rdi, rax
   call   Schedule

   cli

   mov   ecx, IA32_X2APIC_LVT_TIMER
   rdmsr
   btr   eax, 16
   wrmsr

   mov   rax, [rel CurrentT]
   mov   rsp, [rax]

   lea   rax, [rsp + 22 * 8]
   mov   rbx, 0xFFFFFF8000601800
   mov   qword [rbx + 4], rax

   call   GetPcsrId
   
   mov   rdi, TimerLock
   call   LeaveSpinLock

   shl   ax, 3
   mov   ax, 0x9 << 3
   ltr   ax

   jmp   .next

.reentrant:
   mov   rdi, MsgReentrant
   call   DispStr
.next:
   call   GetPcsrId
   movzx   rax, ax
   lea   rbx, [rel ReentrantFlag]
   dec   dword   [rbx + rax * 4]

   pop   r15
   pop   r14
   pop   r13
   pop   r12
   pop   r11
   pop   r10
   pop   r9
   pop   r8
   pop   rdi
   pop   rsi
   pop   rbp
   add   rsp, 8
   pop   rbx
   pop   rdx
   pop   rcx
   pop   rax
   add   rsp, 8

   iretq

And the main code of the scheduler is here:

Code:

public void Schedule(int pcsr)
{
      CurrentT = &(ReadyPtr[pcsr]);
      (*CurrentT)->Ticks--;
      if((*CurrentT)->Ticks <= 0)
      {
            (*CurrentT)->Ticks = (*CurrentT)->Priority;
            Switch(pcsr);
      }
      ReadyPtr[pcsr] = &(ReadyThreads[pcsr].Data[ReadyThreads[pcsr].Front]);
      CurrentT = &(ReadyPtr[pcsr]);
      (*CurrentT)->State = Running;
      TssDesc[pcsr]->AttrLow = TSS | DESC_P;
}

I believe there're a lot of mistakes but I can't find them. Thanks for any help and advice.

Brendan · **Posted:** Sat May 06, 2017 8:53 pm

Hi,

I see multiple problems in the timer IRQ handler's code that could/will cause problems when different CPUs are handling timer IRQs at the same time. Specifically, "aqtemp" and the stack at 0xFFFFFF8000590000 are shared state that would need a lock to prevent one CPU from trashing another CPU's data.

I'd also suggest commenting the code properly, so that you will be able to maintain it later on (after you've forgotten why you've done things), and so that other people can figure out why you're doing horrific things like "mov rsp, 0xFFFFFF8000590000".

Finally, I'd say that the entire scheduler is dependant on timer IRQs and completely useless for tasks blocking/unblocking, and needs to be redesigned and rewritten because of that. Note: You might like the "steps for writing a scheduler" I wrote at the end of this post.

Cheers,

Brendan

OSDev.org

Mutex Problem (Or Anything Else) in SMP Scheduling

Who is online