Atomicity

KemyLand · Post by **KemyLand** » Tue Sep 09, 2014 6:35 pm

Hi!

A few days ago I posted a question(s) about Memory Management. (which led to a _BIG_ discussion

). I've finally managed to provide basic C code for the VFS, KERNEL, MM, LIB, SYSCALL, ARCH (only i386) submodules of the kernel, with a few drivers compiled inside the kernel. Now I want to go by a modular, multitasking model by starting implementing the Process/Driver manager. My problem is with atomicity.

My #include <lib/def.h> file contains all the kernel's basic defs and REALLY most common utilities, such as UInt32, NULL, or Size. Here, Atomic() is a macro which I should use to make a specific section of code atomic

It's code is:

Code: Select all

/* ArchImplement is a previous #define that marks a symbol as defined by arch-specific code
Note: In my kernel every function used by, and only by a macro internally is preceded by a '_'
*/
ArchImplement void _AtomicEnter();
ArchImplement void _AtomicExit();

/* Make a specific part of code completely atomic and thread-safe 
It's hackish (for supporting commas) I know, but it works well :)
The _NE version just doesn't exits the atomicity by itself. This is done
by the Atomic(...) extension if necessary.
*/
#define Atomic_NE(...) AtomicEnter(); __VA_ARGS__;
#define Atomic(...) Atomic_NE(__VA_ARGS__); _AtomicExit();

Now let's see for example my PMContextSwitch() function (in <proc/task.c>):

Code: Select all

/* Arch-independent context switch code. It must be able to change
all machine-specific stuff and process all hackish operations needed
to set user/kernel-space, virtual memory, registers and other things
specific as per process/thread.
volatile void PMContextSwitch(Bool autoSelect, PID pid)
{
	/* If autoselect == true, ignore pid and continue the chain. If false, switch to pid */
	Process *new = autoselect ? _runningProcess->next : PMGetProcess(pid);
	
	/* ContextSwitch() is an ArchImplement-ish function which (obviously) changes
	to another context. ContextSwitch() is responsible for disabling the atomicity as
	its final act before the actual far jump. If it fails, a Kernel Panic will ocurr. (An
	simple exception message here isn't enough to know how torepair the damage made).
	*/
	Atomic_NE(
		ContextSwitch(new);
	);
}

As you can see, the atomicity here is only a prevention and/or optimization, but it's still an example. Returning to the Atomicity problem, how can I implement _AtomicEnter/Exit(), in x86 for example? I DON'T want to use spinlocks or semaphores, as I think they're too

and unneeded. There MUST be a way to do this, probabling doing a CLI, but from then on, my mind is just blank.

FallenAvatar · Post by **FallenAvatar** » Tue Sep 09, 2014 11:37 pm

KemyLand wrote:... I DON'T want to use spinlocks or semaphores, as I think they're too and unneeded.

Explain.

- Monk

P.S. They are used "de facto" for many reasons, that may or may not be because it is the best option. Please explain why you do NOT want to use those techniques, and more importantly what you are trying to accomplish.

P.P.S. If your answer is, "I just want Atomicity" see the spinlock/semaphore wiki articles available here and through Google and other sources.

Brendan · Post by **Brendan** » Tue Sep 09, 2014 11:43 pm

Hi,

Some things about stuff:

a) The general idea is to ensure that different contexts (CPUs, threads, whatever) aren't able to access the same data in ways that conflict and may cause problems.

b) For acceptable performance/scalability, you also need to ensure that different contexts are able to modify different pieces of data at the same time (where there is no conflict). For example, if one thread is modifying one thing, then you do not want all other threads to be unable to access/modify completely unrelated things.

c) With something like "_AtomicEnter()", there's no way to know what data is involved, and therefore no way to ensure that different contexts are able to modify different pieces of data at the same time, and no way to get acceptable performance/scalability.

d) For acceptable performance/scalability, you need something that's associated with the data. This "something" is some kind of lock.

e) Some data does not need any lock. This includes:

Data that is never modified
Data that is only ever accessed (read or modified) from a single context
Data that does any combination of the above at different times (e.g. data can be only ever accessed from a single context for a while, then be transferred to a different context where it's only ever accessed from that context for a while, then become "read only" and shared by many contexts for a while; and still wouldn't need a lock).

It is possible to design a system such that almost all data meets these criteria, where almost no locks are needed.

f) For avoiding deadlocks; you need to establish some kind of "global lock order" and ensure that locks are always acquired in the correct order. For example, if a function uses 2 different pieces of data and needs to acquire 2 different locks, then the order that the function acquires those locks must be correct. This means that for a function like "void myFunction ( myStruct *first, myStruct *second );" you may need to determine the order at run-time (e.g. maybe "if(first < second) acquire_lock(first->lock); acquire_lock(second->lock); } else { acquire_lock(second->lock); acquire_lock(first->lock); }").

Cheers,

Brendan

KemyLand · Post by **KemyLand** » Wed Sep 10, 2014 1:57 am

Thanks for the replies!
Now I understand better the concept. I (stupidly) though you would block the entire system

But, which would be a better version of _AtomicEnter()? (I don't wan't a function as per resource context, say Process Table, VFS Hierarchy)

KemyLand · Post by **KemyLand** » Wed Sep 10, 2014 2:20 am

Do you think it will be okay to rewrite Atomic/_NE() as follows?:

Code: Select all

#define Atomic_NE(res, ...) _AtomicEnter(res); __VA_ARGS__;
#define Atomic(res, ...) Atomic_NE(res, __VA_ARGS__); _AtomicExit(res);

But, if I think of this again, there's no need for a platform-dependent _Atomic*(), so this should make the thing:

Code: Select all

// In <lib/spinlock.h>

typedef struct
{
    PID pid;
    Size tid; // Thread ID
    volatile Bool lock; // "GCC, please don't think in single-threading when compiling me. Thanks!"
} Spinlock;

void spinlockacquire(Spinlock *lock);
void spinlockrelease(Spinlock *lock);

// ... In <lib/spinlock.c>

#include <proc/task.h>

// ...

volatile void spinlockacquire(Spinlock *lock)
{
    while(lock->lock && (lock->pid != _currentProcess->pid && lock->tid != _currentThread->tid)) PMYield(); // Yield timeslice
    lock->lock = true;
    lock->pid = _currentProcess->pid;
    lock->tid = _currentThread->tid;
}

volatile void spinlockrelease(Spinlock *lock)
{
    if(lock->pid != _currentProcess->pid || lock->tid != _currentThread->tid) return; // Silently fail
    lock->lock = false;
}

// ... In <lib/def.h>

#include <lib/spinlock.h>

// ...

#define Atomic_NE(res, ...) spinlockacquire(&res); __VA_ARGS__;
#define Atomic(res, ...) Atomic_NE(res, __VA_ARGS__); spinlockrelease(&res);

With the PMYield() thing my doubt about spinlocks is dead.

Icee · Post by **Icee** » Wed Sep 10, 2014 2:40 am

Your spinlockacquire() and spinlockrelease() functions won't work in a concurrent environment because they are not atomic. You should really read a bit about atomicity.

In short, one problem is as follows: a complete or incomplete acquire operation from a different thread can occur between the "while" and the "lock->lock = true" operators. What state would that leave the spinlock in?

Brendan · Post by **Brendan** » Wed Sep 10, 2014 3:48 am

Hi,

KemyLand wrote:Do you think it will be okay to rewrite Atomic/_NE() as follows?:

Is it supposed to be a spinlock (which "spins" in a busy loop and never yields); or is it supposed to be a mutex (which does yield)? I'll assume you want spinlocks (and even if you didn't, it's best to understand spinlocks before worrying about mutexes anyway

).

Your spinlock code won't actually work correctly. It test until "lock->lock" is false; then any number of threads may acquire the lock after that but before the code has set "lock->lock" to true. This means it's possible for 2 or more threads to acquiring the lock at the same time. For multi-CPU you must use an atomic "test and set" instruction (e.g. "lock bts" or "lock xadd"). For single-CPU you can use a "test and set" instruction (but it doesn't need to be atomic - e.g. "bts" rather than "lock bts"), or you could disable all IRQs instead.

If a thread acquires the same lock twice; then typically that's a bug and the bug should not be ignored (e.g. kernel panic). If you really do want to allow a thread to acquire the same lock 2 or more times, then you need to use a counter - e.g. if a thread acquires a lock 6 times, then the lock is only released after the thread calls "spinlockrelease()" 6 times (and the lock must not be released the first time the thread calls "spinlockrelease()").

For spinlocks (that don't yield), it'd be very bad if a task switch occurs while a thread is holding a lock, as that will cause all other tasks that attempt to acquire the lock to spin (wasting a lot of CPU time for nothing). For this reason you need to disable task switches while the lock is acquired (and re-enable task switches when the lock is released). However, if a thread acquires 2 or more locks (including 2 or more completely different/unrelated locks) then you shouldn't re-enable task switches until the last lock is released; and you need a counter for that. To avoid race conditions, you need to disable task switches before doing the spinlock's "atomic test and set", not after.

Note that when I say "disable task switches", what I normally do is postpone them. For example, in the scheduler I'd do something like:

Code: Select all

    if( scheduler_disable_counter != 0) {
        task_switch_postponed_flag = true;
    } else {
        do_task_switch_like_normal();
   }

And in the code for "spinlockrelease()":

Code: Select all

    scheduler_disable_counter--;
    if(scheduler_disable_counter == 0) {
        if(task_switch_postponed_flag == true) {
            task_switch_postponed_flag = false;
            do_postponed_task_switch_now();
        }
    }

Of course all of that is example code that will need to be done with atomic instructions (e.g. "lock sub dword [scheduler_disable_counter],1; jne .counterNotZeroYet") for multi-CPU systems, and something more than plain C for single-CPU systems.

For any spinlocks used by interrupt handlers (including spinlocks acquired by functions called by functions called by interrupt handlers); disabling task switches is not enough. In addition you must ensure that the interrupt can't occur while anything is holding a lock that the interrupt handler may need; otherwise the interrupt handler will spin until the interrupt handler returns (which never happens because it's busy spinning) and the system will deadlock. For this reason I typically have 2 different types of spinlocks. One that only disables task switches, and another that disables both task switches and IRQs. You may just have the latter (if you don't need to care much about locks effecting IRQ latency). It is also possible to only disable interrupts if you never want a task switch while holding a lock, but I find that less convenient (I like being able to ask for a "postponed task switch" while holding lock/s).

Also; there is more involved in designing efficient spinlocks than I've described above (the PAUSE instruction, the "test, then atomically test and set" pattern, lock fairness and cache thrashing prevention). These issues are better left until after you've got simpler spinlocks working.

Cheers

Brendan

KemyLand · Post by **KemyLand** » Wed Sep 10, 2014 4:47 pm

Could I use GCC's built-ins, such as __sync_lock_test_and_set, instead of reducing perfomance by CLI?

Brendan · Post by **Brendan** » Wed Sep 10, 2014 6:39 pm

Hi,

KemyLand wrote:Could I use GCC's built-ins, such as __sync_lock_test_and_set, instead of reducing perfomance by CLI?

Yes; however don't be surprised if their "__sync_lock_test_and_set" is the equivalent of one instruction (e.g. "xchg") that doesn't help when trying to create well designed spinlocks.

Cheers,

Brendan

KemyLand · Post by **KemyLand** » Wed Sep 10, 2014 11:16 pm

I hope this is my final solution:

Code: Select all


// In <arch/x86/proc/atomic.c>

void spinlockacquire(volatile Spinlock *lock)
{
    while(lock->lock);
    
    // Atomical race condition (I think it is okay). How will win?
    INASM("cli");
    lock->tid = _currentProcess->pid;
    lock->tid = _currentThread->tid;
    lock->lock = true;
    INASM("sti");
}

void spinlockrelease(volatile Spinlock *lock)
{
    if(!lock->lock || lock->pid != _currentProcess->pid || lock->tid != _currentThread->tid) Panic(); // Pseudo-code
    
    INASM("cli");
    lock->lock = false;
    INASM("sti");
}

Please tell me if something is wrong here (

)

Combuster · Post by **Combuster** » Wed Sep 10, 2014 11:33 pm

This doesn't work on either single or multiprocessor systems:

Code: Select all

    while(lock->lock);
   // NOTE: interrupt happens here
    INASM("cli");

Brendan · Post by **Brendan** » Thu Sep 11, 2014 12:26 am

Hi,

Continued from:

Brendan wrote:Also; there is more involved in designing efficient spinlocks than I've described above (the PAUSE instruction, the "test, then atomically test and set" pattern, lock fairness and cache thrashing prevention). These issues are better left until after you've got simpler spinlocks working.

The simplest spinlock might look like this:

Code: Select all

.retry:
     lock bts [myLock],0     ;Set bit 0 of "myLock", and return the previous value in the carry flag
     jc .retry               ;Retry if bit 0 was already set

This is simple, but bad.

Modern CPUs try to do as many instructions as they can in parallel. For small/tight loops this means that the CPU can have many iterations of the loop in progress at the same time, which is a good thing (usually).

However, for tight polling loops (e.g. "while(lock->lock);") it's a bad thing - "multiple iterations of the loop in progress at the same time" wastes CPU resources (which is especially bad when hyper-threading is involved), and also means that when the condition changes there's multiple "future iterations" in progress that need to be discarded (which slows down exiting the loop). To solve this you want a way to tell the CPU not to do multiple iterations of the loop in parallel. This is what the "PAUSE" instruction does.

Note: The "PAUSE" instruction was introduced about 10 years ago, and isn't supported on older CPUs. However, Intel were smart and recycled an encoding that was previously a "NOP" (no operation). This means that if a "PAUSE" instruction is executed on a CPU that doesn't support it, nothing goes wrong, and you can simply assume that all CPUs support the "PAUSE" instruction.

With "PAUSE" the simple spinlock might become:

Code: Select all

.retry:
     pause                   ;Don't waste CPU resources
     lock bts [myLock],0     ;Set bit 0 of "myLock", and return the previous value in the carry flag
     jc .retry               ;Retry if bit 0 was already set

This is still simple, but still not good.

The next problem is that the "lock bts [lock],0" is a bit expensive because it ties up the bus (or cache line); and when CPU/s are constantly retrying it's bad. Fortunately you can check if there's any hope of acquiring the lock (without the expense of doing the "test and set") and then only do the more expensive "test and set" when there is some hope of acquiring the lock. This is the "test, then atomically test and set" pattern.

With "test, then atomically test and set" pattern the spinlock might become:

Code: Select all

.retry:
     pause                   ;Don't waste CPU resources
     bt [myLock],0           ;Test if bit 0 of "myLock" is already set
     jc .retry               ;Retry (without doing the more expensive "lock bts [myLock],0") if bit 0 was set

     lock bts [myLock],0     ;Set bit 0 of "myLock", and return the previous value in the carry flag
     jc .retry               ;Retry if bit 0 was already set

Now this is reasonable for when the you have to wait to acquire the lock. However, if the spinlock can be acquired immediately (no contention) then it's a bit slower that it could be. Often a spinlock can be acquired without waiting, so we should improve the lock to be more optimistic:

Code: Select all

     lock bts [myLock],0     ;Optimistically set bit 0 of "myLock", and return the previous value in the carry flag
     jnc .acquired           ;Lock was acquired is bit 0 was previously clear

.retry:
     pause                   ;Don't waste CPU resources
     bt [myLock],0           ;Test if bit 0 of "myLock" is already set
     jc .retry               ;Retry (without doing the more expensive "lock bts [myLock],0") if bit 0 was set

     lock bts [myLock],0     ;Set bit 0 of "myLock", and return the previous value in the carry flag
     jc .retry               ;Retry if bit 0 was already set

.acquired:

As far as your basic spinlock goes, this isn't that bad. However, what if you want to ensure that interrupts are disabled when the lock is acquired? This sounds easy (or at least it should), but it's a little trickier than it sounds - if you need to spin for ages, then you do not want interrupts to be disabled while you're spinning.

Here's an example of disabling interrupts properly:

Code: Select all

     cli                     ;Disable IRQs (in the hope that we do acquire the lock)
     lock bts [myLock],0     ;Optimistically set bit 0 of "myLock", and return the previous value in the carry flag
     jnc .acquired           ;Lock was acquired is bit 0 was previously clear

.retry:
     sti                     ;Enable IRQs (because we failed to acquire the lock)
     pause                   ;Don't waste CPU resources
     bt [myLock],0           ;Test if bit 0 of "myLock" is already set
     jc .retry               ;Retry (without doing the more expensive "lock bts [myLock],0") if bit 0 was set

     cli                     ;Disable IRQs (in the hope that we do acquire the lock)
     lock bts [myLock],0     ;Set bit 0 of "myLock", and return the previous value in the carry flag
     jc .retry               ;Retry if bit 0 was already set

.acquired:

That's probably enough to think about for now!

Please note that there is more involved in designing efficient spinlocks than I've described above - I still haven't discussed lock fairness or cache thrashing prevention (or instrumenting locks to detect bugs and/or heavy contention). These issues are better left until after you've got "slightly less simple" spinlocks working.

Cheers,

Brendan

KemyLand · Post by **KemyLand** » Thu Sep 11, 2014 9:54 pm

A lot of thanks to all of you, specially to Brendan. With Brendan's last code, I can finally understand atomicity (which I was misunderstanding). Also that optimizations will be very helpful.

So, just to reply, this is the Final Code:

Code: Select all

;we are in <arch/x86/proc/atomic.s
;PID/TID verifying is not included
.code32
.text
.global Spinlock
Spinlock:
push %ebp
movl %esp, %ebp
cli
lock bts $0, 8(%ebp)
jnc .acquired
.retry:
sti
pause
btl $0, 8(%ebp)
jc .retry
cli
lock bts $0, 8(%ebp)
jc .retry
.acquired:
movl %ebp, %esp
pop %ebp
ret

Antti · Post by **Antti** » Fri Sep 12, 2014 1:42 am

If this was the final code, it would not work. Is this syntaxically correct? Is this semantically correct? I am a little bit worried about copy-pasting code like this. You should only look at the logic and idea Brendan gave us. After that, you should write it by yourself using your own style (including comments).

KemyLand · Post by **KemyLand** » Fri Sep 12, 2014 2:29 am

I already read the code. Its okay (unless the unlikely case I missed something).

OSDev.org

Atomicity

Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity

Re: Atomicity