Best or fastest way to determine which CPU is running (SMP)

rod · **Joined:** Mon Feb 10, 2014 7:42 am **Posts:** 21

I enabled SMP in x86_64 and now I want to know which CPU or core is running the interrupt handler in each moment (timer, etc.) in order to make some decisions (scheduling, etc.).

As far I know there are several methods:

Provide different page table mappings for different cores and store different values in the same virtual address, then read those values. This should be fast, but I've read that with HyperThreading, the 2 threads of the same core share the same page tables, so it wouldn't work in that case.
CPUID eax=1 gives in ebx the APIC ID. But I've read that the CPUID instruction is quite slow, could spend 100 cycles?
Read from the APIC tables: APIC_BASE (usually 0xFEE00000) + 0x20 which is the APIC ID Register, and should return the same value as CPUID. For this to work, all cores should share the same APIC_BASE address (as obtained from the corresponding bits of rdmsr(0x01B)). Is that guaranteed? Is there much latency when reading from that memory-mapped area?
The RDTSCP instruction that also loads IA32_TSC_AUX into ecx (that value could be used to store a per-cpu value).
Some value stored in the GDT.
Some other processor specific register that can be quickly checked.

Which one would be better or faster? Are there any other methods?

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

HyperThreading doesn't cause the paging to be shared, they are independent, however the TLB resources may (AFAIK will) be shared, so effectively speaking for each HT core the size of the TLB is halved, which may (in practice will) impact performance, however you'll likely get more performance from proper HT usage...

Another alternative is to use separate IDT's for each core, the ISR that is run already knows which core it's run on because it's different code for each core.

Whatever you choose you'll likely need CPU/core specific data areas (the first option you listed) so that might be the easiest and most convenient option.

edit. I don't know how slow CPUID is, but assuming your 100 cycles it's possible (depending on your OS) that accessing memory will in practice ~always cause cache miss and thus would be even slower. If for example in your OS the "core specific data area" is only accessed very infrequently and thus is always out of cache. However I wouldn't optimize something this small at this point, after your OS is "complete" you can decide what gives best performance, for now use what makes the most sense and leave optimizations till later.

iansjack · **Posted:** Fri Jul 07, 2017 9:40 am

Read the ID register in the local APIC?

xenos · **Posted:** Fri Jul 07, 2017 11:02 am

What about reading the task register? For interrupts with privilege level change you should have one TSS per core, and so each core should have a unique TSS selector, to which the task register points.

I haven't compared the reading performance with APIC ID register, though.

Korona · **Joined:** Thu May 17, 2007 1:27 pm **Posts:** 999

Use gs to point to cpu-specific data on x86_64. Use the swapgs instruction to swap between user-mode gs and the cpu-specific pointer in the kernel. syscall basically forces you to use gs/swapgs for this purpose, as it does not give you a stack to save your other registers on.

simeonz · **Joined:** Fri Aug 19, 2016 10:28 pm **Posts:** 360

rod wrote:

Provide different page table mappings for different cores and store different values in the same virtual address, then read those values.

I believe this will require one version of every process's address space for each cpu. In particular, one cpu specific page table must be created for each address translation level. And on-the-fly changes to such address space will get complicated as well.

rod wrote:

Are there any other methods?

Honestly, I am mostly spectator here (for educational purposes), but skimming over the Linux kernel sources I see that the x86-64 ISR uses the "swapgs" instruction on entry. This changes the GS descriptor's base to a value controlled through an MSR. The GS descriptor is pointed to a per-cpu structure in kernel mode (their ABI you could say), which means that the kernel can store all sorts of cpu-specific information as fields in it, including a CPU id (which you want), pointers to per-cpu scheduler queues, etc. You can also get the GS register base or the cpu id from your ISR stack, assuming it was configured through the interrupt stack table individually for each cpu. Essentially, you either need to get the kernel stack from the per-cpu structures or you need to get the per-cpu structures from the kernel stack. But either way, once you end up with a per-cpu state, you will receive a "cache" of the cpu id as a field in the per-cpu data. The instruction is actually mentioned in the wiki.

Now, this may not be actually be as reliable as some of the methods you have mentioned. The technique here assumes that the per-cpu structure is consistent between ISR invocations.

Edit: Korona gave you the answer already, but I will leave my answer as well, in case there is something useful in it.

iansjack · **Posted:** Fri Jul 07, 2017 12:20 pm

You might want to read this note about problems with the swapgs instruction. https://www.kernel.org/doc/Documentatio ... try_64.txt If all you want to do is identify which processor a task is running on I would suggest that the local APIC is the simplest and most reliable source of information.

LtG · **Joined:** Thu Aug 13, 2015 4:57 pm **Posts:** 384

iansjack wrote:

You might want to read this note about problems with the swapgs instruction. https://www.kernel.org/doc/Documentatio ... try_64.txt If all you want to do is identify which processor a task is running on I would suggest that the local APIC is the simplest and most reliable source of information.

Is there something that makes it _more_ reliable than some of the others (paging for instance)? Also, assuming you need CPU specific data anyway, then I don't really see it as simpler either.

tsdnz · **Joined:** Sun Jun 16, 2013 4:09 am **Posts:** 333

XenOS wrote:

I haven't compared the reading performance with APIC ID register, though.

I write handler for each core, very fast, reading APIC ID register takes a few cycles that I am not willing to waste.

Ali

iansjack · **Posted:** Sat Jul 08, 2017 12:32 am

I suppose it's a trade-off between code size, and complexity, and speed. As far as I am concerned, interrupts typically occur at the end of a relatively lengthy pause (waiting for a key press, waiting for a network or USB frame, waiting for a disk sector read, etc.) so a clock cycle here or there isn't going to make any difference. An exception would be the timer tick, so it might be sensible to use the local APIC timer to drive separate interrupts on individual cores.

Brendan · **Posted:** Sat Jul 08, 2017 12:34 am

Hi,

rod wrote:

Are there any other methods?

The only other method that I've heard of (that someone hasn't already mentioned) is using a debug register (e.g. DR3). This might actually be the fastest method (if you're willing to limit things like debuggers to 3 breakpoints instead of 4).

tsdnz wrote:

XenOS wrote:

I haven't compared the reading performance with APIC ID register, though.

I write handler for each core, very fast, reading APIC ID register takes a few cycles that I am not willing to waste.

You'd pay for that in terms of cache misses. E.g. if L1 cache is shared by 2 CPUs, L2 cache is shared by 4 CPUs and L3 cache is shared by 8 CPUs; then using "different interrupt handler per CPU" means that other CPUs don't cause the cache line/s you need to be brought into caches you share; which means that it's more likely a CPU will have to fetch the IDT entry and the interrupt handler's code from further away (e.g. from RAM instead of L2 cache).

simeonz wrote:

rod wrote:

Provide different page table mappings for different cores and store different values in the same virtual address, then read those values.

I believe this will require one version of every process's address space for each cpu. In particular, one cpu specific page table must be created for each address translation level. And on-the-fly changes to such address space will get complicated as well.

If you support multi-threaded processes (where 2 threads that belong to the same process could be running on different CPUs at the same time) it'd have to be worse than "virtual address space per process per CPU".

What I do is have a virtual address space for each thread; then patch part of the thread's virtual address space during task switches (before loading CR3 to avoid TLB invalidation) to get "per-CPU", "per-core" and "per NUMA domain" areas of kernel space. However, I'm using "virtual address space for each thread" for other reasons (to split user-space into "process space" and "thread space", and ensure one thread can't access data in a different thread's "thread space") and wouldn't do it like this if I wasn't already using "virtual address space for each thread".

Cheers,

Brendan

simeonz · **Joined:** Fri Aug 19, 2016 10:28 pm **Posts:** 360

Brendan wrote:

tsdnz wrote:

XenOS wrote:

I haven't compared the reading performance with APIC ID register, though.

I write handler for each core, very fast, reading APIC ID register takes a few cycles that I am not willing to waste.

You'd pay for that in terms of cache misses...

That was my first thought as well. However, couldn't you have an array of trampolines that simply call into the shared ISR code? The trampolines will effectively push the following address to the stack, which the ISR can use to determine the cpu index. It may be the fastest way actually, albeit a bit idiosyncratic.

Brendan wrote:

What I do is have a virtual address space for each thread

This obviously aligns with the security goals of the OS, especially considering applications that service different clients using threads (possibly with impersonation). However it is also interesting, because the program parallelism in such case is almost process-based. The relation to multi-threading (assuming that I understood the scheme) is that shared data pointers match in both thread-like processes and can be natively worked with, without translation. If we extrapolate the same principle, may be multiple executables can share data in this way if the shared region is mapped in consistent location by the OS. So multi-threading can be replaced entirely by a memory mapping/sharing API that supports consistent inter-process layout. (I do not discuss sharing function pointers here, because of the negative security implications.)

Sorry for the off-topic.

Geri · **Joined:** Sun Jul 14, 2013 6:01 pm **Posts:** 442

i do it with cpuid.

also if you feel that you must do cpuid all the time again and again in a such large extent that will sloth your code, you probably doing something terribly wrong

iansjack · **Posted:** Sat Jul 08, 2017 6:45 am

I think you may have misunderstood the question.

samiam95124 · **Joined:** Sun Sep 11, 2016 12:54 pm **Posts:** 9

My guess:

Task register, to unique TSS, (which is usually per core in any case), then to a value in the TSS. The TSS span is set in the descriptor, which implies that you can make it longer than needed to store goodies in the TSS data, so voila, there is a place for a core number using a simple offset. You will have more than one thread per core, but in most implementations you don't use the hardware task switching but rather use the (fake) tss to bounce the stack pointer. Thus all threads use the same TSS, and thus all threads on the same core yield the same core number from the TSS.

Other idea: since you are not actually using the TSS to store registers, you can repurpose those fields so that you are not wasting the whole TSS per core.

Scott Franco
San Jose

OSDev.org

Best or fastest way to determine which CPU is running (SMP)

Who is online