Process-Context Identifiers (PCIDs)

JasonBond · **Joined:** Sun Dec 27, 2015 6:23 am **Posts:** 2

I read about Process-Context Identifiers (PCIDs) for TLB/paging structure caches in Intel's manual but don't understand exactly how it should be used. For one thing, are there any real life OS (windows 10?) that is actually using it?

I suppose it is to prevent flushing some TLBs when we switch to a new CR3 and re-use the same TLB entries when switching back to a previous CR3 value. But the processor operation outlined in the intel's manual does not seem to support this. Exactly how does it benefit the performance?

Are there anything similar in AMD?

Brendan · **Posted:** Sun Dec 27, 2015 7:16 am

Hi,

JasonBond wrote:

I read about Process-Context Identifiers (PCIDs) for TLB/paging structure caches in Intel's manual but don't understand exactly how it should be used. For one thing, are there any real life OS (windows 10?) that is actually using it?

I don't know if any OS (Windows, OS X, Linux, *BSD, ..) supports it yet. It would be a relatively difficult thing to retro-fit into an existing kernel design (without breaking corner-cases, etc).

JasonBond wrote:

I suppose it is to prevent flushing some TLBs when we switch to a new CR3 and re-use the same TLB entries when switching back to a previous CR3 value. But the processor operation outlined in the intel's manual does not seem to support this. Exactly how does it benefit the performance?

Imagine the same CPU is rapidly switching between 5 different processes. In this case the performance benefit should be obvious - instead of blowing away all of a process' TLB entries every time you switch between processes, you don't (and should get a huge decrease in the number of TLB misses caused by task switching).

The problem is multi-CPU TLB invalidation, which can get expensive even without PCID (the more CPUs you have the worse it gets, in an exponential way). With PCID you can't assume that a CPU that is no longer running a process still doesn't have a TLB entry for that process; so PCID (if implemented in a simple/bad way) can make multi-CPU TLB invalidation overhead significantly worse.

To avoid making multi-CPU TLB invalidation overhead significantly worse you need something clever/complex; and it's this "clever/complex" that would make it hard to retro-fit into existing kernels that were never designed for it.

JasonBond wrote:

Are there anything similar in AMD?

That depends what you mean by "similar". AMD's virtualisation has had "Address Space IDs" for a long time, but they can only be used for guests running inside VMs.

Cheers,

Brendan

JasonBond · **Joined:** Sun Dec 27, 2015 6:23 am **Posts:** 2

Also, the Intel manual says bit 0-11 of CR3 is used as the PCID. Does it somehow related to the usual process id user mode code see? If yes, does it mean it imposes a limit on the # of user processes (4096) allowed ?

Brendan · **Posted:** Sun Dec 27, 2015 7:38 am

Hi,

JasonBond wrote:

Also, the Intel manual says bit 0-11 of CR3 is used as the PCID. Does it somehow related to the usual process id user mode code see? If yes, does it mean it imposes a limit on the # of user processes (4096) allowed ?

There's 3 alternatives:

Have a limit of 4095 processes, and use "PCID = OS process ID". This is probably fine for small systems (embedded?)
Have some sort of PCID recycling (e.g. so that only the 4095 most recently used processes have one of the CPU's PCIDs and the others don't), plus some way to determine "CPU's PCID" from "OS process ID" where PCIDs are global (same on all CPUs). This is probably fine for medium systems (e.g. typical "single 4-core chip").
Have some sort of PCID recycling, plus some way to determine "CPU's PCID on CPU #N" from "OS process ID" where the same process uses a different PCID on different CPUs. This might be the only sane option for large/huge/NUMA systems.

Cheers,

Brendan

Owen · **Posted:** Sun Dec 27, 2015 8:53 am

Linux, Windows and OS X (Well, XNU) supported "PCID"s (ASIDs) long before x86 did. While the x86 PCID extension is new, ASIDs are old hat to other architectures; for example, ARM has supported them for close to two decades. Linux implemented support for the PCID extension before Intel shipped it (this is of course usual - CPU vendors upstream feature support to Linux before they ship features)

The performance characteristics vary depending upon the architecture. For example, ARM architecture CPUs support broadcast TLB invalidate instructions and therefore the overhead of multi-core TLB invalidations is orders of magnitudes lower than for x86 where interrupts are required

Brendan · **Posted:** Sun Dec 27, 2015 9:21 am

Hi,

Owen wrote:

Linux, Windows and OS X (Well, XNU) supported "PCID"s (ASIDs) long before x86 did. While the x86 PCID extension is new, ASIDs are old hat to other architectures; for example, ARM has supported them for close to two decades. Linux implemented support for the PCID extension before Intel shipped it (this is of course usual - CPU vendors upstream feature support to Linux before they ship features)

I haven't been able to find one single thing online that suggests Linux supports PCID on 80x86. The closest I found is emails on the Linux kernel mailing list (from April 2015) talking about maybe implementing support for it one day, where (as far as I can tell) they all forgot about it and didn't implement anything.

Apparently Intel did add support for it, but it didn't help performance and nobody ever saw the patch.

Owen wrote:

The performance characteristics vary depending upon the architecture. For example, ARM architecture CPUs support broadcast TLB invalidate instructions and therefore the overhead of multi-core TLB invalidations is orders of magnitudes lower than for x86 where interrupts are required

I'd assume that making the TLB's cache coherent (e.g. check all TLB entries whenever any CPU does any write) would be a massive performance disaster (but after you've already got that massive performance disaster, there's no additional pain involved when adding support for ASIC/PCID).

Cheers,

Brendan

Owen · **Posted:** Sun Dec 27, 2015 3:06 pm

Brendan wrote:

Owen wrote:

The performance characteristics vary depending upon the architecture. For example, ARM architecture CPUs support broadcast TLB invalidate instructions and therefore the overhead of multi-core TLB invalidations is orders of magnitudes lower than for x86 where interrupts are required

I'd assume that making the TLB's cache coherent (e.g. check all TLB entries whenever any CPU does any write) would be a massive performance disaster (but after you've already got that massive performance disaster, there's no additional pain involved when adding support for ASIC/PCID).

Cheers,

Brendan

TLBs aren't cache coherent, but the TLBI instruction has variants which broadcast a TLB invalidation to all CPUs in a cache coherency domain; for example, "TLBI ASIDE1IS" invalidates the TLB entry for the given ASID and VA on all CPUs in the same inner shareable domain (all cores running a given OS must be in the same inner shareable domain)

OSDev.org

Process-Context Identifiers (PCIDs)

Who is online