How is TLS DTV generation even supposed to work?

OSwhatever · **Joined:** Mon Jul 05, 2010 4:15 pm **Posts:** 595

The TLS model with its generation number in the DTV vector seems flawed.

A few documents, there are many of the.
https://www.uclibc.org/docs/tls.pdf
https://android.googlesource.com/platfo ... elf-tls.md

Having one generation number for the entire DTV suggests that the DTV vector must be monotonically increased when you load more dynamic modules. Nothing in the documentation tells us what happens when you unload the dynamic module. Can a DTV entry be replaced with a new module? In that case there is no possibility to track such change with only one generation number for the entire DTV.

I think the DTV being a collection of possible values, it could be a pointer to initial executable TLS area, dynamic module TLS area and also just a custom value provided by interfaces like TlsAlloc/TlsFree/TlsGetValue/TlsSetValue for Windows, pthread_key_create/pthread_key_delete/pthread_getspecific/pthread_setspecific for POSIX.

In order for the DTV to be dynamically allocated (meaning you can randomly be provided an entry in the DTV), you would need to have a generation number for each DTV entry in order verify that the DTV entry is valid otherwise it will not work.

Is it just me who don't understand this and is there something really clever I haven't understood.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

Each module is assigned a DTV number on load. In musl, this can just be a growing number, since musl does not support module unloading (in fact, TLS was one of the reasons for that). However, more generally, a libc would keep track of used DTV numbers and assign the next unused one on load. Similar to how file descriptors work.

The DTV number is an index into a thread-local array of pointers. When a new module is loaded, new DTV vectors are allocated for all the threads. All the existing pointers are copied, and for the new module you get a new memory block. Deallocation of the old DTV vectors is the real hard trick, but that's beside the point.

When a module is unloaded, the corresponding pointer can be set NULL and the DTV number marked free. When the next module is loaded, you can re-use that spot. This time, deallocation is actually easy, since the module can only be unloaded when it is no longer in use.

The keys for POSIX thread-specific data would in theory be a good use for TLS, but unfortunately the requirement that the new TSD pointer be NULL after pthread_key_create() combined with it being explicitly allowed that TSD pointers be not NULL on pthread_key_delete() means the implementation has to be able to NULL all the pointers itself at either create or delete time, and normal ELF TLS doesn't allow this. Beyond this, those keys have nothing to do with DTV numbers.

OSwhatever · **Joined:** Mon Jul 05, 2010 4:15 pm **Posts:** 595

nullplan wrote:

When a module is unloaded, the corresponding pointer can be set NULL and the DTV number marked free. When the next module is loaded, you can re-use that spot. This time, deallocation is actually easy, since the module can only be unloaded when it is no longer in use.

Hypothetically, When you have TLS access (with __tls_get_addr, must be as it is a shared library) it will discover that the generation number is out of date. The size of the DTV might be the same so no reason to resize. Then it can go through the DTV in order to check if any modules have been unloaded and set those entries to NULL. However, if a new module is in the same spot as an old module that was previously unloaded, then there is no way to determine that the pointer to the local TLS area belongs to the old or new module with only a pointer. You need extra information in the DTV entry in order to determine that, like a generation number. If you have that then you can detect a new module is in the same spot and run the initialization code for that TLS area.

That's why I question if the TLS model is sane for reusing DTV entry spots. Maybe that's what was discovered by the Musl developers.

nullplan wrote:

The keys for POSIX thread-specific data would in theory be a good use for TLS, but unfortunately the requirement that the new TSD pointer be NULL after pthread_key_create() combined with it being explicitly allowed that TSD pointers be not NULL on pthread_key_delete() means the implementation has to be able to NULL all the pointers itself at either create or delete time, and normal ELF TLS doesn't allow this. Beyond this, those keys have nothing to do with DTV numbers.

Correct that the keys have nothing to do with DTV entries but implementation wise it could be possible. The infrastructure is already there and it can be reused rather than having yet another vector for the thread keys.

There is a new format called TLS descriptors, the DTV is still there but the question is if it solves the problem that I just described. This new format is however not widely available yet, only for certain architectures and compilers (GCC). It looks like the generation number is injected as a relocation into the descriptor which I think this should be done.
https://www.fsfla.org/~lxoliva/writeups ... lk2006.pdf

This paper is from 2006 and this model is still not used everywhere. Some things are introduced slowly.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

OSwhatever wrote:

Hypothetically, When you have TLS access (with __tls_get_addr, must be as it is a shared library) it will discover that the generation number is out of date. The size of the DTV might be the same so no reason to resize. Then it can go through the DTV in order to check if any modules have been unloaded and set those entries to NULL. However, if a new module is in the same spot as an old module that was previously unloaded, then there is no way to determine that the pointer to the local TLS area belongs to the old or new module with only a pointer. You need extra information in the DTV entry in order to determine that, like a generation number. If you have that then you can detect a new module is in the same spot and run the initialization code for that TLS area.

Why would you unload the TLS lazily? After unloading a module with TLS, there is no reason to presume that any other module with TLS even remains in the process, or accesses the TLS soon after. No, I was thinking the thread calling dlclose() could just iterate over all other threads and set the DTV pointer for that module to NULL. This of course requires having a good thread list implementation. Then, next time dlopen() is called (on a module with TLS, natch), it can just check if it has a NULL pointer in the existing DTV and reuse the DTV number instead of increasing the size.

Having dlopen() allocate the TLS memory would allow it to fail on memory exhaustion. __tls_get_addr() cannot fail, it can only crash (well, abort, but there is no real difference).

OSwhatever wrote:

Correct that the keys have nothing to do with DTV entries but implementation wise it could be possible. The infrastructure is already there and it can be reused rather than having yet another vector for the thread keys.

As I tried to say, it unfortunately fails to match with the specification. You have to iterate over all threads and set the new TSD pointer NULL either in pthread_key_create() or pthread_key_delete(). Which is easy if you have the TSD vector as part of the thread descriptor, similar to the DTV vector, but next to impossible if you have the TSD vector as some thread-local array in libc's TLS memory.

OSwhatever wrote:

This paper is from 2006 and this model is still not used everywhere. Some things are introduced slowly.

The TLS paper itself is only from a few years prior to that. Bear in mind that through most of the 90ies, threading was this weird research project some people were apparently excited about, but the Unix buffs didn't get the hype at the time. And the now-ubiquitous NPTL implementation of POSIX threads on Linux would also take some time to develop (and before that you had this weird system with the thread server, where the threads were actually different processes).

No, I don't think its young age is the reason for lack of adoption of this extension, it is because it does not solve a pressing need. Several CPU extensions were rolled out in the time since then and have seen greater adoption, partly because they actually do solve a problem.

OSwhatever · **Joined:** Mon Jul 05, 2010 4:15 pm **Posts:** 595

nullplan wrote:

Why would you unload the TLS lazily? After unloading a module with TLS, there is no reason to presume that any other module with TLS even remains in the process, or accesses the TLS soon after. No, I was thinking the thread calling dlclose() could just iterate over all other threads and set the DTV pointer for that module to NULL. This of course requires having a good thread list implementation. Then, next time dlopen() is called (on a module with TLS, natch), it can just check if it has a NULL pointer in the existing DTV and reuse the DTV number instead of increasing the size.

Having dlopen() allocate the TLS memory would allow it to fail on memory exhaustion. __tls_get_addr() cannot fail, it can only crash (well, abort, but there is no real difference.

My mindset is really set to do everything as lazy as possible. You are right that you can iterate through all threads and clear the DTV entry when the module is unloaded. One problem is that threads and the number of threads is a moving target and it perhaps requires locking some thread list which I would like to avoid. There are lockless variants but they often have other limitations and iterating through those might open up for race condition bugs.

nullplan wrote:

As I tried to say, it unfortunately fails to match with the specification. You have to iterate over all threads and set the new TSD pointer NULL either in pthread_key_create() or pthread_key_delete(). Which is easy if you have the TSD vector as part of the thread descriptor, similar to the DTV vector, but next to impossible if you have the TSD vector as some thread-local array in libc's TLS memory.

I see this problem to be the exact the same problem as with dynamic modules.

Anyway, as we are making our own systems we are free to do whatever we want.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

OSwhatever wrote:

My mindset is really set to do everything as lazy as possible.

Mine is the opposite. Many lazy primitives have bad effects when they fail for some reason. For example: lazy binding. If you process relocations in dlopen(), you can fail immediately if some required function is not there. (Or, on program startup, you can abort the program before it has a chance to run.) If you do it lazily, however, and a function is referenced that is not there, you can only abort the program. That is, after the program has already started to run for an unknown amount of time. Imagine the program is some kind of productivity app, and the function that is not there is somehow required to save an open document. With lazy relocation, the program appears to work fine, but when you click on "save", the program crashes. With eager relocation, the program would not even start, and you can sort out the issue immediately, before wasting time working on something that then could not be saved.

Besides, the argument in favor of lazy TLS allocation is to minimize resource usage, since likely not every thread will access every TLS module. Fine, agreed, but that argument only works for allocation. Even in such a system you would want eager deallocation, to return the memory to the available pool as quickly as possible.

OSwhatever wrote:

One problem is that threads and the number of threads is a moving target and it perhaps requires locking some thread list which I would like to avoid.

You will probably have a reason to have a thread list before long. And if you use a circular doubly-linked list, then insertion is very fast, you only need to set four pointers in three cache lines. So the list is not locked for long. On exit, you only need to set two pointers, but you need the kernel to unlock the list on thread exit. Otherwise you could have unlisted threads running around, and that can have negative consequences.

I agree that the lockfree variants are likely not worth it. You would need some very specific circumstances to make them work, and as soon as you get significantly beyond a singly-linked list, the complexity just mounts up. It is so easy to get something wrong with these.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3192

I think I fail to understand why the TLS function would be related to modules / DLLs, and why it would need thread lists. In my implementation, TLS is handled by the executable loader (PE loader) in kernel. On thread creation it will allocate memory for a bitmap of pointers, and will set FS to reference the TLS storage in user space. On thread termination, the bitmap will be deallocated. The executable loader also handles TLS sections in the image by allocating indexes on load and deallocating them on unload. TLS indexes are per application, not per module.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

rdos wrote:

I think I fail to understand why the TLS function would be related to modules / DLLs, and why it would need thread lists. In my implementation, TLS is handled by the executable loader (PE loader) in kernel. On thread creation it will allocate memory for a bitmap of pointers, and will set FS to reference the TLS storage in user space. On thread termination, the bitmap will be deallocated. The executable loader also handles TLS sections in the image by allocating indexes on load and deallocating them on unload. TLS indexes are per application, not per module.

Then how does PE handle TLS in DLLs? In ELF, each module has its own TLS section. The only runtime relocation that happens is the function that provides the thread-local base address of the current module's TLS segment. When a module is loaded, it is assigned a TLS ID dynamically (there are relocations to handle that, too). That way, the code referencing the TLS stays exactly the same, and different processes can share the same text segment.

The thread list is needed so that dlopen() can allocate the necessary memory for the new TLS for all threads, as well as a new DTV vector (those are the thread-local vectors containing all the TLS base pointers for each module) and then actually install all those DTVs in the threads. On thread creation, you need to conversely allocate and initialize the TLS for the new thread. That means allocating the memory, assigning the DTV base pointers, and then initializing the first part of them all with the TLS image from the module, so you actually need to iterate over your list of TLS bearing modules and copy the initialization data over.

With just one TLS block, how are the offsets fixed up in the DLLs? I mean, the DLL could be loaded into different processes with different sets of DLLs loaded, so the layout of the TLS block would be different each time.

OSwhatever · **Joined:** Mon Jul 05, 2010 4:15 pm **Posts:** 595

nullplan wrote:

With just one TLS block, how are the offsets fixed up in the DLLs? I mean, the DLL could be loaded into different processes with different sets of DLLs loaded, so the layout of the TLS block would be different each time.

It sounds like there is only support for the initial exec model to me.

Anyway, I managed to cook up a version that has more or less a totally lazy TLS. This only works when the local exec and init exec are access through a global function (__aeabi_read_tp on ARM). If a HW register is used for tp, then the initial exec TLS area must be initialized at every thread start even if TLS is never accessed which is something I want to avoid. Those who use TLS should be punished, not the ones who don't.

Dynamic TLS is also completely lazy. The complexity to get there is rather large and I had to step outside the TLS model but the TLS model sucks anyway. The newer TLSDESC one seems much better at handling new module loads/unloads as well as handling concurrency. I figured out a way not going through all the DTVs for all threads during a module unload but this is first updated when you access a TLS variable or at thread destruction. It's more like I'm emulating the TLSDESC model but the descriptors are in a special dynamic DTV.

The TlsGetValue/TlsSetValue interface also use the dynamic DTV sharing it with dynamic TLS modules. The infrastructure is already there and I can use it. This interface is really legacy so I didn't want implement a complete new functionality just to support that. Quite frankly this interface might be better than the built in compiler generated TLS as then those threads can deal with the TLS manually without pestering other threads.

BTW, the TLS model is falsely claiming that you only need a DTV with only pointers to the TLS areas. With dynamic TLS entries you need an extra entry for the allocated pointer, because the there might be an extra alignment that comes from the ELF file.

I give the (old) TLS model a D-, they clearly didn't think this through.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

OSwhatever wrote:

If a HW register is used for tp, then the initial exec TLS area must be initialized at every thread start even if TLS is never accessed which is something I want to avoid.

Well, tough. On i386, they use GS, on x86_64 they use FS (in both cases set by system call), and on armv7 and newer, they use something in coproc 15 somewhere. That one seems to be an endless bag of tricks. __aeabi_read_tp is only used by applications wanting binary compatibility with pre-armv7 platforms.

OSwhatever wrote:

BTW, the TLS model is falsely claiming that you only need a DTV with only pointers to the TLS areas. With dynamic TLS entries you need an extra entry for the allocated pointer, because the there might be an extra alignment that comes from the ELF file.

You... you do only need the aligned pointer. If you want to be able to free each area individually, you need to allocate them with posix_memalign() or something, but you only need the aligned pointer there.

OSwhatever · **Joined:** Mon Jul 05, 2010 4:15 pm **Posts:** 595

nullplan wrote:

Well, tough. On i386, they use GS, on x86_64 they use FS (in both cases set by system call), and on armv7 and newer, they use something in coproc 15 somewhere. That one seems to be an endless bag of tricks. __aeabi_read_tp is only used by applications wanting binary compatibility with pre-armv7 platforms.

This is a problem and I think the compilers should always offer a function call alternative to the hardware register methods for obvious reasons. The problem with the ARM cp15 register is that the compiler is set to use a register that just can be set in supervisor mode and I need the one that can be set in user mode. There are three thread ID registers in ARM.

https://developer.arm.com/documentation ... -registers

so you have to go in somewhere in the compiler in order to change this which is inconvenient.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3192

nullplan wrote:

rdos wrote:

I think I fail to understand why the TLS function would be related to modules / DLLs, and why it would need thread lists. In my implementation, TLS is handled by the executable loader (PE loader) in kernel. On thread creation it will allocate memory for a bitmap of pointers, and will set FS to reference the TLS storage in user space. On thread termination, the bitmap will be deallocated. The executable loader also handles TLS sections in the image by allocating indexes on load and deallocating them on unload. TLS indexes are per application, not per module.

Then how does PE handle TLS in DLLs? In ELF, each module has its own TLS section. The only runtime relocation that happens is the function that provides the thread-local base address of the current module's TLS segment. When a module is loaded, it is assigned a TLS ID dynamically (there are relocations to handle that, too). That way, the code referencing the TLS stays exactly the same, and different processes can share the same text segment.

PE uses the same principle. There is a TLS section in programs and DLLs.

nullplan wrote:

The thread list is needed so that dlopen() can allocate the necessary memory for the new TLS for all threads, as well as a new DTV vector (those are the thread-local vectors containing all the TLS base pointers for each module) and then actually install all those DTVs in the threads. On thread creation, you need to conversely allocate and initialize the TLS for the new thread. That means allocating the memory, assigning the DTV base pointers, and then initializing the first part of them all with the TLS image from the module, so you actually need to iterate over your list of TLS bearing modules and copy the initialization data over.

With just one TLS block, how are the offsets fixed up in the DLLs? I mean, the DLL could be loaded into different processes with different sets of DLLs loaded, so the layout of the TLS block would be different each time.

First, I don't support sharing DLLs between processes, so having different TLS blocks is a non-issue.

The FS register points to a memory block that contains a bitmap (which TLS indexes are allocated), and the TLS values for the current thread. When a new TLS entry is allocated (directly, or through a DLL load), a new entry in the bitmap is allocated. When the TLS is freed (directly, or though DLL unload), the bitmap position is set to available. The set & get TLS functions are implemented by reading or writing the TLS values for the current thread.

What I don't support is that all TLS values are set to zero in all threads when a new TLS entry is allocated. For the OpenWatcom environment, I don't need to support this. Not sure how Windows handles this since the TLS function is borrowed from there.

OSDev.org

How is TLS DTV generation even supposed to work?

Who is online