Memset implementation optimizing ... to itself

ScropTheOSAdventurer · Sun Apr 15, 2018 2:36 pm

I was playing with my toy OS code, and I realized that while I had it compiling with the -O2 option, I should try -O3 and see what would happen. It triple-faulted, and so I went on a quest to find the culprit. I narrowed down the optimization causing the fault to -ftree-loop-distribute-patterns , which:

Perform loop distribution of patterns that can be code generated with calls to a library.

So I compiled my code with -O2 and again with -O2 and -ftree-loop-distribute-patterns and disassembled each respective build. Three new calls were being emitted, two of which were in my page allocator. Both of them were optimizing certain loops with calls to memset, and if my page allocator was built without the -ftree optimization, no triple fault happened. But I looked at the disassembly and c code of the page allocator, and nothing seemed wrong.

But, I realized there was one final call to check for, and it led me to this monstrosity, which I annotated:

Code: Select all

00100070 <memset>:
  100070:	53                   	push   %ebx ;save ebx
  100071:	83 ec 08             	sub    $0x8,%esp ; make stack space
  100074:	8b 44 24 18          	mov    0x18(%esp),%eax ;size
  100078:	8b 5c 24 10          	mov    0x10(%esp),%ebx ;ptr
  10007c:	85 c0                	test   %eax,%eax ;is size zero
  10007e:	74 13                	je     100093 <memset+0x23> ; end
  100080:	83 ec 04             	sub    $0x4,%esp ; more stack  space?
  100083:	50                   	push   %eax ;push size
  100084:	0f b6 44 24 1c       	movzbl 0x1c(%esp),%eax ;get value 
  100089:	50                   	push   %eax ; push value
  10008a:	53                   	push   %ebx ; push ptr
  10008b:	e8 e0 ff ff ff       	call   100070 <memset> ; call memset again?  WHAT?!
  100090:	83 c4 10             	add    $0x10,%esp ; stack cleanup 
  100093:	83 c4 08             	add    $0x8,%esp ; more cleanup
  100096:	89 d8                	mov    %ebx,%eax ; put the pointer into the return value
  100098:	5b                   	pop    %ebx ; restore ebx 
  100099:	c3                   	ret    ; return 
  10009a:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi

For clarity, my memset in C is this:

Code: Select all

void* 	memset(void* ptr, int value, size_t num) { 
	unsigned char* ptr_byte = (unsigned char*)ptr;

	for (size_t i = 0; i < num; ptr_byte[i] = (unsigned char)value, i++);	
	return ptr;
};

The extra optimization was optimizing away the loop inside of memset ... into a memset call.

I am going to have to rewrite it in assembly if I want the -O3 option to work.

God fucking dammit

simeonz · Post by **simeonz** » Sun Apr 15, 2018 3:14 pm

You could also try applying something like "__attribute__((optimize("-fno-tree-loop-distribute-patterns")))" to the function.
Or you could try to compile the library with "-ffreestanding" and hopefully it should stop assuming the meaning of library functions and make substitutions of code with calls.

Edit: Interesting caveat though.

ScropTheOSAdventurer · Sun Apr 15, 2018 3:29 pm

For -ffreestanding, no dice. Probably the attribute setting is what I am going to have to settle with. Thanks for the suggestion; I always forget that gcc has those attribute features.

ScropTheOSAdventurer · Sun Apr 15, 2018 3:49 pm

Set the new attributes, and the triple fault went away. I will say my memset looks a little .... big

Code: Select all

00100070 <memset>:
  100070:	55                   	push   %ebp
  100071:	57                   	push   %edi
  100072:	56                   	push   %esi
  100073:	53                   	push   %ebx
  100074:	83 ec 10             	sub    $0x10,%esp
  100077:	8b 6c 24 2c          	mov    0x2c(%esp),%ebp
  10007b:	8b 44 24 24          	mov    0x24(%esp),%eax
  10007f:	8b 5c 24 28          	mov    0x28(%esp),%ebx
  100083:	85 ed                	test   %ebp,%ebp
  100085:	0f 84 eb 00 00 00    	je     100176 <memset+0x106>
  10008b:	89 c2                	mov    %eax,%edx
  10008d:	8d 75 ff             	lea    -0x1(%ebp),%esi
  100090:	bf 05 00 00 00       	mov    $0x5,%edi
  100095:	f7 da                	neg    %edx
  100097:	88 5c 24 07          	mov    %bl,0x7(%esp)
  10009b:	83 e2 03             	and    $0x3,%edx
  10009e:	8d 4a 03             	lea    0x3(%edx),%ecx
  1000a1:	83 f9 05             	cmp    $0x5,%ecx
  1000a4:	0f 42 cf             	cmovb  %edi,%ecx
  1000a7:	39 ce                	cmp    %ecx,%esi
  1000a9:	0f 82 d1 00 00 00    	jb     100180 <memset+0x110>
  1000af:	85 d2                	test   %edx,%edx
  1000b1:	c7 04 24 00 00 00 00 	movl   $0x0,(%esp)
  1000b8:	74 27                	je     1000e1 <memset+0x71>
  1000ba:	83 fa 01             	cmp    $0x1,%edx
  1000bd:	88 18                	mov    %bl,(%eax)
  1000bf:	c7 04 24 01 00 00 00 	movl   $0x1,(%esp)
  1000c6:	74 19                	je     1000e1 <memset+0x71>
  1000c8:	83 fa 03             	cmp    $0x3,%edx
  1000cb:	88 58 01             	mov    %bl,0x1(%eax)
  1000ce:	c7 04 24 02 00 00 00 	movl   $0x2,(%esp)
  1000d5:	75 0a                	jne    1000e1 <memset+0x71>
  1000d7:	88 58 02             	mov    %bl,0x2(%eax)
  1000da:	c7 04 24 03 00 00 00 	movl   $0x3,(%esp)
  1000e1:	89 ee                	mov    %ebp,%esi
  1000e3:	31 c9                	xor    %ecx,%ecx
  1000e5:	8a 4c 24 07          	mov    0x7(%esp),%cl
  1000e9:	29 d6                	sub    %edx,%esi
  1000eb:	0f b6 fb             	movzbl %bl,%edi
  1000ee:	89 74 24 08          	mov    %esi,0x8(%esp)
  1000f2:	c1 ee 02             	shr    $0x2,%esi
  1000f5:	89 74 24 0c          	mov    %esi,0xc(%esp)
  1000f9:	88 dd                	mov    %bl,%ch
  1000fb:	89 fe                	mov    %edi,%esi
  1000fd:	c1 e6 10             	shl    $0x10,%esi
  100100:	0f b7 c9             	movzwl %cx,%ecx
  100103:	c1 e7 18             	shl    $0x18,%edi
  100106:	09 f1                	or     %esi,%ecx
  100108:	8b 74 24 0c          	mov    0xc(%esp),%esi
  10010c:	09 cf                	or     %ecx,%edi
  10010e:	8d 0c 10             	lea    (%eax,%edx,1),%ecx
  100111:	31 d2                	xor    %edx,%edx
  100113:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi
  100119:	8d bc 27 00 00 00 00 	lea    0x0(%edi,%eiz,1),%edi
  100120:	89 3c 91             	mov    %edi,(%ecx,%edx,4)
  100123:	83 c2 01             	add    $0x1,%edx
  100126:	39 d6                	cmp    %edx,%esi
  100128:	77 f6                	ja     100120 <memset+0xb0>
  10012a:	8b 7c 24 08          	mov    0x8(%esp),%edi
  10012e:	8b 14 24             	mov    (%esp),%edx
  100131:	89 f9                	mov    %edi,%ecx
  100133:	83 e1 fc             	and    $0xfffffffc,%ecx
  100136:	01 ca                	add    %ecx,%edx
  100138:	39 cf                	cmp    %ecx,%edi
  10013a:	74 3a                	je     100176 <memset+0x106>
  10013c:	8d 4a 01             	lea    0x1(%edx),%ecx
  10013f:	88 1c 10             	mov    %bl,(%eax,%edx,1)
  100142:	39 cd                	cmp    %ecx,%ebp
  100144:	76 30                	jbe    100176 <memset+0x106>
  100146:	8d 4a 02             	lea    0x2(%edx),%ecx
  100149:	88 5c 10 01          	mov    %bl,0x1(%eax,%edx,1)
  10014d:	39 cd                	cmp    %ecx,%ebp
  10014f:	76 25                	jbe    100176 <memset+0x106>
  100151:	8d 4a 03             	lea    0x3(%edx),%ecx
  100154:	88 5c 10 02          	mov    %bl,0x2(%eax,%edx,1)
  100158:	39 cd                	cmp    %ecx,%ebp
  10015a:	76 1a                	jbe    100176 <memset+0x106>
  10015c:	8d 4a 04             	lea    0x4(%edx),%ecx
  10015f:	88 5c 10 03          	mov    %bl,0x3(%eax,%edx,1)
  100163:	39 cd                	cmp    %ecx,%ebp
  100165:	76 0f                	jbe    100176 <memset+0x106>
  100167:	8d 4a 05             	lea    0x5(%edx),%ecx
  10016a:	88 5c 10 04          	mov    %bl,0x4(%eax,%edx,1)
  10016e:	39 cd                	cmp    %ecx,%ebp
  100170:	76 04                	jbe    100176 <memset+0x106>
  100172:	88 5c 10 05          	mov    %bl,0x5(%eax,%edx,1)
  100176:	83 c4 10             	add    $0x10,%esp
  100179:	5b                   	pop    %ebx
  10017a:	5e                   	pop    %esi
  10017b:	5f                   	pop    %edi
  10017c:	5d                   	pop    %ebp
  10017d:	c3                   	ret

Edit: Holy crap! Before this, I had been debugging triple faulting in my scheduler. That went away after this too! Phew!

TravorLZH · Post by **TravorLZH** » Tue Apr 24, 2018 8:38 am

I believe it's better to use assembly to implement memset, so you will not mess up with the compiler.

NOTE: The following assembly code is using Intel Syntax

Code: Select all

memset:
        push ebp
        mov ebp,esp
        mov edi,[ebp+8]                    ; dest
        movzx eax,byte [ebp+12]       ; ch
        mov ecx,[ebp+16]                  ; count
        rep stosb
        pop ebp
        ret

nullplan · Post by **nullplan** » Thu Apr 26, 2018 1:37 am

Travor... that's a little basic, wouldn't you say? For memset, you could get a lot of mileage out of first aligning the destination pointer to a 4-byte boundary, then REP STOSD, then REP STOSB for the remaining bytes.

And you could even implement that in C:

Code: Select all

#define LONG_MASK (sizeof (unsigned long) - 1)
void memset(void* _dst, int val, size_t len)
{
    unsigned char *dst = _dst;
    unsigned long *ldst;
    unsigned long lval = (val & 0xFF) * (-1ul / 255); //the multiplier becomes 0x0101... of the same length as long

    if (len >= 16) //optimize only if it's worth it (limit is a guess)
    {
        while ((uintptr_t)dst & LONG_MASK)
        {
            *dst++ = val;
            len--;
        }
        ldst = (void*)dst;
        while (len > sizeof (long))
        {
             *ldst++ = lval;
             len -= sizeof (long);
        }
        dst = (void*)ldst;
    }
    while (len--)
        *dst++ = val;
}

simeonz · Post by **simeonz** » Thu Apr 26, 2018 3:04 am

nullplan wrote:For memset, you could get a lot of mileage out of first aligning the destination pointer to a 4-byte boundary, then REP STOSD, then REP STOSB for the remaining bytes.

I had the same feelings about TravorLZH's response originally, but it turns out that I was wrong. You can read this SO post, which gives rather detailed benchmarks on a feature of modern cpus, which optimizes rep stosb and movsb with microcode, by doing what the classical memset is doing in software, but faster.

There are a few problems with this - it is not marked by CPUID as far as I can tell, so you have to either keep hardware database or use on-site testing. Furthermore, if you scroll down in the same SO post, there is a quoted comment by an insider, indicating that such microcode optimizations may be of inconsistent quality in the different models. Thus, benchmarks are the only way to be sure. And finally, if you can only decide whether the code variant is optimal at runtime (or while deploying), then you cannot handle the symbol bindings with classical mechanisms. You will need something better then elf or pe load-time symbol resolution. Ultimately, conservatively speaking, the classic align then loop that you propose is probably going to be the most balanced approach, if the old models are given consideration and dynamic choice of the function code is not possible.

P.S.: This is not only true for this CPU optimization. A lot of CPU optimizations exist only under the hood and are not something that can be identified programmatically. For example, access misalignment (edit: when cache line straddling is not involved) is not an issue in the more modern cpus, and thus structure padding to achieve alignment will in fact impair the cache performance (due to increased structure size.) Another example - newer cpus handle cache capacity overflow in long sequential runs better then older ones, by allowing the cache to keep the initial portion of a transfer, rather then evicting the old entries when new data comes in, FIFO style.

ScropTheOSAdventurer · Sat May 05, 2018 11:52 am

From the looks of my -O3 memset disassembly, it's heavily optimized anyway. It's usually best to let the compiler do its thing.

MrLolthe1st · Post by **MrLolthe1st** » Tue May 22, 2018 4:09 pm

Hi, the fastest way memcpy/memset - use SSE/MMX regs.
Threre is my implemetation of memcpy, use that as example.

Code: Select all

void memcpy(unsigned char * s, unsigned char * d, size_t count) {
   __asm__("pusha\n\
			mov %2,%%ecx\n\
			mov %0,%%esi\n\
			mov %1,%%edi\n\
			test %%ecx,%%ecx\n\
			jz zr\n\
			memcpySSE:\
				movups (%%esi),%%xmm0	\n\
				movups %%xmm0,(%%edi)	\n\
				add $16, %%esi			\n\
				add $16, %%edi			\n\
				dec %%ecx				\n\
				test %%ecx,%%ecx		        \n\
				jnz memcpySSE			        \n\
			zr:							\n\
			mov %3, %%ecx				\n\
			test %%ecx,%%ecx			        \n\
			jz send						\n\
			memcpySSE_1:				        \
				movsb					\n\
				dec %%ecx				\n\
				jnz memcpySSE_1			\n\
			send:						\
			popa"
				::"r"(s),"r"(d),"r"(count>>4),"r"(count%16));
}

Octocontrabass · Post by **Octocontrabass** » Wed May 23, 2018 12:00 pm

MrLolthe1st wrote:Hi, the fastest way memcpy/memset - use SSE/MMX regs.

In a kernel, you may not be able to use SSE or MMX. The speed doesn't matter too much until you're able to benchmark it anyway.

MrLolthe1st wrote:Threre is my implemetation of memcpy, use that as example.

Your implementation is a bad example. The source and destination are in the wrong order, the return value is the wrong type, and your inline assembly is not using constraints or clobbers correctly.

A better example for use in a kernel would be something like this:

Code: Select all

void * memcpy( void * d, const void * s, size_t c )
{
    void * temp = d;
    __asm__ volatile (
        "rep movsb"
        :"=D"(d),"=S"(s),"=c"(c)
        :"0"(d),"1"(s),"2"(c)
        :"memory"
    );
    return temp;
}

void * memmove( void * d, const void * s, size_t c )
{
    void * temp = d;
    if( (uintptr_t)d <= (uintptr_t)s || (uintptr_t)d >= (uintptr_t)s + c )
    {
        __asm__ volatile (
            "rep movsb"
            :"=D"(d),"=S"(s),"=c"(c)
            :"0"(d),"1"(s),"2"(c)
            :"memory"
        );
    }
    else
    {
        __asm__ volatile (
            "std\n\t"
            "rep movsb\n\t"
            "cld"
            :"=D"(d),"=S"(s),"=c"(c)
            :"0"((uintptr_t)d + c - 1),"1"((uintptr_t)s + c - 1),"2"(c)
            :"memory"
        );
    }
    return temp;
}

void * memset( void * d, int s, size_t c )
{
    void * temp = d;
    __asm__ volatile (
        "rep stosb"
        :"=D"(d),"=c"(c)
        :"0"(d),"a"(s),"1"(c)
        :"memory"
    );
    return temp;
}

int memcmp( const void * s1, const void * s2, size_t c )
{
    if( !c ) return 0;
    __asm__ volatile (
        "repe cmpsb"
        :"=D"(s1),"=S"(s2),"=c"(c)
        :"0"(s1),"1"(s2),"2"(c)
        :"memory","cc"
    );
    return ((const unsigned char *)s1)[-1] - ((const unsigned char *)s2)[-1];
}

(I haven't actually tested this code, so it may not work correctly. Use it at your own risk. Edit: case in point, I've edited to fix some mistakes.)

At least memcpy and memset should perform pretty well on CPUs that report ERMS through CPUID. I wouldn't be surprised if they're faster than what GCC generates.

simeonz · Post by **simeonz** » Wed May 23, 2018 12:47 pm

Octocontrabass wrote:At least memcpy and memset should perform pretty well on CPUs that report ERMS through CPUID.

So, it does report ERMS. I was looking at stale information. That is pretty good. If you could dynamically choose the code to use (say, at boot or load time), SSE or trivial, that would be best. I will reiterate that according to some insider source, just because the ERMS optimization exists in a range of models, it doesn't mean that it is equally efficiently implemented in microcode in all of them. But I doubt that trusting it will be much worse than manually aligned SSE transfers.

Brendan · Post by **Brendan** » Thu May 24, 2018 9:17 pm

Hi,

simeonz wrote:
Octocontrabass wrote:At least memcpy and memset should perform pretty well on CPUs that report ERMS through CPUID.
So, it does report ERMS. I was looking at stale information. That is pretty good. If you could dynamically choose the code to use (say, at boot or load time), SSE or trivial, that would be best. I will reiterate that according to some insider source, just because the ERMS optimization exists in a range of models, it doesn't mean that it is equally efficiently implemented in microcode in all of them. But I doubt that trusting it will be much worse than manually aligned SSE transfers.

In general you can split it into four things:

startup overhead - e.g. things like checking for alignment and doing any initial bytes to force alignment
the middle - e.g. the part where you know everything is nicely aligned and you can do large pieces at once
ending overhead - e.g. doing any final bytes that weren't a whole large piece
cache management - e.g. any prefetching/preloading and flushing data after use to avoid cache pollution

For tiny copies, the startup overhead and ending overhead dominate, the middle may not exist and cache management isn't worth bothering with.

For huge copies, the startup overhead and ending overhead are insignificant, but the middle and cache management are very important.

In other words, it's impossible to implement a "memcpy()" that is ideal for both tiny and huge copies.

However, bad code (that fails to avoid copying large amounts of data) is going to be slow regardless of how "memcpy()" is implemented because it's almost impossible to avoid the "RAM bandwidth" bottleneck for almost all cases (excluding extremely rare cases where, in theory, you could use "copy on write" paging tricks); and for almost all good code the amount of data copied by each individual "memcpy()" call is typically less than 1 KiB (and often the mean size is around 16 bytes). This implies that a good "memcpy()" should be optimised for tiny/small copies.

It also means that for some special cases "memcpy()" shouldn't be used at all. This includes code that is using SSE/AVX anyway (where the data is guaranteed to be aligned nicely, etc); and code that copies an entire buffer of pixels to the video card's frame buffer, where you should probably have a function pointer to one of many implementations (where the function that's selected when the video mode is set depends on things like if there's padding between rows of pixels in the video card's frame buffer, if each row of pixels is nicely aligned, etc) and where (especially for discrete video cards) you should try to avoid copying data that didn't change (dirty rectangles, "line changed flags", ...); and things like "fork()" where you should use paging tricks and copy nothing.

Note that most of the above applies to "memset()" too. For "memset()" the special cases (where "memset()" shouldn't be used) are things like filling an executable's ".bss" area with zeros (where you should use "allocate on write" paging tricks), filling an individual page with zeros (e.g. in the page fault handler when an "allocate on write" page is written to), etc. Also; for filling (larger) areas with zeros it's worth mentioning AMD's new "CLZERO" instruction, which is designed to fill an entire cache line (64 bytes) with zeros in a way that avoids the cost of fetching the cache line's previous contents (e.g. for filling a 4096-byte page with zeros in the page fault handler, I'd strongly consider using a linear sequence of 64 CLZERO instructions with no loop at all, and I'd consider having multiple different implementations of the page fault handler where during boot the kernel sets the IDT entry to point to the best page fault handler depending on various things, including if CLZERO is supported or not and including if cache line size is 64 bytes or something else).

Cheers,

Brendan

simeonz · Post by **simeonz** » Fri May 25, 2018 4:09 am

Brendan wrote:However, bad code (that fails to avoid copying large amounts of data) is going to be slow regardless of how "memcpy()" is implemented because it's almost impossible to avoid the "RAM bandwidth" bottleneck for almost all cases (excluding extremely rare cases where, in theory, you could use "copy on write" paging tricks); and for almost all good code the amount of data copied by each individual "memcpy()" call is typically less than 1 KiB (and often the mean size is around 16 bytes). This implies that a good "memcpy()" should be optimised for tiny/small copies.

To be honest, I am a little concerned about the memory bottleneck argument. Because, it is very important argument. In theory, you could have memory transfers interspersed with data processing. And the overall speed will depend on the rate of retiring in the memory transfer portion as well as the processing portion. And again in theory, the processing gives enough time for the cache to lazily flush the data to the next level and ultimately to memory. But the Intel cache to my knowledge is write-back only and will flush data on eviction, unless explicitly instructed to do so immediately. And the eviction will be triggered in the transfer portion of the code, not the processing portion. Some AMD cpu's apparently use write-through cache strategy according to this SO answer and rely on the write combining buffers to avoid redundant transfers. If so, it means that they can exploit CPU to memory parallelism better and the speed of retiring instructions in the transfer portion becomes relevant for them. Another argument could be made that multi-channel dd3 ram offers a lot of performance, but considering that you could retire multiple instuctions per cpu cycle and those are movs and branches, and they are already kept decoded in the instruction cache, I don't think this is a valid argument. However, there is one insight in the SO benchmarks (from a link in my prior post) that I think is relevant, which is that rep movs loops can avoid read-for-write (i.e. RFO) in cached memory regions, because they can foretell the transfer size. This is a huge benefit, because if implemented in microcode, it will halve the used memory bandwidth. Note that the benchmarks there show sensitivity to the approach used. Unless the tests were broken, there is some performance to be gained in the extreme case.

Brendan wrote:It also means that for some special cases "memcpy()" shouldn't be used at all. This includes code that is using SSE/AVX anyway (where the data is guaranteed to be aligned nicely, etc); and code that copies an entire buffer of pixels to the video card's frame buffer, where you should probably have a function pointer to one of many implementations (where the function that's selected when the video mode is set depends on things like if there's padding between rows of pixels in the video card's frame buffer, if each row of pixels is nicely aligned, etc) and where (especially for discrete video cards) you should try to avoid copying data that didn't change (dirty rectangles, "line changed flags", ...); and things like "fork()" where you should use paging tricks and copy nothing.

Even setting aside the cases in which the software has to specialize, the runtime would still need to have many variants to be suitable for most situations. Too many to be used comfortably in practice - to cater for alignment, hot and cold memory, VM pseudo-initialization, etc. Source level libraries can actually do better, since the interface can be simplified with generics (preferably) or macros, and by relying on decent LTO. But source level libraries have no hardware information at their disposal. Not with AOT compilation. Thus I agree that memcpy is obviously optimizing for the least productive case. It is also true that things like strings and structures are sometimes transferred in memory with memcpy, which are less than 32-bytes long. ERMS makes things worse in this regard, because rep movs becomes unsuitable for those transfers as well.

There are however justified uses of memcpy in my opinion. For example, applications that do not use zero-copy I/O receive a duplicate of a memory range from the system cache into user-space buffer. This transfer by kernel code is commonly performed without alignment assumptions, and without coldness hints, etc. It may not be done literally with memcpy (although theoretically it could be), but the code will have similar function. The transfers may be small, or bigger, or variably-sized, depending on the data being copied. Another justifiable use is in library code. The reasons are similar - it is hard for library code to assume the alignment or size of client buffers and will resort to the least common denominator. Application may use libraries for most of their memory intensive work. This is mostly in user-space, although it could happen in kernel space if kernel space libraries are used. And again - the function may not be called memcpy, but it is still some generic copy routine.

A more questionable usage is resulting from compiler optimizations that turn loops into memcpy calls (unless prohibited). There is a bright side to this phenomenon - the compiler will optimize some of the short fixed memcpy transfers (like copying small structures) into series of movs. So, especially with working LTO, a portion of the least productive memcpy calls will be optimized away.

I think there are reasons to cover memcpy reasonably efficiently. A programmer usually assumes that it is. Whether you optimize to the bleeding edge for this routine is a trade-off of sorts. I cannot say decisively whether to prioritize small or big transfers. Both could apply in practice.

Regarding the dynamically changing code bindings I hinted at earlier - I was not thinking solely of memcpy here. Libraries and/or kernel modules can use architecture dependent optimizations for many things, like lock-free concurrency, hashing, vector processing, etc. Either that, or you need a level of indirection after the routine's entry point, or very precise package management with detailed hardware versioning, or dynamic code generation. Such binding could be implemented by extending the executable format with a per-export callback, which returns the object/function variant the loader should use. It is just an idea and is not something I think is crucial. Not to mention that there are details, like running the callback in pre-runtime execution environment and with multi-threading, that need to be addressed.

Brendan wrote:Note that most of the above applies to "memset()" too. For "memset()" the special cases (where "memset()" shouldn't be used) are things like filling an executable's ".bss" area with zeros (where you should use "allocate on write" paging tricks), filling an individual page with zeros (e.g. in the page fault handler when an "allocate on write" page is written to), etc. Also; for filling (larger) areas with zeros it's worth mentioning AMD's new "CLZERO" instruction, which is designed to fill an entire cache line (64 bytes) with zeros in a way that avoids the cost of fetching the cache line's previous contents (e.g. for filling a 4096-byte page with zeros in the page fault handler, I'd strongly consider using a linear sequence of 64 CLZERO instructions with no loop at all, and I'd consider having multiple different implementations of the page fault handler where during boot the kernel sets the IDT entry to point to the best page fault handler depending on various things, including if CLZERO is supported or not and including if cache line size is 64 bytes or something else).

Reading that paragraph, I am realizing now that Intel has no easy mechanism to discard cache lines or to write to cache without performing read-for-write. Or is there? I had the delusion that write combining store buffers might skip read-for-writes if the code is writing out entire cache lines (in cached regions), but now I am thinking that this will not happen. The CPU cannot know if the write sequence will not end before filling the cache line, thus causing the read to be performed anyway, potentially stalling the pipeline. So, without explicit instructions this is not generally possible.

simeonz · Post by **simeonz** » Fri May 25, 2018 11:43 am

A couple of additional links for those interested:
How to avoid cache fill on write miss?

This is a question asked on the Intel forums, and was answered essentially as no go, albeit with certain tips.

Copying Accelerated Video Decode Frame Buffers

This is an article from Intel, where they describe an interesting copy technique. They use non-temporal (aka streaming) loads to transfer 4K blocks from the source memory range into some ordinary cached address range, and non-temporal stores to transfer block-wise from the cached range into the destination. A load of 4K block from the source is followed by store of 4K block into the destination. They use the cache as an intermediate buffer, and since the lines are never evicted, no unnecessary fills are triggered due to write misses. If you make the block very big, as big as the cache size, aside from evicting other data, the approach will obviously break with hyperthreading.

Edit: After giving this some thought, the technique in itself is not very useful for general purpose transfers. It is however a good indicator of the kind of processing that could avoid unnecessary read for write requests to memory. Namely, processing which operates on blocks and reuses the same scratch memory block for many iterations will only trigger fills in it during the first iteration. This is not really something new or surprising to think of it.

And lastly - a note regarding AMD's write coalescing cache (or write coalescing buffer, which I called write combining buffer in my previous post). It is only present in the Bulldozer architecture. I am not sure what other functions it has, but it enables asynchronous write-through policy, because synchronous write through results in abysmal amount of stalls. Essentially, every write to L1D is also replicated to the write coalescing cache, which are both sitting at approximately the same distance from the core. While the L1D is used to serve reads to the core, the write coalescing cache is used to make transfers to L2 in parallel. Evictions from the L1 cache are free (do not trigger a store) since the cache lines are never dirty in this model, although the write coalescing cache is set associative and may stall due to set collisions. It is 4K in size, which is much bigger than the total amount of fill buffers (write combining buffers) in Intel, but processing in data chunks greater than 4K at a time will probably stall heavily. Since it was used only in Bulldozer and then abandoned, it was apparently not very successful and currently has no future impact.

Brendan · Post by **Brendan** » Sat May 26, 2018 2:42 am

Hi,

simeonz wrote:
Brendan wrote:However, bad code (that fails to avoid copying large amounts of data) is going to be slow regardless of how "memcpy()" is implemented because it's almost impossible to avoid the "RAM bandwidth" bottleneck for almost all cases (excluding extremely rare cases where, in theory, you could use "copy on write" paging tricks); and for almost all good code the amount of data copied by each individual "memcpy()" call is typically less than 1 KiB (and often the mean size is around 16 bytes). This implies that a good "memcpy()" should be optimised for tiny/small copies.
To be honest, I am a little concerned about the memory bottleneck argument. Because, it is very important argument. In theory, you could have memory transfers interspersed with data processing. And the overall speed will depend on the rate of retiring in the memory transfer portion as well as the processing portion.

You're on the right track - one of the best ways to get rid of a memory copy is to merge it with whatever code touched the source previously, or merge it with whatever code will touch the destination next. This either avoids additional reads, or avoids additional writes; and the majority of 80x86 CPUs are "out-of-order enough" to end up doing processing and memory transfers in parallel.

simeonz wrote:And again in theory, the processing gives enough time for the cache to lazily flush the data to the next level and ultimately to memory. But the Intel cache to my knowledge is write-back only and will flush data on eviction, unless explicitly instructed to do so immediately.

Since 80386 Intel's caches have been controlled by global flags in CR0 and "per page" flags in page table entries that can be used to either disable caches or disable write-back/enable write-through. Then they added MTRRs (in Pentium) to allow caching to be controlled on a "per physical range" basis, then added support for write combining (in Pentium 4 I think). Of course other 80x86 CPU manufacturers (e.g. AMD, VIA) copied everything and support all of it; but Cyrix had something like MTRRs before Intel did (with a different name and different/incompatible access mechanism), and AMD added some extra stuff to MTRRs (mostly shifting some of the memory controller configuration from chipset into MTTRs).

The result of all of this is that for all 8x086 made in the last 10 years or more; there are multiple ways to configure an area as either "write-back" (default for RAM), "write-through", "write combining" or "uncached" (default for memory mapped devices).

There's also a few variations of prefetch instructions, non-temporal moves (which use "write combining" behaviour for individual writes) and CLFLUSH (which explicitly flushes a cache line).

For large memory copies; you mostly want to prefetch from the source (note: hardware prefetchers don't work across page boundaries) and use CLFLUSH to explicitly evict cache lines from both the source and the destination after you're finished with them (to avoid leaving the cache full of data that won't be used again "soon" and causing everything that is more likely to be used again to be evicted from cache). For huge memory copies; you might also want to trick the CPU into prefetching TLB entries (e.g. using a dummy load, like "cmp eax,[somewhere]", where nothing depends on instruction); but recent CPUs do have "TLB hardware prefetchers" making this unnecessary, and it shouldn't be done on CPUs that aren't "out-of-order" (e.g. first generation Atom).

simeonz wrote:And the eviction will be triggered in the transfer portion of the code, not the processing portion. Some AMD cpu's apparently use write-through cache strategy according to this SO answer and rely on the write combining buffers to avoid redundant transfers.

That's only talking about (e.g.) the implementation of the L1 cache, not the overall behaviour (including L2 and L3). For those AMD CPUs, if a page uses "write-back caching" then the L1 cache will "write through" to L2 (but will not "write through" to L3 or RAM).

However this is mixing different concepts (the overall behaviour, and implementation details of one of several caches in the middle).

For better terminology; Intel CPUs typically use "exclusive caches" (where a cache line can not be in L1 and L2 at the same time) and AMD CPUs typically used "inclusive caches" (where a cache line can be in both L1 and L2 at the same time). For "inclusive caches", to maintain cache coherency, if you modify the data in a cache line you must also update or evict any/all copies in other caches.

simeonz wrote:If so, it means that they can exploit CPU to memory parallelism better and the speed of retiring instructions in the transfer portion becomes relevant for them. Another argument could be made that multi-channel dd3 ram offers a lot of performance, but considering that you could retire multiple instuctions per cpu cycle and those are movs and branches, and they are already kept decoded in the instruction cache, I don't think this is a valid argument.

An argument can be made that there's 4 or more CPUs (plus maybe an integrated GPU, plus bus mastering PCI devices for SATA and network and whatever else) all pounding the RAM at the same time; and that each individual CPU can only expect a fraction of the total memory bandwidth.

simeonz wrote:
Brendan wrote:It also means that for some special cases "memcpy()" shouldn't be used at all. This includes code that is using SSE/AVX anyway (where the data is guaranteed to be aligned nicely, etc); and code that copies an entire buffer of pixels to the video card's frame buffer, where you should probably have a function pointer to one of many implementations (where the function that's selected when the video mode is set depends on things like if there's padding between rows of pixels in the video card's frame buffer, if each row of pixels is nicely aligned, etc) and where (especially for discrete video cards) you should try to avoid copying data that didn't change (dirty rectangles, "line changed flags", ...); and things like "fork()" where you should use paging tricks and copy nothing.
Even setting aside the cases in which the software has to specialize, the runtime would still need to have many variants to be suitable for most situations. Too many to be used comfortably in practice - to cater for alignment, hot and cold memory, VM pseudo-initialization, etc.

For "extreme specialisation", yes; but there's a lot of area between "no specialisation" and "extreme specialisation". For example, what if you provided "memcpy()" for when you expect the amount of memory being copied will be relatively small, plus a "memcpy_large()" for when you expect the amount of memory being copied will be relatively large? You'd be able to make a significant difference with very little specialisation.

simeonz wrote:Source level libraries can actually do better, since the interface can be simplified with generics (preferably) or macros, and by relying on decent LTO. But source level libraries have no hardware information at their disposal. Not with AOT compilation.

That depends (e.g. consider something like Gentoo Linux, where almost everything is AOT compiled from source when installed). In this case it's not the hardware information that libraries lack, it's usage information (will the caller be copying aligned data, and how much, and how often).

simeonz wrote:There are however justified uses of memcpy in my opinion.

Yes, for smaller memory copies (and not huge memory copies).

simeonz wrote:For example, applications that do not use zero-copy I/O receive a duplicate of a memory range from the system cache into user-space buffer.

If the process doesn't care about performance and doesn't use things like memory mapped files, etc; then they deserve whatever (poor) performance they get; and I'd argue that deliberately making performance as bad as possible is a good way to encourage them to write better software (and/or encourage users to replace the bad software with better software) and improve performance in the long run.

simeonz wrote:
Brendan wrote:Note that most of the above applies to "memset()" too. For "memset()" the special cases (where "memset()" shouldn't be used) are things like filling an executable's ".bss" area with zeros (where you should use "allocate on write" paging tricks), filling an individual page with zeros (e.g. in the page fault handler when an "allocate on write" page is written to), etc. Also; for filling (larger) areas with zeros it's worth mentioning AMD's new "CLZERO" instruction, which is designed to fill an entire cache line (64 bytes) with zeros in a way that avoids the cost of fetching the cache line's previous contents (e.g. for filling a 4096-byte page with zeros in the page fault handler, I'd strongly consider using a linear sequence of 64 CLZERO instructions with no loop at all, and I'd consider having multiple different implementations of the page fault handler where during boot the kernel sets the IDT entry to point to the best page fault handler depending on various things, including if CLZERO is supported or not and including if cache line size is 64 bytes or something else).
Reading that paragraph, I am realizing now that Intel has no easy mechanism to discard cache lines or to write to cache without performing read-for-write. Or is there?

For discarding individual cache lines there's CLFLUSH.

Writing to part of a cache line (e.g. writing 8 bytes in a 64 byte cache line) means that the unmodified bytes must be read. The only cases where read-for-write can be avoided is when the stored data is not cached ("uncached", "write combining", non-temporal stores); or when the entire cache line is being modified (CLZERO, "rep movsb/d/q" and AVX512).

simeonz wrote:I had the delusion that write combining store buffers might skip read-for-writes if the code is writing out entire cache lines (in cached regions), but now I am thinking that this will not happen. The CPU cannot know if the write sequence will not end before filling the cache line, thus causing the read to be performed anyway, potentially stalling the pipeline. So, without explicit instructions this is not generally possible.

Why?

You do a bunch of little writes to a "write combining" area, and they get stored into the CPU's write combining buffer (ignoring cache) where they're combined if possible. Later (when CPU needs to evict/flush part of the write combining buffer, hopefully after you've finished writing to that area) CPU tries to store the "combined writes" using a small number of larger stores and knows exactly how large the store is (and knows if it does/doesn't fill an entire cache line).

Cheers,

Brendan

OSDev.org

Memset implementation optimizing ... to itself

Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself

Re: Memset implementation optimizing ... to itself