portasynthinca3 wrote:
I have tried using SSE2 as @bzt suggested with the PREFETCHNTA instruction, but didn't seem to get much of it. It's most likely a problem with my implementation.
I've checked your source. Yep, this is not how one use prefetch. Most importantly you should prefetch data for the NEXT iteration
As a Christmas gift, here's my implementation:
Code:
/**
* Copyright (c) 2017 bzt (bztsrc@gitlab)
* https://creativecommons.org/licenses/by-nc-sa/4.0/
*
* void *memcpy(void *dst, const void *src, size_t len)
*/
memcpy:
cld
/* check input parameters */
orq %rdi, %rdi
jz 2f
orq %rsi, %rsi
jz 2f
orq %rdx, %rdx
jz 2f
/* if it's a small block */
cmpq $512, %rdx
jb 1f
/* if both source and destination aligned */
movb %sil, %al
xorb %dil, %al
andb $15, %al
jnz 1f
/* copy big blocks, 256 bytes per iteration */
0: movq %rdx, %rcx
xorq %rdx, %rdx
movb %cl, %dl
shrq $8, %rcx
0: prefetchnta 256(%rsi)
prefetchnta 288(%rsi)
prefetchnta 320(%rsi)
prefetchnta 352(%rsi)
prefetchnta 384(%rsi)
prefetchnta 416(%rsi)
prefetchnta 448(%rsi)
prefetchnta 480(%rsi)
movdqa 0(%rsi), %xmm0
movdqa 16(%rsi), %xmm1
movdqa 32(%rsi), %xmm2
movdqa 48(%rsi), %xmm3
movdqa 64(%rsi), %xmm4
movdqa 80(%rsi), %xmm5
movdqa 96(%rsi), %xmm6
movdqa 112(%rsi), %xmm7
movdqa 128(%rsi), %xmm8
movdqa 144(%rsi), %xmm9
movdqa 160(%rsi), %xmm10
movdqa 176(%rsi), %xmm11
movdqa 192(%rsi), %xmm12
movdqa 208(%rsi), %xmm13
movdqa 224(%rsi), %xmm14
movdqa 240(%rsi), %xmm15
movntdq %xmm0, 0(%rdi)
movntdq %xmm1, 16(%rdi)
movntdq %xmm2, 32(%rdi)
movntdq %xmm3, 48(%rdi)
movntdq %xmm4, 64(%rdi)
movntdq %xmm5, 80(%rdi)
movntdq %xmm6, 96(%rdi)
movntdq %xmm7, 112(%rdi)
movntdq %xmm8, 128(%rdi)
movntdq %xmm9, 144(%rdi)
movntdq %xmm10,160(%rdi)
movntdq %xmm11,176(%rdi)
movntdq %xmm12,192(%rdi)
movntdq %xmm13,208(%rdi)
movntdq %xmm14,224(%rdi)
movntdq %xmm15,240(%rdi)
addq $256, %rsi
addq $256, %rdi
decq %rcx
jnz 0b
/* copy small block */
1: movq %rdx, %rcx
shrq $3, %rcx
or %rcx, %rcx
jz 1f
repnz movsq
1: movb %dl, %cl
andb $0x7, %cl
jz 2f
repnz movsb
2: movq %rdi, %rax
ret
Notes: it only guarantees high performance if you pass properly aligned buffers to it. If you copy the entire framebuffer this shouldn't be a problem. If you copy areas, then I suggest to expand the area's starting X coordinate to be on a pixel address which is 16 bytes aligned. That is, in worst case scenario, copy 3 pixels more per line. (Alternatively you could use REP MOVSD until you get a buffer address which is aligned)
As your CPU supports AVX, you can replace the main loop with AVX registers to copy more bytes per iteration, that would make it even faster. (Basically you should utilize as many and as big registers as you can to increase the throughput of the main loop.) Your CPU should also support ERMSB, so check 16 bytes aligned REP MOVSB, it should be much faster than REP MOVSQ or the SSE versions (I know, sounds crazy, but read the Intel Optimization Manual I've linked). Some reported that REP MOVSB is not working as expected, and unfortunately I'm one of them, my CPU is after-Ivy Bridge yet I see no ERMSB effect.
And as I and others have suggested, check the MMU configuration on how your framebuffer is mapped. You should never read the video memory directly, only write it, therefore using normal caching would actually slow down your driver.
Cheers and Merry Christmas,
bzt