Could be! There are exceptions though. Before any fixes were made, I watched video in 9front on a Thinkpad R400 which is an Intel machine, and thought it okay. I think Thinkpads are generally all right, at least the T and X series. (This R400 seems to be a cheaper X301 in a T61 case.)rdos wrote:I've tried this on many different machines, and it is my impression that it is Intel machines that have lousy frame-buffer performance while most AMD based machines work fine. It probably relates to Intel assuming that everybody use their GPU software, and so they use some kind of emulation for framebuffer.
Slow framebuffer access on a real PC
Re: Slow framebuffer access on a real PC
Kaph — a modular OS intended to be easy and fun to administer and code for.
"May wisdom, fun, and the greater good shine forth in all your work." — Leo Brodie
"May wisdom, fun, and the greater good shine forth in all your work." — Leo Brodie
Re: Slow framebuffer access on a real PC
I had the same problem more than 10 years ago, so very old graphic card: copying data into framebuffer using repz movsX was very slow. The performance improved on properly changing caching attributes (MTRRs, PAT, bits in page table entry) - bzt, Octocontrabass, eekee already suggested that.
hypervisor-based solutions developer (Intel, AMD)
Re: Slow framebuffer access on a real PC
I've checked your source. Yep, this is not how one use prefetch. Most importantly you should prefetch data for the NEXT iterationportasynthinca3 wrote:I have tried using SSE2 as @bzt suggested with the PREFETCHNTA instruction, but didn't seem to get much of it. It's most likely a problem with my implementation.

As a Christmas gift, here's my implementation:
Code: Select all
/**
* Copyright (c) 2017 bzt (bztsrc@gitlab)
* https://creativecommons.org/licenses/by-nc-sa/4.0/
*
* void *memcpy(void *dst, const void *src, size_t len)
*/
memcpy:
cld
/* check input parameters */
orq %rdi, %rdi
jz 2f
orq %rsi, %rsi
jz 2f
orq %rdx, %rdx
jz 2f
/* if it's a small block */
cmpq $512, %rdx
jb 1f
/* if both source and destination aligned */
movb %sil, %al
xorb %dil, %al
andb $15, %al
jnz 1f
/* copy big blocks, 256 bytes per iteration */
0: movq %rdx, %rcx
xorq %rdx, %rdx
movb %cl, %dl
shrq $8, %rcx
0: prefetchnta 256(%rsi)
prefetchnta 288(%rsi)
prefetchnta 320(%rsi)
prefetchnta 352(%rsi)
prefetchnta 384(%rsi)
prefetchnta 416(%rsi)
prefetchnta 448(%rsi)
prefetchnta 480(%rsi)
movdqa 0(%rsi), %xmm0
movdqa 16(%rsi), %xmm1
movdqa 32(%rsi), %xmm2
movdqa 48(%rsi), %xmm3
movdqa 64(%rsi), %xmm4
movdqa 80(%rsi), %xmm5
movdqa 96(%rsi), %xmm6
movdqa 112(%rsi), %xmm7
movdqa 128(%rsi), %xmm8
movdqa 144(%rsi), %xmm9
movdqa 160(%rsi), %xmm10
movdqa 176(%rsi), %xmm11
movdqa 192(%rsi), %xmm12
movdqa 208(%rsi), %xmm13
movdqa 224(%rsi), %xmm14
movdqa 240(%rsi), %xmm15
movntdq %xmm0, 0(%rdi)
movntdq %xmm1, 16(%rdi)
movntdq %xmm2, 32(%rdi)
movntdq %xmm3, 48(%rdi)
movntdq %xmm4, 64(%rdi)
movntdq %xmm5, 80(%rdi)
movntdq %xmm6, 96(%rdi)
movntdq %xmm7, 112(%rdi)
movntdq %xmm8, 128(%rdi)
movntdq %xmm9, 144(%rdi)
movntdq %xmm10,160(%rdi)
movntdq %xmm11,176(%rdi)
movntdq %xmm12,192(%rdi)
movntdq %xmm13,208(%rdi)
movntdq %xmm14,224(%rdi)
movntdq %xmm15,240(%rdi)
addq $256, %rsi
addq $256, %rdi
decq %rcx
jnz 0b
/* copy small block */
1: movq %rdx, %rcx
shrq $3, %rcx
or %rcx, %rcx
jz 1f
repnz movsq
1: movb %dl, %cl
andb $0x7, %cl
jz 2f
repnz movsb
2: movq %rdi, %rax
ret
As your CPU supports AVX, you can replace the main loop with AVX registers to copy more bytes per iteration, that would make it even faster. (Basically you should utilize as many and as big registers as you can to increase the throughput of the main loop.) Your CPU should also support ERMSB, so check 16 bytes aligned REP MOVSB, it should be much faster than REP MOVSQ or the SSE versions (I know, sounds crazy, but read the Intel Optimization Manual I've linked). Some reported that REP MOVSB is not working as expected, and unfortunately I'm one of them, my CPU is after-Ivy Bridge yet I see no ERMSB effect.
And as I and others have suggested, check the MMU configuration on how your framebuffer is mapped. You should never read the video memory directly, only write it, therefore using normal caching would actually slow down your driver.
Cheers and Merry Christmas,
bzt