Slow framebuffer access on a real PC

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
User avatar
eekee
Member
Member
Posts: 942
Joined: Mon May 22, 2017 5:56 am
Location: Kerbin
Discord: eekee
Contact:

Re: Slow framebuffer access on a real PC

Post by eekee »

rdos wrote:I've tried this on many different machines, and it is my impression that it is Intel machines that have lousy frame-buffer performance while most AMD based machines work fine. It probably relates to Intel assuming that everybody use their GPU software, and so they use some kind of emulation for framebuffer.
Could be! There are exceptions though. Before any fixes were made, I watched video in 9front on a Thinkpad R400 which is an Intel machine, and thought it okay. I think Thinkpads are generally all right, at least the T and X series. (This R400 seems to be a cheaper X301 in a T61 case.)
Kaph — a modular OS intended to be easy and fun to administer and code for.
"May wisdom, fun, and the greater good shine forth in all your work." — Leo Brodie
feryno
Member
Member
Posts: 74
Joined: Thu Feb 09, 2012 6:53 am
Location: Czechoslovakia
Contact:

Re: Slow framebuffer access on a real PC

Post by feryno »

I had the same problem more than 10 years ago, so very old graphic card: copying data into framebuffer using repz movsX was very slow. The performance improved on properly changing caching attributes (MTRRs, PAT, bits in page table entry) - bzt, Octocontrabass, eekee already suggested that.
hypervisor-based solutions developer (Intel, AMD)
User avatar
bzt
Member
Member
Posts: 1584
Joined: Thu Oct 13, 2016 4:55 pm
Contact:

Re: Slow framebuffer access on a real PC

Post by bzt »

portasynthinca3 wrote:I have tried using SSE2 as @bzt suggested with the PREFETCHNTA instruction, but didn't seem to get much of it. It's most likely a problem with my implementation.
I've checked your source. Yep, this is not how one use prefetch. Most importantly you should prefetch data for the NEXT iteration :-)

As a Christmas gift, here's my implementation:

Code: Select all

/**
 * Copyright (c) 2017 bzt (bztsrc@gitlab)
 * https://creativecommons.org/licenses/by-nc-sa/4.0/
 *
 * void *memcpy(void *dst, const void *src, size_t len)
 */
memcpy:
    cld
    /* check input parameters */
    orq     %rdi, %rdi
    jz      2f
    orq     %rsi, %rsi
    jz      2f
    orq     %rdx, %rdx
    jz      2f
    /* if it's a small block */
    cmpq    $512, %rdx
    jb      1f
    /* if both source and destination aligned */
    movb    %sil, %al
    xorb    %dil, %al
    andb    $15, %al
    jnz     1f
    /* copy big blocks, 256 bytes per iteration */
0:  movq    %rdx, %rcx
    xorq    %rdx, %rdx
    movb    %cl, %dl
    shrq    $8, %rcx
0:  prefetchnta 256(%rsi)
    prefetchnta 288(%rsi)
    prefetchnta 320(%rsi)
    prefetchnta 352(%rsi)
    prefetchnta 384(%rsi)
    prefetchnta 416(%rsi)
    prefetchnta 448(%rsi)
    prefetchnta 480(%rsi)
    movdqa    0(%rsi), %xmm0
    movdqa   16(%rsi), %xmm1
    movdqa   32(%rsi), %xmm2
    movdqa   48(%rsi), %xmm3
    movdqa   64(%rsi), %xmm4
    movdqa   80(%rsi), %xmm5
    movdqa   96(%rsi), %xmm6
    movdqa  112(%rsi), %xmm7
    movdqa  128(%rsi), %xmm8
    movdqa  144(%rsi), %xmm9
    movdqa  160(%rsi), %xmm10
    movdqa  176(%rsi), %xmm11
    movdqa  192(%rsi), %xmm12
    movdqa  208(%rsi), %xmm13
    movdqa  224(%rsi), %xmm14
    movdqa  240(%rsi), %xmm15
    movntdq %xmm0,   0(%rdi)
    movntdq %xmm1,  16(%rdi)
    movntdq %xmm2,  32(%rdi)
    movntdq %xmm3,  48(%rdi)
    movntdq %xmm4,  64(%rdi)
    movntdq %xmm5,  80(%rdi)
    movntdq %xmm6,  96(%rdi)
    movntdq %xmm7, 112(%rdi)
    movntdq %xmm8, 128(%rdi)
    movntdq %xmm9, 144(%rdi)
    movntdq %xmm10,160(%rdi)
    movntdq %xmm11,176(%rdi)
    movntdq %xmm12,192(%rdi)
    movntdq %xmm13,208(%rdi)
    movntdq %xmm14,224(%rdi)
    movntdq %xmm15,240(%rdi)
    addq    $256, %rsi
    addq    $256, %rdi
    decq    %rcx
    jnz     0b
    /* copy small block */
1:  movq    %rdx, %rcx
    shrq    $3, %rcx
    or      %rcx, %rcx
    jz      1f
    repnz   movsq
1:  movb    %dl, %cl
    andb    $0x7, %cl
    jz      2f
    repnz   movsb
2:  movq    %rdi, %rax
    ret
Notes: it only guarantees high performance if you pass properly aligned buffers to it. If you copy the entire framebuffer this shouldn't be a problem. If you copy areas, then I suggest to expand the area's starting X coordinate to be on a pixel address which is 16 bytes aligned. That is, in worst case scenario, copy 3 pixels more per line. (Alternatively you could use REP MOVSD until you get a buffer address which is aligned)

As your CPU supports AVX, you can replace the main loop with AVX registers to copy more bytes per iteration, that would make it even faster. (Basically you should utilize as many and as big registers as you can to increase the throughput of the main loop.) Your CPU should also support ERMSB, so check 16 bytes aligned REP MOVSB, it should be much faster than REP MOVSQ or the SSE versions (I know, sounds crazy, but read the Intel Optimization Manual I've linked). Some reported that REP MOVSB is not working as expected, and unfortunately I'm one of them, my CPU is after-Ivy Bridge yet I see no ERMSB effect.

And as I and others have suggested, check the MMU configuration on how your framebuffer is mapped. You should never read the video memory directly, only write it, therefore using normal caching would actually slow down your driver.

Cheers and Merry Christmas,
bzt
Post Reply