OSDev.org

The Place to Start for Operating System Developers
It is currently Fri Sep 18, 2020 8:37 pm

All times are UTC - 6 hours




Post new topic Reply to topic  [ 18 posts ]  Go to page Previous  1, 2
Author Message
 Post subject: Re: Slow framebuffer access on a real PC
PostPosted: Tue Dec 24, 2019 3:51 pm 
Offline
Member
Member
User avatar

Joined: Mon May 22, 2017 5:56 am
Posts: 315
rdos wrote:
I've tried this on many different machines, and it is my impression that it is Intel machines that have lousy frame-buffer performance while most AMD based machines work fine. It probably relates to Intel assuming that everybody use their GPU software, and so they use some kind of emulation for framebuffer.

Could be! There are exceptions though. Before any fixes were made, I watched video in 9front on a Thinkpad R400 which is an Intel machine, and thought it okay. I think Thinkpads are generally all right, at least the T and X series. (This R400 seems to be a cheaper X301 in a T61 case.)


Top
 Profile  
 
 Post subject: Re: Slow framebuffer access on a real PC
PostPosted: Wed Dec 25, 2019 4:34 am 
Offline
Member
Member

Joined: Thu Feb 09, 2012 6:53 am
Posts: 45
Location: Czechoslovakia
I had the same problem more than 10 years ago, so very old graphic card: copying data into framebuffer using repz movsX was very slow. The performance improved on properly changing caching attributes (MTRRs, PAT, bits in page table entry) - bzt, Octocontrabass, eekee already suggested that.

_________________
hypervisor-based solutions developer (Intel, AMD)


Top
 Profile  
 
 Post subject: Re: Slow framebuffer access on a real PC
PostPosted: Wed Dec 25, 2019 12:21 pm 
Offline
Member
Member
User avatar

Joined: Thu Oct 13, 2016 4:55 pm
Posts: 858
portasynthinca3 wrote:
I have tried using SSE2 as @bzt suggested with the PREFETCHNTA instruction, but didn't seem to get much of it. It's most likely a problem with my implementation.
I've checked your source. Yep, this is not how one use prefetch. Most importantly you should prefetch data for the NEXT iteration :-)

As a Christmas gift, here's my implementation:
Code:
/**
* Copyright (c) 2017 bzt (bztsrc@gitlab)
* https://creativecommons.org/licenses/by-nc-sa/4.0/
*
* void *memcpy(void *dst, const void *src, size_t len)
*/
memcpy:
    cld
    /* check input parameters */
    orq     %rdi, %rdi
    jz      2f
    orq     %rsi, %rsi
    jz      2f
    orq     %rdx, %rdx
    jz      2f
    /* if it's a small block */
    cmpq    $512, %rdx
    jb      1f
    /* if both source and destination aligned */
    movb    %sil, %al
    xorb    %dil, %al
    andb    $15, %al
    jnz     1f
    /* copy big blocks, 256 bytes per iteration */
0:  movq    %rdx, %rcx
    xorq    %rdx, %rdx
    movb    %cl, %dl
    shrq    $8, %rcx
0:  prefetchnta 256(%rsi)
    prefetchnta 288(%rsi)
    prefetchnta 320(%rsi)
    prefetchnta 352(%rsi)
    prefetchnta 384(%rsi)
    prefetchnta 416(%rsi)
    prefetchnta 448(%rsi)
    prefetchnta 480(%rsi)
    movdqa    0(%rsi), %xmm0
    movdqa   16(%rsi), %xmm1
    movdqa   32(%rsi), %xmm2
    movdqa   48(%rsi), %xmm3
    movdqa   64(%rsi), %xmm4
    movdqa   80(%rsi), %xmm5
    movdqa   96(%rsi), %xmm6
    movdqa  112(%rsi), %xmm7
    movdqa  128(%rsi), %xmm8
    movdqa  144(%rsi), %xmm9
    movdqa  160(%rsi), %xmm10
    movdqa  176(%rsi), %xmm11
    movdqa  192(%rsi), %xmm12
    movdqa  208(%rsi), %xmm13
    movdqa  224(%rsi), %xmm14
    movdqa  240(%rsi), %xmm15
    movntdq %xmm0,   0(%rdi)
    movntdq %xmm1,  16(%rdi)
    movntdq %xmm2,  32(%rdi)
    movntdq %xmm3,  48(%rdi)
    movntdq %xmm4,  64(%rdi)
    movntdq %xmm5,  80(%rdi)
    movntdq %xmm6,  96(%rdi)
    movntdq %xmm7, 112(%rdi)
    movntdq %xmm8, 128(%rdi)
    movntdq %xmm9, 144(%rdi)
    movntdq %xmm10,160(%rdi)
    movntdq %xmm11,176(%rdi)
    movntdq %xmm12,192(%rdi)
    movntdq %xmm13,208(%rdi)
    movntdq %xmm14,224(%rdi)
    movntdq %xmm15,240(%rdi)
    addq    $256, %rsi
    addq    $256, %rdi
    decq    %rcx
    jnz     0b
    /* copy small block */
1:  movq    %rdx, %rcx
    shrq    $3, %rcx
    or      %rcx, %rcx
    jz      1f
    repnz   movsq
1:  movb    %dl, %cl
    andb    $0x7, %cl
    jz      2f
    repnz   movsb
2:  movq    %rdi, %rax
    ret

Notes: it only guarantees high performance if you pass properly aligned buffers to it. If you copy the entire framebuffer this shouldn't be a problem. If you copy areas, then I suggest to expand the area's starting X coordinate to be on a pixel address which is 16 bytes aligned. That is, in worst case scenario, copy 3 pixels more per line. (Alternatively you could use REP MOVSD until you get a buffer address which is aligned)

As your CPU supports AVX, you can replace the main loop with AVX registers to copy more bytes per iteration, that would make it even faster. (Basically you should utilize as many and as big registers as you can to increase the throughput of the main loop.) Your CPU should also support ERMSB, so check 16 bytes aligned REP MOVSB, it should be much faster than REP MOVSQ or the SSE versions (I know, sounds crazy, but read the Intel Optimization Manual I've linked). Some reported that REP MOVSB is not working as expected, and unfortunately I'm one of them, my CPU is after-Ivy Bridge yet I see no ERMSB effect.

And as I and others have suggested, check the MMU configuration on how your framebuffer is mapped. You should never read the video memory directly, only write it, therefore using normal caching would actually slow down your driver.

Cheers and Merry Christmas,
bzt


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 18 posts ]  Go to page Previous  1, 2

All times are UTC - 6 hours


Who is online

Users browsing this forum: klange and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group