Unable to mark a memory region as WC using MTRRs

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

Hi everyone,
time ago I added in my kernel support for a framebuffer. It noticed back then that its performance was not good enough and, after some research, it turned out that the problem was that framebuffer's memory region was not marked as write-combining.
Now I finally decided to solve this problem using x86's MTRRs or PAT. After studying how to do that, I added a set of functions useful to control the variable MTRRs and I made framebuffer's memory region to be WC by adding a new MTRR in the first available slot.

On my UDOO x86 (https://www.udoo.org/udoo-x86/) it worked great and I achieved a performance of about 1.7 cycles / pixel, which is about 10x faster than what I was able to achieve before using MTRRs.
Note: I actually cheated a little before working with MTRRs because I used FPU instructions (SSE, AVX2) when available in order to increase the performance [the larger the register, the better]. Anyway I said that just as a proof that, at least on some hardware, the new MTRR-related code has a real and tangible effect.

Now, my (apparently unsolvable) problem is with my Dell XPS 13" 9360: the kernel runs "well" but the new MTRR entry has no effect. The performance is terrible: about ~250 cycles/pixel. After some debugging, it turned out that the new MTRR entry is overridden by the MTRR entry 0, which marks the memory from +2 GB to +4 GB as UC (uncacheable). According to Intel's manual, UC always wins over other types of memory, in case of an overlap.
Note: the framebuffer's address on that machine is: 0x90000000 [+2.25 GB].

My first question is:
What's point of having such a big region of the physical memory marked as uncacheable?
My machine has 16 GB of physical memory, so I'd expect that region to be a perfectly regular (WB) part of the RAM. Usually the memory regions used for memory-mapped I/O are much smaller, a few MBs at most (like other MTRRs, on the same machine). There must be something I'm missing in the big picture.

My second question is:
So what could I do to make the framebuffer's memory to be WC? I tried to just invalidate the MTRR entry 0, hopefully in the proper way, as described in Intel's System Programming Guide, Section 11.11.7.2, but it did not work. The screen just forze while trying to re-enable the MTRRs after the change [I implemented pre_mtrr_change() and post_mtrr_change() as described in the Intel's documentation].

If necessary, I can copy-paste my code here, but the theoretical problem is: am I allowed to do that [removing an MTRR set by the firmware] in general? If not, what I'm supposed to do? I believe that there is a some kind of solution to this issue since Linux's framebuffer on the same machine is fast as expected.

Please tell me that there are other solutions than "surrender" and deal with the IOMMU in order to re-map the framebuffer.

Thanks a lot for the help guys,
Vlad

Brendan · **Posted:** Sat Sep 01, 2018 12:39 pm

Hi,

vvaltchev wrote:

My first question is:
What's point of having such a big region of the physical memory marked as uncacheable?
My machine has 16 GB of physical memory, so I'd expect that region to be a perfectly regular (WB) part of the RAM. Usually the memory regions used for memory-mapped I/O are much smaller, a few MBs at most (like other MTRRs, on the same machine). There must be something I'm missing in the big picture.

This is a "memory mapped device hole" that has nothing to do with RAM - it's used for memory mapped PCI devices plus various pieces of hardware built into the chipset (e.g. firmware ROM, local APIC, HPET, ...). Each video card's frame buffer is also in this hole as a memory mapped PCI device area (even if it's an integrated video card using recycled "stolen RAM" under the hood, it's not considered RAM and is considered a memory mapped PCI device).

Almost everything in the "memory mapped device hole" needs to be uncached, and for things that the OS might or might not change (which is typically framebuffers and nothing else) uncached is a good/safe default for firmware to use, partly because it avoids compatibility problems with software that isn't expecting the peculiar nuances of "write-combining" (which breaks/weakens the standard memory ordering model 80x86 uses) and partly because you don't want firmware to be 500 MiB of bugs due to special cases.

vvaltchev wrote:

My second question is:
So what could I do to make the framebuffer's memory to be WC? I tried to just invalidate the MTRR entry 0, hopefully in the proper way, as described in Intel's System Programming Guide, Section 11.11.7.2, but it did not work. The screen just forze while trying to re-enable the MTRRs after the change [I implemented pre_mtrr_change() and post_mtrr_change() as described in the Intel's documentation].

By disabling MTRR entry 0; the CPU would have fell back to the "default type" for the entire area. If the default type is "write-back" (which is very likely if you had a large "uncached" area to begin with) then you would have made everything (all memory mapped areas used by all PCI devices, plus ROM, HPET, local APIC, IO APIC, ...) "write-back" and would've caused problems for everything (including firmware's SMM code) that tries to use any of these pieces of hardware. It's a little bit like a brain surgeon using a hand grenade to remove a patient's blood clot.

vvaltchev wrote:

If necessary, I can copy-paste my code here, but the theoretical problem is: am I allowed to do that [removing an MTRR set by the firmware] in general? If not, what I'm supposed to do? I believe that there is a some kind of solution to this issue since Linux's framebuffer on the same machine is fast as expected.

You're allowed to reconfigure the MTRRs (if you do it carefully). Specifically; you need to prepare by deciding what each area of the entire physical address space should be (while honouring the "total number of variable range MTTRs" restriction plus taking into account any model specific quirks); and then once you've prepared there's a special sequence to ensure that cache coherency doesn't cause corruption (involving disabling caches on all CPUs, flushing all cached data with WBINVD on all CPUs, then reprogramming MTRRs on all cores, then re-enabling caches on all CPUs) where all CPUs have to do this in a "lock-step" way with all CPUs synchronised at each step.

vvaltchev wrote:

Please tell me that there are other solutions than "surrender" and deal with the IOMMU in order to re-map the framebuffer.

The best solution is to optimise your code so that write-combining makes very little difference.

For examples; your "memcpy_256_nt" is likely to be significantly slower than a simple "rep movsd", you're copying one line at a time (even in the "bytes_per_pixel * horizontal_resolution == pitch" case where you could do a single "memcpy()"), you're relying on the compiler to optimise for you (including "-O3" which almost never improves performance) but then preventing the compiler from being able to optimise at all (e.g. splitting things across multiple compilation units so compiler can't inline functions, etc), you're overusing "memcpy()" (when you could be doing a small number of explicit writes) and then using "special" functions ("memcpy32()") that prevent the compiler from generating optimal code when the size could/should be constant, you're not hoisting branches out of loops (e.g. "for each line { if(bpp == 32) .." rather than "if(bpp == 32) { for each line {..."), for some cases (horizontal lines for 24-bpp) you're doing "fb_draw_pixel()" loops that re-calculate an address of every pixel (when it could be optimised to do groups of 4 pixels as 3 aligned 32-bit writes with an "next_address += 12" every 4 pixels instead of "address = buf + y * pitch + x * 3" for every individual pixel), etc.

After you've got it down to about 1 cycle per pixel; then try to enable write-combining to get it down to slightly less than 1 cycle per pixel.

Cheers,

Brendan

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

Hi Brendan,

thank you very much for your exhaustive response to my questions! Now I understand why resetting MTRR[0] on that machine was such a big deal. I got a freeze that was undebuggable (no faults, nothing). It makes perfect sense to me your theory that I may have messed up with SMM's expectations.

Brendan wrote:

vvaltchev wrote:

Please tell me that there are other solutions than "surrender" and deal with the IOMMU in order to re-map the framebuffer.

The best solution is to optimise your code so that write-combining makes very little difference.

For examples; your "memcpy_256_nt" is likely to be significantly slower than a simple "rep movsd", you're copying one line at a time (even in the "bytes_per_pixel * horizontal_resolution == pitch" case where you could do a single "memcpy()"), you're relying on the compiler to optimise for you (including "-O3" which almost never improves performance) but then preventing the compiler from being able to optimise at all (e.g. splitting things across multiple compilation units so compiler can't inline functions, etc), you're overusing "memcpy()" (when you could be doing a small number of explicit writes) and then using "special" functions ("memcpy32()") that prevent the compiler from generating optimal code when the size could/should be constant, you're not hoisting branches out of loops (e.g. "for each line { if(bpp == 32) .." rather than "if(bpp == 32) { for each line {..."), for some cases (horizontal lines for 24-bpp) you're doing "fb_draw_pixel()" loops that re-calculate an address of every pixel (when it could be optimised to do groups of 4 pixels as 3 aligned 32-bit writes with an "next_address += 12" every 4 pixels instead of "address = buf + y * pitch + x * 3" for every individual pixel), etc.

After you've got it down to about 1 cycle per pixel; then try to enable write-combining to get it down to slightly less than 1 cycle per pixel.

Brendan

About that, I just wanted to specify that my performance test has nothing to do with my framebuffer console implementation. I wrote the following code to test framebuffer's performance:

Code:

void selftest_fb_perf_manual()
{
   const int iters = 20;
   u64 start, duration, cycles;

   start = RDTSC();

   for (int i = 0; i < iters; i++) {

      fb_raw_color_lines(0, fb_get_height(),
                         vga_rgb_colors[i % 2 ? COLOR_WHITE : COLOR_BLACK]);

      if (framebuffer_vi.flush_buffers) // This is used to test also the case where double-buffering
         fb_full_flush();                         // is used. My primary test is with a direct write on the framebuffer.
   }

   duration = RDTSC() - start;
   cycles = duration / (iters);

   u64 pixels = fb_get_width() * fb_get_height();
   printk("fb size (pixels): %u\n", pixels);
   printk("cycles per redraw: %llu\n", cycles);
   printk("cycles per 32 pixels: %llu\n", 32 * cycles / pixels);

   fb_draw_banner(); // You can completely ignore this
   fb_flush_banner(); // You can completely ignore this
}

fb_raw_color_lines() is implemented as:

void fb_raw_color_lines(u32 iy, u32 h, u32 color)
{
   if (fb_bpp == 32) {
      memset32((void *)(fb_vaddr + (fb_pitch * iy)),
               color, (fb_pitch * h) >> 2);
   } else {

      // Generic (but slower version)
      for (u32 y = iy; y < (iy + h); y++)
         for (u32 x = 0; x < fb_width; x++)
            fb_draw_pixel(x, y, color);
   }
}

So, in practice I use just a plain memset32() to write a constant value on the framebuffer and yes, I'm re-drawing the whole screen with just one rep stosl instruction (in memset32). It couldn't be simpler than that.
[Note: the "if (fb_bpp == 32)" is checked once per screen-redraw, so 20 times. Nothing in practice.]

Now, about my memcpy_256_nt(). Actually, my first implementation used just memcpy() and it was a lot slower.
That's why I decided to deal with the FPU, introduce the fpu_context etc. in order to use bigger registers and see what would happen. The result was: I achieved a performance improvement proportional to the size of the register I was using to copy the data. For example, by using AVX2 registers, which are 256-bit wide, I was able to achieve roughly an ~8x improvement over plain 32-bit registers.

Brendan wrote:

you're not hoisting branches out of loops (e.g. "for each line { if(bpp == 32) .." rather than "if(bpp == 32) { for each line {..."), for some cases

You're probably looking at the failsafe functions [which use fb_draw_pixel()] .
You might be curious to take a look at fb_draw_char_optimized_row() which is the most important function drawing 1 row. Also, I've really optimized only for the 32-bbp case, the only one I'm testing.
You're perfectly right that the bbp=24 case is not optimized, but that's because I care only for the bbp=32 case.

Anyway, there's one more thing I didn't get:

Brendan wrote:

After you've got it down to about 1 cycle per pixel; then try to enable write-combining to get it down to slightly less than 1 cycle per pixel.

How could I achieve ~1 cycle per pixel with an uncachable memory?
As far as I know, an access to RAM (cost of L3 cache miss / case where mem is UC) takes roughly 100 ns on modern machines.

[Source: https://software.intel.com/en-us/forums ... pic/287236, which points out to this: https://software.intel.com/sites/produc ... _guide.pdf].
In practice, for a Xeon 5500 I could get roughly (copy-pasting the text):

Code:

Data Source Latency (approximate)

L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns

My CPU runs at 2.7 GHz. So in 100 ns my CPU will go through roughly:

Code:

2,700,000,000 * ( 100 / 1,000,000,000 ) = 270 cycles

Therefore having something like ~250 cycles per pixel (32-bit) seems to me close to the best I could theoretically achieve with uncacheable memory, on that machine. OK, the Intel pdf document states that we go as far as 60 ns, which means 162 cycles, but that depends on the hardware [...] I think you got the point.
Am I missing something in this analysis?

Brendan · **Posted:** Sun Sep 02, 2018 12:25 am

Hi,

vvaltchev wrote:

So, in practice I use just a plain memset32() to write a constant value on the framebuffer and yes, I'm re-drawing the whole screen with just one rep stosl instruction (in memset32). It couldn't be simpler than that.

For 24-bpp, it's not one memset32(). For 32-bit it is one memset32() but it's potentially wrong. For example, if the video mode is 1280 pixels wide and the video card rounds it up to a power of 2 so that the pitch is 8192 (e.g. each line has 1280*4 bytes of pixel data then 768*4 bytes of padding), you'd end up copying pixel data into the padding and it'll all be distorted.

You need an "if(bytes_per_pixel * horizontal_resolution == pitch)" to know if you can do a single copy or if it needs to be split into a copy per line.

vvaltchev wrote:

[Note: the "if (fb_bpp == 32)" is checked once per screen-redraw, so 20 times. Nothing in practice.]

That depends on if it's 24-bpp or not.

vvaltchev wrote:

Now, about my memcpy_256_nt(). Actually, my first implementation used just memcpy() and it was a lot slower.
That's why I decided to deal with the FPU, introduce the fpu_context etc. in order to use bigger registers and see what would happen. The result was: I achieved a performance improvement proportional to the size of the register I was using to copy the data. For example, by using AVX2 registers, which are 256-bit wide, I was able to achieve roughly an ~8x improvement over plain 32-bit registers.

In general; for older CPUs that don't support AVX, "rep movsb" works on 128-bit pieces and is equivalent to AVX1 anyway; and for CPUs that do support AVX most of them also support ERMSB which (according to Intel's optimisation guide) is as fast or faster than 128-bit AVX for copies larger than 128 bytes anyway. For AVX2 it's "hit or miss" - some CPUs implement it as "pairs of 128-bit" and it's no better (for write sizes) than AVX1 or SSE.

However, for AVX in general, the CPU turns it off to save power when it's not being used, and when you first start using AVX it takes several thousand cycles to reach it's max. speed, and then shortly after that the CPU says "hey, I'm using more power so I need to compensate" and reduces its clock speed. If you're constantly using AVX then it's mostly fine; but if you're only using it in specific places (e.g. only when copying data to frame buffer) it can be a performance disaster because you're always paying for "slow startup".

vvaltchev wrote:

Brendan wrote:

you're not hoisting branches out of loops (e.g. "for each line { if(bpp == 32) .." rather than "if(bpp == 32) { for each line {..."), for some cases

You're probably looking at the failsafe functions [which use fb_draw_pixel()] .

I was looking at all of it (including the failsafe parts).

vvaltchev wrote:

Anyway, there's one more thing I didn't get:

Brendan wrote:

After you've got it down to about 1 cycle per pixel; then try to enable write-combining to get it down to slightly less than 1 cycle per pixel.

How could I achieve ~1 cycle per pixel with an uncachable memory?
As far as I know, an access to RAM (cost of L3 cache miss / case where mem is UC) takes roughly 100 ns on modern machines.

You're mixing up latency with bandwidth. E.g. if it takes 100 ns for one write to complete, but there are 50 writes "in flight" at the same time, then on average you'd be completing a write every 2 nanoseconds.

You're also mixing up "cache miss" (where CPU has to fetch the entire 64-byte cache line into the cache before doing the write) with "uncached" (where CPU doesn't have to fetch anything into cache before doing the write).

Cheers,

Brendan

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

Brendan wrote:

For 24-bpp, it's not one memset32(). For 32-bit it is one memset32() but it's potentially wrong. For example, if the video mode is 1280 pixels wide and the video card rounds it up to a power of 2 so that the pitch is 8192 (e.g. each line has 1280*4 bytes of pixel data then 768*4 bytes of padding), you'd end up copying pixel data into the padding and it'll all be distorted.

You need an "if(bytes_per_pixel * horizontal_resolution == pitch)" to know if you can do a single copy or if it needs to be split into a copy per line.

vvaltchev wrote:

[Note: the "if (fb_bpp == 32)" is checked once per screen-redraw, so 20 times. Nothing in practice.]

That depends on if it's 24-bpp or not.

Let's please forget the bbp=24 case. I have "some" support for it just in order to being able to show a warning
on the screen explaining that Tilck is not optimized for it.
I care exclusively for the bbp=32 case and my bootloaders look only for 32-bbp resolutions among the available ones.

You're right about the missing "if(bytes_per_pixel * horizontal_resolution == pitch)" in that function. My bad.
Just, in the all the configurations I've run the test, "bytes_per_pixel * horizontal_resolution == pitch" was true,
therefore the results are reliable.

Now, do you trust the results on that test? In the context:
- bbp == 32
- bytes_per_pixel * horizontal_resolution == pitch
- framebuffer_vi.flush_buffers == NULL

A screen redraw using a constant color (either black or white), is just done with 1 rep stosl instruction.
If we agree to trust the test, we have to accept its results. Again, they are:

- ~250 cycles / pixel (the test reports about ~8200 cycles/32 pixel)
on my Dell XPS, using UC memory, no matter which resolution I use.

- ~1.7 cycles / pixel (the test reports about ~54.4 cycles/32 pixel)
on my UDOO x86, using WC memory.
Note: the Udoo is a single-board computer with a hardware by far inferior than the high-end Dell laptop.

Brendan wrote:

In general; for older CPUs that don't support AVX, "rep movsb" works on 128-bit pieces and is equivalent to AVX1 anyway; and for CPUs that do support AVX most of them also support ERMSB which (according to Intel's optimisation guide) is as fast or faster than 128-bit AVX for copies larger than 128 bytes anyway. For AVX2 it's "hit or miss" - some CPUs implement it as "pairs of 128-bit" and it's no better (for write sizes) than AVX1 or SSE.

However, for AVX in general, the CPU turns it off to save power when it's not being used, and when you first start using AVX it takes several thousand cycles to reach it's max. speed, and then shortly after that the CPU says "hey, I'm using more power so I need to compensate" and reduces its clock speed. If you're constantly using AVX then it's mostly fine; but if you're only using it in specific places (e.g. only when copying data to frame buffer) it can be a performance disaster because you're always paying for "slow startup".

I understand your argument, but my measurements show that using AVX 2 it is actually convenient.
Actually, if you're curious, I can create another test like selftest_fb_perf_manual(), that will be like this one,
completely independent from my console implementation, and use the FPU registers (AVX 2) to do the
memset, instead of using rep stosl.

Brendan wrote:

You're mixing up latency with bandwidth. E.g. if it takes 100 ns for one write to complete, but there are 50 writes "in flight" at the same time, then on average you'd be completing a write every 2 nanoseconds.

OK, I totally agree that latency != bandwidth. My bad for mixing them.
But, I don't agree about having 50 writes "in flight" at the same time. As far as I know, a few (2-8) writes at most can be done in parallel, but I doubt that the CPU is allowed to do that for UC memory. That's typically happens when the CPU flushes (writes back) a cache line or when the memory is write-combining and the CPU is allowed to "do anything it wants". I might be wrong, but the empirical evidence at the moment shows me that apparently no fancy parallelization occurs. That will be even more evident if I make the test using 256-bit AVX 2 registers and the redraw gets ~8x faster.

If you have any official paper or simple benchmark code like mine that I can run directly on hardware and that can prove otherwise I'll really appreciate that.

Vlad

Brendan · **Posted:** Sun Sep 02, 2018 1:16 pm

Hi,

vvaltchev wrote:

Now, do you trust the results on that test? In the context:
- bbp == 32
- bytes_per_pixel * horizontal_resolution == pitch
- framebuffer_vi.flush_buffers == NULL

A screen redraw using a constant color (either black or white), is just done with 1 rep stosl instruction.
If we agree to trust the test, we have to accept its results. Again, they are:

- ~250 cycles / pixel (the test reports about ~8200 cycles/32 pixel)
on my Dell XPS, using UC memory, no matter which resolution I use.

- ~1.7 cycles / pixel (the test reports about ~54.4 cycles/32 pixel)
on my UDOO x86, using WC memory.
Note: the Udoo is a single-board computer with a hardware by far inferior than the high-end Dell laptop.

Can you do an "apples vs. apples" test; like the UDOO with WC vs. the UDOO without WC and everything else the same?

For these figures, it's impossible to say what effect WC had. For one random example, maybe the Dell uses AVX2 and the UDOO doesn't and AVX2 kills performance.

Also note that I think you're using non-temporal stores for AVX (I can't be sure - I didn't find it in your source code and think it's in a missing header file), and non-temporal stores use the same underlying mechanics as WC (in both cases the data is stored in write-combining buffers, bypassing cache).

vvaltchev wrote:

Brendan wrote:

In general; for older CPUs that don't support AVX, "rep movsb" works on 128-bit pieces and is equivalent to AVX1 anyway; and for CPUs that do support AVX most of them also support ERMSB which (according to Intel's optimisation guide) is as fast or faster than 128-bit AVX for copies larger than 128 bytes anyway. For AVX2 it's "hit or miss" - some CPUs implement it as "pairs of 128-bit" and it's no better (for write sizes) than AVX1 or SSE.

However, for AVX in general, the CPU turns it off to save power when it's not being used, and when you first start using AVX it takes several thousand cycles to reach it's max. speed, and then shortly after that the CPU says "hey, I'm using more power so I need to compensate" and reduces its clock speed. If you're constantly using AVX then it's mostly fine; but if you're only using it in specific places (e.g. only when copying data to frame buffer) it can be a performance disaster because you're always paying for "slow startup".

I understand your argument, but my measurements show that using AVX 2 it is actually convenient.
Actually, if you're curious, I can create another test like selftest_fb_perf_manual(), that will be like this one,
completely independent from my console implementation, and use the FPU registers (AVX 2) to do the
memset, instead of using rep stosl.

For the purpose of a pure "with/without WC" throughput test; it'd make more sense to delete that part entirely (just copy the same data from double buffer to frame buffer without changing any pixels).

For the purpose of testing "real world performance"; it'd make more sense to modify a few characters and then update the screen (to simulate someone typing and cursor moving) instead of testing a rare "all pixels change every frame" pathological case; where things like avoiding copying data that didn't change (from double buffer to frame buffer) will make the "pure with/without WC throughput test" seem as irrelevant as it should be.

vvaltchev wrote:

Brendan wrote:

You're mixing up latency with bandwidth. E.g. if it takes 100 ns for one write to complete, but there are 50 writes "in flight" at the same time, then on average you'd be completing a write every 2 nanoseconds.

OK, I totally agree that latency != bandwidth. My bad for mixing them.
But, I don't agree about having 50 writes "in flight" at the same time. As far as I know, a few (2-8) writes at most can be done in parallel, but I doubt that the CPU is allowed to do that for UC memory. That's typically happens when the CPU flushes (writes back) a cache line or when the memory is write-combining and the CPU is allowed to "do anything it wants". I might be wrong, but the empirical evidence at the moment shows me that apparently no fancy parallelization occurs. That will be even more evident if I make the test using 256-bit AVX 2 registers and the redraw gets ~8x faster.

If you have any official paper or simple benchmark code like mine that I can run directly on hardware and that can prove otherwise I'll really appreciate that.

When I first wrote that sentence I used "1234 writes "in flight" at the same time", because I always tend to use "1234" in examples where the point has nothing to do with what the number is (e.g. where the point is to highlight the difference between bandwidth and latency and any number larger than 1 will do that). Then (while writing the other part of the paragraph) I was too lazy to calculate "100 ns / 1234" so I changed the original 1234 to 50 to make the division easy (so I could just write "2 ns" instead of finding a calculator and writing "~0.081 ns").

Cheers,

Brendan

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

Brendan wrote:

Can you do an "apples vs. apples" test; like the UDOO with WC vs. the UDOO without WC and everything else the same?

Sure, I also added a flag to the test about using the FPU. The original test used only REP STOSL.
Here are my results:

Code:

UDOO x86:

   No FPU (rep stosl, 32-bit registers):

      - WC:   56 cycles/32 pixel   =   ~1.75 cycles/pixel
      - UC: 5657 cycles/32 pixel   = ~176.80 cycles/pixel

   FPU (SSE 2, 128-bit registers):

      - WC:   59 cycles/32 pixel   =  ~1.84 cycles/pixel
      - UC: 1417 cycles/32 pixel   = ~44.28 cycles/pixel

A few observations:

The impact of using write-combining memory is HUGE on that machine. About ~100x in performance improvement.
Using FPU registers when the mem is WC makes a little (negative) difference
In case of UC mem, the improvements from using FPU regs is proportional to their size. Almost 4x in this case.
By observing the avg. num of cycles per write on the UC memory, it is clear that no parallelism exists and that perfectly explains why by using 4x wider registers, we get about 4x throughtput. While in the WC case, the CPU is allowed to use the full bandwidth of the memory, and, at least on that hardware we can get even a 100x more throughtput.

On the Dell machine, the impact on the UC memory is identical: I'm able to reach almost 8x improvement because the AVX2 registers are 256-bit wide (The UDOO does not support AVX).

Therefore, clearly using WC memory is the key to reach a good performance.
My effort in using the FPU was in practice a work-around to mitigate the effects of the UC memory. To some degree, such a work-around did a great job for smaller resolutions like 800x600, where the absolute number of cycles to redraw the whole screen is not that much but when using the native resolution (3200x1800), the high cost (per pixel) gets very noticeable.

Now, I'll have to find a way to somehow mark the framebuffer memory region as WC, but without removing the whole MTRR[0] which marks the the range [+2 GB, +4 GB] as UC. But I have no idea how. As far as I know PAT gets overriden my MTRRs, so it seem useless in this case, right?

--------------------------------------------------------------------------------------------------------------------------------------------------------------

Note:
In case anybody is interested in looking at the code of my test, the function is called "internal_selftest_fb_perf(bool)".
To run the test, from the Tilck's shell, just run:

Code:

selftest fb_perf

Or, for the FPU version

Code:

selftest fb_perf_fpu

In order to see system's memory map + MTRRs, just press F3. It's easy to see whether the framebuffer (FBUF) is in a UC region or not on your hardware.

Brendan · **Posted:** Sat Sep 08, 2018 8:06 am

Hi,

vvaltchev wrote:

Brendan wrote:

Can you do an "apples vs. apples" test; like the UDOO with WC vs. the UDOO without WC and everything else the same?

Sure, I also added a flag to the test about using the FPU. The original test used only REP STOSL.

If I remember correctly; "REP STOSL" was only used to set data in RAM (which has nothing to do with frame buffer bandwidth); and then "something else" was used to copy the data from RAM to the frame buffer (where that "something else" may have been "REP MOVSB" to take advantage of ERMSB in Ivy Bridge or later, but may not have been).

vvaltchev wrote:

Here are my results:

Code:

UDOO x86:

   No FPU (rep stosl, 32-bit registers):

      - WC:   56 cycles/32 pixel   =   ~1.75 cycles/pixel
      - UC: 5657 cycles/32 pixel   = ~176.80 cycles/pixel

   FPU (SSE 2, 128-bit registers):

      - WC:   59 cycles/32 pixel   =  ~1.84 cycles/pixel
      - UC: 1417 cycles/32 pixel   = ~44.28 cycles/pixel

A few observations:

The impact of using write-combining memory is HUGE on that machine. About ~100x in performance improvement.
Using FPU registers when the mem is WC makes a little (negative) difference
In case of UC mem, the improvements from using FPU regs is proportional to their size. Almost 4x in this case.
By observing the avg. num of cycles per write on the UC memory, it is clear that no parallelism exists and that perfectly explains why by using 4x wider registers, we get about 4x throughtput. While in the WC case, the CPU is allowed to use the full bandwidth of the memory, and, at least on that hardware we can get even a 100x more throughtput.

On the Dell machine, the impact on the UC memory is identical: I'm able to reach almost 8x improvement because the AVX2 registers are 256-bit wide (The UDOO does not support AVX).

Therefore, clearly using WC memory is the key to reach a good performance.

Well, no.

For some cases (console, GUI, ...) the key to reaching good performance is avoiding copying data to the frame buffer in the first place. For other cases (full screen games where all pixels always change every frame) the key to reaching good performance is not using the CPU at all.

Using WC and SSE/AVX is the key to good meaningless benchmark results for a silly pathological case that is almost entirely irrelevant for all cases in practice. What happens is that once you start trying to do anything that isn't irrelevant the "magical" performance advantages disappear because you're no longer doing a nice "extremely prefetchable" special case where everything happens to be perfectly aligned and everything happens to fill each 256-bit register or a whole write combining buffer.

Also note that I think both of your computers are using integrated Intel graphics (without any eDRAM); and you should expect different results for your irrelevant benchmark when there's a discrete video card (and you have to push all that data through a PCI bus) and when there is eDRAM (and you're writing to something faster than RAM). I also assume you're not testing the "video mode requires padding at the end of each line" case where you can't just have a single "do the whole frame buffer" loop (and need nested loops with extra overhead). Finally, I strongly suspect that the code the compiler generates for the SSE and AVX cases is very inefficient and that this is poisoning your results.

Note that I'm not saying WC shouldn't help; it's just that WC should be used as a last resort (and shouldn't be used to salvage a few scraps of performance when everything that actually matters isn't optimised). For perspective; what I tend to expect is 20 different "copy data from buffer in RAM to frame buffer" functions, where the initialisation code sets a function pointer to point to whichever function makes sense; where none of the functions have any unnecessary branches and none of the functions call anything, and most of the functions use "mostly pure" assembly (including things like using prefetch hints and cache flushes where appropriate, and including making sure that all available SSE/AVX registers are used in parallel and that it's not just a single register being pounded in a loop). For the most recent version/s of my own code I go one step further - I generate code at run-time (during initialisation) so that "variables" (e.g. horizontal, vertical resolution, bytes between lines/pitch) become constants baked directly into the executed code.

vvaltchev wrote:

Now, I'll have to find a way to somehow mark the framebuffer memory region as WC, but without removing the whole MTRR[0] which marks the the range [+2 GB, +4 GB] as UC. But I have no idea how. As far as I know PAT gets overriden my MTRRs, so it seem useless in this case, right?

For PAT, it depends on which CPU it is. Mostly:

Ancient CPUs (Pentium, 80486, 80386) don't support WC at all and nothing you do can change that
Old CPUs (Pentium Pro, Pentium II) don't support PAT and the only way to get WC is with MTRRs.
Pentium III was supposed to work like newer CPUs, but there's errata (CPU bugs) that cause it to work wrong, so for these CPUs if the MTRR says UC then PAT can't change it and make it WC instead.
For newer CPUs (Pentium 4, ...) if the MTRRs say UC and PAT says WC, then it becomes WC (it was a special case hack to make it break the "most conservative wins" rule used for everything else).

Note: I'm not sure about CPUs from other vendors (e.g. AMD, VIA, etc) - you'd have to check the datasheets for each of them.

Cheers,

Brendan

Schol-R-LEA · **Posted:** Sat Sep 08, 2018 10:27 am

@vvaltchev: I think that Brendan's point about the performance needs of different use cases may need to be looked at in more detail. I will not pretend to be an expert on this topic, but I do agree that the focus on the speed of block copying into the FB is misplaced.

It sounds as if your current approach is to perform all of the rendering in software, writing to a buffer in general memory, and once this is done, copying the entire resulting rendered page to the frame buffer.

While this is a fairly general approach to this, mostly independent of the GPU and certainly better than writing changes directly to the framebuffer as you compute the rendering, as Brendan points out, for the majority of use cases this is not an ideal approach.

Unfortunately, as Brendan also mentioned, two of the most common scenarios have essentially opposite solutions,

For use cases where the screen changes infrequently, and most changes only affect a small part of the rendered screen, it is often better to write to just those bytes of the frame buffer which are affected. Now, this does not mean you should write the changes to the frame buffer directly; that is still generally a bad idea. Rather, you may still want to render outside of the frame buffer, but use various techniques to ensure that the changed elements, and only those changed elements, get written to the frame buffer when the time comes. This is where things such as bounding boxes, clipping methods, and image masks come in.

Conversely, in cases where the majority of the screen is being updated frequently, the best approach is to let the GPU do the heavy lifting of rendering the screen - that is what it is mainly for, after all. Even with a minimal GPU, such as the ones integrated into Intel CPUs, knowing how - and when - to hand a rendering process over to the GPU is an important part of designing a video system.

A truly general approach, though, requires the use of both, and having at least some ability to tell which to use when. Even with a mostly-static screen, the rendering of certain elements - such as the cursor sprite and other elements which move more or less independently from the rest of the screen - is often best delegated to the GPU.

On one final note: your code and post mention double buffering, but you seem to be referring to the general-memory buffer as your backing buffer. While I do not have much experience with this myself, my understanding is that in the case of a GPU with its own dedicated memory, an effective double buffering technique requires that there be two independent buffers within the GPU memory, to allow the video system to flip between them in an single atomic action. Use of a separate rendering buffer in general memory and then copying from that to the frame buffer is not the same thing as true double buffering, AFAIK.

Even with an integrated video memory, to get proper double-buffering, you need to be able to map the backing buffer into the video memory, so the page flip gets done by the GPU. The backing buffer becomes the new facing buffer, and vice versa. For the page flipping itself, no copying should be needed.

Now, using such a general-memory buffer for software rendering is often used in conjunction with double-buffering, and this is sometimes (a bit misleadingly) referred to as 'triple-buffering', but in those cases, the purpose is to avoid writing to the frame buffer on the fly, regardless of whether the memory being accessed is facing, backing, or compute/scratch.

The reason for this has to do with how the CPU and GPU communicate with each other, and with the different types of the memory. While the GDRAM used by a dedicated card is often faster overall than the system DRAM, and more quickly accessed by the GPU than the general memory is by the CPU, access to the GDRAM by the CPU has to pass through the relatively slow PCI bus or some similar sub-system. IIUC, this means that a write to GDRAM by the CPU may be slower than a write to general memory by one or more orders of magnitude, regardless of caching and other considerations.

This also becomes a significant factor when the GPU needs more memory than there is available GDRAM, as it then has to spill to general memory. In this instance, once again the reads and writes have to pass across the bus connecting the GPU to the general-purpose RAM, with a resulting loss of performance.

This is much less of a consideration when the GPU is using general memory rather than a dedicated memory, as is usually the case with an iGPU, but that has its own issues. These include the specific problem you are talking about where the video memory needs to be uncached.

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

Brendan wrote:

If I remember correctly; "REP STOSL" was only used to set data in RAM (which has nothing to do with frame buffer bandwidth); and then "something else" was used to copy the data from RAM to the frame buffer (where that "something else" may have been "REP MOVSB" to take advantage of ERMSB in Ivy Bridge or later, but may not have been).

Hi Brendan,
actually it is not. The perf test I wrote in no case used the double buffer (yes, there was an IF, but I stated clearly that I'm not testing the case with the double buffer ON). The double buffer was used by the fb_console, but I never wanted to talk about that. The whole time I only wanted to talk about the performance of a simple test (~30 lines of code) writing directly to the framebuffer (with rep stosl and, in my latest post also with SSE 2 registers), in the case of UC and WC memory.
The reason was: in one case, we're talking about my specific implementation that people might like or not, it might be good, it might be bad, while in the other case we're talking about something interesting to everybody because everybody can experience it: the raw-performance of the framebuffer, in the best case, using a trivial test that just writes to it. This second case really matters to everybody and that's why I wrote specifically that test and make our discussion about its results.

Brendan wrote:

For newer CPUs (Pentium 4, ...) if the MTRRs say UC and PAT says WC, then it becomes WC (it was a special case hack to make it break the "most conservative wins" rule used for everything else).

That's actually the only thing that helped me. Thanks a lot for saying that. I reliazed that I had a small bug in the function calling set_pages_pat_wc() for the framebuffer and that's why it didn't work (black screen). Now I realized that PAT might override an MTRRs I spent some time debugging the issue in my code and I've fixed it. The result? Impressive. Now on the Dell laptop I'm able to mark framebuffer's memory as WC and it got fast as hell!.

In numbers, the same perf test now shows:
~6.6 million cycles per redraw [3200 x 1800 x 32 bbp] (again, rep stosl, no fpu, DIRECT WRITE => NO DOUBLE-BUFFER)
36 cycles / 32 pixels = ~1.125 cycles / pixel. Previously it was ~250 cycles/pixel. Improvement: ~222x.

So, I'm really happy about that.

Now, if there's anybody intersted in the performance of my console implementation, we could talk also about that (with numbers, mostly).

Now that I've fixed the low-level problem with the framebuffer, we can compare the performance of the raw performance test writing just a constant 32-bit value (color) to the entire screen, with the whole console implementation captable of drawing characters from a scroll buffer. In other words: how many cycles it will take, in the worst case to redraw the whole screen (now), character by character while scrolling up and down?

According my measurements, in the worst case: 7.1 million cycles (again using: 3200x1800x32bbp).
[You can see that benchmark by pressing F4, if you run Tilck].

If we compare that to the numbers from the raw performance test, where a full redraw took about ~6.6 million cycles, we can measure an 7.5% overhead caused by all of my upper-level logic. I'm pretty happy with that: from the user experience point of view now there is no visible "scrolling effect" at all, anymore.

Now I just got curious to write an actual console bechmark (using an user mode application), which tests the whole stack by writing characters all over the screen and than erasing it. How many cycles will take a full screen redraw character-by-character? Since the whole point of Tilck is running Linux applications natively, I could run the same program on Linux and compare the results. Probably, my console will be slower than the Linux one, but I'll be happy even if the difference turns out to be not that much.

Vlad

P.S. I've commited to the master branch my latest changes and, if anybody is curious to run the test on a VM and/or a real hardware, now it's possible to do that. I'll be happy to see the same test run on completely different machines than mine and compare the results.

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

Schol-R-LEA wrote:

@vvaltchev: I think that Brendan's point about the performance needs of different use cases may need to be looked at in more detail. I will not pretend to be an expert on this topic, but I do agree that the focus on the speed of block copying into the FB is misplaced.

Hi Schol-R-LEA,
thanks for your analysis. I agree that the "speed of block copying into the FB" is only part of the whole picture,
but I hope you'd agree it might be a bottleneck. I started investigating this performance issue when I noticed that while running on QEMU + KVM the framebuffer console was by far faster than on real hardware. Therefore, I was able to root-cause the write-to-the-framebuffer as the bottleneck.

Schol-R-LEA wrote:

It sounds as if your current approach is to perform all of the rendering in software, writing to a buffer in general memory, and once this is done, copying the entire resulting rendered page to the frame buffer.

That was true, but not for the test I talked about here all the time. Therefore it should be irrelevant. But anyway, I can explain the point of using a shadow buffer in (regular) RAM. I used it to avoid as much as possible to write to the framebuffer itself. Possibly, I used to copy from it to the framebuffer only the rows that have changes. That lead to significant performance improvements in my case (super-slow FB raw-access). Also, it optimized the scroll-by-one-line case: instead of redrawing the whole screen character by character (moderately expensive, but not too much) or scrolling up using the framebuffer itself both for reading and writing (insanely slow), I did everything using the shadow buffer and than I used to flush that buffer to the framebuffer, as fast as I can. Again, I measured every single change and that configuration was the best I was able to achieve before today, when I finally succeeded to make the FB memory as write-combining. But again, that (the shadow buffer) was an implementation detail that Brendan noticed while looking at my code and did not affect my test.

Schol-R-LEA wrote:

On one final note: your code and post mention double buffering, but you seem to be referring to the general-memory buffer as your backing buffer. While I do not have much experience with this myself, my understanding is that in the case of a GPU with its own dedicated memory, an effective double buffering technique requires that there be two independent buffers within the GPU memory, to allow the video system to flip between them in an single atomic action. Use of a separate rendering buffer in general memory and then copying from that to the frame buffer is not the same thing as true double buffering, AFAIK.

Even with an integrated video memory, to get proper double-buffering, you need to be able to map the backing buffer into the video memory, so the page flip gets done by the GPU. The backing buffer becomes the new facing buffer, and vice versa. For the page flipping itself, no copying should be needed.

Now, using such a general-memory buffer for software rendering is often used in conjunction with double-buffering, and this is sometimes (a bit misleadingly) referred to as 'triple-buffering', but in those cases, the purpose is to avoid writing to the frame buffer on the fly, regardless of whether the memory being accessed is facing, backing, or compute/scratch.

The reason for this has to do with how the CPU and GPU communicate with each other, and with the different types of the memory. While the GDRAM used by a dedicated card is often faster overall than the system DRAM, and more quickly accessed by the GPU than the general memory is by the CPU, access to the GDRAM by the CPU has to pass through the relatively slow PCI bus or some similar sub-system. IIUC, this means that a write to GDRAM by the CPU may be slower than a write to general memory by one or more orders of magnitude, regardless of caching and other considerations.

This also becomes a significant factor when the GPU needs more memory than there is available GDRAM, as it then has to spill to general memory. In this instance, once again the reads and writes have to pass across the bus connecting the GPU to the general-purpose RAM, with a resulting loss of performance.

This is much less of a consideration when the GPU is using general memory rather than a dedicated memory, as is usually the case with an iGPU, but that has its own issues. These include the specific problem you are talking about where the video memory needs to be uncached.

I understand and I perfectly agree on your general theory of using a buffer in the GPU for double-buffering and that with a single instruction (or a few instructions at most) we should be able to tell the GPU to swap the two buffers. That's the way it is supposed to be. That's how real GPU drivers work. And therefore, that's also my problem with doing that: writing a real GPU driver is A LOT of work and there are a ton of different video cards out there. Even if I wrote a perfect driver, that would just cover my GPU. Just consider the size of the Linux driver for my integrated GPU (family: Intel i915): 152,000 lines of code (wc -l *.c). In comparison, my whole project is about 32,000 lines of code. For those reasons, using correctly the GPU and writing a proper video driver was never (and never will be) part of my goals. I just wanted to achieve the best possible with a generic framebuffer that I get from the bootloader, which in turn uses either VBE or natively as per EFI's specifications: those mechanisms are pretty generic and work on almost all the PCs, no matter which GPU they have installed.
Clearly, what can be achieved this way is pretty limited compared to the real power of any modern GPU, even if integrated. Still, it was worth for me trying to achieve the best it could be achieved with this limitation.

Note: now that the framebuffer uses WC memory, I removed the shadow buffer from the fb_console since it stopped being helpful: now it is faster to write directly to the framebuffer.

Anyway, really thanks to you both guys for the help and the patience :-)

I hope this whole discussion might help many other people as well.

Vlad

Brendan · **Posted:** Sat Sep 08, 2018 9:56 pm

Hi,

vvaltchev wrote:

Brendan wrote:

If I remember correctly; "REP STOSL" was only used to set data in RAM (which has nothing to do with frame buffer bandwidth); and then "something else" was used to copy the data from RAM to the frame buffer (where that "something else" may have been "REP MOVSB" to take advantage of ERMSB in Ivy Bridge or later, but may not have been).

actually it is not. The perf test I wrote in no case used the double buffer (yes, there was an IF, but I stated clearly that I'm not testing the case with the double buffer ON). The double buffer was used by the fb_console, but I never wanted to talk about that. The whole time I only wanted to talk about the performance of a simple test (~30 lines of code) writing directly to the framebuffer (with rep stosl and, in my latest post also with SSE 2 registers), in the case of UC and WC memory.

Ah. That makes a little more sense, given that "REP STOSD" is not properly optimised on modern CPUs (unlike "REP MOVSB" which is).

The part that still doesn't make sense to me is that SSE or AVX using non-temporaral stores should be using the write-combining buffers and should give exactly the same performance regardless of whether the area is UC or WC in MTRRS or PAT.

vvaltchev wrote:

In numbers, the same perf test now shows:
~6.6 million cycles per redraw [3200 x 1800 x 32 bbp] (again, rep stosl, no fpu, DIRECT WRITE => NO DOUBLE-BUFFER)
36 cycles / 32 pixels = ~1.125 cycles / pixel. Previously it was ~250 cycles/pixel. Improvement: ~222x.

Note that 32 bits per pixel means that 25% of the bytes written to frame buffer do nothing more than waste bandwidth. With 24 bits per pixel you should be able to get it 1.333 times faster (about 0.85 cycles per pixel for the WC case).

Of course for rendering pixels it's faster to have each pixel nicely aligned; which means that the fastest method is to do all the rendering with 32 bits per pixel and then convert from 32 bits per pixel to 24 bits per pixel as part of copying data from your buffer in RAM to the frame buffer.

vvaltchev wrote:

Now that I've fixed the low-level problem with the framebuffer, we can compare the performance of the raw performance test writing just a constant 32-bit value (color) to the entire screen, with the whole console implementation captable of drawing characters from a scroll buffer. In other words: how many cycles it will take, in the worst case to redraw the whole screen (now), character by character while scrolling up and down?

What performance do you get if you scroll up by one pixel (and not by a whole character)?

Note that for this case (assuming that the buffer in RAM is larger than "bare minimum" - e.g. maybe 3200x6000 if the video mode is 3200x1800; and assuming that there's no scroll bar or menus or status bar or...) the scrolling itself should cost a total of literally nothing.

vvaltchev wrote:

According my measurements, in the worst case: 7.1 million cycles (again using: 3200x1800x32bbp).
[You can see that benchmark by pressing F4, if you run Tilck].

If we compare that to the numbers from the raw performance test, where a full redraw took about ~6.6 million cycles, we can measure an 7.5% overhead caused by all of my upper-level logic. I'm pretty happy with that: from the user experience point of view now there is no visible "scrolling effect" at all, anymore.

When scrolling the screen by a whole character the gaps between lines of characters remains in the same place so you end up with about 10% of pixels that don't change colour because of that (more if a lot of characters are lower case), and with a fixed width font the same happens for gaps between rows of characters, and sometimes you'll get lucky and some characters won't change (e.g. "The" on one line and "There" on the line below means three whole characters remain the same when you scroll), and typically there's lots of white space (at the ends of lines, at the start of lines if there's an indented list, etc). In other words the amount of data you need to change in the frame buffer is probably less than 50% of all pixels; so it should've costed you zero cycles for the scrolling and then ~3.3 million cycles to update the screen; so your code is probably more than twice as slow as it could be.

vvaltchev wrote:

Now I just got curious to write an actual console bechmark (using an user mode application), which tests the whole stack by writing characters all over the screen and than erasing it. How many cycles will take a full screen redraw character-by-character? Since the whole point of Tilck is running Linux applications natively, I could run the same program on Linux and compare the results. Probably, my console will be slower than the Linux one, but I'll be happy even if the difference turns out to be not that much.

Oh my... You're planning to use a modern 4K display capable of millions of colours to emulate an 80*25 monochrome terminal from the 1970s?

Cheers,

Brendan

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

Brendan wrote:

The part that still doesn't make sense to me is that SSE or AVX using non-temporaral stores should be using the write-combining buffers and should give exactly the same performance regardless of whether the area is UC or WC in MTRRS or PAT.

That's the same thing I initially thought, but it turns to be not true on all of my machines.
My current understanding is that stores with a non-temporal hint affect only write-back memory and they force the CPU to bypass the cache and just write directly to the main memory. The gain evident because it avoids the CPU to throw away precious cache data for something that we'll not going to read any time soon. But in my case, for UC memory that hint has no effect since the CPU already bypasses the cache. Each store goes directly to the RAM and the CPU busy-waits until the store is completed. I got the idea of using non-temporal stores from: https://software.intel.com/en-us/articl ... me-buffers but in that article is explained the opposite: how to READ in an efficient way from a framebuffer and not how to write to it. I wanted to try anyway the non-temporal stores because I thought the framebuffer was using a write-back memory. After I added supported for MTRRs, I realized that it was not.

The write-combining memory instead is faster (in my understanding) because the stores to it are asynchronous: therefore, while the cpu continues its execution after each store and flushes its WC buffers completely out-of-order using the best strategy for the memory (I mean, writing to contiguous addresses might not be always the best strategy).

Brendan wrote:

Note that 32 bits per pixel means that 25% of the bytes written to frame buffer do nothing more than waste bandwidth. With 24 bits per pixel you should be able to get it 1.333 times faster (about 0.85 cycles per pixel for the WC case).

Of course for rendering pixels it's faster to have each pixel nicely aligned; which means that the fastest method is to do all the rendering with 32 bits per pixel and then convert from 32 bits per pixel to 24 bits per pixel as part of copying data from your buffer in RAM to the frame buffer.

Yeah I get that, but I don't have anymore my 2nd buffer, because it's faster without it, I'd avoid re-introducing it only because of the 24-bit case.
OK, maybe with a lot of tricks I might gain something even without the 2nd buffer, by packing locally the pixels but that's also additional code and I'm not sure what are the odds to achieve a meaningful improvement. Also, UEFI uses only 32-bit modes (EFI_GRAPHICS_OUTPUT_BLT_PIXEL is a struct with size = 4) therefore I decided to stick with that.

Brendan wrote:

What performance do you get if you scroll up by one pixel (and not by a whole character)?

My console does not support that: it works with rows. The fastest scroll I have now it just a plain redraw of the scroll screen, using as source a buffer of characters (not pixels). I tried an "image scroll" using the framebuffer and it's terribly slow (because reading from the framebuffer is terribly slow).

Brendan wrote:

Note that for this case (assuming that the buffer in RAM is larger than "bare minimum" - e.g. maybe 3200x6000 if the video mode is 3200x1800; and assuming that there's no scroll bar or menus or status bar or...) the scrolling itself should cost a total of literally nothing.

How would you achieve that? A scroll requires both loads a stores. Loads from the framebuffer are insanely slow. With the 2nd buffer in RAM,
it's much better, but still overall it's slower than a simple redraw. The Linux kernel uses a full redraw strategy as well for the console, as far as I know.

Brendan wrote:

When scrolling the screen by a whole character the gaps between lines of characters remains in the same place so you end up with about 10% of pixels that don't change colour because of that (more if a lot of characters are lower case), and with a fixed width font the same happens for gaps between rows of characters, and sometimes you'll get lucky and some characters won't change (e.g. "The" on one line and "There" on the line below means three whole characters remain the same when you scroll), and typically there's lots of white space (at the ends of lines, at the start of lines if there's an indented list, etc). In other words the amount of data you need to change in the frame buffer is probably less than 50% of all pixels; so it should've costed you zero cycles for the scrolling and then ~3.3 million cycles to update the screen; so your code is probably more than twice as slow as it could be.

OK, I got that but it seems very tricky.. also you'd have to handle the case of special characters like the full blank block where the whole character area (8x16 or 16x32) is full with a single color. Are you sure that the overhead of that tricky code won't throw away (most of) the gain from the reduced stores to the framebuffer? I agree that theoretically something like that could be done, but I'm not sure how big will be the benefit in practice. It depends a lot of how many stores you can skip and how much costs doing that. Have you written a code like that? I'd be very curious to see a console using such strategies in practice.

Brendan wrote:

Oh my... You're planning to use a modern 4K display capable of millions of colours to emulate an 80*25 monochrome terminal from the 1970s?

Aahahhaha

Not exactly. On 3200x1800 using a 16x32 font + my banner, I have 200x54 characters (200x56 without the banner) on screen.
16 colors, VGA style :-)

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

Guys, I've just written a console performance test that just fills the screen 10 times with regular letters and I've run it on the Linux console (not on a terminal emulator) and on Tilck on two of my machines (Dell XPS 13" and Lenovo ideapad 700).

Here is test's code (it's just a function in a regular user application):

Code:

void console_perf_test(void)
{
   static const char letters[] =
      "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";

   const int iters = 10;
   struct winsize w;
   char *buf;

   ioctl(1, TIOCGWINSZ, &w);

   buf = malloc(w.ws_row * w.ws_col);

   if (!buf) {
      printf("Out of memory\n");
      return;
   }

   for (int i = 0; i < w.ws_row * w.ws_col; i++) {
      buf[i] = letters[i % (sizeof(letters) - 1)];
   }

   printf("%s", CSI_ERASE_DISPLAY CSI_MOVE_CURSOR_TOP_LEFT);

   uint64_t start = RDTSC();

   for (int i = 0; i < iters; i++) {
      write(1, buf, w.ws_row * w.ws_col);
   }

   uint64_t end = RDTSC();
   unsigned long long c = (end - start) / iters;

   printf("Term size: %d rows x %d cols\n", w.ws_row, w.ws_col);
   printf("Screen redraw:       %10llu cycles\n", c);
   printf("Avg. character cost: %10llu cycles\n", c / (w.ws_row * w.ws_col));
   free(buf);
}

You can compile even outside of Tilck's build system:

Code:

gcc tilck/usermode_apps/termtest.c -o termtest

Here are the screenshots of the benchmark:

Dell XPS, Linux console

Attachment:

File comment: The console test, Linux, Dell XPS.

dell_linux.jpg [ 67.22 KiB | Viewed 2775 times ]

Dell XPS, Tilck console

Attachment:

File comment: The console test, Tilck, Dell XPS.

dell_tilck.jpg [ 68.81 KiB | Viewed 2775 times ]

By looking at the average number of cycles per character, we can see that the test runs 11x faster on Tilck than on Linux , same machine, same framebuffer (resolution, color mode), same font size, 16x32. Honestly I'd never expected that. In the next post, I'll put also the screenshots of the benchmark run on the Lenovo machine (3 attachments limit per post).

vvaltchev · **Joined:** Fri May 11, 2018 6:51 am **Posts:** 274

Lenovo Ideapad 700, Linux

Attachment:

File comment: The console test, Linux, Lenovo Ideapad 700.

lenovo_linux.jpg [ 70.12 KiB | Viewed 2774 times ]

Lenovo Ideapad 700, Tilck

Attachment:

File comment: The console test, Tilck, Lenovo Ideapad 700.

lenovo_tilck.jpg [ 72 KiB | Viewed 2774 times ]

Note: here the font size is still 16x32, but the resolution is 1920x1080x32 bpp.

In this case the gap is much more reasonable, but still huge: 3.76x faster than Linux.
My best explanation for such a gap is the complexity (feature-richness) and the generality of the Linux console. After all, my console implementation is optimized for the 32-bbp case, for the 16x32 (and 8x16) font sizes (while Linux supports a variety of font sizes), supports only 16 colors and takes advantage of that. Also, Tilck's console supports certainly fewer escape sequences than Linux. But still, I'm so happy that overall, it's pretty fast now.

Vlad

OSDev.org

Unable to mark a memory region as WC using MTRRs

Who is online