Hi Brendan,
thank you very much for your exhaustive response to my questions! Now I understand why resetting MTRR[0] on that machine was such a big deal. I got a freeze that was undebuggable (no faults, nothing). It makes perfect sense to me your theory that I may have messed up with SMM's expectations.
Brendan wrote:
vvaltchev wrote:
Please tell me that there are other solutions than "surrender" and deal with the IOMMU in order to re-map the framebuffer.
The best solution is to optimise your code so that write-combining makes very little difference.
For examples; your "memcpy_256_nt" is likely to be significantly slower than a simple "rep movsd", you're copying one line at a time (even in the "bytes_per_pixel * horizontal_resolution == pitch" case where you could do a single "memcpy()"), you're relying on the compiler to optimise for you (including "-O3" which almost never improves performance) but then preventing the compiler from being able to optimise at all (e.g. splitting things across multiple compilation units so compiler can't inline functions, etc), you're overusing "memcpy()" (when you could be doing a small number of explicit writes) and then using "special" functions ("memcpy32()") that prevent the compiler from generating optimal code when the size could/should be constant, you're not hoisting branches out of loops (e.g. "for each line { if(bpp == 32) .." rather than "if(bpp == 32) { for each line {..."), for some cases (horizontal lines for 24-bpp) you're doing "fb_draw_pixel()" loops that re-calculate an address of every pixel (when it could be optimised to do groups of 4 pixels as 3 aligned 32-bit writes with an "next_address += 12" every 4 pixels instead of "address = buf + y * pitch + x * 3" for every individual pixel), etc.
After you've got it down to about 1 cycle per pixel; then try to enable write-combining to get it down to slightly less than 1 cycle per pixel.
Brendan
About that, I just wanted to specify that my performance test has nothing to do with my framebuffer console implementation. I wrote the following code to test framebuffer's performance:
Code:
void selftest_fb_perf_manual()
{
const int iters = 20;
u64 start, duration, cycles;
start = RDTSC();
for (int i = 0; i < iters; i++) {
fb_raw_color_lines(0, fb_get_height(),
vga_rgb_colors[i % 2 ? COLOR_WHITE : COLOR_BLACK]);
if (framebuffer_vi.flush_buffers) // This is used to test also the case where double-buffering
fb_full_flush(); // is used. My primary test is with a direct write on the framebuffer.
}
duration = RDTSC() - start;
cycles = duration / (iters);
u64 pixels = fb_get_width() * fb_get_height();
printk("fb size (pixels): %u\n", pixels);
printk("cycles per redraw: %llu\n", cycles);
printk("cycles per 32 pixels: %llu\n", 32 * cycles / pixels);
fb_draw_banner(); // You can completely ignore this
fb_flush_banner(); // You can completely ignore this
}
fb_raw_color_lines() is implemented as:
void fb_raw_color_lines(u32 iy, u32 h, u32 color)
{
if (fb_bpp == 32) {
memset32((void *)(fb_vaddr + (fb_pitch * iy)),
color, (fb_pitch * h) >> 2);
} else {
// Generic (but slower version)
for (u32 y = iy; y < (iy + h); y++)
for (u32 x = 0; x < fb_width; x++)
fb_draw_pixel(x, y, color);
}
}
So, in practice I use just a plain memset32() to write a constant value on the framebuffer and yes, I'm re-drawing the whole screen with just one rep stosl instruction (in memset32). It couldn't be simpler than that.
[Note: the "if (fb_bpp == 32)" is checked once per screen-redraw, so 20 times. Nothing in practice.]
Now, about my memcpy_256_nt(). Actually, my first implementation used just memcpy() and it was a lot slower.
That's why I decided to deal with the FPU, introduce the fpu_context etc. in order to use bigger registers and see what would happen. The result was: I achieved a performance improvement proportional to the size of the register I was using to copy the data. For example, by using AVX2 registers, which are 256-bit wide, I was able to achieve roughly an ~8x improvement over plain 32-bit registers.
Brendan wrote:
you're not hoisting branches out of loops (e.g. "for each line { if(bpp == 32) .." rather than "if(bpp == 32) { for each line {..."), for some cases
You're probably looking at the failsafe functions [which use fb_draw_pixel()] .
You might be curious to take a look at fb_draw_char_optimized_row() which is the most important function drawing 1 row. Also, I've really optimized only for the 32-bbp case, the only one I'm testing.
You're perfectly right that the bbp=24 case is not optimized, but that's because I care only for the bbp=32 case.
Anyway, there's one more thing I didn't get:
Brendan wrote:
After you've got it down to about 1 cycle per pixel; then try to enable write-combining to get it down to slightly less than 1 cycle per pixel.
How could I achieve ~1 cycle per pixel with an uncachable memory?
As far as I know, an access to RAM (cost of L3 cache miss / case where mem is UC) takes roughly 100 ns on modern machines.
[Source:
https://software.intel.com/en-us/forums ... pic/287236, which points out to this:
https://software.intel.com/sites/produc ... _guide.pdf].
In practice, for a Xeon 5500 I could get roughly (copy-pasting the text):
Code:
Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns
My CPU runs at 2.7 GHz. So in 100 ns my CPU will go through roughly:
Code:
2,700,000,000 * ( 100 / 1,000,000,000 ) = 270 cycles
Therefore having something like ~250 cycles per pixel (32-bit) seems to me close to the best I could theoretically achieve with uncacheable memory, on that machine. OK, the Intel pdf document states that we go as far as 60 ns, which means 162 cycles, but that depends on the hardware [...] I think you got the point.
Am I missing something in this analysis?