It's probably about time I post in this thread... I worked on Compiz many years ago, spent some time at Apple working on the window server in OS X, and have written two different compositing window managers of my own. Hopefully this isn't too much a rambling mess.
I'll spare you the usual history-of-window-drawing explanation and jump straight to "how we do it now". Every modern windowing system employs a compositor - be it Windows's DWM, OS X's WindowServer, Wayland, iOS's WindowServer (which is based on the OS X one), or Android's SurfaceFlinger. It seems you get the general idea of compositing: Store a representation of the window contents within the server so you don't need to ask the clients to redraw. There are many ways to do this, but all of these examples just store bitmaps, usually in GPU memory. Obviously, GPU memory is a tall order for a hobby OS, so you'll likely just want to use plain old RAM. OS X's WindowServer used to work this way in its early days - it actually took several years for it to be moved over to the GPU. You can also store draw calls and re-run them, but that could be more expensive in runtime than it's worth in potential space savings (and RAM is cheap). These "surface-based" systems all present a canvas/surface/texture/bitmap/whatever-term-you-want-to-use to the client, the client does whatever it wants to render into it, informs the server it is ready, usually there's some copying so the client can go back to drawing again (though not always, sometimes the server and client will share the same surface; this can lead to tearing and a few other artifacts), and the server will redraw affected areas of the screen. There's a lot of things compositing is really good with, like rendering semi-transparent windows over other windows that are constantly changing. Since these compositors employ GPUs, they're very similar to video game rendering engines and you can think of windows as big rectangular textured meshes rendered back to front to ensure that semi-transparency stacks up correctly. But video games render the whole visible scene on each frame, and that's obviously not a good idea for a general UI that's unlikely to be changing much. This is where "dirty" or "damaged" regions come into play. Whenever a window updates its texture, it informs the server of where those updates occurred. This can be a whole window or just a small subset of it (your blinking terminal cursor shouldn't signal the server to update the whole terminal, for example). Dirty regions are summed together, probably with an efficient region merging algorithm (
jojo's guide covers this pretty well), and then only those regions are rendered and copied to the framebuffer.
As a practical example, I'll explain how the process works for my own compositor. As I don't particularly care about tearing, I use a single surface for each window, rather than having server and client copies. Surfaces are 32-bit ARGB regions with stride==width, which are allocated by the server and made available to clients through a simple shared memory functionality in my kernel. A port of my compositor to Sortix used normal Unix shared memory. An ideal implementation would use GPU memory, but we don't have any GPU drivers, so that's not likely to happen any time soon. The clients can use whatever they want to draw onto that surface (and most of them just set pixel values manually; some use Cairo, some use software Mesa, a few use SDL), and then send a message to the server to inform it that a region has been updated. Because the client and server share the single surface, the server may render updates made within a window before the client has informed the server of them, so most clients employ client-side double buffering (they render into their own buffer and then copy it to the shared surface) to avoid drawing things that should not be displayed on screen (such as layers behind a button or something). The server itself uses Cairo for all of its rendering. My first implementation of a compositor was written from scratch, and I wanted to simplify and speed up my second iteration by employing an existing rendering library. To perform dirty region clipping, we use Cairo's built-in support for clipping. Because Cairo can clip to arbitrary shapes, it actually has two clip modes and which one gets used depends on the sort of clip regions you provide. We always use pixel-aligned rectangular clips to ensure that we stick to the fast clipping mode (the other mode is the slow one used for those arbitrarily-shaped and semi-transparent clips). From there, we render all of the windows from bottom to top into a temporary buffer. Cairo's rendering backend ensures we're not actually doing anything expensive with the windows that don't fall into the clip region. We then copy, using the same clip regions, from the temporary buffer to the actual framebuffer.