Exploring Hardware Compositing With the Raspberry Pi
Introduction
Hardware compositing is something I took for granted until earlier this year. The mental model I previously had was a single framebuffer sitting at the end of the display pipeline getting scanned out to whatever protocol the monitor supported. The OS’s windowing system then composited itself to that framebuffer through traditional GPU exchanges, using an API like OpenGL or Direct3D. When I learned about hardware compositing, specifically Android’s SurfaceFlinger, I realized that some systems have the capability to offload some compositing work to a special hardware unit.
I’ve been interested in “pulling back the veil” of low-level graphics for awhile now, and given that Broadcom (the makers of the chip in the Pi) previously released documentation on the VideoCore® IV GPU inside the Raspberry Pi 3, this seemed like a good place to start.
So today we’re going to interface with the Raspberry Pi’s hardware composition unit, the Hardware Video Scaler, through a bare-metal kernel. If you want to follow along at home, read Getting Started With the Raspberry Pi first, which describes how I got up and running with kernel development on the Pi. I’m using a project called raspberry-pi-os as boilerplate for the kernel code, which gets us up and running quickly.
Introducing the HVS
Getting pixels on the screen can be thought of as a pipeline. On one end, sits a framebuffer, a matrix of numbers waiting to be converted into light. On the other end, the monitor. The unit of the pipeline is the scanline, a row of pixels.
The chip used in the Raspberry Pi is a Broadcom 2835 System on a Chip (SoC). The Hardware Video Scaler, or HVS, is one of the components on the chip and is part of the display pipeline.
A brief overview of the display pipline will help orient us. Let’s work backwards starting with the monitor. Contained within the Pi’s SoC is an HDMI encoder. The HDMI encoder encodes pixels, scanline by scanline, down the physical wire to the monitor. The monitor decodes them, and turns on subpixels which emit various wavelengths of light that your brain perceives to be continous colors.
Uptream of the HDMI encoder is a memory buffer known as the FIFO. It temporarily houses scanlines until the HDMI encoder dequeues them. At the opposite end of the FIFO is the HVS. The HVS connects one or more framebuffers to the FIFO. It’s configured to read pixels from a chunk of memory (the framebuffer), and enqueue them, scanline by scanline, onto the FIFO.
As we’ll see, the HVS has a few tricks up its sleeve. It can be configured to composite framebuffers from different areas of memory into a single stream of pixels. It can also scale, rotate, and blend the framebuffers along the way.
Understanding that hardware compositors exist was one of my first “aha” moments of understanding low-level graphics. The pixels you see on your monitor can come from various regions of memory.
While Broadcom’s documentation covers the majority of the GPU, it doesn’t cover the HVS and I haven’t found any offical documentation on it in lieu of reading the Linux source code, which is primarily what I’ve done. So bare with me, and please let me know if I got anything wrong.
The Display List
The HVS works by consuming a structure in memory called the display list, which is what we need to prepare. The display list lives in memory mapped I/O, a special portion of physical memory that allows the kernel to talk to peripherals, like the HVS. It appears just as any other memory address, and we can use regular ARM instructions to write to it.
The display list is simply a list of commands or words, 4 bytes each. They configure the scale, position, and other properties for each plane.
A plane is an image source to be composited into the final pixel stream with attributes such as its pixel format, where on screen it should be overlaid, if it should be scaled, how it should be rotated, etc. Remember, the HVS does the compositing in real time, i.e., it doesn’t “save” the result of the composition anywhere. Its logic works scanline-by-scanline. The result of the composition only exists as an illusion on your monitor. (This definition of plane is not to be confused with a plane such as a luminance plane in a YUV image.)
We’ll start by crafting a display list that composites a single plane on the screen (we’ll add more later). Let’s look at the words that make up a display list with a single plane:
- Control Word
- Position Word 0
- Position Word 2
- Position Word 3
- Pointer Word
- Pointer Context Word
- Pitch Word
- End of Display List
In this case, we have 7 words plus an “end of display list” word. 8 words * 4 bytes = 32 bytes in total.
Where the Display List Lives
The display list memory lives at address 0x3F402000
which, as I mentioned, is in memory-mapped I/O that the HVS can access. We can treat this memory region as an array of 4 byte words:
static volatile uint32_t* dlist_memory = (uint32_t*) 0x3F402000; |
Writing to the display list is a simple as:
dlist_memory[0] = 0x12345678; // write a single word at offset 0 |
These writes have side-effects, so notice that the volatile
keyword is used on the dlist_memory
pointer. If we wrote to the memory location like any other, the C compiler would be free optimize away the write, because it’s never read back again by us.
We’ll keep track of the current word offset and every time we write a word, increment the offset. The display list memory is 16 KiB in size- we can write as many display lists to it as we please. We’ll later see how we tell the HVS where in memory we’ve written our display list.
A Single Plane Display List
We’re going to start simple: a single plane centered in the screen. We’re assuming a 1920x1080 monitor, and the plane will take up a single quadrant (960x540):
If you like reading the Linux kernel in all its gory detail, follow along in vc4_plane.c. The function is called vc4_plane_mode_set
. By the way, different systems refer to hardware compositing by different names. In Linux, it’s part of KMS, kernel mode setting.
The hvs_plane Struct
We’ll be filling in the details for this function as we go along:
static void write_plane(uint16_t* offset, hvs_plane plane); |
The function will write a plane to display list memory at the given word offset. offset
is passed as a pointer. The function will increment the offset for each word so the caller knows how many have been written. We can define a macro that will write out a word and increment the offset:
write_plane
takes a hvs_plane
struct which we need to define. It contains all the necessary information to write out the plane:
typedef struct { |
The rough equivalent in the Linux kernel, if you’re interested in comparing, is a struct called drm_plane_state
.
There’s two enums used in hvs_plane
that need some explanation, hvs_pixel_format
and hvs_pixel_order
.
Pixel Format
The hvs_pixel_format
is an enum that tells the HVS what type of pixels are in our framebuffer. Here we see some of the pixel formats that the HVS natively supports:
typedef enum { |
I left a few out, like YUV. You can see the full list in the Linux driver.
HVS_PIXEL_FORMAT_RGB565
is the format we’ll be working with. Each pixel is 16 bits. The first 5 bits are for red, the next 6 for green, and the last 5 for blue.
Pixel Order
The order of the pixels is another enum, one of the following (also taken straight from the VC4 driver):
typedef enum { |
So far as I can tell, the HVS requires the alpha component to come first (if there is one), so we’ll always use HVS_PIXEL_ORDER_ARGB.
Writing the Display List
We’re now ready to take the hvs_plane
and write out a display list.
Control Word
First up is the control word. It conveys:
- A signal bit that this word is the start of a plane
- A signal bit that the plane has no scaling
- The pixel format
- The pixel component order
- The number of words in this plane
The control word is formed by bitshifting and ORing all of that together.
/* Control word */ |
This is a pattern you’ll see with these words. We cram several arguments into a single word by bitshifting some over so they can fit within 32 bits. I figured out the amount to bitshift by taking a look at the Linux kernel driver.
SCALER_CTL0_VALID
and SCALER_CTL0_UNITY
are defined as such:
These are just signals to the HVS.
Position Word 0
Position Word 0 conveys the plane’s position on screen. It contains the X and Y positions:
/* Position Word 0 */ |
Position Word 2
Position Word 2 conveys the dimensions of the framebuffer, its width and height in pixels.
/* Position Word 2 */ |
Note that I’m refering to this word as Position Word 2, even though we skipped over a “Position Word 1” as you might expect. The so-called Position Word 1 is only present if we’re doing scaling, which we aren’t. I’ve kept the names the same as their Linux kernel counterparts, for those following along.
Position Word 3
The position word 3 is super easy. Its just a placeholder for the HVS to store some context information for its own use, which we don’t have to worry about. Leave it uninitialized (but be sure to skip a word) or fill it with your favorite Hexspeak:
/* Position Word 3: used by HVS */ |
Pointer Word
The pointer word is important- it gives the memory location of the actual framebuffer. We’ll set up the memory later- for now, just write out the pointer present in the struct:
/* This cast is okay, because the framebuffer pointer can always be held in 4 bytes |
By the way, if you’re familiar with the concept of page flipping, this would be the pointer you’d “flip” to implement it.
Pointer Context Word
The Pointer Context Word is another placeholder word for the HVS to use for its own bidding:
/* Pointer Context: used by HVS */ |
Pitch Word
Last but not least there’s the Pitch Word. The Pitch Word conveys the pitch of the framebuffer, also known as stride. This is the number of bytes in a row of pixels.
/* Pitch word */ |
End of Display List
We’ve finished writing the display list for the first plane. If we had additional planes, this is where they’d go. Since we’re only doing one for now, we need to move on to the the final word of the display list, which signifies that the whole thing is done.
/* End word */ |
SCALER_CTL0_END
is defined as such:
Again, it’s just another signal bit that tells the HVS that the display list has finished.
Testing It Out
And that’s a basic display list! There’s a few more things we need to do before we can can test it out. First off, we need to create the framebuffer.
Getting a Framebuffer
Let’s talk for a minute about the memory layout of the Pi. Remember, we’re writing kernel code. That means our code is running directly on the Pi without the luxeries of an operating system beneath us. Most pertinent to us is forgone the concept of virtual memory- we have only physical memory- about 1 GB. We don’t have any malloc
function at our disposal. In our case, we’ll need room for some framebuffers. These framebuffers can go almost anywhere in the address space- the Pi has a unified memory architecture, so the GPU can see all of RAM.
There’s a few locations that are off-limits:
- Anything above
0x3F000000
is peripheral memory - Our stack starts at
0x00400000
and grows downward - The kernel image itself sits at
0x00080000
^ |
We don’t want to overwrite anything important. Address 0x10000000
will do nicely, giving us plenty of room for additional framebuffers.
uint16_t* const framebuffer = (uint16_t*)(0x10000000); |
That’s it. We cast it to a uint16_t
pointer because we’re using a 16 bit pixel format, RGB565
.
Informing The HVS Of Our Display List
Let’s create a hvs_plane
and call our function, writing it to the display list at offset 0. This plane will be a quarter of the screen size, centered in the middle:
const uint16_t screen_width = 1920, screen_height = 1080; |
All that’s left to do is tell the HVS where the display list is. There’s another special memory location which is memory-mapped to a register on the HVS, called SCALER_DISPLIST1
. It’s at 0x3F400024
.
/* Tell the HVS where the display list is by writing to the SCALER_DISPLIST1 register. */ |
The put32
function is inherited from the raspberry-pi-os
project:
.globl put32 |
It merely stores a 32 bit word to a memory location. We use this function to write to the register for similar reasons to why we marked the dlist_memory
pointer as volatile
- so the C compiler won’t optimize the write away.
By the way, here’s a listing of all the HVS registers on the Pi. If you search for SCALER_DISPLIST1
, you’ll see that it’s listed at address 0x7E400024
, not 0x3F400024
. The BCM2835 Peripheral Guide clears up that ambiguity:
The bus addresses for peripherals are set up to map onto the peripheral bus address range starting at 0x7E000000. Thus a peripheral advertised here at bus address 0x7Ennnnnn is available at physical address 0x20nnnnnn.
If you run the kernel on the Pi, you should see something that looks like this:
What we’re looking at is an uninitialized framebuffer! Adding a clear_plane
function is easy enough:
void clear_plane_16(hvs_plane plane, uint16_t color) |
We’ll call it clear_plane_16
as it’s clearing a 16 bit framebuffer.
Let’s clear the memory after we’ve set up the display list so we see the clear in “real time.”
It’s also easy to implement some drawing functions, like draw_rectangle
, and draw_circle
. These are inside of draw.c
.
Here’s another run, this time with clearing and drawing some shapes:
Adding Additional Planes
Adding additional planes is easy. Let’s refactor and add a new function called write_display_list
, which will take an array of planes. All we need to do is loop through the planes, write each out, and then write the End Word after (make sure to remove the End Word write from write_plane
, so it’s only written once):
void write_display_list(hvs_plane planes[], uint8_t count) { |
The HVS can merge several different formats together, so as a demonstration, we’ll choose a different pixel format for one of the framebuffers. Let’s use the RGBA8
format, which will also let us test alpha blending. Note that this framebuffer will need additional memory compared to the 16 bit framebuffers. That’s okay as there’s nothing else for it to collide with.
/* "Allocate" 4 framebuffers in memory. Each is 1MiB in size, which is plenty for our purposes. */ |
Then we can initialize the display list with 4 planes, each one taking up a quadrant of the screen:
/* Set up 4 planes. */ |
write them out to the display list:
write_display_list(planes, 4); |
and clear them:
/* Clear the 4 framebuffers. */ |
Here’s what it looks like for me:
Double-Buffering the Display List
There’s one more improvement we should make. If we want to update a display list (for animation, for example), we can’t just write over the current one being used by the HVS. Remember, the HVS is continuously scanning out pixels, and updating the display list needs to be done atomically, otherwise we’ll see a brief flash of who-knows-what.
What we’ll do is keep 2 display list “slots” in display list memory, far enough apart that we need not worry about them stepping on each other. Call them A and B. We’ll start out writing our display list to slot A and set SCALER_DISPLIST1
to A’s location. When we need to perform an update, we’ll write out to display list slot B, then atomically update SCALER_DISPLIST1
to point to B. Next time we update, we’ll write to A, and so on. We’re essentially “double-buffering” our display list.
This means we have to recalculate the display list every time write_display_list
is called, but this is good enough for our simple usage.
The final implementation can be found here on GitHub, along with the rest of the source code. src/hvs.c has the interesting HVS code.
I updated kernel.c with a full showcase of the HVS, demonstrating multiple planes, positioning, and transparent blending. At the end, it goes into a loop, swapping the framebuffers around:
Conclusion
I hope this gives you a sense of the possibilities of hardware compositors. This only skims the surface of what’s possible with the RPi’s HVS. Among other features are YUV framebuffers, color conversion, scaling, and rotation. Armed with this knowledge, exploring the VC4 driver in the Linux kernel should be a bit less daunting.
If you’re interested in exploring the concept further, check out Android’s documentation on the Hardware Composer HAL, which describes the interface hardware vendors implement to support hardware compositing on Android. Armed with knowledge of the Pi’s HVS, you can begin to imagine what an implementation looks like.
That’s it for now. In a future post, I’d love to take a look at getting some GPU triangles on the screen!