Exploring Hardware Compositing With the Raspberry Pi

Benjamin Doherty's Blog

2019-05-21

Raspberry Pi

Introduction

Hardware compositing is something I took for granted until earlier this year. The mental model I previously had was a single framebuffer sitting at the end of the display pipeline getting scanned out to whatever protocol the monitor supported. The OS’s windowing system then composited itself to that framebuffer through traditional GPU exchanges, using an API like OpenGL or Direct3D. When I learned about hardware compositing, specifically Android’s SurfaceFlinger, I realized that some systems have the capability to offload some compositing work to a special hardware unit.

I’ve been interested in “pulling back the veil” of low-level graphics for awhile now, and given that Broadcom (the makers of the chip in the Pi) previously released documentation on the VideoCore® IV GPU inside the Raspberry Pi 3, this seemed like a good place to start.

So today we’re going to interface with the Raspberry Pi’s hardware composition unit, the Hardware Video Scaler, through a bare-metal kernel. If you want to follow along at home, read Getting Started With the Raspberry Pi first, which describes how I got up and running with kernel development on the Pi. I’m using a project called raspberry-pi-os as boilerplate for the kernel code, which gets us up and running quickly.

Introducing the HVS

Getting pixels on the screen can be thought of as a pipeline. On one end, sits a framebuffer, a matrix of numbers waiting to be converted into light. On the other end, the monitor. The unit of the pipeline is the scanline, a row of pixels.

The chip used in the Raspberry Pi is a Broadcom 2835 System on a Chip (SoC). The Hardware Video Scaler, or HVS, is one of the components on the chip and is part of the display pipeline.

A brief overview of the display pipline will help orient us. Let’s work backwards starting with the monitor. Contained within the Pi’s SoC is an HDMI encoder. The HDMI encoder encodes pixels, scanline by scanline, down the physical wire to the monitor. The monitor decodes them, and turns on subpixels which emit various wavelengths of light that your brain perceives to be continous colors.

Uptream of the HDMI encoder is a memory buffer known as the FIFO. It temporarily houses scanlines until the HDMI encoder dequeues them. At the opposite end of the FIFO is the HVS. The HVS connects one or more framebuffers to the FIFO. It’s configured to read pixels from a chunk of memory (the framebuffer), and enqueue them, scanline by scanline, onto the FIFO.

As we’ll see, the HVS has a few tricks up its sleeve. It can be configured to composite framebuffers from different areas of memory into a single stream of pixels. It can also scale, rotate, and blend the framebuffers along the way.

Understanding that hardware compositors exist was one of my first “aha” moments of understanding low-level graphics. The pixels you see on your monitor can come from various regions of memory.

While Broadcom’s documentation covers the majority of the GPU, it doesn’t cover the HVS and I haven’t found any offical documentation on it in lieu of reading the Linux source code, which is primarily what I’ve done. So bare with me, and please let me know if I got anything wrong.

The Display List

The HVS works by consuming a structure in memory called the display list, which is what we need to prepare. The display list lives in memory mapped I/O, a special portion of physical memory that allows the kernel to talk to peripherals, like the HVS. It appears just as any other memory address, and we can use regular ARM instructions to write to it.

The display list is simply a list of commands or words, 4 bytes each. They configure the scale, position, and other properties for each plane.

A plane is an image source to be composited into the final pixel stream with attributes such as its pixel format, where on screen it should be overlaid, if it should be scaled, how it should be rotated, etc. Remember, the HVS does the compositing in real time, i.e., it doesn’t “save” the result of the composition anywhere. Its logic works scanline-by-scanline. The result of the composition only exists as an illusion on your monitor. (This definition of plane is not to be confused with a plane such as a luminance plane in a YUV image.)

We’ll start by crafting a display list that composites a single plane on the screen (we’ll add more later). Let’s look at the words that make up a display list with a single plane:

Control Word
Position Word 0
Position Word 2
Position Word 3
Pointer Word
Pointer Context Word
Pitch Word
End of Display List

In this case, we have 7 words plus an “end of display list” word. 8 words * 4 bytes = 32 bytes in total.

Where the Display List Lives

The display list memory lives at address 0x3F402000 which, as I mentioned, is in memory-mapped I/O that the HVS can access. We can treat this memory region as an array of 4 byte words:

static volatile uint32_t* dlist_memory = (uint32_t*) 0x3F402000;

Writing to the display list is a simple as:

dlist_memory[0] = 0x12345678;   // write a single word at offset 0
dlist_memory[1] = 0x12345678;   // write a single word at offset 1

These writes have side-effects, so notice that the volatile keyword is used on the dlist_memory pointer. If we wrote to the memory location like any other, the C compiler would be free optimize away the write, because it’s never read back again by us.

We’ll keep track of the current word offset and every time we write a word, increment the offset. The display list memory is 16 KiB in size- we can write as many display lists to it as we please. We’ll later see how we tell the HVS where in memory we’ve written our display list.

A Single Plane Display List

We’re going to start simple: a single plane centered in the screen. We’re assuming a 1920x1080 monitor, and the plane will take up a single quadrant (960x540):

If you like reading the Linux kernel in all its gory detail, follow along in vc4_plane.c. The function is called vc4_plane_mode_set. By the way, different systems refer to hardware compositing by different names. In Linux, it’s part of KMS, kernel mode setting.

The hvs_plane Struct

We’ll be filling in the details for this function as we go along:

static void write_plane(uint16_t* offset, hvs_plane plane);

The function will write a plane to display list memory at the given word offset. offset is passed as a pointer. The function will increment the offset for each word so the caller knows how many have been written. We can define a macro that will write out a word and increment the offset:

#define WRITE_WORD(word) (dlist_memory[(*offset)++] = word)

write_plane takes a hvs_plane struct which we need to define. It contains all the necessary information to write out the plane:

typedef struct {
    hvs_pixel_format format;            // format of the pixels in the plane
    hvs_pixel_order pixel_order;        // order of the components in each pixel
    uint16_t start_x;                   // x position of the left of the plane
    uint16_t start_y;                   // y position of the top of the plane
    uint16_t height;                    // height of the plane, in pixels
    uint16_t width;                     // width of the plane, in pixels
    uint16_t pitch;                     // number of bytes between the start of each scanline
    void* framebuffer;                  // pointer to the pixels in memory
} hvs_plane;

The rough equivalent in the Linux kernel, if you’re interested in comparing, is a struct called drm_plane_state.

There’s two enums used in hvs_plane that need some explanation, hvs_pixel_format and hvs_pixel_order.

Pixel Format

The hvs_pixel_format is an enum that tells the HVS what type of pixels are in our framebuffer. Here we see some of the pixel formats that the HVS natively supports:

typedef enum {
    /* 8bpp */
    HVS_PIXEL_FORMAT_RGB332 = 0,

    /* 16bpp */
    HVS_PIXEL_FORMAT_RGBA4444 = 1,
    HVS_PIXEL_FORMAT_RGB555 = 2,
    HVS_PIXEL_FORMAT_RGBA5551 = 3,
    HVS_PIXEL_FORMAT_RGB565 = 4,

    /* 24bpp */
    HVS_PIXEL_FORMAT_RGB888 = 5,
    HVS_PIXEL_FORMAT_RGBA6666 = 6,

    /* 32bpp */
    HVS_PIXEL_FORMAT_RGBA8888 = 7,
} hvs_pixel_format;

I left a few out, like YUV. You can see the full list in the Linux driver.

HVS_PIXEL_FORMAT_RGB565 is the format we’ll be working with. Each pixel is 16 bits. The first 5 bits are for red, the next 6 for green, and the last 5 for blue.

Pixel Order

The order of the pixels is another enum, one of the following (also taken straight from the VC4 driver):

typedef enum {
    HVS_PIXEL_ORDER_RGBA = 0,
    HVS_PIXEL_ORDER_BGRA = 1,
    HVS_PIXEL_ORDER_ARGB = 2,
    HVS_PIXEL_ORDER_ABGR = 3
} hvs_pixel_order;

So far as I can tell, the HVS requires the alpha component to come first (if there is one), so we’ll always use HVS_PIXEL_ORDER_ARGB.

Writing the Display List

We’re now ready to take the hvs_plane and write out a display list.

Control Word

First up is the control word. It conveys:

A signal bit that this word is the start of a plane
A signal bit that the plane has no scaling
The pixel format
The pixel component order
The number of words in this plane

The control word is formed by bitshifting and ORing all of that together.

/* Control word */
const uint8_t number_of_words = 7;
uint32_t control_word = SCALER_CTL0_VALID              |        // denotes the start of a plane
                        SCALER_CTL0_UNITY              |        // indicates no scaling
                        plane.pixel_order       << 13  |        // pixel order
                        number_of_words         << 24  |        // number of words in this plane
                        plane.format;                           // pixel format
WRITE_WORD(control_word);

This is a pattern you’ll see with these words. We cram several arguments into a single word by bitshifting some over so they can fit within 32 bits. I figured out the amount to bitshift by taking a look at the Linux kernel driver.

SCALER_CTL0_VALID and SCALER_CTL0_UNITY are defined as such:

#define SCALER_CTL0_VALID                       1U << 30
#define SCALER_CTL0_UNITY                       1U << 4

These are just signals to the HVS.

Position Word 0

Position Word 0 conveys the plane’s position on screen. It contains the X and Y positions:

/* Position Word 0 */
uint32_t position_word_0 = plane.start_x        << 0   |
                           plane.start_y        << 12;
WRITE_WORD(position_word_0);

Position Word 2

Position Word 2 conveys the dimensions of the framebuffer, its width and height in pixels.

/* Position Word 2 */
uint32_t position_word_2 = plane.width         << 0    |
                           plane.height        << 16;
WRITE_WORD(position_word_2);

Note that I’m refering to this word as Position Word 2, even though we skipped over a “Position Word 1” as you might expect. The so-called Position Word 1 is only present if we’re doing scaling, which we aren’t. I’ve kept the names the same as their Linux kernel counterparts, for those following along.

Position Word 3

The position word 3 is super easy. Its just a placeholder for the HVS to store some context information for its own use, which we don’t have to worry about. Leave it uninitialized (but be sure to skip a word) or fill it with your favorite Hexspeak:

/* Position Word 3: used by HVS */
WRITE_WORD(0xDEADBEEF);

Pointer Word

The pointer word is important- it gives the memory location of the actual framebuffer. We’ll set up the memory later- for now, just write out the pointer present in the struct:

/* This cast is okay, because the framebuffer pointer can always be held in 4 bytes
   even though we're on a 64 bit architecture. */
uint32_t framebuffer = (uint32_t) (intptr_t) plane.framebuffer;
WRITE_WORD(framebuffer);

By the way, if you’re familiar with the concept of page flipping, this would be the pointer you’d “flip” to implement it.

Pointer Context Word

The Pointer Context Word is another placeholder word for the HVS to use for its own bidding:

/* Pointer Context: used by HVS */
WRITE_WORD(0xDEADBEEF);

Pitch Word

Last but not least there’s the Pitch Word. The Pitch Word conveys the pitch of the framebuffer, also known as stride. This is the number of bytes in a row of pixels.

/* Pitch word */
uint32_t pitch_word = plane.pitch;
WRITE_WORD(pitch_word);

End of Display List

We’ve finished writing the display list for the first plane. If we had additional planes, this is where they’d go. Since we’re only doing one for now, we need to move on to the the final word of the display list, which signifies that the whole thing is done.

/* End word */
WRITE(SCALER_CTL0_END);

SCALER_CTL0_END is defined as such:

#define SCALER_CTL0_END                         1U << 31

Again, it’s just another signal bit that tells the HVS that the display list has finished.

Testing It Out

And that’s a basic display list! There’s a few more things we need to do before we can can test it out. First off, we need to create the framebuffer.

Getting a Framebuffer

Let’s talk for a minute about the memory layout of the Pi. Remember, we’re writing kernel code. That means our code is running directly on the Pi without the luxeries of an operating system beneath us. Most pertinent to us is forgone the concept of virtual memory- we have only physical memory- about 1 GB. We don’t have any malloc function at our disposal. In our case, we’ll need room for some framebuffers. These framebuffers can go almost anywhere in the address space- the Pi has a unified memory architecture, so the GPU can see all of RAM.

There’s a few locations that are off-limits:

Anything above 0x3F000000 is peripheral memory
Our stack starts at 0x00400000 and grows downward
The kernel image itself sits at 0x00080000

    ^
    |
0x3F000000              <-- peripheral base (memory-mapped I/O)

~ free space ~

0x00400000              <-- stack (grows downward)
    |
    v


    ^                   <-- kernel image
    |
0x00080000              <-- raspi bootloader loads kernel8.img here

0x00000000

We don’t want to overwrite anything important. Address 0x10000000 will do nicely, giving us plenty of room for additional framebuffers.

uint16_t* const framebuffer = (uint16_t*)(0x10000000);

That’s it. We cast it to a uint16_t pointer because we’re using a 16 bit pixel format, RGB565.

Informing The HVS Of Our Display List

Let’s create a hvs_plane and call our function, writing it to the display list at offset 0. This plane will be a quarter of the screen size, centered in the middle:

const uint16_t screen_width = 1920, screen_height = 1080;
const uint16_t fb_width = screen_width / 2, fb_height = screen_height / 2;
hvs_plane plane = {
    .format = HVS_PIXEL_FORMAT_RGB565,
    .pixel_order = HVS_PIXEL_ORDER_ARGB,
    .start_x = (screen_width - fb_width) / 2,
    .start_y = (screen_height - fb_height) / 2,
    .height = fb_height,
    .width = fb_width,
    .pitch = fb_width * sizeof(uint16_t),
    .framebuffer = 0x10000000
};
write_plane(0, plane);

All that’s left to do is tell the HVS where the display list is. There’s another special memory location which is memory-mapped to a register on the HVS, called SCALER_DISPLIST1. It’s at 0x3F400024.

/* Tell the HVS where the display list is by writing to the SCALER_DISPLIST1 register. */
put32(SCALER_DISPLIST1, 0);

The put32 function is inherited from the raspberry-pi-os project:

.globl put32
put32:
	str w1,[x0]
	ret

It merely stores a 32 bit word to a memory location. We use this function to write to the register for similar reasons to why we marked the dlist_memory pointer as volatile- so the C compiler won’t optimize the write away.

By the way, here’s a listing of all the HVS registers on the Pi. If you search for SCALER_DISPLIST1, you’ll see that it’s listed at address 0x7E400024, not 0x3F400024. The BCM2835 Peripheral Guide clears up that ambiguity:

The bus addresses for peripherals are set up to map onto the peripheral bus address range starting at 0x7E000000. Thus a peripheral advertised here at bus address 0x7Ennnnnn is available at physical address 0x20nnnnnn.

If you run the kernel on the Pi, you should see something that looks like this:

What we’re looking at is an uninitialized framebuffer! Adding a clear_plane function is easy enough:

void clear_plane_16(hvs_plane plane, uint16_t color)
{
    uint16_t* pixels = (uint16_t*) plane.framebuffer;
    for (int i = 0; i < plane.width * plane.height; ++i) {
        pixels[i] = color;
    }
}

We’ll call it clear_plane_16 as it’s clearing a 16 bit framebuffer.

Let’s clear the memory after we’ve set up the display list so we see the clear in “real time.”

It’s also easy to implement some drawing functions, like draw_rectangle, and draw_circle. These are inside of draw.c.

Here’s another run, this time with clearing and drawing some shapes:

Adding Additional Planes

Adding additional planes is easy. Let’s refactor and add a new function called write_display_list, which will take an array of planes. All we need to do is loop through the planes, write each out, and then write the End Word after (make sure to remove the End Word write from write_plane, so it’s only written once):

void write_display_list(hvs_plane planes[], uint8_t count) {
    uint16_t offset = 0;

    /* Write out each plane. */
    for (uint8_t p = 0; p < count; p++) {
        write_plane(&offset, planes[p]);
    }

    /* End word */
    dlist_memory[offset] = SCALER_CTL0_END;

    /* Tell the HVS where the display list is by writing to the SCALER_DISPLIST1 register. */
    put32(SCALER_DISPLIST1, 0);
}

The HVS can merge several different formats together, so as a demonstration, we’ll choose a different pixel format for one of the framebuffers. Let’s use the RGBA8 format, which will also let us test alpha blending. Note that this framebuffer will need additional memory compared to the 16 bit framebuffers. That’s okay as there’s nothing else for it to collide with.

/* "Allocate" 4 framebuffers in memory. Each is 1MiB in size, which is plenty for our purposes. */
uint16_t* const fb_one     = (uint16_t*)(0x10000000);   // the first 3 will use 16-bit pixels.
uint16_t* const fb_two     = (uint16_t*)(0x10100000);
uint16_t* const fb_three   = (uint16_t*)(0x10200000);
uint32_t* const fb_four    = (uint32_t*)(0x10300000);   // this one will use a 32-bit pixel format.

Then we can initialize the display list with 4 planes, each one taking up a quadrant of the screen:

/* Set up 4 planes. */
hvs_plane planes[4];
int i = 0;
for (int y = 0; y < 2; y++) {
    for (int x = 0; x < 2; x++) {
        planes[i].format = HVS_PIXEL_FORMAT_RGB565,
        planes[i].pixel_order = HVS_PIXEL_ORDER_ARGB,
        planes[i].start_x = fb_width * x;
        planes[i].start_y = fb_height * y;
        planes[i].height = fb_height,
        planes[i].width = fb_width,
        planes[i].pitch = fb_width * 2,
        i++;
    }
}
planes[0].framebuffer = fb_one;
planes[1].framebuffer = fb_two;
planes[2].framebuffer = fb_three;
planes[3].framebuffer = fb_four;

/* We'll make the fourth framebuffer a 32-bit pixel format, just for demonstrations. */
planes[3].format = HVS_PIXEL_FORMAT_RGBA8888;
planes[3].pitch = fb_width * sizeof(uint32_t);

write them out to the display list:

write_display_list(planes, 4);

and clear them:

/* Clear the 4 framebuffers. */
clear_plane_16(planes[0], BLUE_16);
clear_plane_16(planes[1], WHITE_16);
clear_plane_16(planes[2], RED_16);
clear_plane_32(planes[3], YELLOW_32);

Here’s what it looks like for me:

Double-Buffering the Display List

There’s one more improvement we should make. If we want to update a display list (for animation, for example), we can’t just write over the current one being used by the HVS. Remember, the HVS is continuously scanning out pixels, and updating the display list needs to be done atomically, otherwise we’ll see a brief flash of who-knows-what.

What we’ll do is keep 2 display list “slots” in display list memory, far enough apart that we need not worry about them stepping on each other. Call them A and B. We’ll start out writing our display list to slot A and set SCALER_DISPLIST1 to A’s location. When we need to perform an update, we’ll write out to display list slot B, then atomically update SCALER_DISPLIST1 to point to B. Next time we update, we’ll write to A, and so on. We’re essentially “double-buffering” our display list.

This means we have to recalculate the display list every time write_display_list is called, but this is good enough for our simple usage.

The final implementation can be found here on GitHub, along with the rest of the source code. src/hvs.c has the interesting HVS code.

I updated kernel.c with a full showcase of the HVS, demonstrating multiple planes, positioning, and transparent blending. At the end, it goes into a loop, swapping the framebuffers around:

Conclusion

I hope this gives you a sense of the possibilities of hardware compositors. This only skims the surface of what’s possible with the RPi’s HVS. Among other features are YUV framebuffers, color conversion, scaling, and rotation. Armed with this knowledge, exploring the VC4 driver in the Linux kernel should be a bit less daunting.

If you’re interested in exploring the concept further, check out Android’s documentation on the Hardware Composer HAL, which describes the interface hardware vendors implement to support hardware compositing on Android. Armed with knowledge of the Pi’s HVS, you can begin to imagine what an implementation looks like.

That’s it for now. In a future post, I’d love to take a look at getting some GPU triangles on the screen!