Hi, I am Sebastian Aaltonen. I have over 20 years of experience in graphics
programming. In the past have been working at Ubisoft and Unity building their cross
platform rendering technologies.
I joined HypeHype one year ago with a mission to rewrite their mobile rendering
technology. Today I am going to be talking about the first milestone of that project:
rewriting the low level graphics API and the platform backends.
HypeHype is a mobile game development platform. You create games directly on the
touchscreen and upload them to our cloud server.
Gamers use a Tik Tok-style feed to browse the games. The games are instant
loading. This is a big technical challenge. Both the game binary size and the loading
code have to be highly optimized. To make the initial binary smaller, we store data in
highly compressed form and also lean on streaming.
HypeHype has up to 8 player multiplayer. Multiplayer features and the player count
will increase in the future, once our cloud game server infrastructure is deployed.
We have a full blown game editor inside the mobile app. Visual scripting system is
used for writing the game logic. Players can spectate creators creating games, and
multiple creators can collaboratively create games together at real time. It’s a bit like
Google Docs, but for game creation. Test play is instant and all the spectators join
multiplayer test session as players. This improves the iteration time drastically.
We of course have a full set of social features, including chat, leaderboards, replays
and similar.
HypeHype is mainly targeting mobile devices and tablets. But we also have a web
client and native PC and Mac applications.
I have a console development background at Ubisoft, so I like to compare mobile
devices to the past console generations to understand them better.
Xbox 360 and PS3 are nowadays equal to low-mid tier mobile devices in GPU
performance. This is excellent news, since these consoles offered the greatest visual
jump we have seen between console generations so far: We got HD output
resolution, and were able to implement proper HDR lighting pipelines, physically
based material models and image post processing for the first time. All of that is
possible today on mainstream mobile devices. And we can scale that down to bottom
tier devices at 30 fps with upscaling.
When you look at the high end, you see latest 1000$+ phones reaching Xbox One
and PS4 levels of performance already. However these phones run at higher native
resolution and are thermally constrained, thus in reality, we can’t quite reach that
generation of visual fidelity yet on mobile devices in real games. And we don’t even
want to, since that would make the devices hot and drain their battery in couple of
hours.
HypeHype games have been limited to simple visuals: Stylized untextured objects,
simple gamma space lighting and tiny scenes with short view distance. This has been
fine for simple hyper-casual games.
This is however a big limitation to the platform, so we started building a new renderer
from the scratch one year ago. The visual fidelity target for the new renderer is to
match the best looking Xbox 360 and PS3 games. We will be bringing full PBR
pipeline with modern lighting, shadowing and post processing techniques. We will be
targeting larger game worlds and longer draw distances to allow more game genres
to be built properly on the platform.
This is all nice of course, but we have to be really careful about the performance cost
of all these new improvements. We still want to be running HypeHype games at
locked 60 fps on mid tier mobile phones without throttling the devices. This is a big
concern for us and is the main reason why we are focusing heavily on performance in
our new rendering architecture.
If you compare current mainstream phones with Xbox 360, you notice a lot of
similarities.
Both designs have slow shared main memory. Bandwidth is the main limiting factor.
Both designs also employ techniques to reduce the memory bandwidth usage. The
most important one being on-chip storage for render targets. On Xbox 360 you had a
10MB EDRAM buffer for your whole render target. On mobile phones, you have a
smaller on-chip tile memory. Both technologies solve similar problems. Overdraw
doesn’t require extra memory bandwidth and Z-buffering and blending happens fully
on chip. On mobile phones, you also have framebuffer fetch, allowing you to load
back the previous pixel from the same render target location without a memory round
trip. The newer Xbox One console also was equipped with read-write ESRAM
allowing similar optimizations.
Since the main memory is slow, you want to avoid resolving render targets as much
as possible. You want to minimize the amount of render passes. Doing multiple things
at once is the key to good performance. Modern mobile phones also have framebuffer
compression, to reduce the render target resolve and sampling bandwidth cost. This
is a good addition, but doesn’t fully solve the problem. ASTC texture compression
also helps. It offers better quality and smaller footprint than DXT5 back in the day.
Mobile phones also have double rate fp16 math. This helps, since you don’t want to
lean on memory lookups on bandwidth starved devices. And there’s now better lower
precision HDR framebuffer formats available.
But some old limitations still remain: Mobile GPUs are still designed around uniform
buffers. SSBO loads from dynamic addresses are still slow. If you are able to
scalarize your memory access patterns you hitting the performance sweet spot. This
limits the algorithms we are able to implement efficiently. Many mobile phones also
write vertex varyings to main memory, which costs significant amount of precious
bandwidth. Optimizing the size of the varyings is key for good performance on these
devices.
I was talking about GPU-driven renderer already 8 years ago at SIGGRAPH, we
presented the core ideas such as the clustered rendering and the 2-pass occlusion
culling which have become a de-facto standard nowadays.
Recently Nanite by Epic made GPU-driven rendering available for mainstream. They
combine V-buffer, material classification, analytic derivatives and software rasterizer
to make GPU-driven rendering robust enough for generic engine.
However, there’s still a lot of unsolved performance problems with GPU-driven
rendering on mainstream mobile GPUs.
Mobile GPUs are not optimized yet for SSBO loads. AMD and Nvidia optimized their
data paths couple of generations ago when they added ray-tracing. Ray-tracing
access patterns are dynamic and you can’t lean on tiny on-chip buffers for vertex
attributes anymore. We still need to wait for mobile GPUs with similar optimizations to
become mainstream.
V-buffer requires you to run the vertex shader 3 times per pixel, and this includes
fetching all the vertex attributes of these 3 vertices. You also need to fetch all the
instance data and material data from dynamic location. This is over 20 non-uniform
memory loads in the pixel shader. Mobile chips aren’t simply designed for this kinds
of memory heavy workloads.
No current mobile GPU supports framebuffer compression for compute shader writes.
Compute shader is the most efficient way to implement full screen material passes in
deferred V-buffer shading. If you do that on mobile, you waste a lot of bandwidth.
64 bit atomics are commonly used in software rasterizers. You pack the Z value in
high bits and payload in low bits and let the atomic resolve the closest surface.
There’s no 64 bit atomic support in mobile GPUs. SampleGrad is also slow. ⅛ rate or
even slower. Which makes deferred texturing with analytic gradients quite costly. And
there support for wave intrinsics is spotty and you even have emulated groupshared
memory on some low end devices.
As a result, traditional CPU-based rendering is still the best be for mainstream mobile
phones today. We could do 10,000 draw calls on Xbox 360 back in the day at 60 fps.
To reach that goal on mobile devices today, we need to write very well optimized
rendering code.
Let’s talk about our roadmap.
We split the renderer rewrite into two stages. First we rewrote the low level gfx API
and all the platform specific backend code. In order to run both the old and the new
backends in tandem, we introduce a minimal wrapper with ifdefs, so that we can keep
shipping the old rendering code and switch between the new and the old to compare
them. We have already deleted 200 files of old rendering code and recently started
tearing down the wrapper and replacing it with direct calls to the new platform API.
This presentation will focus on the low level platform API and the backends. I will be
later talking about our new high level rendering code. Our design allows us to refactor
these pieces completely independently of each other. I will come back to that topic
later in the presentation.
The first thing we need to decide is the platform abstraction level. What code is
platform specific and what code is platform independent.
Game engines generally limit platform specific code to the lowest levels of the stack.
This minimizes the amount of code that is platform specific and reduces the
implementation and maintenance cost. However some engine and renderer specifics
tend to leak to the lowest level platform code.
If you look at mobile apps, the platforms specific code tends to reach a bit higher
levels in the stack. For example the popular Google Flutter app framework is
developed by multiple platform specific teams. They usually ship new features first on
mobiles and later to desktops. Android and iOS don’t have full feature parity either.
Their high level rendering code is different on desktop and mobile platforms, including
Mac and iOS, even though they both use the same Metal API.
Many mobile apps bring the code separation even higher. There’s often completely
separate iOS and Android team with their dedicated code bases. Most of the business
logic in these apps tends to be running in a cloud server, which is of course shared
and maintained by a third team.
HypeHype is a real time game engine, so we of course need to have all of our world
state locally. Games must run identically on all devices and cross play must work
across all devices. The old HypeHype gfx code base had duplicated shaders for
Metal and some duplicated higher level code as well. This bloated the test matrix,
added maintenance cost and made adding new features slow. This was the first thing
I wanted to solve. The goal was to bring the platform API to even lower level
compared to existing game engines.
We want as small amount of platform specific code as possible. This leads to a
design where we tightly wrap the existing low level gfx APIs.
The design work started by cross referencing Vulkan, Metal and WebGPU docs. I was
already familiar with all of these APIs, which made the work easier and less error
prone.
When writing a wrapper, you first want to find the common set of features. These are
often straightforward to wrap. The difficulties arise when there’s differences in the API
design. Care must be taken in order to abstract these differences in a way that is
performance optimal. We chose to use Metal 2.0 because it’s closer to Vulkan and
WebGPU and it provides placement heaps, argument buffers and manual fences,
allowing us to extract a bit more performance out of Apple devices too. We also
support MoltenVK to make cross platform development easier, but we don’t ship it
since our Metal 2.0 backend is roughly 40% faster for CPU.
In order to make the API more compact, we trim all deprecated stuff that nobody uses
anymore. These things were failed experiments that never lived up to our
expectations regarding performance. Vertex buffers are an interesting topic. At
Ubisoft, we deprecated vertex buffers in our GPU-driven renderer already 8 years
ago. But at HypeHype we still support vertex buffers, since some mobile GPU shader
compilers generate better code for them. Also we are still using WebGL2 in our web
client, since WebGPU coverage is not yet good enough. I will likely be removing
vertex buffers from the API in a few years.
Single set of shaders is crucial for tech artist productivity. We use modern open
source tools such as SPIRV-Cross to cross compile our shaders to all target
platforms.
Let’s talk about the design goals of this new platform API.
First, we want it to be a standalone library. Designed and maintained independently
from the HypeHype engine. It needs to have an stable API that doesn’t change often.
I have seen a lot of graphics platform abstractions during my career and the problem
in most abstractions is that user land concepts creep into the hardware API. Having a
mesh and material in the platform code is the most common issue. This is problematic
since mesh and material both have change pressure. Meshlets and bindless textures
are the future. We don’t want to commit to a certain way of presenting them. Mesh
can be simply represented as an index buffer binding + N vertex buffer bindings, and
material can be represented as a bind group containing multiple texture descriptors
and a buffer for value data.
Automatic the uniform handling might feel like a good idea in the beginning, but
eventually you want to add stuff like geometry instancing and now you need to
refactor your backend code to change the data layout. Or even worse, add a new fast
path to complicate the API. And eventually you add a new fast path for GPU-driven
rendering too, bloating the API further. In our design, the user land code is
responsible for setting up all the data!
Zero extra API overhead is another crucial design core pillar for us. The platform
interface should not add significant cost. It should be as easy to use as DX11, but
always as efficient as hand written optimized DX12. A wrong solution is to copy the
DX11 API as is. This way you end up emulating the DX11 driver in your code base,
and rest assured that Nvidia and AMD does this better than your team. Thus your
modern bankends are slower than DX11. The reason for this is that you have too fine
grained inputs, too fine grained render state, lots of shadow state and data copies.
PSO and render state tracking and caching is a big performance drain, and slow
software command buffer design usually adds to the cost.
So, we have very strict performance standards for our API, but at the same time we
want it to be as easy to use as DX11. How can we achieve this?
We need a good process for designing the API.
The traditional way would be to spend months researching the API documentations
and writing a big technical design document describing the new API in detail, splitting
in into tasks and estimating the implementation time for each task for each backend.
The issue with this approach is that you lock in the design too early and it’s hard to
change later. Small nitty gritty details matter a lot in platform specific graphics code.
You can’t really understand the performance impact of all the corner cases without
writing any code. Now you will notice those issues when there’s a lot of production
ready code written. It’s too hard to justify a full rewrite of plans and code at this point.
Agile test driven development has the opposite problem. You are focusing on what
you need in the next sprints. You implement small independent pieces of code that
have full test coverage. The assumption is that once you put these pieces together,
you have good architecture. But you didn’t even do any architecture design. More
pieces equals more interfaces, equals more communication overhead. It is difficult to
reach optimal performance with this kind of programming practice. And it’s even more
difficult to throw away lots of production ready code with full test coverage and lots of
story points spent once you notice that the architecture needs a big overhaul to meet
the performance goals.
Our solution for this problem is to use a highly iterative design process.
I start by writing mock user land code. I use my prior expertise and start writing my
dream graphics code, assuming I have the perfect API. That API doesn’t exist yet, but
I keep writing mock code until I am happy with it. I write code for creating all the
resources I need for rendering, textures, shaders, buffers and so and then I write a
small draw loop using these resources. The draw loop is called multiple times, with
some resources mutated to implement animation. It’s important to design both
dynamic and static data paths early.
Once I am happy with the first iteration of the user land code, I write a mock platform
API for it. This is just a hollow API at this point. There’s no backend implementation.
But it allows me so start using the compiler to do syntax checking and autocomplete.
Now I can really start experimenting with the API to see how good it feels to use. I will
of course refactor all the time when I find even slightest need for it. I add missing
mock use cases and go through the Vulkan, Metal and WebGPU API docs to ensure
that I have not missed anything important.
Then I will do a performance check for all the user land code. I have a good
understanding how all the platform APIs work, so I think what kind of implementation
each API call would require in Vulkan and Metal and WebGPU backends. If the
implementation is trivial then it’s fine. If the implementation requires extra data copies,
hash map lookups, memory allocations or other expensive operations, then I scrap
that design and rewrite that part of the API to be more efficient. As you remember, our
goal is to be as fast as hand optimized DX12 in every single case. We can’t do that if
our API doesn’t map perfectly to the underlying hardware API.
Once we are happy with the performance of our mock code, we start implementing
the backends. We of course notice nitty gritty details that we missed during this
process and immediately refactor the mock code and the mock API when this
happens. We don’t write heavy test suite just yet, as that would slow down our
iteration time. Instead we are leaning on Vulkan and Metal validation layers to provide
us thousands of test cases for free. We hook the validation layer error callback to our
automated tests to ensure our code keeps functioning when we refactor it.
The last topic about API design I wanted to discuss today is doing things at right
frequency and granularity.
A big problem in rendering code tends to be that expensive operations are done at
too high frequency. This also tends to add tracking cost to the hot draw loop.
Games have a lot of temporal coherency. You load the game world and you slowly
mutate it every frame. Most of the data stays the same. Also the camera is most of
the time moving slowly. Human brain needs temporal coherency between the frames
to see smooth movement. This is great for us! We want to exploit it!
Let’s take a look at all the stuff happening: Loading the game world and all the shader
PSOs happens in the beginning. If you have a larger level, you also load textures,
meshes and materials when you move around. Most of the objects are spawned in
the beginning, but sections of the level might be spawned during streaming, enemies,
loot, projectiles, etc are generally spawned throughout the game, but not that many
every frame. The only really high frequency operation is culling all the objects and
drawing the visible objects. The culling and draw loops are the most time sensitive
loops in your whole code base.
I have highlighted problem cases with red. People tend to do processing related to
these inside their hot draw loop.
Modifying material bindings is not common. How often do you replace the normal
map of an already loaded material? How often do you change the shader that’s used
to render an object? How often you change the render state that is used to render an
object? Pretty much never, except in some special effects. Animating object color and
object transform are more common operations. A small subset of objects animate
every frame. We only want to pay for these things when they happen. Not for every
draw call.
Our solution for this problem is to fully separate all data modification from drawing. All
the data is ready before the draw loop.
Pipeline state objects (PSO) should be built at application startup or at level load time.
Building PSOs at runtime causes stuttering. In our philosophy shader variants are
authored by coders and tech artists and hand optimized. There’s only a limited
amount of them. This is similar to that id-Software does and provides very good
performance.
We store the PSO handle directly to each object’s visual component. We don’t need
hash map lookups to obtain it every frame.
We precreate all the bind groups (descriptor sets). Material descriptor sets contain all
their textures and buffer for value data. We store the material bind group handle to
objects visual component. This avoids a hashmap lookup and makes it possible to
efficiently change the material bindings with a single Vulkan, Metal and WebGPU
command.
Separating persistent and dynamic data is important. Persistent data is uploaded at
startup and delta updated when changed. I had a talk about this topic two years ago
at REAC 2011. You can refer to that presentation if you want more information about
that topic.
Dynamic data should be batch uploaded once per pass instead of using map/unmap
per draw call. Global data should be separated from per-draw data to minimize the
wasted bandwidth cost.
Resource synchronization is a big CPU cost in many engines. Our current solution is
simple: When a render pass begins, we transition the render target to writable layout.
At the end of the render pass, we transition it back to sampled texture layout. This
way all the textures (both static and dynamic) are always in sampler readable layout.
We never need to do per-draw call resource tracking at all. This saves significant
amount of CPU cycles.
And now we are ready to talk about the implementation details.
A renderer needs textures, buffers, shaders and several other resource objects. We
need a good way to store these objects and ensure they are safe to use.
The modern C++ practice would be to use smart pointers, reference counting and
RAII (resource acquisition is initialization).
Frankly, these are too slow for us. Reference counted smart pointers tie the life time
of a reference and the backing memory together. This results in a lot of small memory
allocations. Memory allocations are expensive in current highly multithreaded
systems. Allocations are also randomly scattered around the system memory, making
data access patterns worse and increasing the cache misses. Copying reference
counted smart pointers requires two atomics (add, sub), since we are in a
multithreaded system. Ownership can be shared between threads.
There’s also safety issues. Ref counting makes object lifetime vague. Hard to reason
about. It might die in any thread. RAII objects such as listeners cause destructors to
have side effects. Example: Object ref count runs in another thread, destructor of
listener de-registers it from an array. Another thread is just iterating that array.
CRASH! To avoid this crash, you have to protect part of the destructor with a mutex.
This means that every time you delete an object, you need a mutex lock. This is very
expensive. HypeHype is loading and unloading games rapidly in the feed. We can’t
afford slow loading and tear down code!
Our solution for this problem (and most other problems too) is to use arrays!
There’s one big allocation containing all objects of the same type. Array index is a
surprisingly nice data handle. If I have an array of textures, I can simply ask texture at
index 4. Index is POD data. It’s trivial to copy around. I can pass it to worker threads
safely too. As long as that thread doesn’t have access to the array, they can do
nothing dangerous with the indices. This allows us to write culling and draw stream
generation tasks with no safety concerns. These threads simply take array indices
from one place and combine them to form draw calls. No access to the data arrays is
needed at all.
But there’s a critical flaw: An array index doesn’t guarantee object lifetime. We of
course reuse slots in our data arrays. Data could have died and slot reused…
To solve the problem, we need to replace our array indices with generational handles.
Let’s discuss what this means.
A pool is similar to our data array. It is an typed array of objects, but now it also has
an additional generation counter array. Generation counter tells how many times the
slot has been reused. Counter is increased when the current data in that slot is freed.
We also have a freelist in the pool. A freelist is simply a linear array containing the
indices of each free slots. It has stack semantics. When you allocate a new object,
you pop a free index from the top. When you delete an object you push the index of
the freed slot on top of the free list. These are both fast O(1) operations. When the
freelist runs out, we double the size of the pool. This is safe as nobody is allowed to
have direct pointer references to the data in the array. All references are done using
handles.
The handle is just a POD struct. It contains the array index just like in the previous
slide, but now we also have the generation counter next to it. This is in total 32 bit (for
example 16+16 bit split) or 64 bits (32 + 32 bit), depending how many simultaneously
active resources you require and how short are the lifetimes of the objects. In
HypeHype we use 32 bit (16+16) handles for all graphics resources, this is enough for
65536 resources of each particular type.
Pool offers a getter API, which takes the handle as a parameter. It reads pool’s
generation counter array at handle index and compares it to the handle’s generation
counter. If they match, you get the data. If they don’t you get a null pointer.
This results in weak reference semantics. It’s completely safe to use stale handles.
You just get a null pointer back. A null check is a predictable branch, and is almost
free on modern CPUs. The branch prediction fails once when the handle is deleted.
At that point you clean up yourself too. Weak references result in coding practice that
doesn’t required callbacks. Callbacks require buffering or mutexes to avoid race
conditions in multithreaded systems.
One of our main goals was to make the API as easy to use as DX11. This requires us
to bundle auxiliary data in our data structs in the pools. In Vulkan the VkTexture
handle doesn’t know anything about itself, which is annoying if you try to write your
rendering code in pure Vulkan. We want our texture struct to know it’s size, format,
data pointer for writing, allocator for deleting it and so forth.
This auxiliary data is required for low frequency tasks such as modifying the resource
and deleting the resource. Since our design principle is to separate resource
modifications from drawing, we are accessing this data only when the resource is
modified or deleted. This means that putting the auxiliary data in the same struct as
the data required in the hot draw loop is not cache efficient. The draw loop will load
data to L1$ that is not used. I hate trade-offs between performance and usability.
Our solution for this problem is to use SoA layout inside the pools. We identify which
data is required every frame in the hot draw loops and put that data in one struct and
the remaining low frequency auxiliary data in another struct. The pool now has two
data arrays instead of one. We can use the same array index in the handle to access
either of the data arrays (or both). This way we only need to load the hot data to
caches in the performance critical draw loop. The auxiliary data struct is only loaded
at low frequency, solving our performance issue with L1$ cache utilization.
Now we have a good way to store and refer to graphics resources. The next topic is
creating the resources.
Creating graphics resources in Vulkan and DX12 is cumbersome. You need to fill big
structs that contain other big structs. Some of these structs also contain pointers to
arrays of structs too. This makes it possible to shoot yourself in the foot with
temporary object life times.
The most common existing solution for this problem is using builder pattern for
resource descriptors: The builder object contains good default state for the descriptor.
It offers an API to mutate itself to set all the fields you want to change. Once you are
ready, you call build function to get the final descriptor struct. This is easy to use, but
the codegen, especially in debug mode is far from perfect. At HypeHype we use
debug mode a lot during development, so we want it to be fast too.
Our solution for this problem is to use C++20 designated struct initializers in
combination with C++11 struct aggregate initialization. These two features in
combination allow us to set default values to each struct in a trivial way. Look at the
code example box below. If you want to override one of these defaults, you use the
designated struct initializer syntax to override the values of named fields. The syntax
is super clean and codegen is perfect.
To solve the array data cleanly, we have to write our own span class. C++20 built-in
span class doesn’t support initializer lists, because initializer lists have very short life
time. They die immediately after the statement. It was too dangerous to allow putting
initializer list inside a span in the generic case. However we use this only in a special
case, and we have a solution for it: C++ const && function parameter only accepts
temporary unnamed objects. C++ guarantees that temporary objects in function
parameter list live long enough to finish the function call. This gives us enough
guarantees to safely store initializer lists inside spans in our resource descriptor
structs.
And this is how it looks in practice.
Let’s start with the left side: First we are creating a vertex buffer and a texture. The
syntax here is nice and we are only declaring fields that differ from the struct default
values.
If you look at the bottom left, you see us declaring a material. This is a bind group.
The bind group has an array of textures: albedo, normal and properties. We are using
initializer list here to provide the array. This makes the syntax super clean. And it’s
worth noting that this array doesn’t require any heap allocations. The initializer list and
the whole descriptor struct lives in the stack. It is never copied. We just pass a
reference to it in the resource creation function call. This is as fast as raw DX12 or
raw Vulkan.
One the right side you see us initializing a more complex resource. This looks a bit
like json. We have named fields, arrays and fields and arrays inside each other with
proper indentation. This is much more easier to write and read compared to raw DX12
and Vulkan. Yet still we pay no runtime cost. There’s still no memory allocations or
data copies. Everything is pure stack data.
Now that we have a good way to create and store resources, we need to allocate
GPU memory for them.
I prefer to use temporary memory whenever possible. Temporary memory doesn’t
fragment your memory pools and allocating it is as simple as adding a number to a
counter.
We use 128MB memory heaps in our bump allocator. The heaps are stored in a ring.
If the bump allocator reaches the tail, we allocate a new heap block. Once we reach
stable state, there’s no heap allocations happening at all. We create a platform
specific buffer handle for each GPU heap we create. This buffer handle maps the
whole heap. This way we don’t need to create platform specific buffer objects at
runtime. Our buffer struct simply contains an heap index and an offset. It’s super
efficient to construct them at runtime and pass to user.
As an extra optimization, we provide the user land a concrete bump allocator object.
This has a function to allocate N bytes. This function inlines perfectly to the caller. It
simply increments a counter and then tests whether the counter is over the heap
block boundary. This check is a predictable branch. When the block runs out, we call
a virtual function in th gfx API to obtain the new temp allocator block. This happens
only once for 128MB of data, making it highly efficient.
Since WebGPU doesn’t yet have 100% coverage, we had to add WebGL2 support
during the project. We use the same temp allocator abstraction for WebGL2. User
land code doesn’t need to know whether the returned pointer is a CPU pointer or a
GPU pointer. In WebGL2 we use 8MB CPU side temp buffers and we copy these
buffers using a single glBufferSubData at beginning of each render pass. This
amortizes the cost of data updates, and is a big performance win over calling
map/unmap per draw call.
We do persistent allocations only when needed, since persistent allocation is always
much slower than temporary.
I implemented a two-level segregated fit algorithm. This is O(1) hard real time
allocator. It uses a two level bitfield and two lzcnt instructions to find the bin. Bin size
classes follow floating point distribution. This guarantees that overhead percentage is
always small, independent of the size class. Delete operation is similar to allocate.
But in addition you are checking neighbor pointers on both sides and merging empty
memory regions. This is also O(1).
We use the same allocator for both Vulkan and Metal 2.0 (placement heaps). I open
sourced the offset allocator. It can be used for sub-allocating GPU heaps or buffers,
and generally anything that requires a contiguous range of elements (and doesn’t
require CPU memory backing for embedded metadata).
One of the biggest differences our design has towards other renderers is user land
bind groups (descriptor sets in Vulkan terminology).
The traditional way is to have separate bindings for each texture and buffer. Before
drawing you set all bindings separately. The gfx backend has to combine these
bindings in shader specific layout and create respective bind groups (WebGPU),
descriptor sets (Vulkan), argument buffers (Metal) or descriptor tables (DX12). These
bind group objects are GPU objects and are expensive to create. IHVs recommend
you to precreate all GPU objects to avoid stalls and memory fragmentation issues.
The common workaround is to cache bind groups in a hash map in the backend. All
bindings are hashed and a lookup is made. If the bind group exists, then it is reused
instead of created. The problem with this approach is that hashing is expensive and
hash map lookups randomize your memory access pattern. If you are rendering from
multiple threads, you might even need to protect your bind group hash with a mutex,
making it even more expensive.
Our solution is to bring bind groups directly to user land: User creates immutable bind
groups ahead of time. For example a material bind group contains 5 textures and one
uniform buffer (filled with value data). You get a handle, which you use to bind the
material.
Our draw call API exposes three bind group slots to the user land. Vulkan on Android
and WebGPU mandate minimum of four bind group slots. Three first groups are
exposed directly to use land code, matching the GLSL set=X semantics. This is easy
for gfx programmers and tech artists to understand.
HypeHype higher level rendering code uses an convention to split data to bind groups
by binding frequency. The first group has render pass global bindings (sun light,
camera matrices, shadow maps, etc), the second slot has material bindings, third slot
has shader specific bindings and the last slot is special.
We use the last slot in Vulkan and WebGPU for dynamic offset bound buffers. This is
important for bump allocated temporary data, such as uniform buffers. Metal API
doesn’t have similar offset update API for argument buffer buffer bindings. Instead we
use Metal setBuffer API to set these dynamic buffers separately, and use setOffset
API to change their offset. This provides an abstraction that uses the most efficient
code paths on all platform APIs.
Push constants are emulated on some mobile GPUs. It’s faster to bump allocate your
uniforms and just change the offset.
I already said that software command buffers are slow, yet we have one :)
This software command buffer is entirely different to the ones most people are familiar
with. We don’t have any data in the software command buffer. We only have
metadata pointing to already uploaded data. The metadata is also grouped, making it
much smaller than individual bindings and individual state. This allows us to represent
a draw call with only 64 bytes of data, which is just a single CPU cache line.
Our initial design was to use an array of draw structs. The draw struct contains
handles to the shader (this is a resolved PSO variant including all render state), 3
user land bind groups, dynamic buffers (for temp allocated offset bound data), index
and vertex buffers and some offsets. Offsets are needed because sub-allocating
resources is usually a big performance win.
This 64 byte struct is already pretty good, but I wanted to improve it further. I analyzed
the data and noticed that all fields are 32 bits. Optimized rendering uses sorted order
to minimize the costly PSO and render state transitions. When rendering binned
content we notice that most fields don’t change between draw calls. On average only
18 bytes change between the draws. We want to take advantage of this.
The idea is to store only the fields that change. This leads to a draw stream design.
We store a 32 bit bitmask in front of each draw call. This bit mask tells which fields in
the draw struct have changed.
It’s the responsibility of the user land code to write data according to the stream data
API contract. For this we have user land draw stream writer class. It contains a single
draw struct describing the current state and a dirty mask. The draw stream writer
provides an function for setting each field in the struct. These functions check whether
the data value was changed. If yes, then set the corresponding dirty bit and write that
field to the stream. After writing all fields the user calls draw, which simply writes the
dirty bitmask in front of the data values.
The backend is simple: For each draw call it reads the dirty bitmask. Then it reads
one uint32 from the stream for each bit and calls the corresponding platform API call
to set that binding/state/value. The advantage of this design is that the backend
doesn’t need any state filtering. We have already done that in the user land code.
This is handy on platforms where secondary command buffers are not available or
are slow (some Qualcomm GPUs disable optimizations with secondary command
buffers). We can still generate draw stream using multiple worker threads and offload
the state filtering cost there. The render thread is as fast as possible, which is a big
win since the platform API calls are slow on mobile devices. We also save roughly 3x
bandwidth versus full blown 64 byte structs.
Let’s talk about draw call performance.
This slide represents a quite traditional DX11 and OpenGL style draw loop. For each
draw call we call map/unmap and write uniforms separately. We also bind vertex
buffer and index buffer and we bind our textures and buffers. Here I am simply
binding set 2 (material) and set 3 following the conventions we have.
In total this is 6 to 7 API calls per draw call. 6 calls when the material doesn’t change
and 7 otherwise. If we bin by material, then we can assume that the number is closer
to 6 than 7.
This is using the temp allocator to bump allocate uniforms (and other dynamic data).
Now we don’t need to call map/unmap per draw call. This reduces the API call count
to 4-5 per draw call.
Map/unmap are surprisingly expensive calls. Our old GLES backend was uploading
uniforms per draw call. The biggest difference in our new GLES3 backend (WebGL2)
was the lack of map/unmap per draw and this change alone got around 3x CPU
performance gain for us.
We didn’t implement per draw map/unmap to our new Vulkan backend (Vulkan
supports persistent mapping), so I can’t unfortunately show you Vulkan numbers
here.
The next optimization with big impact was packing meshes. We allocate big 128MB
heap blocks and have one platform buffer handle for each. This makes it easy for us
to sub-allocate meshes and simply change the base vertex and base index in each
draw call to change the mesh.
This way we get rid of two API calls: set vertex buffer and set index buffer. We are
down to 2-3 API calls per draw, which is very nice!
This optimization improved the CPU throughput on all devices. We saw biggest gains
on desktop GPUs (close to 2x), but mobile GPUs also showed notable gains (30%-
40%).
The last optimization I want to discuss is base instance.
Base instance drawing uses identical data layout as instancing uses. You use tightly
packed array of draw data. On mobile uniform buffers have 16KB binding size
limitation. The idea is to change the binding offset once per 16KB, amortizing the cost
of rebinding the temp allocator buffer with a different offset. This cuts our API call
count by 1 and we now have optimal amount of API calls: just the draw itself and the
possible material bind group change. The draw call has base instance parameter,
which we change to point to different slot in the uniform buffer data array.
So why not use instancing instead? Base instance results in better shader codegen
on many platforms. The reason is that instance ID is dynamic offset. GPUs pack
multiple instances in the same vertex wave, meaning that all data indexed by instance
ID must use vector registers and vector loads. This is a lot of extra register bloat for
loading 4x4 matrices and similar. Base instance on the other hand is a static per-draw
offset. Every lane loads from the same location. This means that compilers can scalar
code paths and/or use fast constant buffer hardware.
In practice however, we run into various issues. While the base instance codegen is
perfect on PC, on mobile GPUs it’s a mixed bag. Some drivers simply don’t optimize
this properly. Also this feature has poor coverage. DX12 doesn’t support base
instance at all and WebGL and WebGPU also have no support. So I wouldn’t
recommend this optimization, unless you are shipping only on desktop. Not worth it
for mobiles.
Let’s take a look at the performance numbers.
This is using a single render thread. Ten thousand actual draw calls without any
instancing tricks. Each draw call using an unique mesh and unique material. With
bind groups and packed meshes it’s fast to change the material and the mesh.
I didn’t have time to implement GPU-persistent scene data yet in HypeHype. These
numbers are with per-draw bump allocated uniforms, as described in the previous
slides.
We are targeting 10k draw calls because that’s what we managed to push 15 years
ago with Xbox 360 at 60 fps. And the results are impressive. Even the low end 99$
Android phones are close to hitting 60 fps in this stress test. In a real kit bashed UGC
game scene we will have lots of repeated meshes and materials, allowing batching
and reduction in gfx API call counts. We also intend to multithread the rendering.
On AMD’s modern integrated GPUs (found also in the Steam Deck and ROG Ally
handheld) our renderer can push 10k draws in less than one millisecond. When
multithreading is used, our renderer could push up to 1 million draw calls at 60 fps on
modern AMD and Nvidia GPUs.