Clustered Rendering on WebGPU

Been working on clustered rendering over the past few weekends.

So first of all, a quick refresher on what is clustered rendering and why we care.

Clustered rendering is a technique for optimizing shader performance when you have lots of lights in the scene.

If you look inside three.js, you will see that for each light we do a simple iteration inside the shader:

What this means, is that for each light in the scene, you will have to do some computation for each pixel on the screen. And light calculations happen to be the most expensive part of the whole thing.

Say you have a scene of a street at night with 20 street lights. Your POV is at the ground level, half of the street lights are far behind the camera, 5 more are behind a building, so really you only have 5 lights that are contributing to what you see on screen.

In this situation, you would still run lighting calculations for those other 15 lights that are not contributing. You can extrapolate this and quickly see how the whole thing becomes a huge bottleneck.

to demonstrate this, I’m going to use Toji’s implementation.

If we have a scene with 1024 point lights, and we run a this whole thing on RTX 4090, with the loop approach we get ~50 FPS

If we stare straight at the floor, we do a bit better, now we’re ~80 FPS

Now, if we use clustered

we’re at around 240 FPS in both cases, which is my monitor’s limit.

You may think that this example is contrived and proves little, as 50FPS is already pretty good, but remember that we’re on RTX 4090 here

So clustered rendering is a very simple thing conceptually, you just slice up your view frustum into parts called “clusters”, and do a bit of precomputation before each frame to bin lights to clusters. That is - for each cluster we record what lights overlap that cluster.

In our shader, we then simply lookup what cluster a pixel belongs to, and only iterate over the lights that the cluster overlaps. Pretty simple really, at least in concept.

Shade

So, now that we’re up to date on what clustered rendering is, back to the new part.

I’ve written a clustered renderer in the past, and I was pretty happy with it, expect the fact that binning was done on the CPU, which imposed pretty hard limits on number of lights and how finely you can slice up your frustum. As I was working on a new WebGPU renderer, I had it in the back of my mind to port my existing work over, but run binning work on the GPU instead, so that’s pretty much all I did.

Turned out to be a lot more work than I imagined, as what’s fast on CPU is not the same for GPU, so I had to rewrite the whole thing pretty much from scratch. Toji’s code was actually quite helpful, even if I ended up doing things quite differently.

Biggest 2 differences from standard approaches:

  1. Using actual frusta, instead of AABB approximation for each cluster, resulting in much tighter bounds and far fewer false positives.
  2. Pre-culling. We start with say 1024 lights in the scene, we could bin them all, but 99.9% of the time, some of the lights will not be visible at all, so by culling the initial list before binning, we can typically reduce amount of lights to be binned quite drastically.

I don’t really have numbers to back up my claims for the first point, you just going to have to take my word for it, but you can convince yourself by imagining a frustum, and how one would create an AABB around it, it should be obvious that AABB has a larger volume, and being axis-aligned, it would grow even larger if frustum is unaligned.

I do have some pictures and numbers though, so fret not.

First, let’s take a test scene with 1025 lights

There is, obviously 1 directional light and 1024 randomly placed point lights of different colors.

We slice up our frustum into 32x18x48 clusters, here’s what that looks like in Z axis:

X and Y are boring, as they just alight with the screen

Now, if we visualize each pixel’s cluster, hopefully it will make some sense

here’s the same thing, with cluster resolution dropped to 4x4x48

Okay, so that’s not very interesting, more interesting is the data that’s stored in the clusters, let’s visualize that:


The scale goes up to 11 light, here’s the scale:
image

So on average we’re shading about 3 lights per pixel here, instead of the original 1024

And here’s the actual perfect contribution:

Now, for the numbers, shading scales incredibly well

Light count Shading time Binning Without Clusters (default) Speedup (including binning)
16 302.08 µs 21.50 µs 369.66 µs + 14.2%
256 306.18 µs 83.97 µs 1.93 ms + 394%
1024 318.46 µs 462.85 µs 5.42 ms + 594%
16384 401.41 µs 2.85 ms 54.90 ms + 1588%

NOTE: µs is micro second (1/1000 of millisecond)
NOTE: All lights are point-lights with radius of 1, the sponza dimensions are default: 29.7 x 12.5 x 18.3

Theoretically the system scales to infinity, but realistically speaking about 64 thousand lights in the scene should be about the point where further optimization would be required.

For those that are curious, I’m using log2 scale for Z slices, and the limit per cluster is set to 128, which realistically is plenty, as if you have more than 128 lights overlapping a single small cluster of space - you’re probably doing something wrong in your scene. That said, limits are adjustable.

References:

10 Likes

Managed to connect PIX profiler to Shade today, so I collected actual GPU timings for the first time. Previously I was using WebGPU’s timestamp-query, which have wonky timings for “security reasons”.

Here are precise timings from PIX for 1024 lights scenario:
Culling - 7.84 µs
Binning - 193 µs
Shading - 274.37 µs

This is pretty much the first time I’ve been using a real GPU profiler, and I must say I’m really impressed. I’ve been creating a lot of tooling in Shade just to provide a window into what the GPU is doing, and this tool does it all and more our of the box. I wouldn’t say I regret writing the tools that I have, but I will definitely be relying on PIX more in the future.

That said, setting it up has been a real pain. I’ve done it in the past only to get fail to record anything, today I managed to obtain a recording, but only on Chrome Canary, and only with a very specific version of PIX (tried 5 different version, only one worked). And even then, many of the recording features don’t work and recording only works about 20% of the time. So a lot left to be desired.

For posterity,


I’m on Windows 11 x64
PIX version is 2405.15.002-OneBranch_release
Chrome Canary is Version 137.0.7125.0 (Official Build) canary (64-bit)
Launch parameters are: --disable-gpu-sandbox --disable-gpu-watchdog --enable-dawn-features=emit_hlsl_debug_symbols,disable_symbol_renaming "http://localhost:5173/"

Tickboxes
image


When everything works (The 20% I mentioned), you get something like this:

Especially enlightening for me was wave occupancy on ray traced shadows:

The warps are only ~30% utilized on average, with the rest sitting idle. Another way to picture this would be to imagine something like this:

if(x){
   // code 1
}else if(y){
   // code 2
}else{
   // code 3
}

Where each each thread hits a random branch on average, so only 30% of the threads are running the same branch at any given time on average.

This is not too bad for an inline ray tracer for somewhat obvious reasons. But the main takeaway for me here was that optimizing the memory access further is pointless, and any BVH improvements will have marginal gains, as the main issue is occupancy. I’m curious about that last bit at the end where we see a lot of blue, this indicates a lot of stragglers.

Imagine if you could reach close to 90% occupancy - suddenly ray tracing would be 3 times faster.

More relevant to this topic, is shading. Here’s shading occupancy:

Red is occupied, orange is idle and blue is free. occupancy is ~40% throughout.

The reasons seems to be register pressure

which hovers around 93% at the peak, so we can only schedule about 40% of the threads in a warp it seems :person_shrugging:

Also, take the timings with a grain of salt, GPU timings are not super reliable because:

  1. GPU has acceleration/parking strategies to limit power usage. This causes timings to drift
  2. Ambient temperature has impact on the above, I (and I assume you too) are not in a temperature-controlled environment
  3. GPU is doing other stuff, from drawing your other application windows, to doing stuff like video decoding
  4. Modern GPUs runs on virtual memory, so there’s paging involved
  5. Browser actually interacts with the driver for you, so there’s a fat layer of abstraction between WebGPU and the GPU
  6. GPU scheduler provides no guarantees on reproducibility of scheduling. That is - your thread/warp allocation will vary pretty much every time, and some allocations might lead to better occupancy
  7. GPU has caches

I’m sure I’m missing at least a few extra reasons in there, the point is - timings will vary. The only useful metrics are averages for relatively large sample sizes, and relative to other metrics.

3 Likes

I’m guessing this is also deferred? Very cool writeup.:slight_smile: really interested in the pix stuff too..i will be revisiting this at some point.

Yep, deferred. Shade is explicitly deferred. Except for the transparencies of course, but that’s the same everywhere right now.

1 Like