Been working on clustered rendering over the past few weekends.
So first of all, a quick refresher on what is clustered rendering and why we care.
Clustered rendering is a technique for optimizing shader performance when you have lots of lights in the scene.
If you look inside three.js, you will see that for each light we do a simple iteration inside the shader:
What this means, is that for each light in the scene, you will have to do some computation for each pixel on the screen. And light calculations happen to be the most expensive part of the whole thing.
Say you have a scene of a street at night with 20 street lights. Your POV is at the ground level, half of the street lights are far behind the camera, 5 more are behind a building, so really you only have 5 lights that are contributing to what you see on screen.
In this situation, you would still run lighting calculations for those other 15 lights that are not contributing. You can extrapolate this and quickly see how the whole thing becomes a huge bottleneck.
to demonstrate this, I’m going to use Toji’s implementation.
If we have a scene with 1024 point lights, and we run a this whole thing on RTX 4090, with the loop approach we get ~50 FPS
If we stare straight at the floor, we do a bit better, now we’re ~80 FPS
Now, if we use clustered
we’re at around 240 FPS in both cases, which is my monitor’s limit.
You may think that this example is contrived and proves little, as 50FPS is already pretty good, but remember that we’re on RTX 4090 here
So clustered rendering is a very simple thing conceptually, you just slice up your view frustum into parts called “clusters”, and do a bit of precomputation before each frame to bin lights to clusters. That is - for each cluster we record what lights overlap that cluster.
In our shader, we then simply lookup what cluster a pixel belongs to, and only iterate over the lights that the cluster overlaps. Pretty simple really, at least in concept.
Shade
So, now that we’re up to date on what clustered rendering is, back to the new part.
I’ve written a clustered renderer in the past, and I was pretty happy with it, expect the fact that binning was done on the CPU, which imposed pretty hard limits on number of lights and how finely you can slice up your frustum. As I was working on a new WebGPU renderer, I had it in the back of my mind to port my existing work over, but run binning work on the GPU instead, so that’s pretty much all I did.
Turned out to be a lot more work than I imagined, as what’s fast on CPU is not the same for GPU, so I had to rewrite the whole thing pretty much from scratch. Toji’s code was actually quite helpful, even if I ended up doing things quite differently.
Biggest 2 differences from standard approaches:
- Using actual frusta, instead of AABB approximation for each cluster, resulting in much tighter bounds and far fewer false positives.
- Pre-culling. We start with say 1024 lights in the scene, we could bin them all, but 99.9% of the time, some of the lights will not be visible at all, so by culling the initial list before binning, we can typically reduce amount of lights to be binned quite drastically.
I don’t really have numbers to back up my claims for the first point, you just going to have to take my word for it, but you can convince yourself by imagining a frustum, and how one would create an AABB around it, it should be obvious that AABB has a larger volume, and being axis-aligned, it would grow even larger if frustum is unaligned.
I do have some pictures and numbers though, so fret not.
First, let’s take a test scene with 1025 lights
There is, obviously 1 directional light and 1024 randomly placed point lights of different colors.
We slice up our frustum into 32x18x48 clusters, here’s what that looks like in Z axis:
X and Y are boring, as they just alight with the screen
Now, if we visualize each pixel’s cluster, hopefully it will make some sense
here’s the same thing, with cluster resolution dropped to 4x4x48
Okay, so that’s not very interesting, more interesting is the data that’s stored in the clusters, let’s visualize that:
The scale goes up to 11 light, here’s the scale:

So on average we’re shading about 3 lights per pixel here, instead of the original 1024
And here’s the actual perfect contribution:
Now, for the numbers, shading scales incredibly well
Light count | Shading time | Binning | Without Clusters (default) | Speedup (including binning) |
---|---|---|---|---|
16 | 302.08 µs | 21.50 µs | 369.66 µs | + 14.2% |
256 | 306.18 µs | 83.97 µs | 1.93 ms | + 394% |
1024 | 318.46 µs | 462.85 µs | 5.42 ms | + 594% |
16384 | 401.41 µs | 2.85 ms | 54.90 ms | + 1588% |
NOTE: µs is micro second (1/1000 of millisecond)
NOTE: All lights are point-lights with radius of 1, the sponza dimensions are default: 29.7 x 12.5 x 18.3
Theoretically the system scales to infinity, but realistically speaking about 64 thousand lights in the scene should be about the point where further optimization would be required.
For those that are curious, I’m using log2 scale for Z slices, and the limit per cluster is set to 128, which realistically is plenty, as if you have more than 128 lights overlapping a single small cluster of space - you’re probably doing something wrong in your scene. That said, limits are adjustable.