Chasing Shadows

Eyetracking - computer game: Warhammer 40 000 (fixation map 4) by mrmexperience

Eyetracking - computer game: Warhammer 40 000 (fixation map 4), a photo by mrmexperience on Flickr.

Upping Your Shadow Map Performance

Shadow mapping is a well-known technique for rendering shadows in 3D scenes. With the rapid development of graphics hardware and the increasing geometric complexity of modern game scenes, shadow maps have become an essential tool for real-time rendering.

Though they're popular, shadow maps are notoriously hard to get working robustly. One of the biggest issues with shadow mapping is poor scalability, both in terms of quality and performance, with respect to the increasing complexity of real world game scenes. In theory, we should need only one shadow map sample per pixel and the shadow map generation step should be output sensitive, i.e., the running time of the shadow map generation step should depend only on the number of objects that cast visible shadows, instead of the number of objects in the scene. In practice, however, everyone has their own bag of tricks to cope with shadow map resolution management and shadow map rendering issues.

There exists an ever-growing body of literature on different shadow mapping techniques, most of which focus solely on improving shadow quality (see Eisemann in References). Indeed, given a sufficiently high shadow map resolution, these methods are able to achieve shadows of very high quality. Hence, in this article our focus is on scalable shadow map performance.

We already have a good understanding of how to render large environments in a scalable fashion, at least without shadows. Occlusion culling helps us to get rid of all hidden objects, and we can apply level-of-detail techniques to the remaining visible ones in order to bound the geometric and shading complexity to an acceptable level. Thus, it is not uncommon to find out that the performance bottleneck shifts to shadow mapping (see Silvennoinen in References).

Shadow map performance can be characterized by two things: generation cost, and sampling and filtering cost. Sampling and filtering are essentially independent of the geometric complexity of the scene since they are texture space operators. On the other hand, shadow map generation, i.e., the rendering of shadow casters to the shadow map, can consume a big portion of the rendering budget, unless we take care to bound the number of rendered shadow casters somehow. This effect is potentially amplified when using cascaded shadow maps because some shadow casters could be rendered multiple times during the shadow map generation phase.

Background

In this article, our goal is to speed up shadow map generation. A naive solution to this problem would again use occlusion from the light's point of view to reduce the number of rendered shadow casters when rendering the shadow map. However, as Bittner et al observed (see Bittner in References), the effectiveness of this approach is completely dependent on the light view depth complexity.

Large outdoor environments combined with a global shadow casting light source such as the sun or the moon might not gain much from occlusion culling alone. In the worst case, we might end up rendering all the potential shadow casters contained in the intersection of the view frustum and the light frustum, even though most of the shadow casters do not actually contribute to the shadows in the final rendered image.

Bittner et al observed that in addition to using occlusion culling from the light's point of view, it is essential to cull shadow casters, which do not cast a visible shadow. The group first rendered the scene from the camera's point of view and used occlusion culling to identify all the visible shadow receivers. Second, they rendered the visible shadow receivers to a light space shadow receiver mask to mark the visible parts of the scene as seen from the light's point of view. Finally, they applied occlusion culling from the light's point of view--together with the shadow receiver mask--to identify the potential shadow casters, which should be rendered to the shadow map.

Compared to the naive approach, the main overhead in this method comes from the shadow receiver mask generation, which is not guaranteed to be optimal in all cases. In the worst case all visible shadow receivers are already in shadow, and the receivers are rendered to the shadow receiver mask, since the visibility status from the light's point of view is determined after the shadow receiver mask is created. In addition, they rely on an efficient hardware occlusion culling algorithm which limits the applicability of their method.

In this article we introduce a practical variant of the shadow caster culling algorithm based on the ideas of Bittner et al with the aim of making the method a more viable option in a wider array of scenarios. In particular, our method only assumes the availability of the main view depth buffer, and we will demonstrate a technique for generating the shadow mask directly from the depth buffer. Furthermore, the shadow mask generation step is independent of the geometric complexity of the scene and solves the worst case scenario with the method described above.

Shadow Masking

The first thing to do is identify all light space shadow map texels that will contribute to the shadow in the final image. A shadow map texel T will contribute to the final image if T gets sampled during the deferred lighting pass. We call the set of contributing shadow map texels a shadow mask. There is a direct connection between the main view depth buffer and the shadow mask since each visible pixel in the main view will generate shadow map lookups during the deferred lighting pass, and we will use this property to generate the shadow mask directly from the depth buffer.

Given a fully initialized shadow mask, the culling part is relatively simple; for each shadow caster we rasterize the light space AABB of the shadow caster, and if the rasterized bounding box does not contain a contributing shadow map texel, we can safely cull the shadow caster. Otherwise, we have a potential shadow caster, which should be rendered to the shadow map.

A straightforward method for generating the shadow mask from the main view depth buffer would be to reproject the depth buffer pixels to light space, by treating each depth buffer pixel as a world space point and rasterizing this point cloud to the shadow mask. However, this approach has two obvious drawbacks. First of all, the number of points generated from a full resolution depth buffer creates a non-trivial amount of work to be executed on the GPU. Second, we lose all information about the topology of the original geometry by using the point cloud approximation, which means that all bets are off when considering the connectivity of the projected point cloud. In particular, small holes or cracks are likely to appear in the shadow mask, which could lead to false occlusion and missing shadows. Point splatting could fix the light space topology, but it adds another burden to our already-overworked GPU.

We propose a scalable method for shadow mask generation by subdividing the depth buffer into a screen space tile grid. Given the screen space tile grid, we compute a world space bounding frustum for each tile based on the minimum and maximum depth values in the tile. Note that after this process each world space tile frustum contains all the world space points corresponding to the screen space depth buffer pixels that reside inside the screen space tile. Then, instead of rasterizing the point cloud resulting from the full resolution depth buffer we only need to render the frusta to the light space shadow mask buffer to obtain a conservative approximation of the shadow mask.

It turns out that min/max depth pyramids offer an efficient way to compute the world space tile frusta. Given the full resolution depth buffer, we compute a min/max depth pyramid by successively downsampling the original depth buffer until we obtain the grid resolution we want. We store the depth values in the R and G channels of a single texture.

In order to guarantee that we do not lose any information in the original depth buffer--and to make sure the tile grid matches the last mip level in the chain--we round up the depth buffer resolution to the next multiple of 2^N in each dimension for the lowest mip level (i.e., highest resolution) in the min/max pyramid, where N is the number of mip levels in the min/max pyramid, and N-1 is last mip level in the mip chain with the same resolution as the tile grid. Then, we bootstrap the min/max pyramid downsampling by upsampling the depth buffer to the lowest mip level of the min/max pyramid.

Another subtle consideration to make while performing the min/max pyramid construction is how to correctly downsample the maximum depth values. Suppose we have an outdoor scene consisting of terrain and a visible skybox at the infinity with depth 1.0. Now, if we would simply take the maximum depth value of the four samples in the lower mip level, it would mean that there would almost surely be a set of bounding frusta spanning the whole terrain at the horizon. The correct way to handle this case is to consider only maximum depth values less than one during the downsample operation (see Listing 1).

After we have obtained the min/max depth pyramid that corresponds to the screen space tile grid of the original depth buffer, the next step is to rasterize the frusta defined by the grid to the light space shadow mask. In our implementation, we chose to utilize the geometry shader stage and stream the emitted triangles from the geometry shader directly to the rasterizer stage, eliminating the need to explicitly compute the bounding geometry or read back the results to the CPU.

To feed the geometry shader we render an immutable point list, where each point corresponds to a single tile in the tile grid. During each geometry shader invocation we emit the frustum triangles by looking up the minimum and maximum depth values from the last mip level of the previously constructed min/max depth pyramid. In addition, we disable depth and color writes, and write out only to the stencil buffer associated with the shadow map, setting the stencil value to one for each passed fragment. The final shadow mask will then consist of all the shadow map texels with an associated stencil value of one.

In addition to obtaining the binary shadow mask, we could also prime a conservative depth buffer for subsequent occlusion culling passes by rasterizing the backfacing triangles of the view frustum. This might be especially useful in cases where the light depth complexity is high.

Shadow Caster Culling

Now that we have obtained a fully initialized shadow mask constructed from the original depth buffer, we can now run an occlusion culling pass from the light's point of view using the shadow map as our depth render target, in a spirit similar to Bittner et al. Depending on the circumstances, however, we might not need a fully hierarchical occlusion culling pass to obtain significant performance gains at the shadow map generation step. In some cases, especially when dealing with low depth complexity from the light's point of view, it is more beneficial to avoid all forms of GPU read-backs in order to eliminate any potential synchronization issues, such as GPU starvation, CPU stalling, or latency, which are usually associated with hardware occlusion queries.

In our implementation, we keep all the data on the GPU at all times, and choose to use predicated rendering for shadow caster culling. In particular, we issue predicate queries for each shadow caster candidate in the intersection of the light frustum and view frustum by rendering the shadow caster candidate's world space AABB and checking for intersection with the shadow mask. We disable both depth and stencil writes during this pass and set the stencil test for equality with one (i.e., the shadow mask stencil reference value). Then, in a second pass we issue predicated draw calls for each candidate and profit each time a predicate culls a shadow caster candidate.

Since the query geometry consists only of a few vertices, the predicate initialization pass is completely fill-bound on the GPU. As an additional optimization to reduce the fill cost of this process, we could conservatively downsample the shadow mask to a lower resolution for the predicate rendering pass.

Software Occlusion Systems

Software-based visibility systems have recently regained popularity, and there are several high-end game engines and AAA-titles that have adopted this approach. Regardless of whether the system is based on potentially visible sets (PVS), cells and portals, or a more straightforward software rasterizer with custom occluder geometry, it is possible to output a conservative depth buffer based on the visibility query results.

The good news is that as long as we have a conservative depth buffer available, we can compute the shadow mask using the min/max pyramid approach--as described above--by using a trivial minimum depth bound of zero. This operation and the subsequent shadow caster culling is thus perfectly suited for a software implementation.

Conclusions

Our shadow caster culling method is compatible with both hardware and software-based visibility systems. We assume only the availability of a (conservative) depth buffer and hence believe that the presented technique is easy to integrate into an existing rendering engine.

The shadow mask concept has applications beyond shadow caster culling. One potentially interesting direction of future work is to generalize cascaded shadow maps with the combination of hardware-supported sparse textures together with the shadow mask, aiming for a more flexible level-of-detail management for shadows. In particular, the shadow mask could be used to select which tiles in the sparse shadow texture need updating as well as aid in selecting the correct resolution for each tile.

References

Real-Time Shadows - Elmar Eisemann, Michael Schwarz, Ulf Assarsson and Michael Wimmer, CRC Press 2011

Occlusion Culling in Alan Wake - Ari Silvennoinen, Hiding Complexity, SIGGRAPH 2011 Talks

Shadow Caster Culling for Efficient Shadow Mapping - Jiri Bittner, Oliver Mattausch, Ari Silvennoinen and Michael Wimmer, Symposium on Interactive 3D Graphics and Games 2011

Listing 1:

Pixel shader code for the min/max pyramid downsampling pass.

float4 MinMaxDownsample_PixelShader(in float4 Position : SV_Position, in float2 UV : TEXCOORD) : SV_Target
{
float4 Samples[4] = {
Texture.SampleLevel(PointSampler, UV, 0),
Texture.SampleLevel(PointSampler, UV, 0, int2(1,0)),
Texture.SampleLevel(PointSampler, UV, 0, int2(1,1)),
Texture.SampleLevel(PointSampler, UV, 0, int2(0,1))
};
// Use only valid depth values in the downsampling filter
for (int i = 0; i < 4; i++) Samples[i].y = Samples[i].y < 1 ? Samples[i].y : 0; float MinZ = min(min(Samples[0].x, Samples[1].x), min(Samples[2].x, Samples[3].x)); float MaxZ = max(max(Samples[0].y, Samples[1].y), max(Samples[2].y, Samples[3].y)); return float4(MinZ, MaxZ,0,0); } Source Citation Silvennoinen, Ari. "Chasing Shadows." Game Developer 1 Feb. 2012: 49. Computer Database. Web. 9 Feb. 2012. Document URL http://go.galegroup.com/ps/i.do?id=GALE%7CA278774612&v=2.1&u=22054_acld&it=r&p=CDB&sw=w Gale Document Number: GALE|A278774612