Wednesday, August 12, 2009

Optimizing the light far plane while PSSMing

I recently had a cool idea about Parallel Split Shadow Mapping and the light projection on the different frustum splits. By default, every frustum split is handled separately as uniform shadowing or with a special projection algorithm as PSM or LiSPSM. This ensures that every split frustum is enclosed as much as possible to maximize the texel ratio of the shadow/depth map on that volume. But since we have several splits - I use four splits encoded in a ARGB 32bit floating point texture - we also have different light far planes used for the orthogonal projection on that split.

I have done a small and nasty sketch to visualize this issue and since my sketching skills are lim => 0.. eh.. you’ll get the point:

Keeping that construct (?) thing in mind you may notice that the depth representation of the different splits differ, since they are “normalized” to fit in the far plane ratio while using a linearized depth representation. However, this effect will also occur while using the default logarithmic depth representation. To point this construct (?) gizmo out, see attached:

While visualizing this by simply outputting the projected shadow map, that is, the depth vaues, on a 3D scene, it becomes more clearly:

Indiana Jones told me some time ago, that a cross never marks an important spot, but a red mark does :)

So what we actually want is having the depth representation constant over the whole view frustum and thus all splits. My idea was to use the far plane of the last frustum splits light for all the other light far planes before. You can simply do this by computing/rendering the frustum splits in a reversed order and thus saving the far plane computation on n-1 splits. Cool, eh?

But why do I do this? It’s because of blurring the depth map in some way. Most common blur algorithmns like the good ol’ Gaussian one has, in combination with a modern GPU, the property to work on all components of a non-skalar type at the same time. But this causes the blurred shadow map to suck, because it applies the same filter kernel to all splits at the same time, resulting in larger blurred shadow maps in the distance. It’s a good idea to scale the filter kernel by the distance for each split, but having the depth representation different between the splits comes to a problem too. So basically, we want a depth representation like this:

Having the depth representation across the splits as nice as this, you don’t even notice the split borders anymore if you simply project the shadow map onto the scene (and if you would scale your blur filters kernel size, like I don’t at the moment):

And here is the respective shadow/depth map for that picture above:

Note that the alpha channel is not visible… for some reason you may guess.

Hope this helps :)

Wednesday, July 30, 2008

Selecting the correct frustum split

I have been working on my Parallel Split Shadow Mapping implementation for a while.. a while? Hm.. almost for five weeks. Yesterday, I have just proven that there still is some room for optimizations. While my implementation renders four frustum splits into the ARGB channels of a texture, instead of using four shadow maps per split, it became a mess selecting the correct split channel and matrices within the fragment shader. This saves three textures and thus a whole bunch of texture memory, but it’s a bit more complicated to switch between the different channels.

This is how the splits actually look like:

Where Black = 0, Red = 1, Yellow = 2, White = 3

Actually, selecting the proper split is very easy - easy to solve. Generally, we need a function that satifies the following equation:

As you see, this function needs to perform at least three tests to output the proper index. But encoding this in HLSL is a bit more complicated when you want it optimized. My first approach was very stupid, but see yourself:

half GetSplitByDepth(float fDepth)
{
	half nSplitID = 3;

	while( fDepth >= g_fSplitDistances[nSplitID] )
		nSplitID--;

	return nSplitID;
}

Note that asymmetric returns are not supported by my old Geforce 7800 GTX… I don’t even know if they are by newer ones, but regardless of this it doesn’t matter, cause this would break the rules of well-structured programming. But breaking the rules is a good thing when it ends up with a performance boost. But let me stop the dumb talk, here are the results of this method:

ps_3_0
def c1, 1, 0, -1, 0
dcl_texcoord1 v0.z
add r0, -c0.wzyx, v0.z
mp r0, r0, c1.x, c1.y
mul r0.x, r0.y, r0.x
mul r0.x, r0.z, r0.x
mul r0.y, r0.w, r0.x
cmp_pp r0.x, -r0.x, c1.x, c1.y
cmp_pp oC0, -r0.y, r0.x, c1.z
// approximately 7 instruction slots used
ps_2_0
def c1, 1, 0, -1, 0
dcl t1.xyzadd r0.w, t1.z, -c0.w
cmp r0.y, r0.y, c1.x, c1.y
mul r0.x, r0.x, r0.y
add r0.y, t1.z, -c0.y
cmp r0.y, r0.y, c1.x, c1.y
mul r0.x, r0.x, r0.y
add r0.y, t1.z, -c0.x
cmp r0.y, r0.y, c1.x, c1.y
mul r0.y, r0.x, r0.y
cmp_pp r0.x, -r0.x, c1.x, c1.y
cmp_pp r0, -r0.y, r0.x, c1.z
mov_pp oC0, r0

// approximately 14 instruction slots used

So this is the crappiest solution. 14 instruction slots is probably the shittiest even possible solution. Let’s just forget this gimp and take a look at my second approach:

half GetSplitByDepth(float fDepth)
{
	half nSplitID = 3;

	if( fDepth >= g_fSplitDistances[3] )
		fSplitID = 3;
	else if( fDepth >= g_fSplitDistances[2] )
		fSplitID = 2;
	else if( fDepth >= g_fSplitDistances[1] )
		fSplitID = 1;

	return nSplitID;
}

So this should be logically the same as approach no. one, but you never know what the compiler does with it. Actually, its very different:

ps_3_0
def c1, 1, 0, 2, 3
dcl_texcoord1 v0.z
add r0.xyz, -c0.wzyw, v0.z
cmp_pp r0.z, r0.z, c1.x, c1.y
cmp_pp r0.y, r0.y, c1.z, r0.z
cmp_pp oC0, r0.x, c1.w, r0.y

// approximately 4 instruction slots used

ps_2_0
def c1, 1, 0, 2, 3
dcl t1.xyz
add r0.w, t1.z, -c0.y
cmp_pp r0.x, r0.w, c1.x, c1.y
add r0.y, t1.z, -c0.z
cmp_pp r0.x, r0.y, c1.z, r0.x
add r0.y, t1.z, -c0.w
cmp_pp r0, r0.y, c1.w, r0.x
mov_pp oC0, r0

// approximately 7 instruction slots used

Four instructions on SM 3.0 and seven on SM 2.0. Thanks to dynamic branching abilities on SM 3.0, but on the good ol’ vanilla SM 2.0.. it’s not perfect. But I was able to get it (IMHO) perfect:

half GetSplitByDepth(float fDepth)
{
	float4 fTest = fDepth > g_fSplitDistances;
	return dot(fTest, fTest);
}
ps_3_0
def c1, 0, 1, 0, 0
dcl_texcoord1 v0.z
add r0, c0, -v0.z
cmp r0, r0, c1.x, c1.y
dp4_pp oC0, r0, r0

// approximately 3 instruction slots used

ps_2_0
def c1, 0, 1, 0, 0
dcl t1.xyz
add r0, -t1.z, c0
cmp r0, r0, c1.x, c1.y
dp4 r0, r0, r0
mov_pp oC0, r0

// approximately 4 instruction slots used

So this is THE solution, isn’t it? Think a bit about it :)