Constant buffers are the most basic way to provide your data to Shaders. Even if you don't see them, they are all implicitly created and updated under the hood.
Even tho, knowing how to use them and organize them can lead to significant performance improvements.
So let's get started.
float4x4 tW : WORLD; float4x4 tVP : VIEWPROJECTION; float4x4 tWVP : WORLDVIEWPROJECTION; float4x4 tP : PROJECTION; float Alpha <float uimin=0.0; float uimax=1.0;> = 1; float4 cAmb <bool color=true;String uiname="Color";>; float4x4 tTex <string uiname="Texture Transform"; >; float4x4 tColor <string uiname="Color Transform";>;
This is a basic example (I only show the shader variables to keep things compact).
Since we don't explicitly manage our data, all of it will automatically be integrated into one of those Constant buffers for you, so this is equivalent to :
cbuffer cbGlobal : register( b0 ) { float4x4 tW : WORLD; float4x4 tVP : VIEWPROJECTION; float4x4 tWVP : WORLDVIEWPROJECTION; float4x4 tP : PROJECTION; float Alpha <float uimin=0.0; float uimax=1.0;> = 1; float4 cAmb <bool color=true;String uiname="Color";>; float4x4 tTex <string uiname="Texture Transform"; >; float4x4 tColor <string uiname="Color Transform";>; };
Now let's see what we have here:
Now what we can also notice:
But also (and that's the most important):
Ok so now what happens when we draw our objects (in this scenario):
Create a structure (in CPU) that is same size of our buffer (with some potential differences, see below)
Now, for each slice
Now the line that will attract our interest is
Here we have a buffer with a size of:
64(transform)* 6 + 16 (float4) + 4 (float) = 404 bytes
Now let's say we draw 512 objects, we need the to upload:
404*512 = 206848 bytes
In our little simple scenario that's not much, but start to think when you need to start loading more complex structures, skinning transforms...
So let's modify that a little:
So as we said, and since we are allowed to use several of those buffers, we now split all of those in that way:
cbuffer cbPerRender : register( b0 ) { float4x4 tVP : VIEWPROJECTION; float4x4 tP : PROJECTION; }; cbuffer cbPerObject : register (b1) { float4x4 tW : WORLD; float4x4 tWVP : WORLDVIEWPROJECTION; float Alpha <float uimin=0.0; float uimax=1.0;> = 1; float4 cAmb <bool color=true;String uiname="Color";>; float4x4 tTex <string uiname="Texture Transform"; >; float4x4 tColor <string uiname="Color Transform";>; };
Now shader node will react like that:
So that gives us a total of:
64 * 2 (once)
64 * 4 + 16 + 4 = 276 (per element)
For he same amount of calls, we end up with 141440 bytes uploaded instead.
Compared to the above version that's a decent gain.
Please note that also, normally the effects runtime will notice that cbPerRender is not used by pixel shader and in that case will no bind it (saves you an API call as well).
You remember when I shown you constant buffer size with all that primary school level calculations... well technically they are all wrong ;)
GPU wants Constant buffer elements to be aligned on a 16 bytes boundary (yes you know same as when we bitch about using SSE, so yes this is important and yes we want faster transforms ! ;)
So let's decompose our structure (i'll remove semantic /annotation for clarity)
cbuffer cbPerObject : register (b1) { float4x4 tW; //Offset = 0 float4x4 tWVP; //Offset = 64 float Alpha; //Offset = 128 float4 cAmb; //Offset = 132 float4x4 tTex; //Offset = 148 float4x4 tColor; //Offset = 212 }; Total size : 276
Actually it is not, let's look at cAmb, offset is 84, which is not a multiple of 16. So your cbuffer will instead be like:
cbuffer cbPerObject : register (b1) { float4x4 tW; //Offset = 0 float4x4 tWVP; //Offset = 64 float Alpha; //Offset = 128 float3 SomeRandomDataAdded ; //Offset = 132 float4 cAmb; //Offset = 144 float4x4 tTex; //Offset = 160 float4x4 tColor; //Offset = 224 }; Total size : 288
So technically we got 12 bytes added by runtime.
This is mostly done for the same reasons as when using SIMD (which I will not cover here), you need your reads to start as 16 bytes boundaries.
Another way to pack is :
cbuffer cbPerObject : register (b1) { float4x4 tW; float4x4 tWVP; float4 cAmb; float4x4 tTex; float4x4 tColor; float Alpha; };
Which will be... exactly same size (since Constant Buffer size also needs to be a multiple of 16).
But at least in this case our data is organized in a way that all elements got their start point which is not modified.
There's also a few other parts that would need to be covered, but let's say that they become more situational (for people interested about manual packing, google packoffset hlsl and you'll be presented with tons of info).
Now another important thing is that the shader node will also check if your input pin is spreaded or not.
If pin is not spreaded that part of the buffer will be only updated once (same as PROJECTION for example)
So now let's say tColor is always the same, so is cAmb (eg: always single spread count).
You can modify your buffers like this:
cbuffer cbPerRender : register( b0 ) { float4x4 tVP : VIEWPROJECTION; float4x4 tP : PROJECTION; float4x4 tColor; float4 cAmb; }; cbuffer cbPerObject : register (b1) { float4x4 tW : WORLD; float4x4 tWVP : WORLDVIEWPROJECTION; float4x4 tTex; float Alpha; };
Here since the shader node will know that tColor and cAmb are single spread, buffer will be uploaded only once.
BUT (and this is IMPORTANT):
If now you spread cAmb pin, cbPerRender will be updated EVERY frame again. So you go back to point 0.
So make sure that you always only combine elements which will have a single count.
Please note that in next major release(s) for dx11 nodes that fact will become even more important (and please bitch about .NET 4.5 framework support :)
tWVP will multiply your transform in CPU, using mul (tV,TVP) will do it in GPU.
Once big difference is when you do it in CPU it's only once per model, whereas mul(tW,TVP) is once per vertex.
So it's also situational (please note that pre transform hardly works in instancing scenario by the way, unless you do it in Compute Shader).
But generally I just keep 2 versions of the shader and use the fastest one (if i run into perf issues, if not, I don't care ;)
If you don't use color transform (or anything else, texture sampling probably being the most important), get rid of it, or do a version with/without it.
Operations at pixel level can add up very quickly, so if you don't need them get rid of them (and it's so simple to have 2 versions of shader with/without texture that it should be in your toolkit).
anonymous user login
~2d ago
~2d ago
~9d ago
~11d ago
~13d ago
~16d ago
~16d ago
~23d ago
~30d ago
~30d ago