-
Notifications
You must be signed in to change notification settings - Fork 14
Options cost estimates
In order to help engineers and technical artists to optimize their projects, o3de has a flexible concept of shader builds based on a tree of "virtual" variants that may physically be there or not. Physically meaning having a bytecode built with hardcoded values for the specific option of that variant. If it's not present, there will be a dynamic fallback where the option is a variable. The root is the bytecode with all options as variables. (Unpacked from a bitfield in an auto-generated fallback key, usually uint4, packed in a constant buffer with the rest of the SRG constants).
Here is the o3de document page about them: https://www.o3de.org/docs/atom-guide/dev-guide/shaders/azsl/shader-variant-options/
The shader management console (SMC) is a tool that allows toying with the presence or abscence of physical bytecodes depending on the options the engineer/designer deems important.
Here is the o3de wiki document user guide for SMC: https://github.com/o3de/o3de/wiki/%5BShader-Management-Console%5D-User-Guide
To help this estimate, we optionally use AMD RGA to count the number of registers used by a variant.
And secondly, AZSLc is cooking an estimate we can call the "impact cost", of an option. With the intention of providing a reasonable default order of priority for variant baking. Hopefully ordering first the baking of variant bytecodes with options-as-constants for the most impactful options first.
As a support for example, let's take this extract for enhanced PBR forward pass
shader of o3de:
void ApplyDirectLighting( SurfaceData_EnhancedPBR surface, inout LightingData_BasePBR lightingData, float4 screenUv)
{
if( IsDirectLightingEnabled() )
{
if (o_enableDirectionalLights)
{
ApplyDirectionalLights(surface, lightingData, screenUv);
}
if (o_enablePunctualLights)
{
ApplySimplePointLights(surface, lightingData);
ApplySimpleSpotLights(surface, lightingData);
}
if (o_enableAreaLights)
{
ApplyPointLights(surface, lightingData);
ApplyDiskLights(surface, lightingData);
ApplyCapsuleLights(surface, lightingData);
ApplyQuadLights(surface, lightingData);
ApplyPolygonLights(surface, lightingData);
}
}
else if(IsDebuggingEnabled_PLACEHOLDER() && GetRenderDebugViewMode() == RenderDebugViewMode::CascadeShadows)
{
if (o_enableDirectionalLights)
{
ApplyDirectionalLights(surface, lightingData, screenUv);
}
}
Options o_enableDirectionalLights
, o_enablePunctualLights
, o_enableAreaLights
are protecting execution of code blocks, therefore we can imagine that the cost of say o_enablePunctualLights
option, is the cost of ApplySimplePointLights
added to ApplySimpleSpotLights
.
This is how AZSLc will estimate the "impact cost" of options.
Let's analyze the practical results we have as of latest PR for that feature: https://github.com/o3de/o3de-azslc/pull/85
First, we need to find the azslc executable, the .azslin file that is prepared by the asset processor.
Let's use everything application (unix's locate
).
I got a bunch of them because I version the binaries for regression testing. But in your case you want to find the one you just built from the relevant git branch. So it will look like o3de-azslc/build...
Then about the input shader. First you'll need to have run a project. The editor or the sample viewer for instance, because we need a cache of assets made by the AP. The AP has shader builders that execute complex preparations earlier to azslc invocation, like preprocess, SRG header injections etc. The result is a .azslin
. That, is what azslc.exe can digest. (unintuitively, not the source-controlled .azsl
from the git repo, they aren't ripe for compilation, shader building is a long chain of tools)
Here is how I went about it, just use a bit of wildcard, sort by size to get the biggest monster, and pick DX12, non customz:
shift+right click->copy as path.
And here is the command line in my case:
$ "D:\o3de-azslc\build\win_x64\Release\azslc.exe" "D:\o3de-atom-sampleviewer\Cache\pc\materials\types\enhancedpbr_mainpipeline_forwardpass_enhancedlighting_dx12.azslin" --options
The result will look like:
{
"ShaderOptions":[
{
"costImpact":7,
"defaultValue":"",
"keyOffset":0,
"keySize":2,
"kind":"user-defined",
"name":"o_opacity_mode",
"order":0,
"range":false,
"type":"OpacityMode",
"values":[
"OpacityMode::Opaque",
"OpacityMode::Cutout",
"OpacityMode::Blended",
"OpacityMode::TintedTransparent"
]
To list only the costs from that output, we can muster a tiny python filter as such:
import json
import sys
f = json.loads("\n".join(sys.stdin.readlines()))
for so in f["ShaderOptions"]:
print(so["name"], " ", so["costImpact"])
Now we can pipe the previous command into that python (simplifying paths for readability):
$ azslc.exe enhancedpbr_mainpipeline_forwardpass_enhancedlighting_dx12.azslin --options | py filter.py
result:
o_opacity_mode 7
o_specularF0_enableMultiScatterCompensation 2
o_enableShadows 2489
o_enableDirectionalLights 4192
o_enablePunctualLights 392
o_enableAreaLights 3881
o_enableSphereLights 736
o_enableSphereLightShadows 310
o_enableDiskLights 756
o_enableDiskLightShadows 284
o_enableCapsuleLights 479
o_enableQuadLightLTC 1864
o_enableQuadLightApprox 1843
o_enablePolygonLights 464
o_enableIBL 1
...
Let's take a look at what's going on with o_enableIBL
, why does it cost only 1.
When we search through the .azslin file, we find only 2 occurrences, the declaration, and this line:
OUT.m_normal.a = EncodeUnorm2BitFlags(o_enableIBL, o_specularF0_enableMultiScatterCompensation);
The IBL at this stage is only a GBuffer flag, therefore its associated cost is its mere access from constant buffer (at worst).
This is a specificity that the optimizing engineers will have to keep in mind, shaders have their compartmentalized job, but they don't necessarily reflect the whole cost of the feature indicated by the option. Just the cost in that shader.
How about o_enableDirectionalLights
with score 4192, versus o_enableAreaLights
with score 3881?
In the previous snippet of code we saw that o_enableDirectionalLights
was covering a call to ApplyDirectionalLights
and o_enableAreaLights
5 calls to point, disk, capsules, quads and polygons evaluators, which ought to be more complex than directional.
Let's use the --verbose
flag of azslc to get more data points.
CLI:
azslc.exe enhancedpbr_mainpipeline_forwardpass_enhancedlighting_dx12.azslin --options --verbose | grep 'ApplyDirectionalLights\|o_enableDirectionalLights' | grep -v 'seenat\|new'
The greps are to filter all the prior semantic analysis and symbol registrars.
Result:
214: var decl: o_enableDirectionalLights
10449: register func: /ApplyDirectionalLights full identity: /ApplyDirectionalLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR,?float4)
Analyzing /o_enableDirectionalLights
/ApplyDirectionalLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR,?float4) non-memoized. discovering cost
/ApplyDirectionalLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR,?float4) call score 2094 added
/ApplyDirectionalLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR,?float4) call score 2094 added
/o_enableDirectionalLights final cost 4192
"name" : "o_enableDirectionalLights",
The function covered by the option is actually about half the option's reported cost, 2094. It just turns out that in a second if
block a bit under the normal shader execution path, there is a debug else
case that doubles the estimated cost by re-calling the Apply function.
So we can manually correct the estimates, and assume that the "real" score is more 2094 than 4192. Which makes the order of importance: o_enableAreaLights
> o_enableDirectionalLights
> o_enablePunctualLights
for baking variants.
For science, let's compare it to the AreaLights block with that prompt:
$ azslc.exe enhancedpbr_mainpipeline_forwardpass_enhancedlighting_dx12.azslin --options --verbose | grep 'o_enableAreaLights\|ApplyPointLights\|ApplyDiskLights\|ApplyCapsuleLights\|ApplyQuadLights\|ApplyPolygonLights' | grep -v 'seenat\|new'
we get:
217: var decl: o_enableAreaLights
9608: register func: /ApplyPointLights full identity: /ApplyPointLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR)
9865: register func: /ApplyCapsuleLights full identity: /ApplyCapsuleLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR)
10783: register func: /ApplyDiskLights full identity: /ApplyDiskLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR)
11551: register func: /ApplyPolygonLights full identity: /ApplyPolygonLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR)
11845: register func: /ApplyQuadLights full identity: /ApplyQuadLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR)
Analyzing /o_enableAreaLights
/ApplyPointLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR) non-memoized. discovering cost
/ApplyPointLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR) call score 767 added
/ApplyDiskLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR) non-memoized. discovering cost
/ApplyDiskLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR) call score 787 added
/ApplyCapsuleLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR) non-memoized. discovering cost
/ApplyCapsuleLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR) call score 510 added
/ApplyQuadLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR) non-memoized. discovering cost
/ApplyQuadLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR) call score 1348 added
/ApplyPolygonLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR) non-memoized. discovering cost
/ApplyPolygonLights(/SurfaceData_EnhancedPBR,/LightingData_BasePBR) call score 463 added
/o_enableAreaLights final cost 3881
"name" : "o_enableAreaLights",
767+787+510+1348+463=3875
How does it work?
The cost estimate is a static analysis. Essentially a statement counter. It postulates approximations such as 1 statement = 1 cost point.
Then it lists all options. For each option it lists all appearances of said option's identifier (called references or seenat
s) throughout the code. Each apparition site counts for 1 cost point (since it's supposed to be alike to 1 registry read or 1 memory load).
And on top of that it will judge whether the reference is linked to a sub-tree of code. Which means expressions with a dependent block: for
, if
, while
, do
, switch
.
It can't dinstinguish false positives though, such as: if (false && o_myOption) {code}
in that case the costof(code)
will be added to the score of o_myOption.
To limit this problem, it is heuristically assumed that the deeper the expression, the less effect it has on the final result. So we restrict the level of apparition of the reference in the block expression to a depth of about 6 AST nodes.
In other words something like if (a && (b || (c + (o_myOption * 3) ? var1 : var2)))
is just forfeited.
The exception to the 1 statement 1 point rule, is for embdedded function call expressions, when such exist, they will be resolved, and the call tree is fully evaluated for cost recursively.
When intrinsic functions are encountered, some exceptional hardcoded cost is applied. Notably CallShader
and TraceRay
has cost 100. Memory fetch functions such as Sample
and Load
have cost 10.