Move scene-graph management to the GPU #96

devshgraphicsprogramming · 2018-05-17T13:42:08Z

Outline

The idea is to calculate the common shader uniforms/inputs for all objects such as:
AbsoluteTransformViewProjection matrix (modelviewproj) for multiple passes/viewports
AbsoluteTransform matrix (model) for multiple passes/viewports
WorldNormal matrix
etc.

From the scene representation as a hierarchy (scene-tree but misnamed as scene-graph).

By objects we mean meshbuffer instances (instances as in copies in the world, not hardware instancing, although that counts towards all instances), and other scene nodes such as bones, etc.

The purpose is three-fold:

To fill the Uniform Buffer Object with per-object draw parameters efficiently
To construct the MultiDrawIndirect command buffer on the GPU
To perform bone update from the Animations on the GPU

This should include view frustum and occlusion culling of the results.

Implementation

Each node should have a reference/pointer/handle to an allocated input range of LocalTransforms, for most nodes this will be a range of 1 matrix but for instanced nodes it will be a range appropriate for the reserved number of instances.

One compute shader should be dispatched once per node hierarchy level to scan the node hierarchy from top to bottom propagating the relative transformation matrices and calculating the view dependent transformations, with optional frustum and occlusion culling of two types:

Type SELF flags self as culled but allows subsequent children updates and draws
Type FAMILY flags self as culled and kills all children

Culled FAMILY flag will only be issued if the object is culled across all viewports tested.

The input to this compute shader shall be:
Viewport viewproj matrix list and viewport count
[optional: z culling] HiZ approximate depth buffers
[possibly implcit] SelfID to retrieve it's own renderpass bitfield, instanceID, maxInstances, bounding box and transform plus user defined data
Parent ID/Handle to retrieve parent transform and culling flag
Draw Type (none/unique/Instanced) + DrawID for instanced draws (to increment instance count)

The output of this compute shader:
DrawIndirect parameters output to multiple append output buffer streams (1 per separate Pipeline [VAO,Texture,Shader] and Renderpass combination)
DrawIndirect parameters for instanced meshes (needing possible compaction -- unlikely) without append
Global UBOs (one for non-instanced, one for instanced) with per-object attribute pointers

NOTE 1: Hierarchy level 0 contains only root nodes ergo we should really only launch this computation for levels 1 onwards
NOTE 2: There will be an extra hierarchy level for instanced nodes.
NOTE 3: Globally controlled bones (EBUM_CONTROL) should become flat root nodes despite having parents

Technical considerations

Should we use DispatchIndirect?
If the work in higher levels can make further work in lower levels useless, shouldn't we use dispatch indirect?
This can mess up and complicate multiple things, so its best left as an end-stage optimization.
Why the global objectID pointer/handle? RESOLVED
Because surplus static data such as material parameters, or data relevant only to some objects, such as light parameters, material parameters, etc. shouldn't be copied around needlessly into the output or indirect draw buffers.
Moreover if the surplus data is not copied and just lives inside the input buffer, it will either cause memory explosion (everything must be an UBER-STRUCT with all parameters for everything) or introduce non-uniform-sized allocations which defeat the point of implicit object IDs for the culling shader.
Finally any surplus data would be duplicated if object is included in multiple render-passes, especially if the buffer space would be pre-allocated for all possible drawindirects (as if no culling).
How many renderpasses should we support? RESOLVED
Right now IrrlichtBAW has about 5 built-in passes (camera, shadow[stock irr], solid, transparent, effect) which it automagically resolves via looking up blending equations in the material (insane hackishness/spaghetti). BaW has 6 passes with explicit root nodes (one extra for water).
So uint8_t with 8 bits would be acceptable, and 32bit would be the max.
Blender has 20 layers.
We can still dynamically reduce the maximum render-pass count.
If the shader unifies transform update, culling and drawindirect preparation, how to ensure convergent execution? RESOLVED
Better not iterate over all 32 bits of the renderpass mask, this will cause a loop of 32 iterations for all objects with 32 conditional stores. This would be particularly bad for objects active in the same number of renderpasses but different ones.
Instead use bitcount and progressively do findLSB to only iterate over active passes, eliminating execution divergence and constant cost in terms max render passes.
Do we use the same shader for all processing passes (all hierarchy levels)? RESOLVED
No, because end-level objects such as meshbuffers and instanced meshbuffers have no local transforms so they just produce indirect draws and the only unique data for culling is their aabbox.
Ergo at least two are needed updateParent and updateChildless, if culling and drawindirect preparation is to be decoupled then we need 3 (last one would join updated absolutetransforms with viewport and hiZ buffers to produce culled indirect draws)
If you only support 32 layers, how do we do 100 point light shadows?
You don't, not in the same frame.
I'd allocate the few last layers for dynamic usage and schedule light updates over several frames.
However we could provide an option to separate transformation update (without the FAMILY culling flag) from culling.
How should we store the input data buffer?
It would be tempting to allocate an address from a memory allocator so that the object attribute data storage does not reallocate over the object's lifetime.
However this would lead to fragmentation where the ranges of neighbouring addresses would belong to objects at vastly different hierarchy levels, which would require very scattered reads in the hierarchy transform shader.
Also we would have to walk the entire tree to collect "indices" (addresses) of the input data from the nodes to be processed.
Another solution would be to use a granular memory allocator like the InstancedMeshSceneNode uses for continuous memory layouts, there is an indirect involved but its a reverse indirection (link) from the data to the object. As long as all necessary data would be contained in the input buffer the compute shader would not need to do a dependent read .
We could flatten out the tree in memory and process sub-ranges, however this could over-constrain the memory allocation, so best approach would be to have a dynamically sub-allocated buffer region per hierarchy level.
It would be best to make sure that all children of a node are allocated within the same subrange (for performance reasons, could do with an std::sort on the parent handle).
What should be the underlying structure of the output buffers?
Definitely not SoA for the final output, bad for performance.
One output for perObject dynamic data (modelviewproj, culled renderpasses flag, model, normal matrix, global object data handle etc.)
[if not decoupled] One output for drawIndirect which is tightly packed (AMD requires this for performance).
[optional pending on resolve of item 1] indirect dispatch parameters and children to update
How should the SoA be implemented ?
GLSL only supports unsized arrays in the SSBO at the end of the declaration.
So the our only three options are either

layout(binding=X) readonly buffer ObjectInputData
{
    type0 attr0[MAX_OBJECTS_IN_LEVEL];
    ...
    typeN attrN[MAX_OBJECTS_IN_LEVEL];
};

or

layout(binding=X) readonly buffer ObjectInputData_attr0
{
    type0 data[];
};
...
layout(binding=X+N) readonly buffer ObjectInputData_attrN
{
    typeN data[];
};

or

layout(binding=X) readonly buffer ObjectInputData
{
   vec4 allData[]; //or uint
};

type0 getAttribute0(uint objID)
{
    ... do decode stuff from an offset ...
}
...
typeN getAttributeN(uint objID)
{
    ... do decode stuff from an offset ...
}

Benchmarking should provide the answer, however idea 3 counts as an "untyped" load and could see performance similar to TBOs instead of SSBOs.
EDIT: Actually SoA is a really bad idea for the output buffer, it gives good performance for the culling shader but bad for the actual draws which will access very small ranges of object-data per triangle batch.

How should input and output buffers be bound?
Obviously the per-object input data buffer (local tforms) would need to be bound, it would be best not to rebind its particular level ranges at different offsets for performance reasons (unless per-region instead of whole object synchronisation primitives are available in Vulkan).
Another input to each stage is the previous iteration's output (parent xforms) so that could be rebound as a read-only range, so that the output could be bound as a write-only range, however that would require binding and unbinding of buffer ranges which requires benchmarking.
Can we accept a limit on the number of nodes in a hierarchy level? RESOLVED
Suppose each object drawn requires a generous 3 matrices, 8 bindless handles (not sure if they can come from indirect reads), and a some extra dynamic attributes which somehow rely on the culling results. This gives 256-512 bytes per object, the rest of the "uniform" data lives in a static buffer.
We can expect around 2GB of VRAM from a GPU, which gives a theoretical max of 8-4 million objects.
With legacy draw methods we can hope for 2-4 Million drawcall/sec so for acceptable interactivity we get 30-130 thousand drawcalls per frame, and that's it the CPU is doing nothing else.
This amount of draws would require a 15-75 MB buffer, which is acceptable.
With this system we just have to enable the drawing of just as many objects, we don't have to top it.
Even at the limit this could go up to 150-750MB but the only thing we'd be drawing would be a synthetic benchmark of billboards or other polycount<50 objects.
All of the above is assuming one renderpass, but we can get the exact numbers of possible draws for each renderpass so smaller buffers can be allocated for renderpasses with less max drawcalls.
What should the maximum objects per hierarchy level limit be?
Definitely not compiled-in static, should be able to reset it at runtime so a hard max (idea 2 in item 9) is infeasable
What about CPU readback of transformations when we NEED to obtain the absolute position or transformation of an object?
Definitely not going to do a CPU->GPU readback.
We already don't react to any changes (with some few exceptions) to local transformation changes after ISceneManager::OnAnimate(timeMs)
We can enable a shadow AbsoluteTransform cached variable and do duplicate CPU updates of just the ancestors for the node in question.
This is duplicate work but we rely on the number of explicitly accessed nodes to be small.
[OPTIMIZATION] Cut down branches or skip computing already computed ancestors
What to do about really flat hierarchies or ones where top levels have very few objects? RESOLVED
For levels i with number of objects N_i<K (where K could be 8192) to the update on the CPU.
After the first level with more than K objects, do the rest on the GPU
How does the lack of ARB_shader_draw_parameters affect us?
We need to set baseInstance to achieve a DrawID and resign from using vertex attribute divisors for instancing (fetch data from SSBO or TBO using gl_InstanceID).
And the lack of ARB_indirect_parameters?
We need to clear the indirect draw buffer before every compute
OR
Clear the indirect draw buffer with a shader before every compute
It would make sense not to use atomic counters to append to draw buffers because it gets you the same result.
How do we integrate things which modify the the local transform of an object on the CPU such as AI, IK etc. ?
Have a function in IDummyTransformationSceneNode which will fetch the pointer to the allocated matrix in the compute shader input back buffer.
Treat as Usual (TM) just like any CPU animation or position/scale/rotation setting.
How do we integrate things which modify the global transform such as global space IK, or physics simulations?
Memcpy the entire matrix of the physics simulated items into the compute shader output buffer while setting a "do not recompute" flag on the globally set objects.
We could actually re-upload the output buffer contents to both set globally controlled object's final transforms AND clear the output buffer at the same time.
Bonus points for a unified pipeline for globally controlled/transformed objects [and possibly objects whose absolute transform was calculated randomly] being turned into root nodes.
But how does item 4 fit in with viewports which have culing against their HiZ buffers enabled? RESOLVED
One solution would be to make these HiZ buffers into texture arrays, and redirect, however the HiZ buffers could all be different sizes which would kind-of break this idea.
However HiZ buffers don't follow ordinary mip-mapping rules (strictly 4:1 downsample) so we could get away with that.
Another solution would be to sort the renderpasses such that the passes with HiZ are in the first N slots and the compute shader does a dynamically uniform loop over the first N bits, and then resorts to a scan.
How to handle culling against view-dependent viewports such as Adaptive Shadow Maps or Adapative Reflections? RESOLVED
Well they have to have some sort of max bounds in the first place so just add a render pass with the larger viewport bounds.
Then refine (more culling and transform) the resultant semi-filtered output.
Cull against viewproj matrix or partitioning planes? RESOLVED
Culling a bbox against the viewproj matrix is extremely fast however we can allow more than 6 culling planes for other viewports and cull almost as fast.

P.S. It seems that separating update from culling is very logical.
P.P.S. Maybe not separating, but definitely have an option to output updated absolute world transforms to an extra buffer (plus whether to force position update on all objects regardless of culling).

The text was updated successfully, but these errors were encountered:

devshgraphicsprogramming · 2018-05-17T16:49:08Z

TODO:

Benchmark CPU host-side buffer vs. persistently mapped vs. GPU buffer as drawindirect input
~~Benchmark how much indirection from indices to costs in a culling-like shader~~
Benchmark item 9 and 10 (separate ranges and packed soft)
Benchmark buffer swapping/update methodologies
Benchmark instancing via drawInstanced vs. MultiDrawIndirect

Buffer update methods:

Have a CPU side back buffer and glNamedBufferSubData to the front GPU-only buffer at various ranges
In case of scattered updates break update into multiple gl*SubData calls following some heuristic
Scatter updates with a compute shader reading from streaming staging buffer
Same as 4 but use a GPU-side staging buffer
(in the benchmark send all geometry off-screen)

devshgraphicsprogramming · 2018-05-17T17:44:38Z

Maybe if transform update and cull would be split into
Transform Update and Frustum [and optional rough Z] Cull + Occlusion Cull With bool flags or atomic counters
Where the final Z Cull could be enabled for some special objects which need accurate pixel counts etc.
In the spirit of page 37 from http://on-demand.gputechconf.com/gtc/2013/presentations/S3032-Advanced-Scenegraph-Rendering-Pipeline.pdf

devshgraphicsprogramming · 2018-07-25T10:28:03Z

Will add a small proof of concept to ext for creating buffers (which can be used as either UBO or SSBO) containing worldViewProjection and normal matrices for scene nodes.

devshgraphicsprogramming · 2019-01-03T15:13:05Z

We should definitely use only node/object IDs (indices) to then indirectly read from GPU Buffers that contain actual per-node data.

Rationale:

Compute Shader needs a way of mapping the continuous range of InvocationID to a unique objectInstanceID from a set of IDs that can contain holes
So we need an index list (lookup table of InvocationID to objectInstanceID)
With some mild CPU culling at the top levels of the scene-tree, only the list of Indices needs to be dynamically re-built per-frame, which will perform nicely even with millions of instances because it will only be indices (not all the extra per-object data which can be 20x more memory)
If per-object data does not change then no buffer transfer is necessary
Top-Level nodes could keep static IGPUBuffer lists of child object IDs, and then fire off the whole CS scene update stage as multiple Dispatch calls

devshgraphicsprogramming · 2019-04-29T14:27:08Z

A Triangle and Triangle Cluster Culling pass could be added (page 32)
https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf

devshgraphicsprogramming · 2019-08-03T14:58:13Z

This is moving up in the ranks due to need for access of node and bone matrices for:

TSSAA
Raytracing
GPU Culling

devshgraphicsprogramming · 2020-08-07T19:01:50Z

See #303 for further discussion of GPU-scenegraph managment

devshgraphicsprogramming · 2021-09-02T13:47:09Z

getting done in https://github.com/Devsh-Graphics-Programming/Nabla/tree/scene_manager

devshgraphicsprogramming · 2021-09-14T13:15:51Z

Everything here is hopeless out of date and superceded by a new design in Nabla

devshgraphicsprogramming added the large big < task size < enormous label May 17, 2018

devshgraphicsprogramming self-assigned this May 17, 2018

devshgraphicsprogramming mentioned this issue May 17, 2018

Create a Compute Shader Global Bone Animator #77

Closed

devshgraphicsprogramming mentioned this issue May 17, 2018

Add an Animated Mesh Instance Cacher #97

Open

devshgraphicsprogramming closed this as completed Jul 25, 2018

devshgraphicsprogramming reopened this Jul 25, 2018

devshgraphicsprogramming added the priority label Aug 3, 2019

devshgraphicsprogramming mentioned this issue Oct 11, 2019

Shader Pipeline #358

Merged

devshgraphicsprogramming pinned this issue Oct 20, 2019

devshgraphicsprogramming mentioned this issue Feb 6, 2021

Scene Management in Compute Devsh-Graphics-Programming/Nabla#62

Closed

devshgraphicsprogramming closed this as completed Sep 14, 2021

devshgraphicsprogramming unpinned this issue Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move scene-graph management to the GPU #96

Move scene-graph management to the GPU #96

devshgraphicsprogramming commented May 17, 2018 •

edited

Loading

devshgraphicsprogramming commented May 17, 2018 •

edited

Loading

devshgraphicsprogramming commented May 17, 2018

devshgraphicsprogramming commented Jul 25, 2018

devshgraphicsprogramming commented Jan 3, 2019

devshgraphicsprogramming commented Apr 29, 2019

devshgraphicsprogramming commented Aug 3, 2019

devshgraphicsprogramming commented Aug 7, 2020

devshgraphicsprogramming commented Sep 2, 2021

devshgraphicsprogramming commented Sep 14, 2021

Move scene-graph management to the GPU #96

Move scene-graph management to the GPU #96

Comments

devshgraphicsprogramming commented May 17, 2018 • edited Loading

devshgraphicsprogramming commented May 17, 2018 • edited Loading

devshgraphicsprogramming commented May 17, 2018

devshgraphicsprogramming commented Jul 25, 2018

devshgraphicsprogramming commented Jan 3, 2019

devshgraphicsprogramming commented Apr 29, 2019

devshgraphicsprogramming commented Aug 3, 2019

devshgraphicsprogramming commented Aug 7, 2020

devshgraphicsprogramming commented Sep 2, 2021

devshgraphicsprogramming commented Sep 14, 2021

devshgraphicsprogramming commented May 17, 2018 •

edited

Loading

devshgraphicsprogramming commented May 17, 2018 •

edited

Loading