-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move scene-graph management to the GPU #96
Comments
TODO:
Buffer update methods:
|
Maybe if transform update and cull would be split into |
Will add a small proof of concept to |
We should definitely use only node/object IDs (indices) to then indirectly read from GPU Buffers that contain actual per-node data. Rationale:
|
A Triangle and Triangle Cluster Culling pass could be added (page 32) |
This is moving up in the ranks due to need for access of node and bone matrices for:
|
See #303 for further discussion of GPU-scenegraph managment |
Everything here is hopeless out of date and superceded by a new design in Nabla |
Outline
The idea is to calculate the common shader uniforms/inputs for all objects such as:
AbsoluteTransformViewProjection matrix (modelviewproj) for multiple passes/viewports
AbsoluteTransform matrix (model) for multiple passes/viewports
WorldNormal matrix
etc.
From the scene representation as a hierarchy (scene-tree but misnamed as scene-graph).
By objects we mean meshbuffer instances (instances as in copies in the world, not hardware instancing, although that counts towards all instances), and other scene nodes such as bones, etc.
The purpose is three-fold:
This should include view frustum and occlusion culling of the results.
Implementation
Each node should have a reference/pointer/handle to an allocated input range of LocalTransforms, for most nodes this will be a range of 1 matrix but for instanced nodes it will be a range appropriate for the reserved number of instances.
One compute shader should be dispatched once per node hierarchy level to scan the node hierarchy from top to bottom propagating the relative transformation matrices and calculating the view dependent transformations, with optional frustum and occlusion culling of two types:
Culled FAMILY flag will only be issued if the object is culled across all viewports tested.
The input to this compute shader shall be:
Viewport viewproj matrix list and viewport count
[optional: z culling] HiZ approximate depth buffers
[possibly implcit] SelfID to retrieve it's own renderpass bitfield, instanceID, maxInstances, bounding box and transform plus user defined data
Parent ID/Handle to retrieve parent transform and culling flag
Draw Type (none/unique/Instanced) + DrawID for instanced draws (to increment instance count)
The output of this compute shader:
DrawIndirect parameters output to multiple append output buffer streams (1 per separate Pipeline [VAO,Texture,Shader] and Renderpass combination)
DrawIndirect parameters for instanced meshes (needing possible compaction -- unlikely) without append
Global UBOs (one for non-instanced, one for instanced) with per-object attribute pointers
NOTE 1: Hierarchy level 0 contains only root nodes ergo we should really only launch this computation for levels 1 onwards
NOTE 2: There will be an extra hierarchy level for instanced nodes.
NOTE 3: Globally controlled bones (EBUM_CONTROL) should become flat root nodes despite having parents
Technical considerations
If the work in higher levels can make further work in lower levels useless, shouldn't we use dispatch indirect?
This can mess up and complicate multiple things, so its best left as an end-stage optimization.
Because surplus static data such as material parameters, or data relevant only to some objects, such as light parameters, material parameters, etc. shouldn't be copied around needlessly into the output or indirect draw buffers.
Moreover if the surplus data is not copied and just lives inside the input buffer, it will either cause memory explosion (everything must be an UBER-STRUCT with all parameters for everything) or introduce non-uniform-sized allocations which defeat the point of implicit object IDs for the culling shader.
Finally any surplus data would be duplicated if object is included in multiple render-passes, especially if the buffer space would be pre-allocated for all possible drawindirects (as if no culling).
Right now IrrlichtBAW has about 5 built-in passes (camera, shadow[stock irr], solid, transparent, effect) which it automagically resolves via looking up blending equations in the material (insane hackishness/spaghetti). BaW has 6 passes with explicit root nodes (one extra for water).
So uint8_t with 8 bits would be acceptable, and 32bit would be the max.
Blender has 20 layers.
We can still dynamically reduce the maximum render-pass count.
Better not iterate over all 32 bits of the renderpass mask, this will cause a loop of 32 iterations for all objects with 32 conditional stores. This would be particularly bad for objects active in the same number of renderpasses but different ones.
Instead use
bitcount
and progressively dofindLSB
to only iterate over active passes, eliminating execution divergence and constant cost in terms max render passes.No, because end-level objects such as meshbuffers and instanced meshbuffers have no local transforms so they just produce indirect draws and the only unique data for culling is their aabbox.
Ergo at least two are needed updateParent and updateChildless, if culling and drawindirect preparation is to be decoupled then we need 3 (last one would join updated absolutetransforms with viewport and hiZ buffers to produce culled indirect draws)
You don't, not in the same frame.
I'd allocate the few last layers for dynamic usage and schedule light updates over several frames.
However we could provide an option to separate transformation update (without the FAMILY culling flag) from culling.
It would be tempting to allocate an address from a memory allocator so that the object attribute data storage does not reallocate over the object's lifetime.
However this would lead to fragmentation where the ranges of neighbouring addresses would belong to objects at vastly different hierarchy levels, which would require very scattered reads in the hierarchy transform shader.
Also we would have to walk the entire tree to collect "indices" (addresses) of the input data from the nodes to be processed.
Another solution would be to use a granular memory allocator like the InstancedMeshSceneNode uses for continuous memory layouts, there is an indirect involved but its a reverse indirection (link) from the data to the object. As long as all necessary data would be contained in the input buffer the compute shader would not need to do a dependent read .
We could flatten out the tree in memory and process sub-ranges, however this could over-constrain the memory allocation, so best approach would be to have a dynamically sub-allocated buffer region per hierarchy level.
It would be best to make sure that all children of a node are allocated within the same subrange (for performance reasons, could do with an std::sort on the parent handle).
Definitely not SoA for the final output, bad for performance.
One output for perObject dynamic data (modelviewproj, culled renderpasses flag, model, normal matrix, global object data handle etc.)
[if not decoupled] One output for drawIndirect which is tightly packed (AMD requires this for performance).
[optional pending on resolve of item 1] indirect dispatch parameters and children to update
GLSL only supports unsized arrays in the SSBO at the end of the declaration.
So the our only three options are either
or
or
Benchmarking should provide the answer, however idea 3 counts as an "untyped" load and could see performance similar to TBOs instead of SSBOs.
EDIT: Actually SoA is a really bad idea for the output buffer, it gives good performance for the culling shader but bad for the actual draws which will access very small ranges of object-data per triangle batch.
Obviously the per-object input data buffer (local tforms) would need to be bound, it would be best not to rebind its particular level ranges at different offsets for performance reasons (unless per-region instead of whole object synchronisation primitives are available in Vulkan).
Another input to each stage is the previous iteration's output (parent xforms) so that could be rebound as a read-only range, so that the output could be bound as a write-only range, however that would require binding and unbinding of buffer ranges which requires benchmarking.
Suppose each object drawn requires a generous 3 matrices, 8 bindless handles (not sure if they can come from indirect reads), and a some extra dynamic attributes which somehow rely on the culling results. This gives 256-512 bytes per object, the rest of the "uniform" data lives in a static buffer.
We can expect around 2GB of VRAM from a GPU, which gives a theoretical max of 8-4 million objects.
With legacy draw methods we can hope for 2-4 Million drawcall/sec so for acceptable interactivity we get 30-130 thousand drawcalls per frame, and that's it the CPU is doing nothing else.
This amount of draws would require a 15-75 MB buffer, which is acceptable.
With this system we just have to enable the drawing of just as many objects, we don't have to top it.
Even at the limit this could go up to 150-750MB but the only thing we'd be drawing would be a synthetic benchmark of billboards or other polycount<50 objects.
All of the above is assuming one renderpass, but we can get the exact numbers of possible draws for each renderpass so smaller buffers can be allocated for renderpasses with less max drawcalls.
Definitely not compiled-in static, should be able to reset it at runtime so a hard max (idea 2 in item 9) is infeasable
Definitely not going to do a CPU->GPU readback.
We already don't react to any changes (with some few exceptions) to local transformation changes after ISceneManager::OnAnimate(timeMs)
We can enable a shadow AbsoluteTransform cached variable and do duplicate CPU updates of just the ancestors for the node in question.
This is duplicate work but we rely on the number of explicitly accessed nodes to be small.
[OPTIMIZATION] Cut down branches or skip computing already computed ancestors
For levels
i
with number of objectsN_i
<K (where K could be 8192) to the update on the CPU.After the first level with more than K objects, do the rest on the GPU
We need to set
baseInstance
to achieve a DrawID and resign from using vertex attribute divisors for instancing (fetch data from SSBO or TBO usinggl_InstanceID
).We need to clear the indirect draw buffer before every compute
OR
Clear the indirect draw buffer with a shader before every compute
It would make sense not to use atomic counters to append to draw buffers because it gets you the same result.
Have a function in
IDummyTransformationSceneNode
which will fetch the pointer to the allocated matrix in the compute shader input back buffer.Treat as Usual (TM) just like any CPU animation or position/scale/rotation setting.
Memcpy the entire matrix of the physics simulated items into the compute shader output buffer while setting a "do not recompute" flag on the globally set objects.
We could actually re-upload the output buffer contents to both set globally controlled object's final transforms AND clear the output buffer at the same time.
Bonus points for a unified pipeline for globally controlled/transformed objects [and possibly objects whose absolute transform was calculated randomly] being turned into root nodes.
One solution would be to make these HiZ buffers into texture arrays, and redirect, however the HiZ buffers could all be different sizes which would kind-of break this idea.
However HiZ buffers don't follow ordinary mip-mapping rules (strictly 4:1 downsample) so we could get away with that.
Another solution would be to sort the renderpasses such that the passes with HiZ are in the first N slots and the compute shader does a dynamically uniform loop over the first N bits, and then resorts to a scan.
Well they have to have some sort of max bounds in the first place so just add a render pass with the larger viewport bounds.
Then refine (more culling and transform) the resultant semi-filtered output.
Culling a bbox against the viewproj matrix is extremely fast however we can allow more than 6 culling planes for other viewports and cull almost as fast.
P.S. It seems that separating update from culling is very logical.
P.P.S. Maybe not separating, but definitely have an option to output updated absolute world transforms to an extra buffer (plus whether to force position update on all objects regardless of culling).
The text was updated successfully, but these errors were encountered: