Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move scene-graph management to the GPU #96

Closed
devshgraphicsprogramming opened this issue May 17, 2018 · 9 comments
Closed

Move scene-graph management to the GPU #96

devshgraphicsprogramming opened this issue May 17, 2018 · 9 comments
Assignees
Labels
large big < task size < enormous priority

Comments

@devshgraphicsprogramming
Copy link
Collaborator

devshgraphicsprogramming commented May 17, 2018

Outline

The idea is to calculate the common shader uniforms/inputs for all objects such as:
AbsoluteTransformViewProjection matrix (modelviewproj) for multiple passes/viewports
AbsoluteTransform matrix (model) for multiple passes/viewports
WorldNormal matrix
etc.

From the scene representation as a hierarchy (scene-tree but misnamed as scene-graph).

By objects we mean meshbuffer instances (instances as in copies in the world, not hardware instancing, although that counts towards all instances), and other scene nodes such as bones, etc.

The purpose is three-fold:

  1. To fill the Uniform Buffer Object with per-object draw parameters efficiently
  2. To construct the MultiDrawIndirect command buffer on the GPU
  3. To perform bone update from the Animations on the GPU

This should include view frustum and occlusion culling of the results.

Implementation

Each node should have a reference/pointer/handle to an allocated input range of LocalTransforms, for most nodes this will be a range of 1 matrix but for instanced nodes it will be a range appropriate for the reserved number of instances.

One compute shader should be dispatched once per node hierarchy level to scan the node hierarchy from top to bottom propagating the relative transformation matrices and calculating the view dependent transformations, with optional frustum and occlusion culling of two types:

  1. Type SELF flags self as culled but allows subsequent children updates and draws
  2. Type FAMILY flags self as culled and kills all children

Culled FAMILY flag will only be issued if the object is culled across all viewports tested.

The input to this compute shader shall be:
Viewport viewproj matrix list and viewport count
[optional: z culling] HiZ approximate depth buffers
[possibly implcit] SelfID to retrieve it's own renderpass bitfield, instanceID, maxInstances, bounding box and transform plus user defined data
Parent ID/Handle to retrieve parent transform and culling flag
Draw Type (none/unique/Instanced) + DrawID for instanced draws (to increment instance count)

The output of this compute shader:
DrawIndirect parameters output to multiple append output buffer streams (1 per separate Pipeline [VAO,Texture,Shader] and Renderpass combination)
DrawIndirect parameters for instanced meshes (needing possible compaction -- unlikely) without append
Global UBOs (one for non-instanced, one for instanced) with per-object attribute pointers

NOTE 1: Hierarchy level 0 contains only root nodes ergo we should really only launch this computation for levels 1 onwards
NOTE 2: There will be an extra hierarchy level for instanced nodes.
NOTE 3: Globally controlled bones (EBUM_CONTROL) should become flat root nodes despite having parents

Technical considerations

  1. Should we use DispatchIndirect?
    If the work in higher levels can make further work in lower levels useless, shouldn't we use dispatch indirect?
    This can mess up and complicate multiple things, so its best left as an end-stage optimization.
  2. Why the global objectID pointer/handle? RESOLVED
    Because surplus static data such as material parameters, or data relevant only to some objects, such as light parameters, material parameters, etc. shouldn't be copied around needlessly into the output or indirect draw buffers.
    Moreover if the surplus data is not copied and just lives inside the input buffer, it will either cause memory explosion (everything must be an UBER-STRUCT with all parameters for everything) or introduce non-uniform-sized allocations which defeat the point of implicit object IDs for the culling shader.
    Finally any surplus data would be duplicated if object is included in multiple render-passes, especially if the buffer space would be pre-allocated for all possible drawindirects (as if no culling).
  3. How many renderpasses should we support? RESOLVED
    Right now IrrlichtBAW has about 5 built-in passes (camera, shadow[stock irr], solid, transparent, effect) which it automagically resolves via looking up blending equations in the material (insane hackishness/spaghetti). BaW has 6 passes with explicit root nodes (one extra for water).
    So uint8_t with 8 bits would be acceptable, and 32bit would be the max.
    Blender has 20 layers.
    We can still dynamically reduce the maximum render-pass count.
  4. If the shader unifies transform update, culling and drawindirect preparation, how to ensure convergent execution? RESOLVED
    Better not iterate over all 32 bits of the renderpass mask, this will cause a loop of 32 iterations for all objects with 32 conditional stores. This would be particularly bad for objects active in the same number of renderpasses but different ones.
    Instead use bitcount and progressively do findLSB to only iterate over active passes, eliminating execution divergence and constant cost in terms max render passes.
  5. Do we use the same shader for all processing passes (all hierarchy levels)? RESOLVED
    No, because end-level objects such as meshbuffers and instanced meshbuffers have no local transforms so they just produce indirect draws and the only unique data for culling is their aabbox.
    Ergo at least two are needed updateParent and updateChildless, if culling and drawindirect preparation is to be decoupled then we need 3 (last one would join updated absolutetransforms with viewport and hiZ buffers to produce culled indirect draws)
  6. If you only support 32 layers, how do we do 100 point light shadows?
    You don't, not in the same frame.
    I'd allocate the few last layers for dynamic usage and schedule light updates over several frames.
    However we could provide an option to separate transformation update (without the FAMILY culling flag) from culling.
  7. How should we store the input data buffer?
    It would be tempting to allocate an address from a memory allocator so that the object attribute data storage does not reallocate over the object's lifetime.
    However this would lead to fragmentation where the ranges of neighbouring addresses would belong to objects at vastly different hierarchy levels, which would require very scattered reads in the hierarchy transform shader.
    Also we would have to walk the entire tree to collect "indices" (addresses) of the input data from the nodes to be processed.
    Another solution would be to use a granular memory allocator like the InstancedMeshSceneNode uses for continuous memory layouts, there is an indirect involved but its a reverse indirection (link) from the data to the object. As long as all necessary data would be contained in the input buffer the compute shader would not need to do a dependent read .
    We could flatten out the tree in memory and process sub-ranges, however this could over-constrain the memory allocation, so best approach would be to have a dynamically sub-allocated buffer region per hierarchy level.
    It would be best to make sure that all children of a node are allocated within the same subrange (for performance reasons, could do with an std::sort on the parent handle).
  8. What should be the underlying structure of the output buffers?
    Definitely not SoA for the final output, bad for performance.
    One output for perObject dynamic data (modelviewproj, culled renderpasses flag, model, normal matrix, global object data handle etc.)
    [if not decoupled] One output for drawIndirect which is tightly packed (AMD requires this for performance).
    [optional pending on resolve of item 1] indirect dispatch parameters and children to update
  9. How should the SoA be implemented ?
    GLSL only supports unsized arrays in the SSBO at the end of the declaration.
    So the our only three options are either
layout(binding=X) readonly buffer ObjectInputData
{
    type0 attr0[MAX_OBJECTS_IN_LEVEL];
    ...
    typeN attrN[MAX_OBJECTS_IN_LEVEL];
};

or

layout(binding=X) readonly buffer ObjectInputData_attr0
{
    type0 data[];
};
...
layout(binding=X+N) readonly buffer ObjectInputData_attrN
{
    typeN data[];
};

or

layout(binding=X) readonly buffer ObjectInputData
{
   vec4 allData[]; //or uint
};

type0 getAttribute0(uint objID)
{
    ... do decode stuff from an offset ...
}
...
typeN getAttributeN(uint objID)
{
    ... do decode stuff from an offset ...
}

Benchmarking should provide the answer, however idea 3 counts as an "untyped" load and could see performance similar to TBOs instead of SSBOs.
EDIT: Actually SoA is a really bad idea for the output buffer, it gives good performance for the culling shader but bad for the actual draws which will access very small ranges of object-data per triangle batch.

  1. How should input and output buffers be bound?
    Obviously the per-object input data buffer (local tforms) would need to be bound, it would be best not to rebind its particular level ranges at different offsets for performance reasons (unless per-region instead of whole object synchronisation primitives are available in Vulkan).
    Another input to each stage is the previous iteration's output (parent xforms) so that could be rebound as a read-only range, so that the output could be bound as a write-only range, however that would require binding and unbinding of buffer ranges which requires benchmarking.
  2. Can we accept a limit on the number of nodes in a hierarchy level? RESOLVED
    Suppose each object drawn requires a generous 3 matrices, 8 bindless handles (not sure if they can come from indirect reads), and a some extra dynamic attributes which somehow rely on the culling results. This gives 256-512 bytes per object, the rest of the "uniform" data lives in a static buffer.
    We can expect around 2GB of VRAM from a GPU, which gives a theoretical max of 8-4 million objects.
    With legacy draw methods we can hope for 2-4 Million drawcall/sec so for acceptable interactivity we get 30-130 thousand drawcalls per frame, and that's it the CPU is doing nothing else.
    This amount of draws would require a 15-75 MB buffer, which is acceptable.
    With this system we just have to enable the drawing of just as many objects, we don't have to top it.
    Even at the limit this could go up to 150-750MB but the only thing we'd be drawing would be a synthetic benchmark of billboards or other polycount<50 objects.
    All of the above is assuming one renderpass, but we can get the exact numbers of possible draws for each renderpass so smaller buffers can be allocated for renderpasses with less max drawcalls.
  3. What should the maximum objects per hierarchy level limit be?
    Definitely not compiled-in static, should be able to reset it at runtime so a hard max (idea 2 in item 9) is infeasable
  4. What about CPU readback of transformations when we NEED to obtain the absolute position or transformation of an object?
    Definitely not going to do a CPU->GPU readback.
    We already don't react to any changes (with some few exceptions) to local transformation changes after ISceneManager::OnAnimate(timeMs)
    We can enable a shadow AbsoluteTransform cached variable and do duplicate CPU updates of just the ancestors for the node in question.
    This is duplicate work but we rely on the number of explicitly accessed nodes to be small.
    [OPTIMIZATION] Cut down branches or skip computing already computed ancestors
  5. What to do about really flat hierarchies or ones where top levels have very few objects? RESOLVED
    For levels i with number of objects N_i<K (where K could be 8192) to the update on the CPU.
    After the first level with more than K objects, do the rest on the GPU
  6. How does the lack of ARB_shader_draw_parameters affect us?
    We need to set baseInstance to achieve a DrawID and resign from using vertex attribute divisors for instancing (fetch data from SSBO or TBO using gl_InstanceID).
  7. And the lack of ARB_indirect_parameters?
    We need to clear the indirect draw buffer before every compute
    OR
    Clear the indirect draw buffer with a shader before every compute
    It would make sense not to use atomic counters to append to draw buffers because it gets you the same result.
  8. How do we integrate things which modify the the local transform of an object on the CPU such as AI, IK etc. ?
    Have a function in IDummyTransformationSceneNode which will fetch the pointer to the allocated matrix in the compute shader input back buffer.
    Treat as Usual (TM) just like any CPU animation or position/scale/rotation setting.
  9. How do we integrate things which modify the global transform such as global space IK, or physics simulations?
    Memcpy the entire matrix of the physics simulated items into the compute shader output buffer while setting a "do not recompute" flag on the globally set objects.
    We could actually re-upload the output buffer contents to both set globally controlled object's final transforms AND clear the output buffer at the same time.
    Bonus points for a unified pipeline for globally controlled/transformed objects [and possibly objects whose absolute transform was calculated randomly] being turned into root nodes.
  10. But how does item 4 fit in with viewports which have culing against their HiZ buffers enabled? RESOLVED
    One solution would be to make these HiZ buffers into texture arrays, and redirect, however the HiZ buffers could all be different sizes which would kind-of break this idea.
    However HiZ buffers don't follow ordinary mip-mapping rules (strictly 4:1 downsample) so we could get away with that.
    Another solution would be to sort the renderpasses such that the passes with HiZ are in the first N slots and the compute shader does a dynamically uniform loop over the first N bits, and then resorts to a scan.
  11. How to handle culling against view-dependent viewports such as Adaptive Shadow Maps or Adapative Reflections? RESOLVED
    Well they have to have some sort of max bounds in the first place so just add a render pass with the larger viewport bounds.
    Then refine (more culling and transform) the resultant semi-filtered output.
  12. Cull against viewproj matrix or partitioning planes? RESOLVED
    Culling a bbox against the viewproj matrix is extremely fast however we can allow more than 6 culling planes for other viewports and cull almost as fast.

P.S. It seems that separating update from culling is very logical.
P.P.S. Maybe not separating, but definitely have an option to output updated absolute world transforms to an extra buffer (plus whether to force position update on all objects regardless of culling).

@devshgraphicsprogramming
Copy link
Collaborator Author

devshgraphicsprogramming commented May 17, 2018

TODO:

  • Benchmark CPU host-side buffer vs. persistently mapped vs. GPU buffer as drawindirect input
  • Benchmark how much indirection from indices to costs in a culling-like shader
  • Benchmark item 9 and 10 (separate ranges and packed soft)
  • Benchmark buffer swapping/update methodologies
  • Benchmark instancing via drawInstanced vs. MultiDrawIndirect

Buffer update methods:

  1. Have a CPU side back buffer and glNamedBufferSubData to the front GPU-only buffer at various ranges
  2. In case of scattered updates break update into multiple gl*SubData calls following some heuristic
  3. Scatter updates with a compute shader reading from streaming staging buffer
  4. Same as 4 but use a GPU-side staging buffer
    (in the benchmark send all geometry off-screen)

@devshgraphicsprogramming
Copy link
Collaborator Author

Maybe if transform update and cull would be split into
Transform Update and Frustum [and optional rough Z] Cull + Occlusion Cull With bool flags or atomic counters
Where the final Z Cull could be enabled for some special objects which need accurate pixel counts etc.
In the spirit of page 37 from http://on-demand.gputechconf.com/gtc/2013/presentations/S3032-Advanced-Scenegraph-Rendering-Pipeline.pdf

@devshgraphicsprogramming
Copy link
Collaborator Author

Will add a small proof of concept to ext for creating buffers (which can be used as either UBO or SSBO) containing worldViewProjection and normal matrices for scene nodes.

@devshgraphicsprogramming
Copy link
Collaborator Author

We should definitely use only node/object IDs (indices) to then indirectly read from GPU Buffers that contain actual per-node data.

Rationale:

  • Compute Shader needs a way of mapping the continuous range of InvocationID to a unique objectInstanceID from a set of IDs that can contain holes
  • So we need an index list (lookup table of InvocationID to objectInstanceID)
  • With some mild CPU culling at the top levels of the scene-tree, only the list of Indices needs to be dynamically re-built per-frame, which will perform nicely even with millions of instances because it will only be indices (not all the extra per-object data which can be 20x more memory)
  • If per-object data does not change then no buffer transfer is necessary
  • Top-Level nodes could keep static IGPUBuffer lists of child object IDs, and then fire off the whole CS scene update stage as multiple Dispatch calls

@devshgraphicsprogramming
Copy link
Collaborator Author

A Triangle and Triangle Cluster Culling pass could be added (page 32)
https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf

@devshgraphicsprogramming
Copy link
Collaborator Author

This is moving up in the ranks due to need for access of node and bone matrices for:

  1. TSSAA
  2. Raytracing
  3. GPU Culling

@devshgraphicsprogramming
Copy link
Collaborator Author

See #303 for further discussion of GPU-scenegraph managment

@devshgraphicsprogramming
Copy link
Collaborator Author

@devshgraphicsprogramming
Copy link
Collaborator Author

Everything here is hopeless out of date and superceded by a new design in Nabla

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
large big < task size < enormous priority
Projects
None yet
Development

No branches or pull requests

1 participant