prod_env_mat: allocate GPU memory out of frame loop #2832
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Allocating GPU memory is not a cheap operator. This PR allocates memory for
int_temp
,uint64_temp
, andtensor_list[0, 1, 3, 4, 5, 6]
out of the frame loop, so they can be reused in each loop without allocating many times.In the original code,
tensor_list[3]
,tensor_list[4]
, andtensor_list[6]
may need to reallocate if the memory is not enough. This behavior still exists.The shape of
tensor_list[2]
is dynamic, so it is not refactored in this PR.With CUDA enabled, unit tests for C++ and Python can pass. The examples can be performed.
The speedup can be observed when the number of frames (samples) in a batch is not small.