Adding synchronize to cupy implementations #335
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Multiple successive cupy kernel invocations are queued to the same cuda stream, but are executed asynchronously with host code.
cuda.stream.synchronize
calls are need to ensure that the kernel code completes execution before the data is accessed by the host. This PR adds thesesynchronize
calls to the kernels.Report from dpbench execution generated after this change is below.
Summary of current implementation
input_size benchmark problem_preset cupy
0 20MB black_scholes S Success
1 8KB gpairs S Success
2 1MB l2_norm S Success
3 8MB pairwise_distance S Success
4 1MB pca S Success
5 7MB rambo S Success
Summary of current implementation
input_size benchmark problem_preset cupy
0 5GB black_scholes M 97.05ms
1 512KB gpairs M 60.02ms
2 8GB l2_norm M 493.05ms
3 8GB pairwise_distance M 399.43ms
4 1GB pca M 133.48ms
5 1GB rambo M 60.23ms