Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding synchronize to cupy implementations #335

Merged
merged 1 commit into from
Feb 13, 2024
Merged

Conversation

adarshyoga
Copy link
Contributor

@adarshyoga adarshyoga commented Feb 12, 2024

  • Have you provided a meaningful PR description?

Multiple successive cupy kernel invocations are queued to the same cuda stream, but are executed asynchronously with host code. cuda.stream.synchronize calls are need to ensure that the kernel code completes execution before the data is accessed by the host. This PR adds these synchronize calls to the kernels.

Report from dpbench execution generated after this change is below.

Summary of current implementation

input_size benchmark problem_preset cupy
0 20MB black_scholes S Success
1 8KB gpairs S Success
2 1MB l2_norm S Success
3 8MB pairwise_distance S Success
4 1MB pca S Success
5 7MB rambo S Success

Summary of current implementation

input_size benchmark problem_preset cupy
0 5GB black_scholes M 97.05ms
1 512KB gpairs M 60.02ms
2 8GB l2_norm M 493.05ms
3 8GB pairwise_distance M 399.43ms
4 1GB pca M 133.48ms
5 1GB rambo M 60.23ms

@adarshyoga adarshyoga self-assigned this Feb 12, 2024
@adarshyoga adarshyoga enabled auto-merge February 13, 2024 07:16
@oleksandr-pavlyk
Copy link
Collaborator

FYI, here is how CuPy's own benchmark does it https://github.com/cupy/cupy/blob/main/examples/gemm/utils.py#L12-L22:

def benchmark(func, args, n_run):
    times = []
    for _ in range(n_run):
        start = cp.cuda.Event()
        end = cp.cuda.Event()
        start.record()
        func(*args)
        end.record()
        end.synchronize()
        times.append(cp.cuda.get_elapsed_time(start, end))  # milliseconds
    return times

Or https://github.com/cupy/cupy/blob/main/examples/cg/cg.py#L10-L17:

@contextlib.contextmanager
def timer(message):
    cupy.cuda.Stream.null.synchronize()
    start = time.time()
    yield
    cupy.cuda.Stream.null.synchronize()
    end = time.time()
    print('%s:  %f sec' % (message, end - start))

But what is being proposed in this PR works just the same.

Copy link
Collaborator

@oleksandr-pavlyk oleksandr-pavlyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I like the change to use cp instead of np in cupy implementation files.

@adarshyoga adarshyoga merged commit 0646f1e into main Feb 13, 2024
70 of 71 checks passed
@adarshyoga adarshyoga deleted the cupy_sync_add branch February 13, 2024 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants