Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PreFetcher to Increase ImageLoader Throughput #327

Open
Yancey1989 opened this issue Sep 14, 2020 · 2 comments
Open

Implement PreFetcher to Increase ImageLoader Throughput #327

Yancey1989 opened this issue Sep 14, 2020 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Sep 14, 2020

Background

ImageLoader scans a tgz of labeled image files to minibatch Tensor and
fit to model as the following pipeline:

retrieve minibatch -> memcpy CPU to CUDA device -> training

The above pipeline running sync, the training step should wait for the memory copy step finished, which caused the GPU utilization very low. This issue would like to introduce the Prefethcer feature to overlap memory copy and CUDA kernel execution to increase the GPU utilization and training throughput.

The prefetcher feature works as the following pipeline:

prefetch minibatch -> training
  |-wait previous minibatch ready
  |-prefetch next minibatch

we can copy the next minibatch Tensor from host to device at training, and the training step just waits previous memory copy ready.

What to do

Maintain the Tensor GC Status in ImageLoader

The current codebase, each iteration calls torch.GC() to wait for all Tensors have been cleared, but prefetcher always keep one or more minibatch in ImageLoader.

Port CUDA API to Go

c.f. https://github.com/pytorch/pytorch/blob/master/torch/cuda/streams.py#L12

  1. GetCurrentCUDAStream
  2. SetCUDAStream
  3. WaitCUDAStream

Reference

  1. https://github.com/NVIDIA/apex/blob/4ef930c1c884fdca5f472ab2ce7cb9b505d26c1a/examples/imagenet/main_amp.py#L264
  2. https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/
@wangkuiyi
Copy link
Owner

Is this prefetcher the asynchronous image loader with buffer size equals to 1?

@QiJune
Copy link
Collaborator

QiJune commented Sep 15, 2020

@wangkuiyi

In GPU, the implementation of the prefetcher is a little tricky. The GPU device is asynchronous, we could write sequential codes and achieve parallelization. In libtorch, we could not use goroutines in prefetcher, which breaks the CUDA stream binding.

Every CUDA kernel has to binds on a CUDA stream. In libtorch, the CUDA stream is got by at::cuda::getCurrentCUDAStream.

At first, the at::cuda::getCurrentCUDAStream returns CUDA stream 0(the default CUDA stream). All the CUDA kernels in libtorch use this stream. In order to parallelize computation CUDA kernels in libtorch, and memory copy CUDA kernels in prefetcher, we have to make memory copy CUDA kernel bind on another CUDA stream.

However, libtorch does not expose a way to set the CUDA stream to a CUDA kernel directly. The only thing we could do is to call setCurrentCUDAStream. It sets the user-defined stream as the current CUDA stream.

# the default CUDA stream
stream0 = getCurrentCUDAStream()
# user-defined CUDA stream
stream1 = Stream()

# memory copy kernel
setCurrentCUDAStream(stream1)
MemCpy(t1, t2, getCurrentCUDAStream())

# computation kernel
setCurrentCUDAStream(stream0)
Conv2d(t2, getCurrentCUDAStream())

The CUDA kernels are launched asynchronously, MemCpy is returned at once. Then, Conv2d is launched with stream0. Since the MemCpy kernel and Conv2d kernel have two different CUDA streams, they could be executed in parallel.

However, if we launch MemCpy in a goroutine, and keep the Conv2d in the main goroutine, everything breaks. We call setCurrentCUDAStream(stream1) in copy goroutine, it will influence the whole GPU device. The Conv2d kernel will also be launched in stream1, which is not as expected.

@Yancey1989 Yancey1989 added the enhancement New feature or request label Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants