Implement PreFetcher to Increase ImageLoader Throughput #327

Yancey1989 · 2020-09-14T15:16:50Z

Background

ImageLoader scans a tgz of labeled image files to minibatch Tensor and
fit to model as the following pipeline:

retrieve minibatch -> memcpy CPU to CUDA device -> training

The above pipeline running sync, the training step should wait for the memory copy step finished, which caused the GPU utilization very low. This issue would like to introduce the Prefethcer feature to overlap memory copy and CUDA kernel execution to increase the GPU utilization and training throughput.

The prefetcher feature works as the following pipeline:

prefetch minibatch -> training
  |-wait previous minibatch ready
  |-prefetch next minibatch

we can copy the next minibatch Tensor from host to device at training, and the training step just waits previous memory copy ready.

What to do

Maintain the Tensor GC Status in `ImageLoader`

The current codebase, each iteration calls torch.GC() to wait for all Tensors have been cleared, but prefetcher always keep one or more minibatch in ImageLoader.

Port CUDA API to Go

c.f. https://github.com/pytorch/pytorch/blob/master/torch/cuda/streams.py#L12

GetCurrentCUDAStream
SetCUDAStream
WaitCUDAStream

Reference

The text was updated successfully, but these errors were encountered:

wangkuiyi · 2020-09-14T16:24:21Z

Is this prefetcher the asynchronous image loader with buffer size equals to 1?

QiJune · 2020-09-15T01:10:27Z

@wangkuiyi

In GPU, the implementation of the prefetcher is a little tricky. The GPU device is asynchronous, we could write sequential codes and achieve parallelization. In libtorch, we could not use goroutines in prefetcher, which breaks the CUDA stream binding.

Every CUDA kernel has to binds on a CUDA stream. In libtorch, the CUDA stream is got by at::cuda::getCurrentCUDAStream.

At first, the at::cuda::getCurrentCUDAStream returns CUDA stream 0(the default CUDA stream). All the CUDA kernels in libtorch use this stream. In order to parallelize computation CUDA kernels in libtorch, and memory copy CUDA kernels in prefetcher, we have to make memory copy CUDA kernel bind on another CUDA stream.

However, libtorch does not expose a way to set the CUDA stream to a CUDA kernel directly. The only thing we could do is to call setCurrentCUDAStream. It sets the user-defined stream as the current CUDA stream.

# the default CUDA stream
stream0 = getCurrentCUDAStream()
# user-defined CUDA stream
stream1 = Stream()

# memory copy kernel
setCurrentCUDAStream(stream1)
MemCpy(t1, t2, getCurrentCUDAStream())

# computation kernel
setCurrentCUDAStream(stream0)
Conv2d(t2, getCurrentCUDAStream())

The CUDA kernels are launched asynchronously, MemCpy is returned at once. Then, Conv2d is launched with stream0. Since the MemCpy kernel and Conv2d kernel have two different CUDA streams, they could be executed in parallel.

However, if we launch MemCpy in a goroutine, and keep the Conv2d in the main goroutine, everything breaks. We call setCurrentCUDAStream(stream1) in copy goroutine, it will influence the whole GPU device. The Conv2d kernel will also be launched in stream1, which is not as expected.

Yancey1989 assigned wangkuiyi, shendiaomo and QiJune Sep 14, 2020

Yancey1989 added the enhancement New feature or request label Sep 16, 2020

Yancey1989 mentioned this issue Sep 17, 2020

Port CUDAStream to Go #332

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement PreFetcher to Increase ImageLoader Throughput #327

Implement PreFetcher to Increase ImageLoader Throughput #327

Yancey1989 commented Sep 14, 2020 •

edited

Loading

wangkuiyi commented Sep 14, 2020

QiJune commented Sep 15, 2020

Implement PreFetcher to Increase ImageLoader Throughput #327

Implement PreFetcher to Increase ImageLoader Throughput #327

Comments

Yancey1989 commented Sep 14, 2020 • edited Loading

Background

What to do

Maintain the Tensor GC Status in ImageLoader

Port CUDA API to Go

Reference

wangkuiyi commented Sep 14, 2020

QiJune commented Sep 15, 2020

Yancey1989 commented Sep 14, 2020 •

edited

Loading

Maintain the Tensor GC Status in `ImageLoader`