You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ImageLoader scans a tgz of labeled image files to minibatch Tensor and
fit to model as the following pipeline:
retrieve minibatch -> memcpy CPU to CUDA device -> training
The above pipeline running sync, the training step should wait for the memory copy step finished, which caused the GPU utilization very low. This issue would like to introduce the Prefethcer feature to overlap memory copy and CUDA kernel execution to increase the GPU utilization and training throughput.
The prefetcher feature works as the following pipeline:
prefetch minibatch -> training
|-wait previous minibatch ready
|-prefetch next minibatch
we can copy the next minibatch Tensor from host to device at training, and the training step just waits previous memory copy ready.
What to do
Maintain the Tensor GC Status in ImageLoader
The current codebase, each iteration calls torch.GC() to wait for all Tensors have been cleared, but prefetcher always keep one or more minibatch in ImageLoader.
In GPU, the implementation of the prefetcher is a little tricky. The GPU device is asynchronous, we could write sequential codes and achieve parallelization. In libtorch, we could not use goroutines in prefetcher, which breaks the CUDA stream binding.
Every CUDA kernel has to binds on a CUDA stream. In libtorch, the CUDA stream is got by at::cuda::getCurrentCUDAStream.
At first, the at::cuda::getCurrentCUDAStream returns CUDA stream 0(the default CUDA stream). All the CUDA kernels in libtorch use this stream. In order to parallelize computation CUDA kernels in libtorch, and memory copy CUDA kernels in prefetcher, we have to make memory copy CUDA kernel bind on another CUDA stream.
However, libtorch does not expose a way to set the CUDA stream to a CUDA kernel directly. The only thing we could do is to call setCurrentCUDAStream. It sets the user-defined stream as the current CUDA stream.
# the default CUDA streamstream0=getCurrentCUDAStream()
# user-defined CUDA streamstream1=Stream()
# memory copy kernelsetCurrentCUDAStream(stream1)
MemCpy(t1, t2, getCurrentCUDAStream())
# computation kernelsetCurrentCUDAStream(stream0)
Conv2d(t2, getCurrentCUDAStream())
The CUDA kernels are launched asynchronously, MemCpy is returned at once. Then, Conv2d is launched with stream0. Since the MemCpy kernel and Conv2d kernel have two different CUDA streams, they could be executed in parallel.
However, if we launch MemCpy in a goroutine, and keep the Conv2d in the main goroutine, everything breaks. We call setCurrentCUDAStream(stream1) in copy goroutine, it will influence the whole GPU device. The Conv2d kernel will also be launched in stream1, which is not as expected.
Background
ImageLoader
scans a tgz of labeled image files to minibatch Tensor andfit to model as the following pipeline:
The above pipeline running sync, the training step should wait for the memory copy step finished, which caused the GPU utilization very low. This issue would like to introduce the Prefethcer feature to overlap memory copy and CUDA kernel execution to increase the GPU utilization and training throughput.
The
prefetcher
feature works as the following pipeline:we can copy the next minibatch Tensor from host to device at training, and the training step just waits previous memory copy ready.
What to do
Maintain the Tensor GC Status in
ImageLoader
The current codebase, each iteration calls
torch.GC()
to wait for all Tensors have been cleared, butprefetcher
always keep one or more minibatch inImageLoader
.Port CUDA API to Go
c.f. https://github.com/pytorch/pytorch/blob/master/torch/cuda/streams.py#L12
GetCurrentCUDAStream
SetCUDAStream
WaitCUDAStream
Reference
The text was updated successfully, but these errors were encountered: