Skip to content

Latest commit

 

History

History
48 lines (27 loc) · 2.65 KB

README.md

File metadata and controls

48 lines (27 loc) · 2.65 KB

About the Project

During the training of Convolutional Neural Networks (CNNs), the convolutional layer is the most time consuming layer. So, we wanted to accelerate the forward pass convolution operation on GPUs which would obviously reduce the time taken in the convolutional layer.

Researchers are actively working on different ways to reduce the time complexity of different convolution methods including Winograd algorithm, FFT based convolution etc.,

Based on the literature survey, we found that very few researchers are working on accelerating the general matrix multiplication(GEMM) based convolution by the usage of efficient memory access patterns. On noticing it, we planned to implement and verify any one of their techniques.

Our implementation of the convolution kernel is based on the algorithms mentioned in the conference paper titled "Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs" which was accepted at DAC'17.

Our implementation is benchmarked against the single-precision general matrix multiplication(SGEMM) based convolution kernel available in NVIDIA's cuDNN library with the help of nvprof.

Special thanks to Peter Goldsborogh for his blogpost and gist which explained the usage of convolution algorithm routine available in the cuDNN library. Without his work, It would have been a tough time for us battling with the cuDNN developer guide to benchmark our kernel.

Benchmarking Environment

OS : Ubuntu 16.04.3 LTS

GPU : GeForce GTX 650 Ti BOOST

CUDA Driver Version : 9.0

CUDA Runtime Version : 8.0

CUDA Capability Version : 3.0

cuDNN Major Version : 7

Benchmarking Results

For the purpose of benchmarking, We are naming our implementation of the memory-efficient kernel as Kernel A and the SGEMM based convolution kernel of cuDNN as Kernel B.

Here are some of the results from the benchmarking process,

For a stride value of 1, a filter dimension of 3*3 and number of channels to be 1,

Kernel Image Dimension Avg. Time
Kernel A 2048*2048 8.2038 milli.secs
Kernel B 2048*2048 15.149 milli.secs
Kernel A 1024*1024 2.0776 milli.secs
Kernel B 1024*1024 3.7918 milli.secs
Kernel A 512*512 531.65 micro.secs
Kernel B 512*512 955.65 micro.secs

From the above table, it can be clearly seen that Kernel A outperforms Kernel B by a ~50% reduction in the time taken for computation.