GSoC 2010 CUDA

The CUDA Massively Parallel Trajectory Evolution project tries to use GPGPU computing, which has recently become a strong focus of scientific computing (often allowing for substantial speed-ups in CPU intensive tasks). Since the computation involved in evolving neurocontrollers for spacecraft trajectories can be extremely time consuming and require a large number of numerical integrations, the appeal of high performance GPGPU computing is obvious.

The goal of this project is to parallelise a neurocontroller evolution for a spacecraft docking sequence.

Students: Kashif Kaleem, Tee

Mentors: Juxi Leitner, Francesco Biscani

Previous Work / Context

Technical Details: PaGMO version 1.0 will provide a neural network toolbox (at the present time implemented in the evolving docking branch) to represent a neurocontroller (see docking.cpp) for a docking problem. This problem representation requires in one objective function evaluation several integrations of a dynamic system using a numerical scheme (e.g., Runge-Kutta) that can be performed in parallel. The aim is to implement in CUDA / openCL the following:

a numerical integrator,
the equations governing the spacecraft's dynamics,
a neural network representation of the spacecraft control, and the interface to a pagmo::problem class representing the overall optimisation problem.

Timeline

The main dates here are based on the GSoC2010Timeline but are a bit more flexible than required by Google. May 24, 2010: Start of programming July 15, 2010: Mid-Term Review (similar to Google's) August 20, 2010: Pencil's down till beginning of September: time to give feedback and finish the project

Documentation

Build environment

(In Progress) - The build environment needs to support both CUDA and OpenCL even if we have selected one of the two for the docking problem. Integration of CUDA (Completed) - CUDA has been integrated into the cmake build sequence using FINDCUDA.cmake. This enables us to enable or disable CUDA by the ccmake utility so that CUDA isnt a compilation requirement. It has been tested with the emulator (libcudartemu.so) and a basic test program was verified. The emulator mode is deprecated and will be disabled in future CUDA releases. CUDA is supported with gcc-4.3 and gcc-4.4 can cause compilation problems. Integration of OpenCL (On Hold) - There isnt a cmake script for OpenCL. We will work with CUDA for now. Integration of PyCUDA/PyOpencl

Parallel numerical integrator

Runge Kutta method

spacecraft's dynamics equations

Hill-Clohessy-Wiltshire equations

Neural network representation

Perceptron Multilayer perceptron (In Progress) - Interface considerations are being made. elman network continuous time recurrent neural network

docking problems implementation

Reports

list of reports produced

MidTerm Review

Juxi: Please write about 1-2 paragraphs about your work done so far here! Deadline: July 15, 2010

Kashif

Right now I have implemented the basic implementations of all except the ctrnn network. From here on I plan to do the following:

Memory management: Right now I am very much concerned about the large amount of data that will have to be synchronously copied to the device from the host and back (some of which will be redundant). To handle this I will create an allocator class (probably not similar to the STL allocators though) that will deal with allocations depending on the devices's (and maybe host's) abilities. The memory loading (cudaMemcpy) can be optimised by having it executed asynchronously over a number of streams. I am not sure how to find the number of beneficial streams but when I do, one plan would be to split big memory copies to smaller sections depending on the number of streams I can use. Other plans will include interlacing memory loads with kernel executions. But this will require the device overlap feature from the device. So in short, there is a good need for an allocator that would be optimised for the device's abilities and the fact that it will be allocating and deallocating blocks of very consistent sizes. This will be useful for other code that will be using cuda - like the fitness function evaluation.
kernel management: Right now, the kernels are designed for large single neural networks. Our scenario requires a lot of smaller neural networks. We can execute multiple neural networks at the same time if they are not interdependent. This will require modifying the kernels to include some information of how big each neural network's data is so that we can have multiple neural networks running on the same kernel. Now with this we will have to modify the docking problem to bank the neural networks' data (in a class possibly named like CudaExecutionContext) and then execute them concurrently. Different classes of nnets will have different implementations for this depending on whether they have memory. The CudaExecutionContext will have to deal with stuff like how many threads to execute at the same time (based on warp size and possibility of multiple kernel execution etc)
Other optimisations: The present kernel doesn't load the neural network data to the shared memory and it doesnt perform other steps for optimisations (loop unrolling etc). These will be added soon. Apart from this, I think given how kernels can be templatized, we can make a kernel work for both single and double precision float types (depending on compute capability).
The neural network interface will at some point need to be changed as well. The approach I am going for is to develop a neural network instance from a number of layers instances and connection instances (similar perhaps to pybrain). The key of course is that it will still be able to load a weight vector (but of a certain size) for this, we will determine the weights size in the constructor.
Performance management: The expected changes that I have described above will be made while measuring the times by the CudaTimer class. Using this class we now have a good way to measure how fast we're doing so we know whether a change is a good change or not. I think I will soon start producing reports for it too for the wiki.
Reuse: I expect most of the above code will be reusable for the rest of the docking problem's internals (fitness_function evaluation and the integrator). We will also be able to reuse it for the monte-carlo problem too.

Tee

tba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly