Skip to content
This repository has been archived by the owner on Jul 16, 2024. It is now read-only.

GPGPUProposal

magnific0 edited this page Feb 24, 2014 · 5 revisions

OpenCL massively parallel docking trajectory evolution

proposals from the GSoC applications ...

Comments

I would also design a flexible interface to integrate these CUDA implementations into PaGMO. This interface would facilitate future CUDA implementations of other parallelizeable functions in PaGMO and provide a common framework.

openCL is very promising (it handles NVIDIA, ATI cards), but it is just the beginning and so it is still to new.

mainly 2 ways : The CUDA implementation could be absolutely transparent to the user. If a compatible device is detected, it is the CUDA version of functions, which is used. The second choice is two provided CUDA functions (with a suffix _cu for example) to the user, which decides to use them or not. Another significant choice is the floating point precision, which is used in CUDA. Indeed for most of the CUDA devices (except the new generation) using the double precision is much slower (around 8 times) than the single precision. So we have to decide if we give priority to precision or speed. We have also to decide if we support multi-card (SLI) or not. Indeed, the computation on multi-cards is not transparent in CUDA.

Some parameters, which are used in the execution of CUDA kernels (functions) depend much on the hardware (the model of the card). We have to decide if these parameters will be computed at the build-time (in the installation procedure), or at the running-time (in a initialization function). I think that these design choices must be well analyzed at the beginning of the project in order to start the implementation phase in good condition.

Project Architecture and My Approach towards it

As mentioned on the website my aim is to perform three tasks:

a) Develop a parallel numerical integrator to run on GPU: This will require solving ODE using Runge Kutta method (having a fixed step size). b) The equations governing the spacecraft's dynamics: This will also have certain equations which have to be parallelized ( the Hill-Clohessy-Wiltshire equations ). c) Neural network representation of the spacecraft control: Will have equations which cannot be paralleized but for the training of the neural network and the optimization these calculations need to be done many times, this can be done in parallel on GPU using CUDA.

Then combine the results at the end.

The method would be to allocate memory space in GPU (cudaMalloc) then transfer the contents from the main memory to the GPU memory (cudaMemcpy) using the inbuilt CUDA functions . Then identify the parallelism in the method and implement it accordingly (inside the kernel function). It could be task parallel or data parallel. Gather the output to the GPU global memory and then copy it back to the CPU main memory.

  1. We know that each thread has its own local storage . If the thread needs a particular data repeatedly, I will copy the data to the threads own local storage. This will reduce the latency.
  2. Each block of threads have a shared memory. Suppose a block of threads uses a particular data repeatedly, I would copy the contents to the shared memory of the block. This will again reduce latency. Further I will use the other methods of optimization after reading CUDA manuals so that the code runs faster.
  3. Allocating blocks and in such a manner that maximum amount of potential of the GPU can be exploited.

Proposed Timelines

0

May 9 to May 23 - Interact with mentor and create class level design. May 24 to June 05 - Implement functions relevant to the Neural network in CUDA. June 06 to June 12 - Test the Neural network and work with mentor to make necessary changes. June 13 to June 26 - Implementation of numerical integrator using Runge-Kutta Methods for ODE in CUDA. June 27 to July 3 - Test the Integrator and receive feedback from mentor for changes and improvements. July 4 to July 17 - Parallelize the objective function and dynamics governing the spacecraft. July 18 to July 24 - Test the above implementation and work with mentor to make necessary changes. July 25 to Aug 2 - Integrate all the individual components and interface with "pagmo::problem" class. August 3 to Aug 8 - Work with mentor to test and make final changes. August 9 - Submit final source code and documentation.

1

###PHASE 0: From now till May 23rd 2010 ###

  •    Identify how to make the ODE Runge-Kutta solver parallel and confirm the same with mentor.
    
  •    Look at the part of the code which calls the ODE solver function (using boost libraries).
    
  •    Then write an algorithm for the parallel ODE solver, which would be later implemented in CUDA (identify the allocation blocks and threads).
    
  •    Read upon necessary material on Orbital Mechanics (HCW equations, (<https://netfiles.uiuc.edu/prussing/www/OM.html> chapter 8)). Understand what algorithm is used. Design the parallel algorithm which would parallelize the equations in order to make the program run faster.
    
  •    Now the 3rd part, look at the neural network representation part of code. Read necessary material on internet. Go through the code, identify the part which has to be repeatedly performed.
    

###PHASE 1: From 24th May – 10th June (2.5 weeks) ### AIM: {Write and Run the CUDA code for ODE Solving} {Optimize the CUDA code for Runge-Kutta(I have mentioned techniques for the same on page 6 )} Methodology in chronological order:

  •    Go through the Runge-Kutta parallel method. Develop the proper algorithm.
    
  •    Complete  writing code (CUDA). ). Make sure there is proper thread synchronization. No race condition is created.(confirm with mentor)
    
  •    Try to run the code on a machine with CUDA enabled card.
    
  •    Fix compilation/run time errors if any. (check syntax, include libraries, proper allocation and free-ing of space.)
    
  •    Feed dummy input and check the output, check whether it gives correct answers. Check the same with different test cases and compare results. (in case of errors, check synchronization of threads, check race condition if any. )
    
  •    Also check whether emulation of the code on the CPU gives same and correct results or not. (using the flag for emulation with NVCC command).
    
  •    Once results are perfect, check the speed up. Now optimize the code using one method at a time and keep checking speedups.
    
  •    After optimization , again use all test cases to check whether the code gives correct results or not. Test CPU emulation results also.
    

###PHASE 2: From 11th June – 6th July(3.5 weeks)### {Write and Run the CUDA code for equations governing the space-craft dynamics (hill clohessey Wiltshire equations)} {Optimize the CUDA code for the same(I have mentioned techniques for the same on page 6 )} Methodology in chronological order:

  •    identify the allocation of blocks and threads. (how many blocks, how many threads in each block?)
    
  •    Complete  writing code (CUDA). ). Make sure there is proper thread synchronization. No race condition is created.(confirm with mentor)
    
  •    Try to run the code on a machine with CUDA enabled card.
    
  •    Fix compilation/run time errors if any. (check syntax, include libraries, proper allocation and free-ing of space.)
    
  •   how to perform test on it. Then perform the tests.
    
  •    Also check whether emulation of the code on the CPU gives same and correct results or not. (using the flag for emulation with NVCC command).
    
  •    Once results are perfect, check the speed up. Now optimize the code using one method at a time and keep checking speedups.
    
  •    After optimization , again use all test cases to check whether the code gives correct results or not. Test CPU emulation results also.
    

Mid-term submission. Check all the work done again. And prepare for submission (2 days).

###PHASE 3: From 9th July – 9th August (1 month)### AIM: {Write and Run the CUDA code for neural network representation of the spacecraft control} {Optimize the CUDA code for the same(I have mentioned techniques for the same on page 6 )} Methodology in chronological order:

  •    Make sure there is proper thread synchronization. No race condition is created.(confirm with mentor)
    
  •    Try to run the code on a machine with CUDA enabled card.
    
  •    Fix compilation/run time errors if any. (check syntax, include libraries, proper allocation and free-ing of space.)
    
  •    Consult with mentor to check whether the program is correct, ask how to perform test on it. The perform the tests.
    
  •    Also check whether emulation of the code on the CPU gives same and correct results or not. (using the flag for emulation with NVCC command).
    
  •    Once results are perfect, check the speed up. Now optimize the code using one method at a time and keep checking speedups.
    
  •    After optimization , again use all test cases to check whether the code gives correct results or not. Test CPU emulation results also.
    

After 9th August till 16th August Test all the codes, and start documenting.

3

The problem as I see it is to perform many iterations of a simulated docking sequence controlled by a neural network based controller. The first part of this is to calculate the trajectory of the spacecraft from a known position towards its target. We would do this by integrating the equations of motion of the spacecraft numerically. integrators: Runge-Kutta, Modified Euler, Gaussian Quadrature and Newton-Cotes. The second part of the problem is evolve a neural network controller to control the spacecraft. The real work in this project will be implementing the trajectory and learning algorithm in the CUDA programming language so it can be run on GPU hardware and integrating this with PaGMO. I have been playing with CUDA in my spare time using my home machine graphics card and am in the early stages of implementing algorithms from my multi-agent work.

4

  1. Project In this section, I would present the project specific details of the current project. Tasks I intend to complete the following tasks during GSOC 2010.
    1. Understand the design and structure of PaGMO.
    2. Come-up with a parallelize-able design for numerical integrators and neural networks to fit with CUDA.
    3. Implement and test these functions on a CUDA supported graphics card. In parallel, design and implement an interface into PaGMO for these functions.
    4. Integrate CUDA implementations through the interface into PaGMO.
    5. Provide brief documentation for the CUDA based design of these functions as well as the interface. Suitablity Road Map I would propose the following time-line of deliverables for the current project. Time-line Projected Deliverables May 05 - May 20 Exploring PaGMO and familiarizing myself with code May 20 - July 10 Porting numerical integrators and neural networks to CUDA and Designing an interface for these into PaGMO July 16 - Midterm Evaluation Deadline Deliverable: CUDA code and Design of interface July 16 - August 09 Testing CUDA code and Integrating through interface into PaGMO August 09 - August 16 Documenting the CUDA based designs and interface August 20 - Final Evaluation Deadline Deliverable: Cleaned-up code and design documents Future Plans I would like to explore the possibility of publishing the optimization of these algorithms in relevant conferences.

5

The proposed project aims at reducing these bottlenecks by using CUDA to implement these tasks in a parallelized fashion within the GPGPU. The Runge-Kutta method can also be parallelized and many authors have proposed methods to achieve that. Therefore, this parallelization is the natural step in the code optimization for this problem.

6

The docking problem is the problem is an important process in space missions involving more than one spacecraft. The docking process involves a series of manoeuvres and controlled trajectories, which successively bring the active vehicle (chaser) into the vicinity of, and eventually into contact with, the passive vehicle (target). The guidance, navigation and control (GNC) system of the chaser controls the vehicle state parameters required for entry into the docking interfaces of the target vehicle and for capture. More information could be found in "Automated Rendezvous and Docking of Spacecraft, Wigbert Fehse (2003)". It's required to reimplement and optimize in CUDA the components needed for docking problem and integrate it into the PaGMO system. The components include: * a numerical integrator, * the equations governing the spacecraft's dynamics, * a neural network representation of the spacecraft control, * and the interface to a pagmo::problem class representing the overall optimisation problem. Project Plan: 1. During the bonding period (beside the steps mentioned here): * Understanding the ANNs model, fitness evaluation and the PaGMO Problem interface: * Identifying parallelizable parts of the code and designing the possible parallel implementation. * Beginning the implementation and testing phase. 2. Before the midterm evaluation: * Finishing the implementation and testing phase. * Integrating the implementation into PaGMO. 3. After the midterm evaluation: * Optimizing the code. * Benchmarking the performance of the new version against the initial implementation and the serial version. * Preparing documentation. Deliverables: 1. Optimized and tested CUDA implementation of the required problem. 2. Detailed benchmarking of the performance of the CUDA implementation Vs. the serial C++ implementation. 3. Documentation of the problem, the code and implementation.

7

    Read the code and conduct experiments on a stripped down version of the same along with CUDA.
     Talk with developers on the mailing lists of Nvidia as well as ACT.

Implement and Deploy CUDA kernels within the existing neuro-controller framework Design proper data structures and strategic Distribution of data between main memory and GPU memory. Write CUDA kernels with proper parameters to be called from the main program. Write calls to CUDA kernel in the main program with proper block dimensions. The setup should be able to perform simulations and output a optimal neural network representation of the spacecraft control. The CUDA modules should be used properly, without Exceptions and errors. Optimize the CUDA kernels developed to provide maximum speedup to the Neuro-controller. Redistribute data and optimize GPU data access. Use Nvidia CUDA Profiler to achieve maximum GPU throughput. Optimize GPU registers and memory usage. Attain satisfactory speed up as desired by the ACT team and the Mentors. Test the final developed modules and make sure that the application is bug free. Run simulations and test cases to test the application Request ACT members to check the application for loopholes and other bugs, also ask them to provide with more suggestions Incorporate ACT suggestions and resolve issues. The setup should be stable under standard running conditions Make sure the application is free of any CUDAexceptions. Write proper documentation for the code implemented and the CUDA Kernels developed. Get the documentation read by mentors for any errors or ambiguities Proper documentation which clearly defines project objectives and how the solution is implemented.

Kashif

The requirements of the project are tentatively as follows:

Requirement Description
OpenCL integration to PaGMO OpenCL will be added to the PaGMO project as an optional add-on. The intent of having OpenCL is to allow a wide host of computational devices to be available for PaGMO instead of just the ones specific to a vendor and to support an open standard. OpenCL will be usable in the context of a problem so that computationally intensive parts can take advantage of parallelism.
Ease of use and flexibility It will need to be easy to use so that code writers wont have to deal with managing OpenCL inside the problems unless they want to.
might be a problem--Juxi 10:29, 28 May 2010 (UTC)
OpenCL toolkit A toolkit will be developed to make it easy for code writers to write and use OpenCL kernels and to manage OpenCL entities quickly and easily.
OpenCL for Neural network Toolkit The neural network toolkit will use OpenCL (in the form of the toolkit).
How are you planning that? --Juxi 10:29, 28 May 2010 (UTC)
OpenCL for numerical integrator, dynamical system and fitness function OpenCL will be usable in the context of a problem so that computationally intensive parts can take advantage of parallelism.
Docking problem implementation A sample set of docking problems will be changed to use OpenCL.
Performance logging A Log of the times used for various activities will be produced.

Proposed design

OpenCL Toolkit

This will contain the following:

Part Purpose
Preprocessor macros This will provide a set of macros which will make it easier to write OpenCL kernels.
Auxiliary kernels This will be a set of kernels which are generic enough to be reusable yet complex enough to not fit in as macros.
ideas for 'generic enough' kernels --Juxi 10:29, 28 May 2010 (UTC)
OpenCL management functions A set of classes/functions to manage OpenCL contexts, workgroups, queues, kernels and also memory.
Integration with GOClasses The toolkit will be integrated with PaGMO::problem initially.
you mean with all pagmo::problems? that is rather hard I would say --Juxi 10:29, 28 May 2010 (UTC)

Docking problem implementation

The docking problem will use the OpenCL toolkit that was mentioned earlier. It will use the different levels of OpenCL constructs as follows:

Level Description
Context We will assume that each one GPU for the current implementation.
Workspace A number of workspaces will be instantiated for individual purposes; neural network training, ODE/Integrator evaluation etc. These may be loosely related to individual execution of the docking problem. Sequential steps of each execution will share data via buffers to avoid main memory access latency.
Work item Depending on the size of the neural network and the requirements of the Dynamical system's simulation, a number of work items will be used for each task. The kernels will use the OpenCL toolkit.

Essentially this means that the artificial neural network will employ a number of work items (depending on the number of neurons layers? --Juxi 10:29, 28 May 2010 (UTC)) to compute outputs and weights. These outputs will be fed into the trajectory simulation which will compute the expected outputs and then evaluate its fitness. For example we could have the following configuration: comments on the pic:

  • why is there no link between the dyn system and the ANN? (the ANN inputs come from the system not from the fitness!) --Juxi 10:29, 28 May 2010 (UTC)
  • why is there a link between evaluate fitness and ANN (towards ann!) --Juxi 10:29, 28 May 2010 (UTC)
  • what is the reason for the workspace not includeing the ANN?! --Juxi 10:29, 28 May 2010 (UTC)

Proposed Timeline

Period Activity
26th April to 24th May Integration of nvidia OpenCL into PaGMO
Review of requirements
Familiarization with docking problem
Evaluation of implementation various strategies
24th May to 16th July Implementation
Performance Analysis (at the end of every week)
16th July to 19th August Final touches and bug fixes
Performance Analysis (at the end of every 3 days)
Clone this wiki locally