Skip to content
abergeron edited this page Aug 27, 2012 · 21 revisions

Goal of compyte

Make a common GPU ndarray(matrix/tensor or n dimensions) that can be reused by all projects.

Mailing list

Comparison of existing implementation

Branch

Motivation

  • Currently there are at least 6 different gpu arrays in python
    • CudaNdarray(Theano), GPUArray(pycuda), CUDAMatrix(cudamat), GPUArray(pyopencl), Clyther, Copperhead, ...
    • There are even more if we include other languages.
  • They are incompatible
    • None have the same properties and interface
  • All of them are a subset of numpy.ndarray on the gpu!

Lack of Standard Creates Problems:

  • Duplicates work
    • GPU code is harder/slower to do correctly and fast than on the CPU/python
  • Harder to port/reuse code
  • Harder to find/distribute code
  • Divides development work

Pitfalls to Avoid

  • Start alone
    • We need different people/groups to "adopt" the new GpuNdArray
  • Too simple - other projects won't adopt
  • Too general - other projects will implement "light" versions... and not adopt
    • Having an easy way to convert/check conditions as numpy could alleviate this.

The preferred option is to have a general version with easy check/conversion to allow supporting only a subset!

Design Goals

  • Make it VERY similar to numpy.ndarray
    • Easier to attract other people from python community
  • Have the base object in C to allow collaboration with more projects.
    • We want people from C, C++, ruby, R, ... all use the same base Gpu ndarray.
  • Be compatible with CUDA and OpenCL

Current behavior not wanted

  • No CPU code generated from the python interface (for PyOpenCL and PyCUDA). Gpu code is OK.

Implementation plan

All of the basic C code is done. Currently working on elementwise functionality in prevision of a PyOpenCL/PyCUDA integration.

Sketch of the file structure and the reasoning behind it

This section will detail the file structure and give you a hint of what to expect if you intent on shipping a project integrating this code. Also this applies to the code in the reorg branch which will become the mainline soon. It is located here: http://github.com/abergeron/compyte/tree/reorg

Some of these files are not in the repository yet, which means that this functionality is being worked on.

The main files are:

  • ndarray/compyte_buffer.h:
    • Defines the base compyte_buffer object
    • Also defines the structure for GpuArray and GpuKernel
  • ndarray/compyte_buffer_cuda.c:
    • Implements the CUDA version of the compyte_buffer API
  • ndarray/compyte_buffer_opencl.c:
    • Implements the OpenCL version of the compyte_buffer API
  • ndarray/pygpu_ndarray.pyx
    • Define a Cython wrapper that exposes the GpuArray object and a couple of function to mimic the interface of numpy.ndarray
  • elemwise.py:
    • Support running arbitrary elementwise kernels on GpuArray of arbitrary memory layout (python-only).

These files serve as support for the functionality above:

  • ndarray/compyte_types.{c,h}:
    • generated by ndarray/gen_types.py
    • serve as a type table for operations that need to know some information about types involved
  • ndarray/compyte_util.{c,h}:
    • some generally useful functions that don't really fit anywhere else.
  • ndarray/setup.py:
    • Builds the python module implemented in pygpu_ndarray.pyx along with all the supporting code

These files serve for portability (mainly to support windows):

  • ndarray/compyte_compat.h
  • ndarray/compyte_mkstemp.c
  • ndarray/compyte_strl.c
  • ndarray/wincompat/*

Some tests for the python interface (that also test the underlying C code):

  • ndarray/test_gpu_ndarray.py (test basic functionality: init, copy, indexing, ...)
  • tests/test_elemwise.py (test that the numpy-like elemwise operations on array work correctly)

Some gotchas and differences from numpy

  • We have the updateifcopy flag as numpy, but it is always False and we expect it is False.
  • Buffer offsets (like what is generated when you do a[1:3]), are only partially supported under OpenCL 1.0. You cannot run kernels on them without copying them beforehand.