Skip to content

CLOC Compiler and Sample SDK

Shreyas Ramalingam edited this page Aug 15, 2014 · 26 revisions

Introduction

CLOC (CL Offline Compiler) is a script that helps developers to easily take advantage of HSA accelerators (like GPUs) by writing the host code (running on CPU device) and kernel (running on HSA accelerator device). To keep things simple, we have taken OpenCL-like language to write the kernel and compile into HSAIL/BRIG. This can be loaded and launched on the device using the host program. This is an effort to encourage developers to write simple applications which can show the power of HSA features on AMD platforms.

HSA Foundation has released the CLOC tool (along with required binaries to build the HSAIL/BRIG files from kernels) along with samples.

Downloading CLOC and related components

  1. Download the CLOC utility from https://github.com/HSAFoundation/CLOC
  2. Download the High Level Compiler components - There are two options.
    1. Open source version - git clone https://github.com/HSAFoundation/HSAIL-HLC-Development
    2. Closed source version - git clone https://github.com/HSAFoundation/HSAIL-HLC-Stable
  3. Download HSAIL-Tools from git clone https://github.com/HSAFoundation/HSAIL-Tools. Follow build instructions to build hsailasm. This tool is required to disassemble BRIG generated by CLOC.

Compiling OpenCL Kernels to BRIG/HSAIL

  1. Setting environment variables for CLOC.
    1. export HSA_LLVM_PATH=HSAIL-HLC-Developement/bin or HSAIL-HLC-Stable/bin
    2. export HSA_LIBHSAIL_PATH=HSAIL-Tool/libHSAIL/build_linux
  2. Compiling a cl kernel file
    1. If you want to generate a brig file from a cl file - Execute ./cloc kernel.cl
    2. If you want to generate a hsail file from a cl file - Execute ./cloc -hsail kernel.cl

Mapping OpenCL variables to HSAIL kernel arguments

The High Level Compiler generates HSAIL for OpenCL kernels. The HLC can generate code in large (64-bit) or small mode (32-bit). Each OpenCL type is mapped to a HSAIL type. All HSAIL types generated by the HLC are naturally aligned. Below are the general rules for mapping

  1. OpenCL Local/global pointers and size_t are mapped to kernarg_u32 in small mode and kernarg_u64 in large mode.
  2. OpenCL char, integer, floats and doubles are mapped to kernarg_u8, kernarg_u32, kernarg_f32 and kernarg_f64 respectively.
  3. OpenCL Vectors - This depends on whether HSAIL_HLC_Stable or HSAIL_HLC_Development
    1. HSAIL-HLC-Stable - Vectors are flattened. A float16 is flattened out to 16 different HSAIL arguments of type kernarg32.
    2. HSAIL-HLC-Development - Vectors are converted to HSAIL Vectors - A float 16 is converted to kernarg_f32 arg[16].
  4. OpenCL Images are converted to HSAIL types Kernarg_rwimg or kernarg_roimg. The size of the image is 48 bytes and the alignment in the kernel argument buffer is 16 bytes.
  5. OpenCL Samplers are converted to Kernarg_samp. The size of the sampler is 32 bytes and alignment in the kernel argument buffer is 32 bytes.

Limitations

Note that certain features of OpenCL such as handling images, global offsets, global variables, pipes, printf require OpenCL runtime .

Extra arguments in HSAIL_HLC_Stable

1. Dummy Arguments : The OpenCL HLC/runtime handles global offsets and other features by always adding six additional arguments at the beginning of the argument list. These size additional arguments are of type size_t and are set by the OpenCL runtime. The user must set the first three arguments to zero if he/she does not want to use global offsets in the calculation.

2. Vector Flattening : Vectors such as float4 are flattened i.e they are broken down into 4 different arguments. Hence, the user must take care to pass 4 different arguments

CLOC Examples:

The CLOC/examples/hsa directory contain examples that use CLOC + HSA runtime to dispatch kernels to the GPU. The CLOC/examples/okra contain examples the use CLOC + OKRA runtime to dispatch kernels. To build and execute the examples, you need to download the following github repositories

Set the environment variables

  • HSA_RUNTIME_PATH= Path to HSA-Runtime-AMD
  • HSA_KMT_PATH= Path to HSA-Drivers-Linux-AMD/kfd-0.8/libhsakmt/
  • HSA_OKRA_PATH= Path to Okra-Interface-to-HSA-Device/okra/

Building HSA example

  • cd CLOC/examples/hsa && make all && make test

Building OKRA examples

  • export OKRA_DISABLE_FIX_HSAIL=1
  • cd CLOC/examples/okra && make all && make test

Note : If the HSAIL_HLC_Stable is used for the higher level compiler. Run "make all CFLAGS=-DDUMMY_ARGS=1" instead of "make all".

An OKRA Example:

Suppose you wish to write an HSA program using CLOC and host CPP program to compute sum of two vectors of numbers. This operation is inherently parallel: the addition of corresponding vector elements can be performed in parallel by individual GPU threads. This is a classic case in which the power of GPU compute can be utilized.

The kernel code:

The first step is to write the kernel using OpenCL. This would be:

kernel void test(global int *a, global int *b, global int *sum) { int id = get_global_id(0); sum[id] = a[id] + b[id]; }

As we can see the above kernel just adds the elements of two input vectors and puts the sum into another vector. The host would create these three vectors and pass the pointers into this kernel. This kernel computes the sum of the vectors.

We can use the CLOC utility on this kernel to create the HSAIL/BRIG format.

amd@msdnkv69:~/CLOC/CLOC/example/vector_copy$ ./cloc -p ~/Prakash/BenTest/bin/D2/ -hsail vector_copy.cl
Info: Version: 0.6
Info: OpenCL file: /home/amd/CLOC/CLOC/example/vector_copy/vector_copy.cl
Info: Output file: /home/amd/CLOC/CLOC/example/vector_copy/vector_copy.hsail
Info: Run date: Wed Jul 30 16:29:48 IST 2014
Info: Compile(clc) cl --> bc ...
Info: Disassmbl(llvm-as) bc --> ll ...
Info: Link(llvm-link) bc --> lnkd.bc ...
Info: Optimize(opt) lnkd.bc --> opt.bc ...
Info: llc arch=hsail opt.bc --> brig ...
Info: hsailasm brig --> vector_copy.hsail ...
Info: Done

Now, we have the HSAIL program for this kernel, which can be used in the host side.

Host Program:

Let us look at the host side of this program. We will show this example by using the OKRA interface to HSAIL RT. The OKRA APIs is just a layer over HSAIL RT APIs for simpler usage. The host program has to do the following:

  1. Create the OKRA context giving

     OkraContext *context = OkraContext::Create();
     if (context == NULL) {cout << "...unable to create context\n"; exit(-1);}
    
  2. Create the "kernel" from the HSAIL source that we generated in the previous section by

     string sourceFileName = "vector_copy.hsail";
     char* vcopySource = buildStringFromSourceFile(sourceFileName);
    
     OkraContext::Kernel *kernel = context->createKernel(vcopySource, "&run");
     if (kernel == NULL) {cout << "...unable to create kernel\n"; exit(-1);}
    
  3. Create and register the vectors that are to be used for computing the vector additions:

     context->registerArrayMemory(vecA, NUMELEMENTS * sizeof(float));       
     context->registerArrayMemory(vecB,  NUMELEMENTS * sizeof(float));
     context->registerArrayMemory(vecS,  NUMELEMENTS * sizeof(float));  
    
  4. Set the arguments to the kernel. We need to set the pointers of three arrays in the same order as in the kernel: kernel->clearArgs(); kernel->pushPointerArg(vecA); kernel->pushPointerArg(vecB); kernel->pushPointerArg(vecS);

  5. Set the "NDRange" of the kernel using global and local dimensions of the kernel execution

     size_t globalDims[] = {NUMELEMENTS};  
     size_t localDims[] = {NUMELEMENTS};
     kernel->setLaunchAttributes(1, globalDims, localDims);  // 1 dimension
    
  6. Dispatch the kernel synchronously to the device and wait for the device to complete execution

        kernel->dispatchKernelWaitComplete();
    
  7. Finally, check for the results from the kernel execution

The complete C++ program for this sample can be seen in the SDK supplied along with this CLOC utility.

Conclusions :

This CLOC utility just illustrates the usage of HSA programs using OpenCL-like language to write the kernels. It also demonstrates that HSA is a standalone stack and does not require the runtime of any other language runtime (such as OpenCL runtime). One only needs to compile the kernel source into an HSAIL program. If one can write the HSAIL program directly, then one does not even need the device compiler.