forked from ROCm/ROCm.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
rocncloc.html
218 lines (151 loc) · 14.2 KB
/
rocncloc.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
<!DOCTYPE html>
<html>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-83146017-1', 'auto');
ga('send', 'pageview');
</script>
<head>
<meta charset='utf-8'>
<meta http-equiv="X-UA-Compatible" content="chrome=1">
<meta name="description" content="ROCm, A New Era in Open GPU Computing : Platform for GPU Enabled HPC and UltraScale Computing ">
<link rel="stylesheet" type="text/css" media="screen" href="stylesheets/stylesheet.css">
<title>ROCm, A New Era in Open GPU Computing</title>
</head>
<body>
<!-- HEADER -->
<div id="header_wrap" class="outer">
<header class="inner">
<a id="forkme_banner" href="https://github.com/RadeonOpenCompute">View on GitHub</a>
<img class="wrap" src="images/ROCm_Logo_128.png" alt="ROCm_Logo" />
<h2 id="project_title">ROCm, A New Era in Open GPU Computing</h2>
<h3 id="project_tagline">Platform for GPU Enabled HPC and UltraScale Computing </h3>
</header>
</div>
<div id="nav">
<div id="nbar">
<ul>
<li><a href="index.html">Overview</a></li>
<li><a href="install.html">Install</a></li>
<li><a href="languages.html">Languages</a></li>
<li><a href="http://rocm-documentation.readthedocs.io/en/latest/index.html">Documentation</a></li>
<li><a href="packages.html">ROCm Solutions </a></li>
<li><a href="tutorials.html">Tutorials</a></li>
<li><a href="hardware.html">ROCm Hardware</a></li>
<li><a href="blog.html">ROCm BLOG</a></li>
</ul>
</div>
</div>
<!-- MAIN CONTENT -->
<div id="main_content_wrap" class="outer">
<section id="main_content" class="inner">
<h2>ROCm With Harmony: Combining OpenCL, HCC, and HSA in a Single Program</h2>
<h3>Introduction</h3>
<p>In a <a href="http:gpuopen.com/rocm-do-you-speaka-my-language/">previous blog</a> we discussed the different languages available on the ROCm platform. <a href="https://github.com/RadeonOpenCompute/HCC-Example-Application/tree/master/BitonicSort-CL-from-HCC" target="_blank">Here </a>we’ll show you how to combine several of these languages in a single program:</p>
<ul>
<li>We’ll use an offline OpenCL™ compiler to compile the “BitonicSort” OpenCL kernel (from the AMD APP SDK) into a standard HSA code object (“hsaco”) format.</li>
<li>The host code will employ HCC’s hc dialect for device discovery (ie <code>hc::accelerator</code> and <code>hc::accelerator_view</code>) and memory management (<code>hc::array</code>)</li>
<li>The actual dispatch will use the low-level HSA Runtime calls. Recall that ROCR is an implementation of the HSA Runtime with extensions for multi-GPU configurations. We’ll show you how to extract HSA queue and agent structures from the HCC C++ ones, and then use them to perform the kernel launch.</li>
</ul>
<p>There are several reasons you might want to do something along these lines. First, many kernels exist in OpenCL and re-using this existing investment can save time. The OpenCL kernel language is widely-used, and it enables programmers to use advanced GPU features including local memory, rich math functions, and vector operations. But the OpenCL runtime can be verbose and the memory interface can be difficult to control and optimize. HCC provides the advantage of a full C++ runtime but also full control over the memory allocation and copies. Using the techniques we'll show you here, you can employ OpenCL kernels without having to port the host runtime code to OpenCL. This approach offers a significant advantage for larger C++ programs that can use a few optimized OpenCL kernels while sticking with C++ kernels and features for the rest of the program.</p>
<h2>hsaco : The Common Currency</h2>
Hsaco is informally pronounced “sock-o” (with a slight emphasis on the first letter to reflect the otherwise silent “h”). It's a standard ELF file ; <a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format" target="_blank">ELF</a> (“Executable and Linkable Format”) is a container format widely used in Linux to store object code, and the hsaco ELF container organization matches the one generated by the popular LLVM tool chain. Hsaco stores the compiled GCN code in the .text section, it optionally contains debug information, and it defines symbols that allow the host code to find the kernel entrypoints and functions. Like other ELF files, code objects can contain multiple kernels, functions, and data – so when using hsaco you will need to specify both the code object and the desired symbol. Refer to the <a href="https://github.com/RadeonOpenCompute/ROCm-Docs" target="_blank">detailed description</a> of the hsaco format for more information.
Many tools in AMD’s compiler chain generate and use the hsaco format including OpenCL, HCC, HIP, the GCN assembler and the HSAIL Finalizer. Kernel code contained in hsaco can be extracted and then launched onto the GPU. Additionally, the <a href="https://github.com/RadeonOpenCompute/LLVM-AMDGPU-Assembler-Extra" target="_blank">dissembler tool </a>can disassemble hsaco files so you can see what is going on inside the kernel. In a future blog, we’ll talk about using the same techniques described here to assemble and then launch kernels written in GCN assembly. Essentially, hsaco is the interchange format used to pass code between these different tools, and allows code written in different languages to be used together.
<h3>Compiling an OpenCL Kernel into hsaco</h3>
The Makefile shows the usage of the <a href="https://github.com/HSAFoundation/CLOC" target="_blank">CLOC </a>(CL Offline Compiler) tool to compile the CL kernel into the hsaco file. Here’s the relevant call to CLOC:
<code>/opt/rocm/cloc/bin/cloc.sh BitonicSort_Kernels.cl -o BitonicSort_Kernels.hsaco</code>
<h3>Using hsaco:</h3>
This example shows two methods for accessing the hsaco data from the host application :
<ul>
<li>Use a separate file and load it using C++ file I/O code. See the load_hsa_from_file() command. This path is enabled when p_loadKernelFromFile=true.</li>
<li>Serialize the code into a global string and thus directly link the hsaco into the executable. This approach avoids the need to find the hsaco file at runtime. This path is enabled when p_loadKernelFromFile=false.</li>
</ul>
The “load_hsa_code_object” shows the use of the standard HSA Runtime API calls to load the code object into memory and extract the pointer to the BitonicSort kernel. If we were working with an HSAIL or BRIG kernel we would first call the finalizer which would produce hsaco data, and the use these exact same finalizer APIs to load the hsaco into memory and find the desired symbols. This is a powerful and extremely useful concept that allows applications using the HSA Runtime to support either:
<ul>
<li>An industry standard portable intermediate language (HSAIL/BRIG) that can be finalized to a vendor-specific binary, or</li>
<li>A standard ELF container that stores vendor-specific binary code (hsaco). This flavor supports vendor-specific ISA inside a standard container format, and still benefits from the standard HSA runtime API. Effectively this enables use cases where apps and tools can use the HSA Runtime APIs without using HSAIL, and still retain source code portability.</li>
</ul>
The picture below shows the different steps in the code loading process, and in particular the clean separation between the pre-finalization (green) and post-finalization (yellow) steps.
<a href="http://gpuopen.com/wp-content/uploads/2016/06/finalizer-executable-hsail-flow.png" target="_blank"><img class="alignnone size-medium wp-image-3413" src="http://gpuopen.com/wp-content/uploads/2016/06/finalizer-executable-hsail-flow-300x255.png" alt="finalizer executable hsail flow" width="300" height="255" /></a>
<h2>Making HCC Sing</h2>
The example uses the <a href="http://gpuopen.com/rocm-do-you-speaka-my-language/" target="_blank">hc C++ dialect</a> to select the default accelerator and queue. To launch the hsaco file we’ve created, we need to make HCC reveal the details of the HSA data structure that live under the covers. Here’s the critical piece of code that shows how to get from the HCC world to the HSA world using “hc::accelerator_view::get_hsa_queue”:
<pre>
//_acc is type hc::accelerator.
// Select default queue
hc::accelerator_view av = _acc.get_default_view();
// Extract the HSA queue from the accelerator view:
hsa_queue_t *hsaQueue = static_cast<hsa_queue_t*> (av.get_hsa_queue());
</pre>
<h3></h3>
Now that we have an HSA queue we can use the low-level HSA runtime API to enqueue the kernel for execution on the GPU. The code creates an "AQL" packet, uses the hsa runtime APIs (such as hsa_queue_store_write_index_relaxed) to place the packet into the queue and make it visible to the GPU for execution. More details in the code.
This capability is a quite useful since we can now mix HCC kernels (submitted with parallel_for_each) with kernels in hsaco format (from OpenCL kernels, or assembly, or other sources) in the same application or even in the same queue. For example, libraries can benefit from this architecture : the library interface can be based on HCC structures (accelerator, accelerator_view, completion_future) while the implementation uses HSA Runtime and hsacos.
<h3>Extracting Data Pointers</h3>
The example under discussion uses <code>hc::array<></code>to store the array of integers that are sorted. The original OpenCL kernel of course knows nothing of the <code></code>hc::array<></code> data-type. Here’s the OpenCL kernel signature:
<pre>
__kernel
void bitonicSort(__global uint * theArray, const uint stage, const uint passOfStage, const uint direction)
</pre>
When calling this kernel, the first parameter (theArray) is an 8-byte pointer. Fortunately the hc syntax defines an API that allows us to retrieve this pointer on the host side so we can later pass it to the kernel in the expected position:
<pre>
_inputAccPtr = _inputArray->;accelerator_pointer();
</pre>
<p>Our application is still responsible for ensuring that the data at this pointer is valid on the accelerator, before calling the kernel. In this case, the application copies from host data (allocated with malloc) to the inputArray.</p>
<p>The code also shows the use of hc’s accelerator memory interface to allocate and copy the data. This is an alternative to using <code>hc::array<></code>, and can be select by setting p_useHcArray=false in the top of the source code. Here’s the relevant code snippet:</p>
<pre> // Alternative allocation technique using am_alloc
_inputAccPtr = hc::am_alloc(sizeBytes, _acc, 0);
hc::am_copy(_inputAccPtr, _input, sizeBytes);
</pre>
We do not recommended usinge <code>hc::array_view<></code> with the direct hsaco code launching techniques we are discussing here. <code>hc::array_view<></code> is designed to automatically synchronize the data before and after parallel_for_each blocks are launched. Direct launching with HSA runtime APIs will not automatically synchronize <code>hc::array_view<></code>.
<p>Finally, HCC provides accessors that allow easy retrieval of the the HSA “regions” associated with an accelerator. The HSA runtime API uses regions to specify where memory on an agent is located - for example coarse-grain device memory or fine-grain system memory. When enumerating accelerators, HCC scans the supported regions for each underlying HSA agent and provides the following accessors:</p>
<pre>
void* get_hsa_am_region();// Accelerator-memory region. On discrete GPUs its the device memory ; on APUs its shared host memory
void* get_hsa_am_system_region() // Pinned or registered host memory accessible to this accelerator
void* get_hsa_kernarg_region() // Memory for kernel arguments.
</pre>
<p>This example uses <strong>get_hsa_kernarg_region()</strong> to allocate memory for the kernel arguments passed to the BitonicSort kernel. Kernarg memory is typically written by the host CPU and read by the accelerator executing the kernel. The example defines a host-side structure to describe the layout of the arguments expected by the kernel, and then typecasts the pointer returned by the kernarg pointer.</p>
<pre>
/*
* This is the host-side representation of the kernel arguments expected by the BitonicSort kernel.
* The format and alignment for this structure must exactly match the kernel signature defined in the kernel
*/
struct BitonicSort_args_t {
uint32_t * theArray;
uint32_t stage;
uint32_t passOfStage;
uint32_t direction;
} ;
/*
* Allocate the kernel argument buffer from the correct region.
*/
BitonicSort_args_t * args = NULL;
hsa_region_t kernarg_region = *(static_cast<hsa_region_t*> (_acc.get_hsa_kernarg_region()));
hsa_status = hsa_memory_allocate(kernarg_region, sizeof(args), (void**)(&args));
aql.kernarg_address = args;
assert(HSA_STATUS_SUCCESS == hsa_status);
/*
* Write the args directly into the kernargs buffer:
*/
args->theArray = _inputAccPtr;
args->stage = 0;
args->passOfStage = 0;
args->direction = _sortIncreasing;
</pre>
<h2>Summary</h2>
<p>We learned how to use offline compilation to convert an OpenCL kernel into a standard hsaco file and then employed the HSA Runtime API to launch that kernel from an HCC program. Harmony! In the future we'll look at how to optimize the HSA Runtime calls, and also how to use other tools to create hsaco files (such as the AMDGCN assembler). Stay tuned.</p>
<h2>Reference:</h2>
<a href="https://github.com/RadeonOpenCompute/HCC-Example-Application/tree/master/BitonicSort-CL-from-HCC" target="_blank">GitHub Code for this example </a>
https://en.wikipedia.org/wiki/Bitonic_sorter
<!-- FOOTER -->
<div id="footer_wrap" class="outer">
<footer class="inner">
<p> © 2016 AMD Corporation <a href="legal.html">Disclaimer and Legal Information</a> </p>
</footer>
</div>
</body>
</html>