forked from ROCm/ROCm.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
optimize_dispatch.html
167 lines (143 loc) · 11.8 KB
/
optimize_dispatch.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
<!DOCTYPE html>
<html>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-83146017-1', 'auto');
ga('send', 'pageview');
</script>
<head>
<meta charset='utf-8'>
<meta http-equiv="X-UA-Compatible" content="chrome=1">
<meta name="description" content="ROCm, A New Era in Open GPU Computing : Platform for GPU Enabled HPC and UltraScale Computing ">
<link rel="stylesheet" type="text/css" media="screen" href="stylesheets/stylesheet.css">
<title>ROCm, A New Era in Open GPU Computing</title>
</head>
<body>
<!-- HEADER -->
<div id="header_wrap" class="outer">
<header class="inner">
<a id="forkme_banner" href="https://github.com/RadeonOpenCompute">View on GitHub</a>
<img class="wrap" src="images/ROCm_Logo_128.png" alt="ROCm_Logo" />
<h2 id="project_title">ROCm, A New Era in Open GPU Computing</h2>
<h3 id="project_tagline">Platform for GPU Enabled HPC and UltraScale Computing </h3>
</header>
</div>
<div id="nav">
<div id="nbar">
<ul>
<li><a href="index.html">Overview</a></li>
<li><a href="ROCmInstall.md">Install</a></li>
<li><a href="languages.html">Languages</a></li>
<li><a href="http://rocm-documentation.readthedocs.io/en/latest/index.html">Documentation</a></li>
<li><a href="packages.html">ROCm Solutions </a></li>
<li><a href="tutorials.html">Tutorials</a></li>
<li><a href="hardware.html">ROCm Hardware</a></li>
<li><a href="blog.html">ROCm BLOG</a></li>
</ul>
</div>
</div>
<!-- MAIN CONTENT -->
<div id="main_content_wrap" class="outer">
<section id="main_content" class="inner">
<h2>ROCm with Rapid Harmony : Optimizing HSA Dispatch</h2>
We <a href="/rocm-with-harmony-combining-opencl-hcc-hsa-in-a-single-program">previously </a>looked at how to launch an OpenCL™ kernel using the HSA runtime. That example showed the basics of using the HSA Runtime. <a href="https://github.com/RadeonOpenCompute/HCC-Example-Application/tree/master/BitonicSort-CL-from-HCC" target="_blank">Here </a>we'll turn up the tempo a bit by optimizing the launch code - moving some expensive operations into the setup code (rather than on each dispatch), removing host-side synchronization, and optimizing the memory fences to the bare minimum required. We'll measure the contributions of the different optimizations and discuss the results. The code is available at the <a href="https://github.com/RadeonOpenCompute/HCC-Example-Application/tree/master/BitonicSort-CL-from-HCC" target="_blank">same GitHub repository as before</a> and the optimizations can be enabled with a series of command-line switches.
<h2>Optimizing</h2>
Bitonic sort involves running the same kernel several times. For the default array length of 32768, the algorithm launches 120 kernels. The original OpenCL code and the associated port used in the example synchronize with the host after each of the kernel code. To improve performance, we can submit all 120 kernels at one time, and only synchronize with the host after the last one completes.
To make this change, we will need to restructure the BitonicSort::run call as follows:
<ul>
<li>Each kernel still needs to wait for the previous kernel to finish executing. The AQL packet in the HSA system architecture defines a “barrier” bit which provides exactly this synchronization – packets with the barrier bit set will wait for all preceding kernels in the same queue to complete before beginning their own execution. Barrier-bit synchronization only works for commands in the same queue, but will be more efficient than using signals in the cases where it applies. So we’ll set the barrier bit for all the kernels to provide the required synchronization between kernels, and therefore will only need to use a completion_signal for the last kernel in the sequence. (all other kernels set the completion_signal to 0, which saves an atomic decrement operation when the command finishes. ) This optimization is marked with <strong>p_optPreallocSignal</strong>.</li>
<li>In HSA, each kernel submission requires a block of “kernarg” memory to hold the kernel arguments. The baseline implementation allocates a single kernarg block and re-uses it for each kernel submission. In the optimized version, we submit all the kernels at the same time, but with different kernel arguments, so we must ensure that each kernel has its own kernarg block. The code actually performs a single kernarg allocation with enough space to cover all of the inflight kernels. Additionally, the code aligns each kernarg block on a 64-byte cache line boundary. This avoids false-sharing cases where the GPU is reading kernargs for one command while the host is writing arguments for another kernel, causing the cache line to ping-pong between CPU and GPU caches. The kernarg optimizations are marked with<strong> p_optPreallocKernarg</strong>.</li>
<li>The function <strong>bitonicSortGPU_opt </strong>contains the optimized loop which submits the batch of 120 kernels to the GPU. This code is marked with <strong>o_optAvoidHostSync</strong>).</li>
<li>Each AQL kernel dispatch packet contains a field that controls the memory fences applied before and after the kernel executes. In the baseline implementation, the fences conservatively specify system visibility for both acquire and release fences. (The subject of fences and what they control is well beyond the scope of this document but it covered extensively in the HSA System Architecture Specification Memory Model. It turns out we can make a more surgical use of these fences in the optimized version: (code marked with <strong>p_optFence</strong>)
<ul>
<li>The first kernel needs a system acquire fence to make sure it gets the data from the host->device copy.</li>
<li>The last kernel needs a system release fence to make sure it releases the data for the device->host copy.</li>
<li>All of the intermediate kernels only need to use "agent" level fences. On the AMD Fiji hardware, agent-scope fences are significantly faster than system-scope fences since the former flush only the L1 caches while the latter flush both the L1 and the L2 caches.</li>
</ul>
<codebox>
<pre>// Optimize HSA Fences
if (p_optFence) {
aql.header =
(HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) |
(1 << HSA_PACKET_HEADER_BARRIER);</pre>
<pre> bool setFence=false;</pre>
<pre></pre>
<pre></pre>
<pre> if (kernelCount == 1) {</pre>
<pre> // first packet needs to acquire from system to make sure it gets the host->device copy:
aql.header |= (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE);
aql.header |= (HSA_FENCE_SCOPE_AGENT << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE); setFence = true;</pre>
<pre> }
if (kernelCount == numKernels) {
// last packet needs to release to system to make sure data is visible for device->host copy:
aql.header |= (HSA_FENCE_SCOPE_AGENT << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE);
aql.header |= (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE);
setFence = true;
}
if (!setFence) {
// fences at agent scope:
aql.header |= (HSA_FENCE_SCOPE_AGENT << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE);
aql.header |= (HSA_FENCE_SCOPE_AGENT << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE);
}
}</pre>
</codebox></li>
<li>The flag <strong>p_optPinHost</strong> uses <strong>hc::am_alloc </strong>with the <strong>amPinnedHost</strong> flag to allocate pinned host memory. Pinned host memory accelerates the data transfer operations since the runtime will identify that the memory is already pinned and thus immediately start the DMA transactions - this achieves a peak transfer rate of 13-14GB/s. Unpinned memory is transferred through a host-side staging buffer and can be transferred at 6-7GB/s.</li>
</ul>
<h2>Results</h2>
<p>After making these changes, we see the speedups shown in the chart and table below.</p>
<img class="aligncenter size-full wp-image-3435" src="http://gpuopen.com/wp-content/uploads/2016/06/perf-data.png" alt="perf-data" width="743" height="416" />
<p>The timing numbers shown here includes the time to transfer the array to the GPU, run all of the kernels, and transfer back the result. The numbers do not include time spent initializing the runtime, allocating memory, or performing the result verification. The times show the time required to sort 32768 integers using 120 kernels. This is relatively small size to offload to the GPU (only 128K) and as a result the kernels run in 3-4 us, which stresses the HSA runtime features that we want to discuss.</p>
<table>
<thead>
<tr>
<th></th>
<th>Baseline</th>
<th>+optPreallocSignal</th>
<th>+optPreallocKernarg</th>
<th>+optAvoidHostSync</th>
<th>+optFence</th>
<th>+optPinnedHost</th>
</tr>
</thead>
<tbody>
<tr>
<td>RunTime/Iteration (us)</td>
<td>1943</td>
<td>1906</td>
<td>1869</td>
<td>1665</td>
<td>1221</td>
<td>1137</td>
</tr>
<tr>
<td>Delta/Iteration(us)</td>
<td></td>
<td>-37</td>
<td>-37</td>
<td>-204</td>
<td>-444</td>
<td>-84</td>
</tr>
</tbody>
</table>
<p>The results show that signal allocation and kernarg allocation both take approximately 37us to complete, which makes sense since both operations require a trip into kernel space (ROCK) and perform memory allocation. Even the baseline operation shares the signal and kernarg allocation for all 120 kernels but the overhead here is still significant. Kernels can be dispatched in 5-10us each, so optimal programs definitely will want to perform these operations outside of the critical path. The optimized code path here moves these operations into the setup routine. Another option is to create a buffer pool of signals and kernargs (this is the approach used by <a href="https://github.com/RadeonOpenCompute/hcc/blob/master/lib/hsa/mcwamp_hsa.cpp#L47" target="_blank">HCC</a>) or to use thread-local-storage (if thread-safety is required).</p>
<p>Avoiding the host synchronization saves 204us, or about 1.7us per kernel.</p>
<p>The system-scope fences are fairly expensive - Fiji has a 2MB L2 cache, and it takes 3-4 us to flush the entire thing. Additionally, the bitonic kernel default size is only 128K (32K elements * 4 bytes/element) which can easily fit in the L2 cache. Each kernel in the sequence then reads from the L2 and writes the data back to it. By optimizing these fences to use AGENT scope when possible, we are able to save approximately 3.7us per kernel launch.</p>
<p>Finally, using pinned host memory improves the transfer speeds from around 6GB/s to 14GB/s. In this workload, we see a modest performance improvement (84us) since most of the benchmark is spent running the kernels and synchronizing between them.</p>
<p>Overall the performance improvement from these optimizations is 1.7X faster than the baseline version.</p>
<h2>Reference:</h2>
<a href="https://en.m.wikipedia.org/wiki/Bitonic_sorter" target="_blank">Wikipedia </a>has a nice description of the Bitonic sort algorithm, including pictures.
Eric Bainville wrote a nice explanation <a href="http://www.bealto.com/gpu-sorting_intro.html" target="_blank">here </a>describing how to optimize Bitonic Sort for the GPU.
</div>
<!-- FOOTER -->
<div id="footer_wrap" class="outer">
<footer class="inner">
<p> © 2016 AMD Corporation <a href="legal.html">Disclaimer and Legal Information</a> </p>
</footer>
</div>
</body>
</html>