-
Notifications
You must be signed in to change notification settings - Fork 0
/
algebra.html
189 lines (183 loc) · 11.4 KB
/
algebra.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: Linear Algebra
subtitle: Matrix Multiplication Beyond Auto-Tuning — Rewrite Based GPU Code Generation
layout: project_page
permalink: algebra
---
<div id="main">
<div class="container">
<div class="row">
<!-- Content -->
<div id="content" class="8u skel-cell-important">
<section>
<header>
<h2>Abstract</h2>
</header>
<p>Graphics Processing Units (GPUs) are used as general purpose
parallel accelerators in a wide range of applications.
They are found in most computing systems, and mobile devices
are no exception. The recent availability of programming
APIs such as OpenCL for mobile GPUs promises to
open up new types of applications on these devices.</p>
<p>
However, producing high performance GPU code is extremely
difficult. Subtle differences in device characteristics
can lead to large performance variations when different optimizations
are applied. As we will see, this is especially true
for a mobile GPU such as the ARM Mali GPU which has a
very different architecture than desktop-class GPUs. Code
optimized and tuned for one type of GPUs is unlikely to
achieve the performance potential on another type of GPUs.</p>
<p>
Auto-tuners have traditionally been an answer to this performance
portability challenge. For instance, they have been
successful on CPUs for matrix operations, which are used
as building blocks in many high-performance applications.
However, they are much harder to design for different classes
of GPUs, given the wide variety of hardware characteristics.</p>
<p>
We take a different perspective and show how
performance portability for matrix multiplication is achieved
using a compiler approach. This approach is based on a
recently developed generic technique that combines a high-level
programming model with a system of rewrite rules. Programs
are automatically rewritten in successive steps, where
optimizations decision are made. This approach is truly performance
portable, resulting in high-performance code for
very different types of architectures such as desktop and mobile
GPUs. In particular, we achieve a speedup of 1.7x over a
state-of-the-art auto-tuner on the ARM Mali GPU.</p>
</section>
<section>
<header>
<h2>Publications</h2>
</header>
<ul>
<li>
Michel Steuwer, Toomas Remmelg, and Christophe Dubach:
<strong><a href="publications/2017/steuwer17LiftIR.pdf">
Lift: A Functional Data-Parallel
IR for High-Performance GPU Code Generation</a></strong>,
in the <i><a href="http://www.cgo.org/" target="blank">Proceedings of the 2017
International Symposium on Code Generation and Optimization (CGO)</a></i>.
</li>
<li>Michel Steuwer, Toomas Remmelg, and Christophe Dubach:
<strong><a href="publications/2016/steuwer16beyondAutoTuning.pdf">
Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation</a></strong>,
in <i><a href="http://www.esweek.org/cases/about" target="blank">
Proceedings of the 2016 International Conference on Compilers, Architecture and
Synthesis for Embedded Systems (CASES)</a></i>.</li>
<li>Toomas Remmelg, Thibaut Lutz, Michel Steuwer, and Christophe Dubach:
<strong><a href="publications/2016/remmelg16perfport.pdf">
Performance Portable GPU Code Generation for Matrix Multiplication</a></strong>,
in the <i><a href="http://conf.researchr.org/track/PPoPP-2016/GPGPU-2016-papers" target="blank">
9th Workshop on General Purpose Processing using GPUs</a> (GPGPU) @ PPoPP</i>.</li>
</ul>
</section>
</div>
<!-- Sidebar -->
<div id="sidebar" class="4u">
<section style="text-align: center">
<header style="text-align: left">
<h2>Posters</h2>
</header>
<div class="thumb_frame" style="margin: 0 auto;">
<a href="posters/2016/CASES-2016.pdf">
<div class="thumb_container">
<img class="thumb_image" src="posters/2016/thumbnails/CASES-2016_thumb.png"
width="300" height="212">
<div class="thumb_overlay">
<div class="thumb_text">
<p class="posted">Oct 4, 2016</p>
<p class="posted">@ the International Conference on Compilers, Architectures
and Synthesis for Embedded Systems (CASES) 2016 in Pittsburgh, USA</p>
<p>Matrix Multiplication Beyond Auto-Tuning: Rewrite Based GPU Code
Generation</p>
</div>
</div>
</div>
</a>
</div>
</section>
<section>
<header>
<h2>Talks</h2>
</header>
<div class="row">
<section>
<ul class="style">
<li>
<p class="posted">Feb 6, 2017 @ the International Symposium on Code Generation and
Optimization (CGO) 2017 in Austin, USA</p>
<p><a href="presentations/2017/CGO-2017.pdf">
LIFT: A Functional Data-Parallel IR for High-Performance GPU Code Generation</a></p>
</li>
<li>
<p class="posted">Oct 4, 2016 @ the International Conference on Compilers, Architectures
and Synthesis for Embedded Systems (CASES) 2016 in Pittsburgh, USA</p>
<p><a href="presentations/2016/CASES-2016.pdf">
Matrix Multiplication Beyond Auto-Tuning: Rewrite Based GPU Code Generation</a></p>
</li>
<li>
<p class="posted">Apr 21, 2016 @ the Institute for Computing Systems Architecture
(ICSA) Sessions at the University of Edinburgh, UK</p>
<p><a href="presentations/2016/ICSA-2016.pdf">
Expressing Optimisations as Rewrites</a></p>
</li>
<li>
<p class="posted">Mar 12, 2016 @ the Annual Workshop on General Purpose Processing
using Graphics Processing Units (GPGPU) 2016 in Barcelona, Spain</p>
<p><a href="presentations/2016/GPGPU-2016.pdf">
Performance Portable GPU Code Generation for Matrix Multiplication</a></p>
</li>
<li>
<p class="posted">Oct 19, 2015 @ the Programming Language Interest Group at the
University of Edinburgh, UK</p>
<p><a href="presentations/2015/PLInG-2015.pdf">
A Functional Approach to Performance Portable GPU Code Generation: A Case
Study on Matrix Multiplication</a></p>
</li>
</ul>
</section>
</div>
</section>
<section class="profile">
<header>
<h2>Researchers</h2>
</header>
<div class="row">
<section class="6u">
<a href="https://www.inf.ed.ac.uk/people/students/Toomas_Remmelg.html" class="image full">
<img src="images/toomas.jpg" alt="Toomas Remmelg"></a>
<a href="https://www.inf.ed.ac.uk/people/students/Toomas_Remmelg.html">Toomas Remmelg</a>
<br>
PhD Student
<br>
<a href="http://www.ed.ac.uk/informatics/">University of Edinburgh</a>
</section>
<section class="6u">
<a href="https://michel-steuwer.github.io/" class="image full">
<img src="images/msteuwer.jpg" alt="Michel Steuwer"></a>
<a href="https://michel-steuwer.github.io/">Michel Steuwer</a>
<br>
Lecturer
<br>
<a href="https://www.gla.ac.uk/schools/computing/">University of Glasgow</a>
</section>
</div>
<div class="row">
<section class="6u">
<a href="http://homepages.inf.ed.ac.uk/cdubach/" class="image full">
<img src="images/cdubach.png" alt="Christophe Dubach"></a>
<a href="http://homepages.inf.ed.ac.uk/cdubach/">Christophe Dubach</a>
<br>
Reader
<br>
<a href="http://www.ed.ac.uk/informatics/">University of Edinburgh</a>
</section>
</div>
</section>
</div>
</div>
</div>
</div>