algebra.html

---
title: Linear Algebra
subtitle: Matrix Multiplication Beyond Auto-Tuning — Rewrite Based GPU Code Generation

layout: project_page

permalink: algebra
---
<div id="main">
    <div class="container">
        <div class="row">

            <!-- Content -->
            <div id="content" class="8u skel-cell-important">
                <section>
                    <header>
                        <h2>Abstract</h2>
                    </header>
                    <p>Graphics Processing Units (GPUs) are used as general purpose
                        parallel accelerators in a wide range of applications.
                        They are found in most computing systems, and mobile devices
                        are no exception. The recent availability of programming
                        APIs such as OpenCL for mobile GPUs promises to
                        open up new types of applications on these devices.</p>
                    <p>
                        However, producing high performance GPU code is extremely
                        difficult. Subtle differences in device characteristics
                        can lead to large performance variations when different optimizations
                        are applied. As we will see, this is especially true
                        for a mobile GPU such as the ARM Mali GPU which has a
                        very different architecture than desktop-class GPUs. Code
                        optimized and tuned for one type of GPUs is unlikely to
                        achieve the performance potential on another type of GPUs.</p>
                    <p>
                        Auto-tuners have traditionally been an answer to this performance
                        portability challenge. For instance, they have been
                        successful on CPUs for matrix operations, which are used
                        as building blocks in many high-performance applications.
                        However, they are much harder to design for different classes
                        of GPUs, given the wide variety of hardware characteristics.</p>
                    <p>
                        We take a different perspective and show how
                        performance portability for matrix multiplication is achieved
                        using a compiler approach. This approach is based on a
                        recently developed generic technique that combines a high-level
                        programming model with a system of rewrite rules. Programs
                        are automatically rewritten in successive steps, where
                        optimizations decision are made. This approach is truly performance
                        portable, resulting in high-performance code for
                        very different types of architectures such as desktop and mobile
                        GPUs. In particular, we achieve a speedup of 1.7x over a
                        state-of-the-art auto-tuner on the ARM Mali GPU.</p>
                    
                </section>
                
                <section>
                    <header>
                        <h2>Publications</h2>
                    </header>
                    <ul>
                        <li>
                            Michel Steuwer, Toomas Remmelg, and Christophe Dubach:
                            <strong><a href="publications/2017/steuwer17LiftIR.pdf">
                                Lift: A Functional Data-Parallel
                                IR for High-Performance GPU Code Generation</a></strong>, 
                            in the <i><a href="http://www.cgo.org/" target="blank">Proceedings of the 2017
                            International Symposium on Code Generation and Optimization (CGO)</a></i>.
                        </li>
                        <li>Michel Steuwer, Toomas Remmelg, and Christophe Dubach:
                            <strong><a href="publications/2016/steuwer16beyondAutoTuning.pdf">
                                Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation</a></strong>, 
                            in <i><a href="http://www.esweek.org/cases/about" target="blank">
                                Proceedings of the 2016 International Conference on Compilers, Architecture and 
                                Synthesis for Embedded Systems (CASES)</a></i>.</li>
                        <li>Toomas Remmelg, Thibaut Lutz, Michel Steuwer, and Christophe Dubach:
                            <strong><a href="publications/2016/remmelg16perfport.pdf">
                                Performance Portable GPU Code Generation for Matrix Multiplication</a></strong>,
                            in the <i><a href="http://conf.researchr.org/track/PPoPP-2016/GPGPU-2016-papers" target="blank">
                                9th Workshop on General Purpose Processing using GPUs</a> (GPGPU) @ PPoPP</i>.</li>
                    </ul>
                </section>
            </div>

            <!-- Sidebar -->
            <div id="sidebar" class="4u">
                <section style="text-align: center">
                    <header style="text-align: left">
                        <h2>Posters</h2>
                    </header>
                    <div class="thumb_frame" style="margin: 0 auto;">
                        <a href="posters/2016/CASES-2016.pdf">
                            <div class="thumb_container">
                                <img class="thumb_image" src="posters/2016/thumbnails/CASES-2016_thumb.png"
                                     width="300" height="212">
                                <div class="thumb_overlay">
                                    <div class="thumb_text">
                                        <p class="posted">Oct 4, 2016</p>
                                        <p class="posted">@ the International Conference on Compilers, Architectures
                                            and Synthesis for Embedded Systems (CASES) 2016 in Pittsburgh, USA</p>
                                        <p>Matrix Multiplication Beyond Auto-Tuning: Rewrite Based GPU Code 
                                            Generation</p>
                                    </div>
                                </div>
                            </div>
                        </a>
                    </div>
                </section>
                <section>
                    <header>
                        <h2>Talks</h2>
                    </header>
                    <div class="row">
                        <section>
                            <ul class="style">
                                <li>
                                    <p class="posted">Feb 6, 2017 @ the International Symposium on Code Generation and
                                        Optimization (CGO) 2017 in Austin, USA</p>
                                    <p><a href="presentations/2017/CGO-2017.pdf">
                                        LIFT: A Functional Data-Parallel IR for High-Performance GPU Code Generation</a></p>
                                </li>
                                <li>
                                    <p class="posted">Oct 4, 2016 @ the International Conference on Compilers, Architectures
                                        and Synthesis for Embedded Systems (CASES) 2016 in Pittsburgh, USA</p>
                                    <p><a href="presentations/2016/CASES-2016.pdf">
                                        Matrix Multiplication Beyond Auto-Tuning: Rewrite Based GPU Code Generation</a></p>
                                </li>
                                <li>
                                    <p class="posted">Apr 21, 2016 @ the Institute for Computing Systems Architecture
                                        (ICSA) Sessions at the University of Edinburgh, UK</p>
                                    <p><a href="presentations/2016/ICSA-2016.pdf">
                                        Expressing Optimisations as Rewrites</a></p>
                                </li>
                                <li>
                                    <p class="posted">Mar 12, 2016 @ the Annual Workshop on General Purpose Processing
                                        using Graphics Processing Units (GPGPU) 2016 in Barcelona, Spain</p>
                                    <p><a href="presentations/2016/GPGPU-2016.pdf">
                                        Performance Portable GPU Code Generation for Matrix Multiplication</a></p>
                                </li>
                                <li>
                                    <p class="posted">Oct 19, 2015 @ the Programming Language Interest Group at the
                                        University of Edinburgh, UK</p>
                                    <p><a href="presentations/2015/PLInG-2015.pdf">
                                        A Functional Approach to Performance Portable GPU Code Generation: A Case 
                                        Study on Matrix Multiplication</a></p>
                                </li>
                            </ul>
                        </section>
                    </div>
                </section>
                <section class="profile">
                    <header>
                        <h2>Researchers</h2>
                    </header>
                    <div class="row">
                        <section class="6u">
                            <a href="https://www.inf.ed.ac.uk/people/students/Toomas_Remmelg.html" class="image full">
                                <img src="images/toomas.jpg" alt="Toomas Remmelg"></a>
                            <a href="https://www.inf.ed.ac.uk/people/students/Toomas_Remmelg.html">Toomas Remmelg</a>
                            <br>
                            PhD Student
                            <br>
                            <a href="http://www.ed.ac.uk/informatics/">University of Edinburgh</a>
                        </section>
                        <section class="6u">
                            <a href="https://michel-steuwer.github.io/" class="image full">
                                <img src="images/msteuwer.jpg" alt="Michel Steuwer"></a>
                            <a href="https://michel-steuwer.github.io/">Michel Steuwer</a>
                            <br>
                                Lecturer
                            <br>
                            <a href="https://www.gla.ac.uk/schools/computing/">University of Glasgow</a>
                        </section>
                    </div>
                    <div class="row">
                        <section class="6u">
                            <a href="http://homepages.inf.ed.ac.uk/cdubach/" class="image full">
                                <img src="images/cdubach.png" alt="Christophe Dubach"></a>
                            <a href="http://homepages.inf.ed.ac.uk/cdubach/">Christophe Dubach</a>
                            <br>
                                Reader
                            <br>
                            <a href="http://www.ed.ac.uk/informatics/">University of Edinburgh</a>
                        </section>
                    </div>
                </section>
            </div>
        </div>
    </div>
</div>