Skip to content

Hydride: A Retargetable and Extensible Synthesis based Compiler for Modern Hardware Architectures

Akash Kothari edited this page Nov 28, 2022 · 6 revisions

Introduction

Domain-specific hardware accelerators and complex vector extensions to existing architectures are emerging to efficiently support modern tensor and image processing workloads. These custom accelerators and CPU/GPU extensions provide high performance and energy efficiency for important operations like matrix multiplication, tensor convolution, and so on. Accelerators such as Qualcomm Hexagon DSP and x86 vector extensions such as Intel AVX-512 have been extended with specialized instructions that help optimize tensor and stencil computations. These include specialized SIMD and non-SIMD (cross-lane) instructions to perform dot products and various reduction operations, and to change data layout by moving data across vector lanes. The number of instructions in ISAs such as x86 and Hexagon will continue to grow to support more compute-intensive operations in deep learning and image processing applications. A major challenge faced by compilers for such architectures is the difficulty of generating efficient code for these complex vector instructions. Modern compilers for general-purpose languages rely on a rich language-independent, highly retargetable compiler infrastructure such as GCC or LLVM to compile to a wide range of architectures, as shown in Figure 1(a). Domain-specific language (DSL) compilers such as Halide, TVM, XLA, and others also rely on LLVM for low-level code generation, but these languages depend much more heavily on complex vector and tensor hardware for high performance than traditional general-purpose languages. Unfortunately, extending even highly retargetable infrastructures like GCC and LLVM to keep pace with evolving hardware vector instruction sets has proven difficult because the target-independent compiler IR has no inherent extensibility mechanisms: instead, retargetability focuses on translating this IR to a wide range of hardware targets. While this works well for scalar and popular SIMD instructions, experience from multiple teams (as well as our own) shows that the instruction selection in LLVM often fails to generate complex vector instructions from LLVM IR, such as dot product and swizzle (aka, vector shuffle) instructions found in many architectures today. As a result, IRs had to include support for target-specific intrinsics to allow high performance code generation for DSLs, even though it hinders retargetability. In order to generate efficient code for different architectures, performance-sensitive DSL compilers like Halide use target-specific back ends generating target-specific LLVM intrinsics for architectures such as x86, Hexagon, ARM, RISCV, Power, etc., using manually-crafted target-specific pattern-matching rules to map the DSL-specific IR to complex vector hardware instructions (Figure 1 (b)). Such DSLs inherently fail to benefit from the long investment in language-independent, retargetable compiler infrastructures, as classical languages have done for decades.

The fundamental weakness in using manually-engineered targetspecific back ends in DSL compilers to support modern ISAs is that it eliminates retargetability. In particular, manually crafting separate, target-specific code generators for every hardware target is cumbersome, error-prone and requires a lot of engineering effort and time. This makes supporting large ISAs or extending existing ISAs with new instructions difficult. (A separate problem, shared with traditional compilers shown in Fig. 1 (a), is that a pattern-matching approach is inherently brittle because the patterns only apply to fixed, predetermined sequences of instructions i.e. if the input sequence of instructions do not match the hard-coded patterns, the patterns would not work.) Also, because the DSL backends are manually implemented, maximizing target instruction coverage is difficult, especially for large ISAs such as x86. For example, the x86 backend for Halide supports only 24 non-SIMD integer vector operations, a small fraction of all the variants supported by the hardware. More generally, requiring every performance-sensitive DSL like Halide or high-level compiler infrastructure like TVM to implement its own back ends for every hardware target is extraordinarily inefficient instead of investing in reusable language-independent compiler infrastructures, like LLVM, GCC or MLIR.

Screen Shot 2022-11-27 at 9 59 32 PM

A few recent works have looked at automatically generating backends for DSLs. However, they still assume a fixed compiler representation such as LLVM IR. A few recent papers have attempted to improve the situation for scalar and vector compilation. One of the first uses program synthesis to achieve better coverage of a large target ISA, but it focuses on scalar X86 instructions, which are fairly stable across processor generations and therefore are well supported by a fixed compiler IR. Vegen improves target instruction coverage for cross-lane (non-SIMD) vector operations by automatically generating pattern-matching rules using x86 instruction semantics. However, it does not support generation of specialized data movement instructions, which are critical for performance of deep learning and stencil computations. More generally, it fundamentally does not address the lack of infrastructure reuse across DSLs because it does not provide a way to evolve the language-independent compiler IR (it assumes a fixed LLVM IR). Other works that leverage program synthesis techniques like Rake, Diospyros and Porcupine only support generation of a small number of target instructions. Moreover, Rake requires users to modify source code of Halide programs that use matrix multiplication, convolution, etc. to expose the dot product and data swizzle patterns explicitly to make synthesis tractable, which places a heavy burden on the programmer and violates the core Halide principle of separating computational specifications from scheduling optimizations.

Of all these systems, only Rake attempts to improve the fundamental weakness of the current DSL compiler approach illustrated in Figure 1: the inability to share infrastructure investment for optimization and code generation across DSLs. Rake [2] developed a target-agnostic, language-independent IR, called Uber IR, for Halide to generate HVX and ARM intrinsics in LLVM. However, their Uber IR was manually designed and that approach is not scalable when considering large, evolving ISAs like HVX, X86, ARM, and numerous others.

In this work, we propose an alternative approach in which (formally defined) language-independent and target-independent LLVM IR instructions are automatically generated to support one or more target ISAs. We call this autogenerated IR the AutoLLVM IR. This enables a shared compiler infrastructure to evolve rapidly, while also achieving nearly complete hardware instruction coverage: together, these capabilities make the infrastructure both, highly extensible (to support new instructions in an existing ISA) and retargetable (across multiple ISAs). Moreover, because all the IR operations are defined using a precise, executable formal semantics, we use program synthesis in a language front-end (e.g., for a DSL like Halide) to generate highly efficient target-specific code that benefits fully from evolving vector hardware operations, without using brittle pattern matching. More specifically, we describe a system called Hydride, which simultaneously solves three key problems: enabling automated design of the language-independent compiler IR to support multiple, rapidly evolving hardware ISAs, maximizing instruction coverage for complex ISAs, and avoiding the brittle pattern-matching approach to instruction selection. Hydride operates in two phases: offline and online. The offline phase automatically generates compiler code during compiler development time, while the online phase uses program synthesis for target-specific code generation.

In the offline phase, Hydride uses a novel similarity-checking technique to automatically discover “similar instructions” across multiple hardware architectures and represent similar instructions as target-independent IR operations. The similarity checking uses target ISA formal semantics derived automatically from vendor-specified ISA pseudocode (the automatic derivation of formal semantics is similar to Vegen). We represent these ISA instructions in a custom IR, which we call Hydride IR, and perform code transformations on it to derive a canonical form. Next, we abstract away differences in constants, vector lane sizes and types across thousands of ISA instructions represented in Hydride IR to build a set of equivalence classes of ISA instructions across multiple target architectures. We create target-independent IR instructions automatically from these equivalence classes, which we call AutoLLVM IR. Our results show that we only need hundreds of AutoLLVM IR instructions to represent thousands of hardware instructions, e.g., 235 AutoLLVM IR instructions for a combined total of 2377 X86 and HVX machine instructions. In the online phase, Hydride takes as input the formal semantics of a language (e.g., Halide) and the IR of a program in the language and synthesizes semantically equivalent AutoLLVM IR using syntax-guided synthesis (SyGus). Although generating AutoLLVM IR, this synthesis is target-specific, i.e., it selects particular hardware instructions within each AutoLLVM IR equivalence class; this is important for performance but the synthesis algorithm itself is completely language-independent and target-independent, and fully reusable for other languages with a formal semantics. A key aspect of Hydride is that the synthesis approach is carefully crafted to achieve synthesis times comparable to, or faster than, the state of the art, while generating 10x–100x more possible AutoLLVM IR instructions, i.e., providing far better coverage. Arbitrary LLVM IR passes can be applied unchanged to AutoLLVM IR, before translating to target-specific LLVM intrinsics for the specific machine instructions previously selected by the language front end. The online phase entails synthesis of not only compute vector instructions, but also, data movement (swizzle) instructions. We generate both computation and data movement code in two stages to significantly speedup synthesis times, whilst not compromising much on the code quality. In the first stage, the front-end fixes the data movement to a few representative patterns and synthesizes the instructions required for the computation. In the second stage, the front-end synthesizes the best representative swizzle instruction sequence to realize the data movement pattern. This two stage approach allows us to achieve significantly faster synthesis times, while not putting the burden on the programmer to encode the data movement patterns.

Overall, Hydride achieves several key goals addressing the three problems that we laid out before. First, it uses vendor-provided pseudocode specifications of target ISAs to automatically derive a set of target-independent compiler IR operations that represent the full functionality of those ISAs, together with an executable formal semantics of these operations. Second, by enabling full support in the compiler IR for all existing target ISA instructions and providing a full formal semantics of the compiler IR, Hydride enables a front-end compiler to use program synthesis instead of brittle pattern matching to compile source programs into the compiler IR, while taking full advantage of non-trivial code sequences that may not have been generated with a pattern matching approach. Third, it retains the full infrastructure reuse and compiler engineering benefits of traditional retargetable, language-independent compiler infrastructures because all components of Hydride are fully language-independent and target-independent, relying only on a source language formal semantics, a target ISA pseudocode specification, and a parser for the latter.

Hydride Overview

Hydride is a DSL compiler infrastructure that is retargetable to a wide range of rich and complex scalar and vector architectures, is easily extensible to new instructions and instruction sets, uses formal semantics of input languages, compiler IR, and target ISAs, and provides robust code generation capabilities with high instruction coverage by using fully automated synthesis to translate between them. To our knowledge, Hydride is the first system to automatically generate language- and target-independent compiler IR operations from machine code semantics. The LLVM IR operations Hydride generates are fully compatible with manually designed LLVM IR and can take full advantage of all relevant LLVM IR analysis and transformation passes, metadata and attributes, and extensive build infrastructure and programming tools. The key insight in our work is that it is possible to use automated methods based on formal semantics to design (or extend) the mid-level compiler IR automatically instead of manually. Automatic generation can ensure that the compiler IR supports all hardware instructions for multiple, complex, potentially large, ISAs. And by automatically generating the formal semantics of both the compiler IR and the target instructions, pattern matching can be replaced with program synthesis, enabling more robust code generation even for complex vector operations. Hydride achieves these goals in practice by solving the following challenges:

  • Developing an automated strategy to find similar instructions across multiple targets to automatically design operations for a language-independent compiler IR, including a formal semantics for the IR.
  • Automatically generating code for mapping the language-independent compiler IR instructions to target-specific instructions.
  • Leveraging formal semantics of compiler IR instructions to perform program synthesis in reasonable times for multiple hardware targets (big and small).

There are multiple advantages to generating target-independent IR operations (representing equivalences classes of similar machine instructions across architectures), instead of using program synthesis directly from a DSL IR to target-specific instructions. In our current work, the primary advantage is that the program synthesis from the DSL IR (e.g., Halide IR) exploits these equivalence classes to prune the search space of output instructions, greatly reducing synthesis cost. Moreover, the total number of synthesized IR operations are an order-of-magnitude fewer than the target-specific instructions, which would simplify practical considerations like testing, debugging and maintenance. In future, formal methods-based IR-level tools that take advantage of the full formal semantics, such as tools for symbolic execution, concolic testing, or program verification, might also be greatly simplified by having far fewer operations to support. The overall workflow of Hydride is shown in Figure 2. Hydride operates in two phases: an offline and an online phase. During the offline phase, Hydride Code Generator Generator (described in Section 3) uses ISA pseudocode specifications provided commonly by hardware vendors to automatically generate target-agnostic, retargetable IR instructions for a language-independent compiler IR like LLVM along with their semantics, which we call the AutoLLVM IR. During the online phase (i.e., at compilation time), Hydride’s Code Synthesizer uses syntax-guided synthesis to translate code for an input program (for example, written in Halide) to the AutoLLVM IR.

Screen Shot 2022-11-27 at 9 57 58 PM