-
Notifications
You must be signed in to change notification settings - Fork 0
Thoughts for Portable ISAs
By Portable ISA I mean an Instruction Set Architecture that can be implemented on many microarchitectures. Probably with the intent of being implemented by more than one independent company or other organization.
I distinguish
- a portable ISA used for software distribution
- from an intermediate language used in a compiler tool chain: HLL --> IR --> assembly or machine code
Here briefly, for more see [Portable ISAs versus Virtual Machines]].
My term "Portable ISA" is associated with, and can mostly be considered a subset of, a software virtual machine, which provides the system services, such as system calls, privilege domains, threads/processes/processors, and (sometimes) exceptions and interrupts and other forms of communication.
I.e. virtual machine = Portable ISA + System Service Architecture. Distribution formats may be included.
Here briefly, see Examples of Portable ISAs for more.
RISC-V is an example of a "portable ISA" - portable at a machine code level. Binary portable, for any of the several versions - RISC-V with 32 bit instructions, with scalar register width (XLEN) of 32 or 64 bits.
Java byte codes and the many other examples of stack based software virtual machines, are examples of portable ISAs at a high level.
GPU intermediate languages are portable ISAs in the sense that they target multiple microarchitectures. Some are company specific, some are intended to be portable across many hardware vendors.
Enough preliminaries. I might as well admit that these thoughts are largely inspired by what I perceive to be weaknesses in RISC-V.
RISC-V could be used as a portable ISA. However, IMHO RISC-V is handicapped by the tradeoffs inevitably associated with trying to encode a large number of desired operations into a small number of instruction bits - typically 32-bit instructions, with 3 or 4 input registers, and 1 output register. Although RISC-V variants with 64-bit and 128-but long instructions are anticipated, AFAIK machines that support such wide instruction widths will also be required to support the original, legacy, 32-bit wide instructions.
It might be better for a portable ISA to be defined in a format that was not so constrained by instruction width.
- either variable length binary, or textual However, it also makes sense to define a limited number of projections from the unconstrained (fixed width) ISA to particular machine code representations.
- e.g. 32-bit fixed length instructions - classic RISC, 2 src + 1 dest
- e.g. fixed width 128-bit instructions (suitable for GPUs)
- e.g. variable length instructions - e.g, n*16-bit instructions. More generally, a particular embedded system may define a machine code representation that includes only the instructions it requires, possibly reducing code size by omitting representations of unused instructions. Or... by representing the larger space of instructions in a standard wide format, say 64 or 128-bits, while also representing the most frequently used instructions in a shorter variable length format, optimized for code size, both static and dynamic.
H&P tell us that all decisions in computer architecture should be driven by quantitative analysis. But we can only do quantitative analysis when we have a set of candidates to evaluate. In some ways, what I am proposing is that the unconstrained portable ISA support a very large number of "reasonable" instructions. Quantitative analysis should be used to choose which parts of the unconstrained portable ISA should be projected into a particular machine ISA, dependent on workload, HW costs, etc. For that matter, I am also proposing a framework by which extremely workload dependent projections can be made, and still have a chance of portable tools being able to operate on these idiosyncratic and tightly encoded machine code representations.
Terminology: I may say un/less-constrained portable ISAs rather than "the unconstrained portable ISA, or the less constrained varieties such as those with instructions 128-bits wide.
I will assume a register based ISA, not stack based or memory based. Since most modern machines are register based.
"Register" => storage for a given number of bits. Distinct from memory. Typically not shared. Typically not indirectly addressed.
The unconstrained ISA may support an arbitrary number of registers. Installing such code would require a step to map this arbitrary number of registers to a small fixed number of registers.
One of the first and most obvious constraints would be to use only a small number of registers. E.g. 32 for a classic 32-bit RISC instantiation, 16 for compact code, 128 for GPU style code.
Further constrain: unified or typed register files - e.g. separate integer and floating point, address or data, or not.
Note: the unified or typed register constraint is separate from the instruction width and encoding. The latter defines the field widths that specify a register - but is agnostic wrt whether they are in a unified register file or separate. Optimized code obviously cares.
A machine code instruction format is mostly agnostic wrt separate or unified register files... EXCEPT for transfers between such register files. The unconstrained ISA should contain (pseudo-)instructions to effect such transfers between typed register file typesif the projection includes separate register files. Particular implementations may implement these transfers, force them to go through memory, and/or type punting, by overlaying different types on the same register. These transfers between register file types should also include transformations such as widening (sign or zero extending), narrowing (scalaing and rounding, with saturation or wrap).
Issues such as whether 64-bit FP is overlaid as pairs of 32-bit FP, vs 32-bit FP occupying the low bits of 654
Wide data types such as vectors typically are viewed such separately from narrow data types such as scalars. Such separation must be supported - IMHO it is more important than int/FP type separation.
But there is a certain elegance in the "Zilog style" of overlapped register file representations:
- E.g. 32 32-bit registers overlaid on the low 16 64-bit registers
- E.gh. 32 64-bit registers overlaid on the lowest 128-bit registers
- => 32 32-bit registers overlaod on the low 16m 128-bit registers and so on for whatever register widths are desired. VLEN in RISC-V parlance. IMHO 512 bit and 2K bits are reasonable assumptions.
I have seen too many examples where the RISC constraint of only writing a single 32 or 64 (or 128 or 512) bit destination is overly constraining, leading to inefficient code.
Worse - IMHO it is important to encourage checks for integer overflow, etc. But determining whether an integer overflow has occurred for arbitrary code takes a lot of instructions.
The unconstrained instruction set should support operations that write multiple logical registers
- e.g. swap r1 :=: r2, or arbitrary permutations of instructions
- which may be mapped to register to register moves, or register renaming operations
- e.g. widening operations,
- such as 2x-widening multiplication, (Rhi,Rlo) := Ra * Rb,
- or 4x widening multiply accumlate (R3..R0) += Ra * Rb although in the completely unconstrained representation, each register might be considered to have a variable number of bits.
The standard RISC argument that condition code side effects impede high performance implementations or out-of-order execution is not true in general - some of the highest performance microarchitectures support condition codes. However, it is important that such condition codes be all-or-nothing, written onto rather than into, not-partially written or merged with previous values.
The unconstrained portable ISA should support such condition code side effects,
- e.g. (CCd,Rd) := ADD Ra + Rb , where width(Rd) = width(Ra) = width(Rb) There can be multiple condition codes in general. Usually separate, possible overlaid on integer registers. Possibly associated with any particular register
- e.g. Rd := ADD Ra + Rb , where width(Rd) = width(Ra) = width(Rb)
- with predicates like unsigned_overflow(Rd) and signed_overflow(Rd)
- with definitions of the state required
For integer operations, such condition codes should include signed and unsigned overflow, which otherwise cannot be easily computed from the same-width output. There is little value in providing condition codes in the unconstrained portable ISA for predicates such as negative or zero or parity, since those can be easily determined from the result. I.e. of the traditional CONZ condition codes (C=carry (aka unsigned overflow), O=signed overflow, N=negative, Z=zero)) only CO need be provided. NZ can be computed, as long as the ISA contains instructions such as "BLT branch if register is negative"
However, I suspect that an important projection to machine code will have all 4 CONZ condition codes. Particularly compacted variable length instructions with reduced register count.
Unconstrained => multiple condition codes, and/or associated per register. Constrained => usually only one set of condition codes. Possibly a SetCC bit, although that wastes encoding space for compact 16 or 32 bit instruction widths, and might be reserved for 64 or 128 bit ISA projections.
However, non-condition code implementations are also common. Therefore the portable unconstrained ISA should provide separate instructions to compute one or more of the condition codes: e.g. Rd := low32_bits( Ra + Rb ) Rcc := signed_overflow( Ra + Rb ) e.g. Rd := low32_bits( Ra << shamt ) Rcc := or_reduce( bits_shifted_out_32( Ra << shamt ) )
Side effect condition codes can be mapped to these separate instructions, or vice versa.
Particular machine-code implementations will provide one or the other, and require appropriate optimizations.
Similar considerations apply to vector condition codes: ideally there would be a set of CO(NZ) per vector element.
For packed SIMD RF representations - fixed number of bits, variable number of elements - this implies a variable number of condition codes - condition codes packed corresponding to the different packed fields. If 4-bit CONZ, this suggests 4-bits as the narrowest element width. 2-bits for CO. 1 bit if the type of condition is specified as part of the operation (only in the unconstrained or least constrained projections such as 128-bit).
Vector masks/flags are one of the standard issues in vector ISA design: flag granularity (e.g. one set of flags per 32-bits?) , how many flag bits per vector element? overlaid or separate, if separate how many flag registers?
The considerations above suggest that the unconstrained representation have flag vectors of the same bit width as the physical vector width, to support packed SIMD, in a unified (infinite) RF. But, of course, there will be constrained projections to machine code that have separate mask/flag registers, possibly only a single mask regisyer with a single bit per 32-bit data element.
The standard vector predication is Vd[i] := VOP( Va[i] + Vb[i] ) IF Vcond[i] ELSE unchanged or (Vfout,Vd[i]) := VOP( Va[i] + Vb[i] ) IF Vfin[i] ELSE unchanged
Compound conditions are possible Vd[i] := VOP( Vs1[i] + Vs2[i] ) IF cond1(Vs1[i]) OR cond2(Vs2[i])
(This last using the "flags associated with each data value approach.)
Conditions associated with inputs convolutions of irregular patches, e.g. 2D (although the general form of such operations involves fractional moves, and hence numeric filtering and inter/extrapolation, not just masking).
SIMT style predication uses a vector mask to indicate control flow where divergent control flow branches in individual SIMT lane threads are mapped to vector masks for the converged execution engine that executes multiple such threads in lockstep.
Standard: SIMT vector ISAs should at the very least have branch if none/any/all mask bit is set (basically, zero/non-zero) (more generally, if none/any/all conditions are met by vector mask/flag element.)
More generally, combined mask computation and none/any/all branches... although this might be done as separate instructions, fused dynamically or if statically projected.
SIMT threading can take advantage of more than a single mask input, especially if the lane threads are themselves using predicated execution (branchless code): Vd[i] := VOP( Vs1[i] + Vs2[i] ) IF intraLaneMask[i] AND interLaneSimtMask[i] Note, however, that the intraLaneMask[i[ may be packed SIMD in the usual manner (more mask bits for smaller vector element widths), whereas SIMT lanes probably correspond to a fixed granularity. I.e. SIMD with a conceptual lane thread, SIMT between.
Vd.8[i] := VOP( Vs1.8[i] + Vs2.8[i] ) IF cond.intraLaneMask.8[i] AND cond.interLaneSimtMask.32[i]
where m = SIMT lane width = e.g 32 bits, but n may be 8, 16, or 32... (or even 54 with register pairs)
Although there is value in a SIMT aware compiler adjusting number of threads / lane width per physical vector register as element width changes.
Vector operations tend to be hard-pressed to fit within a small 32-bit fixed width instruction encoding.
- Starting with vector-vector operations Vd.n[i] := VOP( Vs1.n[i] + Vs2.n[i] )
- then memory operations - unit stride, stride N, scatter/gather
- then masks
Vd.n[i] := VOP( Vs1.n[i] + Vs2.n[i] ) IF Vmask.cond[i]
- sometimes both input and output masks, sometimes more than one input mask, as described above
- then reductions
- then prefixes Vd.n[i] := SUM( Vs1.n[j] ) j = 0..i
- then 2x widening operations
- packed SIMD
- HI/LO forms (lane crossing, if overlaid) LO: Vd.2n[i] := Vs1.n[i]+Vs2.n[i] + Vs1.n[i]+Vs2.n[i] HI: Vd.2n[i] := Vs1.n[i+h]+Vs2.n[i+h] + Vs1.n[i+h]+Vs2.n[i+h] wherre h = VLEN/2/n
- Even/Odd forms (non-lane crossing) Vd.2n[i] := Vs1.n[2j+eo]+Vs2.n[2j+eo] + Vs1.n[2j+eo]+Vs2.n[2j+eo] where eo = 0,1
- packed SIMD
- then 4x widening then ...
RISC-V's approach to the combinatoric explosion is to express vector instructions in terms of a SEW (Standard Element Width), requiring changes to the control register holding SEW as operand width changes. Some instructions have operand EEW (Effective Element Width) expressed as scalings of the global SEW. Similarly for LMUL (length multipliers). Essentially this amounts to creating a variable length instruction encoding for vectors, but even more so: the same code may be interpreted operating on different data widths, if SEW and LMUL are set to different values on different paths that branch on shared code.
RISC-V's predecessors used a polymorphic register file...
While I understand the motivation for RISC-V's SEW/LMUL instruiction encodings
IMHO it should be rejected for the unconstrained ISA, and for the wider ISAs like 128-bits. These less constrained projections should may explore SEW/LMUL techniques.
The unconstrained portable ISA should support many of these, with explicit data widths - indeed, a good argument for being able to represent vector instruction as FOR-ALL loops across vector elements where the computations are open the vector elements, possibly textual, in the unconstrained form.n Many can be supported in the wider instruction formats - e.g,. 128-bit vector instructions.
Early vector machines like the BSP and CDC where memory-to-memory. Although the vector lengths might be parameterized, the vector operands tended to have the same number of elements.
The Cray-1 is the prototypical vector register machine, with 8 vector registers, each of 64 64-bit elements. Narrower 32-bit data was typically loaded one 32-bit value per 64-bit vector element => we can consider the vector registers as having the same number of elements no matter the data type. Other vector ISAs inspired by the Cray machines, such as the Gould NP1 were even more explicit in this regard. However, smart programmers were sometimes able to perform computations on 2 32-bit values packed into a single 64-bit vector element, and so on.
SIMD packed vector instructions, such as Intel MMX, SSE, etc. turned this around: instead of a fixed number of elements per vector register, a "packed SIMD register" was a fixed number of bits, interpreted as a different number of elements according to data type.
SIMD packed vectors require more instructions to pack and interleave data of different widths. Typically the physical vector width is smaller - 64b/MMX, 128b/SSE, then 256 and 512, vs 64x64=4K bits for the Cray-1 and many GPUs. There is a temptation to provide operations that cross the smaller data widths.
Compilers for vector machines (both classic and packed) strip-mine vector loops into physical register sized chunks. Obviously easy for homogeneous data widths. Scaled addressing modes are particularly convenient, allowing the same counter to be used for different data widths. Classic and packed diverge for mixed data widths: classic widens into uniform number of elements, whereas packed tends to load a fixed number of bits, unpack and shuffle to alogn for vector-vector elements.
IMHO an un/less-constrained vector ISA should support both approaches. It should have widening and narrowing loads and stores, to permit fixed number of vector elements, variable width in memory, aligned same width in registers, aligned in corresponding vector elements. But it should also allow fixed number of bits operations. Indeed, there is not much difference for the computational instructions, except wrt vector masks as mentioned above. Packed tends to require vector mask predicates at the smallest data width.
While it would seem that packed vector ISAs require more data shuffling, I believe this is a red herring. First, in my experience all vector ISAs must be careful sliding down the slippery slope of inter-element computations. First prefixes, then nearest neighbour - or maybe in the opposite order. Second, packed vector ISAs tend to provide more shuffling operations (a) if they lack widening loads and narrowing stores, for vector-vector corresponding element operations, and (b) because packed vector ISAs at first tended to have relatively narrow physical vector widths. 64/MMX, 128/SSE1... On such relatively narrow widths, permutations are possible. However, as physical vector widths grow to 512 bits and beyond, long distance permutations are increasingly impractical.
Therefore, IMHO define a lane width - say 128-bits. Allow a reasonable set of permutations within this lane wiodth, but strongly resist beyond.
128-bits seems to me to be reasonable as a maximum architectural lane width, since 64x64=128 bit multipliers are common - might as well use the full width, beneficial for crypto, as well as complex arithmetic on pairs of FP64 values. Possibly 4xFP32 elements as are common in graphics, although those are usually planar. Tiles: 2x2x32, 4x4x16, ...
The un/less constrained portable ISA should support lane widths of 16/32/64/128-bits - although fir the mnost part there are no special instructions for the narrower lane widths. Imple,entations may constrain the lane widtyhs supported. IMHO it is unreasonable in general to require a JITter to adapt code that exploits a 128-bit wide lane width to a 32-bit narrow lane width - so the distribution format should indicate it. Projections to compact machine code may take advantage of reduced encoding pressure for narrower lane widths.
A third reason for the proliferation of shuffles is the RISC 2-input 1-output straightjacket. Many permutations, even when restricted to within a 128-bit lane, are ofg the form concat(vs1,vs2) permute ...
, and are not amenable to srcdest form.
It would be nice if the unsconstrained portable ISA could reflect an arbitrary permutation within the lane. It is possible to express an arbitrary permutation of the 8 bytes of a 64-bit number as an 8x8 bit matrix; more compactly, only 6 bits are required to specify each byte. Or 16x7 bits per element for a 128-bit lane. Therefore, the portable ISA should support byte granular permutations within a lane upto 128 bits. There's even room to encode zero / sign extension.
This works for a permutation contained in a 128-bit register. Frequently a scalar register - except for the most part scalar registers are only 64-bits wide, necessitating a register pair, even more RISC hostile. (The un/less-constrained portable ISA should allow any 128-bit vector "element" to be used as a scalar - natural for packed vectors, doable by concatenating classic vector elements.)
It even works when the permutation is different in each 128-bit lane. As may be needed for packed data elements that is not a nice power of two. Although that requires cross-lane data movement, and in my experience that is often obviated by good scatter/gather support.
If the permutation is constant, known at compile time... most ISAs do not support the 64-bit or 128-bit constants necessary to express permutations in 128-bit lanes. Or even 32-bit constants. Note that even 128-bit wide instructions are challenged to provide 128-bit byte permutation constants. Nevertheless, the unconstrained variable instruction length portable ISAs should support constants this wide. (In general, efficient constant support is important.) Particular patterns (like the various packs, unpacks, and shuffles) correspond to particular instances of these wide constants - and may be provided narrower encodings for the more constrained ISA projections.
(The un/less constrained vector ISA should support long-distance data movement - but mostly in the form of vector register-register permutations. Common patterns would be hardwired. Which is basically a handwave over the proliferation of shuffle and pack instructions.)