-
Notifications
You must be signed in to change notification settings - Fork 43
i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 16-bit integers #372
Comments
This operation doesn't have a direct equivalent in ARM NEON or ARM64 |
I'm still investigating myself, but I have found what seems to be an efficient solution for 16 8-bit integers to 16 32bit integers. You can see the Godbolt here. (Edited to fix a typo in Godbolt. This is definitely a work in progress.) |
In this Godbolt there are three samples showcasing different implementations on Intel. It's really fascinating too since the shortest implementation underperforms both of the longer implementations because pmovzxbd and palignr block the shuffle port. The two longer ones finish in 604 cycles whereas the shortest one takes 704. |
@Maratyszcza Please take a look at this Godbolt. Regarding your earlier point about missing ARM instructions, you're right. However, If you look at this analysis you'll see that With respect to ARM, the 6 instruction pass that yields 4 vectors of 32bit integers is most efficient in two specific cases:
This is mostly because you don't get the individual 32bit conversion efficiency seen with lesser ops, or the large unsigned 32bit cases, where you can take advantage of the table indices multiple times. If the load of the indices only needs to happen once, the table transform will outperform in all cases where the data output will remain unsigned. The signed versus unsigned distinction is important on ARM64 since On the other hand, x64 chips get tremendous performance advantages no matter which way you look at it, and that's despite a great increase in the number of instructions. Going from 6 shuffles ( |
Introduction
This proposal mirrors #290 to add new variants of existing
widen
instructions and extends the 32 and 64 widen instructions to include support from 16 and 8-bit integers. The practical use case for this is signal processing -- specifically audio and image processing, but the use cases for this are pretty large in general. For a non-image processing use case, these could be very helpful any time someone wants to convert an 8-bit value to a floating-point number. Currently, this requires multiple conversions steps between integers before converting to float, but modern architectures provide operations to convert from just about any integer size to another. Due to the non-binary relationship between 8 bits and 64 bits, this instruction will introduce new terminology that will replace the high/low terminology with a constant parameter immediate. This ticket will serve as the foundation for the PR that follows and will be updated with implementation details for each instruction set.Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
x86/x86-64 processors with SSE4.1 instruction set
ARM64 processors
ARMv7 processors with NEON instruction set
The text was updated successfully, but these errors were encountered: