Sign Select instructions #124

Maratyszcza · 2019-10-28T06:46:43Z

Introduction

Piece-wise approximations with discontinuity at zero are common in numerical algorithms. Such approximations take the form y := x < 0 ? f(x) : g(x). E.g. in artificial neural networks, ReLU [1], Leaky ReLU [2], Parametric ReLU [3], ELU [4], SELU [5] functions involve such piece-wise approximations. Selection between two piece-wise approximations depending on the size of input can be implemented with two WebAssembly SIMD instructions from the current specification: one instruction to replicate the highest bit of the input elements into all bits of the elements so they can be used as a bit-selection mask, and bitselect instruction to choose between two values, f(x) and g(x), depending on the constructed selection mask. There are, however, two considerations which make such decomposition suboptimal in practice:

There are two ways to replicate the high bit of input elements: do integer comparison with zero (e.g. i32x4.lt_s(x, v128.const(0)) for 32-bit elements) or do integer arithmetic shift right by sizeof(element) * CHAR_BIT - 1 (e.g. i32x4.shr_s(x, 31) for 32-bit elements). The optimal way to do replication depends on target architecture: typically integer comparison is more efficient than integer shift, but if the WebAssembly engine has to explicitly instantiate v128.const(0), the trade-off skews the other way. ARM NEON and ARM64 provide instructions for integer comparison with zero (e.g. VCLT.S32 Qd, Qn, #0/CMLT Vd.4S, Vn.4S, #0 for 32-bit elements), but on x86 comparison with zero would always require two instructions and an extra register, so arithmetic shift right (e.g. PSRAD xmm_x, 31 is often preferable).
Since SSE4.1, x86 processors provide PBLENDVB/BLENDVPS/BLENDVPD instructions which choose between two vectors of 8-/32-/64-bit elements depending on the sign of elements in the selection mask. In lieu of WebAssembly instructions that directly map to these SSE4.1 instructions, WebAssembly SIMD engines have to generate 4-5 instructions (1-2 to construct bit mask, and 3 to emulate v128.bitselect) where 1 would suffice.

To enable efficient implementation of piece-wise approximations with discontinuity at zero, this PR introduce new Sign Select instructions which select between elements of two input vectors depending on sign of the elements in the mask vector.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

v8x16.signselect
- y = v8x16.signselect(v1, v2, c) is lowered to VPBLENDVB xmm_y, xmm_v2, xmm_v1, xmm_c
v16x8.signselect
- y = v16x8.signselect(v1, v2, c) is lowered to VPSRAW xmm_tmp, xmm_c, 15 + VPBLENDVB xmm_y, xmm_v2, xmm_v1, xmm_tmp
v32x4.signselect
- y = v32x4.signselect(v1, v2, c) is lowered to VBLENDVPS xmm_y, xmm_v2, xmm_v1, xmm_c
v64x2.signselect
- y = v64x2.signselect(v1, v2, c) is lowered to VBLENDVPD xmm_y, xmm_v2, xmm_v1, xmm_c

x86/x86-64 processors with SSE4.1 instruction set

v8x16.signselect
- y = v8x16.signselect(v1, v2, c) is lowered to:
  - MOVDQA xmm0, xmm_c
  - MOVDQA xmm_y, xmm_v2
  - PBLENDVB xmm_y, xmm_v1 [implicit xmm0 input]
v16x8.signselect
- y = v16x8.signselect(v1, v2, c) is lowered to:
  - PXOR xmm0, xmm0
  - PCMPGTW xmm0, xmm_c
  - MOVDQA xmm_y, xmm_v2
  - PBLENDVB xmm_y, xmm_v1 [implicit xmm0 input]
v32x4.signselect
- y = v32x4.signselect(v1, v2, c) is lowered to:
  - MOVAPS xmm0, xmm_c
  - MOVAPS xmm_y, xmm_v2
  - BLENDVPS xmm_y, xmm_v1 [implicit xmm0 input]
v64x2.signselect
- y = v64x2.signselect(v1, v2, c) is lowered to:
  - MOVAPD xmm0, xmm_c
  - MOVAPD xmm_y, xmm_v2
  - BLENDVPD xmm_y, xmm_v1 [implicit xmm0 input]

x86/x86-64 processors with SSE2 instruction set

v8x16.signselect
- y = v8x16.signselect(v1, v2, c) is lowered to:
  - PXOR xmm_tmp, xmm_tmp
  - PCMPGTB xmm_tmp, xmm_c
  - MOVDQA xmm_y, xmm_v1
  - PAND xmm_y, xmm_tmp
  - PANDNOT xmm_tmp, xmm_v2
  - POR xmm_y, xmm_tmp
v16x8.signselect
- y = v8x16.signselect(v1, v2, c) is lowered to:
  - PXOR xmm_tmp, xmm_tmp
  - PCMPGTW xmm_tmp, xmm_c
  - MOVDQA xmm_y, xmm_v1
  - PAND xmm_y, xmm_tmp
  - PANDNOT xmm_tmp, xmm_v2
  - POR xmm_y, xmm_tmp
v32x4.signselect
- y = v8x16.signselect(v1, v2, c) is lowered to:
  - PXOR xmm_tmp, xmm_tmp
  - PCMPGTD xmm_tmp, xmm_c
  - MOVDQA xmm_y, xmm_v1
  - PAND xmm_y, xmm_tmp
  - PANDNOT xmm_tmp, xmm_v2
  - POR xmm_y, xmm_tmp
v64x2.signselect
- y = v8x16.signselect(v1, v2, c) is lowered to:
  - PXOR xmm_tmp, xmm_tmp
  - PCMPGTD xmm_tmp, xmm_c
  - PSHUFD xmm_tmp, xmm_tmp, 0xF5
  - MOVDQA xmm_y, xmm_v1
  - PAND xmm_y, xmm_tmp
  - PANDNOT xmm_tmp, xmm_v2
  - POR xmm_y, xmm_tmp

ARM64 processors

v8x16.signselect
- y = v8x16.signselect(v1, v2, c) is lowered to CMLT Vy.16B, Vc.16B, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B
v16x8.signselect
- y = v16x8.signselect(v1, v2, c) is lowered to CMLT Vy.8H, Vc.8H, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B
v32x4.signselect
- y = v32x4.signselect(v1, v2, c) is lowered to CMLT Vy.4S, Vc.4S, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B
v64x2.signselect
- y = v64x2.signselect(v1, v2, c) is lowered to CMLT Vy.2D, Vc.2D, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B

ARMv7 processors with NEON extension

v8x16.signselect
- y = v8x16.signselect(v1, v2, c) is lowered to VCLT.S8 Qy, Qc, #0 + VBSL Qy.16B, Qv1.16B, Qv2.16B
v16x8.signselect
- y = v16x8.signselect(v1, v2, c) is lowered to VCLT.S16 Qy, Qc, #0 + VBSL Qy.16B, Qv1.16B, Qv2.16B
v32x4.signselect
- y = v32x4.signselect(v1, v2, c) is lowered to VCLT.S32 Qy, Qc, #0 + VBSL Qy.16B, Qv1.16B, Qv2.16B
v64x2.signselect
- y = v64x2.signselect(v1, v2, c) is lowered to VSHR.S64 Qy, Qc, #63 + VBSL Qy.16B, Qv1.16B, Qv2.16B

Other processors and instruction sets

v8x16.signselect
- y = v8x16.signselect(v1, v2, c) is lowered as either v128.bitselect(v1, v2, i8x16.lt_s(c, v128.const(0))) or v128.bitselect(v1, v2, i8x16.shr_s(c, 7))
v16x8.signselect
- y = v16x8.signselect(v1, v2, c) is lowered as either v128.bitselect(v1, v2, i16x8.lt_s(c, v128.const(0))) or v128.bitselect(v1, v2, i16x8.shr_s(c, 15))
v32x4.signselect
- y = v32x4.signselect(v1, v2, c) is lowered as either v128.bitselect(v1, v2, i32x4.lt_s(c, v128.const(0))) or v128.bitselect(v1, v2, i32x4.shr_s(c, 31))
v64x2.signselect
- y = v64x2.signselect(v1, v2, c) is lowered as either v128.bitselect(v1, v2, i64x2.lt_s(c, v128.const(0))) or v128.bitselect(v1, v2, i64x2.shr_s(c, 63))

References

[1] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th International Conference on Machine Learning, 2010.
[2] Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities improve neural network
acoustic models. In International Conference on Machine Learning (ICML).
[3] He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015.
[4] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). CoRR, abs/1511.07289.
[5] Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. (2017). Self-Normalizing Neural
Networks. ArXiv e-prints.

ngzhian · 2020-07-06T23:42:49Z

There are two ways to replicate the high bit of input elements: do integer comparison with zero (e.g. i32x4.lt_s(x, v128.const(0)) for 32-bit elements) or do integer arithmetic shift right by sizeof(element) * CHAR_BIT - 1 (e.g. i32x4.shr_s(x, 31) for 32-bit elements). The optimal way to do replication depends on target architecture: typically integer comparison is more efficient than integer shift, but if the WebAssembly engine has to explicitly instantiate v128.const(0), the trade-off skews the other way. ARM NEON and ARM64 provide instructions for integer comparison with zero (e.g. VCLT.S32 Qd, Qn, #0/CMLT Vd.4S, Vn.4S, #0 for 32-bit elements), but on x86 comparison with zero would always require two instructions and an extra register, so arithmetic shift right (e.g. PSRAD xmm_x, 31 is often preferable).

I think shift right by an immediate is more efficient for all archs (on v8 at least), it's a single instruction.

Comparison with immediate 0 is not supported on Wasm, so you won't be able to get the fast behavior on ARM, it will be i32x4.lt_s(x, v128.const(0)) as you said, on all arch. [^1]

WebAssembly SIMD engines have to generate 4-5 instructions (1-2 to construct bit mask, and 3 to emulate v128.bitselect) where 1 would suffice.

Yea, bitselect is 4 instructions on x64 :(

Looking at the lowering, this instruction mostly helps AVX, from 5 instructions to 1 (or 2). The ARM codegen looks exactly like what cmp/shift+bitselect would be (if emulated using cmp/shr_s+bitselect).

From the guidelines:

hard/expensive to emulate in absence of spec change

This instruction will mostly be helpful for AVX, and is much more expensive without sign select.

well supported on 2+ major archs

It is well supported in that codegen on ARM/ARM64 isn't awful, but those archs don't gain from sign select. So it isn't as useful.

SSE4.1 sees slight improvement, but I don't think it's sufficient.

I haven't looked at SSE2 that much, but the codegen looks long (particular for v64x2.signselect, cmp+bitselect might be shorter).

useful in 2+ important use cases

All of the references are machine learning (neural networks) related. I do not have expertise in this field to distinguish how different they are - they all refer to *LU, and so looks the same to my untrained eye. Thus, I'm not certain this counts as "2+".

In comparison, #237 shows the use in codecs/media, machine learning, molecular dynamics, etc.

There were some discussions in #131 regarding mask/select related instructions, which led to bitmask being proposed (and merged). Is there anyone who would like to offer examples of sign select usages in their fields?

[^1] v128.const(0) will be a single instruction

Maratyszcza · 2020-07-07T05:18:31Z

@ngzhian One thing to keep in mind is that processors typically can handle integer comparisons in a large number of SIMD units than shifts. E.g. AMD Zen 2 processors can do 3 integer compare instructions ([V]PCMPGTD) per cycle, but only one arithmetic shift instruction ([V]PSRAD) per cycle, and ARM Cortex-A76 cores can do 2 SIMD integer compare instructions per cycle, but one SIMD shift per cycle. Thus, compare-based lowering is always preferable when it doesn't increase the number of generated instructions.

zeux · 2020-08-05T15:40:01Z

@ngzhian One of the uses of signselect for use cases where desktop performance is important is to replace bitselect to get better codegen on x64 in cases where the mask is known to be all 0s/1s - most commonly as a result of a comparison instruction. How possible do you think it is for TurboFan to generate pblend instead of current instruction sequence when the input is a comparison mask?

ngzhian · 2020-08-05T22:05:28Z

@zeux I looked around, we have cases where we pattern match, like shift + and [0], mul+add (then use mla) [1]. So I think we can pattern match on select+compares, and generate better code there.

[0] https://source.chromium.org/chromium/chromium/src/+/master:v8/src/compiler/backend/x64/instruction-selector-x64.cc;l=710?q=instruction-selector-x64.cc&ss=chromium%2Fchromium%2Fsrc:v8%2F
[1] https://source.chromium.org/chromium/chromium/src/+/master:v8/src/compiler/backend/arm/instruction-selector-arm.cc;l=1271?q=instruction-selector-arm.cc&ss=chromium%2Fchromium%2Fsrc:v8%2F

ngzhian · 2020-10-28T07:20:55Z

Prototyped in V8 https://crrev.com/c/2486235 for x64.
using pretty weird opcodes:

  V(I8x16SignSelect, 0xfd7d, s_sss)      \
  V(I16x8SignSelect, 0xfd7e, s_sss)      \
  V(I32x4SignSelect, 0xfd7f, s_sss)      \
  V(I64x2SignSelect, 0xfd94, s_sss)      \

Thomas is aware of these opcodes.

As proposed in WebAssembly/simd#124, using the opcodes adopted by V8 in https://chromium-review.googlesource.com/c/v8/v8/+/2486235/2/src/wasm/wasm-opcodes.h. Uses new builtin functions and a new target intrinsic exclusively to ensure that the new instructions are only emitted when a user explicitly opts in to using them since they are still in the prototyping and evaluation phase. Differential Revision: https://reviews.llvm.org/D90357

tlively · 2020-10-29T18:10:08Z

This has landed in LLVM using the V8 opcodes and usable via __builtin_wasm_signselect_{i8x16,i16x8,i32x4,i64x2}. It will be usable in tip-of-tree Emscripten in a few hours. It hasn't landed in Binaryen yet, but that won't be a problem as long as you don't pass optimization flags at link time.

planglois · 2020-12-08T19:26:28Z

Hi all!

While looking at the AArch32 implementation in V8, we realized NEON does not support the i64x2 variant of the VCLT instruction. Discussing this with some colleagues, an alternative implementation could be as follows:

;; Use the i32x4 variant of the compare with zero instruction.
VCLT.S32 Qy, Qc, #0 
;; The sign bit should be set in lanes 1 and 3, with undefined values in lane 0 and 2:
;; | SSSSSSSS | xxxxxxxx | SSSSSSSS | xxxxxxxx |
;; We can shift those down to propagate the sign bit:
VSHR.S64 Qy, Qy, #32
VBSL Qy.16B, Qv1.16B, Qv2.16B

I thought I'd comment on here too in case another VM needs to implement this, better suggestions welcome of course!

Maratyszcza · 2020-12-08T23:12:14Z

@planglois Good point. Although more efficient would be VSHR.S64 Qy, Qc, #63 + VBSL Qy.16B, Qv1.16B, Qv2.16B. Updated the PR description with the new lowering.

ngzhian · 2020-12-10T09:48:58Z

This is now prototyped in V8 on ARM, ARM64, IA32 as well.

Maratyszcza · 2020-12-11T05:20:46Z

The experiment to enable these instructions in XNNPACK is in google/XNNPACK#1236, but fails to link: [parse exception: invalid code after SIMD prefix: 127 (at 0:118627)]

- i64x2.eq (WebAssembly/simd#381) - i64x2 widens (WebAssembly/simd#290) - i64x2.bitmask (WebAssembly/simd#368) - signselect ops (WebAssembly/simd#124)

Maratyszcza · 2020-12-17T21:13:54Z

I evaluated the performance impact on two neural network operators in XNNPACK: Sigmoid and ELU. Results for Sigmoid:

Processor (Device)	Performance with WAsm SIMD + Sign Select	Performance with WAsm SIMD (baseline)	Speedup
Intel Xeon W-2135	7.31 GB/s	6.87 GB/s	6%
Intel Celeron N3060	1.02 GB/s	0.98 GB/s	4%
AMD PRO A10-8700B	3.99 GB/s	3.59 GB/s	11%
AMD A4-7210	1.46 GB/s	1.46 GB/s	0%
Qualcomm Snapdragon 670 (Pixel 3a)	1.97 GB/s	1.97 GB/s	0%
Qualcomm Snapdragon 855 (LG G8 ThinQ)	4.66 GB/s	4.63 GB/s	1%
Samsung Exynos 8895 (Galaxy S8)	2.28 GB/s	2.21 GB/s	3%

Results for ELU:

Processor (Device)	Performance with WAsm SIMD + Sign Select	Performance with WAsm SIMD (baseline)	Speedup
Intel Xeon W-2135	8.53 GB/s	8.04 GB/s	6%
Intel Celeron N3060	1.58 GB/s	1.60 GB/s	-1%
AMD PRO A10-8700B	4.25 GB/s	3.90 GB/s	9%
AMD A4-7210	1.43 GB/s	1.41 GB/s	1%
Qualcomm Snapdragon 670 (Pixel 3a)	2.02 GB/s	2.06 GB/s	-2%
Qualcomm Snapdragon 855 (LG G8 ThinQ)	4.77 GB/s	4.75 GB/s	0%
Samsung Exynos 8895 (Galaxy S8)	3.21 GB/s	3.22 GB/s	0%

Maratyszcza · 2020-12-17T21:14:28Z

Code changes can be seen in google/XNNPACK#1236

penzn · 2020-12-18T00:58:08Z

Why does the Celeron N3060 go from +4 on one benchmark to -1 on the other?

Maratyszcza · 2020-12-18T09:06:32Z

There are two factors at play on Celeron:

Sign Select reduce register pressure as we don't need to explicitly compute the mask for v128.bitselect.
Sign Select on SSE4.1 lowers to BLENVPS instruction, which is very inefficient on Celeron N3060. In fact, it is worth to detect these types of CPUs (Silvermont, Goldmont, Goldmont Plus cores) and use SSE2-level code-generation pattern for them.

zeux · 2020-12-22T18:53:30Z

Let me unpack my comments from the meeting a bit.

bitselect is not great

There's an existing problem with bitselect in that it results in optimal codegen on ARM, but - on CPUs with a fast blend instruction - it results in suboptimal codegen. v8 synthesizes 4 instructions for bitselect today - mov, xor, and, xor. mov is probably reg-renamed, but the remaining three instructions are dependent on each other, so this has 3-cycle latency; given 3 execution ports this should have rec. throughput of 1. Compared to this, blendps has latency of 2 and throughput of 0.66 on Skylake. This is mentioned in #192. Worth noting that v8's lowering seems particularly inefficient compared to standard and/andnot/or assuming the availability of VEX prefixes, as the latter has latency of 2 (but rec. throughput of 1 all the same).

signselect seems like a better bitselect but it's not

Crucially, bitselect is optimal for ARM; signselect has a 2-instruction lowering on ARM, these are dependent on each other and with hit-or-miss NEON implementations in general I would probably want to prioritize ARM performance.

So signselect can only really be used when it's replacing a comparison with 0. It's tempting to use it as a general bitselect replacement but oh-so-wrong; doing so will result in performance uplift on Intel systems (... which is where the code is probably tested...), but a small performance degradation on ARM systems.

bitselect can be fixed

During the call we've talked about challenges with blendv on Goldmont, that has 4 cycle latency and 4 cycle rec. throughput which is not great; this means that this instruction can not be lowered to blendv directly on any system that has a Goldmont chip, so there's going to be a perf. disparity - on a Goldmont-carrying system we'd need to convert sign bit to mask (arithmetic right shift by 31) followed by the usual bitselect code, for a total latency of 4 cycles with existing v8 lowering and rec. throughput of 2 (Goldmont only has 2 EUs for vectorized bitops, so 4 instructions with rec. throughput of 0.5 add up to 2).

So assuming this needs to be figured out for any use of blendv, including the proposal above, bitselect codegen can be optimal on Intel.

To do this during codegen, the code generator needs to track whether the register originated from an instruction that can only produce all-1 or all-0 in a given lane; effectively each value grows an extra enum, smth like not-a-mask | mask8 | mask16 | mask32 | mask64. Crucially, this propagates throughout the code in a forward fashion - you just need to have each Wasm SIMD value track this as the code is generated, and the type depends entirely on the instruction and mask-mode for the instruction arguments. For example, the basic implementation would result in not-a-mask for all instructions but comparisons; a slightly more involved implementation would have logical ops return maskN if both operands are maskN and not-a-mask otherwise.

I would argue this can be done with minimal codegen perf loss for any SIMD engine, including ones that do a single pass through Wasm code and generate resulting assembly directly.

I would also argue that any Wasm SIMD engine that wants to have decent performance has to do similar instruction analysis for constants, as shifts can't be lowered efficiently otherwise. In the simplest case, both analyses can be done directly on the arguments (effectively as a peep-hole matcher on the source Wasm instructions) at minimal cost.

If we don't want to fix bitselect, we should add qselect

If hypothetically speaking we decide that the above is impractical - which I don't believe, and if we were to decide that I'd ask that we also introduce immediate variants for some instructions like shifts - we could introduce a quasi-select instruction.

There's a behavior mismatch here between modern Intel and ARM systems; on Intel it's most efficient to do a select based on top bit; on ARM it's most efficient to do a select based on all bits. Resolving the difference on either side creates perf. problems, but if we had an instruction that allowed to use any bits from the source lanes when doing the select (and e.g. require this choice to be done consistently during codegen of a given instruction), this would allow us to shift the mask-analysis to either the user, or the compiler that generates Wasm code.

signselect is less interesting if we have better-bitselect or qselect

Given the importance of selects in SIMD code, I think we have to either fix bitselect in the implementations or introduce qselect; qselect has to wait for post-MVP as it will by necessity have different behaviors on different archs. I would strongly prefer fixing bitselect but either way, all of this says that on systems that have efficient blendv, Wasm engine will be able to synthesize one.

Once we have this, we can compare signselect to explicit compare-to-0 + bitselect/qselect. Given a decent code generation engine, I'd expect:

On Intel compare+bitselect would generate compare+blendv, which is 3 cycles of latency and 1 cycle of rec throughput; the variant without compare would have 2 cycles of latency and 0.66 cycles of rec throughput (Skylake)
On ARM compare+bitselect and signselect would be equivalent

I suspect that if we had better bitselect lowering, the gains on the benchmarks posted would be smaller accordingly. signselect would still be better, but it would be better in a fairly special case.

Of course improving bitselect situation and adding signselect don't have to be mutually exclusive; signselect will never be worse (given that the codegen engine can detect Goldmont) than an equivalent replacement, but if I had to choose what to spend the resources on, I would focus on making bitselect better.

omnisip · 2020-12-22T19:33:05Z

@zeux thanks for the comments, really.

I'm working on a design proposal for V8 that will help us perform significantly better optimization across-instructions first by constant folding and hoisting, then by peephole optimization. Would you mind commenting on that when it's ready for review? This is a good use case for the proposal.

Maratyszcza · 2020-12-22T20:01:37Z

There are several use-cases for v128.bitselect in XNNPACK right now:

In clamping the results (because f32x4.min/f32x4.max are too slow on x86). This use-case would be replaced with f32x4.pmin/f32x4.pmax once it has enough user coverage.
In rounding operators. This use-case would be replaced with f32x4.nearest/f32x4.floor/f32x4.ceil/f32x4.trunc once these instructions have enough user coverage.
In Leaky ReLU, ELU, PReLU, and Sigmoid operators. In ALL of these operators the mask for v128.bitselect is generated from the sign of the input.

In other words, XNNPACK doesn't really have a use-case for v128.bitselect that wouldn't be covered by Sign Select instructions.

omnisip · 2020-12-22T20:30:55Z

I don't have a particular opinion on the utility of this proposal, but I'm very curious about the code generation. Specifically why is ARM64 performing worse when Cortex-A55/Cortex-A75 both only have one shift port, and the ultimate result between signselect on ARM64 and bitselect with signed shift right both yield two instructions.

Are you able to post the codegen before and after? Similarly, if I give you a couple of patches for ARM64 to try, can you see if it yields a difference in performance? We only need to pick one of the bad performers.

lars-t-hansen · 2021-01-08T15:52:35Z

FWIW, I think optimizers can and should deal with a compare-with-zero argument to bitselect and lower to a blend when that is appropriate; this should not be hard and would largely make signselect unnecessary. (Ties into our discussion about what optimizers should do.) We should not add signselect until we've seen the outcome of that work. In addition, it looks as if there's a broader discussion to be had here about the general utility of signselect.

rrwinterton · 2021-01-15T16:47:42Z

The following is a the performance numbers when running SignSelect when tested with XNN and elu and sigmoid. The table shows the scaling performance benefit between the baseline and the new signselect instruction. The Compute Core / Scaling Core is the ratio of the SignSelect performance difference. The Compute column is the performance ratio from time in Baseline to using SignSelect. The Scaling is the performance ratio from time in Baseline to SignSelect.

dtig · 2021-01-25T19:30:36Z

Adding a preliminary vote for the inclusion of Sign Select operations to the SIMD proposal below. Please vote with -

👍 For including Sign Select operations
👎 Against including Sign Select operations

Maratyszcza · 2021-01-25T20:44:19Z

To quote @lars-t-hansen from #430:

And given a low complexity limit, I think signselect ought to be a dedicated instruction, the tree required to express it being relatively large: Bitselect(compare(x, v128.const(0)), a, b).

zeux · 2021-01-25T23:18:34Z

fwiw my take on signselect is that I would not expect engines to optimize the tree above into a single instruction, but I also wouldn't consider this instruction as being so fundamental as to warrant a special opcode, and would settle on a compare+select lowering - on ARM that happens to already be the optimal instruction pattern anyhow.

jan-wassenberg · 2021-01-26T15:23:54Z

@zeux that means we would need i64.gt_s, right?

zeux · 2021-01-26T16:53:04Z

@jan-wassenberg I don't see these as directly related. If we don't standardize i64.gt_s - which is essentially what's going to happen I suspect (I'd love to have these in the spec but the vote was split) - we're effectively making a statement that you can't efficiently use SIMD for 64-bit integer workloads that involve comparisons. Presence or absence of signselect is orthogonal to this decision - if you work with f64 you can still use f64.gt to feed bitselect.

If we got signselect then naturally you'd be able to emulate some i64.gt_s-based selects with it - but not all! For example, (a >= 0 && a < 10) ? b : c is going to be tricky to emulate with signselect. Since I believe the conclusion at the point when i64.gt_s was taken out of the spec a year or more ago was that in general we weren't able to find SIMD workloads that use i64 comparisons, whether we get signselect or not isn't material for i64 IMO.

And of course you can emulate signselect still with i64.shr(63) + bitselect although I'm not sure what the codegen looks like on v8 today.

jan-wassenberg · 2021-01-27T10:02:00Z

@zeux Ah, I didn't realize we had i64.shr. That does allow some limited form of comparison. Its implementation is probably using something like signselect on SSE4, here's what Highway is doing:

template <size_t N>
HWY_API Vec128<int64_t, N> BroadcastSignBit(const Vec128<int64_t, N> v) {
#if HWY_TARGET == HWY_AVX3
  return Vec128<int64_t, N>{_mm_srai_epi64(v.raw, 63)};
#elif HWY_TARGET == HWY_AVX2
  return VecFromMask(v < Zero(Simd<int64_t, N>()));
#else
  // Efficient Gt() requires SSE4.2 but we only have SSE4.1, so it is faster to
  // generate constants and BLENDVPD, which only looks at the sign bit.
  const Simd<int64_t, N> di;
  const Simd<double, N> df;
  const auto zero = Zero(di);
  const auto all = BitCast(df, VecFromMask(zero == zero));
  const auto sign = MaskFromVec(BitCast(df, v));
  return BitCast(di, IfThenElse(sign, all, BitCast(df, zero)));
#endif
}

template <int kBits, size_t N>
HWY_API Vec128<int64_t, N> ShiftRight(const Vec128<int64_t, N> v) {
#if HWY_TARGET == HWY_AVX3
  return Vec128<int64_t, N>{_mm_srai_epi64(v.raw, kBits)};
#else
  const Simd<int64_t, N> di;
  const Simd<uint64_t, N> du;
  const auto right = BitCast(di, ShiftRight<kBits>(BitCast(du, v)));
  const auto sign = ShiftLeft<64 - kBits>(BroadcastSignBit(v));
  return right | sign;
#endif
}

Seems rather wasteful to use such a shr instead of signselect. If both signselect and gt are frowned upon, perhaps this BroadcastSignBit would be a useful compromise and low-level building block? It would allow engines to use shr (NEON/AVX512)/gt(SSE4.2)/blendv(SSE4), and users to call bitselect without all the extra wasted work that a general shr would require.

lars-t-hansen · 2021-01-27T10:08:08Z

To quote @lars-t-hansen from #430:

And given a low complexity limit, I think signselect ought to be a dedicated instruction, the tree required to express it being relatively large: Bitselect(compare(x, v128.const(0)), a, b).

I'm neutral on whether we want signselect or not, but I'm not committing to recognizing that tree as an instance of signselect :-)

jan-wassenberg · 2021-01-29T18:02:51Z

We didn't have time in the meeting to cover this, but I'd like to propose a small building block that may avoid the need for signselect and possibly i64.gt and i64.abs.

Currently all we have for this is i64.shr + bitselect, or f64.gt (3+ cycle plus domain switch penalty).
I64.shr is expensive on x86: 3 ALU + 2 const + BLENDV.

Various platforms have fast ways to implement this that users cannot match/know:

AVX3: shr, or test+blend
ARM: shr + BSL
AVX2: i64.gt
SSE4: 2 const + BLENDV, 3 ops less than i64.shr

I have found this abstraction useful in multiple settings, perhaps it is less controversial than these other ops?

Maratyszcza · 2021-01-29T19:07:54Z

@jan-wassenberg Maybe I'm missing something, but what exactly is the proposal?

jan-wassenberg · 2021-02-01T06:43:30Z

@Maratyszcza Ah, sorry to be unclear. The proposal is to add broadcastsign (or some better name), with the semantics of i##.shr (arithmetic shift right) by ##-1. It allows doing less work than an actual i64.shr on SSE4. I proposed blendv above but even better would be i32.gt(zero) + _mm_shuffle_epi32(gt, _MM_SHUFFLE(3,3,1,1)) to copy the upper 32 bits to the lower 32.

This operation would be useful for implementing various user code including 64-bit comparison or abs(), which then might not have to add as new instructions.

Maratyszcza · 2021-02-01T16:56:07Z

@jan-wassenberg I agree that this is an important operation, but I don't think we need an extra instruction for that. Probably the specification should emphasise that i##.shr by -1 is a special case that justifies additional optimization. I'll create a separate issue for that.

dtig · 2021-03-05T02:12:41Z

Closing as per #436.

As proposed in WebAssembly/simd#124, using the opcodes adopted by V8 in https://chromium-review.googlesource.com/c/v8/v8/+/2486235/2/src/wasm/wasm-opcodes.h. Uses new builtin functions and a new target intrinsic exclusively to ensure that the new instructions are only emitted when a user explicitly opts in to using them since they are still in the prototyping and evaluation phase. Differential Revision: https://reviews.llvm.org/D90357

Maratyszcza force-pushed the signselect branch from 7e5d706 to d94e5a7 Compare October 28, 2019 06:48

zeux mentioned this pull request Jan 14, 2020

movemask instruction #131

Closed

Maratyszcza force-pushed the signselect branch from d94e5a7 to 8f840dd Compare May 7, 2020 23:14

Maratyszcza mentioned this pull request Sep 21, 2020

Finalizing the instruction set #343

Closed

ngzhian mentioned this pull request Oct 21, 2020

Remove integer SIMD not-equals instructions #351

Closed

Maratyszcza force-pushed the signselect branch 2 times, most recently from 05c0499 to 3d06a28 Compare December 4, 2020 06:17

tlively mentioned this pull request Dec 11, 2020

Prototype SIMD instructions implemented in LLVM WebAssembly/binaryen#3440

Merged

Maratyszcza force-pushed the signselect branch from 3d06a28 to 15223e7 Compare December 18, 2020 09:17

tlively mentioned this pull request Jan 8, 2021

Agenda for sync meeting 1/22/21 #419

Closed

Maratyszcza force-pushed the signselect branch from 15223e7 to e9d8eb3 Compare January 12, 2021 04:25

Maratyszcza force-pushed the signselect branch from e9d8eb3 to fc8606f Compare January 19, 2021 21:00

tlively mentioned this pull request Jan 23, 2021

Agenda for sync meeting 1/29/21 #429

Closed

ngzhian added the 2021-01-29 Agenda for sync meeting 1/29/21 label Jan 26, 2021

tlively mentioned this pull request Jan 31, 2021

Agenda for sync meeting 2/5/21 #436

Closed

Maratyszcza force-pushed the signselect branch from fc8606f to 6965e62 Compare February 1, 2021 17:03

Maratyszcza mentioned this pull request Feb 1, 2021

Canonical sign-replication operation #437

Open

dtig added needs discussion Proposal with an unclear resolution and removed 2021-01-29 Agenda for sync meeting 1/29/21 labels Feb 2, 2021

Sign Select instructions

be76ffd

Maratyszcza force-pushed the signselect branch from 6965e62 to be76ffd Compare February 5, 2021 16:42

dtig closed this Mar 5, 2021

Maratyszcza mentioned this pull request Mar 29, 2021

blend instructions WebAssembly/relaxed-simd#17

Open

Maratyszcza mentioned this pull request Jun 5, 2023

Codegen of i16x8.relaxed_laneselect WebAssembly/relaxed-simd#125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sign Select instructions #124

Sign Select instructions #124

Maratyszcza commented Oct 28, 2019 •

edited

Loading

ngzhian commented Jul 6, 2020

Maratyszcza commented Jul 7, 2020 •

edited

Loading

zeux commented Aug 5, 2020

ngzhian commented Aug 5, 2020

ngzhian commented Oct 28, 2020

tlively commented Oct 29, 2020

planglois commented Dec 8, 2020

Maratyszcza commented Dec 8, 2020 •

edited

Loading

ngzhian commented Dec 10, 2020

Maratyszcza commented Dec 11, 2020

Maratyszcza commented Dec 17, 2020

Maratyszcza commented Dec 17, 2020

penzn commented Dec 18, 2020

Maratyszcza commented Dec 18, 2020 •

edited

Loading

zeux commented Dec 22, 2020 •

edited

Loading

omnisip commented Dec 22, 2020 •

edited

Loading

Maratyszcza commented Dec 22, 2020 •

edited

Loading

omnisip commented Dec 22, 2020 •

edited

Loading

lars-t-hansen commented Jan 8, 2021

rrwinterton commented Jan 15, 2021 •

edited

Loading

dtig commented Jan 25, 2021

Maratyszcza commented Jan 25, 2021

zeux commented Jan 25, 2021

jan-wassenberg commented Jan 26, 2021

zeux commented Jan 26, 2021 •

edited

Loading

jan-wassenberg commented Jan 27, 2021

lars-t-hansen commented Jan 27, 2021

jan-wassenberg commented Jan 29, 2021

Maratyszcza commented Jan 29, 2021

jan-wassenberg commented Feb 1, 2021

Maratyszcza commented Feb 1, 2021

dtig commented Mar 5, 2021

Sign Select instructions #124

Sign Select instructions #124

Conversation

Maratyszcza commented Oct 28, 2019 • edited Loading

Introduction

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON extension

Other processors and instruction sets

References

ngzhian commented Jul 6, 2020

Maratyszcza commented Jul 7, 2020 • edited Loading

zeux commented Aug 5, 2020

ngzhian commented Aug 5, 2020

ngzhian commented Oct 28, 2020

tlively commented Oct 29, 2020

planglois commented Dec 8, 2020

Maratyszcza commented Dec 8, 2020 • edited Loading

ngzhian commented Dec 10, 2020

Maratyszcza commented Dec 11, 2020

Maratyszcza commented Dec 17, 2020

Maratyszcza commented Dec 17, 2020

penzn commented Dec 18, 2020

Maratyszcza commented Dec 18, 2020 • edited Loading

zeux commented Dec 22, 2020 • edited Loading

bitselect is not great

signselect seems like a better bitselect but it's not

bitselect can be fixed

If we don't want to fix bitselect, we should add qselect

signselect is less interesting if we have better-bitselect or qselect

omnisip commented Dec 22, 2020 • edited Loading

Maratyszcza commented Dec 22, 2020 • edited Loading

omnisip commented Dec 22, 2020 • edited Loading

lars-t-hansen commented Jan 8, 2021

rrwinterton commented Jan 15, 2021 • edited Loading

dtig commented Jan 25, 2021

Maratyszcza commented Jan 25, 2021

zeux commented Jan 25, 2021

jan-wassenberg commented Jan 26, 2021

zeux commented Jan 26, 2021 • edited Loading

jan-wassenberg commented Jan 27, 2021

lars-t-hansen commented Jan 27, 2021

jan-wassenberg commented Jan 29, 2021

Maratyszcza commented Jan 29, 2021

jan-wassenberg commented Feb 1, 2021

Maratyszcza commented Feb 1, 2021

dtig commented Mar 5, 2021

Maratyszcza commented Oct 28, 2019 •

edited

Loading

Maratyszcza commented Jul 7, 2020 •

edited

Loading

Maratyszcza commented Dec 8, 2020 •

edited

Loading

Maratyszcza commented Dec 18, 2020 •

edited

Loading

zeux commented Dec 22, 2020 •

edited

Loading

omnisip commented Dec 22, 2020 •

edited

Loading

Maratyszcza commented Dec 22, 2020 •

edited

Loading

omnisip commented Dec 22, 2020 •

edited

Loading

rrwinterton commented Jan 15, 2021 •

edited

Loading

zeux commented Jan 26, 2021 •

edited

Loading