Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Sign Select instructions #124

Closed
wants to merge 1 commit into from

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Oct 28, 2019

Introduction

Piece-wise approximations with discontinuity at zero are common in numerical algorithms. Such approximations take the form y := x < 0 ? f(x) : g(x). E.g. in artificial neural networks, ReLU [1], Leaky ReLU [2], Parametric ReLU [3], ELU [4], SELU [5] functions involve such piece-wise approximations. Selection between two piece-wise approximations depending on the size of input can be implemented with two WebAssembly SIMD instructions from the current specification: one instruction to replicate the highest bit of the input elements into all bits of the elements so they can be used as a bit-selection mask, and bitselect instruction to choose between two values, f(x) and g(x), depending on the constructed selection mask. There are, however, two considerations which make such decomposition suboptimal in practice:

  1. There are two ways to replicate the high bit of input elements: do integer comparison with zero (e.g. i32x4.lt_s(x, v128.const(0)) for 32-bit elements) or do integer arithmetic shift right by sizeof(element) * CHAR_BIT - 1 (e.g. i32x4.shr_s(x, 31) for 32-bit elements). The optimal way to do replication depends on target architecture: typically integer comparison is more efficient than integer shift, but if the WebAssembly engine has to explicitly instantiate v128.const(0), the trade-off skews the other way. ARM NEON and ARM64 provide instructions for integer comparison with zero (e.g. VCLT.S32 Qd, Qn, #0/CMLT Vd.4S, Vn.4S, #0 for 32-bit elements), but on x86 comparison with zero would always require two instructions and an extra register, so arithmetic shift right (e.g. PSRAD xmm_x, 31 is often preferable).
  2. Since SSE4.1, x86 processors provide PBLENDVB/BLENDVPS/BLENDVPD instructions which choose between two vectors of 8-/32-/64-bit elements depending on the sign of elements in the selection mask. In lieu of WebAssembly instructions that directly map to these SSE4.1 instructions, WebAssembly SIMD engines have to generate 4-5 instructions (1-2 to construct bit mask, and 3 to emulate v128.bitselect) where 1 would suffice.

To enable efficient implementation of piece-wise approximations with discontinuity at zero, this PR introduce new Sign Select instructions which select between elements of two input vectors depending on sign of the elements in the mask vector.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • v8x16.signselect
    • y = v8x16.signselect(v1, v2, c) is lowered to VPBLENDVB xmm_y, xmm_v2, xmm_v1, xmm_c
  • v16x8.signselect
    • y = v16x8.signselect(v1, v2, c) is lowered to VPSRAW xmm_tmp, xmm_c, 15 + VPBLENDVB xmm_y, xmm_v2, xmm_v1, xmm_tmp
  • v32x4.signselect
    • y = v32x4.signselect(v1, v2, c) is lowered to VBLENDVPS xmm_y, xmm_v2, xmm_v1, xmm_c
  • v64x2.signselect
    • y = v64x2.signselect(v1, v2, c) is lowered to VBLENDVPD xmm_y, xmm_v2, xmm_v1, xmm_c

x86/x86-64 processors with SSE4.1 instruction set

  • v8x16.signselect
    • y = v8x16.signselect(v1, v2, c) is lowered to:
      • MOVDQA xmm0, xmm_c
      • MOVDQA xmm_y, xmm_v2
      • PBLENDVB xmm_y, xmm_v1 [implicit xmm0 input]
  • v16x8.signselect
    • y = v16x8.signselect(v1, v2, c) is lowered to:
      • PXOR xmm0, xmm0
      • PCMPGTW xmm0, xmm_c
      • MOVDQA xmm_y, xmm_v2
      • PBLENDVB xmm_y, xmm_v1 [implicit xmm0 input]
  • v32x4.signselect
    • y = v32x4.signselect(v1, v2, c) is lowered to:
      • MOVAPS xmm0, xmm_c
      • MOVAPS xmm_y, xmm_v2
      • BLENDVPS xmm_y, xmm_v1 [implicit xmm0 input]
  • v64x2.signselect
    • y = v64x2.signselect(v1, v2, c) is lowered to:
      • MOVAPD xmm0, xmm_c
      • MOVAPD xmm_y, xmm_v2
      • BLENDVPD xmm_y, xmm_v1 [implicit xmm0 input]

x86/x86-64 processors with SSE2 instruction set

  • v8x16.signselect
    • y = v8x16.signselect(v1, v2, c) is lowered to:
      • PXOR xmm_tmp, xmm_tmp
      • PCMPGTB xmm_tmp, xmm_c
      • MOVDQA xmm_y, xmm_v1
      • PAND xmm_y, xmm_tmp
      • PANDNOT xmm_tmp, xmm_v2
      • POR xmm_y, xmm_tmp
  • v16x8.signselect
    • y = v8x16.signselect(v1, v2, c) is lowered to:
      • PXOR xmm_tmp, xmm_tmp
      • PCMPGTW xmm_tmp, xmm_c
      • MOVDQA xmm_y, xmm_v1
      • PAND xmm_y, xmm_tmp
      • PANDNOT xmm_tmp, xmm_v2
      • POR xmm_y, xmm_tmp
  • v32x4.signselect
    • y = v8x16.signselect(v1, v2, c) is lowered to:
      • PXOR xmm_tmp, xmm_tmp
      • PCMPGTD xmm_tmp, xmm_c
      • MOVDQA xmm_y, xmm_v1
      • PAND xmm_y, xmm_tmp
      • PANDNOT xmm_tmp, xmm_v2
      • POR xmm_y, xmm_tmp
  • v64x2.signselect
    • y = v8x16.signselect(v1, v2, c) is lowered to:
      • PXOR xmm_tmp, xmm_tmp
      • PCMPGTD xmm_tmp, xmm_c
      • PSHUFD xmm_tmp, xmm_tmp, 0xF5
      • MOVDQA xmm_y, xmm_v1
      • PAND xmm_y, xmm_tmp
      • PANDNOT xmm_tmp, xmm_v2
      • POR xmm_y, xmm_tmp

ARM64 processors

  • v8x16.signselect
    • y = v8x16.signselect(v1, v2, c) is lowered to CMLT Vy.16B, Vc.16B, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B
  • v16x8.signselect
    • y = v16x8.signselect(v1, v2, c) is lowered to CMLT Vy.8H, Vc.8H, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B
  • v32x4.signselect
    • y = v32x4.signselect(v1, v2, c) is lowered to CMLT Vy.4S, Vc.4S, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B
  • v64x2.signselect
    • y = v64x2.signselect(v1, v2, c) is lowered to CMLT Vy.2D, Vc.2D, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B

ARMv7 processors with NEON extension

  • v8x16.signselect
    • y = v8x16.signselect(v1, v2, c) is lowered to VCLT.S8 Qy, Qc, #0 + VBSL Qy.16B, Qv1.16B, Qv2.16B
  • v16x8.signselect
    • y = v16x8.signselect(v1, v2, c) is lowered to VCLT.S16 Qy, Qc, #0 + VBSL Qy.16B, Qv1.16B, Qv2.16B
  • v32x4.signselect
    • y = v32x4.signselect(v1, v2, c) is lowered to VCLT.S32 Qy, Qc, #0 + VBSL Qy.16B, Qv1.16B, Qv2.16B
  • v64x2.signselect
    • y = v64x2.signselect(v1, v2, c) is lowered to VSHR.S64 Qy, Qc, #63 + VBSL Qy.16B, Qv1.16B, Qv2.16B

Other processors and instruction sets

  • v8x16.signselect
    • y = v8x16.signselect(v1, v2, c) is lowered as either v128.bitselect(v1, v2, i8x16.lt_s(c, v128.const(0))) or v128.bitselect(v1, v2, i8x16.shr_s(c, 7))
  • v16x8.signselect
    • y = v16x8.signselect(v1, v2, c) is lowered as either v128.bitselect(v1, v2, i16x8.lt_s(c, v128.const(0))) or v128.bitselect(v1, v2, i16x8.shr_s(c, 15))
  • v32x4.signselect
    • y = v32x4.signselect(v1, v2, c) is lowered as either v128.bitselect(v1, v2, i32x4.lt_s(c, v128.const(0))) or v128.bitselect(v1, v2, i32x4.shr_s(c, 31))
  • v64x2.signselect
    • y = v64x2.signselect(v1, v2, c) is lowered as either v128.bitselect(v1, v2, i64x2.lt_s(c, v128.const(0))) or v128.bitselect(v1, v2, i64x2.shr_s(c, 63))

References

[1] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th International Conference on Machine Learning, 2010.
[2] Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities improve neural network
acoustic models. In International Conference on Machine Learning (ICML).
[3] He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015.
[4] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). CoRR, abs/1511.07289.
[5] Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. (2017). Self-Normalizing Neural
Networks. ArXiv e-prints.

@ngzhian
Copy link
Member

ngzhian commented Jul 6, 2020

There are two ways to replicate the high bit of input elements: do integer comparison with zero (e.g. i32x4.lt_s(x, v128.const(0)) for 32-bit elements) or do integer arithmetic shift right by sizeof(element) * CHAR_BIT - 1 (e.g. i32x4.shr_s(x, 31) for 32-bit elements). The optimal way to do replication depends on target architecture: typically integer comparison is more efficient than integer shift, but if the WebAssembly engine has to explicitly instantiate v128.const(0), the trade-off skews the other way. ARM NEON and ARM64 provide instructions for integer comparison with zero (e.g. VCLT.S32 Qd, Qn, #0/CMLT Vd.4S, Vn.4S, #0 for 32-bit elements), but on x86 comparison with zero would always require two instructions and an extra register, so arithmetic shift right (e.g. PSRAD xmm_x, 31 is often preferable).

I think shift right by an immediate is more efficient for all archs (on v8 at least), it's a single instruction.

Comparison with immediate 0 is not supported on Wasm, so you won't be able to get the fast behavior on ARM, it will be i32x4.lt_s(x, v128.const(0)) as you said, on all arch. [^1]

WebAssembly SIMD engines have to generate 4-5 instructions (1-2 to construct bit mask, and 3 to emulate v128.bitselect) where 1 would suffice.

Yea, bitselect is 4 instructions on x64 :(

Looking at the lowering, this instruction mostly helps AVX, from 5 instructions to 1 (or 2). The ARM codegen looks exactly like what cmp/shift+bitselect would be (if emulated using cmp/shr_s+bitselect).

From the guidelines:

  • hard/expensive to emulate in absence of spec change

This instruction will mostly be helpful for AVX, and is much more expensive without sign select.

  • well supported on 2+ major archs

It is well supported in that codegen on ARM/ARM64 isn't awful, but those archs don't gain from sign select. So it isn't as useful.

SSE4.1 sees slight improvement, but I don't think it's sufficient.

I haven't looked at SSE2 that much, but the codegen looks long (particular for v64x2.signselect, cmp+bitselect might be shorter).

  • useful in 2+ important use cases

All of the references are machine learning (neural networks) related. I do not have expertise in this field to distinguish how different they are - they all refer to *LU, and so looks the same to my untrained eye. Thus, I'm not certain this counts as "2+".

In comparison, #237 shows the use in codecs/media, machine learning, molecular dynamics, etc.

There were some discussions in #131 regarding mask/select related instructions, which led to bitmask being proposed (and merged). Is there anyone who would like to offer examples of sign select usages in their fields?

[^1] v128.const(0) will be a single instruction

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Jul 7, 2020

@ngzhian One thing to keep in mind is that processors typically can handle integer comparisons in a large number of SIMD units than shifts. E.g. AMD Zen 2 processors can do 3 integer compare instructions ([V]PCMPGTD) per cycle, but only one arithmetic shift instruction ([V]PSRAD) per cycle, and ARM Cortex-A76 cores can do 2 SIMD integer compare instructions per cycle, but one SIMD shift per cycle. Thus, compare-based lowering is always preferable when it doesn't increase the number of generated instructions.

@zeux
Copy link
Contributor

zeux commented Aug 5, 2020

@ngzhian One of the uses of signselect for use cases where desktop performance is important is to replace bitselect to get better codegen on x64 in cases where the mask is known to be all 0s/1s - most commonly as a result of a comparison instruction. How possible do you think it is for TurboFan to generate pblend instead of current instruction sequence when the input is a comparison mask?

@ngzhian
Copy link
Member

ngzhian commented Aug 5, 2020

@ngzhian
Copy link
Member

ngzhian commented Oct 28, 2020

Prototyped in V8 https://crrev.com/c/2486235 for x64.
using pretty weird opcodes:

  V(I8x16SignSelect, 0xfd7d, s_sss)      \
  V(I16x8SignSelect, 0xfd7e, s_sss)      \
  V(I32x4SignSelect, 0xfd7f, s_sss)      \
  V(I64x2SignSelect, 0xfd94, s_sss)      \

Thomas is aware of these opcodes.

tlively added a commit to llvm/llvm-project that referenced this pull request Oct 29, 2020
As proposed in WebAssembly/simd#124, using the opcodes
adopted by V8 in
https://chromium-review.googlesource.com/c/v8/v8/+/2486235/2/src/wasm/wasm-opcodes.h.
Uses new builtin functions and a new target intrinsic exclusively to ensure that
the new instructions are only emitted when a user explicitly opts in to using
them since they are still in the prototyping and evaluation phase.

Differential Revision: https://reviews.llvm.org/D90357
@tlively
Copy link
Member

tlively commented Oct 29, 2020

This has landed in LLVM using the V8 opcodes and usable via __builtin_wasm_signselect_{i8x16,i16x8,i32x4,i64x2}. It will be usable in tip-of-tree Emscripten in a few hours. It hasn't landed in Binaryen yet, but that won't be a problem as long as you don't pass optimization flags at link time.

@Maratyszcza Maratyszcza force-pushed the signselect branch 2 times, most recently from 05c0499 to 3d06a28 Compare December 4, 2020 06:17
@planglois
Copy link

Hi all!

While looking at the AArch32 implementation in V8, we realized NEON does not support the i64x2 variant of the VCLT instruction. Discussing this with some colleagues, an alternative implementation could be as follows:

;; Use the i32x4 variant of the compare with zero instruction.
VCLT.S32 Qy, Qc, #0 
;; The sign bit should be set in lanes 1 and 3, with undefined values in lane 0 and 2:
;; | SSSSSSSS | xxxxxxxx | SSSSSSSS | xxxxxxxx |
;; We can shift those down to propagate the sign bit:
VSHR.S64 Qy, Qy, #32
VBSL Qy.16B, Qv1.16B, Qv2.16B

I thought I'd comment on here too in case another VM needs to implement this, better suggestions welcome of course!

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Dec 8, 2020

@planglois Good point. Although more efficient would be VSHR.S64 Qy, Qc, #63 + VBSL Qy.16B, Qv1.16B, Qv2.16B. Updated the PR description with the new lowering.

@ngzhian
Copy link
Member

ngzhian commented Dec 10, 2020

This is now prototyped in V8 on ARM, ARM64, IA32 as well.

@Maratyszcza
Copy link
Contributor Author

The experiment to enable these instructions in XNNPACK is in google/XNNPACK#1236, but fails to link: [parse exception: invalid code after SIMD prefix: 127 (at 0:118627)]

tlively added a commit to tlively/binaryen that referenced this pull request Dec 11, 2020
tlively added a commit to WebAssembly/binaryen that referenced this pull request Dec 12, 2020
@Maratyszcza
Copy link
Contributor Author

I evaluated the performance impact on two neural network operators in XNNPACK: Sigmoid and ELU. Results for Sigmoid:

Processor (Device)  Performance with WAsm SIMD + Sign Select Performance with WAsm SIMD (baseline) Speedup
Intel Xeon W-2135 7.31 GB/s 6.87 GB/s 6%
Intel Celeron N3060 1.02 GB/s 0.98 GB/s 4%
AMD PRO A10-8700B 3.99 GB/s 3.59 GB/s 11%
AMD A4-7210 1.46 GB/s 1.46 GB/s 0%
Qualcomm Snapdragon 670 (Pixel 3a) 1.97 GB/s 1.97 GB/s 0%
Qualcomm Snapdragon 855 (LG G8 ThinQ) 4.66 GB/s 4.63 GB/s 1%
Samsung Exynos 8895 (Galaxy S8) 2.28 GB/s 2.21 GB/s 3%

Results for ELU:

Processor (Device)  Performance with WAsm SIMD + Sign Select Performance with WAsm SIMD (baseline) Speedup
Intel Xeon W-2135 8.53 GB/s 8.04 GB/s 6%
Intel Celeron N3060 1.58 GB/s 1.60 GB/s -1%
AMD PRO A10-8700B 4.25 GB/s 3.90 GB/s 9%
AMD A4-7210 1.43 GB/s 1.41 GB/s 1%
Qualcomm Snapdragon 670 (Pixel 3a) 2.02 GB/s 2.06 GB/s -2%
Qualcomm Snapdragon 855 (LG G8 ThinQ) 4.77 GB/s 4.75 GB/s 0%
Samsung Exynos 8895 (Galaxy S8) 3.21 GB/s 3.22 GB/s 0%

@Maratyszcza
Copy link
Contributor Author

Code changes can be seen in google/XNNPACK#1236

@penzn
Copy link
Contributor

penzn commented Dec 18, 2020

Why does the Celeron N3060 go from +4 on one benchmark to -1 on the other?

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Dec 18, 2020

There are two factors at play on Celeron:

  • Sign Select reduce register pressure as we don't need to explicitly compute the mask for v128.bitselect.
  • Sign Select on SSE4.1 lowers to BLENVPS instruction, which is very inefficient on Celeron N3060. In fact, it is worth to detect these types of CPUs (Silvermont, Goldmont, Goldmont Plus cores) and use SSE2-level code-generation pattern for them.

@zeux
Copy link
Contributor

zeux commented Dec 22, 2020

Let me unpack my comments from the meeting a bit.

bitselect is not great

There's an existing problem with bitselect in that it results in optimal codegen on ARM, but - on CPUs with a fast blend instruction - it results in suboptimal codegen. v8 synthesizes 4 instructions for bitselect today - mov, xor, and, xor. mov is probably reg-renamed, but the remaining three instructions are dependent on each other, so this has 3-cycle latency; given 3 execution ports this should have rec. throughput of 1. Compared to this, blendps has latency of 2 and throughput of 0.66 on Skylake. This is mentioned in #192. Worth noting that v8's lowering seems particularly inefficient compared to standard and/andnot/or assuming the availability of VEX prefixes, as the latter has latency of 2 (but rec. throughput of 1 all the same).

signselect seems like a better bitselect but it's not

Crucially, bitselect is optimal for ARM; signselect has a 2-instruction lowering on ARM, these are dependent on each other and with hit-or-miss NEON implementations in general I would probably want to prioritize ARM performance.

So signselect can only really be used when it's replacing a comparison with 0. It's tempting to use it as a general bitselect replacement but oh-so-wrong; doing so will result in performance uplift on Intel systems (... which is where the code is probably tested...), but a small performance degradation on ARM systems.

bitselect can be fixed

During the call we've talked about challenges with blendv on Goldmont, that has 4 cycle latency and 4 cycle rec. throughput which is not great; this means that this instruction can not be lowered to blendv directly on any system that has a Goldmont chip, so there's going to be a perf. disparity - on a Goldmont-carrying system we'd need to convert sign bit to mask (arithmetic right shift by 31) followed by the usual bitselect code, for a total latency of 4 cycles with existing v8 lowering and rec. throughput of 2 (Goldmont only has 2 EUs for vectorized bitops, so 4 instructions with rec. throughput of 0.5 add up to 2).

So assuming this needs to be figured out for any use of blendv, including the proposal above, bitselect codegen can be optimal on Intel.

To do this during codegen, the code generator needs to track whether the register originated from an instruction that can only produce all-1 or all-0 in a given lane; effectively each value grows an extra enum, smth like not-a-mask | mask8 | mask16 | mask32 | mask64. Crucially, this propagates throughout the code in a forward fashion - you just need to have each Wasm SIMD value track this as the code is generated, and the type depends entirely on the instruction and mask-mode for the instruction arguments. For example, the basic implementation would result in not-a-mask for all instructions but comparisons; a slightly more involved implementation would have logical ops return maskN if both operands are maskN and not-a-mask otherwise.

I would argue this can be done with minimal codegen perf loss for any SIMD engine, including ones that do a single pass through Wasm code and generate resulting assembly directly.

I would also argue that any Wasm SIMD engine that wants to have decent performance has to do similar instruction analysis for constants, as shifts can't be lowered efficiently otherwise. In the simplest case, both analyses can be done directly on the arguments (effectively as a peep-hole matcher on the source Wasm instructions) at minimal cost.

If we don't want to fix bitselect, we should add qselect

If hypothetically speaking we decide that the above is impractical - which I don't believe, and if we were to decide that I'd ask that we also introduce immediate variants for some instructions like shifts - we could introduce a quasi-select instruction.

There's a behavior mismatch here between modern Intel and ARM systems; on Intel it's most efficient to do a select based on top bit; on ARM it's most efficient to do a select based on all bits. Resolving the difference on either side creates perf. problems, but if we had an instruction that allowed to use any bits from the source lanes when doing the select (and e.g. require this choice to be done consistently during codegen of a given instruction), this would allow us to shift the mask-analysis to either the user, or the compiler that generates Wasm code.

signselect is less interesting if we have better-bitselect or qselect

Given the importance of selects in SIMD code, I think we have to either fix bitselect in the implementations or introduce qselect; qselect has to wait for post-MVP as it will by necessity have different behaviors on different archs. I would strongly prefer fixing bitselect but either way, all of this says that on systems that have efficient blendv, Wasm engine will be able to synthesize one.

Once we have this, we can compare signselect to explicit compare-to-0 + bitselect/qselect. Given a decent code generation engine, I'd expect:

  • On Intel compare+bitselect would generate compare+blendv, which is 3 cycles of latency and 1 cycle of rec throughput; the variant without compare would have 2 cycles of latency and 0.66 cycles of rec throughput (Skylake)
  • On ARM compare+bitselect and signselect would be equivalent

I suspect that if we had better bitselect lowering, the gains on the benchmarks posted would be smaller accordingly. signselect would still be better, but it would be better in a fairly special case.

Of course improving bitselect situation and adding signselect don't have to be mutually exclusive; signselect will never be worse (given that the codegen engine can detect Goldmont) than an equivalent replacement, but if I had to choose what to spend the resources on, I would focus on making bitselect better.

@omnisip
Copy link

omnisip commented Dec 22, 2020

@zeux thanks for the comments, really.

I'm working on a design proposal for V8 that will help us perform significantly better optimization across-instructions first by constant folding and hoisting, then by peephole optimization. Would you mind commenting on that when it's ready for review? This is a good use case for the proposal.

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Dec 22, 2020

There are several use-cases for v128.bitselect in XNNPACK right now:

  • In clamping the results (because f32x4.min/f32x4.max are too slow on x86). This use-case would be replaced with f32x4.pmin/f32x4.pmax once it has enough user coverage.
  • In rounding operators. This use-case would be replaced with f32x4.nearest/f32x4.floor/f32x4.ceil/f32x4.trunc once these instructions have enough user coverage.
  • In Leaky ReLU, ELU, PReLU, and Sigmoid operators. In ALL of these operators the mask for v128.bitselect is generated from the sign of the input.

In other words, XNNPACK doesn't really have a use-case for v128.bitselect that wouldn't be covered by Sign Select instructions.

@omnisip
Copy link

omnisip commented Dec 22, 2020

I don't have a particular opinion on the utility of this proposal, but I'm very curious about the code generation. Specifically why is ARM64 performing worse when Cortex-A55/Cortex-A75 both only have one shift port, and the ultimate result between signselect on ARM64 and bitselect with signed shift right both yield two instructions.

Are you able to post the codegen before and after? Similarly, if I give you a couple of patches for ARM64 to try, can you see if it yields a difference in performance? We only need to pick one of the bad performers.

@lars-t-hansen
Copy link
Contributor

FWIW, I think optimizers can and should deal with a compare-with-zero argument to bitselect and lower to a blend when that is appropriate; this should not be hard and would largely make signselect unnecessary. (Ties into our discussion about what optimizers should do.) We should not add signselect until we've seen the outcome of that work. In addition, it looks as if there's a broader discussion to be had here about the general utility of signselect.

@rrwinterton
Copy link

rrwinterton commented Jan 15, 2021

The following is a the performance numbers when running SignSelect when tested with XNN and elu and sigmoid. The table shows the scaling performance benefit between the baseline and the new signselect instruction. The Compute Core / Scaling Core is the ratio of the SignSelect performance difference. The Compute column is the performance ratio from time in Baseline to using SignSelect. The Scaling is the performance ratio from time in Baseline to SignSelect.

wasmsignsel

@dtig
Copy link
Member

dtig commented Jan 25, 2021

Adding a preliminary vote for the inclusion of Sign Select operations to the SIMD proposal below. Please vote with -

👍 For including Sign Select operations
👎 Against including Sign Select operations

@Maratyszcza
Copy link
Contributor Author

To quote @lars-t-hansen from #430:

And given a low complexity limit, I think signselect ought to be a dedicated instruction, the tree required to express it being relatively large: Bitselect(compare(x, v128.const(0)), a, b).

@zeux
Copy link
Contributor

zeux commented Jan 25, 2021

fwiw my take on signselect is that I would not expect engines to optimize the tree above into a single instruction, but I also wouldn't consider this instruction as being so fundamental as to warrant a special opcode, and would settle on a compare+select lowering - on ARM that happens to already be the optimal instruction pattern anyhow.

@ngzhian ngzhian added the 2021-01-29 Agenda for sync meeting 1/29/21 label Jan 26, 2021
@jan-wassenberg
Copy link

@zeux that means we would need i64.gt_s, right?

@zeux
Copy link
Contributor

zeux commented Jan 26, 2021

@jan-wassenberg I don't see these as directly related. If we don't standardize i64.gt_s - which is essentially what's going to happen I suspect (I'd love to have these in the spec but the vote was split) - we're effectively making a statement that you can't efficiently use SIMD for 64-bit integer workloads that involve comparisons. Presence or absence of signselect is orthogonal to this decision - if you work with f64 you can still use f64.gt to feed bitselect.

If we got signselect then naturally you'd be able to emulate some i64.gt_s-based selects with it - but not all! For example, (a >= 0 && a < 10) ? b : c is going to be tricky to emulate with signselect. Since I believe the conclusion at the point when i64.gt_s was taken out of the spec a year or more ago was that in general we weren't able to find SIMD workloads that use i64 comparisons, whether we get signselect or not isn't material for i64 IMO.

And of course you can emulate signselect still with i64.shr(63) + bitselect although I'm not sure what the codegen looks like on v8 today.

@jan-wassenberg
Copy link

@zeux Ah, I didn't realize we had i64.shr. That does allow some limited form of comparison. Its implementation is probably using something like signselect on SSE4, here's what Highway is doing:

template <size_t N>
HWY_API Vec128<int64_t, N> BroadcastSignBit(const Vec128<int64_t, N> v) {
#if HWY_TARGET == HWY_AVX3
  return Vec128<int64_t, N>{_mm_srai_epi64(v.raw, 63)};
#elif HWY_TARGET == HWY_AVX2
  return VecFromMask(v < Zero(Simd<int64_t, N>()));
#else
  // Efficient Gt() requires SSE4.2 but we only have SSE4.1, so it is faster to
  // generate constants and BLENDVPD, which only looks at the sign bit.
  const Simd<int64_t, N> di;
  const Simd<double, N> df;
  const auto zero = Zero(di);
  const auto all = BitCast(df, VecFromMask(zero == zero));
  const auto sign = MaskFromVec(BitCast(df, v));
  return BitCast(di, IfThenElse(sign, all, BitCast(df, zero)));
#endif
}

template <int kBits, size_t N>
HWY_API Vec128<int64_t, N> ShiftRight(const Vec128<int64_t, N> v) {
#if HWY_TARGET == HWY_AVX3
  return Vec128<int64_t, N>{_mm_srai_epi64(v.raw, kBits)};
#else
  const Simd<int64_t, N> di;
  const Simd<uint64_t, N> du;
  const auto right = BitCast(di, ShiftRight<kBits>(BitCast(du, v)));
  const auto sign = ShiftLeft<64 - kBits>(BroadcastSignBit(v));
  return right | sign;
#endif
}

Seems rather wasteful to use such a shr instead of signselect. If both signselect and gt are frowned upon, perhaps this BroadcastSignBit would be a useful compromise and low-level building block? It would allow engines to use shr (NEON/AVX512)/gt(SSE4.2)/blendv(SSE4), and users to call bitselect without all the extra wasted work that a general shr would require.

@lars-t-hansen
Copy link
Contributor

To quote @lars-t-hansen from #430:

And given a low complexity limit, I think signselect ought to be a dedicated instruction, the tree required to express it being relatively large: Bitselect(compare(x, v128.const(0)), a, b).

I'm neutral on whether we want signselect or not, but I'm not committing to recognizing that tree as an instance of signselect :-)

@jan-wassenberg
Copy link

We didn't have time in the meeting to cover this, but I'd like to propose a small building block that may avoid the need for signselect and possibly i64.gt and i64.abs.

Currently all we have for this is i64.shr + bitselect, or f64.gt (3+ cycle plus domain switch penalty).
I64.shr is expensive on x86: 3 ALU + 2 const + BLENDV.

Various platforms have fast ways to implement this that users cannot match/know:

  • AVX3: shr, or test+blend
  • ARM: shr + BSL
  • AVX2: i64.gt
  • SSE4: 2 const + BLENDV, 3 ops less than i64.shr

I have found this abstraction useful in multiple settings, perhaps it is less controversial than these other ops?

@Maratyszcza
Copy link
Contributor Author

@jan-wassenberg Maybe I'm missing something, but what exactly is the proposal?

@jan-wassenberg
Copy link

@Maratyszcza Ah, sorry to be unclear. The proposal is to add broadcastsign (or some better name), with the semantics of i##.shr (arithmetic shift right) by ##-1. It allows doing less work than an actual i64.shr on SSE4. I proposed blendv above but even better would be i32.gt(zero) + _mm_shuffle_epi32(gt, _MM_SHUFFLE(3,3,1,1)) to copy the upper 32 bits to the lower 32.

This operation would be useful for implementing various user code including 64-bit comparison or abs(), which then might not have to add as new instructions.

@Maratyszcza
Copy link
Contributor Author

@jan-wassenberg I agree that this is an important operation, but I don't think we need an extra instruction for that. Probably the specification should emphasise that i##.shr by -1 is a special case that justifies additional optimization. I'll create a separate issue for that.

@dtig dtig added needs discussion Proposal with an unclear resolution and removed 2021-01-29 Agenda for sync meeting 1/29/21 labels Feb 2, 2021
@dtig
Copy link
Member

dtig commented Mar 5, 2021

Closing as per #436.

@dtig dtig closed this Mar 5, 2021
arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Mar 25, 2021
As proposed in WebAssembly/simd#124, using the opcodes
adopted by V8 in
https://chromium-review.googlesource.com/c/v8/v8/+/2486235/2/src/wasm/wasm-opcodes.h.
Uses new builtin functions and a new target intrinsic exclusively to ensure that
the new instructions are only emitted when a user explicitly opts in to using
them since they are still in the prototyping and evaluation phase.

Differential Revision: https://reviews.llvm.org/D90357
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs discussion Proposal with an unclear resolution
Projects
None yet
Development

Successfully merging this pull request may close these issues.