-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
7e5d706
to
d94e5a7
Compare
I think shift right by an immediate is more efficient for all archs (on v8 at least), it's a single instruction. Comparison with immediate 0 is not supported on Wasm, so you won't be able to get the fast behavior on ARM, it will be
Yea, bitselect is 4 instructions on x64 :( Looking at the lowering, this instruction mostly helps AVX, from 5 instructions to 1 (or 2). The ARM codegen looks exactly like what cmp/shift+bitselect would be (if emulated using cmp/shr_s+bitselect). From the guidelines:
This instruction will mostly be helpful for AVX, and is much more expensive without sign select.
It is well supported in that codegen on ARM/ARM64 isn't awful, but those archs don't gain from sign select. So it isn't as useful. SSE4.1 sees slight improvement, but I don't think it's sufficient. I haven't looked at SSE2 that much, but the codegen looks long (particular for
All of the references are machine learning (neural networks) related. I do not have expertise in this field to distinguish how different they are - they all refer to *LU, and so looks the same to my untrained eye. Thus, I'm not certain this counts as "2+". In comparison, #237 shows the use in codecs/media, machine learning, molecular dynamics, etc. There were some discussions in #131 regarding mask/select related instructions, which led to [^1] v128.const(0) will be a single instruction |
@ngzhian One thing to keep in mind is that processors typically can handle integer comparisons in a large number of SIMD units than shifts. E.g. AMD Zen 2 processors can do 3 integer compare instructions ( |
@ngzhian One of the uses of signselect for use cases where desktop performance is important is to replace bitselect to get better codegen on x64 in cases where the mask is known to be all 0s/1s - most commonly as a result of a comparison instruction. How possible do you think it is for TurboFan to generate pblend instead of current instruction sequence when the input is a comparison mask? |
@zeux I looked around, we have cases where we pattern match, like shift + and [0], mul+add (then use [0] https://source.chromium.org/chromium/chromium/src/+/master:v8/src/compiler/backend/x64/instruction-selector-x64.cc;l=710?q=instruction-selector-x64.cc&ss=chromium%2Fchromium%2Fsrc:v8%2F |
Prototyped in V8 https://crrev.com/c/2486235 for x64.
Thomas is aware of these opcodes. |
As proposed in WebAssembly/simd#124, using the opcodes adopted by V8 in https://chromium-review.googlesource.com/c/v8/v8/+/2486235/2/src/wasm/wasm-opcodes.h. Uses new builtin functions and a new target intrinsic exclusively to ensure that the new instructions are only emitted when a user explicitly opts in to using them since they are still in the prototyping and evaluation phase. Differential Revision: https://reviews.llvm.org/D90357
This has landed in LLVM using the V8 opcodes and usable via |
05c0499
to
3d06a28
Compare
Hi all! While looking at the AArch32 implementation in V8, we realized NEON does not support the i64x2 variant of the
I thought I'd comment on here too in case another VM needs to implement this, better suggestions welcome of course! |
@planglois Good point. Although more efficient would be |
This is now prototyped in V8 on ARM, ARM64, IA32 as well. |
The experiment to enable these instructions in XNNPACK is in google/XNNPACK#1236, but fails to link: |
- i64x2.eq (WebAssembly/simd#381) - i64x2 widens (WebAssembly/simd#290) - i64x2.bitmask (WebAssembly/simd#368) - signselect ops (WebAssembly/simd#124)
- i64x2.eq (WebAssembly/simd#381) - i64x2 widens (WebAssembly/simd#290) - i64x2.bitmask (WebAssembly/simd#368) - signselect ops (WebAssembly/simd#124)
I evaluated the performance impact on two neural network operators in XNNPACK: Sigmoid and ELU. Results for Sigmoid:
Results for ELU:
|
Code changes can be seen in google/XNNPACK#1236 |
Why does the Celeron N3060 go from +4 on one benchmark to -1 on the other? |
There are two factors at play on Celeron:
|
3d06a28
to
15223e7
Compare
Let me unpack my comments from the meeting a bit. bitselect is not greatThere's an existing problem with bitselect in that it results in optimal codegen on ARM, but - on CPUs with a fast blend instruction - it results in suboptimal codegen. v8 synthesizes 4 instructions for bitselect today - signselect seems like a better bitselect but it's notCrucially, bitselect is optimal for ARM; signselect has a 2-instruction lowering on ARM, these are dependent on each other and with hit-or-miss NEON implementations in general I would probably want to prioritize ARM performance. So signselect can only really be used when it's replacing a comparison with 0. It's tempting to use it as a general bitselect replacement but oh-so-wrong; doing so will result in performance uplift on Intel systems (... which is where the code is probably tested...), but a small performance degradation on ARM systems. bitselect can be fixedDuring the call we've talked about challenges with blendv on Goldmont, that has 4 cycle latency and 4 cycle rec. throughput which is not great; this means that this instruction can not be lowered to blendv directly on any system that has a Goldmont chip, so there's going to be a perf. disparity - on a Goldmont-carrying system we'd need to convert sign bit to mask (arithmetic right shift by 31) followed by the usual bitselect code, for a total latency of 4 cycles with existing v8 lowering and rec. throughput of 2 (Goldmont only has 2 EUs for vectorized bitops, so 4 instructions with rec. throughput of 0.5 add up to 2). So assuming this needs to be figured out for any use of blendv, including the proposal above, bitselect codegen can be optimal on Intel. To do this during codegen, the code generator needs to track whether the register originated from an instruction that can only produce all-1 or all-0 in a given lane; effectively each value grows an extra enum, smth like I would argue this can be done with minimal codegen perf loss for any SIMD engine, including ones that do a single pass through Wasm code and generate resulting assembly directly. I would also argue that any Wasm SIMD engine that wants to have decent performance has to do similar instruction analysis for constants, as shifts can't be lowered efficiently otherwise. In the simplest case, both analyses can be done directly on the arguments (effectively as a peep-hole matcher on the source Wasm instructions) at minimal cost. If we don't want to fix bitselect, we should add qselectIf hypothetically speaking we decide that the above is impractical - which I don't believe, and if we were to decide that I'd ask that we also introduce immediate variants for some instructions like shifts - we could introduce a quasi-select instruction. There's a behavior mismatch here between modern Intel and ARM systems; on Intel it's most efficient to do a select based on top bit; on ARM it's most efficient to do a select based on all bits. Resolving the difference on either side creates perf. problems, but if we had an instruction that allowed to use any bits from the source lanes when doing the select (and e.g. require this choice to be done consistently during codegen of a given instruction), this would allow us to shift the mask-analysis to either the user, or the compiler that generates Wasm code. signselect is less interesting if we have better-bitselect or qselectGiven the importance of selects in SIMD code, I think we have to either fix bitselect in the implementations or introduce qselect; qselect has to wait for post-MVP as it will by necessity have different behaviors on different archs. I would strongly prefer fixing bitselect but either way, all of this says that on systems that have efficient blendv, Wasm engine will be able to synthesize one. Once we have this, we can compare
I suspect that if we had better bitselect lowering, the gains on the benchmarks posted would be smaller accordingly. signselect would still be better, but it would be better in a fairly special case. Of course improving bitselect situation and adding signselect don't have to be mutually exclusive; signselect will never be worse (given that the codegen engine can detect Goldmont) than an equivalent replacement, but if I had to choose what to spend the resources on, I would focus on making bitselect better. |
@zeux thanks for the comments, really. I'm working on a design proposal for V8 that will help us perform significantly better optimization across-instructions first by constant folding and hoisting, then by peephole optimization. Would you mind commenting on that when it's ready for review? This is a good use case for the proposal. |
There are several use-cases for
In other words, XNNPACK doesn't really have a use-case for |
I don't have a particular opinion on the utility of this proposal, but I'm very curious about the code generation. Specifically why is ARM64 performing worse when Cortex-A55/Cortex-A75 both only have one shift port, and the ultimate result between signselect on ARM64 and bitselect with signed shift right both yield two instructions. Are you able to post the codegen before and after? Similarly, if I give you a couple of patches for ARM64 to try, can you see if it yields a difference in performance? We only need to pick one of the bad performers. |
FWIW, I think optimizers can and should deal with a compare-with-zero argument to bitselect and lower to a blend when that is appropriate; this should not be hard and would largely make signselect unnecessary. (Ties into our discussion about what optimizers should do.) We should not add signselect until we've seen the outcome of that work. In addition, it looks as if there's a broader discussion to be had here about the general utility of signselect. |
15223e7
to
e9d8eb3
Compare
The following is a the performance numbers when running SignSelect when tested with XNN and elu and sigmoid. The table shows the scaling performance benefit between the baseline and the new signselect instruction. The Compute Core / Scaling Core is the ratio of the SignSelect performance difference. The Compute column is the performance ratio from time in Baseline to using SignSelect. The Scaling is the performance ratio from time in Baseline to SignSelect. |
e9d8eb3
to
fc8606f
Compare
Adding a preliminary vote for the inclusion of Sign Select operations to the SIMD proposal below. Please vote with - 👍 For including Sign Select operations |
To quote @lars-t-hansen from #430:
|
fwiw my take on signselect is that I would not expect engines to optimize the tree above into a single instruction, but I also wouldn't consider this instruction as being so fundamental as to warrant a special opcode, and would settle on a compare+select lowering - on ARM that happens to already be the optimal instruction pattern anyhow. |
@zeux that means we would need i64.gt_s, right? |
@jan-wassenberg I don't see these as directly related. If we don't standardize i64.gt_s - which is essentially what's going to happen I suspect (I'd love to have these in the spec but the vote was split) - we're effectively making a statement that you can't efficiently use SIMD for 64-bit integer workloads that involve comparisons. Presence or absence of signselect is orthogonal to this decision - if you work with f64 you can still use f64.gt to feed bitselect. If we got signselect then naturally you'd be able to emulate some i64.gt_s-based selects with it - but not all! For example, And of course you can emulate signselect still with i64.shr(63) + bitselect although I'm not sure what the codegen looks like on v8 today. |
@zeux Ah, I didn't realize we had i64.shr. That does allow some limited form of comparison. Its implementation is probably using something like signselect on SSE4, here's what Highway is doing:
Seems rather wasteful to use such a shr instead of signselect. If both signselect and gt are frowned upon, perhaps this |
I'm neutral on whether we want signselect or not, but I'm not committing to recognizing that tree as an instance of signselect :-) |
We didn't have time in the meeting to cover this, but I'd like to propose a small building block that may avoid the need for signselect and possibly i64.gt and i64.abs. Currently all we have for this is i64.shr + bitselect, or f64.gt (3+ cycle plus domain switch penalty). Various platforms have fast ways to implement this that users cannot match/know:
I have found this abstraction useful in multiple settings, perhaps it is less controversial than these other ops? |
@jan-wassenberg Maybe I'm missing something, but what exactly is the proposal? |
@Maratyszcza Ah, sorry to be unclear. The proposal is to add broadcastsign (or some better name), with the semantics of i##.shr (arithmetic shift right) by ##-1. It allows doing less work than an actual i64.shr on SSE4. I proposed blendv above but even better would be i32.gt(zero) + _mm_shuffle_epi32(gt, _MM_SHUFFLE(3,3,1,1)) to copy the upper 32 bits to the lower 32. This operation would be useful for implementing various user code including 64-bit comparison or abs(), which then might not have to add as new instructions. |
@jan-wassenberg I agree that this is an important operation, but I don't think we need an extra instruction for that. Probably the specification should emphasise that |
fc8606f
to
6965e62
Compare
6965e62
to
be76ffd
Compare
Closing as per #436. |
As proposed in WebAssembly/simd#124, using the opcodes adopted by V8 in https://chromium-review.googlesource.com/c/v8/v8/+/2486235/2/src/wasm/wasm-opcodes.h. Uses new builtin functions and a new target intrinsic exclusively to ensure that the new instructions are only emitted when a user explicitly opts in to using them since they are still in the prototyping and evaluation phase. Differential Revision: https://reviews.llvm.org/D90357
Introduction
Piece-wise approximations with discontinuity at zero are common in numerical algorithms. Such approximations take the form
y := x < 0 ? f(x) : g(x)
. E.g. in artificial neural networks, ReLU [1], Leaky ReLU [2], Parametric ReLU [3], ELU [4], SELU [5] functions involve such piece-wise approximations. Selection between two piece-wise approximations depending on the size of input can be implemented with two WebAssembly SIMD instructions from the current specification: one instruction to replicate the highest bit of the input elements into all bits of the elements so they can be used as a bit-selection mask, andbitselect
instruction to choose between two values,f(x)
andg(x)
, depending on the constructed selection mask. There are, however, two considerations which make such decomposition suboptimal in practice:i32x4.lt_s(x, v128.const(0))
for 32-bit elements) or do integer arithmetic shift right bysizeof(element) * CHAR_BIT - 1
(e.g.i32x4.shr_s(x, 31)
for 32-bit elements). The optimal way to do replication depends on target architecture: typically integer comparison is more efficient than integer shift, but if the WebAssembly engine has to explicitly instantiatev128.const(0)
, the trade-off skews the other way. ARM NEON and ARM64 provide instructions for integer comparison with zero (e.g.VCLT.S32 Qd, Qn, #0
/CMLT Vd.4S, Vn.4S, #0
for 32-bit elements), but on x86 comparison with zero would always require two instructions and an extra register, so arithmetic shift right (e.g.PSRAD xmm_x, 31
is often preferable).PBLENDVB
/BLENDVPS
/BLENDVPD
instructions which choose between two vectors of 8-/32-/64-bit elements depending on the sign of elements in the selection mask. In lieu of WebAssembly instructions that directly map to these SSE4.1 instructions, WebAssembly SIMD engines have to generate 4-5 instructions (1-2 to construct bit mask, and 3 to emulatev128.bitselect
) where 1 would suffice.To enable efficient implementation of piece-wise approximations with discontinuity at zero, this PR introduce new Sign Select instructions which select between elements of two input vectors depending on sign of the elements in the mask vector.
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
y = v8x16.signselect(v1, v2, c)
is lowered toVPBLENDVB xmm_y, xmm_v2, xmm_v1, xmm_c
y = v16x8.signselect(v1, v2, c)
is lowered toVPSRAW xmm_tmp, xmm_c, 15 + VPBLENDVB xmm_y, xmm_v2, xmm_v1, xmm_tmp
y = v32x4.signselect(v1, v2, c)
is lowered toVBLENDVPS xmm_y, xmm_v2, xmm_v1, xmm_c
y = v64x2.signselect(v1, v2, c)
is lowered toVBLENDVPD xmm_y, xmm_v2, xmm_v1, xmm_c
x86/x86-64 processors with SSE4.1 instruction set
y = v8x16.signselect(v1, v2, c)
is lowered to:MOVDQA xmm0, xmm_c
MOVDQA xmm_y, xmm_v2
PBLENDVB xmm_y, xmm_v1 [implicit xmm0 input]
y = v16x8.signselect(v1, v2, c)
is lowered to:PXOR xmm0, xmm0
PCMPGTW xmm0, xmm_c
MOVDQA xmm_y, xmm_v2
PBLENDVB xmm_y, xmm_v1 [implicit xmm0 input]
y = v32x4.signselect(v1, v2, c)
is lowered to:MOVAPS xmm0, xmm_c
MOVAPS xmm_y, xmm_v2
BLENDVPS xmm_y, xmm_v1 [implicit xmm0 input]
y = v64x2.signselect(v1, v2, c)
is lowered to:MOVAPD xmm0, xmm_c
MOVAPD xmm_y, xmm_v2
BLENDVPD xmm_y, xmm_v1 [implicit xmm0 input]
x86/x86-64 processors with SSE2 instruction set
y = v8x16.signselect(v1, v2, c)
is lowered to:PXOR xmm_tmp, xmm_tmp
PCMPGTB xmm_tmp, xmm_c
MOVDQA xmm_y, xmm_v1
PAND xmm_y, xmm_tmp
PANDNOT xmm_tmp, xmm_v2
POR xmm_y, xmm_tmp
y = v8x16.signselect(v1, v2, c)
is lowered to:PXOR xmm_tmp, xmm_tmp
PCMPGTW xmm_tmp, xmm_c
MOVDQA xmm_y, xmm_v1
PAND xmm_y, xmm_tmp
PANDNOT xmm_tmp, xmm_v2
POR xmm_y, xmm_tmp
y = v8x16.signselect(v1, v2, c)
is lowered to:PXOR xmm_tmp, xmm_tmp
PCMPGTD xmm_tmp, xmm_c
MOVDQA xmm_y, xmm_v1
PAND xmm_y, xmm_tmp
PANDNOT xmm_tmp, xmm_v2
POR xmm_y, xmm_tmp
y = v8x16.signselect(v1, v2, c)
is lowered to:PXOR xmm_tmp, xmm_tmp
PCMPGTD xmm_tmp, xmm_c
PSHUFD xmm_tmp, xmm_tmp, 0xF5
MOVDQA xmm_y, xmm_v1
PAND xmm_y, xmm_tmp
PANDNOT xmm_tmp, xmm_v2
POR xmm_y, xmm_tmp
ARM64 processors
y = v8x16.signselect(v1, v2, c)
is lowered toCMLT Vy.16B, Vc.16B, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B
y = v16x8.signselect(v1, v2, c)
is lowered toCMLT Vy.8H, Vc.8H, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B
y = v32x4.signselect(v1, v2, c)
is lowered toCMLT Vy.4S, Vc.4S, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B
y = v64x2.signselect(v1, v2, c)
is lowered toCMLT Vy.2D, Vc.2D, #0 + BSL Vy.16B, Vv1.16B, Vv2.16B
ARMv7 processors with NEON extension
y = v8x16.signselect(v1, v2, c)
is lowered toVCLT.S8 Qy, Qc, #0 + VBSL Qy.16B, Qv1.16B, Qv2.16B
y = v16x8.signselect(v1, v2, c)
is lowered toVCLT.S16 Qy, Qc, #0 + VBSL Qy.16B, Qv1.16B, Qv2.16B
y = v32x4.signselect(v1, v2, c)
is lowered toVCLT.S32 Qy, Qc, #0 + VBSL Qy.16B, Qv1.16B, Qv2.16B
y = v64x2.signselect(v1, v2, c)
is lowered toVSHR.S64 Qy, Qc, #63 + VBSL Qy.16B, Qv1.16B, Qv2.16B
Other processors and instruction sets
y = v8x16.signselect(v1, v2, c)
is lowered as eitherv128.bitselect(v1, v2, i8x16.lt_s(c, v128.const(0)))
orv128.bitselect(v1, v2, i8x16.shr_s(c, 7))
y = v16x8.signselect(v1, v2, c)
is lowered as eitherv128.bitselect(v1, v2, i16x8.lt_s(c, v128.const(0)))
orv128.bitselect(v1, v2, i16x8.shr_s(c, 15))
y = v32x4.signselect(v1, v2, c)
is lowered as eitherv128.bitselect(v1, v2, i32x4.lt_s(c, v128.const(0)))
orv128.bitselect(v1, v2, i32x4.shr_s(c, 31))
y = v64x2.signselect(v1, v2, c)
is lowered as eitherv128.bitselect(v1, v2, i64x2.lt_s(c, v128.const(0)))
orv128.bitselect(v1, v2, i64x2.shr_s(c, 63))
References
[1] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th International Conference on Machine Learning, 2010.
[2] Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities improve neural network
acoustic models. In International Conference on Machine Learning (ICML).
[3] He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015.
[4] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). CoRR, abs/1511.07289.
[5] Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. (2017). Self-Normalizing Neural
Networks. ArXiv e-prints.