-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
I guess with #416 (likely to happen), this PR can be simplified to just i64x2.all_true. |
5257ef3
to
da1ea23
Compare
Removed |
I think this PR is similar to #368: a reduction from many lanes to a scalar. Since its |
da1ea23
to
bc07d22
Compare
@abrown, it's a reduction but my expectation here would be that all_true is frequently used in a control context, ie as the argument to a branch, not captured for its value. In that case there's no reason for the final setnz / movzxbl pair, and the final code doesn't look all that bad IMO. |
Yes, under that assumption it is not as bad (still not great, though). I guess you are also assuming that engines would be looking for and applying this type of optimization? I think I would as well but not sure about everyone else. |
Yes, I would assume engines would optimize boolean evaluation for control even for SIMD (SpiderMonkey does, anyway). Curious about the sse4.1 lowering actually. Why not
|
@lars-t-hansen Both sequences use the same number of instructions, but the SSE4.1 in the proposal has shorter dependency chain. Also, |
Ideally it would be great to test this, but that is just my opinion. |
bc07d22
to
dd965ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the opcode looks good.
0xfdc3 collides with i16x8.extadd_pairwise_i8x16_u in v8 currently, i guess i should change ext add (since i picked that for prototyping). |
Probably best to err on the side of caution and make the opcode TBD, then. |
Think the last step of ARMv7 lowering is incorrect: IIUC Ry will either be -1 or 0, and the result we want is 0 and 1 respectively. Also the sequence doesn't seem to work for
Still 5 instructions, lmk if you figure out something better. |
@Maratyszcza The ARM64 instruction sequence is wrong too - here's a corrected version:
However, the following is an improved sequence, since it avoids the reduction:
Finally, possibly the best alternative, which is, unfortunately, an instruction longer:
|
This instruction was accepted into the proposal in WebAssembly#415.
In reference to #415 (comment)
@Maratyszcza, I don't think that's right, if what you're getting at is correctness and not performance. My sequence compares the input to zero, meaning the resulting bitmask will be nonzero in any lane that is zero, hence the test succeeds if the bitmask is zero everywhere, and setz captures that. |
* [spectext] Add i64x2.all_true This instruction was accepted into the proposal in #415. * Simplify syntax/instruction bitmask
This was merged in WebAssembly#415.
Introduction
This is proposal to add 64-bit variant of existing
all_true
instruction. This instruction complement the set of double-precision (f64
) and 64-bit integer (i64
) SIMD operations and allow for checking for special conditions. It appears that these instructions were accidentally omitted what the double-precision SIMD was added back to the spec #101. Note that despitei64x2
in instruction names they are agnostic to the interpretation of SIMD lanes, and work forf64x2
as well.Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
y = i64x2.all_true(x)
is lowered to:VPXOR xmm_tmp1, xmm_tmp1, xmm_tmp1
VPCMPEQD xmm_tmp2, xmm_tmp2, xmm_tmp2
VPCMPEQQ xmm_tmp1, xmm_tmp1, xmm_x
VPTEST xmm_tmp1, xmm_tmp2
SETNZ reg_y
MOVZXBL reg_y, reg_y
x86/x86-64 processors with SSE4.1 instruction set
y = i64x2.all_true(x)
is lowered to:PXOR xmm_tmp1, xmm_tmp1
PCMPEQD xmm_tmp2, xmm_tmp2
PCMPEQQ xmm_tmp1, xmm_x
PTEST xmm_tmp1, xmm_tmp2
SETNZ reg_y
MOVZXBL reg_y, reg_y
x86/x86-64 processors with SSE2 instruction set
y = i64x2.all_true(x)
is lowered to:PXOR xmm_tmp1, xmm_tmp1
PCMPEQD xmm_tmp1, xmm_x
PSHUFD xmm_tmp2, xmm_tmp1, 0xB1
PAND xmm_tmp1, xmm_tmp2
PMOVMSKB reg_y, xmm_tmp1
CMP reg_y, 0xFFFF
SETE reg_y
MOVZXBL reg_y, reg_y
ARM64 processors
y = i64x2.all_true(x)
is lowered to:CMEQ Vtmp.2D, Vx.2D, 0
UMINV Stmp, Vtmp.4S
UMOV Ry, Vtmp.S[0]
NEG Ry, Ry
ARMv7 processors with NEON instruction set
y = i64x2.all_true(x)
is lowered to:VCEQ.I32 Qtmp, Qx, 0
VPMIN.U32 Dtmp_lo, Dtmp_lo, Dtmp_hi
VPMIN.U32 Dtmp_lo, Dtmp_lo, Dtmp_lo
VMOV Ry, Dtmp_lo[0]
NEG Ry, Ry