Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

i64x2.all_true instructions #415

Merged
merged 4 commits into from
Jan 30, 2021
Merged

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Dec 29, 2020

Introduction

This is proposal to add 64-bit variant of existing all_true instruction. This instruction complement the set of double-precision (f64) and 64-bit integer (i64) SIMD operations and allow for checking for special conditions. It appears that these instructions were accidentally omitted what the double-precision SIMD was added back to the spec #101. Note that despite i64x2 in instruction names they are agnostic to the interpretation of SIMD lanes, and work for f64x2 as well.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • i64x2.all_true
    • y = i64x2.all_true(x) is lowered to:
      • VPXOR xmm_tmp1, xmm_tmp1, xmm_tmp1
      • VPCMPEQD xmm_tmp2, xmm_tmp2, xmm_tmp2
      • VPCMPEQQ xmm_tmp1, xmm_tmp1, xmm_x
      • VPTEST xmm_tmp1, xmm_tmp2
      • SETNZ reg_y
      • MOVZXBL reg_y, reg_y

x86/x86-64 processors with SSE4.1 instruction set

  • i64x2.all_true
    • y = i64x2.all_true(x) is lowered to:
      • PXOR xmm_tmp1, xmm_tmp1
      • PCMPEQD xmm_tmp2, xmm_tmp2
      • PCMPEQQ xmm_tmp1, xmm_x
      • PTEST xmm_tmp1, xmm_tmp2
      • SETNZ reg_y
      • MOVZXBL reg_y, reg_y

x86/x86-64 processors with SSE2 instruction set

  • i64x2.all_true
    • y = i64x2.all_true(x) is lowered to:
      • PXOR xmm_tmp1, xmm_tmp1
      • PCMPEQD xmm_tmp1, xmm_x
      • PSHUFD xmm_tmp2, xmm_tmp1, 0xB1
      • PAND xmm_tmp1, xmm_tmp2
      • PMOVMSKB reg_y, xmm_tmp1
      • CMP reg_y, 0xFFFF
      • SETE reg_y
      • MOVZXBL reg_y, reg_y

ARM64 processors

  • i64x2.all_true
    • y = i64x2.all_true(x) is lowered to:
      • CMEQ Vtmp.2D, Vx.2D, 0
      • UMINV Stmp, Vtmp.4S
      • UMOV Ry, Vtmp.S[0]
      • NEG Ry, Ry

ARMv7 processors with NEON instruction set

  • i64x2.all_true
    • y = i64x2.all_true(x) is lowered to:
      • VCEQ.I32 Qtmp, Qx, 0
      • VPMIN.U32 Dtmp_lo, Dtmp_lo, Dtmp_hi
      • VPMIN.U32 Dtmp_lo, Dtmp_lo, Dtmp_lo
      • VMOV Ry, Dtmp_lo[0]
      • NEG Ry, Ry

@ngzhian
Copy link
Member

ngzhian commented Jan 6, 2021

I guess with #416 (likely to happen), this PR can be simplified to just i64x2.all_true.

@Maratyszcza Maratyszcza changed the title i64x2.any_true and i64x2.all_true instructions i64x2.all_true instructions Jan 11, 2021
@Maratyszcza
Copy link
Contributor Author

Removed i64x2.any_true

@abrown
Copy link
Contributor

abrown commented Jan 11, 2021

I think this PR is similar to #368: a reduction from many lanes to a scalar. Since its i32x4 version, i32x4.all_true, has a lowering as unfortunate on x86 as this one, why not consider both together? I would argue that the long sequences that *.all_true introduces are potholes in the spec--not giant cliffs, but still unexpected performance issues in a spec designed for near-native performance. But, doubting that we would agree to remove *.all_true altogether in favor of the *.bitmask instructions, then this i64x2.all_true version makes sense.

@lars-t-hansen
Copy link
Contributor

@abrown, it's a reduction but my expectation here would be that all_true is frequently used in a control context, ie as the argument to a branch, not captured for its value. In that case there's no reason for the final setnz / movzxbl pair, and the final code doesn't look all that bad IMO.

@abrown
Copy link
Contributor

abrown commented Jan 12, 2021

Yes, under that assumption it is not as bad (still not great, though). I guess you are also assuming that engines would be looking for and applying this type of optimization? I think I would as well but not sure about everyone else.

@lars-t-hansen
Copy link
Contributor

Yes, I would assume engines would optimize boolean evaluation for control even for SIMD (SpiderMonkey does, anyway).

Curious about the sse4.1 lowering actually. Why not

PXOR tmp, tmp
PCMPEQQ tmp, src
PMOVMSKB dest, tmp
TEST dest, dest
SETZ dest
MOVZXBL dest, dest

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Jan 12, 2021

@lars-t-hansen Both sequences use the same number of instructions, but the SSE4.1 in the proposal has shorter dependency chain. Also, TEST dest, dest should be CMP dest, 0xFFFF with 4-byte immediate.

@penzn
Copy link
Contributor

penzn commented Jan 13, 2021

Ideally it would be great to test this, but that is just my opinion.

@dtig
Copy link
Member

dtig commented Jan 25, 2021

Approving as this was voted on as of #419. @tlively could you confirm that the assigned opcode looks good for the tools?

Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the opcode looks good.

@ngzhian
Copy link
Member

ngzhian commented Jan 28, 2021

0xfdc3 collides with i16x8.extadd_pairwise_i8x16_u in v8 currently, i guess i should change ext add (since i picked that for prototyping).

@tlively
Copy link
Member

tlively commented Jan 28, 2021

Probably best to err on the side of caution and make the opcode TBD, then.

@ngzhian
Copy link
Member

ngzhian commented Jan 30, 2021

Think the last step of ARMv7 lowering is incorrect:
NEG Ry, Ry should be ADD Ry, Ry, #1

IIUC Ry will either be -1 or 0, and the result we want is 0 and 1 respectively.

Also the sequence doesn't seem to work for i64x2(0, -1), think all the vpmin should be vpmax. Here's what I cooked up that seems to pass the tests:

vpmax.32 tmp_lo, src_low, src_high
vceq.32 tmp, tmp #0
vpmax(32 tmp_low, tmp_low, tmp_low
vmov.32 dst, tmp_low[0]
add dst dst #1

Still 5 instructions, lmk if you figure out something better.

@tlively tlively merged commit 3d8c870 into WebAssembly:master Jan 30, 2021
@akirilov-arm
Copy link

@Maratyszcza The ARM64 instruction sequence is wrong too - here's a corrected version:

CMEQ Vtmp.2d, Vx.2d, #0
UMAXV Stmp, Vtmp.4s
FMOV Wy, Stmp
ADD Wy, Wy, #1

However, the following is an improved sequence, since it avoids the reduction:

CMEQ Vtmp.2d, Vx.2d, #0
ADDP Dtmp, Vtmp.2d
FCMP Dtmp, Dtmp
CSET Xy, eq

Finally, possibly the best alternative, which is, unfortunately, an instruction longer:

FMOV Xy, Dx
FMOV Xtmp, Vx.d[1]
CMP Xy, #0
CCMP Xtmp, #0, #4, ne
CSET Xy, ne

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 3, 2021
This instruction was accepted into the proposal in WebAssembly#415.
@lars-t-hansen
Copy link
Contributor

In reference to #415 (comment)

Also, TEST dest, dest should be CMP dest, 0xFFFF with 4-byte immediate.

@Maratyszcza, I don't think that's right, if what you're getting at is correctness and not performance. My sequence compares the input to zero, meaning the resulting bitmask will be nonzero in any lane that is zero, hence the test succeeds if the bitmask is zero everywhere, and setz captures that.

ngzhian added a commit that referenced this pull request Feb 3, 2021
* [spectext] Add i64x2.all_true

This instruction was accepted into the proposal in #415.

* Simplify syntax/instruction bitmask
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 12, 2021
ngzhian added a commit that referenced this pull request Feb 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants