-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codegen of i16x8.relaxed_laneselect #125
Comments
This is a good point! The intention is to allow |
I'm not intimately familiar with the motivations, but if the intention is to basically give users a choice of |
This maybe a bug in the baseline compiler, thanks for flagging. Here's how the optimizing compiler handles it. |
Oh! Sorry I naively assumed they'd be similar, should have checked both! For your link @dtig though if I'm reading that right it's only different in using |
This is the right assumption, and should be true in most cases except for where we emit extra instructions because we can't specify register constraints in the baseline compiler. The basic functionality or the variants of the instructions should be the same. Hopefully there aren't many other cases where they don't match like they should. :)
You're correct that I was only responding to the i{32x4,64x2}.relaxed_laneselect bit. Working through the example in the OP, for the case where all the bits are set (deterministic output), everything works as expected, but for the case you outlined, |
Think this will get pretty ugly: we will have to special case i16x8, and have it check specifically for 0x0080 to 0x00FF, and return a mix of the top and bottom half. For reference, the spec for relaxed lane select is here: https://www.ngzhian.com/relaxed-simd/core/exec/numerics.html#xref-exec-numerics-op-relaxed-lane-mathrm-relaxed-lane-n-i-1-i-2-i-3 |
We could change the text as follows: "Relaxed lane selection is deterministic when all bits are set or unset in the mask. Otherwise depending on the host, either only the top bit of a lane is examined, only the top bit of each byte is examined, or all bits are examined (i.e. it becomes a bit select)." This change would allow using |
Would it be possibility to drop the |
I'm not opposed to removing
So maybe we want to change the wording, but keep all instructions, to allow for some implementation wiggle room. |
I don't disagree that symmetry is nice, but when it comes to wasm simd I think the ship might have already sailed on that? For example i64x2 unsigned comparisons are not present, there's no Alternatively why not go one step further? If the goal is to get access to |
We should probably discuss this at the next sync, unfortunately I'm OOO that day. (And Marat is also OOO). Okay I just checked, XNNPack has some code (not in production yet) that checks if i32x4.laneselect is implemented using blendvps (i.e. only checks top bit). So it is still useful (e.g. detectsign of floats) |
Sorry for the late reply. We should schedule a sync I think. |
Back to the main topic, I think there is a process angle to this. Feature has passed into phase 4 pending final spec review, and according to 'phases' doc this should mean that only "minor cosmetic changes should occur". I am not sure where is the line for that, but changing semantics or dropping instructions may be pushing it. This also might be entirely up to the Working Group, who should be in control according to the doc. If we are considering either of the changes we should probably bring this up at a WG meeting. |
Personally at least I feel that there should still be room for changes. This proposal is not yet at phase 4, although I understand there was a provisional vote of sorts that everyone was ok with it moving to phase 4 with the spec text in place. It's perhaps worth noting, though, that I discovered this issue after said provisional vote. I personally feel that this is an issue that was discovered late in the process because this proposal progressed relatively quickly throughout the stages process. I think that's ok, but at the same time I would at least hope that the process won't be wielded in such a way to codify what might otherwise be considered a mistake in the official specification. |
Yes, I am also not sure how the provisional status of the vote plays into all of this. I don't think this would stop us making the change if the change is needed - I am just trying to understand what is the 'official' process for it right now. By the way, there might be a way to fix just the i16x8 lowering. I have an idea, but trying to see if there are better ways to do it, will share it today or tomorrow. To be super clear, I think it is important to resolve this issue. |
To use For SSE this would be
Two operand blend implicitly uses |
For me I still don't fully understand the motivation for the existence of this instruction. It seems to be like there's three possible motivations, and forgive me for possible erroneously extrapolating here and please correct me if I'm wrong.
My personal take is that none of those motivations justify the existence of the instruction, but I recognize I may very well be missing something crucial! (and that I'm just one person and this is a pretty minor issue all things concerned) |
At the SIMD sync today notes we decided that a spec text tweak (#125 (comment)) will be made, and all instructions kept. Motivation is to keep the instruction lower optimal (single instruction for all 4 lane select). |
Looks like the group is leaning towards changing the spec, though personally would prefer (not super strongly) either removing the operation or changing the lowering, which have more or less the same effect. The argument in favor of changing semantics is that it would result in lowest instruction count (not exactly single instruction for SEE, but lowest nonetheless). However as you pointed out this allows for implementing everything via
Slightly or about the same. I see where you are going with this, I don't know if there is a restriction anywhere preventing one type of
Agree, the intention was to go with it if that is necessary. During the discussion we decided that removing is still OK since we are not in phase 4. Probably also worth mentioning that strict, sign based, |
It was proposed as WebAssembly/simd#124, but rejected by the WG. |
One more thought. In my view the tradeoffs of changing to byte-wise select are between these two alternatives:
vs
The tradeoff is really between handicapping three operations, but only sometimes, vs handicapping only one, but always. The effect is hard to quantify ahead of time - in the best case scenario there would be no overhead, but this requires both runtime with the most efficient implementation and application that would do all the testing to detect that. I think in practice the results are going to drift away from this best case scenario and we are probably are not going to be a lot better off than only affecting |
I don't see why there're tradeoffs at all. The guarantee provided by relaxed laneselect is that if the mask is produced by a SIMD comparison (or boolean operations on results of SIMD comparisons), then bitselect on the mask can be replaced with relaxed laneselect of the corresponding element size. Note that this guarantee works regardless of how relaxed laneselect is implemented - as bitselect, checking top bit of each byte, or checking top bit of each lane - and allows for generation of the optimal single-instruction lowering across all element types on both x86 and ARM. |
I'm a bit confused, is there an assumption that input mask to laneselect is always the result of element-wise comparsion? Was that the original intention with this operation? I guess this might be the source of misunderstanding :) In case of mask produced via SIMD compare you are absolutely right - it would always be single instruction, since they the lane values are either all 0's or all 1's. However the comparison isn't necessary for sign-based select if if we are comparing to zero (it would be necessary for 16-bit lanes since there is no such operation though). |
This allows using pblendvb for i16x8 lane select. See WebAssembly#125.
The input mask to laneselect doesn't have to be the result of element-wise comparison, but this was the use-case targeted by these instructions. Removing comparison for sign-based select is essentially a hack. It is made possible by some implementations of the relaxed laneselect, but it is not portable and not the target use-case. |
One person's hack might be other person's compiler optimization 😄 In addition to comparison to zero, on x86 it is possible to produce all I've check some compilers on this, it sounds like HPC oriented compilers do that (wouldn't vouch for all though). Edit: On a more conservative note, it should be very easy to detect that a mask to |
What would be the benefit of replacing compare with subtraction? On most microarchitectures, compares and subtractions have exactly the same execution characteristics (latency, throughput, pipe/port use). |
Sorry, edited the post above before seeing the reply. You might be right about comparison vs subtraction, let me dig into it a bit more. On the other hand, just producing a blend for |
This allows using pblendvb for i16x8 lane select. See #125.
I've implemented detecting compare+bitselect in V8 (it is out of date, but can be rebased). Probably some parts of the patch can be improved, but the machinery to detect the case is there and is well within 'limited optimizations' approach proposed as part of original SIMD proposal. Can we replace relaxed laneselect with this kind of optimization? |
Currently in #17 it's suggested that the x86_64 lowering of the
i16x8.relaxed_laneselect
instruction ispblendvb
, and this appears to be what v8 does today. In #115 (plus the currentOverview.md
), however, the english prose for the definition of this instruction is:I don't believe, though, that the
pblendvb
instruction correctly implements these semantics because lane selection mask 0x0080 that's neither 0x0000 or 0xffff and the high bit is zero, meaning that according to the spec the output should be the element in theb
vector. Thepblendvb
instruction works at the byte-level, though, so one byte will be chosen from thea
vector and one will be chosen from theb
vector.I think that this is also an issue with v8's lowering of the
i{32x4,64x2}.relaxed_laneselect
since they all go throughpblendvb
right now, although the suggestion in #17 I think would work withblendvp{s,d}
instead.The text was updated successfully, but these errors were encountered: