-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JIT] Add legacy extended EVEX encoding and EVEX.ND/NF feature to x64 emitter backend #108796
base: main
Are you sure you want to change the base?
Conversation
2. SuperPMIVerification with SuperPMI: asmdiffs: MISSED contexts: base: 0 (0.00%), diff: 11 (0.00%) Diff JIT options: JitBypassAPXCheck=1 Overall (+330,453 bytes)
MinOpts (+17,921 bytes)
FullOpts (+312,532 bytes)
tpdiff: Diff JIT options: JitBypassAPXCheck=1 Overall (+0.27% to +0.60%)
MinOpts (+0.82% to +1.08%)
FullOpts (+0.22% to +0.38%)
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
4. Supplement files:To see detail diffs, please refer to the following files: (files are too large to display on github) asm: tpdiff: |
0bd4680
to
c4b162d
Compare
Update comments. Merge the REX2 changes into the original legacy emit path bug fix: Set REX2.W with correct mask code. register encoding and prefix emitting logics. Add REX2 prefix emit logic bug fixes Add Stress mode for REX2 encoding and some bug fixes resolve comments: 1. add assertion check for UD opcodes. 2. add checks for EGPRs. Add REX2 to emitOutputAM, and let LEA to be REX2 compatible. Add REX2.X encoding for SIB byte But fixes: add REX2 prefix on the path in RI where MOV is specially handled. Enable REX2 encoding for `movups` fixed bugs in REX2 prefix emitting logic when working with map 1 instructions, and enabled REX2 for POPCNT legacy map index-er bug fixes some clean-up Adding initial APX unit testing path. Adding a coredistools dll that has LLVM APX disasm capability. It must be coppied into a CORE_ROOT manually. clean up work for REX2 narrow the REX2 scope to `sub` only some clean up based on the comments. bug fix resolve comment
- SV path is mostly for debugging purposes Added encoding unit tests for instructions with immediates
Code refactoring: AddX86PrefixIfNeeded.
… missing in JIT, may indicate these instructions are not being used in JIT, drop them for now.
…ob and will not be EVEX-encoded when JitStressEvexEncoding is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. I had a few questions and suggestions.
src/coreclr/jit/codegenxarch.cpp
Outdated
instruction ins = genGetInsForOper(tree->OperGet(), targetType); | ||
inst_RV(ins, targetReg, targetType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instruction ins = genGetInsForOper(tree->OperGet(), targetType); | |
inst_RV(ins, targetReg, targetType); | |
inst_RV(ins, targetReg, targetType); |
ins
is already defined?
src/coreclr/jit/codegenxarch.cpp
Outdated
|
||
inst_Mov(targetType, targetReg, operandReg, /* canSkip */ true); | ||
if (JitConfig.JitEnableApxNDD() && GetEmitter()->IsApxNDDEncodableInstruction(ins) && (targetReg != operandReg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean that we will always enable with EVEX NDD if we are allowed to? Are there other tuning parameters that would play into this decision, like whether the registers are low/high (so the mov
requires a prefix or not, and how large a prefix it requires)?
Will JitConfig.JitEnableApxNDD() && GetEmitter()->IsApxNDDEncodableInstruction(ins)
be a common pattern? (that warrants a helper function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Bruce, thanks for the feedback!
As for now, we are focusing on the feature enabling in these PRs, so here what we did is to introduce the features and hide them behind a feature flag with default to disabled, and for performance-turning, it will be our future target, and is expected to come with separate PRs.
Will JitConfig.JitEnableApxNDD() && GetEmitter()->IsApxNDDEncodableInstruction(ins) be a common pattern? (that warrants a helper function)
Yes, I will work on this to combine them into a helper.
src/coreclr/jit/emitxarch.cpp
Outdated
{ | ||
// 3-byte | ||
return true; | ||
} | ||
|
||
if ((code & 0xFF00FF00) == 0x0F000000) | ||
if ((code & 0xFF000000) == 0x0F000000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're missing some changes in this function that were made in #106557 (comment) (bad merge?)
src/coreclr/jit/emitxarch.cpp
Outdated
if (id->idIsEvexNfContextSet() && IsBMIInstruction(ins)) | ||
{ | ||
// Only a few BMI instructions shall be promoted to APX-EVEX due to NF feature. | ||
// TODO-Ruihan: convert the check into forms like Has* as above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO-Ruihan: convert the check into forms like Has* as above. | |
// TODO-APX: convert the check into forms like Has* as above. |
we don't put names in the comments
src/coreclr/jit/emitxarch.cpp
Outdated
@@ -1381,7 +1515,7 @@ bool emitter::TakesRex2Prefix(const instrDesc* id) const | |||
// TODO-xarch-apx: | |||
// At this stage, we are only using REX2 in the case that non-simd integer instructions | |||
// with EGPRs being used in its operands, it could be either direct register uses, or | |||
// memory addressing operands, i.e. index and base. | |||
// memory addresssig operands, i.e. index and base. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bad merge? you are un-doing a typo fix
src/coreclr/jit/emitxarch.cpp
Outdated
{ | ||
// We can only have one memory operand and only src can be a constant operand | ||
// However, the handling for a given operand type (mem, cns, or other) is fairly | ||
// consistent regardless of whether they are src or dst. As such, we will find | ||
// the type of each operand and only check them against src/dst where relevant. | ||
|
||
bool useNDD = UsePromotedEVEXEncoding() && (targetReg != REG_NA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
bool useNDD = UsePromotedEVEXEncoding() && (targetReg != REG_NA); | |
const bool useNDD = UsePromotedEVEXEncoding() && (targetReg != REG_NA); |
src/coreclr/jit/emitxarch.cpp
Outdated
@@ -15515,15 +16041,22 @@ BYTE* emitter::emitOutputRR(BYTE* dst, instrDesc* id) | |||
#endif // FEATURE_HW_INTRINSICS | |||
else | |||
{ | |||
// TODO-XArch-APX: | |||
// Ruihan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove name
src/coreclr/jit/emitxarch.cpp
Outdated
{ | ||
assert(IsApxExtendedEvexInstruction(ins)); | ||
assert(id->idInsFmt() == IF_RWR_RRD_RRD); | ||
switch (id->idIns()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like copy-and-paste code from emitOutputRR
. Can we extract a function to handle both cases?
src/coreclr/jit/emitxarch.cpp
Outdated
@@ -17801,6 +18627,23 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp) | |||
{ | |||
code = insCodeRM(ins); | |||
|
|||
if (id->idIsEvexNdContextSet() && TakesApxExtendedEvexPrefix(id)) | |||
{ | |||
// TODO-XArchh-apx: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO-XArchh-apx: | |
// TODO-XArch-apx: |
src/coreclr/jit/emitxarch.cpp
Outdated
if (id->idIsEvexNdContextSet() && TakesApxExtendedEvexPrefix(id)) | ||
{ | ||
// TODO-XArchh-apx: | ||
// Ruihan: I'm not sure why instructions on this path can be with instruction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove name
Hi @tannergooding, this PR is ready for review, please take a look, thanks :) |
@Ruihan-Yin Looks like you need to resolve a merge conflict. |
Looks like there's still a merge conflict. I've started the secondary review however, so if that gets resolved I expect we can get this merged today |
if (GetEmitter()->DoJitUseApxNDD(ins) && (targetReg != operandReg)) | ||
{ | ||
GetEmitter()->emitIns_R_R(ins, emitTypeSize(operand), targetReg, operandReg, INS_OPTS_EVEX_nd); | ||
} | ||
else | ||
{ | ||
inst_Mov(targetType, targetReg, operandReg, /* canSkip */ true); | ||
inst_RV(ins, targetReg, targetType); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This general pattern is repeated quite a lot (with some variations), so I wonder if we should have a helper like I added for SIMD.
For example, we have emitIns_SIMD_R_R_R
which looks like: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/emitxarch.cpp#L8855-L8880 (other variations exist for handling things like memory operands or immediate; and higher level helpers like genHWIntrinsic_R_R_RM
exist for determining which of the variations to call between emitIns_SIMD_R_R_R
, emitIns_SIMD_R_R_A
, emitIns_SIMD_R_R_C
, and emitIns_SIMD_R_R_S
)
This lets us correctly represent any SIMD dst = src1 op src2
operation given the raw registers and then internally handles the RMW consideration, so that the rest of codegen can remain simpler and more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, for example, it seems like we "could" have simplified this down to something like:
GetEmitter()->emitIns_BASE_R_R(ins, emitTypeSize(operand), targetReg, operandReg);
and than had this helper make the distinction of handling APX
, NDD
, inserting the Mov
for the regular case; etc
Presumably this would also make the diffs for other APX support much simpler as well, since we have fewer centralized helpers to update.
// reg3 = op1 op op2 without extra mov | ||
|
||
// see if it can be optimized by inc/dec | ||
if (oper == GT_ADD && op2->isContainedIntOrIImmed() && !treeNode->gtOverflowEx()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handling here of ADD
into INC
/DEC
is also repeated in multiple locations, so probably another place where having centralized helpers is beneficial and ensures we're not missing it anywhere.
It's much better as a peephole in emit than something codegen must directly consider, IMO.
@@ -4406,23 +4469,23 @@ void CodeGen::genCodeForLockAdd(GenTreeOp* node) | |||
if (imm == 1) | |||
{ | |||
// inc [addr] | |||
GetEmitter()->emitIns_AR(INS_inc, size, addr->GetRegNum(), 0); | |||
GetEmitter()->emitIns_AR(INS_inc_no_evex, size, addr->GetRegNum(), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We should probably keep the existing name since its the baseline instruction. We should rather give the APX specific variant a new name, like INS_inc_apx
or similar, to helper ensure other paths don't accidentally use the wrong one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here were made due to the fact that the LOCK
prefix can not be used with a EVEX prefixed instruction, but it is legal with a REX2 prefixed instructions. And this happens in very limited cases with inc
, dec
, and
, or
I definitely agree with the idea that we should make the new naming variants pointing to the instructions with new features and only use them when new features are needed like EGPRs, NDD, and NF. But I will probably need to preserve the REX2 functionality in the original INS_inc
to get EGPRs support. It might be a bit off the semantic the names: INS_inc
/INS_inc_apx
. Will that be acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I will probably need to preserve the REX2 functionality in the original INS_inc to get EGPRs support.
It's definitely fine for an instruction like INS_inc
to allow opportunistic lightup for the REX2 encoding; we have the same thing where INS_addps
is used for legacy
, vex
, and evex
all together for example.
The main consideration is simply that we don't want the "good name" like INS_inc
to be the thing that requires higher level checks (i.e. requires checking APX is supported). Such a case would inevitably cause issues down the road because someone thinks it is simply the inc
instruction that's been around for 40+ years now.
If there must be two different entries for the same instruction because the opcodes conflict, then names like INS_inc
and INS_inc_apx
sound good to me. However, if its just a restriction that something like LOCK
can't use the EVEX encoding and the opcode and base information otherwise remains the same, that sounds like we don't actually need "two instructions" defined and is rather something that LSRA handles in the allowed registers and codegen handles in the INS_OPTS it passes down
@@ -4449,7 +4512,7 @@ void CodeGen::genLockedInstructions(GenTreeOp* node) | |||
|
|||
if (node->OperIs(GT_XORR, GT_XAND)) | |||
{ | |||
const instruction ins = node->OperIs(GT_XORR) ? INS_or : INS_and; | |||
const instruction ins = node->OperIs(GT_XORR) ? INS_or_no_evex : INS_and_no_evex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on dec
, add
, or
, etc.
// TODO-Xarch-apx: we have special stress mode for REX2 on non-compatible machine, that will | ||
// force UseRex2Encoding return true regardless of the CPUID results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment necessary? I don't believe we have such a comment for the EVEX path.
I also thought we weren't enforcing it regardless of the CPUID; but rather were allowing it to be set where supported and using AltJit to get ISAs like APX enabled so disassembly can be gotten
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I will remove it, this was for some internal testing at initial stage, which has been deprecated.
assert(hasEvexPrefix(code)); | ||
code = AddRexWPrefix(id, code); | ||
} | ||
if ((ins != INS_lzcnt_evex) && (ins != INS_tzcnt_evex) && (ins != INS_popcnt_evex)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are rather lzcnt_apx
, not strictly "evex"
right? That is, there are other lzcnt
/popcnt
instructions for SIMD under a different name, so perhaps the base ISA is better than the encoding here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should come down to the same consideration as #108796 (comment)
If we have a unique opcode situation (like the manual calls out) then having a new instruction defined is reasonable. However, if the opcode is the same and its really just special handling of prefixes or features like embedded masking that is only available in a newer ISA, then that rather seems like something that is implicitly handled in emit or other places instead.
The screenshot of the manual you gave, for example, is referring to the opcode as being the actual single-byte opcode + additional information like prefixes. Where-as the JIT really just considers the main opcode byte and automatically extracts any required prefixes into the relevant bit positions.
lzcnt
is a case that actually has a new main opcode byte, as it changes the "main opcode byte" from 0xBD
to instead be 0xF5
. The legacy encoding also has a required prefix of 0xF3
and 0x0F
; where-as the APX encoding uses 66 + MAP4
, so they are clearly incompatible and require separate instruction table entries to be handled.
Changes overall LGTM. However, I have some concerns about the naming/renaming of some instructions that are liable to cause issues down the road. -- I'd like to see these handled before merging There was then a separate suggestion about defining helpers to extract and hide the differences of |
Thanks for the feedback! I will make changes for naming and helpers in this PR together, also I left some thoughts on the naming issue, would appreciate it if we can discuss more on that. |
Overview
This PR is built based on #106557, and is the first one that covers APX-EXTENDED-EVEX encoding.
This PR adds extended EVEX encoding for legacy instructions that are promoted to the EVEX encoding space, currently only instructions wit the new data destination (EVEX.NDD) feature, are covered in the PR.
We plan to cover the encoding and instructions for flag suppression (EVEX.NF) in follow-up PRs.EVEX.ND covered instructions:
EVEX.NF cover instructions:
Specification
EVEX extension of legacy instructions is one of the changes made on the original EVEX prefix to accommodate the ISA features and new instructions introduced by APX, and this part of extension focuses on promoting legacy instructions into EVEX encoding space and providing them with features like EGPR access, new data destination, zero upper, flag suppression.
As shown in the figure, some bits in original EVEX prefix have been re-purposed: EVEX.b to EVEX.ND, first bit of EVEX.aaa to EVEX.NF, and some bits have become reserved and has to be 0. Also, the promoted legacy instructions take a new legacy-map-index: map-4, as shown at EVEX.bits[18:16], say EVEX.mmm field, to be 100b.
All the promoted legacy instructions should follow this encoding schema, and for instructions that does not use these REX bits for access upper registers, these bits: EVEX.R4, X4, B4, R3, X3, B3 should be kept in logical-0 (0, or 1 if defined in inverted way.).
Design
As stated above, this PR will cover the encoding changes needed for EVEX extension for legacy instructions and support for EVEX.ND.
The bulk of the changes occur in the backend emitter, and some changes are added to code generation as the entry of optimization of NDD format.
One part I need to call out in the design is that we separated the EVEX encoding path for legacy instructions with the original EVEX path, and the new emit path will be guarded by
TakesApxExtendedEvexPrefix
. The main reason for this is that the legacy extension part for APX-EVEX will break the assumption that EVEX is only for SIMD instructions and will only be appear on SIMD instruction emit paths, which JIT carries a lot of assertion check to verify. To let the original checks hold as much as possible, we finally chose to establish a stand-alone branch for extended legacy instructions on the path that does not have legacy encoding, or re-use the existing legacy encoding path with some prefix work.Optimization & Performance
In the asmdiff part below, code size regression was observed, say the use of EVEX.ND feature will increase the code size, in detail, the NDD form will introduce at most 2-byte regression per instruction, this is expected as we are using a 4-byte prefixed instruction to replace 2 legacy instructions which are normally 2 bytes. This creates the tradeoff between code size and instructions count, and we will be contributing to teach JIT how to wisely use this feature to get maximum performance gain while controlling the code size regression with a series of followed tuning works.
For better tuning the features, we added the optimization knob for NDD:
JitEnableAPXNDD
, now NDD optimization is there for a few binary and unary instructions when the target register is different from src operands, but to use this feature more wisely, we will need more tunning work in the future, so we plan to have individual tunning knob for each feature APX provides, like NDD, NF, etc.Testing
Results separately posted below.
Follow-up plans
After this PR, we will continue to complete the APX-EVEX support for EVEX.NF for legacy/VEX instructions, and further APX-EVEX support for VEX/EVEX instructions.
Edit:
We eventually decided to cover the EVEX.NF feature within this PR as well. This feature will be enabled with encoding only, and there will be no active surface for this feature until we have some related codegen works.
In summary, this PR covers all the changes to enable EVEX.ND/NF feature, plus the needed register encoding, while this PR is not intended for full coverage for this part.