Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current GCC regressions is vector intrinsic optimization. #204

Open
munroesj52 opened this issue Dec 15, 2024 · 0 comments
Open

Current GCC regressions is vector intrinsic optimization. #204

munroesj52 opened this issue Dec 15, 2024 · 0 comments

Comments

@munroesj52
Copy link
Contributor

munroesj52 commented Dec 15, 2024

There a number of regressions in vector intrinsic generation and optimization that began with GCC-8 and persist in GCC-13/14. This seems to be associated with the integration of vector long long and POWER8 little endian.

  • Loading const vectors from .rodata (.TOC relative, medium model) is poorly optimized for POWER8/9. This especially bad for const vector long long. See GCC PR 117718
  • The generation of small integer splatted const for vector long long and vector __int128 is especially difficult for POWER8 and not so great for POWER9. The compiler seems to go out of it way to undo any attempt to optimize this and usually replaces it with const vectors and loads from .rodata. See bullet above. See GCC PR 117007
  • The intrinsic implementation and Intrinsic reference is inconsistent and overly restrictive of parameter types used for vector shift counts. This forces additional vector element unpack-extend operations that are further de-optimized. See bullets above. For example:
    • Doubleword shift/Rotates (PowerISA 2.07) requires a vector unsigned long long (doubleword) shift count. The PowerISA only requires a shift count in the low-order 6-bits and ignores the high order 58 bits.
    • Shift Left/Right Long /Octet (128-bit) require a vector [un]signed char (byte) shift count The PowerISA requires a 7-bit shift count that is a composed of 4-bits (octet/byte shift) and 3-bits (bit shift).
      • vec_sll/vec_srl (vsl/vsr) require the bit shift count to be splatted to the low-order 3-bits of all 16 bytes of VRB The intrinsic requires a vector unsigned char.
      • vec_slo/vec_sro (vslo/vsro) require the Octet shift count to be in bits 121:124 (bit 1:4 of byte 15 ) in VRB. The intrinsic will accept signed or unsigned char for the shift count.
      • vec_slo/vec_sll and vec_sro/vec_srl are frequently use together to effect an arbitrary 1 to 127 bit shift from a 7-bit shift count.
    • Quadword Shift/Rotates (PowerISA 3.1C) requires a 7-bit shift count in bits 57:63 (DW 0) of VRB. The intrinsic requires the 7-bit shift count in the low-order 7-bits of a vector __int128
    • Perhaps allowing vector unsigned char for all shift counts (VRB) would simplify programing and allow better optimizes when generate shift counts from scalar constants
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant