Current GCC regressions is vector intrinsic optimization. #204

munroesj52 · 2024-12-15T22:08:50Z

There a number of regressions in vector intrinsic generation and optimization that began with GCC-8 and persist in GCC-13/14. This seems to be associated with the integration of vector long long and POWER8 little endian.

Loading const vectors from .rodata (.TOC relative, medium model) is poorly optimized for POWER8/9. This especially bad for const vector long long. See GCC PR 117718
The generation of small integer splatted const for vector long long and vector __int128 is especially difficult for POWER8 and not so great for POWER9. The compiler seems to go out of it way to undo any attempt to optimize this and usually replaces it with const vectors and loads from .rodata. See bullet above. See GCC PR 117007
The intrinsic implementation and Intrinsic reference is inconsistent and overly restrictive of parameter types used for vector shift counts. This forces additional vector element unpack-extend operations that are further de-optimized. See bullets above. For example:
- Doubleword shift/Rotates (PowerISA 2.07) requires a vector unsigned long long (doubleword) shift count. The PowerISA only requires a shift count in the low-order 6-bits and ignores the high order 58 bits.
- Shift Left/Right Long /Octet (128-bit) require a vector [un]signed char (byte) shift count The PowerISA requires a 7-bit shift count that is a composed of 4-bits (octet/byte shift) and 3-bits (bit shift).
  - vec_sll/vec_srl (vsl/vsr) require the bit shift count to be splatted to the low-order 3-bits of all 16 bytes of VRB The intrinsic requires a vector unsigned char.
  - vec_slo/vec_sro (vslo/vsro) require the Octet shift count to be in bits 121:124 (bit 1:4 of byte 15 ) in VRB. The intrinsic will accept signed or unsigned char for the shift count.
  - vec_slo/vec_sll and vec_sro/vec_srl are frequently use together to effect an arbitrary 1 to 127 bit shift from a 7-bit shift count.
- Quadword Shift/Rotates (PowerISA 3.1C) requires a 7-bit shift count in bits 57:63 (DW 0) of VRB. The intrinsic requires the 7-bit shift count in the low-order 7-bits of a vector __int128
- Perhaps allowing vector unsigned char for all shift counts (VRB) would simplify programing and allow better optimizes when generate shift counts from scalar constants

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current GCC regressions is vector intrinsic optimization. #204

Current GCC regressions is vector intrinsic optimization. #204

munroesj52 commented Dec 15, 2024 •

edited

Loading

Current GCC regressions is vector intrinsic optimization. #204

Current GCC regressions is vector intrinsic optimization. #204

Comments

munroesj52 commented Dec 15, 2024 • edited Loading

munroesj52 commented Dec 15, 2024 •

edited

Loading