Optimize vector normalize and dot IR #1504

green-real · 2024-11-02T13:40:33Z

Optimizes vector.normalize by using SIMD instructions, yielding a ~75% speed increase.

Optimizes vector.dot by reordering operations, yielding a ~20% speed increase.

zeux · 2024-11-08T17:39:29Z

How was the performance measured here - what hardware was used and what benchmarks were ran?

green-real · 2024-11-08T23:10:15Z

How was the performance measured here - what hardware was used and what benchmarks were ran?

For both benchmarks, I timed a for loop with a body that simply calls the function I am benchmarking with some non-constant arguments. Both benchmarks have been ran multiple times with around 1e8 iterations and were then averaged out which is where I got the percentages from.
My benchmarks were ran on i7-11700.

Since it seems that your PR contains the same vector.normalize change as my PR and a way better vector.dot optimization, it would make sense to close this one in favour of yours.

#1512) Instead of doing the dot product related math in scalar IR, we lift the computation into a dedicated IR instruction. On x64, we can use VDPPS which was more or less tailor made for this purpose. This is better than manual scalar lowering that requires reloading components from memory; it's not always a strict improvement over the shuffle+add version (which we never had), but this can now be adjusted in the IR lowering in an optimal fashion (maybe even based on CPU vendor, although that'd create issues for offline compilation). On A64, we can either use naive adds or paired adds, as there is no dedicated vector-wide horizontal instruction until SVE. Both run at about the same performance on M2, but paired adds require fewer instructions and temporaries. I've measured this using mesh-normal-vector benchmark, changing the benchmark to just report the time of the second loop inside `calculate_normals`, testing master vs #1504 vs this PR, also increasing the grid size to 400 for more stable timings. On Zen 4 (7950X), this PR is comfortably ~8% faster vs master, while I see neutral to negative results in #1504. On M2 (base), this PR is ~28% faster vs master, while #1504 is only about ~10% faster. If I measure the second loop in `calculate_tangent_space` instead, I get: On Zen 4 (7950X), this PR is ~12% faster vs master, while #1504 is ~3% faster On M2 (base), this PR is ~24% faster vs master, while #1504 is only about ~13% faster. Note that the loops in question are not quite optimal, as they store and reload various vectors to dictionary values due to inappropriate use of locals. The underlying gains in individual functions are thus larger than the numbers above; for example, changing the `calculate_normals` loop to use a local variable to store the normalized vector (but still saving the result to dictionary value), I get a ~24% performance increase from this PR on Zen4 vs master instead of just 8% (#1504 is ~15% slower in this setup).

aviralg · 2024-11-09T00:24:45Z

Merged PR #1512

green-real and others added 3 commits November 2, 2024 14:11

Update IrTranslateBuiltins.cpp

dc0ec48

Update IrTranslateBuiltins.cpp

f5df0c2

Merge branch 'luau-lang:master' into vector-ir-changes

d9fd091

zeux mentioned this pull request Nov 8, 2024

CodeGen: Rewrite dot product lowering using a dedicated IR instruction #1512

Merged

aviralg closed this Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize vector normalize and dot IR #1504

Optimize vector normalize and dot IR #1504

green-real commented Nov 2, 2024

zeux commented Nov 8, 2024

green-real commented Nov 8, 2024

aviralg commented Nov 9, 2024

Optimize vector normalize and dot IR #1504

Optimize vector normalize and dot IR #1504

Conversation

green-real commented Nov 2, 2024

zeux commented Nov 8, 2024

green-real commented Nov 8, 2024

aviralg commented Nov 9, 2024