Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize vector normalize and dot IR #1504

Closed
wants to merge 3 commits into from

Conversation

green-real
Copy link

Optimizes vector.normalize by using SIMD instructions, yielding a ~75% speed increase.

Optimizes vector.dot by reordering operations, yielding a ~20% speed increase.

@zeux
Copy link
Collaborator

zeux commented Nov 8, 2024

How was the performance measured here - what hardware was used and what benchmarks were ran?

@green-real
Copy link
Author

How was the performance measured here - what hardware was used and what benchmarks were ran?

For both benchmarks, I timed a for loop with a body that simply calls the function I am benchmarking with some non-constant arguments. Both benchmarks have been ran multiple times with around 1e8 iterations and were then averaged out which is where I got the percentages from.
My benchmarks were ran on i7-11700.

Since it seems that your PR contains the same vector.normalize change as my PR and a way better vector.dot optimization, it would make sense to close this one in favour of yours.

aviralg pushed a commit that referenced this pull request Nov 9, 2024
#1512)

Instead of doing the dot product related math in scalar IR, we lift the
computation into a dedicated IR instruction.

On x64, we can use VDPPS which was more or less tailor made for this
purpose. This is better than manual scalar lowering that requires
reloading components from memory; it's not always a strict improvement
over the shuffle+add version (which we never had), but this can now be
adjusted in the IR lowering in an optimal fashion (maybe even based on
CPU vendor, although that'd create issues for offline compilation).

On A64, we can either use naive adds or paired adds, as there is no
dedicated vector-wide horizontal instruction until SVE. Both run at
about the same performance on M2, but paired adds require fewer
instructions and temporaries.

I've measured this using mesh-normal-vector benchmark, changing the
benchmark to just report the time of the second loop inside
`calculate_normals`, testing master vs #1504 vs this PR, also increasing
the grid size to 400 for more stable timings.

On Zen 4 (7950X), this PR is comfortably ~8% faster vs master, while I
see neutral to negative results in #1504.
On M2 (base), this PR is ~28% faster vs master, while #1504 is only
about ~10% faster.

If I measure the second loop in `calculate_tangent_space` instead, I
get:

On Zen 4 (7950X), this PR is ~12% faster vs master, while #1504 is ~3%
faster
On M2 (base), this PR is ~24% faster vs master, while #1504 is only
about ~13% faster.

Note that the loops in question are not quite optimal, as they store and
reload various vectors to dictionary values due to inappropriate use of
locals. The underlying gains in individual functions are thus larger
than the numbers above; for example, changing the `calculate_normals`
loop to use a local variable to store the normalized vector (but still
saving the result to dictionary value), I get a ~24% performance
increase from this PR on Zen4 vs master instead of just 8% (#1504 is
~15% slower in this setup).
@aviralg
Copy link
Contributor

aviralg commented Nov 9, 2024

Merged PR #1512

@aviralg aviralg closed this Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants