SIMD vectorization in the engine #4544
Replies: 13 comments 26 replies
-
Modern C/C++ compilers automatically perform SIMD – this is called autovectorization. Instead of adding explicit SIMD intrinstics (which usually aren't portable across architectures), we prefer writing code in a way that compilers will autovectorize it. Godot runs on more than x86 – it supports ARM, WebAssembly, and soon RISC-V 🙂 Pull requests to improve autovectorization behavior are welcome, but they must tackle existing bottlenecks that affect real world projects (or at least a realistic use case). Also, optimization pull requests must be accompanied by benchmarks. |
Beta Was this translation helpful? Give feedback.
-
Autovectorization is not very effective. If you want to get good vector efficiency, it's essential to use intrinsics, and often to restructure your algorithms as well. Well written vectorized code tends to be several times faster than anything a compiler can produce on its own. (As a side note, ARM's next generation SVE vector extension supposedly allows for better autovectorization. That was achieved by analyzing the barriers that compilers run into and designing the instruction set specifically to work around them. For a fairly interesting discussion of the design, see https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf. It's not yet available in any consumer level processors though.) I agree that portability is very important. You don't want to scatter fvec4 x(1, 2, 3, 4);
fvec4 y = x/2 - 3;
float z = y[3]; The vectorize.h header doesn't define anything directly. Here is the whole content. #if defined(__ARM__) || defined(__ARM64__)
#include "vectorize_neon.h"
#elif defined(__PPC__)
#include "vectorize_ppc.h"
#else
#include "vectorize_sse.h"
#endif It just includes the implementation for whatever architecture is being compiled. For example, here is the version for SSE. All use of intrinsics is limited to a single file for each architecture. That makes it easy to add new architectures. You can even have a fallback implementation that uses clang/gcc portable vectors. That way you automatically get vectorization on future architectures, even if it's not as fast as native intrinsics. |
Beta Was this translation helpful? Give feedback.
-
In the tests with etcpak enabling AVX and AVX2 was breaking our build matrix because of intel's decision to not support it on some devices. TL;DR we can do it but we need to avoid certain cpus at runtime. |
Beta Was this translation helpful? Give feedback.
-
Yes, 256 bit vectors are more problematic. Every modern architecture supports 128 bit vectors, so you can safely assume they exist, but they don't all support 256. You can still use AVX when it's available, but it takes a lot more care. You need to compile the relevant code with and without AVX support and decide at runtime which version to use. I'd start with only 128 bit vectors, then in the future consider adding 256 bit versions of only the most performance critical routines. Here are a few other libraries that provide portable SIMD APIs. VecCore: https://github.com/root-project/veccore I haven't used any of those. I just turned them up in a web search. |
Beta Was this translation helpful? Give feedback.
-
If there's interest in this, the next question to consider is what to vectorize. There are two main approaches to vectorizing code, what you might describe as "broad" and "deep". They aren't mutually exclusive. You can do both. Really they're just opposite ends of continuum. The broad approach involves vectorizing low level routines that get used in a lot of places. Any code that calls those routines immediately becomes faster with no changes. Classes like Vector3, Transform3D, and Basis have a lot of routines that would be good candidates. I think this is worth trying, and it might produce some benefit, but probably not a lot. The problem is that all those routines expect their inputs and outputs to be in memory, not SIMD registers. To get good performance, it's essential to keep things in registers as much as possible and only go to memory when absolutely necessary. So a more extreme version would be to reimplement Vector3 to store its data in a Also, the potential benefit of doing that is still limited. If you use an eight component AVX register to store a three component vector, you're leaving a lot of performance on the table! So that brings us to the deep approach, which is to take larger sections of code and more extensively rewrite them based around vectorization. This can give huge speedups. Can we enumerate particular algorithms that would be good candidates for this? Physics and geometry calculations often vectorize well. Before doing any kind of optimization, of course, the first step is to make sure you have good benchmarks. That way you know if you're actually making it faster, and you can make sure you aren't inadvertently making something else slower at the same time! The benchmarks at https://github.com/godotengine/godot-benchmarks are a good start, but they're still very limited. Are there any other existing benchmarks? If not, the first step should probably be to create more of them to cover more of the code. I'd be happy to try doing that. |
Beta Was this translation helpful? Give feedback.
-
Just FYI, about SIMD vector math and the "4th component" idea, : https://www.reedbeta.com/blog/on-vector-math-libraries/#how-to-simd-and-how-not-to Personally I feel like it's probably too much adding that to the general-purpose Vector3. There are too many areas that dont benefit from it at all. Random example, mesh generation APIs need to pass data as Another note: I have been using FastNoise2 in my project, which uses dynamic SIMD. It automatically picks the highest SIMD level to run noise generation at runtime. It also comes with its own abstraction of intrinsics. The author considered libsimdpp, but later preferred to improve their own library separarely (FastSIMD) due to performance. |
Beta Was this translation helpful? Give feedback.
-
That could definitely be useful. I don't mean changing the representation to a SIMD vector, just changing from
That makes a lot of sense. Sticking to four component SSE/NEON avoids a lot of problems, and it could still give a big speedup.
That's good to know. It also looks to me like benchmarking of physics hasn't been implemented yet? Anyway, once it's ready I'd be happy to help in contributing benchmarks. |
Beta Was this translation helpful? Give feedback.
-
I updated a few more ideas in #290 (comment) . I'd actually written that a while ago but was too busy to do more on this. |
Beta Was this translation helpful? Give feedback.
-
Any preference about which SIMD API to use? Personally I'm rather partial to the one I created for OpenMM, and not just because I wrote it. 😀 Here is how I evaluate it. Advantages:
Disadvantages:
Of the other options, I think the one most worth considering is Highway. Advantages:
Disadvantages:
None of the others looks to me like a very good option. |
Beta Was this translation helpful? Give feedback.
-
We also need to decide what to do when compiling in double precision mode. I see a few options.
|
Beta Was this translation helpful? Give feedback.
-
A lot has been said about auto-vectorization already - but to nail the coffin on the head - yes, CPU's can auto-vectorize, but they aren't very good at it. They can handle tons of simple cases but throw an entire mathematical function (not just a simple dot product of vectors, etc) and a lot of the times the results are up in the air. As someone who has written a math library a few times, it is a better idea to wrap some existing hyper-simd-optimized library that provides higher level mathematical abstractions. There are such libraries such as DirectXMath (which is cross platform - despite the DirectX naming) and Eigen that do this absolutely wonderfully, and are able to be configured to which AVX instruction set to target. The point is that a majority of our common mathematical operations are probably not as efficient even if we vectorize them ourselves compared to proper products and optimizations made by Eigen as an example. Whether or not to only support AVX (2011 cpu's) versus AVX2(256 support ~ 2013 cpu processors) is going to be up to the developer, and if they want to repackage game binaries multiple time - which isn't that hard to do - if they even want to. I had a discussion with @reduz about potentially making our Math libraries an opaque wrapper around Eigen - he wasn't against it per-say but it had to have a clear performance improvement. But that was also a time where the engine was hard baked to 32 bit floating point support. Now with the addition of 64 bit floats, and recent issues/topics regarding camera-centric rendering for large worlds, such a setup would definitely increase performance throughout the scene tree. I've wrapped Eigen myself a few times now, and its not that difficult and we could do it to Godot with a little bit of refactoring effort - and the results would probably be very noticeable. On modern CPU's especially, with Eigen, a 4x4 32 bit matrix has the exact same performance as a 4x4 64 bit matrix multiplication, and that's a HUGE deal. I last bench marked large matrix multiplies on my 1920x Threadripper on a single core, and 1 million matrix multiplications completed within margin of error of each other between the 32 bit and 64 bit matrices. |
Beta Was this translation helpful? Give feedback.
-
Eigen is a great piece of software. I can totally endorse it. It's a bit different, though. It's a linear algebra library which happens to be optimized with SIMD, which is not the same as a SIMD library. No one should ever write their own version of standard BLAS routines (unless they really know what they're doing). But if you're coding up low level numerical routines for something more specialized, you want to get as close to the hardware as possible. |
Beta Was this translation helpful? Give feedback.
-
Has anyone noticed that it can actually be accelerated with OpenCL, and OpenCL has very wide cross-platform support. |
Beta Was this translation helpful? Give feedback.
-
As far as I can tell, none of the code in the core engine is vectorized. On modern processors, most of the available compute capacity is found in the vector units. If your code isn't vectorized, you're missing out on most of the processor's available computing resources.
Has the possibility of adding SIMD vectorization been considered? The only thing I could find was #290, which is rather different. It's talking about adding a SIMD API for using in scripting, while I'm talking about vectorizing the engine itself.
I have a lot of ideas about how this could be implemented. Before I describe them, though, I want to check on whether there is interest in this, or whether it is already discussed somewhere else that I missed.
Beta Was this translation helpful? Give feedback.
All reactions