SIMD vectorization in the engine #4544

peastman · 2022-05-15T19:56:05Z

peastman
May 15, 2022

As far as I can tell, none of the code in the core engine is vectorized. On modern processors, most of the available compute capacity is found in the vector units. If your code isn't vectorized, you're missing out on most of the processor's available computing resources.

Has the possibility of adding SIMD vectorization been considered? The only thing I could find was #290, which is rather different. It's talking about adding a SIMD API for using in scripting, while I'm talking about vectorizing the engine itself.

I have a lot of ideas about how this could be implemented. Before I describe them, though, I want to check on whether there is interest in this, or whether it is already discussed somewhere else that I missed.

Calinou · 2022-05-15T21:44:56Z

Calinou
May 15, 2022
Maintainer

Modern C/C++ compilers automatically perform SIMD – this is called autovectorization. Instead of adding explicit SIMD intrinstics (which usually aren't portable across architectures), we prefer writing code in a way that compilers will autovectorize it. Godot runs on more than x86 – it supports ARM, WebAssembly, and soon RISC-V 🙂

Pull requests to improve autovectorization behavior are welcome, but they must tackle existing bottlenecks that affect real world projects (or at least a realistic use case). Also, optimization pull requests must be accompanied by benchmarks.

0 replies

peastman · 2022-05-15T22:22:17Z

peastman
May 15, 2022
Author

Autovectorization is not very effective. If you want to get good vector efficiency, it's essential to use intrinsics, and often to restructure your algorithms as well. Well written vectorized code tends to be several times faster than anything a compiler can produce on its own.

(As a side note, ARM's next generation SVE vector extension supposedly allows for better autovectorization. That was achieved by analyzing the barriers that compilers run into and designing the instruction set specifically to work around them. For a fairly interesting discussion of the design, see https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf. It's not yet available in any consumer level processors though.)

I agree that portability is very important. You don't want to scatter #ifdefs and architecture-specific intrinsics all through your code! Here is an example of how I implemented vectorization in another project. I created a header file vectorize.h that you include anywhere you want to do vectorization. It defines two classes fvec4 and ivec4 that represent four component vectors of floats and ints, respectively. They get compiled down to bare vector registers, but they provide a clean, architecture independent API for it. You can write code like

fvec4 x(1, 2, 3, 4);
fvec4 y = x/2 - 3;
float z = y[3];

The vectorize.h header doesn't define anything directly. Here is the whole content.

#if defined(__ARM__) || defined(__ARM64__)
    #include "vectorize_neon.h"
#elif defined(__PPC__)
    #include "vectorize_ppc.h"
#else
    #include "vectorize_sse.h"
#endif

It just includes the implementation for whatever architecture is being compiled. For example, here is the version for SSE. All use of intrinsics is limited to a single file for each architecture. That makes it easy to add new architectures. You can even have a fallback implementation that uses clang/gcc portable vectors. That way you automatically get vectorization on future architectures, even if it's not as fast as native intrinsics.

0 replies

fire · 2022-05-15T23:45:05Z

fire
May 15, 2022
Collaborator

In the tests with etcpak enabling AVX and AVX2 was breaking our build matrix because of intel's decision to not support it on some devices.

TL;DR we can do it but we need to avoid certain cpus at runtime.

1 reply

fire May 15, 2022
Collaborator

I played with https://github.com/simd-everywhere/simde.

All of bptc encoding, etcpak, and embree will be greatly sped up.

peastman · 2022-05-16T00:10:10Z

peastman
May 16, 2022
Author

Yes, 256 bit vectors are more problematic. Every modern architecture supports 128 bit vectors, so you can safely assume they exist, but they don't all support 256. You can still use AVX when it's available, but it takes a lot more care. You need to compile the relevant code with and without AVX support and decide at runtime which version to use. I'd start with only 128 bit vectors, then in the future consider adding 256 bit versions of only the most performance critical routines.

Here are a few other libraries that provide portable SIMD APIs.

VecCore: https://github.com/root-project/veccore
Highway: https://github.com/google/highway
libsimdpp: https://github.com/p12tic/libsimdpp
USIMD: https://github.com/mathlibrary/usimd

I haven't used any of those. I just turned them up in a web search.

0 replies

peastman · 2022-05-17T14:09:57Z

peastman
May 17, 2022
Author

If there's interest in this, the next question to consider is what to vectorize. There are two main approaches to vectorizing code, what you might describe as "broad" and "deep". They aren't mutually exclusive. You can do both. Really they're just opposite ends of continuum.

The broad approach involves vectorizing low level routines that get used in a lot of places. Any code that calls those routines immediately becomes faster with no changes. Classes like Vector3, Transform3D, and Basis have a lot of routines that would be good candidates. I think this is worth trying, and it might produce some benefit, but probably not a lot. The problem is that all those routines expect their inputs and outputs to be in memory, not SIMD registers. To get good performance, it's essential to keep things in registers as much as possible and only go to memory when absolutely necessary.

So a more extreme version would be to reimplement Vector3 to store its data in a __m128 (SSE) or float32x4_t (NEON). That would allow a much bigger speedup. Unfortunately, it also would break compatibility. Vector3 is a struct that directly exposes its internal fields. That means you can't change the internal representation without changing the API. Maybe for Godot 5. 😀

Also, the potential benefit of doing that is still limited. If you use an eight component AVX register to store a three component vector, you're leaving a lot of performance on the table! So that brings us to the deep approach, which is to take larger sections of code and more extensively rewrite them based around vectorization. This can give huge speedups. Can we enumerate particular algorithms that would be good candidates for this? Physics and geometry calculations often vectorize well.

Before doing any kind of optimization, of course, the first step is to make sure you have good benchmarks. That way you know if you're actually making it faster, and you can make sure you aren't inadvertently making something else slower at the same time! The benchmarks at https://github.com/godotengine/godot-benchmarks are a good start, but they're still very limited. Are there any other existing benchmarks? If not, the first step should probably be to create more of them to cover more of the code. I'd be happy to try doing that.

2 replies

Calinou May 17, 2022
Maintainer

So a more extreme version would be to reimplement Vector3 to store its data in a __m128 (SSE) or float32x4_t (NEON). That would allow a much bigger speedup.

Using 4 fields in Vector3/Vector3i (one of them being a dummy field) to allow for SIMD has been considered in the past. Doing so will increase memory usage, but it's probably worth it if it shows measurable performance gains in real world scenarios.

The benchmarks at godotengine/godot-benchmarks are a good start, but they're still very limited. Are there any other existing benchmarks? If not, the first step should probably be to create more of them to cover more of the code. I'd be happy to try doing that.

Zylann has a suite of GDScript performance tests, but they haven't been ported for Godot 4.0 yet.

There are plans to add more benchmarks to godot-benchmarks, but we need to make it easier to add new benchmarks first (it involves a bit too much copy-pasting right now).

Also, the potential benefit of doing that is still limited. If you use an eight component AVX register to store a three component vector, you're leaving a lot of performance on the table! So that brings us to the deep approach, which is to take larger sections of code and more extensively rewrite them based around vectorization. This can give huge speedups. Can we enumerate particular algorithms that would be good candidates for this? Physics and geometry calculations often vectorize well.

Could the Transform2D/Transform3D types benefit from this? They have plenty of fields internally. Other types used in physics/geomtry such as Plane and Quaternion only have 4 fields to my knowledge. AABB has 6 fields exposed, so it might be possible to align it to 8 too.

That said, official Godot binaries may not be compiled with AVX support for compatibility reasons, so I'm not sure if it's worth the increased memory usage. Given one of Godot's appeals is running on old/low-end hardware, I think SSE 4.2 is probably the most we can aim for on x86 PCs: #3932

AVX (or worse, AVX2) is problematic to require due to some low-end CPUs not shipping with those features enabled. Dynamic dispatch is an option, but I assume it'll increase binary size by a non-negligible margin (which is also another thing we try to keep to a minimum).

peastman May 31, 2022
Author

I'd like to test out adding the padding element to Vector3 to see what effect it has. In Vector3.h I changed the definition to real_t coord[4]. That led to failures of several static assertions that I wasn't certain of the correct way to fix. In extension_api_dump.cpp I increased vec3_elems to 4, and in resource_format_binary.cpp I changed lines 589-90 to specify four elements. With those changes it compiles but immediately segfaults on launch. Any advice on what I need to do differently would be appreciated!

Zylann · 2022-05-17T21:43:46Z

Zylann
May 17, 2022
Collaborator

Just FYI, about SIMD vector math and the "4th component" idea, : https://www.reedbeta.com/blog/on-vector-math-libraries/#how-to-simd-and-how-not-to
Seems to be against stuffing a 4th component nowadays.

Personally I feel like it's probably too much adding that to the general-purpose Vector3. There are too many areas that dont benefit from it at all. Random example, mesh generation APIs need to pass data as Vector3, which when compiled with float=64 unnecessarily uses doubles, and here would get an extra useless member. It should maybe be preferred locally, in separate (but compatible) types, in high-performance areas? And if local enough, it can even be reassessed wether or not the "4th component" approach actually is best.
Improving performance would be nice, I'm just a bit wary of modifying such an ubiquitous type.

Another note: I have been using FastNoise2 in my project, which uses dynamic SIMD. It automatically picks the highest SIMD level to run noise generation at runtime. It also comes with its own abstraction of intrinsics. The author considered libsimdpp, but later preferred to improve their own library separarely (FastSIMD) due to performance.

0 replies

peastman · 2022-05-17T22:21:20Z

peastman
May 17, 2022
Author

Using 4 fields in Vector3/Vector3i (one of them being a dummy field) to allow for SIMD has been considered in the past.

That could definitely be useful. I don't mean changing the representation to a SIMD vector, just changing from float[3] to float[4]. The advantage is that you can copy between memory and SIMD registers with a single instruction. If the size doesn't match, it's a lot more expensive.

AVX (or worse, AVX2) is problematic to require due to some low-end CPUs not shipping with those features enabled.

That makes a lot of sense. Sticking to four component SSE/NEON avoids a lot of problems, and it could still give a big speedup.

There are plans to add more benchmarks to godot-benchmarks, but we need to make it easier to add new benchmarks first (it involves a bit too much copy-pasting right now).

That's good to know. It also looks to me like benchmarking of physics hasn't been implemented yet? Anyway, once it's ready I'd be happy to help in contributing benchmarks.

3 replies

Calinou May 17, 2022
Maintainer

It also looks to me like benchmarking of physics hasn't been implemented yet?

Implementing physics benchmarking should be possible, but since Godot's physics engine isn't deterministic, there would be some variance between runs (at least in some benchmarking scenarios).

Zylann May 17, 2022
Collaborator

We can do a few runs per test?

Calinou May 18, 2022
Maintainer

We can do a few runs per test?

Yes, but there will still always be some kind of variance, even if you run the benchmarks on a machine optimized to avoid this kind of randomness (dedicated server with no other programs running and optimal thermal conditions). Benchmarking other areas such as rendering doesn't suffer from this variance, so that's why I was highlighting this.

lawnjelly · 2022-05-18T08:09:31Z

lawnjelly
May 18, 2022
Collaborator

I updated a few more ideas in #290 (comment) . I'd actually written that a while ago but was too busy to do more on this.

0 replies

peastman · 2022-05-18T23:47:02Z

peastman
May 18, 2022
Author

Any preference about which SIMD API to use? Personally I'm rather partial to the one I created for OpenMM, and not just because I wrote it. 😀 Here is how I evaluate it.

Advantages:

By far the cleanest, simplest API of any of them.
I have implementations for SSE, NEON, and clang/gcc portable vectors (which will work for WebAssembly and any future architecture). Also PowerPC, though I don't think we're currently interested in that. Adding new architectures is also very easy, if we want to get the best possible performance on them.
It's a very small amount of very simple code. It will be easy to maintain, customize, and add features as needed.

Disadvantages:

It has fewer features than some of the others. It currently only provides vectors of ints and floats.

Of the other options, I think the one most worth considering is Highway.

Advantages:

It's a powerful library with lots of features.
It's being actively developed and maintained.
It supports all architectures we're currently interested in.

Disadvantages:

The API is very complicated and poorly documented.
A lot of the features probably aren't relevant to us. Its design seems to take the opposite approach to how we want to use it. Rather than providing a minimal set of features that are well supported on all architectures, it tries to provide the union of all possible features from all architectures. That includes things like vectors of undetermined size (as in SVE), predicated instructions (as in AVX-512), etc. This means a lot of unneeded complexity.
I distrust all software from Google, because you never know when they'll lose interest and stop supporting it.

None of the others looks to me like a very good option.

2 replies

fire May 19, 2022
Collaborator

We have beginning support for riscv!

filipworksdev Jun 12, 2022

Nice! I'm a fan of Risc V. Difficult hardware to get though.

peastman · 2022-05-19T16:03:54Z

peastman
May 19, 2022
Author

We also need to decide what to do when compiling in double precision mode. I see a few options.

We could accept that when you compile in double precision mode, all vectorized calculations will still be done in single precision.
We could create a class representing a vector of four doubles. That could be implemented with AVX, but we want to avoid it. Instead it would have to be implemented with scalar code, or perhaps a pair of SSE/NEON vectors. As a result, vectorized code might sometimes end up slightly slower than it was before when compiled in double precision mode.
We could keep around the non-vectorized code and use #ifdefs to compile the vectorized version in single precision mode and the scalar version in double precision mode. (I don't like this idea. It would be terrible for maintainability. I just mention it for completeness.)

11 replies

peastman May 29, 2022
Author

Is there anything I can do to help move it forward? 2500 proposals at 10 per week is about five years. I was hoping to start a little sooner than that. 😀

fire May 29, 2022
Collaborator

I think someone that is attending the Godot conference should represent this in the best light.

Calinou May 30, 2022
Maintainer

Is there anything I can do to help move it forward?

You can join the Godot Contributors Chat and ask in the #core channel. That said, there's a conference in Barcelona this week, so most core developers won't be as present there.

lawnjelly May 30, 2022
Collaborator

Is there anything I can do to help move it forward?

As said in the proposal, posting some profiles of running projects / demos etc to show where and to what extent various approaches can help would be a start. This kind of optimization should nearly always start with profiling, and without this info, it is difficult to make any sensible decisions.

peastman May 30, 2022
Author

I'll see what I can put together. Any suggestions for particular operations that tend to be bottlenecks on the CPU, and that would likely benefit from vectorization? There already are benchmarks for culling, so I'll check those. CSG and collision detection also come to mind.

marstaik · 2022-05-19T20:17:59Z

marstaik
May 19, 2022

A lot has been said about auto-vectorization already - but to nail the coffin on the head - yes, CPU's can auto-vectorize, but they aren't very good at it. They can handle tons of simple cases but throw an entire mathematical function (not just a simple dot product of vectors, etc) and a lot of the times the results are up in the air.

As someone who has written a math library a few times, it is a better idea to wrap some existing hyper-simd-optimized library that provides higher level mathematical abstractions. There are such libraries such as DirectXMath (which is cross platform - despite the DirectX naming) and Eigen that do this absolutely wonderfully, and are able to be configured to which AVX instruction set to target.

The point is that a majority of our common mathematical operations are probably not as efficient even if we vectorize them ourselves compared to proper products and optimizations made by Eigen as an example.

Whether or not to only support AVX (2011 cpu's) versus AVX2(256 support ~ 2013 cpu processors) is going to be up to the developer, and if they want to repackage game binaries multiple time - which isn't that hard to do - if they even want to.

I had a discussion with @reduz about potentially making our Math libraries an opaque wrapper around Eigen - he wasn't against it per-say but it had to have a clear performance improvement. But that was also a time where the engine was hard baked to 32 bit floating point support. Now with the addition of 64 bit floats, and recent issues/topics regarding camera-centric rendering for large worlds, such a setup would definitely increase performance throughout the scene tree.

I've wrapped Eigen myself a few times now, and its not that difficult and we could do it to Godot with a little bit of refactoring effort - and the results would probably be very noticeable.

On modern CPU's especially, with Eigen, a 4x4 32 bit matrix has the exact same performance as a 4x4 64 bit matrix multiplication, and that's a HUGE deal. I last bench marked large matrix multiplies on my 1920x Threadripper on a single core, and 1 million matrix multiplications completed within margin of error of each other between the 32 bit and 64 bit matrices.

1 reply

fire May 19, 2022
Collaborator

Thanks to @ellenhp we have a copy of eigen wrapped, but I think it may be blocked into godot core for one reason or another. https://github.com/V-Sekai/godot/tree/3d-audio/thirdparty/eigen

peastman · 2022-05-19T20:54:29Z

peastman
May 19, 2022
Author

Eigen is a great piece of software. I can totally endorse it. It's a bit different, though. It's a linear algebra library which happens to be optimized with SIMD, which is not the same as a SIMD library. No one should ever write their own version of standard BLAS routines (unless they really know what they're doing). But if you're coding up low level numerical routines for something more specialized, you want to get as close to the hardware as possible.

0 replies

linkerlin · 2024-02-21T04:12:59Z

linkerlin
Feb 21, 2024

Has anyone noticed that it can actually be accelerated with OpenCL, and OpenCL has very wide cross-platform support.

6 replies

peastman Feb 21, 2024
Author

OpenCL is the API that everyone supports and no one embraces. It works on nearly all operating systems (including macOS) and types of GPUs. But everyone has a proprietary alternative they really want you to use instead: CUDA, HIP, DirectX, Metal, oneAPI. Since Godot already uses Vulkan for graphics, I agree that using it for compute as well is the most straightforward.

In theory OpenCL can be used to program CPUs as well, but Intel is the only vendor that still actively supports that. For code running on the CPU, it's best to stick with C++.

ellenhp Feb 21, 2024
Collaborator

What in the engine could we be accelerating with compute shaders that we aren't already? Job dispatch latency to the GPU is pretty bad especially when you're using it for other things. My thoughts on this discussion are that to get a consensus of core contributors onboard with SIMD intrinsics in core, you're going to need to show some pretty dramatic performance improvements. Nobody likes maintenance burden, and I'm not really sure that "Godot is slow" is one of its main deficiencies right now. The main deficiencies IMO are in its feature set, and if SIMD intrinsics slow down the pace of development then they will hurt more than help. Honestly I'm also not sure that it's even true that Godot is slow. I haven't ever heard anyone back that argument up with data. I haven't looked though.

edit: I feel like this reads harsher than I wanted it to. I'm just a little bit against fast for the sake of fast for large projects that already have a lot of maintenance burden. I do a lot of rust programming too so obviously I love things that are too fast for their own good. I'm just not sure that's what the outcome would be here unfortunately.

filipworksdev Feb 21, 2024

Isn't that same as using compute shaders? Maybe compute shaders are slightly slower but you avoid having to deal with 15 apis.

peastman Feb 21, 2024
Author

There might not be much benefit from compute shaders. In a lot of games, the GPU is already the botteneck to performance. In those cases, moving any more computations to the GPU would only make it slower, even if the GPU can do them a lot faster than the CPU. That's why this issue is about making the CPU code faster through vectorization, not about moving anything to the GPU. It won't always help performance, but it sometimes will and it will never hurt.

I agree that vectorization should only be done if it keeps the code clean and maintainable. For example, I wouldn't use SIMD intrinsics directly. They lead to code that is hard to read and hard to maintain. I don't think this should be a problem. Some of the libraries discussed above allow for very clean code that is easy to write, read, and maintain.

I disagree that CPU performance isn't important. There's no absolute standard for "slow" or "fast" because there's no such thing as "fast enough". How large can your world be? How many characters can be on screen at once? What is the oldest CPU your game runs acceptably on? How fast does it drain the battery? The faster the engine is, the more you can do with it.

filipworksdev Feb 21, 2024

I looked up your claim that there might not be much benefit from compute shaders. Aparently compute shaders can do everything OpenCL can with the added benefit that code is much more portable as a shader. In fact Godot is already using compute shaders for certain things so they are already one step ahead. And in fact both this comment and the original thread are already handled. Godot has moved some things to compute shaders to speed up some logic and also the compiler handles vectorization automatically.

SIMD vectorization in the engine #4544

peastman May 15, 2022

Replies: 13 comments · 26 replies

Calinou May 15, 2022 Maintainer

peastman May 15, 2022 Author

fire May 15, 2022 Collaborator

fire May 15, 2022 Collaborator

peastman May 16, 2022 Author

peastman May 17, 2022 Author

Calinou May 17, 2022 Maintainer

peastman May 31, 2022 Author

Zylann May 17, 2022 Collaborator

peastman May 17, 2022 Author

Calinou May 17, 2022 Maintainer

Zylann May 17, 2022 Collaborator

Calinou May 18, 2022 Maintainer

lawnjelly May 18, 2022 Collaborator

peastman May 18, 2022 Author

fire May 19, 2022 Collaborator

filipworksdev Jun 12, 2022

peastman May 19, 2022 Author

peastman May 29, 2022 Author

fire May 29, 2022 Collaborator

Calinou May 30, 2022 Maintainer

lawnjelly May 30, 2022 Collaborator

peastman May 30, 2022 Author

marstaik May 19, 2022

fire May 19, 2022 Collaborator

peastman May 19, 2022 Author

linkerlin Feb 21, 2024

peastman Feb 21, 2024 Author

ellenhp Feb 21, 2024 Collaborator

filipworksdev Feb 21, 2024

peastman Feb 21, 2024 Author

filipworksdev Feb 21, 2024

peastman
May 15, 2022

Replies: 13 comments 26 replies

Calinou
May 15, 2022
Maintainer

peastman
May 15, 2022
Author

fire
May 15, 2022
Collaborator

fire May 15, 2022
Collaborator

peastman
May 16, 2022
Author

peastman
May 17, 2022
Author

Calinou May 17, 2022
Maintainer

peastman May 31, 2022
Author

Zylann
May 17, 2022
Collaborator

peastman
May 17, 2022
Author

Calinou May 17, 2022
Maintainer

Zylann May 17, 2022
Collaborator

Calinou May 18, 2022
Maintainer

lawnjelly
May 18, 2022
Collaborator

peastman
May 18, 2022
Author

fire May 19, 2022
Collaborator

peastman
May 19, 2022
Author

peastman May 29, 2022
Author

fire May 29, 2022
Collaborator

Calinou May 30, 2022
Maintainer

lawnjelly May 30, 2022
Collaborator

peastman May 30, 2022
Author

marstaik
May 19, 2022

fire May 19, 2022
Collaborator

peastman
May 19, 2022
Author

linkerlin
Feb 21, 2024

peastman Feb 21, 2024
Author

ellenhp Feb 21, 2024
Collaborator

peastman Feb 21, 2024
Author