Turn Godot into an Async Programming Framework #9363

Zistack · 2024-03-23T17:36:13Z

Zistack
Mar 23, 2024

I've been sniffing my way through the Godot codebase a little bit lately, as I noticed that it supported a thread pool, coroutines, and signals, and the servers claimed thread safety. I come from a background in concurrency and distributed systems, and so I'm looking to see if I can do proper async programming in Godot.

Unfortunately, I find that I cannot. While the necessary components are superficially present, they do not compose in the necessary ways. As a side-effect of this investigation, I also see comments from big names like reduz and Calinou talking about how they'd like to make the whole engine thread-safe and add more multi-threading support in general.

Another side-effect of this investigation is that I noticed how Godot uses locks. Godot's use of OS mutexes more or less everywhere is a big problem for async programming. It is a bad idea for a task submitted to a thread pool to use an OS mutex, as it can put that thread to sleep while it blocks. A thread pool thread should never sleep unless there is literally no more work to be done and it is waiting for the next task to be submitted. I suspect that this issue is actively obstructing efforts to further multi-thread Godot already.

The Grand Vision

In an ideal world, after some basic setup code, all code is running as tasks in the thread pool. For this to be reasonable, these tasks would need to be more like user-threads in some sense than simple functions. Godot coroutines would work, though their semantics would have to change when it comes to awaiting on signals (specifically, what happens after the signal is emitted).

Besides making it possible/easy to write truly scalable code, such a transformation would make certain other things easier as well. In combination with async timers, a thread-pool-based model would simplify the scheduling of concurrent rendering and physics processes. It would also start to look reasonable to add async IO support to Godot for high-performance networking and disk access.

Thread Pool

The most fundamental changes that would have to be made would be to the thread pool. The current thread pool implementation is too basic, and does not allow for tasks to manage themselves/each other.

The thread pool should use a lock-free multi-producer multi-consumer queue internally for the task queue. Otherwise, processors with too many cores will lose too many valuable cycles on contention over the task queue. This means that enqueuing tasks is not only thread-safe but also lock-free, which is important since enqueuing tasks is an operation we might want to perform inside of a lock implementation.

It's a bit of an aside, as it isn't strictly necessary, but I would propose an alternate priority model for tasks on the thread pool. Instead of having two fixed priority levels, let the user set the number of priority levels. In general, work for a game will fall into a small number of categories as defined by their typical epoch/deadline cycle. These categories usually have a fairly obvious relative prioritization. We can default to 2-3 priority levels, and have properties that say what levels the rendering and physics servers should be running their code in. Internally, these levels will be represented by an array, each with a queue in it. The positions of the array correspond to the priority levels.

Mutexes

The only OS mutexes in the codebase should live in the thread pool, if they are even necessary at all. We would provide our own mutex implementation which yields the current task to the thread pool until the protected region is available again. This way, the thread pool can just do other work while a lock is being acquired. This will allow locks to be used pretty much anywhere without breaking things, so long as the usual dead-lock conditions are avoided.

Coroutines

Although this is superficially about coroutines, the actual implementation of Godot coroutines is more or less fine as-is. It's really how they interact with other parts of the system that is the problem, and the thread pool will need to explicitly support them. Also, I use coroutines to sometimes mean coroutines and functions, because ultimately we will want to be able to do with ordinary functions many of the same sorts of things we can do with coroutines.

If I understand correctly, In the present implementation, coroutines are initially executed by the calling thread, and then if they await on a signal, the remainder of that coroutine ends up being executed by the thread that emits that signal (or at least the remainder up until the coroutine awaits on another signal). This will need to change. Instead of the emitting thread executing the co-routine continuation, that emitting thread needs to just enqueue the resumption of those coroutines onto the thread pool. This way, coroutines can actually be treated a bit like user threads, maintaining parallelism throughout their execution.

A coroutine can be executed in one of two ways: synchronously, or asynchronously. A synchronous execution is what would be associated with the 'await' syntax now. The calling task does not continue execution until the coroutine finishes. Asynchronous execution would be like what happens when you submit the coroutine to the thread pool. Ideally, when executing a coroutine (or a function) asynchronously, a future would be returned that we can wait on and check for a return value.

There are potentially some GDScript syntax changes to consider, since functions and coroutines should both be asynchronously/synchronously callable, but right now their syntax is heterogeneous for the same semantics.

Sometimes it is useful to dispatch a whole group of tasks. Right now, Godot exposes a way of submitting an indexed set of otherwise identical tasks, but there are other groupings that are useful. Providing an object which can be used to group arbitrary tasks together would be a good idea.

var task_group = TaskGroup.new ()
task_group.submit (foo1)
task_group.submit (foo2)
task_group.submit (foo3)
await task_group.finish()

Concurrent Data Structures

In a fully-concurrent Godot, lacking some basic thread-safe datastructures will be painful. We'll want to add some.

I do not know which would make more sense: Making Dictionary thread-safe, or providing an alternate ThreadSafeDictionary implementation.

I do know that a thread-safe Dequeue type would likely be very useful (though this could sometimes be replaced by a thread-safe Array implementation instead). This datastructure would greatly simplify the issue of filtering/sorting data passed between phases of parallel computation, which is a thing that can happen a lot in games with many communicating agents.

What I Don't Know

I do not know how difficult it will be to integrate such changes with the rest of Godot. I am not super-familiar with the codebase, and I don't know what sorts of challenges will likely be encountered when trying to make things thread-safe and compatible with this overall paradigm.

I would really like to work with someone on this.

What I Can Contribute

I am an expert in concurrency, especially in C++. I am knowledgeable about the C++ relaxed memory model, and I can design and implement lock-free datastructures. I can provide implementations of all of the low-level concurrency primitives that I have described and more.

I am fairly motivated. I am trying to work on a networked game with heavy simulation elements. Although I am not far into developing it, I keep tripping myself over the lack of proper scalable multi-threading.

Zistack · 2024-03-29T21:43:11Z

Zistack
Mar 29, 2024
Author

A brief interaction on the Godot Contributors Chat suggested to me that part of the reason that this discussion isn't turning into much of an actual discussion is possibly because the members of the Godot core team are not sufficiently versed in multi-threaded programming and its paradigms to understand just what my suggestion entails.

As such, I have decided to go into more detail about both the motivation for and implementation of this idea.

Motivation

There are two principle reasons for using async programming frameworks in general, those being ergonomics and performance. These apply just as well to game development as they do to more typical applications developed with async, like web servers and other high-performance networked applications.

Performance

The primary reason for the development of the async/await programming paradigm was to gain performance in web servers. It ultimately grew out of the patterns that were required to effectively use async IO, but it turned out that the async/await pattern is actually broadly useful for dealing with event-driven systems of any kind. The primary performance benefits of async/await come from two properties: 1) it is much cheaper to create and manipulate coroutines than OS threads and 2) it is also much easier to compose coroutines in ways that don't even incur the overhead of treating them like individual user threads.

Because of this, a programmer can fearlessly create hundreds or thousands of small short-lived tasks and not worry about doing things like consuming all of the system memory for unused stack space (creating an OS thread always creates a stack), or drowning the application in syscall overhead (any operation on an OS thread incurs a system call).

Ergonomics

The ergonomics of async programming primarily come from the ergonomics of coroutines. Coroutines are an amazing tool for transforming linear code into a pausable/resumable state-machine. These state machines are readily composable, and that composition can be reasoned about in terms of functional composition, which is generally a familiar and proven-to-be-flexible paradigm. Writing code in this model is a lot like writing single-threaded blocking code, except that you have the option of spawning off and/or waiting for tasks that can run asynchronously while you do other things. The fundamental programming model will change surprisingly little from the user's perspective (assuming a single-threaded programming model as the starting point).

The ways that we can effectively use that programming model expand, however. Being able to write state machines in narrative-fashion is really useful for stuff like AIs or network protocols. These coroutines could direct operations across potentially many frames at a high level of abstraction and all be run concurrently. This is similar to spawning off a thread for each logical long-running task, but a lot less expensive, and with a simpler and better-integrated API for forking/joining and passing information around.

Some supporting work does need to be done for async programming to be truly pain-free, though. It is not enough to simply make the subsystems of Godot thread-safe. All APIs in the engine really should be truly concurrent, in that calling operations on a single subsystem from multiple threads/tasks will cause them to execute concurrently, rather than serializing them through a lock or something. Where this is not possible, it should be documented very clearly. (Related: call_deferred () is an anti-pattern that should be eliminated.)

If we are able to make such a transformation to the engine's code, a lot of the internal complexity around gathering and distributing signals during specific computational phases should be able to just go away. This will simplify the main loop, and also make it easier to give user code more control over the main loop and the operation of the various servers that Godot provides.

Implementation

This proposal is not a small undertaking. There are many different things to be done. I will describe them here in the approximate order in which they need to be completed.

Re-implement The ThreadPool

The first thing to do is re-implement the ThreadPool. As I described before, the actual work queue should be a lock-free (really wait-free) multi-producer multi-consumer queue. I have a specific implementation in mind that is tailored to the specific needs of Godot. We will need to use C++ atomics for this. I would like to implement the priority system that I described above as well. I would like feedback on that idea.

The worker threads themselves will be quite dumb. They just pull callable objects off of the task queue, and call them. When those callable objects finish, they grab the next one and repeat the process.

The objects we put on the queue are where all of the fancy logic for managing the structure of the workload lives. These objects wrap coroutines and functions and actually set up all of the signals and signal handlers that are required for coordinating all of the work.

The simplest type of task is just a function that doesn't yield. It simply runs to completion, and provides the result to the future object that represents that task. (Yes, we will need to have futures.)

The most common type of task will probably be the coroutine. A coroutine can pause itself when it wants to wait for a signal or for a task to complete. When a coroutine pauses, it enqueues the paused state of the coroutine as a task to be submitted to the thread pool when a particular signal is emitted, or when some future receives its result (whichever thing the coroutine is waiting for). When a coroutine finishes, it sets a future value, much like an ordinary function.

Then there will be the indexed group task. This is similar to the group task which Godot already supports. This could be used with functions or coroutines that haven't been started yet. This will need special support in the lock-free task queue implementation.

C++ Coroutines

If we are going to write engine code in an async/await programming model, then we need coroutine support at the C++ level. This basically requires C++20 coroutines, which may be an issue, given that Godot right now only admits a selection of features up to C++17.

Assuming that we incorporate C++ coroutines into the codebase, an important question must be raised: Do we re-base GDScript coroutines into C++ coroutines? Theoretically, we no longer need a special type for the state of a GDScript function. We can just have opcodes for the various yields and awaits, and pass them down to the C++ coroutine implementation.

Build Supporting Primitives

Once we have a foundation, we need to build some support tools, or we still won't be able to accomplish much.

Mutex

The most basic tool is a mutex which yields to the thread pool. I know how to build this. Much as it would be cool to make everything in the engine lock-free, it's not actually going to be possible or reasonable to do this, so we will need a mutex.

Concurrent Datastructures

After that, we really need some concurrent datastructures, preferably wait-free when possible. I'm pretty sure we want a thread-safe dictionary type. I can imagine that being useful all over the scene tree, among other things.

We almost certainly want a type that can concurrently accumulate a list/queue of items to be drained as part of a later phase of computation. This could be a queue or an array, though trying to resize memory allocations inside of lock-free datastructures is not a good time. A lock-free queue for user code should not be hard to provide.

Arbitrary Task Grouping

Although it isn't strictly necessary, providing an object for grouping tasks can make structured concurrency much cleaner to write.

This object would act as an interface for submitting tasks to the thread pool, but unlike direct submission, it accumulates the futures in an internal collection. It would provide convenience methods like waiting for all tasks to complete, or checking their collective status. Such an object is particularly useful for gathering up tasks that are submitted from the body of a loop.

Even though such an object could easily be implemented in a few lines of GDScript, I would recommend implementing such a thing in the engine because I anticipate that it will be useful for engine code and also broadly useful in user code.

Group Dispatch

One neat programming pattern that becomes possible in an async framework is basically spawning a coroutine for each object in a game, where that coroutine manages that object for the object's lifetime. This would be particularly useful for writing game AIs. One of the consequences of this pattern is that there might be many thousands of coroutines that end up awaiting on the physics process signal, and this is a problem.

Normally, a signal emitter in this system would just take all of the awaiting tasks and throw them onto the thread pool. The signal emitter would dequeue each task one by one and put them on the thread pool. This is fine when it's only a dozen tasks. This is not fine when it is hundreds or thousands.

It would be a good idea to provide an object which can accumulate large numbers of tasks which a signal emitter can basically dump onto the thread pool all at once in a single step. This object would likely re-use some of the internal structures of the worker thread pool (which will involve some pretty hefty chunks of memory), and so wouldn't be something you would want to use just everywhere. Hence the idea of providing it as an object that code can opt-in to using only when appropriate.

Convert Engine Subsystems

With a solid foundation and good tools, we can finally work on the rest of the engine. I don't know all of the things that are in the engine, but I can talk a bit about some of the parts that I do know about.

Object IDs

One major issue I see with Godot as it is now is that it assigns/maintains an object ID for every object. Given the dictionaries involved, this is potentially a huge bottleneck for performance. What are these IDs actually used for? Is there any way that we can not do this, or otherwise build a system so that the user opts-in to ID management for only a subset of objects?

If we must maintain object IDs for every object, then we will need a really high-performance ID management system in order to avoid defeating all of the work we've done up to this point.

Servers

The servers are already multi-threaded and thread-safe, but to the best of my knowledge they're still designed around a programming model where a main thread is carving up all work across all state into global computational phases.

For the physics server, in particular, this is a problem. We need to be able to manipulate different physics spaces completely independently of each other, changing their state and stepping them all concurrently with each other.

I know that some aspects of the rendering server might give us trouble, because not all graphics API implementations play nice with concurrency (or at least they didn't when I last looked at them a few years ago - maybe this has changed).

Scene Tree

There is some experimental support for multi-threading processing on the scene tree, but that's about as far as it goes. With concurrent dictionaries and a thread-safe signal system, it should be possible to make the whole scene-tree API thread-safe and concurrent.

Fix The Main Loop

Finally, we can fix the main loop to actually dump all of the work onto the thread pool, rather than rely on a main thread that literally loops continuously to try to coordinate everything.

We could simply have the main loop do the setup and teardown on the main thread, and then enqueue tasks onto the thread pool that represent the main loop of the game. This might even be the thing to start with. There is another way of fixing the main loop that I think should be considered, however.

Invert The Main Loop?

Fundamentally, the biggest issues that Godot has with supporting networked games right now ultimately stems from the fact that Godot takes away too much of the user's control. Godot provides the main loop, and user code just fills in callbacks. Godot tells the servers when to perform their processing steps, and on what state. Godot manages the task structure of the processing over the scene tree.

This is the wrong way to do things, because it means that there are parts of the engine that fundamentally cannot be replaced by user code. You may be able to effectively replace some of these things by using your own servers that expose more control, or by ignoring the scene tree, but this defeats much of the purpose of using Godot in the first place.

What we can do is invert Godot from being a framework into being a library (approximately). Godot is already trying to straddle this line, and the major pain points for networked games from from the places where Godot is being a framework. Meanwhile, Godot's best selling points come from the parts of the engine that are exposed more as libraries.

In my view, these are the server architecture, the ability to dynamically mess around with scripts at runtime (good for modding), all the actual libraries that Godot provides, and also how well GDExtension composes with everything, ultimately due to the fact that it's fundamentally just a mechanism for adding native libraries to an existing pile of well-integrated libraries.

As part of this inversion, the scene tree and servers would need to have their APIs adjusted to reflect the fact that they might be being controlled entirely from user code. This is a good idea anyway, just because it decouples the implementations of those things from the implementation Godot's main loop, and looser coupling is almost always a good thing.

Godot can perform setup and tear-down behind-the-scenes still, since there's no real value in exposing that to the user, but then after setting up the thread pool and starting it up, the entry-point for user code can just be some form of main (). Of course, the default implementation of this main task can provide coordination of rendering and physics processing in much the same way as Godot already does, but this task could also simply be overridden by user code. The internal logic for controlling rendering and physics could be exposed as an API that the user could use or not use as they see fit.

A Note About Platforms Without Multithreading

This all works just fine when there is only one thread. In the case of a platform which can't readily spawn threads, the main thread will have to become the singular worker thread in the thread pool, but otherwise everything should just work without any modification (including user code, so long as it doesn't attempt to spawn threads).

What I Need

This is a lot of stuff. I know how to do pretty much all of it in principle. I could really use some help figuring out how to actually do it in practice.

For one, I think trying to do this without at least some coordination between this project and other aspects of Godot development would probably just be a disaster. I don't even know where to begin on this. Who is coordinating Godot development? Who do I talk to?

For another, this project would touch basically every part of the engine. Ideally, when I am working on making some part of the engine concurrent/thread-safe, I would be working with someone who is familiar with that part of the engine and what it needs to do. Who are those people?

Last but not least, I need to know if people think that this is actually a worthwhile endeavor. How do I get concrete feedback on that? Do I have to submit an actual proposal issue for it to be discussed amongst the core team and given the green light? If so, how much of this should I include in that proposal?

1 reply

Calinou Apr 1, 2024
Maintainer

I think the proposal needs more qualified use cases: who would this benefit? Would it help games made with Godot run noticeably better? I'm seeing a lot of theory in this thread, but large sweeping changes need to have real world benefits that outweigh the downsides. Right now, I feel the benefit isn't worth the added work this would represent.

At the end of the day, most games released today rarely benefit from having 6 CPU physical cores accessible to improve performance (perhaps 8 for a few demanding games). It's not always the fault of the engine, it's just that you don't always have enough work that can be parallelized and spread across frames.

Also, in my experience, most bottlenecks in Godot are either in rendering or scripting (GDScript or C#/GDExtension marshalling costs). Rendering can be either CPU-bound or GPU-bound depending on the situation.

The baseline performance cost of this change also needs to be considered, as Godot aims to run on low-end mobile devices (and web browsers on these devices!). Threads are enabled on Android and iOS after all, even if they're not always enabled on the web platform.

If we are going to write engine code in an async/await programming model, then we need coroutine support at the C++ level. This basically requires C++20 coroutines, which may be an issue, given that Godot right now only admits a selection of features up to C++17.

We can't upgrade to C++20 just yet, as Godot runs on platforms that are limited to C++17 due to their compiler toolchains (namely, some consoles).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turn Godot into an Async Programming Framework #9363

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Turn Godot into an Async Programming Framework #9363

Zistack Mar 23, 2024

The Grand Vision

Thread Pool

Mutexes

Coroutines

Concurrent Data Structures

What I Don't Know

What I Can Contribute

Replies: 1 comment · 1 reply

Zistack Mar 29, 2024 Author

Motivation

Performance

Ergonomics

Implementation

Re-implement The ThreadPool

C++ Coroutines

Build Supporting Primitives

Mutex

Concurrent Datastructures

Arbitrary Task Grouping

Group Dispatch

Convert Engine Subsystems

Object IDs

Servers

Scene Tree

Fix The Main Loop

Invert The Main Loop?

A Note About Platforms Without Multithreading

What I Need

Calinou Apr 1, 2024 Maintainer

Zistack
Mar 23, 2024

Replies: 1 comment 1 reply

Zistack
Mar 29, 2024
Author

Calinou Apr 1, 2024
Maintainer