Build systems and bundlers #24

ahabhgk · 2025-01-07T09:55:39Z

ahabhgk
Jan 7, 2025

Recently, I've been researching the incremental implementation of Rspack. A paper titled "Build Systems à la Carte: Theory and Practice" has been mentioned in many materials about incremental builds implemented by other compilers. So I took some time to study it and found it quite interesting. It also has some relevance to bundlers. This article will briefly introduce the content of this paper and attempt to summarize bundlers from the perspective of build systems.

"à la carte": a French term for "according to the menu".

For the convenience of description, this article omits many details. There is a whole chapter in the paper that describes the problems encountered by real world build systems. For the convenience of description, this part of the content is omitted in this article. Real world build systems will have many practical engineering problems. Therefore, this article only serves as an introduction and provides a new perspective to look at the problems.

Build system

A build system refers to a software system that automatically executes a series of repeatable tasks. Common ones include Make, Shake, and Bazel. They take source files as input and execute tasks according to task description files (such as makefile) to build executable files.

There are also some less common ones. Excel takes cells as input, regards the formulas in specified cells as tasks and executes them to build the results of these cells. UI frameworks take props as input, regard Components as tasks and execute them to build new UI.

From this, we can identify some common concepts:

Task: A task. The actual logic is defined by the task descriptor, such as makefile and Excel formulas.
Input: The input for a task.
Output: The output of a task. The output of a task may be the input for the next task.
Info: Build information, which is information across builds and is available for the next build. For example, the modification time of files in Make is its Info, which can be understood as the cache in a bundler.
Store: Storage. It is a place where the Input, Output, and Info of a Task are stored. For example, in Make, the file system is its Store.
Build: Building. Based on the above concepts, we can regard a build as follows: According to the defined Tasks and the existing Store, new Inputs are inputted to obtain a new Store.

These concepts are quite universal, and their implementations in various build systems are relatively similar. They are not the main reasons for the differences among different build systems. The main reasons for the differences among various build systems are actually caused by the different strategies adopted for the following two points:

Whether a Task is re-executed or not.
The execution order of Tasks.

These two points correspond to two relatively important concepts respectively: Rebuilder and Scheduler. Different build systems can be regarded as combinations of different Rebuilders and different Schedulers.

Scheduler

It holds a Rebuilder and conducts a new Build, determining in what order to execute Tasks.

Topological: It performs a topological sort based on the dependencies of tasks and executes the tasks according to the result of the topological sort.
Restarting: It selects a task to execute. If the dependencies of the task have not been fully executed, it selects another task again until all tasks have been executed.
Suspending: It selects a task to execute. If the dependencies of the task have not been fully executed, it executes its dependencies first. After the dependencies have been executed, it continues to execute the task. This can be easily achieved through async/await.

Rebuilder

It holds a Task and re-executes the Task, determining whether the Task needs to be re-executed and whether to use the cache or the result of re-execution.

Dirty bit: Each task will record whether it is clean or dirty. After a build is completed, all tasks are clean. During the next build, the changed inputs will be marked as dirty and the tasks will be re-executed. If an input and its dependencies are all clean, the task corresponding to this input does not need to be re-executed.
Verifying traces: It records information about task dependencies, including hashes or timestamps, etc. When executing a task next time, it is used to verify whether the task dependencies have changed. If they have changed, the task will be re-executed; otherwise, the result of the previous task will be reused. It can be understood as a cache, and the recorded hashes or timestamps are the cache keys.
Constructive traces: Derived from Verifying traces, it is used to support cloud caching. The difference between Constructive traces and Verifying traces is that in addition to recording lightweight information such as hashes or timestamps, it also records the actual content. In this way, when the traces are transmitted over the network, the actual content can be transmitted, thus realizing cloud caching and remote task execution.

Build Systems

Build systems can be regarded as combinations of different Rebuilders and different Schedulers.

First, let's introduce several common features:

Dynamic dependencies: Whether the Tasks that a Task depends on are statically declared or dynamically calculated. For example, a makefile statically declares the dependency relationships among various Tasks, while in Excel, the formula like IF(RANDBETWEEN(0, 1) > 0.5, A1, A2) requires dynamically calculating the dependency of B2.
Minimality: Only execute the minimum number of tasks to complete the build. Of course, achieving the minimum is quite difficult, so this feature is often relative.
Early cutoff: When a Task is re-executed and its Output has not changed, can the Tasks that depend on this Task stop being executed so that the build can be completed ahead of time.

Make

make = topological modTimeRebuilder

Make uses makefile to describe tasks. The dependency relationships among these tasks are clear, which belong to static dependencies and do not support circular dependencies. Therefore, Make uses a topological scheduler to execute tasks in topological order.

The build information (Info) of Make is actually the file system itself. The file system has file modification times. Make judges whether a task needs to be re-executed by the file modification time. If the modification time of a file is earlier than that of its dependent files, it indicates that the task needs to be re-executed. Make treats the file modification time as a dirty bit, which is a kind of dirty bit rebuilder.

Certainly, in many cases, the file modification time is not reliable. For example, some programs will update the file modification time while the actual content of the file will not be changed. This leads to unnecessary re-execution of tasks.

Make achieves Minimality through modTimeRebuilder by skipping tasks that do not need to be executed. However, because of modTimeRebuilder, it fails to achieve Early cutoff. Because when a task is re-executed and outputs a new file, although the content has not changed, the file modification time has also been changed, resulting in the inability to interrupt early. It can also be seen from this that the tasks that are executed without achieving Early cutoff are definitely not the fewest. Therefore, Minimality is often relative.

Excel

excel = restarting dirtyBitRebuilder

Excel describes tasks through formulas in cells. Some formulas have static dependency relationships, while others are dynamic. Therefore, it uses a restarting scheduler to execute tasks. It is worth noting that Excel records the final execution order for reference in the next build to reduce the overhead of restarting.

Excel uses a dirty bit rebuilder. Cells modified by users are marked as dirty, and tasks that depend on these cells are re-executed. For formulas that result in dynamic dependencies, Excel marks them as dirty in each build to ensure that they are updated every time to guarantee correctness, sacrificing some performance to ensure its correctness.

Excel achieves Minimality for static dependencies but does not achieve Minimality for dynamic dependencies.

Bazel

bazel = restarting ctRebuilder

Bazel also uses a restarting scheduler to execute tasks, and it has an optimization mechanism to avoid the overhead of restarting.

Bazel uses ctRebuilder to support cloud caching and remote task execution.

Shake

shake = suspending vtRebuilder

Shake uses vtRebuilder. When tasks are being executed, it tracks the dependencies of tasks and records them. When executing tasks next time, if the dependencies haven't changed, it skips the execution.

Moreover, if the current task hasn't been executed, tasks that depend on the current task don't need to be executed either since their dependencies haven't changed, thus achieving Minimality and Early cutoff.

Since Shake tracks dependencies when tasks are being executed and doesn't need to define them statically in advance, it also supports Dynamic dependencies.

Cloud Shake

cloudShake = suspending ctRebuilder

Cloud Shake supports cloud caching on the basis of Shake. The difference lies in that the Rebuilder is changed from vtRebuilder to ctRebuilder.

Buck2

buck2 = suspending ctRebuilder

One of the core developers of Buck2 is the author of Shake and also one of the authors of the paper "Build Systems à la Carte: Theory and Practice".

Buck2 is similar to Cloud Shake. Buck2 supports dynamic dependencies, achieves minimality and early cutoff. Besides, it also supports cloud caching and natively supports remote task execution.

Buck2 has also implemented its own incremental computation engine: DICE.

Bundlers

A bundler can actually be understood as a build system plus a part of the task descriptor. In fact, the build system doesn't care about what specific tasks do. What specific tasks do is provided by users through task description files, and the build system only takes care of executing tasks. Early task runners like Gulp and Grunt were actually closer to build systems. Developers used these task runners to manually arrange the processing logic of files and took the task runners as build systems. Similarly, Turborepo doesn't care about the task logic but only executes tasks, and it also claims to be a build system.

The bundler itself describes a part of the task logic, such as how to build modules, how to split chunks, and how to perform optimizations, etc. Then the remaining parts are provided by user configurations and plugins, and they are combined to form a complete task descriptor.

There are also some differences between the tasks of the bundler and the build system:

Firstly, the dependencies of bundler tasks are highly dynamic (dynamic dependencies). The task logic itself is dynamic. For example, the code generation of a module may depend on the generation results of other modules, and the optimization of a module may depend on the optimization results of other modules. Moreover, user configurations and plugins will also affect the task logic. However, early build systems didn't support dynamic dependencies well and were basically based on static dependencies, like Make. It was not until later build systems that relatively good support was available, such as Shake, Buck2, etc.
Secondly, it's about the handling of loops. Due to the relationships among modules, the bundler often has circular dependencies, which leads to circular dependencies among tasks. At this time, it's necessary to handle these loops. While most build systems don't support loops, of course, there are a few build systems that have dealt with loops.

In addition, if we take the Build defined in the build system as the standard, the Build of the bundler is actually divided into two types:

The Build that doesn't interrupt the Compiler, that is, the rebuild under the watch mode.
The Build that interrupts the Compiler, that is, building again after the previous build is completed.

These two types of Build also result in two different kinds of Info, namely memory cache and persistent cache. These two kinds of Info can not only be used separately but also be mixed and used according to specific scenarios.

Webpack/Parcel/Rollup/esbuild

passBasedBundler = foreach ctRebuilder

In traditional pass-based bundlers, both the execution order (Scheduler) of tasks and whether to execute them (Rebuilder) are different in each pass. Each pass uses the task execution order and the execution strategy suitable for this stage according to the task logic of this stage. For example, in webpack:

The module graph and the chunk graph are not acyclic graphs, so the topological scheduler is not applicable in most stages.
SideEffectsFlagPlugin: When optimizing the incoming connections of a module, it is necessary to ensure that the incoming connections of the parent module of this module have already been optimized to achieve the best optimization effect. It belongs to the suspending scheduler. However, since it only updates the connection relationships of modules and there is not much computational overhead, there is no logic to skip the execution of tasks, belonging to the "always true" rebuilder.
FlagProvidedExportsPlugin: Since re-export will affect the exported content of a module, modules containing re-exports and modules introducing re-exports will be recorded as dependency relationships. When the exported content of a module introducing re-exports changes, the exported content of modules containing re-exports will be recalculated until there is no longer any change in the exported content of modules. It belongs to the restarting scheduler. Since calculating the exported content has a certain amount of computation, a cache is introduced to skip some tasks, belonging to the vtRebuilder.
The task logic in most other stages doesn't care about the order of tasks, such as module build, module codegen, etc., and persistent caching is supported. So most other passes use the combination of "foreach" scheduler + ctRebuilder.

In pass-based bundlers, the cache realizes Minimality for the bundler. However, since the tasks among different passes are unaware of each other, the tasks between passes cannot achieve Early cutoff, resulting in excessive tasks that still need cache verification. This is often the reason why pass-based bundlers are slow: the failure to achieve Early cutoff leads to a lack of Minimality.

Turbopack

turbopack = suspending ctRebuilder

Unlike traditional pass-based bundlers, Turbopack doesn't emphasize individual compilation stages (passes) from start to finish. Instead, it's closer to query-based. It defines tasks and obtains task results through queries. Especially in a development (Dev) environment, for example, when compiling a web page with index.js as the entry point, the logic of Turbopack is:

The logic of the traditional pass-based bundler is:

Compared to pass-based bundlers, Turbopack will only focus on the part of tasks that need to be executed to obtain the query results, and other irrelevant tasks will not be executed. Especially in the Dev environment, there will not be a complete ModuleGraph and ChunkGraph. In the Production environment, some methods will still be used to aggregate into a complete graph to perform global optimizations on the complete ModuleGraph / ChunkGraph.

The underlying incremental computation engine of Turbopack, namely turbo tasks, is the build system that drives Turbopack. Concepts of the build system such as task, scheduler, and rebuilder are all implemented in turbo tasks. The upper layer of Turbopack is equivalent to describing the specific tasks of the bundler on the basis of turbo tasks. From this perspective, the incremental computation engine itself is actually a kind of build system. Similarly, Buck2, which is also based on the incremental computation engine DICE, is similar. DICE has already covered the core functions in the build system, and Buck2 implements the execution of tasks described by users as tasks of DICE on its basis.

Turbopack is uniformly based on turbo tasks as a whole and uses the combination of suspending and ctRebuilder to achieve overall Minimality and Early cutoff.

Vite

vite = suspending vtRebuilder

Although Vite itself doesn't perform bundling, Vite will still continuously execute tasks during development, which conforms to the definition of a build system. Vite doesn't package multiple modules but compiles individual modules instead. So the task logic of Vite is actually quite simple, that is, to compile modules. Vite compiles a module only when the browser makes a request for it. A request will be initiated only when the browser doesn't hit the cache. The order of the requests is the order of module imports, which is also determined by the browser. So it can be seen that Vite utilizes the browser's ESM module system as part of its own build system, belonging to the combination of suspending and vtRebuilder.

Utilizing the browser's ESM module system will make its own implementation much simpler, but the browser's ESM module system itself isn't implemented with the goal of being a build system. Compared to a real build system, it will bring many limitations, such as:

The number of concurrent requests is limited by the browser ➡️ Besides the dependency relationships of tasks and machine resources, the number of concurrent tasks is additionally limited by the browser.
The browser cache can't be shared ➡️ Build information or task cache can't be shared, and the browser restricts that vtRebuilder can't be changed to ctRebuilder.

Rspack

incrementalRspack = foreach dirtyBitAndCtRebuilder

Rspack itself also belongs to the pass-based bundler. However, in order to optimize the performance of Hot Module Replacement (HMR) from O(project) to O(change), Rspack has introduced affected-based incremental. Briefly speaking, affected-based incremental will collect changes in various stages, and subsequent stages will calculate the tasks that may be affected based on the collected changes, so that only these affected tasks will be re-executed, reducing the number of task executions.

From the perspective of the build system, affected-based incremental is actually introducing a new Rebuilder on the basis of the original build system of the pass-based bundler, enabling tasks among different stages to be aware of each other through the collected changes, so that Early cutoff can be done for tasks in subsequent stages. By adding the feature of Early cutoff, Rspack can be more Minimality. This approach is closer to self-adjusting computation:

The fundamental idea is to track the control and data dependencies in a computation in such a way that changes to data can be propagated through the computation by identifying the affected pieces that depend on the changes and re-doing the affected pieces. —— Self-Adjusting Computation

Find the affected inputs according to the changes and re-execute the corresponding tasks as dirty inputs. This implementation is less intelligent compared to incremental computation, but it is a relatively simple and effective way.

Summary

Many bundlers have claimed to be the next-generation bundlers. However, from the perspective of the task execution of the underlying build systems, most of them are basically no different and lack many excellent features that have existed in build systems for a long time. Many of these excellent features can be incorporated into bundlers:

Minimality: It has a great impact on the performance of rebuilds.
Early cutoff: It affects Minimality. Bundlers that implement Early cutoff tend to be more Minimality than those that don't.
Parallelism: After clarifying the dependency relationships among tasks, tasks can be made concurrent as much as possible. Suspending often uses the runtime of async/await to start multiple workers for concurrency.
Remote Cache: Cloud caching. Furthermore, when the initial inputs are consistent, only the corresponding final output products are fetched for users to use. Only when users rebuild will the caches of various stages be fetched.
Remote Execution: Remote task execution (distributed). Remote Cache is equivalent to storing the inputs/outputs of tasks. When the inputs/outputs of tasks can already be cloud-cached, can the tasks themselves be further executed remotely, with more machines corresponding to more concurrency/CPU resources.
...

hai-x · 2025-01-10T09:12:49Z

hai-x
Jan 10, 2025

Learned a lot from the article. Meanwhile, I think FlagProvidedExportsPlugin in webpack is more like as optimized restarting scheduler. restarting scheduler emphasis on arbitrarily and its number of aborted tasks is undeterministic. But FlagProvidedExportsPlugin is deterministic since the connection between modules is deterministic and only notify its dependencies to run task again.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Infra

Build systems and bundlers #24

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Web Infra

Build systems and bundlers #24

ahabhgk Jan 7, 2025

Build system

Scheduler

Rebuilder

Build Systems

Make

Excel

Bazel

Shake

Cloud Shake

Buck2

Bundlers

Webpack/Parcel/Rollup/esbuild

Turbopack

Vite

Rspack

Summary

Replies: 1 comment

hai-x Jan 10, 2025

ahabhgk
Jan 7, 2025

hai-x
Jan 10, 2025