Clean up "performance allocators" and "performance flate2" backends #7000

fasterthanlime · 2024-09-04T09:31:43Z

Summary

@charliermarsh has long suspected local builds could be made faster by disabling things like: tikv-jemalloc/mimalloc, zlibng etc.

I'm going through the cargo dep tree looking at things that can be disabled locally.

Methodology:

production cargo flag enables all the production stuff (good allocators, fast compression libs, etc.)

I measure fresh cargo check runs, like so:

rm -rf /tmp/timings; CARGO_TARGET_DIR=/tmp/timings cargo check -F production-memory-allocator --timings

Varying the -F to enable/disable the features

FAQ

Q: Why only check check?

A: The benefits will trickle down to other subcommands (including test/nextest etc.) — check/clippy are super common while iterating. We can do larger checks near the end.

Q: Why only check cold builds?

A: Warm builds depend on a lot on which part of the code is touched — I'll optimize typical interactions later on.

fasterthanlime · 2024-09-04T09:48:35Z

Round 1: allocator

I introduced the uv-production-memory-allocator crate, which conditionally pulls in mimalloc (on Windows) and jemalloc (on other platforms, except OpenBSD). The extra crate works around limitations from cargo.

Before: 430 units, total time 26.6s

After: 425 units, total time 21s

fasterthanlime · 2024-09-04T09:56:35Z

Round 2: miette's `fancy-no-backtrace`

Getting rid of these:

❯ cargo tree -p backtrace-ext
backtrace-ext v0.2.1
└── backtrace v0.3.73
    ├── addr2line v0.22.0
    │   └── gimli v0.29.0
    ├── cfg-if v1.0.0
    ├── libc v0.2.158
    ├── miniz_oxide v0.7.4
    │   └── adler v1.0.2
    ├── object v0.36.4
    │   └── memchr v2.7.4
    └── rustc-demangle v0.1.24
    [build-dependencies]
    └── cc v1.1.15
        ├── jobserver v0.1.32
        │   └── libc v0.2.158
        ├── libc v0.2.158
        └── shlex v1.3.0

Which are pulled by this:

❯ cargo tree -i backtrace-ext -e features
backtrace-ext v0.2.1
└── backtrace-ext feature "default"
    └── miette v7.2.0
        ├── miette feature "backtrace"
        │   └── miette feature "fancy"
        │       └── uv v0.4.4 (/Users/amos/bearcove/uv/crates/uv)
(cut)

Before: 425 units, total time 21s

(See round 1)

After: 415 units, total time 20.1s

zanieb · 2024-09-05T00:31:49Z

Part of #5711

codspeed-hq · 2024-09-06T17:34:30Z

CodSpeed Performance Report

Merging #7000 will not alter performance

_{Comparing bearcove:fewer-deps (e89f275) with main (a541d6c)}

Summary

✅ 14 untouched benchmarks

charliermarsh · 2024-09-09T14:24:17Z

crates/uv/Cargo.toml

-default = ["flate2/zlib-ng", "python", "pypi", "git"]
+default = ["python", "pypi", "git"]
+# Use better memory allocators, etc. — also turns-on self-update.
+production = ["self-update", "production-memory-allocator", "production-flate2-backend", "uv-distribution/production"]


I think we should keep self-update separate, since re-distributors will likely want to run with --features production, but won't want to enable self-update. self-update is only applicable when you install via our dedicated installers, not via brew, etc.

Good point! I just re-separated them.

Are we going to need to include a note to redistributors in the changelog for the --features production flag? Does that suggest we should reserve this change for a breaking release?

charliermarsh · 2024-09-09T14:59:40Z

.github/workflows/build-binaries.yml

@@ -121,7 +121,7 @@ jobs:
        uses: PyO3/maturin-action@v1
        with:
          target: aarch64
-          args: --release --locked --out dist --features self-update
+          args: --release --locked --out dist --features production


I think this and the reference on line 82 also need self-update, unless I'm misreading.

Fixed! Also adjusted the comments.

charliermarsh

Makes sense to me! Probably good to get @BurntSushi eyes on it too since it will also affect profiling etc.

konstin · 2024-09-09T15:55:27Z

Nice!

BurntSushi · 2024-09-09T20:51:15Z

The decrease in build times here is considerable and a nice win. Nice work.

In terms of how this is setup (feature configuration and the extra crates), that all makes sense to me.

Charlie predicted my main concern here: the default build is "divorced" from the build we ship to users. We kind of already have this problem with our Cargo profiles: our release profile has LTO enabled, but our profiling profile does not (and also keeps debug info around). The intent is that one is supposed to use --profile profiling locally when iterating on perf improvements, because otherwise, with LTO enabled, the build times are astronomical (multiple minutes even on my beefy workstation). This introduces a potential footgun in our development workflow: we measure the performance of binaries under a different configuration than what we ship to users.

I think this PR probably exacerbates that, because building local binaries for performance improvements will now also require, I believe, --features production. And I think this will likely be critical given what is being swapped out here (since things like zlibng and jemalloc are used specifically for perf reasons).

But the build improvements here are considerable. However...

Warm builds depend on a lot on which part of the code is touched — I'll optimize typical interactions later on.

If the improvement here is ~23% on cold builds, do we have a sense of what kind of improvement we get on warm builds? My feeling is that for this class of improvement, the warm builds probably matter a lot more than cold builds. And for the dependencies removed here, I wouldn't expect them to be getting rebuilt all of the time. So I'd be curious if this change improves warm build times to the point of being worth the potential footgun here.

Separately, it's probably worth trying to find a different way of avoiding this footgun so that we can more confidently introduce a divergence between "dev builds" and "profiling builds" and "release builds."

fasterthanlime · 2024-09-09T21:15:28Z

[...] because building local binaries for performance improvements will now also require, I believe, --features production.

I must admit I worked on this PR under the assumption that:

Most work on uv is focused on correctness and/or adding features (as opposed to performance optimizations)
Switching decompression backends / memory allocators is not a threat to correctness (since any deviation in behavior besides performance would be immediately spotted and reported as a bug to the respective maintainers)

Even if y'all decide you do not want to go down that road, there's something to salvage in that PR imho: the whole "shim crate to let cargo pick the dependencies/feature flags we want depending on the target platform" thing (and deduplicating the allocator setup between uv & uv-dev).

do we have a sense of what kind of improvement we get on warm builds?

We don't! I guess let's measure that before deciding the fate of this PR first?

My experience in speeding up builds is that cargo rebuilds a lot more than it should, a lot more often than it should. And even when it doesn't build something, just having a large dependency graph involves a lot of, well, fetching, hashing, etc. — as you're well aware, uv does that too!

I'll report back with data on incremental builds (including switching between check/clippy/build/running tests — some of which I suspect trashes the target dir and causes cargo to rebuild too much) but am already mentally prepared into salvaging this PR into a "mostly cleanups" one as I may have underestimated just how much of astral's work focuses on performance alone.

BurntSushi · 2024-09-09T21:23:54Z

Most work on uv is focused on correctness and/or adding features (as opposed to performance optimizations)

I think it comes in waves. I haven't done any perf work in a while since my focus has been on the functionality/correctness of the multi-platform resolver. But I have done a lot of perf work in the past and hope to do more in the future.

The cleanups/refactoring makes sense.

And looking at the impact on "warm" builds makes sense too. I know for me at least, I do a lot of work in uv-resolver. And I think the pep508_rs crate has seen a bit of activity lately because of markers. So those might be useful to test, speaking selfishly.

As per comments on astral-sh#7000, since a lot of uv work is focused on performance, it makes sense to keep those enabled by default. However, it's still nice to have everything in one place.

Instead of having conditional cargo dependencies in both uv-dev and uv, and different `--features` invocations in CI, this introduces a series of `uv-performance-*` crates that do the right thing. These are enabled by default, but can be disabled when working on correctness alone, locally.

fasterthanlime · 2024-09-23T15:12:47Z

This PR is now more about "simplifying the logic to pull in performance allocators/flate2 backends" (it still removes the 'backtrace' feature of miette), and less about removing dependencies by default.

The performance feature is now enabled by default, I suppose someone working on correctness only could use --no-default-features and get the perf improvements I originally had in mind, but as per the comments here: no defaults are changed, the packaging folks (Linux distros etc.) don't have to be notified, there shouldn't be any breaking change here, just fewer Cargo.toml magicks duplicated across uv and uv-dev.

cc @BurntSushi for a second review

BurntSushi

LGTM!

fasterthanlime · 2024-09-25T13:00:54Z

(I’m out sick + traveling so if someone wants to take over this PR to get it rebased and merged, I would owe them a drink, at the very least!)

konstin · 2024-09-25T14:30:44Z

Rebase went through trivially (uv-publish addition), made a new PR due to permissions: #7686

fasterthanlime marked this pull request as draft September 4, 2024 09:31

fasterthanlime force-pushed the fewer-deps branch from c6436c1 to 8a18986 Compare September 6, 2024 17:16

fasterthanlime marked this pull request as ready for review September 6, 2024 17:19

fasterthanlime force-pushed the fewer-deps branch from b37f41a to a312719 Compare September 6, 2024 22:24

zanieb requested review from konstin and BurntSushi September 7, 2024 15:34

charliermarsh reviewed Sep 9, 2024

View reviewed changes

fasterthanlime force-pushed the fewer-deps branch from 3f9adf7 to da7021d Compare September 9, 2024 14:52

charliermarsh reviewed Sep 9, 2024

View reviewed changes

charliermarsh approved these changes Sep 9, 2024

View reviewed changes

konstin approved these changes Sep 9, 2024

View reviewed changes

fasterthanlime force-pushed the fewer-deps branch from f42d56b to 973ecf4 Compare September 23, 2024 14:13

fasterthanlime changed the title ~~Remove/disable deps in dev~~ Clean up "performance allocators" and "performance flate2" backends Sep 23, 2024

fasterthanlime force-pushed the fewer-deps branch from a77334b to fb3e837 Compare September 23, 2024 14:35

fasterthanlime force-pushed the fewer-deps branch from fb3e837 to e89f275 Compare September 23, 2024 14:40

BurntSushi approved these changes Sep 25, 2024

View reviewed changes

konstin mentioned this pull request Sep 25, 2024

Clean up "performance allocators" and "performance flate2" backends #7686

Merged

konstin closed this in #7686 Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up "performance allocators" and "performance flate2" backends #7000

Clean up "performance allocators" and "performance flate2" backends #7000

fasterthanlime commented Sep 4, 2024 •

edited

Loading

fasterthanlime commented Sep 4, 2024 •

edited

Loading

fasterthanlime commented Sep 4, 2024

zanieb commented Sep 5, 2024

codspeed-hq bot commented Sep 6, 2024 •

edited

Loading

charliermarsh Sep 9, 2024

fasterthanlime Sep 9, 2024

zanieb Sep 9, 2024

charliermarsh Sep 9, 2024

fasterthanlime Sep 9, 2024

charliermarsh left a comment

konstin commented Sep 9, 2024

BurntSushi commented Sep 9, 2024

fasterthanlime commented Sep 9, 2024

BurntSushi commented Sep 9, 2024

fasterthanlime commented Sep 23, 2024

BurntSushi left a comment

fasterthanlime commented Sep 25, 2024

konstin commented Sep 25, 2024

Clean up "performance allocators" and "performance flate2" backends #7000

Clean up "performance allocators" and "performance flate2" backends #7000

Conversation

fasterthanlime commented Sep 4, 2024 • edited Loading

Summary

Methodology:

FAQ

fasterthanlime commented Sep 4, 2024 • edited Loading

Round 1: allocator

Before: 430 units, total time 26.6s

After: 425 units, total time 21s

fasterthanlime commented Sep 4, 2024

Round 2: miette's fancy-no-backtrace

Before: 425 units, total time 21s

After: 415 units, total time 20.1s

zanieb commented Sep 5, 2024

codspeed-hq bot commented Sep 6, 2024 • edited Loading

CodSpeed Performance Report

Merging #7000 will not alter performance

Summary

charliermarsh Sep 9, 2024

Choose a reason for hiding this comment

fasterthanlime Sep 9, 2024

Choose a reason for hiding this comment

zanieb Sep 9, 2024

Choose a reason for hiding this comment

charliermarsh Sep 9, 2024

Choose a reason for hiding this comment

fasterthanlime Sep 9, 2024

Choose a reason for hiding this comment

charliermarsh left a comment

Choose a reason for hiding this comment

konstin commented Sep 9, 2024

BurntSushi commented Sep 9, 2024

fasterthanlime commented Sep 9, 2024

BurntSushi commented Sep 9, 2024

fasterthanlime commented Sep 23, 2024

BurntSushi left a comment

Choose a reason for hiding this comment

fasterthanlime commented Sep 25, 2024

konstin commented Sep 25, 2024

fasterthanlime commented Sep 4, 2024 •

edited

Loading

fasterthanlime commented Sep 4, 2024 •

edited

Loading

Round 2: miette's `fancy-no-backtrace`

codspeed-hq bot commented Sep 6, 2024 •

edited

Loading