Examining scudo's performance compared to other allocators #2480

MoSal · 2024-07-11T03:23:52Z

MoSal
Jul 11, 2024

Hello.

Reading Chimera's About page got me intrigued with the mention of using scudo as a supposedly performant allocator.

Coupling musl with a performant allocator system wide is definitely an idea worth exploring, is indeed underexplored. I'm actually surprised there are no more examples of this, or maybe there are, and I don't know about them.

Having not heard of scudo before, I decided to put it to the test.

The test case involves a CLI tool of mine written in Rust. The task here involves decompressing (lz4_flex) and deserializing (speedy) six files (total compressed size ~100MiB) (big allocations), then string formatting a lot of records (small allocations), before printing formatted text to stdout (not relevant here). This is all done using multiple threads of course.

Anyway, here are the benchmarking results (using hyperfine).

(Ordered by min not mean, which doesn't make a big difference, except for glibcs allocator.)

glibc+snmalloc(LD_PRELOAD)

Time (mean ± σ):      1.937 s ±  0.103 s    [User: 5.628 s, System: 0.723 s]
Range (min … max):    1.747 s …  2.075 s    10 runs

glibc+mimalloc(LD_PRELOAD)

Time (mean ± σ):      1.911 s ±  0.059 s    [User: 5.744 s, System: 0.669 s]
Range (min … max):    1.818 s …  1.992 s    10 runs

glibc+tcmalloc(LD_PRELOAD)

Time (mean ± σ):      1.944 s ±  0.039 s    [User: 5.911 s, System: 0.549 s]
Range (min … max):    1.884 s …  2.022 s    10 runs

chimera+tcmalloc(LD_PRELOAD)

Time (mean ± σ):      2.161 s ±  0.116 s    [User: 6.560 s, System: 0.620 s]
Range (min … max):    1.967 s …  2.294 s    10 runs

(libexecinfo+LDFLAGS='-Wl,--export-dynamic' ./configure --enable-frame-pointers --enable-libunwind)

chimera+snmalloc(`#[global_allocator]` / no memcpy guards)

Time (mean ± σ):      2.083 s ±  0.064 s    [User: 6.454 s, System: 0.586 s]
Range (min … max):    1.990 s …  2.185 s    10 runs

chimera+snmalloc(LD_PRELOAD)

Time (mean ± σ):      2.082 s ±  0.045 s    [User: 6.614 s, System: 0.620 s]
Range (min … max):    2.027 s …  2.149 s    10 runs

chimera+mimalloc(`#[global_allocator]`)

Time (mean ± σ):      2.141 s ±  0.056 s    [User: 6.528 s, System: 0.612 s]
Range (min … max):    2.067 s …  2.235 s    10 runs

alpine+snmalloc-checks-memcpy-only(LD_PRELOAD)

Time (mean ± σ):      2.138 s ±  0.031 s    [User: 6.773 s, System: 0.647 s]
Range (min … max):    2.093 s …  2.193 s    10 runs

glibc+snmalloc-checks(LD_PRELOAD)

Time (mean ± σ):      2.229 s ±  0.089 s    [User: 7.000 s, System: 0.674 s]
Range (min … max):    2.107 s …  2.379 s    10 runs

chimera+snmalloc-checks(LD_PRELOAD)

Time (mean ± σ):      2.213 s ±  0.059 s    [User: 7.010 s, System: 0.596 s]
Range (min … max):    2.129 s …  2.299 s    10 runs

glibc+jemalloc(LD_PRELOAD)

Time (mean ± σ):      2.204 s ±  0.069 s    [User: 6.620 s, System: 0.758 s]
Range (min … max):    2.138 s …  2.354 s    10 runs

alpine+snmalloc-checks(LD_PRELOAD)

Time (mean ± σ):      2.395 s ±  0.065 s    [User: 7.526 s, System: 0.633 s]
Range (min … max):    2.284 s …  2.491 s    10 runs

chimera+snmalloc-checks(`#[global_allocator]`)

Time (mean ± σ):      2.386 s ±  0.038 s    [User: 7.375 s, System: 0.559 s]
Range (min … max):    2.329 s …  2.453 s    10 runs

chimera+jemalloc(`#[global_allocator]`)

Time (mean ± σ):      2.476 s ±  0.055 s    [User: 7.569 s, System: 0.731 s]
Range (min … max):    2.389 s …  2.593 s    10 runs

chimera+jemalloc(LD_PRELOAD)

Time (mean ± σ):      2.628 s ±  0.045 s    [User: 8.089 s, System: 0.753 s]
Range (min … max):    2.564 s …  2.700 s    10 runs

chimera+mimalloc(LD_PRELOAD)

Time (mean ± σ):      2.641 s ±  0.074 s    [User: 6.778 s, System: 1.407 s]
Range (min … max):    2.572 s …  2.795 s    10 runs

alpine+snmalloc(LD_PRELOAD)

Time (mean ± σ):      2.677 s ±  0.038 s    [User: 8.493 s, System: 0.638 s]
Range (min … max):    2.607 s …  2.731 s    10 runs

alpine+mimalloc(LD_PRELOAD, mimalloc2-insecure package)

Time (mean ± σ):      2.756 s ±  0.041 s    [User: 8.647 s, System: 0.637 s]
Range (min … max):    2.688 s …  2.818 s    10 runs

musl+mimalloc(`#[global_allocator]` / rustc target / bundled musl static v1.2.3)

Time (mean ± σ):      2.820 s ±  0.051 s    [User: 8.707 s, System: 0.614 s]
Range (min … max):    2.726 s …  2.910 s    10 runs

alpine+tcmalloc(LD_PRELOAD)

Time (mean ± σ):      2.866 s ±  0.077 s    [User: 8.860 s, System: 0.552 s]
Range (min … max):    2.737 s …  2.978 s    10 runs

glibc+mimalloc-secure(LD_PRELOAD)

Time (mean ± σ):      2.964 s ±  0.090 s    [User: 8.713 s, System: 0.771 s]
Range (min … max):    2.805 s …  3.090 s    10 runs

glibc(native)

Time (mean ± σ):      3.423 s ±  0.425 s    [User: 8.597 s, System: 2.037 s]
Range (min … max):    2.987 s …  4.155 s    10 runs

chimera+mimalloc-secure(`#[global_allocator]`)

Time (mean ± σ):      3.172 s ±  0.081 s    [User: 9.424 s, System: 0.779 s]
Range (min … max):    3.087 s …  3.314 s    10 runs

chimera+mimalloc-secure(LD_PRELOAD)

Time (mean ± σ):      3.553 s ±  0.050 s    [User: 8.881 s, System: 1.841 s]
Range (min … max):    3.451 s …  3.630 s    10 runs

alpine+mimalloc-secure(LD_PRELOAD, mimalloc2 package)

Time (mean ± σ):      3.889 s ±  0.089 s    [User: 11.668 s, System: 0.799 s]
Range (min … max):    3.747 s …  4.041 s    10 runs

musl+mimalloc-secure(`#[global_allocator]` / rustc target / bundled musl static v1.2.3)

Time (mean ± σ):      3.923 s ±  0.071 s    [User: 11.750 s, System: 0.816 s]
Range (min … max):    3.821 s …  4.063 s    10 runs

glibc+scudo(LD_PRELOAD)

Time (mean ± σ):      5.769 s ±  0.092 s    [User: 12.889 s, System: 2.500 s]
Range (min … max):    5.641 s …  5.949 s    10 runs

chimera(default)

Time (mean ± σ):     10.008 s ±  0.297 s    [User: 19.593 s, System: 2.641 s]
Range (min … max):    9.682 s … 10.468 s    10 runs

alpine+scudo

Time (mean ± σ):     10.180 s ±  0.249 s    [User: 20.511 s, System: 2.646 s]
Range (min … max):    9.790 s … 10.526 s    10 runs

alpine(native)

Time (mean ± σ):     17.295 s ±  0.189 s    [User: 45.570 s, System: 17.161 s]
Range (min … max):   17.021 s … 17.676 s    10 runs

rustc-musl(bundled static musl v1.2.3)

Time (mean ± σ):     18.317 s ±  0.159 s    [User: 46.666 s, System: 18.146 s]
Range (min … max):   18.123 s … 18.640 s    10 runs

(glibc is Arch Linux. Chimera and Alpine running in containers.)

Conclusions

musl's allocator is a special kind of horrible performance wise.
While it's faster than musl's own allocator, scudo is not a fast allocator. Even in their secure/hardened setting, other allocators can be orders of magnitude faster.
musl with an actually fast allocator can be competitive with glibc performance wise.
musl+scudo seems to cause extra performance degradation compared to glibc+scudo.
Chimera seems to have an (often small) edge over Alpine performance wise.

Random Notes

I have no idea what's going on with snmalloc running slower without checks in Alpine. checks-memcpy-only was added for this reason.
The last time I benched allocators, albeit with a different tool/use-case/load, glibc and jemalloc behaved almost identically. While the different use-case may explain the difference here, but my theory is that glibcs v2.38 allocator changes fiasco and/or the fixes that followed may have hampered performance a bit.
snmalloc was also slightly, but measurably, faster than mimalloc in that other test. But mimalloc was still a strong second. I didn't test the secure/hardened variants then.

nekopsykose · 2024-07-11T03:56:01Z

nekopsykose
Jul 11, 2024

The test case involves a CLI tool of mine written in Rust.

notably, you don't link to this so nobody else can cross check this :D

the results seem about what i'd expect overall.

musl's allocator is a special kind of horrible performance wise.

yea. if you benched something single-threaded (i.e. a process that never uses the pthread apis) all of these results would be a lot closer and malloc-ng is not even slow anymore.

the reason it is so awful is that it's written with the most (performance-)pessimistic multithreaded hardening in mind (only active after a thread actually gets spawned), where any use of the allocator goes through a lock/unlock (iirc as a 'global', it's been a while). ..so yeah.

musl+scudo seems to cause extra performance degradation compared to glibc+scudo.

this is something that would be interesting to look at a flamegraph of; on arch the libc should have frame pointers already, then just build whatever you preload with them too (-fno-omit-frame-pointer) and set RUSTFLAGS=Cforce-frame-pointers=true and you should be able to perf record --call-graph fp and get something usable

for musl, i'm not sure how you'd ideally bench that; the scudo here is not the same one as what is in the main/scudo-malloc preloadable library (we disable one of the extra allocator layers and some other stuff), ... . but i guess just a profile of chimera(default) to compare is fine.

you can see our configuration of scudo in main/musl/files/wrappers.cpp

Chimera seems to have an (often small) edge over Alpine performance wise.

allocator wise, yea. especially for multithreaded stuff

now, onto the slightly more interesting part..

While it's faster than musl's own allocator, scudo is not a fast allocator. Even in their secure/hardened setting, other allocators can be orders of magnitude faster.

yes, it's a hardened allocator that is technically meant to just be 'fast enough' at not being pathologically slow. so, unlike malloc-ng, it ends up being actually usable in MT processes. for a great time, you should measure how long it takes to e.g. thinlto link firefox with malloc-ng vs scudo :D

what that means in practice is that it's not slow enough that things are noticeably 'way worse' than anywhere else. of course, in the end it's not really that fast.

the last time i remember having a discussion around this i think the overall consensus was that the 'other allocator secure/hardened setting' was not really very 'hardened', i.e. they are more just a few mitigations thrown on top and overmarketed rather than being a secure-allocator first design, like scudo is. so it's not surprising that they're actually fast since they're meant to be.

something you might notice from wrappers.cpp is that musl can't use an allocator that relies on tls, i.e. thread_local and friends. snmalloc at least requires this afaict, as do a lot of allocators. it's possible to use preloaded stuff (of course; like any library) but not to replace the allocator inside the libc itself with code that has a tls dependency.

and something that you actually forgot to measure- the actual (heap) memory use of all these allocators in comparison to each other, is also an important data point in comparison to just 'wall time to do thing', and kinda has to be there as a relative point (since we're talking from the perspective of a generally-distro-usable allocator for everything at once, and not just something to use in one application)

0 replies

q66 · 2024-07-11T07:36:48Z

q66
Jul 11, 2024
Maintainer

you could have just asked, i've already done far more extensive allocator benchmarking before and showing a bunch of numbers with your random application isn't all that interesting

0 replies

MoSal · 2024-07-11T09:08:44Z

MoSal
Jul 11, 2024
Author

@q66
It's a discussion, not a bug report.
Consider it a long blabby way of asking.

@nekopsykose

notably, you don't link to this so nobody else can cross check this :D

Yeah. The problem is, it's non-sharable (requiring a subscription) data.
So, even if I shared the code, it wouldn't be of much use.

I may try to re(pro|)duce it later with similar(ish) code and generated data.

this is something that would be interesting to look at a flamegraph of; ...

May take a look later.

something you might notice from wrappers.cpp is that musl can't use an allocator that relies on tls, i.e. thread_local and friends.

Not an area I'm knowledgeable in.

snmalloc has SNMALLOC_ENABLE_DYNAMIC_LOADING build option which skips passing -ftls-model=initial-exec.

mimalloc has MI_LOCAL_DYNAMIC_TLS which explicitly passes -ftls-model=local-dynamic.

I'm guessing that's not enough/not relevant!

and something that you actually forgot to measure- the actual (heap) memory use of all these allocators ...

what do you mean.. memory is there to be used...joking.

With 32GiB RAM + zram, one tends to forget that's a concern.

Operating from memory here, but I think tcmalloc in particular doesn't have an overhead.

0 replies

q66 · 2024-07-11T09:23:56Z

q66
Jul 11, 2024
Maintainer

scudo reaches ~90+% of the performance of the "fast" allocators on average over a span of benchmarks (e.g. mimalloc-bench etc) and it can be configured in a multitude of ways which all affect performance in one way or another

i'm not really sure what you want to discuss here, this has all been thoroughly investigated, and none of the other allocators are really relevant for our use anyway

6 replies

MoSal Jul 11, 2024
Author

Thanks for all the info.

Just a last data point. I didn't think to try it because it didn't have an effect on glibc, but setting SCUDO_OPTIONS="release_to_os_interval_ms=-1" improves the performance in Chimera to be on par with glibc+scudo. An improvement is seen on Alpine too, but not as much.

chimera(`SCUDO_OPTIONS="release_to_os_interval_ms=-1"`)

Time (mean ± σ):      5.634 s ±  0.228 s    [User: 14.450 s, System: 1.685 s]
Range (min … max):    5.229 s …  5.927 s    10 runs

alpine+scudo(LD_PRELOAD, `SCUDO_OPTIONS="release_to_os_interval_ms=-1"`)

Time (mean ± σ):      8.231 s ±  0.234 s    [User: 17.662 s, System: 2.180 s]
Range (min … max):    7.692 s …  8.509 s    10 runs

q66 Jul 11, 2024
Maintainer

you're comparing incomparable things

chimera's scudo configuration uses an entirely different primary allocator from the default configuration the preload library uses so it might as well be comparing different allocators

and disabling reclaiming will speed things up obviously, but then no pages ever get released, which is a really bad idea

MoSal Jul 11, 2024
Author

Is it expected/known that reclaiming can be much slower in Chimera/musl than with glibc?

glibc+scudo(LD_PRELOAD, `SCUDO_OPTIONS="release_to_os_interval_ms=-1"`)

Time (mean ± σ):      5.297 s ±  0.073 s    [User: 12.199 s, System: 2.121 s]
Range (min … max):    5.224 s …  5.415 s    10 runs

glibc+scudo(LD_PRELOAD)

Time (mean ± σ):      5.723 s ±  0.061 s    [User: 12.709 s, System: 2.494 s]
Range (min … max):    5.623 s …  5.815 s    10 runs

nekopsykose Jul 11, 2024

Is it expected/known that reclaiming can be much slower in Chimera/musl than with glibc?

without comparing where the time is spent instead of just total time nobody can answer questions like this

q66 Jul 11, 2024
Maintainer

the question doesn't make sense like i already explained

the reclaiming happens in the primary allocator and the primary allocator is entirely different for chimera and for the preload library

MoSal · 2024-07-15T16:22:29Z

MoSal
Jul 15, 2024
Author

notably, you don't link to this so nobody else can cross check this :D

@nekopsykose

Hopefully this makes it cross-checkable:
https://github.com/MoSal/alloc-perf-test

4 replies

MoSal Jul 21, 2024
Author

notably, you don't link to this so nobody else can cross check this :D

@nekopsykose

Hopefully this makes it cross-checkable: https://github.com/MoSal/alloc-perf-test

Added mallocng and oldmalloc numbers.

mallocng managed to be even slightly faster than chimera default with 8 parallel tasks on a 4 core CPU.

nekopsykose Jul 21, 2024

mallocng managed to be even slightly faster than chimera default with 8 parallel tasks on a 4 core CPU.

the numbers don't show it actually being faster, it used over 2.5x the cpu time total. that is completely abysmal and much worse overall on a real system, even if it's "technically faster" in wall time slightly (there are more things running on a computer than one program). the only time a 12% time reduction in exchange for 2.5x cpu use makes sense is maybe something very user-interactive, and even there that skew seems a bit too extreme to be realistic.

the same goes for measuring SCUDO_OPTIONS=release_to_os_interval_ms=-1 - it's not a real configuration that anyone would run, but it does show that most of the time is spent on freeing memory, which is also visible in e.g. a profile https://share.firefox.dev/4bQyo8r

for musl, it is faster at freeing so much garbage, but instead spends all the time in lock/unlock of the allocator as expected https://share.firefox.dev/4ffimI0 . even reading the files takes forever compared to scudo..

overall these numbers look about normal. this benchmark has some of the worst memory management i've ever seen, with an extremely massive amount of small allocs+frees. a quick look with dhat shows that it allocates+frees 17GB of memory in almost 170 million allocations, with an average allocation size of just 100B, >75% of which are alive for <20% of the program duration.

the main interesting thing here is that when a program allocates this much memory, actually freeing so many allocations takes a long time, and the default scudo configuration is better at freeing them. probably, the chunks are just bigger or managed differently, so there's less round trips. but in reality almost no program behaves like this except poorly written ones that go through this much data without any special handling of memory, which most programs never do.

the one thing to take away here is that maybe it's possible to make freeing stuff faster (probably by changing how memory is chunked somewhat, but i dunno), but aside from that the numbers seem ok

as an aside, 2.9% of the samples are in scudo::releasePagesToOS which does madvise(MADV_DONTNEED), and this can probably just be removed like in https://git.musl-libc.org/cgit/musl/commit/?id=e6e8213244a816511e95e14fb99176442922abac as it's scarcely useful and would give a free 3% speedup in this case, though actually changing it is probably a little more complicated since madv_dontneed has a sideeffect of making further reads 0-filled and the implementation calls it for that.

MoSal Jul 26, 2024
Author

Two new branches added. More info below.

@nekopsykose

Thank you for actually taking the time to look at this. Just a few quick notes:

Good point regarding the 2.5x CPU time from mallocng. I was surprised by the shorter wall time and forgot to look.
I'm aware that the no-release configuration is not a real one. I included it because, as you noted, a lot of time is spent on releasing memory. I did notice the prominence of scudo::releaseFreeMemoryToOS and some locking functions in profiles too.
Regarding the use-case. As mentioned in the README, this is a part of a real code base. It may look extra weird because the benchmark is kind of zoomed in on the allocation performance. To expand on that:
- A part of the real work-load, for example, is outputting data to multiple text formats (including a custom one specified with CLI args). This commit shows printing to one simple similar format. Although just to be clear, neither the format nor data field names here are real. But they do mimic the real ones well enough for demonstration purposes.
- I added two new branches, less_allocs and leak. The first reduces the number of allocations and removes most reallocations. leak further adds some leaking to reduce deallocations, in an attempt to make things even easier for Chimera's scudo-based default. Each branch has applicable bench numbers in its README.
- While the Chimera default numbers improved slightly here, it is mallocng which benefited the most, to the point of becoming 2x faster in wall time measurements. Even bad old oldmalloc manages to (wall-time) beat Chimera default in -n 4 loads. But all three still perform comparatively bad.
- glibc runs no longer have big fluctuations in their numbers.
- The change in less_allocs is the maximum one can be expected to do without greatly sacrificing the idiomaticity and abstractness of the code.
- TL;DR: better memory management helped mallocng, not Chimera's scudo-based default.
I think the performance characteristics of each allocator when varying parallelism is worthy of a separate observation. This is why I added a table with such info below benchmark numbers.

Maybe one possible takeaway here is that, with the combination of patterns like work stealing, RAII, move semantics, and liveness analysis becoming more prominent, work-loads that involve many and continuous small allocations and deallocations in multiple threads will also become more prominent, and there will be increased expectation on allocators to perform well in such cases, instead of pushing the issue to individual projects and expecting them to sacrifice code quality/idiomaticity/abstractness to please certain under-performing allocators.

nekopsykose Jul 27, 2024

looks like an actual bug then

MoSal · 2024-07-28T01:26:31Z

MoSal
Jul 28, 2024
Author

Added memory usage numbers for completeness:
https://github.com/MoSal/alloc-perf-test/tree/less_allocs#max-memory-rss-usage

mallocng once again has the best numbers among musl-native allocators.
libsnmallocshim-checks usage does indeed incur some memory overhead.
libmimalloc-secure would be the best all-around allocator if this factor is considered relevant.
Other allocators like libtcmalloc could also be useful here if a low-spec device user is open to using non-hardened options.

0 replies

q66 · 2024-08-13T22:12:07Z

q66
Aug 13, 2024
Maintainer

this thread is obsolete now that dea4c74 has landed, so closing

5 replies

nekopsykose Aug 13, 2024

still nice to see one more round of comparisons for fun, if you feel like it @MoSal

MoSal Aug 18, 2024
Author

@nekopsykose

Tests with musl-1.2.5_git20240705-r3 added to the README's of all three branches of alloc-perf-test.

Performance numbers are very good. Max RSS memory numbers are also interestingly excellent.

I've been looking for and testing more (fast) allocators, and only two of the ones I tested so far have such low memory usage, SFMalloc (2011), and LRMalloc (2018). Interestingly, both are not tested/tracked by mimalloc-bench. SFMalloc in particular could be a competitive performer all around.

Also notable is the README of ltalloc which points to these small allocation patterns being known to exist in the C++ world too.

There is also a Rust port of good old dlmalloc, which I paired with a spinlock with exponential backoff (1<<19ns to 1<<23ns), and even that performed much better than the trinity of oldmalloc/mallocng/scudo, although it doesn't get close to the current top brass of allocators.

But I digress. I may start looking at allocators internally when I'm done testing them. But as far as Chimera is concerned, allocator performance appears to be indeed a largely solved problem with this mimalloc integration.

nekopsykose Aug 18, 2024

nice! thanks a lot for looking :)

Xynonners Sep 8, 2024

@nekopsykose

Tests with musl-1.2.5_git20240705-r3 added to the README's of all three branches of alloc-perf-test.

Performance numbers are very good. Max RSS memory numbers are also interestingly excellent.

I've been looking for and testing more (fast) allocators, and only two of the ones I tested so far have such low memory usage, SFMalloc (2011), and LRMalloc (2018). Interestingly, both are not tested/tracked by mimalloc-bench. SFMalloc in particular could be a competitive performer all around.

Also notable is the README of ltalloc which points to these small allocation patterns being known to exist in the C++ world too.

There is also a Rust port of good old dlmalloc, which I paired with a spinlock with exponential backoff (1<<19ns to 1<<23ns), and even that performed much better than the trinity of oldmalloc/mallocng/scudo, although it doesn't get close to the current top brass of allocators.

But I digress. I may start looking at allocators internally when I'm done testing them. But as far as Chimera is concerned, allocator performance appears to be indeed a largely solved problem with this mimalloc integration.

have you perhaps tested rpmalloc? Incidentally the nimskull devs were talking about using rpmalloc rather than mimalloc as their default allocator due to its lower code complexity and equivalent performance (according to their benchmarks at least iirc).

MoSal Sep 9, 2024
Author

@Xynonners

I haven't looked at this in a while. But yes, rpmalloc is in my notes as one of the top performers, but it doesn't stand out as THE TOP performer in terms of speed or resource usage. It may stand out if I get around to implementing an allocator or two in Rust, and rpmalloc turns out to be simpler than others as you describe.

gperftools:  3.256 s …  3.537 s | 5.21GiB
mimalloc:    3.550 s …  3.903 s | 6.14GiB
rpmalloc:    3.649 s …  3.761 s | 5.85GiB
snmalloc:    3.650 s …  3.866 s | 10.17GiB
sfmalloc:    3.848 s …  4.052 s | 4.43GiB
snmalloc-ck: 3.999 s …  4.418 s | 8.00GiB
tbbmalloc:   4.041 s …  4.304 s | 5.54GiB
ltalloc:     4.179 s …  4.461 s | 5.12GiB
lrmalloc:    5.278 s …  5.460 s | 4.46GiB
ffmalloc:    5.618 s …  5.788 s | 5.30GiB

Note that these were all run with LD_PRELOAD in a GNU system. So the mimalloc here is a bit different from the secure and integrated one with chimera's musl. I also didn't test all allocators I could find yet, and have only been testing with the less_allocs branch.

Side Note: tcmalloc from its standalone repository is quite different, and actually performs slower with this test, than the one in gperftools.

Samueru-sama · 2024-12-24T12:25:46Z

Samueru-sama
Dec 24, 2024

Hi @MoSal we recently discovered a significant performance degradation in the AppImage runtime that you can read the details about and some benchmarks here: AppImage/type2-runtime#116

Do you think building the runtime on chimera with mimalloc would make it as fast as when glibc is used? or I shouldn't bother lol?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chimera Linux

Examining scudo's performance compared to other allocators #2480

{{title}}

Replies: 8 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Examining scudo's performance compared to other allocators #2480

glibc+snmalloc(LD_PRELOAD)

glibc+mimalloc(LD_PRELOAD)

glibc+tcmalloc(LD_PRELOAD)

chimera+tcmalloc(LD_PRELOAD)

chimera+snmalloc(#[global_allocator] / no memcpy guards)

chimera+snmalloc(LD_PRELOAD)

chimera+mimalloc(#[global_allocator])

alpine+snmalloc-checks-memcpy-only(LD_PRELOAD)

glibc+snmalloc-checks(LD_PRELOAD)

chimera+snmalloc-checks(LD_PRELOAD)

glibc+jemalloc(LD_PRELOAD)

alpine+snmalloc-checks(LD_PRELOAD)

chimera+snmalloc-checks(#[global_allocator])

chimera+jemalloc(#[global_allocator])

chimera+jemalloc(LD_PRELOAD)

chimera+mimalloc(LD_PRELOAD)

alpine+snmalloc(LD_PRELOAD)

alpine+mimalloc(LD_PRELOAD, mimalloc2-insecure package)

musl+mimalloc(#[global_allocator] / rustc target / bundled musl static v1.2.3)

alpine+tcmalloc(LD_PRELOAD)

glibc+mimalloc-secure(LD_PRELOAD)

glibc(native)

chimera+mimalloc-secure(#[global_allocator])

chimera+mimalloc-secure(LD_PRELOAD)

alpine+mimalloc-secure(LD_PRELOAD, mimalloc2 package)

musl+mimalloc-secure(#[global_allocator] / rustc target / bundled musl static v1.2.3)

glibc+scudo(LD_PRELOAD)

chimera(default)

alpine+scudo

alpine(native)

rustc-musl(bundled static musl v1.2.3)

Conclusions

Random Notes

Replies: 8 comments · 15 replies

q66 Jul 11, 2024 Maintainer

MoSal Jul 11, 2024 Author

q66 Jul 11, 2024 Maintainer

MoSal Jul 11, 2024 Author

chimera(SCUDO_OPTIONS="release_to_os_interval_ms=-1")

alpine+scudo(LD_PRELOAD, SCUDO_OPTIONS="release_to_os_interval_ms=-1")

q66 Jul 11, 2024 Maintainer

MoSal Jul 11, 2024 Author

glibc+scudo(LD_PRELOAD, SCUDO_OPTIONS="release_to_os_interval_ms=-1")

glibc+scudo(LD_PRELOAD)

q66 Jul 11, 2024 Maintainer

MoSal Jul 15, 2024 Author

MoSal Jul 21, 2024 Author

MoSal Jul 26, 2024 Author

MoSal Jul 28, 2024 Author

q66 Aug 13, 2024 Maintainer

MoSal Aug 18, 2024 Author

MoSal Sep 9, 2024 Author

chimera+snmalloc(`#[global_allocator]` / no memcpy guards)

chimera+mimalloc(`#[global_allocator]`)

chimera+snmalloc-checks(`#[global_allocator]`)

chimera+jemalloc(`#[global_allocator]`)

musl+mimalloc(`#[global_allocator]` / rustc target / bundled musl static v1.2.3)

chimera+mimalloc-secure(`#[global_allocator]`)

musl+mimalloc-secure(`#[global_allocator]` / rustc target / bundled musl static v1.2.3)

Replies: 8 comments 15 replies

q66
Jul 11, 2024
Maintainer

MoSal
Jul 11, 2024
Author

q66
Jul 11, 2024
Maintainer

MoSal Jul 11, 2024
Author

chimera(`SCUDO_OPTIONS="release_to_os_interval_ms=-1"`)

alpine+scudo(LD_PRELOAD, `SCUDO_OPTIONS="release_to_os_interval_ms=-1"`)

q66 Jul 11, 2024
Maintainer

MoSal Jul 11, 2024
Author

glibc+scudo(LD_PRELOAD, `SCUDO_OPTIONS="release_to_os_interval_ms=-1"`)

q66 Jul 11, 2024
Maintainer

MoSal
Jul 15, 2024
Author

MoSal Jul 21, 2024
Author

MoSal Jul 26, 2024
Author

MoSal
Jul 28, 2024
Author

q66
Aug 13, 2024
Maintainer

MoSal Aug 18, 2024
Author

MoSal Sep 9, 2024
Author