Replies: 8 comments 15 replies
-
notably, you don't link to this so nobody else can cross check this :D the results seem about what i'd expect overall.
yea. if you benched something single-threaded (i.e. a process that never uses the pthread apis) all of these results would be a lot closer and malloc-ng is not even slow anymore. the reason it is so awful is that it's written with the most (performance-)pessimistic multithreaded hardening in mind (only active after a thread actually gets spawned), where any use of the allocator goes through a lock/unlock (iirc as a 'global', it's been a while). ..so yeah.
this is something that would be interesting to look at a flamegraph of; on arch the libc should have frame pointers already, then just build whatever you preload with them too (-fno-omit-frame-pointer) and set RUSTFLAGS=Cforce-frame-pointers=true and you should be able to perf record --call-graph fp and get something usable for musl, i'm not sure how you'd ideally bench that; the scudo here is not the same one as what is in the you can see our configuration of scudo in main/musl/files/wrappers.cpp
allocator wise, yea. especially for multithreaded stuff now, onto the slightly more interesting part..
yes, it's a hardened allocator that is technically meant to just be 'fast enough' at not being pathologically slow. so, unlike malloc-ng, it ends up being actually usable in MT processes. for a great time, you should measure how long it takes to e.g. thinlto link firefox with malloc-ng vs scudo :D what that means in practice is that it's not slow enough that things are noticeably 'way worse' than anywhere else. of course, in the end it's not really that fast. the last time i remember having a discussion around this i think the overall consensus was that the 'other allocator secure/hardened setting' was not really very 'hardened', i.e. they are more just a few mitigations thrown on top and overmarketed rather than being a secure-allocator first design, like scudo is. so it's not surprising that they're actually fast since they're meant to be. something you might notice from wrappers.cpp is that musl can't use an allocator that relies on tls, i.e. thread_local and friends. snmalloc at least requires this afaict, as do a lot of allocators. it's possible to use preloaded stuff (of course; like any library) but not to replace the allocator inside the libc itself with code that has a tls dependency. and something that you actually forgot to measure- the actual (heap) memory use of all these allocators in comparison to each other, is also an important data point in comparison to just 'wall time to do thing', and kinda has to be there as a relative point (since we're talking from the perspective of a generally-distro-usable allocator for everything at once, and not just something to use in one application) |
Beta Was this translation helpful? Give feedback.
-
you could have just asked, i've already done far more extensive allocator benchmarking before and showing a bunch of numbers with your random application isn't all that interesting |
Beta Was this translation helpful? Give feedback.
-
@q66
Yeah. The problem is, it's non-sharable (requiring a subscription) data. I may try to re(pro|)duce it later with similar(ish) code and generated data.
May take a look later.
Not an area I'm knowledgeable in.
I'm guessing that's not enough/not relevant!
what do you mean.. memory is there to be used...joking. With 32GiB RAM + zram, one tends to forget that's a concern. Operating from memory here, but I think tcmalloc in particular doesn't have an overhead. |
Beta Was this translation helpful? Give feedback.
-
scudo reaches ~90+% of the performance of the "fast" allocators on average over a span of benchmarks (e.g. mimalloc-bench etc) and it can be configured in a multitude of ways which all affect performance in one way or another i'm not really sure what you want to discuss here, this has all been thoroughly investigated, and none of the other allocators are really relevant for our use anyway |
Beta Was this translation helpful? Give feedback.
-
Hopefully this makes it cross-checkable: |
Beta Was this translation helpful? Give feedback.
-
Added memory usage numbers for completeness:
|
Beta Was this translation helpful? Give feedback.
-
this thread is obsolete now that dea4c74 has landed, so closing |
Beta Was this translation helpful? Give feedback.
-
Hi @MoSal we recently discovered a significant performance degradation in the AppImage runtime that you can read the details about and some benchmarks here: AppImage/type2-runtime#116 Do you think building the runtime on chimera with mimalloc would make it as fast as when glibc is used? or I shouldn't bother lol? |
Beta Was this translation helpful? Give feedback.
-
Hello.
Reading Chimera's About page got me intrigued with the mention of using scudo as a supposedly performant allocator.
Coupling musl with a performant allocator system wide is definitely an idea worth exploring, is indeed underexplored. I'm actually surprised there are no more examples of this, or maybe there are, and I don't know about them.
Having not heard of scudo before, I decided to put it to the test.
The test case involves a CLI tool of mine written in Rust. The task here involves decompressing (lz4_flex) and deserializing (speedy) six files (total compressed size ~100MiB) (big allocations), then string formatting a lot of records (small allocations), before printing formatted text to stdout (not relevant here). This is all done using multiple threads of course.
Anyway, here are the benchmarking results (using
hyperfine
).(Ordered by
min
notmean
, which doesn't make a big difference, except forglibc
s allocator.)glibc+snmalloc(LD_PRELOAD)
glibc+mimalloc(LD_PRELOAD)
glibc+tcmalloc(LD_PRELOAD)
chimera+tcmalloc(LD_PRELOAD)
(
libexecinfo
+LDFLAGS='-Wl,--export-dynamic' ./configure --enable-frame-pointers --enable-libunwind
)chimera+snmalloc(
#[global_allocator]
/ no memcpy guards)chimera+snmalloc(LD_PRELOAD)
chimera+mimalloc(
#[global_allocator]
)alpine+snmalloc-checks-memcpy-only(LD_PRELOAD)
glibc+snmalloc-checks(LD_PRELOAD)
chimera+snmalloc-checks(LD_PRELOAD)
glibc+jemalloc(LD_PRELOAD)
alpine+snmalloc-checks(LD_PRELOAD)
chimera+snmalloc-checks(
#[global_allocator]
)chimera+jemalloc(
#[global_allocator]
)chimera+jemalloc(LD_PRELOAD)
chimera+mimalloc(LD_PRELOAD)
alpine+snmalloc(LD_PRELOAD)
alpine+mimalloc(LD_PRELOAD, mimalloc2-insecure package)
musl+mimalloc(
#[global_allocator]
/ rustc target / bundled musl static v1.2.3)alpine+tcmalloc(LD_PRELOAD)
glibc+mimalloc-secure(LD_PRELOAD)
glibc(native)
chimera+mimalloc-secure(
#[global_allocator]
)chimera+mimalloc-secure(LD_PRELOAD)
alpine+mimalloc-secure(LD_PRELOAD, mimalloc2 package)
musl+mimalloc-secure(
#[global_allocator]
/ rustc target / bundled musl static v1.2.3)glibc+scudo(LD_PRELOAD)
chimera(default)
alpine+scudo
alpine(native)
rustc-musl(bundled static musl v1.2.3)
(
glibc
is Arch Linux. Chimera and Alpine running in containers.)Conclusions
musl
's allocator is a special kind of horrible performance wise.scudo
is not a fast allocator. Even in their secure/hardened setting, other allocators can be orders of magnitude faster.musl
with an actually fast allocator can be competitive withglibc
performance wise.musl
+scudo
seems to cause extra performance degradation compared toglibc
+scudo
.Random Notes
checks-memcpy-only
was added for this reason.glibc
andjemalloc
behaved almost identically. While the different use-case may explain the difference here, but my theory is thatglibc
s v2.38 allocator changes fiasco and/or the fixes that followed may have hampered performance a bit.snmalloc
was also slightly, but measurably, faster thanmimalloc
in that other test. Butmimalloc
was still a strong second. I didn't test the secure/hardened variants then.Beta Was this translation helpful? Give feedback.
All reactions