Evaluate using Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) #953

zamazan4ik · 2024-02-25T18:17:53Z

zamazan4ik
Feb 25, 2024

Hi!

Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are available here. According to the tests, PGO can help with achieving better performance in many cases, including compilers and interpreters. Since this, I think trying to optimize Candy lang tooling with PGO can be a good idea.

I already did some benchmarks on Candy "compiler/vm" and want to share my results.

Test environment

Fedora 39
Linux kernel 6.7.4
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.78-nightly
candy version: the latest for now from the main branch on commit 7f99bf4e4202967bf341286e1b2ed71e4f81afc1

Benchmark

For benchmark purposes, I use these benchmarks. For PGO optimization I use cargo-pgo tool. Release bench result I got with cargo bench command. PGO training phase is done with cargo pgo bench, PGO optimization phase - with cargo pgo optimize bench.

Results

I got the following results:

Release: https://gist.github.com/zamazan4ik/4640431a5b8d709423d550beade3daf4
PGO optimized compared to Release: https://gist.github.com/zamazan4ik/9b2a2c9a171f81ed0bd402ef40c575cb
(just for reference) PGO instrumentation compared to Release: https://gist.github.com/zamazan4ik/932e3dd15e70c69a700e365e84fda05d

According to the results, PGO measurably improves Candy VM performance in many cases.

Further steps

I can suggest the following action points:

Perform more PGO benchmarks on other Candy tooling. If it shows improvements - add a note to the documentation about possible improvements in Candy tools performance with PGO.
Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Candy tools according to their workloads.
Optimize pre-built Candy binaries

Testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual PGO.

Here are some examples of how PGO optimization is integrated into other projects:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag

I would be happy to answer your questions about PGO and PLO.

JonasWanke · 2024-02-25T19:31:14Z

JonasWanke
Feb 25, 2024
Maintainer

Thanks a lot for the suggestion and all the information! I heard about PGO before, but didn't know the results would be this impressive (the VM runtime of Fibonacci in the results you linked are about 30 % faster) and will give your article a read soon.

So far, AFAICT most users of Candy are working on the Candy compiler itself, so they need frequent (and therefore fast) recompilations. From my understanding, PGO would significantly increase the compilation time (since the training and optimization phases are added), so it would be more beneficial for binary releases of the Candy compiler that people only writing Candy code can use. So we should definitely use PGO in our release workflows (as soon as we have them).

Or does it also make sense to run the training phase once locally and then reuse that data for multiple compilations with PGO of the Candy compiler while working on it?

2 replies

zamazan4ik Feb 25, 2024
Author

PGO would significantly increase the compilation time (since the training and optimization phases are added)

Yes. The compilation time will be doubled at least (instrumentation build + running training scenario + optimization build). So enabling PGO only for delivered to the users binaries is a good idea. All projects that I know choose the same way.

Or does it also make sense to run the training phase once locally and then reuse that data for multiple compilations with the PGO of the Candy compiler while working on it?

I don't think that will make a huge sense in enabling PGO during the development process. I can suggest first trying to apply PGO for the prebuilt release binaries. If it goes well, and you will be interested in Candy compiler speed during the development stage, try to enable PGO for development purposes as well.

By the way, PGO works well not only with compilers. I suggest you evaluate PGO with other Candy tools as well. E.g. I have multiple benchmarks for LSP servers and code formatters, where PGO improved performance too.

JonasWanke Feb 25, 2024
Maintainer

Okay, sounds good!

We only have a single executable which contains all Candy tools, including our language server (candy lsp) and formatter (soon; I just noticed we don't expose it here yet), so this would mostly mean to run all tools in the training phase.

And we could also offer PGO to Candy users when generating binaries using Inkwell (LLVM).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate using Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) #953

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Evaluate using Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) #953

zamazan4ik Feb 25, 2024

Test environment

Benchmark

Results

Further steps

Replies: 1 comment · 2 replies

JonasWanke Feb 25, 2024 Maintainer

zamazan4ik Feb 25, 2024 Author

JonasWanke Feb 25, 2024 Maintainer

zamazan4ik
Feb 25, 2024

Replies: 1 comment 2 replies

JonasWanke
Feb 25, 2024
Maintainer

zamazan4ik Feb 25, 2024
Author

JonasWanke Feb 25, 2024
Maintainer