Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R CRAN package binaries for Linux - 2.0 #36

Open
pat-s opened this issue Nov 1, 2024 · 12 comments
Open

R CRAN package binaries for Linux - 2.0 #36

pat-s opened this issue Nov 1, 2024 · 12 comments

Comments

@pat-s
Copy link

pat-s commented Nov 1, 2024

Dear Working Group,

for the past months I have been building on a new project that I internally call "R CRAN package binaries for Linux - 2.0" (2.0 as Posit PM was the 1.0 in my mind). Hereby I'd like to introduce it to the working group and gather feedback and get ideas for a sustainable future.

Background

CRAN does not provide package binaries for Linux. For many years, Posit is doing so for several OS but only one architecture.

Given the rise of arm64 and docker-based actions in recent years, the need emerged to also have package binaries for arm64 and the commonly-used OS for docker-based workloads, Alpine.

I've reached out to Posit a few times whether they have plans to extend their build chain to arm64 with no success - I only got one brief answer from multiple sources that there are "no plans to do this in the foreseeable future".

Concept

I've head the idea in mind for quite some time but was so far lacking time and financial resources (to get started with).
Starting a LTD two months ago, I now have both and my motivation was high to finally tackle this issue.

So I started to build R package binaries on Linux for the following matrix:

  • Arch: arm64 + amd64
  • OS: Ubuntu 22.04 + 24.04, RHEL 8 + 9, Alpine 3.20

This results in a total of twelve times building CRAN (including all historic package versions), so roughly 12 x 20k x 6 (6 being the average version count of CRAN packages) = 1,4 Mio binaries.

General Details

The binaries are already built and are distributed through a global CDN.
The CDN showed download speedup times of up to 10x compared to downloading the same binary from Posit PM (from a European location).

While I initially only had the plan to build arm64 binaries, I realized that adding a new repo for only arm64 binaries is cumbersome: one has to switch repos for arm64/amd64 builds and in addition needs a CRAN src repo to ensure all packages are available (as some binaries are not available due to build failures).

Binaries have been built once off starting on date X. Since "start day" +1, the updated packages are being processed. This includes:

  • Building binaries for updated packages
  • Archiving old packages
  • Removing packages removed from CRAN

This is then followed by a task that updates the PACKAGES index file which is responsible to provide an index for all available packages.

Technical Details

Binaries are built on a k3s-based kubernetes cluster with autoscaling. This means that recurring jobs will execute the tasks mentioned above and shut down the server after that automatically (usually takes 20-50 mins/day depending on the amount of packages).

The backend for all builds are individual docker images I've crafted that contain a robust compiler setup including further components like X-server, Blas, Chromium, and other tools needed to build many packages.

System dependencies are inferred via pak which make use of https://github.com/rstudio/r-system-requirements to automatically infer these from a package DESCRIPTION file.

With respect to storage, I might have opted for a new approach: storing binaries in S3. While this sounds like a logical solution per se, all related tools and helpers (pkgs tools, cranlike and desc) are not compliant to work with S3 at the moment. Hence I forked them and added S3 support.
For cranlike e.g. this means supporting updating PACKAGES on a remote S3 location using the etag for the required md5sum hash value.

Sustainability & Quality

While so far I've built around 1,4 million binaries including daily rebuilt jobs, there's more to keep such a project alive.

One point is of course to improve the underlying Containerfiles with respect to their compiler config as there are still (too many) packages which fail to build due to some C-code issues.
While my plan is to provide precise statistics for each package (as I am storing the build metadata for every binary in a Postgres DB), Alpine is surely the most complicated one due to the fact that is uses MUSL instead of GLIBC as the C library.

Besides the build and storage costs (which I don't want to share in this post here), there's also the distribution cost. Distributing the binaries through a global CDN with local storage caches on different continents seems to be a great solution to me. I don't want to store the assets in a single location/S3 bucket and then have many requests with a high latency and overall travel time.

All of the above comes with some costs. I didn't mention the person hours yet, but so far I think I am somewhere around 300h that I've invested into the project. Storage and server costs so far are between 1-2k.

My goal is not to maximize profits with this project, even though I am placing it within my recently founded company. I want to help the R community to advance to the "present" of today's possibilities (WRT to architectures and asset delivery) and make use of the binaries myself. Placing it under the umbrella of my company helps me to finance and justify it as a "professional" project. I am aware of the R consortium funds and applying for a grant is definitely in scope. However, I wanted to first share this project with the WG before proceeding with such.

Overall, I am looking for feedback and support with this project to make it sustainable, both in terms of technical and financial support.
The source code is not yet public as I still need to document it properly and "clean up" - but I am definitely planning to do so. In contrast to Posit PM, I would like to develop/maintain the project in the open and encourage everyone to contribute.


Patrick Schratz

@llrs
Copy link
Collaborator

llrs commented Nov 1, 2024

I think building arm64 binaries is something CRAN maintainers have in mind, as I think they briefly mentioned serving arm64 binaries in previous meetings with them.
There is interest also from other Linux distributions to have linux binaries built from CRAN for arm64 (Fedora that I've heard of, but others too probably).

CRAN in their recent presentation at useR!2024 incldes a section: "Help with core CRAN services". This seems like a core CRAN service so it seems you could get support from CRAN.
Maybe next meeting, November 11th at 17 CET, you could present this and get more feedback. If you don't get the invitation, I'll send it to you.


More practical ideas, I wouldn't built old package versions on default. From what I read from Posit their server get very few (<5%) request for packages for old versions of R (It might be that users install from source, use docker or already have other ways to deal with it).
On the cost side you could keep last n versions of packages to reduce the storage requirements.

@pat-s
Copy link
Author

pat-s commented Nov 1, 2024

I think building arm64 binaries is something CRAN maintainers have in mind, as I think they briefly mentioned serving arm64 binaries in previous meetings with them.
There is interest also from other Linux distributions to have linux binaries built from CRAN for arm64 (Fedora that I've heard of, but others too probably).

This is great to hear though I am wondering how this would unfold in practice: until today there are no CRAN binaries, even for amd64 only. Seeing an effort for multiple architectures across different distributions is highly welcome, though I don't have much hope this being tackled in due time with a modern and open approach (judging on the closed binary building system of today).
Also I think that when starting something like this, the most commonly used distributions should be supported (including Alpine). I could see that this would only be approached for Ubuntu then (as a start) and others falling behind.

My proposal here is actually not related to "get help from CRAN" (again, this is not a proposal for a new project but a pre-release announcement of something already built) but rather to establish a new approach of building package binaries in the open. Using a modern, distributed underlying architecture with transparent statistics to which everyone could contribute to.

More practical ideas, I wouldn't built old package versions on default. From what I read from Posit their server get very few (<5%) request for packages for old versions of R (It might be that users install from source, use docker or already have other ways to deal with it).
On the cost side you could keep last n versions of packages to reduce the storage requirements.

I thought about this when starting out. However, it is hard to judge where the cutoff point should be. In addition, the automation in place simply scrapes all package versions and tries to build them. Many of the old ones will error anyhow due to missing/archived R packages, incompatible compilers or missing sysdep entries in DESCRIPTION.

Yet again, the storage size is actually not that much of an issue due to the use of S3. Even a few TB are not that expensive (in contrast to storing this size on a cloud volume). More costs are spent on building historic versions that compile for many minutes and then fail. However, I plan to address this using the logs acquired during the initial builds and implementing a "smart judge" feature that decides whether a specific tag will be skipped entirely for future builds. With "future builds" meaning rebuilds for newer OS versions, e.g. when Ubuntu 2604 comes out. While Ubuntu and RHEL only have releases every few years, this system is more important for Alpine which releases a new version to build against every 6 months.

Maybe next meeting, November 11th at 17 CET, you could present this and get more feedback. If you don't get the invitation, I'll send it to you.

Sure, sounds like a good option.

@llrs
Copy link
Collaborator

llrs commented Nov 1, 2024

My proposal here is actually not related to "get help from CRAN" (again, this is not a proposal for a new project but a pre-release announcement of something already built) but rather to establish a new approach of building package binaries in the open. Using a modern, distributed underlying architecture with transparent statistics to which everyone could contribute to.

Thanks for clarifying my misunderstanding, looking forward to see how you do it.
In case it helps, most of CRAN system is public at https://svn.r-project.org/R-dev-web/trunk/. I think there is an issue or a PR in this repo about how they build the binaries or using the pipeline locally.

I thought about this when starting out. However, it is hard to judge where the cutoff point should be. In addition, the automation in place simply scrapes all package versions and tries to build them. Many of the old ones will error anyhow due to missing/archived R packages, incompatible compilers or missing sysdep entries in DESCRIPTION.

There is no need to scrap the data, R provides functionality to access the current ones with tools::CRAN_package_db() for current ones or all old packages with tools::CRAN_archive_db() (only on devel), but they are in CRAN's archive's and could be used to rebuild those that fail.

Looking forward to the learn more about the project.

@pat-s
Copy link
Author

pat-s commented Nov 1, 2024

There is no need to scrap the data, R provides functionality to access the current ones with tools::CRAN_package_db() for current ones or all old packages with tools::CRAN_archive_db() (only on devel), but they are in CRAN's archive's and could be used to rebuild those that fail.

Thanks. I already make use of tools::CRAN_package_db(), not yet of tools::CRAN_archive_db() (just saw it is an ::: which is why I likely didn't see it before). I currently use cranberries to infer the updated pkgs on a daily basis. For the sources I am using the GH mirror of all packages at https://github.com/cran.

WRT to build failures etc: I envision a transparent UI where everyone can see the build status and failures of packages. The idea is that users can check those and provide suggestions/patches to the underlying Dockerfiles (or their own sources) to solve the failures.
E.g. during the process of building Alpine packages I've already submitted a few issues to certain R packages containing C code as their packages failed on Alpine. Turns out only a small change was needed due to MUSL/GLIBC and now it works for both.
Given that so far nobody had a focus to check for Alpine it is not surprising that many packages fail to build.
The new system would provide a platform for such builds and encourage authors to get their package built on Alpine.

@jameslamb
Copy link

👋🏻 Hi, I'm a random person from the internet who helps maintain a couple R packages, hope you don't mind the drive-by post.

Have you seen the r2u project (https://github.com/eddelbuettel/r2u)? It has some overlapping design goals to what you've described here, and might be worth looking at for inspiration.

It would also probably be useful to describe how what you're building differs from conda-forge (https://anaconda.org/r/repo), which is also building aarch64 binaries of many R packages (see the data.table variants here, for example), and has some of the features you've described (like "everyone can see the build status and failures of packages").

System dependencies are inferred via pak which make use of https://github.com/rstudio/r-system-requirements to automatically infer these from a package DESCRIPTION file.

This, in particular, is something that r2u handles differently. Paraphrasing here from a talk I recently saw @eddelbuettel give... I believe that that system does something like "build from source, then run ldd on the built library to determine which other shared libraries it'll need at runtime, then map backwards from those library filenames to package names".

@pat-s
Copy link
Author

pat-s commented Nov 3, 2024

Hi James,

r2u

Sure, r2u isn't a new project and around for some time. I got asked about it several times in my professional work (which evolves around R infrastructure since several years).

I don't think r2u solves any issues related to R packaging within the R community and (hot take) even makes it more complicated. Here's why:

  • Only built for amd64
  • Only available for one Linux family (Debian)
  • Updates must be done on the admin side/root as normal users can't use apt (users therefore don't even know when packages got updated)
  • You can't use any project versioning approaches (like renv or others) therefore as you can't install historic versions on your own
  • A well maintained OS updates system packages on a regular base (daily, weekly) which means R packages would change constantly

Therefore I always discourage the use of r2u when somebody asks me about my opionion. In addition, given the full availability of R package binaries through Posit PM for Ubuntu, I don't see any benefit of r2u even for Ubuntu users.

anaconda

Anacondas package compatibility is not well described. I.e. they provide a single "linux-64" build. However, this cannot work for all distros as each needs different versions during runtime and must be linked to local sysdeps. Yet I haven't tried it in practice on different distros and inspected it in close detail.

In addition, they don't built full CRAN but only a subset - see also here:

Many Comprehensive R Archive Network (CRAN) packages are available as conda packages. Anaconda does not provide builds of the entire CRAN repository, so there are some packages in CRAN that are not available as conda packages.

has some of the features you've described (like "everyone can see the build status and failures of packages").

This is "just" some additional niceness of the overall approach, not something that stands out.
What matters to me is:

  • arch-agnostic packages
  • distribution-agnostic packages
  • historic versions
  • usage with install.packages()
  • a globally low latency for downloads and installation

@pat-s
Copy link
Author

pat-s commented Nov 6, 2024

@llrs Would you mind sending me an invitation for the upcoming meeting?

@pat-s
Copy link
Author

pat-s commented Jan 1, 2025

The project has been released on Dec 24th. More information can be found in the announcement blog post.

@llrs
Copy link
Collaborator

llrs commented Jan 1, 2025

Nice, thank you for releasing it. Looking forward to test it and for the more technical post.

I was checking the packages and I noticed that the tar.gz file provided by CRAN and on your repository are not the same:
https://cran.devxy.io/arm64/noble/latest/src/contrib/A3_1.0.0.tar.gz vs https://ftp.cixug.es/CRAN/src/contrib/A3_1.0.0.tar.gz, I believe this is because your repository provided the tar.gz files of packages already installed (with the html help pages and other Metadata) instead of the R CMD build output that is submitted to CRAN. Is this on purpose?

I see that alpine uses a different C compiler, do you think that MUSL C library checks could be added as Additional checks on CRAN?
CRAN mentioned something about distributing binaries for linux and organizing them for arm64 and amd64, but these would enforce changing the tooling on base R to pick different combinations. What would be your preference on folder structure: linux/arm64/musl and linux/amd64/glibc?

@pat-s
Copy link
Author

pat-s commented Jan 2, 2025

instead of the R CMD build output that is submitted to CRAN. Is this on purpose?

Binary tarballs are produced via pkgbuild::build(binary = TRUE).

are not the same:

Do you mean hashes are different or what exactly? I didn't compare them file-by-file to their respective sources but also don't have much interest in doing so. The code sources itself are queried from the respective repo within https://github.com/cran.

I see that alpine uses a different C compiler, do you think that MUSL C library checks could be added as Additional checks on CRAN?
CRAN mentioned something about distributing binaries for linux and organizing them for arm64 and amd64, but these would enforce changing the tooling on base R to pick different combinations. What would be your preference on folder structure: linux/arm64/musl and linux/amd64/glibc?

Yes, that would help a lot. There are a dozen packages which have "easy" to fix compiler issues on MUSL but these are not checked at the moment. This then causes these packages and their reverse deps to fail, which explains most of the failed builds on Alpine. Overall though I am quite happy and somewhat surprised there aren't more failed ones.

I am currently preparing a full docs site for the project. This sneak peak shows that using Alpine for CI on GHA reduces the time to install deps (without cache) by ~ 40%.

@llrs
Copy link
Collaborator

llrs commented Jan 2, 2025

instead of the R CMD build output that is submitted to CRAN. Is this on purpose?

Binary tarballs are produced via pkgbuild::build(binary = TRUE).

are not the same:

Do you mean hashes are different or what exactly? I didn't compare them file-by-file to their respective sources but also don't have much interest in doing so. The code sources itself are queried from the respective repo within @cran.

I mean the hashes are different (as reported with available.packages) because the files inside it are different. Check the names and file structure of the two linked tar.gz files. You will see they don't have the same files inside the tar ball.

I see that alpine uses a different C compiler, do you think that MUSL C library checks could be added as Additional checks on CRAN?
CRAN mentioned something about distributing binaries for linux and organizing them for arm64 and amd64, but these would enforce changing the tooling on base R to pick different combinations. What would be your preference on folder structure: linux/arm64/musl and linux/amd64/glibc?

Yes, that would help a lot. There are a dozen packages which have "easy" to fix compiler issues on MUSL but these are not checked at the moment. This then causes these packages and their reverse deps to fail, which explains most of the failed builds on Alpine. Overall though I am quite happy and somewhat surprised there aren't more failed ones.

I think this is due to the many checks and strict quality requirements of compiled code on CRAN thanks to Brian Ripley. I think if he were approached with these issues he might be interested to add new check for MUSL compatibility on CRAN itself. Which might unload your work to do the patches yourself.

I am currently preparing a full docs site for the project. This sneak peak shows that using Alpine for CI on GHA reduces the time to install deps (without cache) by ~ 40%.

Awesome! I'll read with great interest

@pat-s
Copy link
Author

pat-s commented Jan 3, 2025

I mean the hashes are different (as reported with available.packages) because the files inside it are different. Check the names and file structure of the two linked tar.gz files. You will see they don't have the same files inside the tar ball.

Yes, I see it. Likely due to the reason you mentioned. But I don't care too much about this difference, as long as the size is not substantially bigger due to it. For packages with C code the binary will contain additional files anyhow.

I think this is due to the many checks and strict quality requirements of compiled code on CRAN thanks to Brian Ripley. I think if he were approached with these issues he might be interested to add new check for MUSL compatibility on CRAN itself. Which might unload your work to do the patches yourself.

Yes sure, although I am not sure he/CRAN is interested in enforcing these checks for an external project.

Let's continue additional discussions in the dedicated support repo I just created.

(I have now also published a documentation website alongside the project.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants