-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R CRAN package binaries for Linux - 2.0 #36
Comments
I think building arm64 binaries is something CRAN maintainers have in mind, as I think they briefly mentioned serving arm64 binaries in previous meetings with them. CRAN in their recent presentation at useR!2024 incldes a section: "Help with core CRAN services". This seems like a core CRAN service so it seems you could get support from CRAN. More practical ideas, I wouldn't built old package versions on default. From what I read from Posit their server get very few (<5%) request for packages for old versions of R (It might be that users install from source, use docker or already have other ways to deal with it). |
This is great to hear though I am wondering how this would unfold in practice: until today there are no CRAN binaries, even for amd64 only. Seeing an effort for multiple architectures across different distributions is highly welcome, though I don't have much hope this being tackled in due time with a modern and open approach (judging on the closed binary building system of today). My proposal here is actually not related to "get help from CRAN" (again, this is not a proposal for a new project but a pre-release announcement of something already built) but rather to establish a new approach of building package binaries in the open. Using a modern, distributed underlying architecture with transparent statistics to which everyone could contribute to.
I thought about this when starting out. However, it is hard to judge where the cutoff point should be. In addition, the automation in place simply scrapes all package versions and tries to build them. Many of the old ones will error anyhow due to missing/archived R packages, incompatible compilers or missing sysdep entries in DESCRIPTION. Yet again, the storage size is actually not that much of an issue due to the use of S3. Even a few TB are not that expensive (in contrast to storing this size on a cloud volume). More costs are spent on building historic versions that compile for many minutes and then fail. However, I plan to address this using the logs acquired during the initial builds and implementing a "smart judge" feature that decides whether a specific tag will be skipped entirely for future builds. With "future builds" meaning rebuilds for newer OS versions, e.g. when Ubuntu 2604 comes out. While Ubuntu and RHEL only have releases every few years, this system is more important for Alpine which releases a new version to build against every 6 months.
Sure, sounds like a good option. |
Thanks for clarifying my misunderstanding, looking forward to see how you do it.
There is no need to scrap the data, R provides functionality to access the current ones with Looking forward to the learn more about the project. |
Thanks. I already make use of WRT to build failures etc: I envision a transparent UI where everyone can see the build status and failures of packages. The idea is that users can check those and provide suggestions/patches to the underlying Dockerfiles (or their own sources) to solve the failures. |
👋🏻 Hi, I'm a random person from the internet who helps maintain a couple R packages, hope you don't mind the drive-by post. Have you seen the It would also probably be useful to describe how what you're building differs from
This, in particular, is something that |
Hi James, r2uSure, I don't think
Therefore I always discourage the use of anacondaAnacondas package compatibility is not well described. I.e. they provide a single "linux-64" build. However, this cannot work for all distros as each needs different versions during runtime and must be linked to local sysdeps. Yet I haven't tried it in practice on different distros and inspected it in close detail. In addition, they don't built full CRAN but only a subset - see also here:
This is "just" some additional niceness of the overall approach, not something that stands out.
|
@llrs Would you mind sending me an invitation for the upcoming meeting? |
The project has been released on Dec 24th. More information can be found in the announcement blog post. |
Nice, thank you for releasing it. Looking forward to test it and for the more technical post. I was checking the packages and I noticed that the tar.gz file provided by CRAN and on your repository are not the same: I see that alpine uses a different C compiler, do you think that MUSL C library checks could be added as |
Binary tarballs are produced via
Do you mean hashes are different or what exactly? I didn't compare them file-by-file to their respective sources but also don't have much interest in doing so. The code sources itself are queried from the respective repo within https://github.com/cran.
Yes, that would help a lot. There are a dozen packages which have "easy" to fix compiler issues on MUSL but these are not checked at the moment. This then causes these packages and their reverse deps to fail, which explains most of the failed builds on Alpine. Overall though I am quite happy and somewhat surprised there aren't more failed ones. I am currently preparing a full docs site for the project. This sneak peak shows that using Alpine for CI on GHA reduces the time to install deps (without cache) by ~ 40%. |
I mean the hashes are different (as reported with
I think this is due to the many checks and strict quality requirements of compiled code on CRAN thanks to Brian Ripley. I think if he were approached with these issues he might be interested to add new check for MUSL compatibility on CRAN itself. Which might unload your work to do the patches yourself.
Awesome! I'll read with great interest |
Yes, I see it. Likely due to the reason you mentioned. But I don't care too much about this difference, as long as the size is not substantially bigger due to it. For packages with C code the binary will contain additional files anyhow.
Yes sure, although I am not sure he/CRAN is interested in enforcing these checks for an external project. Let's continue additional discussions in the dedicated support repo I just created. (I have now also published a documentation website alongside the project.) |
Dear Working Group,
for the past months I have been building on a new project that I internally call "R CRAN package binaries for Linux - 2.0" (2.0 as Posit PM was the 1.0 in my mind). Hereby I'd like to introduce it to the working group and gather feedback and get ideas for a sustainable future.
Background
CRAN does not provide package binaries for Linux. For many years, Posit is doing so for several OS but only one architecture.
Given the rise of arm64 and docker-based actions in recent years, the need emerged to also have package binaries for arm64 and the commonly-used OS for docker-based workloads, Alpine.
I've reached out to Posit a few times whether they have plans to extend their build chain to arm64 with no success - I only got one brief answer from multiple sources that there are "no plans to do this in the foreseeable future".
Concept
I've head the idea in mind for quite some time but was so far lacking time and financial resources (to get started with).
Starting a LTD two months ago, I now have both and my motivation was high to finally tackle this issue.
So I started to build R package binaries on Linux for the following matrix:
This results in a total of twelve times building CRAN (including all historic package versions), so roughly 12 x 20k x 6 (6 being the average version count of CRAN packages) = 1,4 Mio binaries.
General Details
The binaries are already built and are distributed through a global CDN.
The CDN showed download speedup times of up to 10x compared to downloading the same binary from Posit PM (from a European location).
While I initially only had the plan to build arm64 binaries, I realized that adding a new repo for only arm64 binaries is cumbersome: one has to switch repos for arm64/amd64 builds and in addition needs a CRAN src repo to ensure all packages are available (as some binaries are not available due to build failures).
Binaries have been built once off starting on date X. Since "start day" +1, the updated packages are being processed. This includes:
This is then followed by a task that updates the
PACKAGES
index file which is responsible to provide an index for all available packages.Technical Details
Binaries are built on a k3s-based kubernetes cluster with autoscaling. This means that recurring jobs will execute the tasks mentioned above and shut down the server after that automatically (usually takes 20-50 mins/day depending on the amount of packages).
The backend for all builds are individual docker images I've crafted that contain a robust compiler setup including further components like X-server, Blas, Chromium, and other tools needed to build many packages.
System dependencies are inferred via
pak
which make use of https://github.com/rstudio/r-system-requirements to automatically infer these from a package DESCRIPTION file.With respect to storage, I might have opted for a new approach: storing binaries in S3. While this sounds like a logical solution per se, all related tools and helpers (pkgs
tools
,cranlike
anddesc
) are not compliant to work with S3 at the moment. Hence I forked them and added S3 support.For
cranlike
e.g. this means supporting updatingPACKAGES
on a remote S3 location using theetag
for the requiredmd5sum
hash value.Sustainability & Quality
While so far I've built around 1,4 million binaries including daily rebuilt jobs, there's more to keep such a project alive.
One point is of course to improve the underlying Containerfiles with respect to their compiler config as there are still (too many) packages which fail to build due to some C-code issues.
While my plan is to provide precise statistics for each package (as I am storing the build metadata for every binary in a Postgres DB), Alpine is surely the most complicated one due to the fact that is uses MUSL instead of GLIBC as the C library.
Besides the build and storage costs (which I don't want to share in this post here), there's also the distribution cost. Distributing the binaries through a global CDN with local storage caches on different continents seems to be a great solution to me. I don't want to store the assets in a single location/S3 bucket and then have many requests with a high latency and overall travel time.
All of the above comes with some costs. I didn't mention the person hours yet, but so far I think I am somewhere around 300h that I've invested into the project. Storage and server costs so far are between 1-2k.
My goal is not to maximize profits with this project, even though I am placing it within my recently founded company. I want to help the R community to advance to the "present" of today's possibilities (WRT to architectures and asset delivery) and make use of the binaries myself. Placing it under the umbrella of my company helps me to finance and justify it as a "professional" project. I am aware of the R consortium funds and applying for a grant is definitely in scope. However, I wanted to first share this project with the WG before proceeding with such.
Overall, I am looking for feedback and support with this project to make it sustainable, both in terms of technical and financial support.
The source code is not yet public as I still need to document it properly and "clean up" - but I am definitely planning to do so. In contrast to Posit PM, I would like to develop/maintain the project in the open and encourage everyone to contribute.
Patrick Schratz
The text was updated successfully, but these errors were encountered: