-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-conda builds failing with GLIBCXX errors #185
Comments
I'm using #183 to investigate this. |
Worth noting that there was a new release of
I don't see anything obviously-relevant in those release notes though. |
Still debugging, dumping some notes. Here's a recent full-re-run: https://github.com/rapidsai/ci-imgs/actions/runs/10687166451?pr=183 What's succeeding ✅ all Python 3.10 builds What's failing: ❌ all Python 3.11 and 3.12 aarch64 builds The failures are all happening at this step: Lines 178 to 183 in d63e1aa
|
I truncated the ci-conda-truncated.Dockerfile (click me)ARG CUDA_VER=notset
ARG LINUX_VER=notset
ARG PYTHON_VER=notset
ARG YQ_VER=notset
ARG AWS_CLI_VER=notset
FROM nvidia/cuda:${CUDA_VER}-base-${LINUX_VER} AS miniforge-cuda
ARG LINUX_VER
ARG PYTHON_VER
ARG DEBIAN_FRONTEND=noninteractive
ENV PATH=/opt/conda/bin:$PATH
ENV PYTHON_VERSION=${PYTHON_VER}
SHELL ["/bin/bash", "-euo", "pipefail", "-c"]
# Create a conda group and assign it as root's primary group
RUN <<EOF
groupadd conda
usermod -g conda root
EOF
# Ownership & permissions based on https://docs.anaconda.com/anaconda/install/multi-user/#multi-user-anaconda-installation-on-linux
COPY --from=condaforge/miniforge3:24.3.0-0 --chown=root:conda --chmod=770 /opt/conda /opt/conda
# Ensure new files are created with group write access & setgid. See https://unix.stackexchange.com/a/12845
RUN chmod g+ws /opt/conda
RUN <<EOF
# Ensure new files/dirs have group write permissions
umask 002
# install expected Python version
conda install -y -n base "python~=${PYTHON_VERSION}.0=*_cpython"
conda update --all -y -n base
if [[ "$LINUX_VER" == "rockylinux"* ]]; then
yum install -y findutils
yum clean all
fi
find /opt/conda -follow -type f -name '*.a' -delete
find /opt/conda -follow -type f -name '*.pyc' -delete
conda clean -afy
EOF
# Reassign root's primary group to root
RUN usermod -g root root
RUN <<EOF
# ensure conda environment is always activated
ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh
echo ". /opt/conda/etc/profile.d/conda.sh; conda activate base" >> /etc/skel/.bashrc
echo ". /opt/conda/etc/profile.d/conda.sh; conda activate base" >> ~/.bashrc
EOF
# tzdata is needed by the ORC library used by pyarrow, because it provides /etc/localtime
RUN <<EOF
case "${LINUX_VER}" in
"ubuntu"*)
apt-get update
apt-get upgrade -y
apt-get install -y --no-install-recommends \
tzdata
rm -rf "/var/lib/apt/lists/*"
;;
"rockylinux"*)
yum update -y
yum clean all
;;
*)
echo "Unsupported LINUX_VER: ${LINUX_VER}" && exit 1
;;
esac
EOF
FROM mikefarah/yq:${YQ_VER} AS yq
FROM amazon/aws-cli:${AWS_CLI_VER} AS aws-cli
FROM miniforge-cuda
ARG TARGETPLATFORM=notset
ARG CUDA_VER=notset
ARG LINUX_VER=notset
ARG PYTHON_VER=notset
ARG DEBIAN_FRONTEND
# Set RAPIDS versions env variables
ENV RAPIDS_CUDA_VERSION="${CUDA_VER}"
ENV RAPIDS_PY_VERSION="${PYTHON_VER}"
SHELL ["/bin/bash", "-euo", "pipefail", "-c"]
# Install system packages depending on the LINUX_VER
RUN <<EOF
case "${LINUX_VER}" in
"ubuntu"*)
echo 'APT::Update::Error-Mode "any";' > /etc/apt/apt.conf.d/warnings-as-errors
apt-get update
apt-get upgrade -y
apt-get install -y --no-install-recommends \
curl \
file \
unzip \
wget \
gcc \
g++
rm -rf "/var/lib/apt/lists/*"
;;
"rockylinux"*)
yum -y update
yum -y install --setopt=install_weak_deps=False \
file \
unzip \
wget \
which \
yum-utils \
gcc \
gcc-c++
yum clean all
;;
*)
echo "Unsupported LINUX_VER: ${LINUX_VER}"
exit 1
;;
esac
EOF
# Install CUDA packages, only for CUDA 11 (CUDA 12+ should fetch from conda)
RUN <<EOF
case "${CUDA_VER}" in
"11"*)
PKG_CUDA_VER="$(echo ${CUDA_VER} | cut -d '.' -f1,2 | tr '.' '-')"
echo "Attempting to install CUDA Toolkit ${PKG_CUDA_VER}"
case "${LINUX_VER}" in
"ubuntu"*)
apt-get update
apt-get upgrade -y
apt-get install -y --no-install-recommends \
cuda-gdb-${PKG_CUDA_VER} \
cuda-cudart-dev-${PKG_CUDA_VER} \
cuda-cupti-dev-${PKG_CUDA_VER}
# ignore the build-essential package since it installs dependencies like gcc/g++
# we don't need them since we use conda compilers, so this keeps our images smaller
apt-get download cuda-nvcc-${PKG_CUDA_VER}
dpkg -i --ignore-depends="build-essential" ./cuda-nvcc-*.deb
rm ./cuda-nvcc-*.deb
# apt will not work correctly if it thinks it needs the build-essential dependency
# so we patch it out with a sed command
sed -i 's/, build-essential//g' /var/lib/dpkg/status
rm -rf "/var/lib/apt/lists/*"
;;
"rockylinux"*)
yum -y update
yum -y install --setopt=install_weak_deps=False \
cuda-cudart-devel-${PKG_CUDA_VER} \
cuda-driver-devel-${PKG_CUDA_VER} \
cuda-gdb-${PKG_CUDA_VER} \
cuda-cupti-${PKG_CUDA_VER}
rpm -Uvh --nodeps $(repoquery --location cuda-nvcc-${PKG_CUDA_VER})
yum clean all
;;
*)
echo "Unsupported LINUX_VER: ${LINUX_VER}"
exit 1
;;
esac
;;
*)
echo "Skipping CUDA Toolkit installation for CUDA ${CUDA_VER}"
;;
esac
EOF
# Install gha-tools
RUN wget https://github.com/rapidsai/gha-tools/releases/latest/download/tools.tar.gz -O - \
| tar -xz -C /usr/local/bin That's sufficient to reproduce the error locally on my mac (aarch64), with the latest versions of Ubuntu, Python, and CUDA supported in this repo. docker buildx build \
--build-arg SCCACHE_VER=0.7.7 \
--build-arg GH_CLI_VER=2.54.0 \
--build-arg CODECOV_VER=0.7.3 \
--build-arg YQ_VER=4.44.2 \
--build-arg AWS_CLI_VER=2.17.20 \
--build-arg CUDA_VER=12.5.1 \
--build-arg LINUX_VER=ubuntu22.04 \
--build-arg PYTHON_VER=3.12 \
--file ci-conda-truncated.Dockerfile \
--tag delete-me:ci-conda-py3.12 \
./context
docker run \
--rm \
-it delete-me:ci-conda-py3.12 \
conda clean --yes --all
# Error while loading conda entry point: conda-libmamba-solver (/lib/aarch64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /opt/conda/lib/python3.12/site-packages/libmambapy/bindings.cpython-312-aarch64-linux-gnu.so)) Checked that library's GLIBC symbols docker run \
--rm \
--env LIB_FILE=/opt/conda/lib/python3.12/site-packages/libmambapy/bindings.cpython-312-aarch64-linux-gnu.so \
-it delete-me:ci-conda-py3.12 \
bash -c 'objdump -T ${LIB_FILE} | grep -oP "(?<=GLIBCXX_)([0-9.]+)" | sort -u' And sure enough, I see GLIBCXX_3.4.32 symbols there
So it does look like root cause for CI failures here is something like " In this case, that image got the following versions of mamba things: docker run \
--rm \
-it delete-me:ci-conda-py3.12 \
bash -c 'conda env export | grep -i mamba'
Linking one related issue: conda-forge/mamba-feedstock#201. Next, I'll try to reproduce this with a more minimal example. |
Getting a lot closer! Short summaryIn successfully-building environments, there exists a symlink in In failing environments, that symlink is missing.
Detailsexpand for details (click me)It looks like on Python 3.12, readelf -d /opt/conda/lib/python3.12/site-packages/libmambapy/bindings.cpython-312-aarch64-linux-gnu.so \
| grep -E 'NEEDED|RPATH'
At loading time it's finding the one in
Which only has GLIBCXX symbols up to 3.4.30 objdump -T /lib/aarch64-linux-gnu/libstdc++.so.6 \
| grep -oP '(?<=GLIBCXX_)([0-9.]+)' \
| sort -u INSTEAD OF the one provided by conda and pointed to by that RPATH.
Which has GLIBCXX symbols up to 3.4.33. objdump -T /opt/conda/lib/libstdc++.so.6.0.33 \
| grep -oP '(?<=GLIBCXX_)([0-9.]+)' \
| sort -u Maybe conda is missing a symlink from I rebuilt a Python 3.10 image using that "truncated" Dockerfile, and confirmed that I saw the symlink there. docker buildx build \
--build-arg SCCACHE_VER=0.7.7 \
--build-arg GH_CLI_VER=2.54.0 \
--build-arg CODECOV_VER=0.7.3 \
--build-arg YQ_VER=4.44.2 \
--build-arg AWS_CLI_VER=2.17.20 \
--build-arg CUDA_VER=12.5.1 \
--build-arg LINUX_VER=ubuntu22.04 \
--build-arg PYTHON_VER=3.10 \
--file ci-conda-truncated.Dockerfile \
--tag delete-me:ci-conda-py3.10 \
./context
docker run \
--rm \
-it delete-me:ci-conda-py3.10 \
bash -c 'ls /opt/conda/lib/libstdc++*'
# /opt/conda/lib/libstdc++.so
# /opt/conda/lib/libstdc++.so.6
# /opt/conda/lib/libstdc++.so.6.0.33
docker run \
--rm \
-it delete-me:ci-conda-py3.10 \
bash -c 'stat /opt/conda/lib/libstdc++.so.6'
# File: /opt/conda/lib/libstdc++.so.6 -> libstdc++.so.6.0.33
# Size: 19 Blocks: 0 IO Block: 4096 symbolic link |
I think it's worth noting that we set up conda by copying the contents of Line 24 in d63e1aa
And then update Python and then all other dependencies in the base environment. Lines 33 to 34 in d63e1aa
The |
Ahhhh yes this is totally what's happening!!! For Python 3.11 / 3.12 environments, that Starting from the base image, the library and links are there. docker run --rm \
-it condaforge/miniforge3:24.3.0-0 \
bash
ls -l /opt/conda/lib/*stdc++* | grep -oP '/opt.*'
# /opt/conda/lib/libstdc++.so -> libstdc++.so.6.0.32
# /opt/conda/lib/libstdc++.so.6 -> libstdc++.so.6.0.32
# /opt/conda/lib/libstdc++.so.6.0.32 After updating to Python 3.12, a new conda install -y -n base "python~=3.12.0=*_cpython"
ls -l /opt/conda/lib/*stdc++* | grep -oP '/opt.*'
# /opt/conda/lib/libstdc++.so -> libstdc++.so.6.0.33
# /opt/conda/lib/libstdc++.so.6 -> libstdc++.so.6.0.33
# /opt/conda/lib/libstdc++.so.6.0.32
# /opt/conda/lib/libstdc++.so.6.0.33 summary of upgrades, downgrades, installs, removals (click me)
But the conda update --all -y -n base
ls -l /opt/conda/lib/*stdc++* | grep -oP '/opt.*'
# /opt/conda/lib/libstdc++.so.6.0.33 summary of upgrades, downgrades, installs, removals (click me)
|
Really appreciate the detailed writing. Just approved the PR |
This is an awesome investigation James, thanks! I don't know exactly what is happening, but I can with reasonable confidence say that we are running afoul of conda-forge/ctng-compilers-feedstock#148 introducing incompatibilities with the old packages our images are copying from a version of miniforge prior to those changes. conda-forge/ctng-compilers-feedstock#148 introduced the unsuffixed
But then, when you run the update, we see this:
Note that in the second case the package has the Here's my best guess for what is happening, although it has some pretty clear gaps that need to be filled in.
If I am correct, then what can we do to fix this? I would suggest updating |
I agree with Vyas' analysis. One possible alternative solution is to use micromamba to populate /opt/conda instead of copying from miniforge. This has the advantage of installing the desired python version directly. Here are some docs: https://micromamba-docker.readthedocs.io/en/latest/quick_start.html @vyasr what do you think about this alternative? |
Thank you all so much! I pushed @vyasr 's recommendation of doing an earlier
We do also need the It looks to me from those docs like using |
If you want conda to be available in the environment afterwards, you just include it as one of the packages to install. The idea is to use micromamba as just a provisioner for the environment, and it doesn't stick around. In other words, instead of
you can have:
It's pretty similar either way, and it looks to me like neither is obviously advantageous over the other. The conda history will be simpler with the latter, which might reduce the chance of weird issues. |
I do think we should switch over to micromamba, see rapidsai/build-planning#50 🙂 but I would suggest we do that as a follow-up since that will require more rigorous testing to get right. It would be good to update shared workflows to support using custom images so that we could change the images to use micromamba and then run test workflows in a couple of repos to be sure that everything works as expected. |
Nightly builds of `rapidsai/raft-ann-bench` failed like this: > ImportError: /lib/aarch64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /opt/conda/lib/python3.11/site-packages/libmambapy/bindings.cpython-311-aarch64-linux-gnu.so) ([build link](https://github.com/rapidsai/docker/actions/runs/10739898324/job/29789780257)) I suspect that's because those images use the same pattern for initializing a conda environment that led to the issues described in rapidsai/ci-imgs#185. This proposes the same fix that we applied in `ci-imgs` (rapidsai/ci-imgs#186). Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Mike Sarahan (https://github.com/msarahan) URL: #710
Description
48 of the
ci-conda
image builds jobs are deterministically failing with GLIBC errors like this:Reproducible Example
Observed this on multiple unrelated PRs, e.g. #179 and #183.
Example build link: https://github.com/rapidsai/ci-imgs/actions/runs/10686297019/job/29621203550?pr=183
Notes
N/A
The text was updated successfully, but these errors were encountered: