-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce ocrd/core-cuda #1041
Reduce ocrd/core-cuda #1041
Conversation
python3-pip can now be removed, and running pip instead of pip3 is sufficient. This also avoids the installation of python3-setuptools and python3-wheel (requirements of python3-pip). Signed-off-by: Stefan Weil <[email protected]>
…stem-wide via ld.so.conf
…ckerfile for that
- extra Dockerfile.cuda instead of guessing from base image name - re-use non-CUDA ocrd/core instead of nvidia/cuda base image - install CUDA toolkit and libraries via (micro)mamba instead of nvidia-pyindex CUDA libraries made available system-wide - get cuDNN and CUDA libs from conda-forge and nvidia channels
- install CUDA libraries via nvidia-pyindex again (but not nvcc) - ensure they can be compiled/linked against system-wide (with nvcc)
To elaborate... pkg_resources vs fastentrypoints12e781c fixes #1050, which I had to include here, because Torch and Tensorflow compete over cuDNN. So we currently have to break Kraken's explicit dependencies in order to satisfy all TF processors' implicit dependencies. (Kraken/Torch does work with the newer version, but TF would crash on the older one.) Breaking dependencies is usually not a problem at runtime (only for things like Mamba/conda-forge vs Nvidia base containerAs to the introduction of Mamba/Conda, and giving up the Nvidia base containers: The original idea was to avoid having multiple copies of the huge (several GB each) CUDA libraries (i.e. cudnn cublas cusparse cusolver curand cufft cuda_runtime cuda_nvrtc) – for Torch and Tensorflow, for outer and for inner venv. These are needed at runtime, but also at build-time in ocrd_all (e.g. compiling Detectron2) via nvcc. Torch and TF behave very differently regarding these dependencies, esp. (in the most recent versions) regarding libcudnn:
We initially used the Since venvs cannot share their installed files (and cannot move them around or symlink them), Conda seemed to be the best choice (inter alia it shares files across envs by hardlinking). Conda also seemed the optimal choice to properly encapsulate all system dependencies of our processor modules, replacing our poor In addition, once Conda takes care of Python dependencies, we would effectively be bypassing our outer venv (including the Python version itself). But even if we restrict ourselves to system dependencies only – since Torch will pull CUDA libs via pip anyway, they would still be installed twice (via conda for TF, via pip for Torch). So I decided to downsize this a little: preinstall Conda (as Micromamba for size and speed) into the Docker image, but only for nvcc (which you cannot get via pip). For the libraries themselves, I now preinstall what Torch would eventually pull in, but then make them available system-wide (both as runtime and for compilation) via symlinks and ld.so.conf. PerspectiveImage size of ocrd/core-cuda is only 3.7 GB again now. And ocrd/all:maximum-cuda-git then comes out with 22 GB, which is acceptable. In the future, once ocrd_all adapts to the OCR-D network implementation and uses single-module containers instead of a single fat image, ocrd/core and ocrd/core-cuda can be used as base stages for module images. So even if we have many more multiple copies of our Torch/TF and CUDA libraries via distinct Docker images/containers, due to layer sharing they will not actually require larger total storage size. (That is, the gist of each module image will more or less still be the same, shared ~4 GB base layer.) We could still think about making more use of Conda/Mamba, ultimately replacing This review should have priority – next would be OCR-D/ocrd_all#362. |
oh, and 47eff22 is needed because kraken just renamed master → main, so our URL for blla.mlmodel resource does not work anymore (and the correct model URL is already part of OCR-D/ocrd_kraken#38). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, testing now.
see here