Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce ocrd/core-cuda #1041

Merged
merged 19 commits into from
Jun 7, 2023
Merged

Reduce ocrd/core-cuda #1041

merged 19 commits into from
Jun 7, 2023

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Apr 1, 2023

see here

stweil and others added 5 commits March 15, 2023 20:34
python3-pip can now be removed, and running pip instead of pip3 is sufficient.

This also avoids the installation of python3-setuptools and python3-wheel
(requirements of python3-pip).

Signed-off-by: Stefan Weil <[email protected]>
@bertsky bertsky requested a review from kba April 1, 2023 09:16
Dockerfile Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
bertsky added 13 commits April 15, 2023 22:48
- extra Dockerfile.cuda instead of guessing from base image name
- re-use non-CUDA ocrd/core instead of nvidia/cuda base image
- install CUDA toolkit and libraries via (micro)mamba
  instead of nvidia-pyindex CUDA libraries made available system-wide
- get cuDNN and CUDA libs from conda-forge and nvidia channels
- install CUDA libraries via nvidia-pyindex again (but not nvcc)
- ensure they can be compiled/linked against system-wide (with nvcc)
…to release-2.36.0"

This reverts commit 9a54ef6, reversing
changes made to a78d4c5.
@bertsky
Copy link
Collaborator Author

bertsky commented Jun 2, 2023

To elaborate...

pkg_resources vs fastentrypoints

12e781c fixes #1050, which I had to include here, because Torch and Tensorflow compete over cuDNN. So we currently have to break Kraken's explicit dependencies in order to satisfy all TF processors' implicit dependencies. (Kraken/Torch does work with the newer version, but TF would crash on the older one.) Breaking dependencies is usually not a problem at runtime (only for things like pip check), but as #1050 documents, the fastentrypoints enforce dependencies, which does break our setup (for a little gain of speed). If we have no (prospect of) conflicts in the future, we can still undo the revert again.

Mamba/conda-forge vs Nvidia base container

As to the introduction of Mamba/Conda, and giving up the Nvidia base containers: The original idea was to avoid having multiple copies of the huge (several GB each) CUDA libraries (i.e. cudnn cublas cusparse cusolver curand cufft cuda_runtime cuda_nvrtc) – for Torch and Tensorflow, for outer and for inner venv. These are needed at runtime, but also at build-time in ocrd_all (e.g. compiling Detectron2) via nvcc.

Torch and TF behave very differently regarding these dependencies, esp. (in the most recent versions) regarding libcudnn:

Torch Tensorflow
dependency explicit, Python implicit, system
runtime model dynamic linking dynamic loading
cuDNN version 8.5.0.96 (but tolerates newer) 8.6.0.163 (otherwise crashes)

We initially used the nvidia/cuda:* images as base stage to have all the system dependencies. But not only does this give you multiple copies of the libraries, recently Nvidia stopped building their images and CUDA repositories in a way that would ensure (recent version of) Tensorflow can be installed on top (with matching CUDA and cuDNN). (This used to be a problem all along, but it gets ever more difficult. You can use their nvidia/tensorflow build, but we need to support Torch as well. Besides, Tensorflow in its official documentation now even requires using Conda for the CUDA stuff...)

Since venvs cannot share their installed files (and cannot move them around or symlink them), Conda seemed to be the best choice (inter alia it shares files across envs by hardlinking).

Conda also seemed the optimal choice to properly encapsulate all system dependencies of our processor modules, replacing our poor deps-ubuntu mechanism. But that of course would be a much bigger change, affecting many repositories at once, also native installation and documentation.

In addition, once Conda takes care of Python dependencies, we would effectively be bypassing our outer venv (including the Python version itself). But even if we restrict ourselves to system dependencies only – since Torch will pull CUDA libs via pip anyway, they would still be installed twice (via conda for TF, via pip for Torch).

So I decided to downsize this a little: preinstall Conda (as Micromamba for size and speed) into the Docker image, but only for nvcc (which you cannot get via pip). For the libraries themselves, I now preinstall what Torch would eventually pull in, but then make them available system-wide (both as runtime and for compilation) via symlinks and ld.so.conf.

Perspective

Image size of ocrd/core-cuda is only 3.7 GB again now. And ocrd/all:maximum-cuda-git then comes out with 22 GB, which is acceptable.

In the future, once ocrd_all adapts to the OCR-D network implementation and uses single-module containers instead of a single fat image, ocrd/core and ocrd/core-cuda can be used as base stages for module images. So even if we have many more multiple copies of our Torch/TF and CUDA libraries via distinct Docker images/containers, due to layer sharing they will not actually require larger total storage size. (That is, the gist of each module image will more or less still be the same, shared ~4 GB base layer.)

We could still think about making more use of Conda/Mamba, ultimately replacing deps-ubuntu. But let's take it more slowly!

This review should have priority – next would be OCR-D/ocrd_all#362.

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 2, 2023

oh, and 47eff22 is needed because kraken just renamed master → main, so our URL for blla.mlmodel resource does not work anymore (and the correct model URL is already part of OCR-D/ocrd_kraken#38).

Dockerfile.cuda Outdated Show resolved Hide resolved
Dockerfile.cuda Show resolved Hide resolved
Makefile Show resolved Hide resolved
Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, testing now.

ocrd/ocrd/constants.py Show resolved Hide resolved
@bertsky bertsky linked an issue Jun 4, 2023 that may be closed by this pull request
@kba kba merged commit bac1a45 into OCR-D:master Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix CUDA Docker image
3 participants