Reduce ocrd/core-cuda #1041

bertsky · 2023-04-01T09:16:37Z

python3-pip can now be removed, and running pip instead of pip3 is sufficient. This also avoids the installation of python3-setuptools and python3-wheel (requirements of python3-pip). Signed-off-by: Stefan Weil <[email protected]>

Dockerfile

Makefile

…stem-wide via ld.so.conf

…ckerfile for that

…ad of preinstall

- extra Dockerfile.cuda instead of guessing from base image name - re-use non-CUDA ocrd/core instead of nvidia/cuda base image - install CUDA toolkit and libraries via (micro)mamba instead of nvidia-pyindex CUDA libraries made available system-wide - get cuDNN and CUDA libs from conda-forge and nvidia channels

- install CUDA libraries via nvidia-pyindex again (but not nvcc) - ensure they can be compiled/linked against system-wide (with nvcc)

…to release-2.36.0" This reverts commit 9a54ef6, reversing changes made to a78d4c5.

bertsky · 2023-06-02T09:33:38Z

To elaborate...

pkg_resources vs fastentrypoints

12e781c fixes #1050, which I had to include here, because Torch and Tensorflow compete over cuDNN. So we currently have to break Kraken's explicit dependencies in order to satisfy all TF processors' implicit dependencies. (Kraken/Torch does work with the newer version, but TF would crash on the older one.) Breaking dependencies is usually not a problem at runtime (only for things like pip check), but as #1050 documents, the fastentrypoints enforce dependencies, which does break our setup (for a little gain of speed). If we have no (prospect of) conflicts in the future, we can still undo the revert again.

Mamba/conda-forge vs Nvidia base container

As to the introduction of Mamba/Conda, and giving up the Nvidia base containers: The original idea was to avoid having multiple copies of the huge (several GB each) CUDA libraries (i.e. cudnn cublas cusparse cusolver curand cufft cuda_runtime cuda_nvrtc) – for Torch and Tensorflow, for outer and for inner venv. These are needed at runtime, but also at build-time in ocrd_all (e.g. compiling Detectron2) via nvcc.

Torch and TF behave very differently regarding these dependencies, esp. (in the most recent versions) regarding libcudnn:

	Torch	Tensorflow
dependency	explicit, Python	implicit, system
runtime model	dynamic linking	dynamic loading
cuDNN version	8.5.0.96 (but tolerates newer)	8.6.0.163 (otherwise crashes)

We initially used the nvidia/cuda:* images as base stage to have all the system dependencies. But not only does this give you multiple copies of the libraries, recently Nvidia stopped building their images and CUDA repositories in a way that would ensure (recent version of) Tensorflow can be installed on top (with matching CUDA and cuDNN). (This used to be a problem all along, but it gets ever more difficult. You can use their nvidia/tensorflow build, but we need to support Torch as well. Besides, Tensorflow in its official documentation now even requires using Conda for the CUDA stuff...)

Since venvs cannot share their installed files (and cannot move them around or symlink them), Conda seemed to be the best choice (inter alia it shares files across envs by hardlinking).

Conda also seemed the optimal choice to properly encapsulate all system dependencies of our processor modules, replacing our poor deps-ubuntu mechanism. But that of course would be a much bigger change, affecting many repositories at once, also native installation and documentation.

In addition, once Conda takes care of Python dependencies, we would effectively be bypassing our outer venv (including the Python version itself). But even if we restrict ourselves to system dependencies only – since Torch will pull CUDA libs via pip anyway, they would still be installed twice (via conda for TF, via pip for Torch).

So I decided to downsize this a little: preinstall Conda (as Micromamba for size and speed) into the Docker image, but only for nvcc (which you cannot get via pip). For the libraries themselves, I now preinstall what Torch would eventually pull in, but then make them available system-wide (both as runtime and for compilation) via symlinks and ld.so.conf.

Perspective

Image size of ocrd/core-cuda is only 3.7 GB again now. And ocrd/all:maximum-cuda-git then comes out with 22 GB, which is acceptable.

In the future, once ocrd_all adapts to the OCR-D network implementation and uses single-module containers instead of a single fat image, ocrd/core and ocrd/core-cuda can be used as base stages for module images. So even if we have many more multiple copies of our Torch/TF and CUDA libraries via distinct Docker images/containers, due to layer sharing they will not actually require larger total storage size. (That is, the gist of each module image will more or less still be the same, shared ~4 GB base layer.)

We could still think about making more use of Conda/Mamba, ultimately replacing deps-ubuntu. But let's take it more slowly!

This review should have priority – next would be OCR-D/ocrd_all#362.

bertsky · 2023-06-02T11:08:41Z

oh, and 47eff22 is needed because kraken just renamed master → main, so our URL for blla.mlmodel resource does not work anymore (and the correct model URL is already part of OCR-D/ocrd_kraken#38).

Dockerfile.cuda

Makefile

kba

LGTM, testing now.

ocrd/ocrd/constants.py

stweil and others added 5 commits March 15, 2023 20:34

Dockerfile: Use virtual environment

edcf855

python3-pip can now be removed, and running pip instead of pip3 is sufficient. This also avoids the installation of python3-setuptools and python3-wheel (requirements of python3-pip). Signed-off-by: Stefan Weil <[email protected]>

give up workaround for shapely-CUDA issue

5c003fa

rehash after pip upgrade

2250550

keep gcc, no autoremove

00a0f6f

docker-cuda: change base image, no multi-CUDA runtimes

410783f

bertsky requested a review from kba April 1, 2023 09:16

kba approved these changes Apr 3, 2023

View reviewed changes

Dockerfile Show resolved Hide resolved

Dockerfile Outdated Show resolved Hide resolved

Makefile Show resolved Hide resolved

Makefile Outdated Show resolved Hide resolved

Makefile Show resolved Hide resolved

kba mentioned this pull request Apr 3, 2023

add rule for ocrd-tool-all.json, reduce image size, fix+update modules, fix CUDA OCR-D/ocrd_all#362

Merged

bertsky added 13 commits April 15, 2023 22:48

reinstate workaround for shapely, but more robust

de86e0f

core-cuda: use CUDA 11.8, install cuDNN via pip and make available sy…

c1178f9

…stem-wide via ld.so.conf

core-cuda: install more CUDA libs via pip and ld.so.conf, simplify Do…

d3d54bf

…ckerfile for that

make install on py36: prefer binary OpenCV/Numpy via pip config inste…

357e729

…ad of preinstall

make install on py36: fix prefer-binary syntax

2640b71

make install on py36: revert to prefer-binary via install

b3618a9

Merge branch 'master' of https://github.com/OCR-D/core into reduce-cuda

c0c153e

Merge branch 'pr-1008' into reduce-cuda

209fa21

core-cuda: use same CUDA libs as needed for Torch anyway

a6cf5ff

docker-cuda: improve (reduce size) again…

85a5d16

- install CUDA libraries via nvidia-pyindex again (but not nvcc) - ensure they can be compiled/linked against system-wide (with nvcc)

remove out-dated processor resources

47eff22

Revert "Merge remote-tracking branch 'hnesk/no-more-pkg_resources' in…

12e781c

…to release-2.36.0" This reverts commit 9a54ef6, reversing changes made to a78d4c5.

kba reviewed Jun 2, 2023

View reviewed changes

Dockerfile.cuda Outdated Show resolved Hide resolved

Dockerfile.cuda Show resolved Hide resolved

Makefile Show resolved Hide resolved

kba approved these changes Jun 2, 2023

View reviewed changes

ocrd/ocrd/constants.py Show resolved Hide resolved

make help: improve description

bac1a45

bertsky mentioned this pull request Jun 4, 2023

ocrd_all (2.45)- make all does not work #970

Closed

bertsky linked an issue Jun 4, 2023 that may be closed by this pull request

Fix CUDA Docker image #1001

Closed

kba merged commit bac1a45 into OCR-D:master Jun 7, 2023

bertsky mentioned this pull request Jun 7, 2023

fastentrypoints enforces dependencies #1050

Closed

This was referenced Jun 20, 2023

ocrd_all - Release v2023-06-14 - issue with GPU OCR-D/ocrd_all#375

Closed

wheels segfault when on CUDA shapely/shapely#1598

Open

bertsky mentioned this pull request Feb 12, 2024

add deps-tf1 and docker-cuda-tf1 #1186

Merged

bertsky deleted the reduce-cuda branch June 6, 2024 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce ocrd/core-cuda #1041

Reduce ocrd/core-cuda #1041

bertsky commented Apr 1, 2023

bertsky commented Jun 2, 2023 •

edited

Loading

bertsky commented Jun 2, 2023

kba left a comment

Reduce ocrd/core-cuda #1041

Reduce ocrd/core-cuda #1041

Conversation

bertsky commented Apr 1, 2023

bertsky commented Jun 2, 2023 • edited Loading

pkg_resources vs fastentrypoints

Mamba/conda-forge vs Nvidia base container

Perspective

bertsky commented Jun 2, 2023

kba left a comment

Choose a reason for hiding this comment

bertsky commented Jun 2, 2023 •

edited

Loading