- Date: Sunday, Dec 3, 2023
- Author: ChuNan Liu
- Email: [email protected]
Contents:
You can pull images from the docker hub page at: https://hub.docker.com/r/biochunan/esmfold-image
Use Dockerfiles provided in ./Dockerfiles
to build desired images.
The following are provided in the script ./build-image.sh
:
# build image, add non-root user
docker build --no-cache -t $USER/esmfold:nonroot-devel -f Dockerfiles/Dockerfile.nonroot .
# build runtime image, add non-root user
docker build --no-cache -t $USER/esmfold:nonroot-runtime -f Dockerfiles/Dockerfile.nonroot.runtime .
# build image, root user only
docker build --no-cache -t $USER/esmfold:root-devel -f Dockerfiles/Dockerfile.root .
# build runtime image, root user only
docker build --no-cache -t $USER/esmfold:root-runtime -f Dockerfiles/Dockerfile.root.runtime .
-t $USER/esmfold:root-devel
: tag images$USER
: your usernameesmfold
: image nameroot-devel
: image tag, see below for detailsroot
/non-root
: the image runs asroot
, or a non-root user (vscode
withUSER_UID
andUSER_GID
both set to1000
).devel
/runtime
: the image includes model checkpoints and the model itself ifdevel
, or not ifruntime
meaning checkpoints need to be mounted at runtime.
This image is based on the nvidia/cuda:11.3.1-devel-ubuntu20.04 image.
You might already have noticed there are some packages installed in the Dockerfile are downloaded using gdown
which is a python package that downloads files from Google Drive. These files are:
- openfold.tar.gz: the official release of OpenFold
- My modifications: I commented out the flash-attn package from the default environment.yml file because it's not compatible with the latest version of ESM.
- esm-main.tar.gz: the official release of ESM.
- esm2_t36_3B_UR50D.pt : the pre-trained ESM2 model.
- esm2_t36_3B_UR50D-contact-regression.pt: the pre-trained ESM2 model with contact regression.
- esmfold_3B_v1.pt: the pre-trained ESMFold model.
Even though the three
.pt
checkpoint files are downloaded upon first run of the container, it's better to have them in the image to avoid downloading them every time the container is run.
The Google Drive folder for the above files are esmfold.
If using the Dockerfile.runtime
file, you need to mount the checkpoint files into the container at run time. To download the checkpoint files, you can run the following command:
cd /path/to/host/checkpoints
# esm2_t36_3B_UR50D-contact-regression (6.7KB)
gdown --fuzzy -O esm2_t36_3B_UR50D-contact-regression.pt 1lW8CVTSzX8bwLxbM8lAu_qXQkrPZuSxA
# esm2_t36_3B_UR50D (5.3GB)
curl https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t36_3B_UR50D.pt -o esm2_t36_3B_UR50D.pt
# esmfold_3B_v1 (2.6GB)
curl https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_3B_v1.pt -o esmfold_3B_v1.pt
Model links are derived from repository esm
Example scripts are provdied in ./example/scripts
to run ESMFold with the built image.
NOTICE: these scripts assume your current working directory is the root of the repository.
The default entrypoint for the image, as specified in the Dockerfile, is
ENTRYPOINT ["zsh", "run-esm-fold.sh"]
content of `run-esm-fold.sh`:
#!/bin/zsh
# init conda
source $HOME/.zshrc
# activate py39-esmfold
conda activate py39-esmfold
# run esm-fold
esm-fold $@
Run the following command to see the help information of esm-fold
:
docker run --rm $USER/esmfold:root-devel --help
stdout:
usage: esm-fold [-h] -i FASTA -o PDB [-m MODEL_DIR]
[--num-recycles NUM_RECYCLES]
[--max-tokens-per-batch MAX_TOKENS_PER_BATCH]
[--chunk-size CHUNK_SIZE] [--cpu-only] [--cpu-offload]
optional arguments:
-h, --help show this help message and exit
-i FASTA, --fasta FASTA
Path to input FASTA file
-o PDB, --pdb PDB Path to output PDB directory
-m MODEL_DIR, --model-dir MODEL_DIR
Parent path to the pre-trained ESM data directory.
--num-recycles NUM_RECYCLES
Number of recycles to run. Defaults to number used in
training (4).
--max-tokens-per-batch MAX_TOKENS_PER_BATCH
Maximum number of tokens per gpu forward-pass. This
will group shorter sequences together for batched
prediction. Lowering this can help with out of memory
issues, if these occur on short sequences.
--chunk-size CHUNK_SIZE
Chunks axial attention computation to reduce memory
usage from O(L^2) to O(L). Equivalent to running a for
loop over chunks of of each dimension. Lower values
will result in lower memory usage at the cost of
speed. Recommended values: 128, 64, 32. Default: None.
--cpu-only CPU only
--cpu-offload Enable CPU offloading
If GPUs are available.
cd /path/to/esmfold-docker-image # root of the repository
mkdir -p ./example/{input,output,logs}
########################
# run as root #
########################
# if use devel, checkpoints are already in the image
docker run --rm --gpus all \
-v ./example/input:/root/input \
-v ./example/output:/root/output \
esmfold:root-devel \
-i /root/input/1a2y-HLC.fasta \
-o /root/output \
> ./example/logs/pred-root-devel.log 2>./example/logs/pred-root-devel.err
# if use runtime, mount the checkpoints on the host machine, e.g.
trainModelsDir=/mnt/Data/trained_models/ESM2
docker run --rm --gpus all \
-v ./example/input:/root/input \
-v ./example/output:/root/output \
-v $trainModelsDir:/root/.cache/torch/hub/checkpoints \
esmfold:root-runtime \
-i /root/input/1a2y-HLC.fasta \
-o /root/output \
> ./example/logs/pred-root-devel.log 2>./example/logs/pred-root-devel.err
########################
# run as non-root user #
########################
# non-root user `vscode` with userID:groupID=1000:1000
# if use devel, checkpoints are already in the image
docker run --rm --gpus all \
-v ./example/input:/home/vscode/input \
-v ./example/output:/home/vscode/output \
esmfold:nonroot-devel \
-i /home/vscode/input/1a2y-HLC.fasta \
-o /home/vscode/output \
> ./example/logs/pred.log 2>./example/logs/pred.err
# if use runtime, checkpoints need to be mounted
docker run --rm --gpus all \
-v ./example/input:/home/vscode/input \
-v ./example/output:/home/vscode/output \
-v /path/to/host/checkpoints:/home/vscode/.cache/torch/hub/checkpoints \
esmfold:nonroot-runtime \
-i /home/vscode/input/1a2y-HLC.fasta \
-o /home/vscode/output \
> ./example/logs/pred.log 2>./example/logs/pred.err
If no GPUs are available, add the --cpu-only
flag:
mkdir -p ./example/{input,output,logs}
docker run --rm \
-v ./example/input:/home/vscode/input \
-v ./example/output:/home/vscode/output \
esmfold:nonroot-devel \
--cpu-only \
-i /home/vscode/input/1a2y-HLC.fasta \
-o /home/vscode/output \
> ./example/logs/pred.log 2>./example/logs/pred.err
# if use Dockerfile.runtime, remember to mount the checkpoint files
# -v /path/to/host/checkpoints:/home/vscode/.cache/torch/hub/checkpoints
-i /input/1a2y-HLC.fasta
: input fasta file-o /output
: path to output predicted structure> ./example/logs/pred.log 2>./example/logs/pred.err
: redirect stdout and stderr to log files
Other ESMFold flags, refer to ESMFold repo documentation section
--num-recycles NUM_RECYCLES
: Number of recycles to run. Defaults to number used in training (default is 4).--max-tokens-per-batch MAX_TOKENS_PER_BATCH
: Maximum number of tokens per gpu forward-pass. This will group shorter sequences together for batched prediction. Lowering this can help with out of memory issues, if these occur on short sequences.--chunk-size CHUNK_SIZE
: Chunks axial attention computation to reduce memory usage from O(L^2) to O(L). Equivalent to running a for loop over chunks of of each dimension. Lower values will result in lower memory usage at the cost of speed. Recommended values: 128, 64, 32. Default: None.--cpu-only
: CPU only--cpu-offload
: Enable CPU offloading
If you want to overwrite the entrypoint, you can do so by adding the following to the end of the docker run
command:
docker run --rm --gpus all --entrypoint "/bin/zsh" $USER/esmfold:nonroot-devel -c "echo 'hello world'"
docker run --rm --gpus all --entrypoint "nvidia-smi" $USER/esmfold:nonroot-devel
The image is also available on Dockerhub: biochunan/esmfold-image