TSAI-EMLO-4.0/03_Docker-II at main · ajithvcoder/TSAI-EMLO-4.0

README.md

You’ll need to use this model and training technique (MNIST Hogwild): https://github.com/pytorch/examples/tree/main/mnist_hogwildLinks to an external site.
Set Num Processes to 2 for MNIST HogWild
Create three services in the Docker Compose file: train, evaluate, and infer.
Use a shared volume called mnist for sharing data between the services.
The train service should: Look for a checkpoint file in the volume. If found, resume training from that checkpoint. Train for ONLY 1 epoch and save the final checkpoint. Once done, exit.
The evaluate service should: Look for the final checkpoint file in the volume. Evaluate the model using the checkpoint and save the evaluation metrics in a json file. Once done, exit.
Share the model code by importing the model instead of copy-pasting it in eval.pyLinks to an external site.
The infer service should:
Run inference on any 5 random MNIST images and save the results (images with file name as predicted number) in the results folder in the volume. Then exit.
After running all the services, ensure that the model, and results are available in the mnist volume.

Since we are going to use docker compose its better to create a common model folder to store model in the root and create seperate folders for each service and place their files in those folders.
Write the train.py, eval.py, infer.py and test it first itself.
After the scripts are ready, we need to mount the shared volume properly. Here we have used mnist in docker compose as a shared volume. Make sure you name the volume else it will take the default path value as prefix for volume name.
Since we need both model folder and the common volume mnist we need to mount two volumes for each service while running.
Using docker compose run the train service with process=2 command and then run the eval service and then the infer service.
if you have mounted properly the output files would have been available in the shared folder. You can verify using below command /opt/mount location. docker run --rm -it -v mnist:/opt/mount/model alpine /bin/sh
In the default code it was generating inference images with class id's so we need to change to index numbers to get 5 images as output.

Name the volume properly and mount the volume properly.
We can even mount more than one volume to a service and each service docker files can be placed under seperate folder for better readabilty and management .