Taking machine learning models to production, then maintaining & monitoring them.
You should have Microsoft VSCode and Docker Desktop installed and running in your local machine. To install Docker Desktop, follow Docker Installation guidelines for your operating system.
- MLOps
- Table of contents
- MLOps Workflow
- Coding Guidelines
- Git basics
- Pipeling in Machine Learning
- DVC - Data Version Control
- Automating pipelines with DVC 🛠️
- Adding stages to DVC pipeline -
dvc add
- Running/Reproducing pipelines -
dvc repro
- Versioning data and models with DVC
- DVC commands -
dvc add
,dvc push
,dvc pull
- Tracking changes & switching between versions -
dvc status
&dvc checkout
- Data access in DVC -
dvc list
,dvc get
,dvc import
- Hydra
- Weights and Biases
Steps included in succesful creation of a MLOps project.
Data Management and analysis.
Solution developement & Testing.
Deployment & Serving.
Monitoring & maintenance.
Guidelines on writing codes for project.
Organize code into clean, reusable units 🔧 - functions, classes & modules. 💡
Use git for code versioning.
Follow style guidelines: write comments, docstrings, type annotations.
Keep requirements.txt and Dockerfile updated.
You should have git setup and running on your local machine.
Configurating user information used across
git config --global user.name "[firstname lastname]"
git config --global user.email ["valid-email"]
Initiallizing and cloning repositories
git init
git clone [url]
Check current status
git status
Add files for versioning and tracking
git add <f_name>
Commit staged content
git commit -m "[description]"
List all branches in git. A * will appear after active branch.
git branch
Switch to another branch and check it out to working directory.
git checkout -b "[branch-name]"
Add a git URL
git remote add "[alias]" <URL>
Fetch down all the branches from that Git remote.
git fetch "[alias]"
Merge a remote brach into your current branch and bring it up-to-date.
git merge "[alias]/[branch]"
Transmit local branch commits to the remote repository branch
git push "[alias]" "[branch]"
Fetch and merge any commits from tracking remote branch
git pull
We are going to use PyScaffold Cookiecutter Data Science project template.
Install PyScaffold
pip install pyscaffoldext-cookiecutter
Install pre-commit
pip install pre-commit
Initiallize an empty project with cookiecutter
data science
project structureputup --dsproject <Name of your project>
├── AUTHORS.md <- List of developers and maintainers.
├── CHANGELOG.md <- Changelog to keep track of new features and fixes.
├── CONTRIBUTING.md <- Guidelines for contributing to this project.
├── Dockerfile <- Build a docker container with `docker build .`.
├── LICENSE.txt <- License as chosen on the command-line.
├── README.md <- The top-level README for developers.
├── configs <- Directory for configurations of model & application.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── docs <- Directory for Sphinx documentation in rst or md.
├── environment.yml <- The conda environment file for reproducibility.
├── models <- Trained and serialized models, model predictions,
│ or model summaries.
├── notebooks <- Jupyter notebooks. Naming convention is a number (for
│ ordering), the creator's initials and a description,
│ e.g. `1.0-fw-initial-data-exploration`.
├── pyproject.toml <- Build configuration. Don't change! Use `pip install -e .`
│ to install for development or to build `tox -e build`.
├── references <- Data dictionaries, manuals, and all other materials.
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated plots and figures for reports.
├── scripts <- Analysis and production scripts which import the
│ actual PYTHON_PKG, e.g. train_model.
├── setup.cfg <- Declarative configuration of your project.
├── setup.py <- [DEPRECATED] Use `python setup.py develop` to install for
│ development or `python setup.py bdist_wheel` to build.
├── src
│ └── classify_covid <- Actual Python package where the main functionality goes.
├── tests <- Unit tests which can be run with `pytest`.
├── .coveragerc <- Configuration for coverage reports of unit tests.
├── .isort.cfg <- Configuration for git hook that sorts imports.
└── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.
Series of successive & sometimes parallel steps in which we process data.
Extracting, transforming and loading data.
Creating a test/train split.
Model Training
Model Evaluation
Example: Suppose that you have 2 parameter settings - Epoch = 10 & 20. You would probably want to run steps 1 & 2 only once for both setting, and 3 & 4 twice for each setting. Having a pipeline makes this work easy.
Simple ML pipeline 🔧
flowchart LR A(Load Data) --> B(Featurize) B --> C{Data Split} C -->|Train Data| D[Train Model] C -->|Test Data| E[Evaluate Model]
For making production ready projects, we need to convert Jupyter Notebooks
into .py
Makes versioning easy to automate & build pipelines.
Keep parameters in a
file (config.yaml). We will useHydra.cc
in to load these configuration files. For example,params: batch_size: 32 learning_rate: 0.01 training_epoch: 30 num_gpus: 4
Keep more reusable codes into
modules. For e.g. Createvisualize.py
to contain visualization task. -
modules for each computation task(stage). -
modules for run in both mode - Jupyter & Terminal
Converting dataset loading in jupyter notebook to a python script. We will use hydra.cc for configuration load afterwards. Fow now, we are using yaml library.
This example showcases use of
Argument Parser
to pass arguments to module from terminal.dataset_load.py
import typing import yaml import argparse def data_load(config_path: Text) -> None: cfg = yaml.safe_load(open(config_path)) raw_data_path = cfg['data_load']['raw_data.path'] ... ... data.to_csv(cfg['dataset_processed_path']) if __name__ == '__main__': args_parser = argparse.ArgumentParser() args_parser.add_argument('--config', dest = 'config', required = True) args = args_parser.parse_args() data_load(config_path = args.config)
To import this function into Jupyter Notebook :
from dataset_load import data_load
and pass the argument to the function. -
To run it from terminal, change directory to root folder and execute :
python -m src.stages.data_load --config=params.yaml
To build a ML pipeline, create modeules for each stage like above. Then, run tose modules sequentially.
DVC - Data Version Control
DVC is an open source version control system for ML projects. It will be used for
Experiment management - creating pipelines, tracking metrics, parameters and dependencies.
Data Versioning - Versioning data as we version codes using git.
Installing DVC -
pip install dvc[all]
- Good to intregrate logging.
Initiallizing DVC -
dvc init
Creates a.dvc
folder containing all information about the directory. You must add DVC under git control -git add .
&git commit -m "Init DVC"
- Running stages in sequence manually might be cumbersome and a time taking process. DVC helps in organizing stages into pipeline.
- Stages might depend on parameters, outputs of other stages, and other dependencies. DVC helps in tracking all of them and runs only the stage where a change is detected. (💡Remember example of 2 possible epoch values?)
DVC builds a dependency graph(Directed Acyclic Graph)of stages to determine the order of execution. It saves them in a
file. To check graph -dvc dag
To add a stage to DVC pipeline, execute the following command.
dvc stage add -n <name> \ # Name: Name of the stage of pipeline
-d <dependencies> \ # Dependencies: files(to track) on which processing of stage depends.
-o <outputs> \ # Outputs: outputs of stage. DVC tracks them for any external change.
-p <parameters> \ # Parameters: parameters in the config file to track for changes.
command # Command to execute on execution of this stage of pipeline
Example : Adding data_load.py module as a stage in DVC pipeline
dvc stage add -n data_load \
-d src/data_load.py \
-o data/iris.csv \
-p data_load \
python -m src.data_load --config=params.yaml
Structure of dvc.yaml
stage1: #Name of stage
cmd: <Commmand to execute>
deps: <Dependencies>
params: <Parameters>
outs: <Outputs>
You can manually add stages or make changes to stages in the
After adding all stages to pipeline, execute:
dvc repro
git add .
git commit -m "Description"
DVC will run the pipeline and start to monitor all the parameters, dependencies and outputs specified. When you execute dvc repro
for the next time:
If any stage dependency change is detected, DVC runs stages affected by this change. It won't run the unaffected stages.
Before running any stage, it deletes all outputs of the stage.
DVC follows downstream to produce other stages.
To reproduce single stage:
dvc repro -s <stage_name> # add -f for forced execution.,
Need of data versioning:
Reproucible ML experiments require versioned data, models & artifacts.
Meet regulatory compliance & ethical AI requirements(e.g. Health & Finance).
Data processing takes a long time, resources are expensive, need to be deterministic and reproducile.
We need not produce same data repeatedly.
How data versioning works? - reflinks
to version code,dvc
to version data.
Add file/folder to DVC:
dvc add <file/folder> # creates reflink to cache for file added
Setting up remote storage: Either create a local remote storage(dummy remote) or add S3, Gdrive, Blob, etc.
dvc remote add -f "<name of remote>" </>
</> : <tmp/dvc> for local storage
</> : <gdrive/folder_id> for Google Drive
To push data to remote or pull from remote:
dvc push/pull
To track status of staged files:
dvc status # returns any changes made to files tracked by DVC
To switch version: (check: https://dvc.org/doc/command-reference/checkout)
dvc checkout
To list project contents, including files, models, and directories tracked by DVC and by Git:
dvc list "<URL>"
To download data, but not keeping track of changes with remote:
dvc get "<URL>"
TO download data, and keep track of changes:
dvc import "<URL>"
Hydra is a configuration management framework for Machine Learning/ Data Science projects.
Installing Hydra:
pip install hydra-core --upgrade
You must have all configurations in a folder named config as per our Pyscaffold Cookie-cutter DS template.
To import configuration to a python file:
Method 1:
import hydra from omegaconf import DictConfig, OmegaConf @hydra.main(version_base=None, config_path="conf", config_name="config") def my_app(cfg : DictConfig) -> None: print(OmegaConf.to_yaml(cfg)) if __name__ == "__main__": my_app()
Method 2:
from hydra import compose, initialize # Loading configuration file using Hydra initialize(version_base=None, config_path='../../configs') config = compose(config_name=config_name)
To use configuration: config.<>.<>
Experiment tracking utility for machine learning.
Installing Wandb: pip install wandb
To Login:
Open wandb.ai > Settings > Danger Zone > API
Copy your API key.
wandb login
& paste your key.
You must be logged in to your Weights and Biases account now.
To start a new run:
import wandb
wandb.init(project = '<Project_name>', config = config)
# Note that config has to be loaded using Hydra.cc before
# calling this command.
# This will upload training configurations to W&B portal.
Integration with Keras:
# We use keras callback to integrate W&B with our model.
# This will log accuracy, AUR loss, GPU & CPU usage.
# Pass the callback to model.fit
validation_data=(X_test, y_test),
To log any other metrics:
wandb.log('parameter_name': parameter_value)