-
Taking machine learning models to production, then maintaining & monitoring them.
-
You should have Microsoft VSCode and Docker Desktop installed and running in your local machine. To install Docker Desktop, follow Docker Installation guidelines for your operating system.
- MLOps
- Table of contents
- MLOps Workflow
- Coding Guidelines
- Git basics
- Pipeling in Machine Learning
- DVC - Data Version Control
- Automating pipelines with DVC 🛠️
- Adding stages to DVC pipeline -
dvc add
- Running/Reproducing pipelines -
dvc repro
- Versioning data and models with DVC
- DVC commands -
dvc add
,dvc push
,dvc pull
- Tracking changes & switching between versions -
dvc status
&dvc checkout
- Data access in DVC -
dvc list
,dvc get
,dvc import
- Hydra
- Weights and Biases
-
Steps included in succesful creation of a MLOps project.
-
Data Management and analysis.
-
Experimentation
-
Solution developement & Testing.
-
Deployment & Serving.
-
Monitoring & maintenance.
-
-
Guidelines on writing codes for project.
-
Organize code into clean, reusable units 🔧 - functions, classes & modules. 💡
-
Use git for code versioning.
-
Follow style guidelines: write comments, docstrings, type annotations.
-
Keep requirements.txt and Dockerfile updated.
-
Testing
-
You should have git setup and running on your local machine.
Configurating user information used across
git config --global user.name "[firstname lastname]"
git config --global user.email ["valid-email"]
Initiallizing and cloning repositories
git init
git clone [url]
-
Check current status
git status
-
Add files for versioning and tracking
git add <f_name>
-
Commit staged content
git commit -m "[description]"
-
List all branches in git. A * will appear after active branch.
git branch
-
Switch to another branch and check it out to working directory.
git checkout -b "[branch-name]"
-
Add a git URL
git remote add "[alias]" <URL>
-
Fetch down all the branches from that Git remote.
git fetch "[alias]"
-
Merge a remote brach into your current branch and bring it up-to-date.
git merge "[alias]/[branch]"
-
Transmit local branch commits to the remote repository branch
git push "[alias]" "[branch]"
-
Fetch and merge any commits from tracking remote branch
git pull
We are going to use PyScaffold Cookiecutter Data Science project template.
-
Install PyScaffold
pip install pyscaffoldext-cookiecutter
-
Install pre-commit
pip install pre-commit
-
Initiallize an empty project with cookiecutter
data science
project structureputup --dsproject <Name of your project>
├── AUTHORS.md <- List of developers and maintainers.
├── CHANGELOG.md <- Changelog to keep track of new features and fixes.
├── CONTRIBUTING.md <- Guidelines for contributing to this project.
├── Dockerfile <- Build a docker container with `docker build .`.
├── LICENSE.txt <- License as chosen on the command-line.
├── README.md <- The top-level README for developers.
├── configs <- Directory for configurations of model & application.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── docs <- Directory for Sphinx documentation in rst or md.
├── environment.yml <- The conda environment file for reproducibility.
├── models <- Trained and serialized models, model predictions,
│ or model summaries.
├── notebooks <- Jupyter notebooks. Naming convention is a number (for
│ ordering), the creator's initials and a description,
│ e.g. `1.0-fw-initial-data-exploration`.
├── pyproject.toml <- Build configuration. Don't change! Use `pip install -e .`
│ to install for development or to build `tox -e build`.
├── references <- Data dictionaries, manuals, and all other materials.
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated plots and figures for reports.
├── scripts <- Analysis and production scripts which import the
│ actual PYTHON_PKG, e.g. train_model.
├── setup.cfg <- Declarative configuration of your project.
├── setup.py <- [DEPRECATED] Use `python setup.py develop` to install for
│ development or `python setup.py bdist_wheel` to build.
├── src
│ └── classify_covid <- Actual Python package where the main functionality goes.
├── tests <- Unit tests which can be run with `pytest`.
├── .coveragerc <- Configuration for coverage reports of unit tests.
├── .isort.cfg <- Configuration for git hook that sorts imports.
└── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.
Series of successive & sometimes parallel steps in which we process data.
-
Extracting, transforming and loading data.
-
Creating a test/train split.
-
Model Training
-
Model Evaluation
-
Example: Suppose that you have 2 parameter settings - Epoch = 10 & 20. You would probably want to run steps 1 & 2 only once for both setting, and 3 & 4 twice for each setting. Having a pipeline makes this work easy.
-
Simple ML pipeline 🔧
flowchart LR A(Load Data) --> B(Featurize) B --> C{Data Split} C -->|Train Data| D[Train Model] C -->|Test Data| E[Evaluate Model]
For making production ready projects, we need to convert Jupyter Notebooks
into .py
modules.
-
Makes versioning easy to automate & build pipelines.
-
Keep parameters in a
config
file (config.yaml). We will useHydra.cc
in to load these configuration files. For example,params: batch_size: 32 learning_rate: 0.01 training_epoch: 30 num_gpus: 4
-
Keep more reusable codes into
.py
modules. For e.g. Createvisualize.py
to contain visualization task. -
Create
.py
modules for each computation task(stage). -
Structure
.py
modules for run in both mode - Jupyter & Terminal
-
Converting dataset loading in jupyter notebook to a python script. We will use hydra.cc for configuration load afterwards. Fow now, we are using yaml library.
-
This example showcases use of
Argument Parser
to pass arguments to module from terminal.dataset_load.py
import typing import yaml import argparse def data_load(config_path: Text) -> None: cfg = yaml.safe_load(open(config_path)) raw_data_path = cfg['data_load']['raw_data.path'] ... ... data.to_csv(cfg['dataset_processed_path']) if __name__ == '__main__': args_parser = argparse.ArgumentParser() args_parser.add_argument('--config', dest = 'config', required = True) args = args_parser.parse_args() data_load(config_path = args.config)
-
To import this function into Jupyter Notebook :
from dataset_load import data_load
and pass the argument to the function. -
To run it from terminal, change directory to root folder and execute :
python -m src.stages.data_load --config=params.yaml
To build a ML pipeline, create modeules for each stage like above. Then, run tose modules sequentially.
DVC - Data Version Control
DVC is an open source version control system for ML projects. It will be used for
-
Experiment management - creating pipelines, tracking metrics, parameters and dependencies.
-
Data Versioning - Versioning data as we version codes using git.
Installing DVC -
pip install dvc[all]
- Good to intregrate logging.
Initiallizing DVC -
dvc init
Creates a.dvc
folder containing all information about the directory. You must add DVC under git control -git add .
&git commit -m "Init DVC"
- Running stages in sequence manually might be cumbersome and a time taking process. DVC helps in organizing stages into pipeline.
- Stages might depend on parameters, outputs of other stages, and other dependencies. DVC helps in tracking all of them and runs only the stage where a change is detected. (💡Remember example of 2 possible epoch values?)
DVC builds a dependency graph(Directed Acyclic Graph)of stages to determine the order of execution. It saves them in a
dvc.yaml
file. To check graph -dvc dag
To add a stage to DVC pipeline, execute the following command.
dvc stage add -n <name> \ # Name: Name of the stage of pipeline
-d <dependencies> \ # Dependencies: files(to track) on which processing of stage depends.
-o <outputs> \ # Outputs: outputs of stage. DVC tracks them for any external change.
-p <parameters> \ # Parameters: parameters in the config file to track for changes.
command # Command to execute on execution of this stage of pipeline
Example : Adding data_load.py module as a stage in DVC pipeline
dvc stage add -n data_load \
-d src/data_load.py \
-o data/iris.csv \
-p data_load \
python -m src.data_load --config=params.yaml
Structure of dvc.yaml
file
stages:
stage1: #Name of stage
cmd: <Commmand to execute>
deps: <Dependencies>
params: <Parameters>
outs: <Outputs>
...
You can manually add stages or make changes to stages in the
dvc.yaml
file.
After adding all stages to pipeline, execute:
dvc repro
git add .
git commit -m "Description"
DVC will run the pipeline and start to monitor all the parameters, dependencies and outputs specified. When you execute dvc repro
for the next time:
-
If any stage dependency change is detected, DVC runs stages affected by this change. It won't run the unaffected stages.
-
Before running any stage, it deletes all outputs of the stage.
-
DVC follows downstream to produce other stages.
To reproduce single stage:
dvc repro -s <stage_name> # add -f for forced execution.,
Need of data versioning:
-
Reproucible ML experiments require versioned data, models & artifacts.
-
Meet regulatory compliance & ethical AI requirements(e.g. Health & Finance).
-
Data processing takes a long time, resources are expensive, need to be deterministic and reproducile.
-
We need not produce same data repeatedly.
How data versioning works? - reflinks
Use
git
to version code,dvc
to version data.
-
Add file/folder to DVC:
dvc add <file/folder> # creates reflink to cache for file added
-
Setting up remote storage: Either create a local remote storage(dummy remote) or add S3, Gdrive, Blob, etc.
dvc remote add -f "<name of remote>" </>
</> : <tmp/dvc> for local storage
</> : <gdrive/folder_id> for Google Drive
-
To push data to remote or pull from remote:
dvc push/pull
-
To track status of staged files:
dvc status # returns any changes made to files tracked by DVC
-
To switch version: (check: https://dvc.org/doc/command-reference/checkout)
dvc checkout
-
To list project contents, including files, models, and directories tracked by DVC and by Git:
dvc list "<URL>"
-
To download data, but not keeping track of changes with remote:
dvc get "<URL>"
-
TO download data, and keep track of changes:
dvc import "<URL>"
Hydra is a configuration management framework for Machine Learning/ Data Science projects.
Installing Hydra:
pip install hydra-core --upgrade
You must have all configurations in a folder named config as per our Pyscaffold Cookie-cutter DS template.
To import configuration to a python file:
-
Method 1:
import hydra from omegaconf import DictConfig, OmegaConf @hydra.main(version_base=None, config_path="conf", config_name="config") def my_app(cfg : DictConfig) -> None: print(OmegaConf.to_yaml(cfg)) if __name__ == "__main__": my_app()
-
Method 2:
from hydra import compose, initialize # Loading configuration file using Hydra initialize(version_base=None, config_path='../../configs') config = compose(config_name=config_name)
To use configuration: config.<>.<>
Experiment tracking utility for machine learning.
Installing Wandb: pip install wandb
To Login:
-
Open wandb.ai > Settings > Danger Zone > API
-
Copy your API key.
-
Execute
wandb login
& paste your key.
You must be logged in to your Weights and Biases account now.
To start a new run:
import wandb
wandb.init(project = '<Project_name>', config = config)
# Note that config has to be loaded using Hydra.cc before
# calling this command.
# This will upload training configurations to W&B portal.
Integration with Keras:
# We use keras callback to integrate W&B with our model.
# This will log accuracy, AUR loss, GPU & CPU usage.
# Pass the callback to model.fit
model.fit(
X_train,
y_train,
validation_data=(X_test, y_test),
callbacks=[WandbCallback()]
)
To log any other metrics:
wandb.log('parameter_name': parameter_value)