This repository contains a demo machine learning pipeline implemented using the Composable Logs framework.
This demo pipeline:
- trains a model for predicting digits 0, ..., 9 from a handwritten image of the digit. The data is a small data set included in sklearn library.
- runs daily using Github actions, but does not require any other cloud infrastructure. Rather it uses:
- Github Actions: orchestration and compute
- Github Build Artifacts: to persist pipeline run results (using the OpenTelemetry open standard)
- Github Pages: static website for model/experiment tracking, demo site. This is built using custom fork of MLFlow.
- development is supported by both CI automation and local development tools.
- This repository is configured to run the pipeline for all pull requests, see experiment tracking site linked above.
- Local development with VS Code and/or Jupyter notebooks.
For more details, please see the main documentation site.
graph LR
%% Mermaid input file for drawing task dependencies
%% See https://mermaid-js.github.io/mermaid
%%
TASK_SPAN_ID_0xce2e54cf83d7159b["<b>ingest (Python task) π</b> <br />task.num_cpus=1<br />task.timeout_s=30.0"]
TASK_SPAN_ID_0x7d1ab0d78c082cf3["<b>eda (Jupytext task) π</b> <br />task.num_cpus=1<br />task.timeout_s=60.0"]
TASK_SPAN_ID_0x392c5ca53bf7d565["<b>split_train_test (Python task) π</b> <br />task.num_cpus=1<br />task.timeout_s=30.0"]
TASK_SPAN_ID_0xd4aa81c209e8f07e["<b>train_model (Python task) π</b> <br />task.num_cpus=1<br />task.timeout_s=60.0"]
TASK_SPAN_ID_0x9ece260aad087f90["<b>train_model (Python task) π</b> <br />task.num_cpus=1<br />task.timeout_s=60.0"]
TASK_SPAN_ID_0xf4da6fe70eb23cb8["<b>train_model (Python task) π</b> <br />task.num_cpus=1<br />task.timeout_s=60.0"]
TASK_SPAN_ID_0x4024e5fbedbce3bf["<b>train_model (Python task) π</b> <br />task.num_cpus=1<br />task.timeout_s=60.0"]
TASK_SPAN_ID_0x6a9445853c44a3a1["<b>benchmark-model (Jupytext task) π</b> <br />task.nr_train_images=600<br />task.num_cpus=1<br />task.timeout_s=60.0"]
TASK_SPAN_ID_0x90e9078dac9eb6b2["<b>benchmark-model (Jupytext task) π</b> <br />task.nr_train_images=1200<br />task.num_cpus=1<br />task.timeout_s=60.0"]
TASK_SPAN_ID_0x92e4064d9e9067f3["<b>benchmark-model (Jupytext task) π</b> <br />task.nr_train_images=800<br />task.num_cpus=1<br />task.timeout_s=60.0"]
TASK_SPAN_ID_0x3103566a8571ae5e["<b>benchmark-model (Jupytext task) π</b> <br />task.nr_train_images=1000<br />task.num_cpus=1<br />task.timeout_s=60.0"]
TASK_SPAN_ID_0x8985b6ded7f5fa8b["<b>summary (Jupytext task) π</b> <br />task.num_cpus=1<br />task.timeout_s=60.0"]
TASK_SPAN_ID_0x392c5ca53bf7d565 --> TASK_SPAN_ID_0x9ece260aad087f90
TASK_SPAN_ID_0x392c5ca53bf7d565 --> TASK_SPAN_ID_0xf4da6fe70eb23cb8
TASK_SPAN_ID_0x3103566a8571ae5e --> TASK_SPAN_ID_0x8985b6ded7f5fa8b
TASK_SPAN_ID_0x90e9078dac9eb6b2 --> TASK_SPAN_ID_0x8985b6ded7f5fa8b
TASK_SPAN_ID_0xce2e54cf83d7159b --> TASK_SPAN_ID_0x7d1ab0d78c082cf3
TASK_SPAN_ID_0x392c5ca53bf7d565 --> TASK_SPAN_ID_0xd4aa81c209e8f07e
TASK_SPAN_ID_0xce2e54cf83d7159b --> TASK_SPAN_ID_0x392c5ca53bf7d565
TASK_SPAN_ID_0x92e4064d9e9067f3 --> TASK_SPAN_ID_0x8985b6ded7f5fa8b
TASK_SPAN_ID_0xf4da6fe70eb23cb8 --> TASK_SPAN_ID_0x90e9078dac9eb6b2
TASK_SPAN_ID_0x4024e5fbedbce3bf --> TASK_SPAN_ID_0x3103566a8571ae5e
TASK_SPAN_ID_0x392c5ca53bf7d565 --> TASK_SPAN_ID_0x4024e5fbedbce3bf
TASK_SPAN_ID_0x6a9445853c44a3a1 --> TASK_SPAN_ID_0x8985b6ded7f5fa8b
TASK_SPAN_ID_0xd4aa81c209e8f07e --> TASK_SPAN_ID_0x6a9445853c44a3a1
TASK_SPAN_ID_0x9ece260aad087f90 --> TASK_SPAN_ID_0x92e4064d9e9067f3
As seen from the descriptions, the tasks include both pure Python and Jupytext (notebook) steps.
Alternatively, a pipeline's full output, can be inspected by downloading a zip build artefact for a recent build, link. The zip files contain rendered notebooks, logged metrics and images and the trained model(s) in ONNX format.
This repository uses Github Actions automation to run the demo pipeline as part of the repo's CI-pipeline. Each CI-run stores the pipeline outputs (ie notebooks, models, and logged images and metrics) as build artefacts.
This means:
- The entire pipeline is run for all commits to pull request to this repository, and to commits to
main
-branch. - From the build artefacts one can inspect the pipeline's outputs (and, in particular, model performances) for each pull request and commit.
- The pipeline runs using (free) compute resources provided by Github. No other infrastructure is needed.
- Forking this repo in Github gives a new pipeline with its own experiment tracker that can be developed independently.
The below diagram shows a Gantt chart with runtimes of individual pipeline tasks.
gantt
%% Mermaid input file for drawing Gantt chart of runlog runtimes
%% See https://mermaid-js.github.io/mermaid/#/gantt
%%
axisFormat %H:%M
%%
%% Give timestamps as unix timestamps (ms)
dateFormat x
%%
section ingest (Python task)
1.34s - OK : , 1675605494 , 1675605496
section eda (Jupytext task)
15.86s - OK : , 1675605496 , 1675605512
section split_train_test (Python task)
1.39s - OK : , 1675605497 , 1675605499
section train_model (Python task)
5.16s - OK : , 1675605499 , 1675605504
section train_model (Python task)
7.09s - OK : , 1675605502 , 1675605509
section train_model (Python task)
4.24s - OK : , 1675605502 , 1675605506
section train_model (Python task)
20.77s - OK : , 1675605502 , 1675605523
section benchmark-model (Jupytext task)
31.1s - OK : , 1675605504 , 1675605535
section benchmark-model (Jupytext task)
14.51s - OK : , 1675605506 , 1675605521
section benchmark-model (Jupytext task)
15.05s - OK : , 1675605509 , 1675605524
section benchmark-model (Jupytext task)
12.7s - OK : , 1675605523 , 1675605536
section summary (Jupytext task)
5.35s - OK : , 1675605536 , 1675605541
Of note:
- Tasks are run in parallel using all available cores. On (free) Github hosted runners there are two vCPUs. Parallel execution is implemented using the Ray framework.
To run the pipeline (eg. locally) one needs to install git, make and Docker.
First, clone the demo pipeline repository
git clone --recurse-submodules [email protected]:composable-logs/mnist-digits-demo-pipeline.git
Now the pipeline can be run as follows:
make build-all-docker-images
make clean
make RUN_ENVIRONMENT="dev" test-and-run-pipeline
Pipeline outputs (evaluated notebooks, models, logs, and images) are stored in the repo pipeline-outputs
directory).
This above steps are essentially what is run by the CI-automation (although that is run with RUN_ENVIRONMENT="ci"
which is slightly slower).
This repo is set up for pipeline development using Jupyter notebook via VS Code's remote containers. This is similar to the setup for developing the composable-logs library.
The list of development tasks in VS Code, are defined here. The key tasks:
mnist-demo-pipeline - watch and run all tasks
: Run pipeline and static code analyses (mypy and Black) in watch mode.common package: run all tests in watch mode
: Run unit tests and static code analyses on them (mypy and Black) in watch mode.
A motivation for this work is to make it easier to set up and work together on (open data) pipelines.
If you would like to discuss an idea or have a question, please raise an issue or contact me via email.
This is WIP and any ideas/feedback are welcome.
(c) Matias Dahl 2021-2022, MIT, see LICENSE.md.
The training data is a reduced version of the mnist digits data in sklearn, see sklearn documentation.