This contains an ML training pipeline which can be used to train models as and when new data arrives. This uses W&B
for keeping a tab on artifacts and MLFlow
to orchestrate the different components of the pipeline.
In order to run the pipeline when you are developing, you need to be in the root of the starter kit, then you can execute as usual:
> mlflow run .
This will run the entire pipeline.
When developing it is useful to be able to run one step at the time. Say you want to run only
the download
step. The main.py
is written so that the steps are defined at the top of the file, in the
_steps
list, and can be selected by using the steps
parameter on the command line:
> mlflow run . -P steps=download
If you want to run the download
and the basic_cleaning
steps, you can similarly do:
> mlflow run . -P steps=download,basic_cleaning
You can override any other parameter in the configuration file using the Hydra syntax, by
providing it as a hydra_options
parameter. For example, say that we want to set the parameter
modeling -> random_forest -> n_estimators to 10 and etl->min_price to 50:
> mlflow run . \
-P steps=download,basic_cleaning \
-P hydra_options="modeling.random_forest.n_estimators=10 etl.min_price=50"
mlflow run https://github.com/Gunnvant/modelling_pipeline.git \
-v 1.0.1 \
-P hydra_options="etl.sample='sample2.csv'"
https://wandb.ai/gunnvant/nyc_airbnb?workspace=user-gunnvant