Skip to content

Commit

Permalink
[GH-15654] Introduce MLFlow flavor for working with mojos and pojos (#…
Browse files Browse the repository at this point in the history
…15849)

* [GH-15654] Introduce MLFlow flavors for working with mojos and pojos

* First version of mojo flavor by Eric Wolf

* New saving method

* Update loading mojos

* Fix upload mojos.

* Fix building

* Fix h2o_mlflow_flavor

* Update

* Update loader module

* Update h2o_mojo

* Add genmodel flavor

* Fix pojo

* Fix pojo

* Add extraction of metrics

* Fix metric extraction

* add input examples

* moved mlflow-flavor

* Add examples

* Add doc

* Add description.rst

* Update description

* description

* description

* update description

* update description

* update description

* update description

* update description

* update description

* update description

* update description

* update description

* update description

* update description

* update description

* update description

* update description

* update description

* Add flavor as self reference

* Update build definition

* Remove gitignore

* Just one doc

* Revert DRF_mojo.ipynb

---------

Co-authored-by: Eric Wolf <[email protected]>
  • Loading branch information
mn-mikke and Eric Wolf authored Nov 14, 2023
1 parent 6498915 commit d6c889b
Show file tree
Hide file tree
Showing 9 changed files with 897 additions and 1 deletion.
3 changes: 2 additions & 1 deletion build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,8 @@ ext {

pythonProjects = [
project(':h2o-py'),
project(':h2o-py-cloud-extensions')
project(':h2o-py-cloud-extensions'),
project(':h2o-py-mlflow-flavor')
]

// The project which need to be run under CI only
Expand Down
110 changes: 110 additions & 0 deletions h2o-py-mlflow-flavor/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
H2O-3 MLFlow Flavor
===================

A tiny library containing a `MLFlow <https://mlflow.org/>`_ flavor for working with H2O-3 MOJO and POJO models.

Logging Models to MLFlow Registry
---------------------------------

The model that was trained with H2O-3 runtime can be exported to MLFlow registry with `log_model` function.:

.. code-block:: Python
import mlflow
import h2o_mlflow_flavor
mlflow.set_tracking_uri("http://127.0.0.1:8080")
h2o_model = ... training phase ...
with mlflow.start_run(run_name="myrun") as run:
h2o_mlflow_flavor.log_model(h2o_model=h2o_model,
artifact_path="folder",
model_type="MOJO",
extra_prediction_args=["--predictCalibrated"])
Compared to `log_model` functions of the other flavors being a part of MLFlow, this function has two extra arguments:

* ``model_type`` - It indicates whether the model should be exported as `MOJO <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/mojo-quickstart.html#what-is-a-mojo>`_ or `POJO <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/pojo-quickstart.html#what-is-a-pojo>`_. The default value is `MOJO`.

* ``extra_prediction_args`` - A list of extra arguments for java scoring process. Possible values:

* ``--setConvertInvalidNum`` - The scoring process will convert invalid numbers to NA.

* ``--predictContributions`` - The scoring process will Return also Shapley values a long with the predictions. Model must support that Shapley values, otherwise scoring process will throw an error.

* ``--predictCalibrated`` - The scoring process will also return calibrated prediction values.

The `save_model` function that persists h2o binary model to MOJO or POJO has the same signature as the `log_model` function.

Extracting Information about Model
----------------------------------

The flavor offers several functions to extract information about the model.

* ``get_metrics(h2o_model, metric_type=None)`` - Extracts metrics from the trained H2O binary model. It returns dictionary and takes following parameters:

* ``h2o_model`` - An H2O binary model.

* ``metric_type`` - The type of metrics. Possible values are "training", "validation", "cross_validation". If parameter is not specified, metrics for all types are returned.

* ``get_params(h2o_model)`` - Extracts training parameters for the H2O binary model. It returns dictionary and expects one parameter:

* ``h2o_model`` - An H2O binary model.

* ``get_input_example(h2o_model, number_of_records=5, relevant_columns_only=True)`` - Creates an example Pandas dataset from the training dataset of H2O binary model. It takes following parameters:

* ``h2o_model`` - An H2O binary model.

* ``number_of_records`` - A number of records that will be extracted from the training dataset.

* ``relevant_columns_only`` - A flag indicating whether the output dataset should contain only columns required by the model. Defaults to ``True``.

The functions can be utilized as follows:

.. code-block:: Python
import mlflow
import h2o_mlflow_flavor
mlflow.set_tracking_uri("http://127.0.0.1:8080")
h2o_model = ... training phase ...
with mlflow.start_run(run_name="myrun") as run:
mlflow.log_params(h2o_mlflow_flavor.get_params(h2o_model))
mlflow.log_metrics(h2o_mlflow_flavor.get_metrics(h2o_model))
input_example = h2o_mlflow_flavor.get_input_example(h2o_model)
h2o_mlflow_flavor.log_model(h2o_model=h2o_model,
input_example=input_example,
artifact_path="folder",
model_type="MOJO",
extra_prediction_args=["--predictCalibrated"])
Model Scoring
-------------

After a model obtained from the model registry, the model doesn't require h2o runtime for ability to score. The only thing
that model requires is a ``h2o-gemodel.jar`` which was persisted with the model during saving procedure.
The model could be loaded by the function ``load_model(model_uri, dst_path=None)``. It returns an objecting making
predictions on Pandas dataframe and takes the following parameters:

* ``model_uri`` - An unique identifier of the model within MLFlow registry.

* ``dst_path`` - (Optional) A local filesystem path for downloading the persisted form of the model.

The object for scoring could be obtained also via the `pyfunc` flavor as follows:

.. code-block:: Python
import mlflow
mlflow.set_tracking_uri("http://127.0.0.1:8080")
logged_model = 'runs:/9a42265cf0ef484c905b02afb8fe6246/iris'
loaded_model = mlflow.pyfunc.load_model(logged_model)
import pandas as pd
data = pd.read_csv("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
loaded_model.predict(data)
63 changes: 63 additions & 0 deletions h2o-py-mlflow-flavor/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
description = "H2O-3 MLFlow Flavor"

dependencies {}

def buildVersion = new H2OBuildVersion(rootDir, version)

ext {
PROJECT_VERSION = buildVersion.getProjectVersion()
pythonexe = findProperty("pythonExec") ?: "python"
pipexe = findProperty("pipExec") ?: "pip"
if (System.env.VIRTUAL_ENV) {
pythonexe = "${System.env.VIRTUAL_ENV}/bin/python".toString()
pipexe = "${System.env.VIRTUAL_ENV}/bin/pip".toString()
}
testsPath = file("tests")
}

task copySrcFiles(type: Copy) {
from ("${projectDir}") {
include "setup.py"
include "setup.cfg"
include "h2o_mlflow_flavor/**"
include "README.rst"
}
into "${buildDir}"
}

task buildDist(type: Exec, dependsOn: [copySrcFiles]) {
workingDir buildDir
doFirst {
file("${buildDir}/tmp").mkdirs()
standardOutput = new FileOutputStream(file("${buildDir}/tmp/h2o_mlflow_flavor_buildDist.out"))
}
commandLine getOsSpecificCommandLine([pythonexe, "setup.py", "bdist_wheel"])
}

task copyMainDist(type: Copy, dependsOn: [buildDist]) {
from ("${buildDir}/main/") {
include "dist/**"
}
into "${buildDir}"
}

task pythonVersion(type: Exec) {
doFirst {
println(System.env.VIRTUAL_ENV)
println(environment)
}
commandLine getOsSpecificCommandLine([pythonexe, "--version"])
}

task cleanBuild(type: Delete) {
doFirst {
println "Cleaning..."
}
delete file("build/")
}

//
// Define the dependencies
//
clean.dependsOn cleanBuild
build.dependsOn copyMainDist
125 changes: 125 additions & 0 deletions h2o-py-mlflow-flavor/examples/DRF_mojo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "3ded5553",
"metadata": {},
"outputs": [],
"source": [
"# Start H2O-3 runtime.\n",
"\n",
"import h2o\n",
"h2o.init(strict_version_check=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e746ad4",
"metadata": {},
"outputs": [],
"source": [
"# Configure DRF algorithm and train a model.\n",
"\n",
"from h2o.estimators import H2ORandomForestEstimator\n",
"\n",
"# Import the cars dataset into H2O:\n",
"cars = h2o.import_file(\"https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv\")\n",
"\n",
"# Set the predictors and response;\n",
"# set the response as a factor:\n",
"cars[\"economy_20mpg\"] = cars[\"economy_20mpg\"].asfactor()\n",
"predictors = [\"displacement\",\"power\",\"weight\",\"acceleration\",\"year\"]\n",
"response = \"economy_20mpg\"\n",
"\n",
"# Split the dataset into a train and valid set:\n",
"train, valid = cars.split_frame(ratios=[.8], seed=1234)\n",
"drf = H2ORandomForestEstimator(ntrees=10,\n",
" max_depth=5,\n",
" min_rows=10,\n",
" calibrate_model=True,\n",
" calibration_frame=valid,\n",
" binomial_double_trees=True)\n",
"drf.train(x=predictors,\n",
" y=response,\n",
" training_frame=train,\n",
" validation_frame=valid)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "29eb0722",
"metadata": {},
"outputs": [],
"source": [
"# Log the model to an MLFlow reqistry.\n",
"\n",
"import mlflow\n",
"import h2o_mlflow_flavor\n",
"mlflow.set_tracking_uri(\"http://127.0.0.1:8080\")\n",
"\n",
"with mlflow.start_run(run_name=\"cars\") as run:\n",
" mlflow.log_params(h2o_mlflow_flavor.get_params(drf)) # Log training parameters of the model (optional).\n",
" mlflow.log_metrics(h2o_mlflow_flavor.get_metrics(drf)) # Log performance matrics of the model (optional).\n",
" input_example = h2o_mlflow_flavor.get_input_example(drf) # Extract input example from training dataset (optional)\n",
" h2o_mlflow_flavor.log_model(drf, \"cars\", input_example=input_example,\n",
" model_type=\"MOJO\", # Specify whether the output model should be MOJO or POJO. (MOJO is default)\n",
" extra_prediction_args=[\"--predictCalibrated\"]) # Add extra prediction args if needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bed1dafe",
"metadata": {},
"outputs": [],
"source": [
"# Load model from the MLFlow registry and score with the model.\n",
"\n",
"import mlflow\n",
"mlflow.set_tracking_uri(\"http://127.0.0.1:8080\")\n",
"\n",
"logged_model = 'runs:/a9ff364f07fa499eb44e7c49e47fab11/cars' # Specify correct id of your run.\n",
"\n",
"# Load model as a PyFuncModel.\n",
"loaded_model = mlflow.pyfunc.load_model(logged_model)\n",
"\n",
"# Predict on a Pandas DataFrame.\n",
"import pandas as pd\n",
"data = pd.read_csv(\"https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv\")\n",
"loaded_model.predict(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "905b0c4c",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "mlflow",
"language": "python",
"name": "mlflow"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit d6c889b

Please sign in to comment.