Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GH-15654] Introduce MLFlow flavor for working with mojos and pojos #15849

Merged
merged 43 commits into from
Nov 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
e13b514
[GH-15654] Introduce MLFlow flavors for working with mojos and pojos
mn-mikke Oct 19, 2023
680f528
First version of mojo flavor by Eric Wolf
Oct 19, 2023
6431595
New saving method
mn-mikke Oct 20, 2023
1bd9c02
Update loading mojos
mn-mikke Oct 23, 2023
5862564
Fix upload mojos.
mn-mikke Oct 23, 2023
b969f1f
Fix building
mn-mikke Oct 23, 2023
e038dbe
Fix h2o_mlflow_flavor
mn-mikke Oct 23, 2023
0a32fb8
Update
mn-mikke Oct 31, 2023
ae3336c
Update loader module
mn-mikke Oct 31, 2023
8c9bb69
Update h2o_mojo
mn-mikke Nov 1, 2023
6050e64
Add genmodel flavor
mn-mikke Nov 3, 2023
407c95e
Fix pojo
mn-mikke Nov 3, 2023
83e2420
Fix pojo
mn-mikke Nov 6, 2023
01ab6ab
Add extraction of metrics
mn-mikke Nov 6, 2023
65499e5
Fix metric extraction
mn-mikke Nov 7, 2023
bf5e77a
add input examples
mn-mikke Nov 8, 2023
d897548
moved mlflow-flavor
mn-mikke Nov 10, 2023
07443e6
Add examples
mn-mikke Nov 10, 2023
cbfdd3a
Add doc
mn-mikke Nov 13, 2023
38bdb5f
Add description.rst
mn-mikke Nov 13, 2023
68f5989
Update description
mn-mikke Nov 13, 2023
e7463f1
description
mn-mikke Nov 13, 2023
c742482
description
mn-mikke Nov 13, 2023
f288e55
update description
mn-mikke Nov 13, 2023
35e8194
update description
mn-mikke Nov 13, 2023
06cca19
update description
mn-mikke Nov 13, 2023
e2c6680
update description
mn-mikke Nov 13, 2023
31c8fb1
update description
mn-mikke Nov 13, 2023
1b3ed2d
update description
mn-mikke Nov 13, 2023
7c33c80
update description
mn-mikke Nov 13, 2023
bd4a547
update description
mn-mikke Nov 13, 2023
c816cc7
update description
mn-mikke Nov 13, 2023
9191265
update description
mn-mikke Nov 13, 2023
8c89df6
update description
mn-mikke Nov 13, 2023
76de6bb
update description
mn-mikke Nov 13, 2023
ad5be38
update description
mn-mikke Nov 13, 2023
df98292
update description
mn-mikke Nov 13, 2023
aad61ae
update description
mn-mikke Nov 13, 2023
77c4391
Add flavor as self reference
mn-mikke Nov 13, 2023
b9be04a
Update build definition
mn-mikke Nov 13, 2023
fe76b53
Remove gitignore
mn-mikke Nov 14, 2023
7d39ab4
Just one doc
mn-mikke Nov 14, 2023
237069b
Revert DRF_mojo.ipynb
mn-mikke Nov 14, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,8 @@ ext {

pythonProjects = [
project(':h2o-py'),
project(':h2o-py-cloud-extensions')
project(':h2o-py-cloud-extensions'),
project(':h2o-py-mlflow-flavor')
]

// The project which need to be run under CI only
Expand Down
110 changes: 110 additions & 0 deletions h2o-py-mlflow-flavor/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
H2O-3 MLFlow Flavor
===================

A tiny library containing a `MLFlow <https://mlflow.org/>`_ flavor for working with H2O-3 MOJO and POJO models.

Logging Models to MLFlow Registry
---------------------------------

The model that was trained with H2O-3 runtime can be exported to MLFlow registry with `log_model` function.:

.. code-block:: Python

import mlflow
import h2o_mlflow_flavor

mlflow.set_tracking_uri("http://127.0.0.1:8080")

h2o_model = ... training phase ...

with mlflow.start_run(run_name="myrun") as run:
h2o_mlflow_flavor.log_model(h2o_model=h2o_model,
artifact_path="folder",
model_type="MOJO",
extra_prediction_args=["--predictCalibrated"])


Compared to `log_model` functions of the other flavors being a part of MLFlow, this function has two extra arguments:

* ``model_type`` - It indicates whether the model should be exported as `MOJO <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/mojo-quickstart.html#what-is-a-mojo>`_ or `POJO <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/pojo-quickstart.html#what-is-a-pojo>`_. The default value is `MOJO`.

* ``extra_prediction_args`` - A list of extra arguments for java scoring process. Possible values:

* ``--setConvertInvalidNum`` - The scoring process will convert invalid numbers to NA.

* ``--predictContributions`` - The scoring process will Return also Shapley values a long with the predictions. Model must support that Shapley values, otherwise scoring process will throw an error.

* ``--predictCalibrated`` - The scoring process will also return calibrated prediction values.

The `save_model` function that persists h2o binary model to MOJO or POJO has the same signature as the `log_model` function.

Extracting Information about Model
----------------------------------

The flavor offers several functions to extract information about the model.

* ``get_metrics(h2o_model, metric_type=None)`` - Extracts metrics from the trained H2O binary model. It returns dictionary and takes following parameters:

* ``h2o_model`` - An H2O binary model.

* ``metric_type`` - The type of metrics. Possible values are "training", "validation", "cross_validation". If parameter is not specified, metrics for all types are returned.

* ``get_params(h2o_model)`` - Extracts training parameters for the H2O binary model. It returns dictionary and expects one parameter:

* ``h2o_model`` - An H2O binary model.

* ``get_input_example(h2o_model, number_of_records=5, relevant_columns_only=True)`` - Creates an example Pandas dataset from the training dataset of H2O binary model. It takes following parameters:

* ``h2o_model`` - An H2O binary model.

* ``number_of_records`` - A number of records that will be extracted from the training dataset.

* ``relevant_columns_only`` - A flag indicating whether the output dataset should contain only columns required by the model. Defaults to ``True``.

The functions can be utilized as follows:

.. code-block:: Python

import mlflow
import h2o_mlflow_flavor

mlflow.set_tracking_uri("http://127.0.0.1:8080")

h2o_model = ... training phase ...

with mlflow.start_run(run_name="myrun") as run:
mlflow.log_params(h2o_mlflow_flavor.get_params(h2o_model))
mlflow.log_metrics(h2o_mlflow_flavor.get_metrics(h2o_model))
input_example = h2o_mlflow_flavor.get_input_example(h2o_model)
h2o_mlflow_flavor.log_model(h2o_model=h2o_model,
input_example=input_example,
artifact_path="folder",
model_type="MOJO",
extra_prediction_args=["--predictCalibrated"])


Model Scoring
-------------

After a model obtained from the model registry, the model doesn't require h2o runtime for ability to score. The only thing
that model requires is a ``h2o-gemodel.jar`` which was persisted with the model during saving procedure.
The model could be loaded by the function ``load_model(model_uri, dst_path=None)``. It returns an objecting making
predictions on Pandas dataframe and takes the following parameters:

* ``model_uri`` - An unique identifier of the model within MLFlow registry.

* ``dst_path`` - (Optional) A local filesystem path for downloading the persisted form of the model.

The object for scoring could be obtained also via the `pyfunc` flavor as follows:

.. code-block:: Python

import mlflow
mlflow.set_tracking_uri("http://127.0.0.1:8080")

logged_model = 'runs:/9a42265cf0ef484c905b02afb8fe6246/iris'
loaded_model = mlflow.pyfunc.load_model(logged_model)

import pandas as pd
data = pd.read_csv("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
loaded_model.predict(data)
63 changes: 63 additions & 0 deletions h2o-py-mlflow-flavor/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
description = "H2O-3 MLFlow Flavor"

dependencies {}

def buildVersion = new H2OBuildVersion(rootDir, version)

ext {
PROJECT_VERSION = buildVersion.getProjectVersion()
pythonexe = findProperty("pythonExec") ?: "python"
pipexe = findProperty("pipExec") ?: "pip"
if (System.env.VIRTUAL_ENV) {
pythonexe = "${System.env.VIRTUAL_ENV}/bin/python".toString()
pipexe = "${System.env.VIRTUAL_ENV}/bin/pip".toString()
}
testsPath = file("tests")
}

task copySrcFiles(type: Copy) {
from ("${projectDir}") {
include "setup.py"
include "setup.cfg"
include "h2o_mlflow_flavor/**"
include "README.rst"
}
into "${buildDir}"
}

task buildDist(type: Exec, dependsOn: [copySrcFiles]) {
workingDir buildDir
doFirst {
file("${buildDir}/tmp").mkdirs()
standardOutput = new FileOutputStream(file("${buildDir}/tmp/h2o_mlflow_flavor_buildDist.out"))
}
commandLine getOsSpecificCommandLine([pythonexe, "setup.py", "bdist_wheel"])
}

task copyMainDist(type: Copy, dependsOn: [buildDist]) {
from ("${buildDir}/main/") {
include "dist/**"
}
into "${buildDir}"
}

task pythonVersion(type: Exec) {
doFirst {
println(System.env.VIRTUAL_ENV)
println(environment)
}
commandLine getOsSpecificCommandLine([pythonexe, "--version"])
}

task cleanBuild(type: Delete) {
doFirst {
println "Cleaning..."
}
delete file("build/")
}

//
// Define the dependencies
//
clean.dependsOn cleanBuild
build.dependsOn copyMainDist
125 changes: 125 additions & 0 deletions h2o-py-mlflow-flavor/examples/DRF_mojo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "3ded5553",
"metadata": {},
"outputs": [],
"source": [
"# Start H2O-3 runtime.\n",
"\n",
"import h2o\n",
"h2o.init(strict_version_check=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e746ad4",
"metadata": {},
"outputs": [],
"source": [
"# Configure DRF algorithm and train a model.\n",
"\n",
"from h2o.estimators import H2ORandomForestEstimator\n",
"\n",
"# Import the cars dataset into H2O:\n",
"cars = h2o.import_file(\"https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv\")\n",
"\n",
"# Set the predictors and response;\n",
"# set the response as a factor:\n",
"cars[\"economy_20mpg\"] = cars[\"economy_20mpg\"].asfactor()\n",
"predictors = [\"displacement\",\"power\",\"weight\",\"acceleration\",\"year\"]\n",
"response = \"economy_20mpg\"\n",
"\n",
"# Split the dataset into a train and valid set:\n",
"train, valid = cars.split_frame(ratios=[.8], seed=1234)\n",
"drf = H2ORandomForestEstimator(ntrees=10,\n",
" max_depth=5,\n",
" min_rows=10,\n",
" calibrate_model=True,\n",
" calibration_frame=valid,\n",
" binomial_double_trees=True)\n",
"drf.train(x=predictors,\n",
" y=response,\n",
" training_frame=train,\n",
" validation_frame=valid)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "29eb0722",
"metadata": {},
"outputs": [],
"source": [
"# Log the model to an MLFlow reqistry.\n",
"\n",
"import mlflow\n",
"import h2o_mlflow_flavor\n",
"mlflow.set_tracking_uri(\"http://127.0.0.1:8080\")\n",
"\n",
"with mlflow.start_run(run_name=\"cars\") as run:\n",
" mlflow.log_params(h2o_mlflow_flavor.get_params(drf)) # Log training parameters of the model (optional).\n",
" mlflow.log_metrics(h2o_mlflow_flavor.get_metrics(drf)) # Log performance matrics of the model (optional).\n",
" input_example = h2o_mlflow_flavor.get_input_example(drf) # Extract input example from training dataset (optional)\n",
" h2o_mlflow_flavor.log_model(drf, \"cars\", input_example=input_example,\n",
" model_type=\"MOJO\", # Specify whether the output model should be MOJO or POJO. (MOJO is default)\n",
" extra_prediction_args=[\"--predictCalibrated\"]) # Add extra prediction args if needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bed1dafe",
"metadata": {},
"outputs": [],
"source": [
"# Load model from the MLFlow registry and score with the model.\n",
"\n",
"import mlflow\n",
"mlflow.set_tracking_uri(\"http://127.0.0.1:8080\")\n",
"\n",
"logged_model = 'runs:/a9ff364f07fa499eb44e7c49e47fab11/cars' # Specify correct id of your run.\n",
"\n",
"# Load model as a PyFuncModel.\n",
"loaded_model = mlflow.pyfunc.load_model(logged_model)\n",
"\n",
"# Predict on a Pandas DataFrame.\n",
"import pandas as pd\n",
"data = pd.read_csv(\"https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv\")\n",
"loaded_model.predict(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "905b0c4c",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "mlflow",
"language": "python",
"name": "mlflow"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading