-
Notifications
You must be signed in to change notification settings - Fork 2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
…15849) * [GH-15654] Introduce MLFlow flavors for working with mojos and pojos * First version of mojo flavor by Eric Wolf * New saving method * Update loading mojos * Fix upload mojos. * Fix building * Fix h2o_mlflow_flavor * Update * Update loader module * Update h2o_mojo * Add genmodel flavor * Fix pojo * Fix pojo * Add extraction of metrics * Fix metric extraction * add input examples * moved mlflow-flavor * Add examples * Add doc * Add description.rst * Update description * description * description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * Add flavor as self reference * Update build definition * Remove gitignore * Just one doc * Revert DRF_mojo.ipynb --------- Co-authored-by: Eric Wolf <[email protected]>
- Loading branch information
Showing
9 changed files
with
897 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
H2O-3 MLFlow Flavor | ||
=================== | ||
|
||
A tiny library containing a `MLFlow <https://mlflow.org/>`_ flavor for working with H2O-3 MOJO and POJO models. | ||
|
||
Logging Models to MLFlow Registry | ||
--------------------------------- | ||
|
||
The model that was trained with H2O-3 runtime can be exported to MLFlow registry with `log_model` function.: | ||
|
||
.. code-block:: Python | ||
import mlflow | ||
import h2o_mlflow_flavor | ||
mlflow.set_tracking_uri("http://127.0.0.1:8080") | ||
h2o_model = ... training phase ... | ||
with mlflow.start_run(run_name="myrun") as run: | ||
h2o_mlflow_flavor.log_model(h2o_model=h2o_model, | ||
artifact_path="folder", | ||
model_type="MOJO", | ||
extra_prediction_args=["--predictCalibrated"]) | ||
Compared to `log_model` functions of the other flavors being a part of MLFlow, this function has two extra arguments: | ||
|
||
* ``model_type`` - It indicates whether the model should be exported as `MOJO <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/mojo-quickstart.html#what-is-a-mojo>`_ or `POJO <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/pojo-quickstart.html#what-is-a-pojo>`_. The default value is `MOJO`. | ||
|
||
* ``extra_prediction_args`` - A list of extra arguments for java scoring process. Possible values: | ||
|
||
* ``--setConvertInvalidNum`` - The scoring process will convert invalid numbers to NA. | ||
|
||
* ``--predictContributions`` - The scoring process will Return also Shapley values a long with the predictions. Model must support that Shapley values, otherwise scoring process will throw an error. | ||
|
||
* ``--predictCalibrated`` - The scoring process will also return calibrated prediction values. | ||
|
||
The `save_model` function that persists h2o binary model to MOJO or POJO has the same signature as the `log_model` function. | ||
|
||
Extracting Information about Model | ||
---------------------------------- | ||
|
||
The flavor offers several functions to extract information about the model. | ||
|
||
* ``get_metrics(h2o_model, metric_type=None)`` - Extracts metrics from the trained H2O binary model. It returns dictionary and takes following parameters: | ||
|
||
* ``h2o_model`` - An H2O binary model. | ||
|
||
* ``metric_type`` - The type of metrics. Possible values are "training", "validation", "cross_validation". If parameter is not specified, metrics for all types are returned. | ||
|
||
* ``get_params(h2o_model)`` - Extracts training parameters for the H2O binary model. It returns dictionary and expects one parameter: | ||
|
||
* ``h2o_model`` - An H2O binary model. | ||
|
||
* ``get_input_example(h2o_model, number_of_records=5, relevant_columns_only=True)`` - Creates an example Pandas dataset from the training dataset of H2O binary model. It takes following parameters: | ||
|
||
* ``h2o_model`` - An H2O binary model. | ||
|
||
* ``number_of_records`` - A number of records that will be extracted from the training dataset. | ||
|
||
* ``relevant_columns_only`` - A flag indicating whether the output dataset should contain only columns required by the model. Defaults to ``True``. | ||
|
||
The functions can be utilized as follows: | ||
|
||
.. code-block:: Python | ||
import mlflow | ||
import h2o_mlflow_flavor | ||
mlflow.set_tracking_uri("http://127.0.0.1:8080") | ||
h2o_model = ... training phase ... | ||
with mlflow.start_run(run_name="myrun") as run: | ||
mlflow.log_params(h2o_mlflow_flavor.get_params(h2o_model)) | ||
mlflow.log_metrics(h2o_mlflow_flavor.get_metrics(h2o_model)) | ||
input_example = h2o_mlflow_flavor.get_input_example(h2o_model) | ||
h2o_mlflow_flavor.log_model(h2o_model=h2o_model, | ||
input_example=input_example, | ||
artifact_path="folder", | ||
model_type="MOJO", | ||
extra_prediction_args=["--predictCalibrated"]) | ||
Model Scoring | ||
------------- | ||
|
||
After a model obtained from the model registry, the model doesn't require h2o runtime for ability to score. The only thing | ||
that model requires is a ``h2o-gemodel.jar`` which was persisted with the model during saving procedure. | ||
The model could be loaded by the function ``load_model(model_uri, dst_path=None)``. It returns an objecting making | ||
predictions on Pandas dataframe and takes the following parameters: | ||
|
||
* ``model_uri`` - An unique identifier of the model within MLFlow registry. | ||
|
||
* ``dst_path`` - (Optional) A local filesystem path for downloading the persisted form of the model. | ||
|
||
The object for scoring could be obtained also via the `pyfunc` flavor as follows: | ||
|
||
.. code-block:: Python | ||
import mlflow | ||
mlflow.set_tracking_uri("http://127.0.0.1:8080") | ||
logged_model = 'runs:/9a42265cf0ef484c905b02afb8fe6246/iris' | ||
loaded_model = mlflow.pyfunc.load_model(logged_model) | ||
import pandas as pd | ||
data = pd.read_csv("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") | ||
loaded_model.predict(data) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
description = "H2O-3 MLFlow Flavor" | ||
|
||
dependencies {} | ||
|
||
def buildVersion = new H2OBuildVersion(rootDir, version) | ||
|
||
ext { | ||
PROJECT_VERSION = buildVersion.getProjectVersion() | ||
pythonexe = findProperty("pythonExec") ?: "python" | ||
pipexe = findProperty("pipExec") ?: "pip" | ||
if (System.env.VIRTUAL_ENV) { | ||
pythonexe = "${System.env.VIRTUAL_ENV}/bin/python".toString() | ||
pipexe = "${System.env.VIRTUAL_ENV}/bin/pip".toString() | ||
} | ||
testsPath = file("tests") | ||
} | ||
|
||
task copySrcFiles(type: Copy) { | ||
from ("${projectDir}") { | ||
include "setup.py" | ||
include "setup.cfg" | ||
include "h2o_mlflow_flavor/**" | ||
include "README.rst" | ||
} | ||
into "${buildDir}" | ||
} | ||
|
||
task buildDist(type: Exec, dependsOn: [copySrcFiles]) { | ||
workingDir buildDir | ||
doFirst { | ||
file("${buildDir}/tmp").mkdirs() | ||
standardOutput = new FileOutputStream(file("${buildDir}/tmp/h2o_mlflow_flavor_buildDist.out")) | ||
} | ||
commandLine getOsSpecificCommandLine([pythonexe, "setup.py", "bdist_wheel"]) | ||
} | ||
|
||
task copyMainDist(type: Copy, dependsOn: [buildDist]) { | ||
from ("${buildDir}/main/") { | ||
include "dist/**" | ||
} | ||
into "${buildDir}" | ||
} | ||
|
||
task pythonVersion(type: Exec) { | ||
doFirst { | ||
println(System.env.VIRTUAL_ENV) | ||
println(environment) | ||
} | ||
commandLine getOsSpecificCommandLine([pythonexe, "--version"]) | ||
} | ||
|
||
task cleanBuild(type: Delete) { | ||
doFirst { | ||
println "Cleaning..." | ||
} | ||
delete file("build/") | ||
} | ||
|
||
// | ||
// Define the dependencies | ||
// | ||
clean.dependsOn cleanBuild | ||
build.dependsOn copyMainDist |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "3ded5553", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Start H2O-3 runtime.\n", | ||
"\n", | ||
"import h2o\n", | ||
"h2o.init(strict_version_check=False)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "5e746ad4", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Configure DRF algorithm and train a model.\n", | ||
"\n", | ||
"from h2o.estimators import H2ORandomForestEstimator\n", | ||
"\n", | ||
"# Import the cars dataset into H2O:\n", | ||
"cars = h2o.import_file(\"https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv\")\n", | ||
"\n", | ||
"# Set the predictors and response;\n", | ||
"# set the response as a factor:\n", | ||
"cars[\"economy_20mpg\"] = cars[\"economy_20mpg\"].asfactor()\n", | ||
"predictors = [\"displacement\",\"power\",\"weight\",\"acceleration\",\"year\"]\n", | ||
"response = \"economy_20mpg\"\n", | ||
"\n", | ||
"# Split the dataset into a train and valid set:\n", | ||
"train, valid = cars.split_frame(ratios=[.8], seed=1234)\n", | ||
"drf = H2ORandomForestEstimator(ntrees=10,\n", | ||
" max_depth=5,\n", | ||
" min_rows=10,\n", | ||
" calibrate_model=True,\n", | ||
" calibration_frame=valid,\n", | ||
" binomial_double_trees=True)\n", | ||
"drf.train(x=predictors,\n", | ||
" y=response,\n", | ||
" training_frame=train,\n", | ||
" validation_frame=valid)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "29eb0722", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Log the model to an MLFlow reqistry.\n", | ||
"\n", | ||
"import mlflow\n", | ||
"import h2o_mlflow_flavor\n", | ||
"mlflow.set_tracking_uri(\"http://127.0.0.1:8080\")\n", | ||
"\n", | ||
"with mlflow.start_run(run_name=\"cars\") as run:\n", | ||
" mlflow.log_params(h2o_mlflow_flavor.get_params(drf)) # Log training parameters of the model (optional).\n", | ||
" mlflow.log_metrics(h2o_mlflow_flavor.get_metrics(drf)) # Log performance matrics of the model (optional).\n", | ||
" input_example = h2o_mlflow_flavor.get_input_example(drf) # Extract input example from training dataset (optional)\n", | ||
" h2o_mlflow_flavor.log_model(drf, \"cars\", input_example=input_example,\n", | ||
" model_type=\"MOJO\", # Specify whether the output model should be MOJO or POJO. (MOJO is default)\n", | ||
" extra_prediction_args=[\"--predictCalibrated\"]) # Add extra prediction args if needed." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "bed1dafe", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Load model from the MLFlow registry and score with the model.\n", | ||
"\n", | ||
"import mlflow\n", | ||
"mlflow.set_tracking_uri(\"http://127.0.0.1:8080\")\n", | ||
"\n", | ||
"logged_model = 'runs:/a9ff364f07fa499eb44e7c49e47fab11/cars' # Specify correct id of your run.\n", | ||
"\n", | ||
"# Load model as a PyFuncModel.\n", | ||
"loaded_model = mlflow.pyfunc.load_model(logged_model)\n", | ||
"\n", | ||
"# Predict on a Pandas DataFrame.\n", | ||
"import pandas as pd\n", | ||
"data = pd.read_csv(\"https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv\")\n", | ||
"loaded_model.predict(data)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "905b0c4c", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "mlflow", | ||
"language": "python", | ||
"name": "mlflow" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.