[GH-15654] Introduce MLFlow flavor for working with mojos and pojos (#…

…15849) * [GH-15654] Introduce MLFlow flavors for working with mojos and pojos * First version of mojo flavor by Eric Wolf * New saving method * Update loading mojos * Fix upload mojos. * Fix building * Fix h2o_mlflow_flavor * Update * Update loader module * Update h2o_mojo * Add genmodel flavor * Fix pojo * Fix pojo * Add extraction of metrics * Fix metric extraction * add input examples * moved mlflow-flavor * Add examples * Add doc * Add description.rst * Update description * description * description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * update description * Add flavor as self reference * Update build definition * Remove gitignore * Just one doc * Revert DRF_mojo.ipynb --------- Co-authored-by: Eric Wolf <[email protected]>
h2oai · Nov 14, 2023 · d6c889b · d6c889b
1 parent 6498915
commit d6c889b
Show file tree

Hide file tree

Showing 9 changed files with 897 additions and 1 deletion.
diff --git a/build.gradle b/build.gradle
@@ -155,7 +155,8 @@ ext {
 
     pythonProjects = [
       project(':h2o-py'),
-      project(':h2o-py-cloud-extensions')
+      project(':h2o-py-cloud-extensions'),
+      project(':h2o-py-mlflow-flavor')
     ]
 
     // The project which need to be run under CI only

diff --git a/h2o-py-mlflow-flavor/README.rst b/h2o-py-mlflow-flavor/README.rst
@@ -0,0 +1,110 @@
+H2O-3 MLFlow Flavor
+===================
+
+A tiny library containing a `MLFlow <https://mlflow.org/>`_ flavor for working with H2O-3 MOJO and POJO models.
+
+Logging Models to MLFlow Registry
+---------------------------------
+
+The model that was trained with H2O-3 runtime can be exported to MLFlow registry with `log_model` function.:
+
+.. code-block:: Python
+
+    import mlflow
+    import h2o_mlflow_flavor
+
+    mlflow.set_tracking_uri("http://127.0.0.1:8080")
+    
+    h2o_model = ... training phase ...
+    
+    with mlflow.start_run(run_name="myrun") as run:
+	h2o_mlflow_flavor.log_model(h2o_model=h2o_model,
+                                    artifact_path="folder",
+                                    model_type="MOJO",
+                                    extra_prediction_args=["--predictCalibrated"])
+
+
+Compared to `log_model` functions of the other flavors being a part of MLFlow, this function has two extra arguments:
+
+* ``model_type`` - It indicates whether the model should be exported as `MOJO <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/mojo-quickstart.html#what-is-a-mojo>`_ or `POJO <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/pojo-quickstart.html#what-is-a-pojo>`_. The default value is `MOJO`.
+
+* ``extra_prediction_args`` - A list of extra arguments for java scoring process. Possible values:
+
+  * ``--setConvertInvalidNum`` - The scoring process will convert invalid numbers to NA.
+
+  * ``--predictContributions`` - The scoring process will Return also Shapley values a long with the predictions. Model must support that Shapley values, otherwise scoring process will throw an error.
+
+  * ``--predictCalibrated`` - The scoring process will also return calibrated prediction values.
+
+The `save_model` function that persists h2o binary model to MOJO or POJO has the same signature as the `log_model` function.
+
+Extracting Information about Model
+----------------------------------
+
+The flavor offers several functions to extract information about the model.
+
+* ``get_metrics(h2o_model, metric_type=None)`` - Extracts metrics from the trained H2O binary model. It returns dictionary and takes following parameters:
+
+  * ``h2o_model`` - An H2O binary model.
+
+  * ``metric_type`` - The type of metrics. Possible values are "training", "validation", "cross_validation". If parameter is not specified, metrics for all types are returned.
+
+* ``get_params(h2o_model)`` - Extracts training parameters for the H2O binary model. It returns dictionary and expects one parameter:
+
+  * ``h2o_model`` - An H2O binary model.
+
+* ``get_input_example(h2o_model, number_of_records=5, relevant_columns_only=True)`` - Creates an example Pandas dataset from the training dataset of H2O binary model. It takes following parameters:
+
+  * ``h2o_model`` - An H2O binary model.
+
+  * ``number_of_records`` - A number of records that will be extracted from the training dataset.
+
+  * ``relevant_columns_only`` - A flag indicating whether the output dataset should contain only columns required by the model. Defaults to ``True``.
+
+The functions can be utilized as follows:
+
+.. code-block:: Python
+
+    import mlflow
+    import h2o_mlflow_flavor
+    
+    mlflow.set_tracking_uri("http://127.0.0.1:8080")
+
+    h2o_model = ... training phase ...
+
+    with mlflow.start_run(run_name="myrun") as run:
+	    mlflow.log_params(h2o_mlflow_flavor.get_params(h2o_model))
+	    mlflow.log_metrics(h2o_mlflow_flavor.get_metrics(h2o_model))
+	    input_example = h2o_mlflow_flavor.get_input_example(h2o_model)
+	    h2o_mlflow_flavor.log_model(h2o_model=h2o_model,
+                                        input_example=input_example,
+                                        artifact_path="folder",
+                                        model_type="MOJO",
+                                        extra_prediction_args=["--predictCalibrated"])
+
+
+Model Scoring
+-------------
+
+After a model obtained from the model registry, the model doesn't require h2o runtime for ability to score. The only thing
+that model requires is a ``h2o-gemodel.jar`` which was persisted with the model during saving procedure.
+The model could be loaded by the function ``load_model(model_uri, dst_path=None)``. It returns an objecting making
+predictions on Pandas dataframe and takes the following parameters:
+
+* ``model_uri`` - An unique identifier of the model within MLFlow registry.
+
+* ``dst_path`` - (Optional) A local filesystem path for downloading the persisted form of the model. 
+
+The object for scoring could be obtained also via the `pyfunc` flavor as follows:
+
+.. code-block:: Python
+
+    import mlflow
+    mlflow.set_tracking_uri("http://127.0.0.1:8080")
+
+    logged_model = 'runs:/9a42265cf0ef484c905b02afb8fe6246/iris'
+    loaded_model = mlflow.pyfunc.load_model(logged_model)
+
+    import pandas as pd
+    data = pd.read_csv("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
+    loaded_model.predict(data)
diff --git a/h2o-py-mlflow-flavor/build.gradle b/h2o-py-mlflow-flavor/build.gradle
@@ -0,0 +1,63 @@
+description = "H2O-3 MLFlow Flavor"
+
+dependencies {}
+
+def buildVersion = new H2OBuildVersion(rootDir, version)
+
+ext {
+    PROJECT_VERSION = buildVersion.getProjectVersion()
+    pythonexe = findProperty("pythonExec") ?: "python"
+    pipexe = findProperty("pipExec") ?: "pip"
+    if (System.env.VIRTUAL_ENV) {
+        pythonexe = "${System.env.VIRTUAL_ENV}/bin/python".toString()
+        pipexe = "${System.env.VIRTUAL_ENV}/bin/pip".toString()
+    }
+    testsPath = file("tests")
+}
+
+task copySrcFiles(type: Copy) {
+    from ("${projectDir}") {
+        include "setup.py"
+        include "setup.cfg"
+        include "h2o_mlflow_flavor/**"
+        include "README.rst"
+    }
+    into "${buildDir}"
+}
+
+task buildDist(type: Exec, dependsOn: [copySrcFiles]) {
+    workingDir buildDir
+    doFirst {
+        file("${buildDir}/tmp").mkdirs()
+        standardOutput = new FileOutputStream(file("${buildDir}/tmp/h2o_mlflow_flavor_buildDist.out"))
+    }
+    commandLine getOsSpecificCommandLine([pythonexe, "setup.py", "bdist_wheel"])
+}
+
+task copyMainDist(type: Copy, dependsOn: [buildDist]) {
+    from ("${buildDir}/main/") {
+        include "dist/**"
+    }
+    into "${buildDir}"
+}
+
+task pythonVersion(type: Exec) {
+    doFirst {
+        println(System.env.VIRTUAL_ENV)
+        println(environment)
+    }
+    commandLine getOsSpecificCommandLine([pythonexe, "--version"])
+}
+
+task cleanBuild(type: Delete) {
+    doFirst {
+        println "Cleaning..."
+    }
+    delete file("build/")
+}
+
+//
+// Define the dependencies
+//
+clean.dependsOn cleanBuild
+build.dependsOn copyMainDist
diff --git a/h2o-py-mlflow-flavor/examples/DRF_mojo.ipynb b/h2o-py-mlflow-flavor/examples/DRF_mojo.ipynb
@@ -0,0 +1,125 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3ded5553",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Start H2O-3 runtime.\n",
+    "\n",
+    "import h2o\n",
+    "h2o.init(strict_version_check=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5e746ad4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Configure DRF algorithm and train a model.\n",
+    "\n",
+    "from h2o.estimators import H2ORandomForestEstimator\n",
+    "\n",
+    "# Import the cars dataset into H2O:\n",
+    "cars = h2o.import_file(\"https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv\")\n",
+    "\n",
+    "# Set the predictors and response;\n",
+    "# set the response as a factor:\n",
+    "cars[\"economy_20mpg\"] = cars[\"economy_20mpg\"].asfactor()\n",
+    "predictors = [\"displacement\",\"power\",\"weight\",\"acceleration\",\"year\"]\n",
+    "response = \"economy_20mpg\"\n",
+    "\n",
+    "# Split the dataset into a train and valid set:\n",
+    "train, valid = cars.split_frame(ratios=[.8], seed=1234)\n",
+    "drf = H2ORandomForestEstimator(ntrees=10,\n",
+    "                                    max_depth=5,\n",
+    "                                    min_rows=10,\n",
+    "                                    calibrate_model=True,\n",
+    "                                    calibration_frame=valid,\n",
+    "                                    binomial_double_trees=True)\n",
+    "drf.train(x=predictors,\n",
+    "          y=response,\n",
+    "          training_frame=train,\n",
+    "          validation_frame=valid)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "29eb0722",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Log the model to an MLFlow reqistry.\n",
+    "\n",
+    "import mlflow\n",
+    "import h2o_mlflow_flavor\n",
+    "mlflow.set_tracking_uri(\"http://127.0.0.1:8080\")\n",
+    "\n",
+    "with mlflow.start_run(run_name=\"cars\") as run:\n",
+    "    mlflow.log_params(h2o_mlflow_flavor.get_params(drf)) # Log training parameters of the model (optional).\n",
+    "    mlflow.log_metrics(h2o_mlflow_flavor.get_metrics(drf)) # Log performance matrics of the model (optional).\n",
+    "    input_example = h2o_mlflow_flavor.get_input_example(drf) # Extract input example from training dataset (optional)\n",
+    "    h2o_mlflow_flavor.log_model(drf, \"cars\", input_example=input_example,\n",
+    "                                model_type=\"MOJO\", # Specify whether the output model should be MOJO or POJO. (MOJO is default)\n",
+    "                                extra_prediction_args=[\"--predictCalibrated\"]) # Add extra prediction args if needed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bed1dafe",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load model from the MLFlow registry and score with the model.\n",
+    "\n",
+    "import mlflow\n",
+    "mlflow.set_tracking_uri(\"http://127.0.0.1:8080\")\n",
+    "\n",
+    "logged_model = 'runs:/a9ff364f07fa499eb44e7c49e47fab11/cars' # Specify correct id of your run.\n",
+    "\n",
+    "# Load model as a PyFuncModel.\n",
+    "loaded_model = mlflow.pyfunc.load_model(logged_model)\n",
+    "\n",
+    "# Predict on a Pandas DataFrame.\n",
+    "import pandas as pd\n",
+    "data = pd.read_csv(\"https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv\")\n",
+    "loaded_model.predict(data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "905b0c4c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "mlflow",
+   "language": "python",
+   "name": "mlflow"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}