Merge pull request #18 from james94/mojo-nifi

Integrates H2O.ai Driverless AI MOJO Scoring Pipeline into Apache NiFi using - NiFi custom Processor built for NiFi version 1.11.4: ExecuteDaiMojoScoringPipeline - Java MOJO2 Runtime API version 2.4.8 - pipeline.mojo built with DAI version 1.9.0
h2oai · Oct 16, 2020 · b6bb0ad · b6bb0ad
2 parents 92407a4 + 35c2bfb
commit b6bb0ad
Show file tree

Hide file tree

Showing 18 changed files with 3,379 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -32,5 +32,8 @@ deployment templates.
       - [EC2 Ubuntu HTTP Server](./python-scoring-pipeline/http_ec2_ubuntu.md)
 - [Flink Custom RichMapFunction for Running the MOJO in Flink Data Pipeline](./mojo-flink)
    - This example will walk through how to use a Flink custom RichMapFunction to execute the MOJO Scoring Pipeline within a Flink Data Pipeline to do batch scoring and real-time scoring.
+- [Deploy Driverless AI MOJO Scoring Pipeline within a NiFi Data Flow](./mojo-nifi)
+   - This example will walk through how to use a NiFi custom processor to execute the MOJO Scoring Pipeline within a NiFi Data Flow to do batch scoring and real-time scoring.
 - [Deploy Driverless AI MOJO Scoring Pipeline in a MiNiFi C++ Data Flow with CEM](./mojo-py-minificpp)
    - This example will walk through how to install Cloudera Edge Management, which includes Edge Flow Manager, NiFi Registry and one MiNiFi C++ Agent and Driverless AI MOJO2 Python Runtime on an EC2 instance. It will then go through how to use EFM to build a data flow with Driverless AI MOJO Scoring Pipeline for a MiNiFi C++ Agent and publish that data flow to that agent. With Cloudera Ege Management and Driverless AI MOJO Scoring Pipeline integration, pushing ML models to an edge device is much easier.
+
diff --git a/mojo-nifi/.gitignore b/mojo-nifi/.gitignore
@@ -0,0 +1,19 @@
+target
+.project
+.settings
+.classpath
+nbactions.xml
+nb-configuration.xml
+.DS_Store
+.metadata
+.recommenders
+
+# Intellij
+.idea/
+*.iml
+*.iws
+*~
+
+.vscode/
+
+predData/
diff --git a/mojo-nifi/README.md b/mojo-nifi/README.md
@@ -0,0 +1,296 @@
+# Deploy Driverless AI MOJO within a NiFi Data Flow
+
+## Cloudera Integration Point for CDF
+
+Deploy the Driverless AI MOJO Scoring Pipeline to Apache NiFi by using the MOJO2 Java Runtime API and a custom NiFi processor. This will be a Cloudera Integration point for Cloudera Data Flow (CDF), particulary Cloudera Flow Management (CFM). CFM is powered by Apache NiFi.
+
+## Video Walkthrough
+
+The following link is a YouTube video that shows how to deploy the Driverless AI MOJO to NiFi to do batch and real-time scoring on Hydraulic System data to classify for Hydraulic Cooling Condition: [NiFi Custom Processor for Running the Driverless AI MOJO in NiFi DataFlow](https://youtu.be/c401tSqySS0)
+
+## Prerequisites
+
+- Driverless AI Environment (Tested with Driverless AI 1.9.0, MOJO Scoring Pipeline 2.4.8)
+
+- Launch Ubuntu 18.04 Linux EC2 instance
+    - Instance Type: t2.2xlarge
+    - Storage: 256GB
+    - Open custom TCP port 8080 and source on 0.0.0.0/0
+
+- Download the Driverless AI Deployment Repo to your local machine since we will be using the NiFi Data Flow xml templates that come with mojo-nifi/ folder.
+
+~~~bash
+git clone https://github.com/h2oai/dai-deployment-examples
+~~~
+
+## Task 1: Set Up Environment
+
+### Connect to EC2 from Local Machine
+
+1\. Move the EC2 Pivate Key File (Pem Key) to the .ssh folder
+
+~~~bash
+mv $HOME/Downloads/{private-key-filename}.pem $HOME/.ssh/
+chmod 400 $HOME/.ssh/{private-key-filename}.pem
+~~~
+
+2\. Set EC2 Public DNS and EC2 Pem Key as permanent environment variables
+
+~~~bash
+# For Mac OS X, set permanent environment variables 
+tee -a $HOME/.bash_profile << EOF
+# Set EC2 Public DNS
+export DAI_MOJO_NIFI_INSTANCE={EC2 Public DNS}.compute.amazon.com
+# Set EC2 Pem Key
+export DAI_MOJO_NIFI_PEM=$HOME/.ssh/{private-key-filename}.pem
+EOF
+
+# For Linux, set permanent environment variables
+tee -a $HOME/.profile << EOF
+# Set EC2 Public DNS
+export DAI_MOJO_NIFI_INSTANCE={EC2 Public DNS}.compute.amazon.com
+# Set EC2 Pem Key
+export DAI_MOJO_NIFI_PEM=$HOME/.ssh/{private-key-filename}.pem
+EOF
+
+source $HOME/.bash_profile
+~~~
+
+3\. Connect to EC2 via SSH
+
+~~~bash
+ssh -i $DAI_MOJO_NIFI_PEM ubuntu@$DAI_MOJO_NIFI_INSTANCE
+~~~
+
+### Create Environment Directory Structure
+
+1\. Run the following commands that will create the directories where you could store the **input data**, **mojo-pipeline/** folder.
+
+~~~bash
+# Create directory structure for DAI MOJO NiFi Projects
+mkdir $HOME/daimojo-nifi/
+
+mkdir -p $HOME/daimojo-nifi/testData/{test-batch-data,test-real-time-data}
+~~~
+
+### Set Up Driverless AI MOJO Scoring Pipeline in EC2
+
+1\. Build a **Driverless AI Experiment**
+
+- 1a\. Upload your dataset or use the following **Data Recipe URL** to import the **UCI Hydraulic System Condition Monitoring Dataset**:
+
+~~~bash
+# Data Recipe URL
+https://raw.githubusercontent.com/james94/driverlessai-recipes/master/data/hydraulic-data.py
+~~~
+
+- 1b\. Split the data **75% for training** and **25% for testing**.
+
+- 1c\. Run predict on your **training data**.
+
+- 1d\. Name the experiment **model_deployment**. Choose the **target column** for scoring. Choose the **test data**. Launch the experiment.
+
+2\. Click **Download MOJO Scoring Pipeline** in Driverless AI Experiment Dashboard
+
+- 2a\. Select **Java**, click **Download MOJO Scoring Pipeline** and send **mojo.zip** to EC2.
+
+~~~bash
+# Move Driverless AI MOJO Scoring Pipeline to EC2 instance
+scp -i $DAI_MOJO_NIFI_PEM $HOME/Downloads/mojo.zip ubuntu@$DAI_MOJO_NIFI_INSTANCE:/home/ubuntu/daimojo-nifi/
+~~~
+
+- 2b\. Unzip **mojo.zip**.
+
+~~~bash
+sudo apt -y install unzip
+cd $HOME/daimojo-nifi/
+unzip mojo.zip
+~~~
+
+3\. Install **MOJO2 Java Runtime Dependencies** in EC2
+
+- 3a\. Download and install Anaconda.
+
+~~~bash
+# Download, then install Anaconda
+wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
+
+bash Anaconda3-2020.02-Linux-x86_64.sh
+~~~
+
+- 3b\. Create **model-deployment** virtual environment
+
+~~~bash
+conda create -y -n model-deployment python=3.6
+conda activate model-deployment
+~~~
+
+- 3c\. Install the **required packages**:
+
+~~~bash
+# Install Java
+conda install -y -c conda-forge openjdk=8.0.192
+
+# Install Maven
+conda install -y -c conda-forge maven
+~~~
+
+4\. Set the **Driverless AI License Key** as a **temporary environment variable**
+
+~~~bash
+# Set Driverless AI License Key
+export DRIVERLESS_AI_LICENSE_KEY="{license-key}"
+~~~
+
+### Prepare Hydraulic Test Data For Mojo NiFi Scoring
+
+Make sure there is **input test data** in the input directory NiFi will be pulling data from.
+
+1\. For **batch scoring**, you should make sure there is one or more files with multiple rows of csv data in the following directory:
+
+~~~bash
+# go to mojo-pipeline/ directory with batch data example.csv
+cd $HOME/daimojo-nifi/mojo-pipeline/
+
+# copy this batch data to the input dir where NiFi pulls the batch data
+cp example.csv $HOME/daimojo-nifi/testData/test-batch-data/
+~~~
+
+2\. For **real-time scoring**, you should make sure there are files with a single row of csv data in the following directory:
+
+~~~bash
+# go to real-time input dir where we will store real-time data
+cd $HOME/daimojo-nifi/testData/test-real-time-data/
+
+# copy example.csv to the input dir where NiFi pulls the real-time data
+cp $HOME/daimojo-nifi/mojo-pipeline/example.csv .
+
+# remove file's 1st line, the header
+echo -e "$(sed '1d' example.csv)\n" > example.csv
+
+# split file into multiple files having 1 row of data with numeric suffix and .csv extension
+split -dl 1 --additional-suffix=.csv example.csv test_
+
+# remove example.csv from real-time input dir
+rm -rf example.csv
+~~~
+
+### Set Up NiFi in EC2
+
+1\. Download **NiFi**
+
+~~~bash
+cd $HOME
+# Download NiFi
+wget https://archive.apache.org/dist/nifi/1.11.4/nifi-1.11.4-bin.tar.gz
+# Extract NiFi tar.gz
+tar -xvf nifi-1.11.4-bin.tar.gz
+~~~
+
+### Compile Custom MOJO NiFi Processor
+
+1\. Download **Driverless AI Deployment Examples** Repo for **NiFi** assets
+
+~~~bash
+cd $HOME
+git clone https://github.com/h2oai/dai-deployment-examples
+~~~
+
+2\. Compile the Java code for the NiFi processor into a **NAR package**:
+
+~~~bash
+cd $HOME/dai-deployment-examples/mojo-nifi/nifi-nar-bundles/nifi-daimojo-record-bundle/
+mvn clean install
+~~~
+
+### Add Custom NiFi Processor to NiFi
+
+1\. Copy over the NiFi NAR file **nifi-h2o-record-nar-1.11.4.nar**, which contains the Java MOJO NiFi processor to the NiFi **lib/** folder:
+
+~~~bash
+cd $HOME/nifi-1.11.4/lib/
+
+# copy nifi h2o nar to current folder
+cp $HOME/dai-deployment-examples/mojo-nifi/nifi-nar-bundles/nifi-daimojo-record-bundle/nifi-daimojo-record-nar/target/nifi-daimojo-record-nar-1.11.4.nar .
+~~~
+
+### Start the NiFi Server in EC2
+
+1\. Start the NiFi server where we import NiFi data flows to do batch scoring or real-time scoring
+
+~~~bash
+cd $HOME/nifi-1.11.4/
+
+# start nifi server
+./bin/nifi.sh start
+
+# stop nifi server
+# ./bin/nifi.sh stop
+~~~
+
+2\. Access the NiFi UI: http://localhost:8080/nifi/
+
+## Task 2: Deploy MOJO Scoring Pipeline to NiFi
+
+### Import the NiFi Data Flow Template into NiFi in EC2
+
+1\. On the left side of the NiFi canvas there is an Operate Panel. Click on the **upload button** to upload a NiFi Data Flow Template from your local machine:
+
+<img src="images/nifi-operate-panel-upload.jpg" width="50%" height="50%" alt="nifi operate panel">
+
+2\. Choose one of the following NiFi Data Flow xml templates from your local machine to upload to NiFi:
+
+~~~bash
+# NiFi Data Flow Template executes MOJO for batch scoring
+$HOME/dai-deployment-examples/mojo-nifi/nifi-dataflow-templates/BatchPredHydCoolCond.xml
+
+# NiFi Data Flow Template executes MOJO for real-time scoring
+$HOME/dai-deployment-examples/mojo-nifi/nifi-dataflow-templates/RealTimePredHydCoolCond.xml
+~~~
+
+3\. Drag and drop the NiFi template component onto the NiFi canvas.
+
+![NiFi Flow Template Component](./images/nifi-flow-template-component.jpg)
+
+4\. Select the NiFi Data Flow template you just uploaded. 
+
+- If you uploaded **BatchPredHydCoolCond** template, then select it. 
+- Else if you uploaded **RealTimePredHydCoolCond** template, then select it.
+
+### Start the NiFi Data Flow
+
+Start the NiFi Flow to do **batch scoring** or **real-time scoring**
+
+1\. In the Operate Panel, click the **start button** to run the NiFi Flow. 
+
+<img src="images/nifi-operate-panel-start.jpg" width="50%" height="50%" alt="nifi operate panel">
+
+You should see all processors red stop icon change to a green play icon in their left corner.
+
+2\. Once the NiFi Flow has pulled in the Hydraulic data, performed predictions on the data using the **ExecuteDaiMojoScoringPipeline** processor, then click the **stop button** in the Operate Panel.
+
+### Batch Scoring
+
+If you uploaded the **BatchPredHydCoolCond** flow template, ran the flow and then stopped it, you should see the following NiFi Flow:
+
+![nifi-flow-run-mojo-batch-scores](images/nifi-flow-run-mojo-batch-scores.jpg)
+
+Here we look at a provenance event from PutFile processor for when NiFi executed the MOJO on some batch data (multiple rows of data) to do batch scoring. 
+
+1\. Right click on the PutFile processor, choose Data Provenance.
+
+2\. Choose the first event in the list and click on the **i** on the left corner of the first row.
+
+3\. A provenance event window should appear, then in the output side of the window, click on view to see the batch scores.
+
+![nifi-flow-batch-scores](images/nifi-flow-batch-scores.jpg)
+
+### Interactive Scoring
+
+If you uploaded the **RealTimePredHydCoolCond** flow template, ran the flow and then stopped it, you should see the following NiFi Flow:
+
+![nifi-flow-run-mojo-scores](images/nifi-flow-run-mojo-interactive-scores.jpg)
+
+Here we look at a provenance event from PutFile processor for when NiFi executed the MOJO on some real-time data (one row of data) to do real-time scoring. 
+
+![nifi-flow-real-time-scores](images/nifi-flow-real-time-scores.jpg)
diff --git a/mojo-nifi/images/nifi-flow-batch-scores.jpg b/mojo-nifi/images/nifi-flow-batch-scores.jpg
diff --git a/mojo-nifi/images/nifi-flow-real-time-scores.jpg b/mojo-nifi/images/nifi-flow-real-time-scores.jpg
diff --git a/mojo-nifi/images/nifi-flow-run-mojo-batch-scores.jpg b/mojo-nifi/images/nifi-flow-run-mojo-batch-scores.jpg
diff --git a/mojo-nifi/images/nifi-flow-run-mojo-interactive-scores.jpg b/mojo-nifi/images/nifi-flow-run-mojo-interactive-scores.jpg
diff --git a/mojo-nifi/images/nifi-flow-template-component.jpg b/mojo-nifi/images/nifi-flow-template-component.jpg
diff --git a/mojo-nifi/images/nifi-operate-panel-start.jpg b/mojo-nifi/images/nifi-operate-panel-start.jpg
diff --git a/mojo-nifi/images/nifi-operate-panel-upload.jpg b/mojo-nifi/images/nifi-operate-panel-upload.jpg