In this section we will serve our model in our RHOAI instance using Gitops. We will also use a Custom serving runtime (Triton) in order to deploy the model.

To do this we would need a data connection to the S3 instance where our model is stored, and we would need to create the Custom serving runtime and an Inference server

Custom Serving runtime

RHOAI supports the ability to add your own serving runtime. But it does not support the runtimes themselves. Therefore, it is up to you to configure, adjust and maintain your custom runtimes.

In this tutorial we will setup the Triton Runtime (NVIDIA Triton Inference Server) and serve a model using it.

In the parasol-insurance tenant, create a directory named multi-model-serving
Create the base and overlays directories inside the multi-model-serving directory
Create a directory named parasol-insurance-dev under the multi-model-serving/overlays directory
Create a kustomization.yaml inside the multi-model-serving/overlays/parasol-insurance-dev directory and point it to the base folder of the multi-model-serving directory.
Solution
multi-model-serving/overlays/parasol-insurance-dev/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - ../../base

Create a kustomization.yaml inside the multi-model-serving/base directory. This should have the parasol-insurance namespace, as well as data-connection, inference-service, and serving-runtime as resources.

multi-model-serving/base/kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  <add resources here>

Solution

multi-model-serving/base/kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: parasol-insurance

resources:
  - data-connection.yaml
  - inference-service.yaml
  - serving-runtime.yaml

Create a data connection with the minio details. Create a file named 'data-connection.yaml' inside the multi-model-serving/base directory with the minio details. Make sure to add the RHOAI labels so it will show in the RHOAI Dashboard.

Solution

multi-model-serving/base/data-connection.yaml

kind: Secret
apiVersion: v1
metadata:
  name: accident-model-data-conn
  labels:
    opendatahub.io/dashboard: 'true'
    opendatahub.io/managed: 'true'
  annotations:
    opendatahub.io/connection-type: s3
    openshift.io/display-name: multi-model
    # argocd.argoproj.io/sync-wave: "-100"
stringData:
  AWS_ACCESS_KEY_ID: minio
  AWS_S3_BUCKET: models
  AWS_S3_ENDPOINT: http://minio.object-datastore.svc.cluster.local:9000
  AWS_SECRET_ACCESS_KEY: minio123
  AWS_DEFAULT_REGION: east-1
type: Opaque

To create the custom serving Triton runtime, create a file named 'serving-runtime.yaml' inside the multi-model-serving/base directory with the following content:

multi-model-serving/base/serving-runtime.yaml

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: multi-model-server
  labels:
    opendatahub.io/dashboard: 'true'
  annotations:
    maxLoadingConcurrency: "2"
    openshift.io/template-display-name: "Triton Model Server"
    openshift.io/display-name: multi-model-server
    opendatahub.io/apiProtocol: REST
    enable-route: 'true'
    opendatahub.io/template-name: triton
spec:
  supportedModelFormats:
    - name: keras
      version: "2" # 2.6.0
      autoSelect: true
    - name: onnx
      version: "1" # 1.5.3
      autoSelect: true
    - name: pytorch
      version: "1" # 1.8.0a0+17f8c32
      autoSelect: true
    - name: tensorflow
      version: "1" # 1.15.4
      autoSelect: true
    - name: tensorflow
      version: "2" # 2.3.1
      autoSelect: true
    - name: tensorrt
      version: "7" # 7.2.1
      autoSelect: true

  protocolVersions:
    - grpc-v2
  multiModel: true

  grpcEndpoint: "port:8085"
  grpcDataEndpoint: "port:8001"

  volumes:
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 2Gi
  containers:
    - name: triton
      image: nvcr.io/nvidia/tritonserver:23.05-py3
      command: [/bin/sh]
      args:
        - -c
        - 'mkdir -p /models/_triton_models;
          chmod 777 /models/_triton_models;
          exec tritonserver
          "--model-repository=/models/_triton_models"
          "--model-control-mode=explicit"
          "--strict-model-config=false"
          "--strict-readiness=false"
          "--allow-http=true"
          "--allow-sagemaker=false"
          '
      volumeMounts:
        - name: shm
          mountPath: /dev/shm
      resources:
        requests:
          cpu: 500m
          memory: 1Gi
        limits:
          cpu: "5"
          memory: 1Gi
      livenessProbe:
        exec:
          command:
            - curl
            - --fail
            - --silent
            - --show-error
            - --max-time
            - "9"
            - http://localhost:8000/v2/health/live
        initialDelaySeconds: 5
        periodSeconds: 30
        timeoutSeconds: 10
  builtInAdapter:
    serverType: triton
    runtimeManagementPort: 8001
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000

Inference Service

Once we have our serving runtime, we can use it as the runtime for our Inference Service.

To create the Inference Service, create a file named 'inference-service.yaml' inside the multi-model-serving/base directory. Make sure to add the RHOAI labels so we can view it in the RHOAI dashboard.

multi-model-serving/base/inference-service.yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: accident-detect-model
    serving.kserve.io/deploymentMode: ModelMesh
  name: accident-detect-model
  labels:
    <add RHOAI dashboard labels here>
spec:
  predictor:
    model:
      modelFormat:
        name: <add format name here>
        version: '1'
      name: ''
      resources: {}
      runtime: <add runtime>
      storage:
        key: <add data connection here>
        path: <Add model path here>

Solution

multi-model-serving/base/inference-service.yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: accident-detect-model
    serving.kserve.io/deploymentMode: ModelMesh
  name: accident-detect-model
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
        version: '1'
      name: ''
      resources: {}
      runtime: multi-model-server
      storage:
        key: accident-model-data-conn
        path: accident_model/accident_detect.onnx

Push the changes to your ai-accelerator fork.
Wait for the application to sync in Argo.
Navigate to RHOAI, and validate that there is a new model serving under the Models tab, and check that its status looks green.

Test the served model

To test if the served model is working as expected, go back to RHOAI parasol-insurance project and go to the workbenches tab.

Stop the standard-workbench and start the custom-workbench.

Once the custom-workbench is running, navigate to parasol-insurance/lab-materials/04. Open the 04-05-model-serving notebook. We need to change the RestURL/infer_url value. We can get it from the model that we just deployed.

Make sure to change the values in the notebook when testing:

After making these changes, run the notebook and we should see an output to the image that we pass to the model.

Caution

You have now entered the CHALLENGE PHASE of your project! You are now enabled! Your team lead has died! You must Deploy the model to prod ideally using gitops.

Level 1 Use minio ui to create your buckets and deploy your model serving rumtime!
Level 2 Use gitops to deploy your model to prod and deploy your model serving runtime!
Level 3 Train your own model and deploy it to prod and automate the uploading and model serving!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

36_deploy_model.adoc

36_deploy_model.adoc

Custom Serving runtime

Inference Service

Test the served model

Files

36_deploy_model.adoc

Latest commit

History

36_deploy_model.adoc

File metadata and controls

Custom Serving runtime

Inference Service

Test the served model