Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 1.11 KB

README.md

File metadata and controls

20 lines (15 loc) · 1.11 KB

face-hugger

This repository is meant to be a minimal example of converting a HuggingFace model to ONNX then hosting it on Triton deployed to Kubernetes.

ONNX Export

This repository uses Huggingface Optimum to convert a transformer model to the ONNX format. In order to not use the same pod resources for both serving and exporting I have used a Helm Chart hook to save the graph to a persistent volume which is then used to load the model for inference in the serving pod.

TensorRT Conversion (In Progress)

Sine TensorRT gives better performance than even level 99 ONNX optimized graph on GPU we will try to convert the ONNX graph to TensorRT and host that.

Triton

This graph is then hosted for inference on a Triton server deployed in Kubernetes. The server is exposed through a Kubernetes LoadBalancer where outside requests can communicate with the model in Triton.

I wanted to use Triton to see if it was a better MLOps solution for inference as well as learn more about TensorRT.