What are the advantages of running a model with Triton Inference Server compared to running directly using the model's framework API?
When using Triton Inference Server the inference result will be the same as when using the model's framework directly. However, with Triton you get benefits like concurrent model execution (the ability to run multiple models at the same time on the same GPU) and dynamic batching to get better throughput. You can also replace or upgrade models while Triton and client application are running. Another benefit is that Triton can be deployed as a Docker container, anywhere – on premises and on public clouds. Triton Inference Server also supports multiple frameworks such as TensorRT, TensorFlow, PyTorch, and ONNX on both GPUs and CPUs leading to a streamlined deployment.
Yes, the QuickStart guide describes how to run Triton on a CPU-Only System.
Yes. Triton Inference Server can also be built from source on your "bare metal" system.
We provide C++ and Python client libraries to make it easy for users to write client applications that communicate with Triton. We chose those languages because they were likely to be popular and performant in the ML inference space, but in the future we can possibly add other languages if there is a need.
We provide the GRPC API as a way to generate your own client library for a large number of languages. By following the official GRPC documentation and using grpc_service.proto you can generate language bindings for all the languages supported by GRPC. We provide two examples of this for Go and Python.
In general the client libraries (and client examples) are meant to be just that, examples. We feel the client libraries are well written and well tested, but they are not meant to serve every possible use case. In some cases you may want to develop your own customized library to suit your specific needs.
In an AWS environment, the Triton Inference Server docker container can run on CPU-only instances or GPU compute instances. Triton can run directly on the compute instance or inside Elastic Kubernetes Service (EKS). In addition, other AWS services such as Elastic Load Balancer (ELB) can be used for load balancing traffic among multiple Triton instances. Elastic Block Store (EBS) or S3 can be used for storing deep-learning models loaded by the inference server.
The Triton Inference Server exposes performance information in two ways: by Prometheus metrics and by the statistics available through the HTTP/REST, GRPC, and C APIs.
A client application, perf_analyzer, allows you to measure the performance of an individual model using a synthetic load. The perf_analyzer application is designed to show you the tradeoff of latency vs. throughput.
Triton Inference Server has several features designed to increase GPU utilization:
-
Triton can simultaneously perform inference for multiple models (using either the same or different frameworks) using the same GPU.
-
Triton can increase inference throughput by using multiple instances of the same model to handle multiple simultaneous inferences requests to that model. Triton chooses reasonable defaults but you can also control the exact level of concurrency on a model-by-model basis.
-
Triton can batch together multiple inference requests into a single inference execution. Typically, batching inference requests leads to much higher thoughput with only a relatively small increase in latency.
As a general rule, batching is the most beneficial way to increase GPU utilization. So you should always try enabling the dynamic batcher with your models. Using multiple instances of a model can also provide some benefit but is typically most useful for models that have small compute requirements. Most models will benefit from using two instances but more than that is often not useful.
If I have a server with multiple GPUs should I use one Triton Inference Server to manage all GPUs or should I use multiple inference servers, one for each GPU?
Triton Inference Server will take advantage of all GPUs that it has access to on the server. You can limit the GPUs available to Triton by using the CUDA_VISIBLE_DEVICES environment variable (or with Docker you can also use NVIDIA_VISIBLE_DEVICES or --gpus flag when launching the container). When using multiple GPUs, Triton will distribute inference request across the GPUs to keep them all equally utilized. You can also control more explicitly which models are running on which GPUs.
In some deployment and orchestration environments (for example, Kubernetes) it may be more desirable to partition a single multi-GPU server into multiple nodes, each with one GPU. In this case the orchestration environment will run a different Triton for each GPU and an load balancer will be used to divide inference requests across the available Triton instances.
The NGC build is a Release build and does not contain Debug symbols. The build.py as well defaults to a Release build. Refer to the instructions in build.md to create a Debug build of Triton. This will help find the cause of the segmentation fault when looking at the gdb trace for the segfault.
When opening a GitHub issue for the segfault with Triton, please include the backtrace to better help us resolve the problem.