Skip to content

Latest commit

Β 

History

History
275 lines (202 loc) Β· 17.3 KB

README.md

File metadata and controls

275 lines (202 loc) Β· 17.3 KB

πŸ•ΉοΈ Benchmarks

A fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models

GitHub contributors GitHub commit activity GitHub last commit GitHub top language GitHub issues License


alt text Check out our release blog to know more.

Table of Contents
  1. Quick glance towards performance metrics
  2. ML Engines
  3. Why Benchmarks
  4. Usage and workflow
  5. Contribute

πŸ₯½ Quick glance towards performance benchmarks

Take a first glance at Mistral 7B v0.1 Instruct and Llama 2 7B Chat Performance Metrics Across Different Precision and Inference Engines. Here is our run specification that generated this performance benchmark reports.

Environment:

  • Model: Mistral 7B v0.1 Instruct / Llama 2 7B Chat
  • CUDA Version: 12.1
  • Batch size: 1

Command:

./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model mistral/llama --prompt 'Write an essay about the transformer model architecture'

Mistral 7B v0.1 Instruct

Performance Metrics: (unit: Tokens/second)

Engine float32 float16 int8 int4
transformers (pytorch) 39.61 Β± 0.65 37.05 Β± 0.49 5.08 Β± 0.01 19.58 Β± 0.38
AutoAWQ - - - 63.12 Β± 2.19
AutoGPTQ 39.11 Β± 0.42 42.94 Β± 0.80
DeepSpeed 79.88 Β± 0.32
ctransformers - - 86.14 Β± 1.40 87.22 Β± 1.54
llama.cpp - - 88.27 Β± 0.72 95.33 Β± 5.54
ctranslate 43.17 Β± 2.97 68.03 Β± 0.27 45.14 Β± 0.24 -
PyTorch Lightning 32.79 Β± 2.74 43.01 Β± 2.90 7.75 Β± 0.12 -
Nvidia TensorRT-LLM 117.04 Β± 2.16 206.59 Β± 6.93 390.49 Β± 4.86 427.40 Β± 4.84
vllm 84.91 Β± 0.27 84.89 Β± 0.28 - 106.03 Β± 0.53
exllamav2 - - 114.81 Β± 1.47 126.29 Β± 3.05
onnx 15.75 Β± 0.15 22.39 Β± 0.14 - -
Optimum Nvidia 50.77 Β± 0.85 50.91 Β± 0.19 - -

Performance Metrics: GPU Memory Consumption (unit: MB)

Engine float32 float16 int8 int4
transformers (pytorch) 31071.4 15976.1 10963.91 5681.18
AutoGPTQ 13400.80 6633.29
AutoAWQ - - - 6572.47
DeepSpeed 80097.34
ctransformers - - 10255.07 6966.74
llama.cpp - - 9141.49 5880.41
ctranslate 32602.32 17523.8 10074.72 -
PyTorch Lightning 48783.95 18738.05 10680.32 -
Nvidia TensorRT-LLM 79536.59 78341.21 77689.0 77311.51
vllm 73568.09 73790.39 - 74016.88
exllamav2 - - 21483.23 9460.25
onnx 33629.93 19537.07 - -
Optimum Nvidia 79563.85 79496.74 - -

*(Data updated: 30th April 2024)

Llama 2 7B Chat

Performance Metrics: (unit: Tokens / second)

Engine float32 float16 int8 int4
transformers (pytorch) 36.65 Β± 0.61 34.20 Β± 0.51 6.91 Β± 0.14 17.83 Β± 0.40
AutoAWQ - - - 63.59 Β± 1.86
AutoGPTQ 34.36 Β± 0.51 36.63 Β± 0.61
DeepSpeed 84.60 Β± 0.25
ctransformers - - 85.50 Β± 1.00 86.66 Β± 1.06
llama.cpp - - 89.90 Β± 2.26 97.35 Β± 4.71
ctranslate 46.26 Β± 1.59 79.41 Β± 0.37 48.20 Β± 0.14 -
PyTorch Lightning 38.01 Β± 0.09 48.09 Β± 1.12 10.68 Β± 0.43 -
Nvidia TensorRT-LLM 104.07 Β± 1.61 191.00 Β± 4.60 316.77 Β± 2.14 358.49 Β± 2.38
vllm 89.40 Β± 0.22 89.43 Β± 0.19 - 115.52 Β± 0.49
exllamav2 - - 125.58 Β± 1.23 159.68 Β± 1.85
onnx 14.28 Β± 0.12 19.42 Β± 0.08 - -
Optimum Nvidia 53.64 Β± 0.78 53.82 Β± 0.11 - -

Performance Metrics: GPU Memory Consumption (unit: MB)

Engine float32 float16 int8 int4
transformers (pytorch) 29114.76 14931.72 8596.23 5643.44
AutoAWQ - - - 7149.19
AutoGPTQ 10718.54 5706.35
DeepSpeed 80105.13
ctransformers - - 9774.83 6889.14
llama.cpp - - 8797.55 5783.95
ctranslate 29951.52 16282.29 9470.74 -
PyTorch Lightning 42748.35 14736.69 8028.16 -
Nvidia TensorRT-LLM 79421.24 78295.07 77642.86 77256.98
vllm 77928.07 77928.07 - 77768.69
exllamav2 - - 16582.18 7201.62
onnx 33072.09 19180.55 - -
Optimum Nvidia 79429.63 79295.41 - -

*(Data updated: 30th April 2024)

Our latest version benchmarks Llama 2 7B chat and Mistral 7B v0.1 instruct. The latest version only benchmarks on A100 80 GPU. Because our primary focus is enterprises. Our previous versions benchmarked Llama 2 7B on Cuda and Mac (M1/M2) CPU and metal. You can find those in the archive.md file. Please note that those numbers are old because all the engines are maintained properly continuously with improvements. So those numbers might be a bit outdated.

πŸ›³ ML Engines

In the current market, there are several ML Engines. Here is a quick glance at all the engines used for the benchmark and a quick summary of their support matrix. You can find the details about the nuances here.

Engine Float32 Float16 Int8 Int4 CUDA ROCM Mac M1/M2 Training
candle ⚠️ βœ… ⚠️ ⚠️ βœ… ❌ 🚧 ❌
llama.cpp ❌ ❌ βœ… βœ… βœ… 🚧 🚧 ❌
ctranslate βœ… βœ… βœ… ❌ βœ… ❌ 🚧 ❌
onnx βœ… βœ… ❌ ❌ βœ… ⚠️ ❌ ❌
transformers (pytorch) βœ… βœ… βœ… βœ… βœ… 🚧 βœ… βœ…
vllm βœ… βœ… ❌ βœ… βœ… 🚧 ❌ ❌
exllamav2 ❌ ❌ βœ… βœ… βœ… 🚧 ❌ ❌
ctransformers ❌ ❌ βœ… βœ… βœ… 🚧 🚧 ❌
AutoGPTQ βœ… βœ… ⚠️ ⚠️ βœ… ❌ ❌ ❌
AutoAWQ ❌ ❌ ❌ βœ… βœ… ❌ ❌ ❌
DeepSpeed-MII ❌ βœ… ❌ ❌ βœ… ❌ ❌ ⚠️
PyTorch Lightning βœ… βœ… βœ… βœ… βœ… ⚠️ ⚠️ βœ…
Optimum Nvidia βœ… βœ… ❌ ❌ βœ… ❌ ❌ ❌
Nvidia TensorRT-LLM βœ… βœ… βœ… βœ… βœ… ❌ ❌ ❌

Legend:

  • βœ… Supported
  • ❌ Not Supported
  • ⚠️ There is a catch related to this
  • 🚧 It is supported but not implemented in this current version

You can check out the nuances related to ⚠️ and 🚧 in details here

πŸ€” Why Benchmarks

This can be a common question. What are the benefits you can expect from this repository? So here are some quick pointers to answer those.

  1. Oftentimes, we are confused when given several choices on which engines or precision to use for our LLM inference workflow. Because sometimes we have constraints on computing and sometimes we have other requirements. So this repository helps you to get a quick idea of what to use based on your requirements.

  2. Sometimes there comes a quality vs speed tradeoff between engines and precisions. So this repository keeps track of those and gives you an idea to understand the tradeoffs so that you can give more importance to your priorities.

  3. A fully reproducible and hackable script. The latest benchmarks come with a lot of best practices so that they can be robust enough to run on GPU devices. Also, you can reference and extend the implementations to build your own workflows out of it.

πŸš€ Usage and workflow

Welcome to our benchmarking repository! This organized structure is designed to simplify benchmark management and execution. Each benchmark runs an inference engine that provides some sort of optimizations either through just quantization or device-specific optimizations like custom cuda kernels.

To get started you need to download the models first. This will download the following models: Llama2 7B Chat and Mistral-7B v0.1 Instruct. You can start download by typing this command:

./download.sh

Please make sure that when you are running Llama2-7B Chat weights, we would assume that you already agreed to the required terms and conditions and got verified to download the weights.

A Benchmark workflow

When you run a benchmark, the following set of events occurs:

  • Automatically setting up the environments and installing the required dependencies.

  • Converting the models to some specific format (if required) and saving them.

  • Running the benchmarks and storing them inside the logs folder. Each log folder has the following structure:

  • performance.log: This will track the model run performances. You can see the token/sec and memory consumption (MB) here.

  • quality.md: This file is an automatically generated readme file, which contains qualitative comparisons of different precisions of some engines. We take 5 prompts and run them for the set of supported precisions of that engine. We then put those results side by side. Our ground truth is the output from huggingface PyTorch model with raw float32 weights.

  • quality.json Same as the readme file but more in raw format.

Inside each benchmark folder, you will also see a readme.md file which contains all the information and the qualitative comparison of the engine. For example: bench_tensorrtllm.

Running a Benchmark

Here is how we run benchmarks for an inference engine.

./bench_<engine-name>/bench.sh \
 --prompt <value> \ # Enter a prompt string
 --max_tokens <value> \  # Maximum number of tokens to output
 --repetitions <value> \  # Number of repetitions to be made for the prompt.
 --device <cpu/cuda/metal> \  # The device in which we want to benchmark.
 --model_name <name-of-the-model> # The name of the model. (options: 'llama' for Llama2 and 'mistral' for Mistral-7B-v0.1)

Here is an example. Let's say we want to benchmark Nvidia TensorRT LLM. So here is how the command would look like:

./bench_tensorrtllm/bench.sh -d cuda -n llama -r 10

To know more, here is more detailed info on each command line argument.

 -p, --prompt Prompt for benchmarks (default: 'Write an essay about the transformer model architecture')
 -r, --repetitions Number of repetitions for benchmarks (default: 10)
 -m, --max_tokens Maximum number of tokens for benchmarks (default: 512)
 -d, --device Device for benchmarks (possible values: 'metal', 'cuda', and 'CPU', default: 'cuda')
 -n, --model_name The name of the model to benchmark (possible values: 'llama' for using Llama2, 'mistral' for using Mistral 7B v0.1)
 -lf, --log_file Logging file name.
 -h, --help Show this help message

🀝 Contribute

We welcome contributions to enhance and expand our benchmarking repository. If you'd like to contribute a new benchmark, follow these steps:

Creating a New Benchmark

1. Create a New Folder

Start by creating a new folder for your benchmark. Name it bench_{new_bench_name} for consistency.

mkdir bench_{new_bench_name}

2. Folder Structure

Inside the new benchmark folder, include the following structure

bench_{new_bench_name}
β”œβ”€β”€ bench.sh # Benchmark script for setup and execution
β”œβ”€β”€ requirements.txt # Dependencies required for the benchmark
└── ... # Any additional files needed for the benchmark

3. Benchmark Script (bench.sh):

The bench.sh script should handle setup, environment configuration, and the actual execution of the benchmark. Ensure it supports the parameters mentioned in the Benchmark Script Parameters section.

Pre-commit Hooks

We use pre-commit hooks to maintain code quality and consistency.

1. Install Pre-commit: Ensure you have pre-commit installed

pip install pre-commit

2. Install Hooks: Run the following command to install the pre-commit hooks

pre-commit install

The existing pre-commit configuration will be used for automatic checks before each commit, ensuring code quality and adherence to defined standards.