README.md: Add 3rd Party Inference Speed Dashboard #2244

Hi TensorRT-LLM team, As an NVIDIA Inception startup, I would like to add a community link resource about an inference speed dashboard. The inference speed dashboard feature includes: Comprehensive benchmarks with several optimization techniques like FP16, FP8, INT8-weight-only, and INT4-weight-only. Comprehensive benchmarks with model architectures like Llama3-8b, Gemma2-27b, RecurrentGemma-9b, and Mamba2-2.7b. Comprehensive batch sizes, input lengths, and output lengths. Batch sizes range from 1 to 32 (128 for Mamba), and input and output lengths range from 32 to 4096. Additionally, I plan to publish the source code (website) and benchmark script by the end of this year.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md: Add 3rd Party Inference Speed Dashboard #2244

README.md: Add 3rd Party Inference Speed Dashboard #2244

Commits on Sep 22, 2024

README.md: Add 3rd Party Inference Speed Dashboard #2244

Are you sure you want to change the base?

README.md: Add 3rd Party Inference Speed Dashboard #2244

Commits on Sep 22, 2024