- Benchmarking
- Overview
- Before you begin
- Use the starter kit
- Batching vs non-batching benchmarking
- Third-party tools and data sources
This AI Starter Kit evaluates the performance of different LLM models hosted in SambaStudio. It allows users to configure various LLMs with diverse parameters, enabling experiments to not only generate different outputs but also measurement metrics simultaneously. The Kit includes:
- Configurable SambaStudio and SambanNova Cloud connectors. The connectors generate answers from a deployed model.
- An app with three functionalities:
- A synthetic performance evaluation process with configurable options that users will utilize to obtain and compare different metrics over synthetic data generated by the app.
- A custom performance evaluation process with configurable options that users will utilize to obtain and compare different metrics over their own customed prompts.
- A chat interface with configurable options that users will set to interact and get performance metrics
- A couple of bash scripts that are the core of the performance evaluations and provide more flexibility to users
This sample is ready-to-use. We provide:
- Instructions for setup with SambaStudio or SambaNova Cloud
- Instructions for running the model as-is
- Instructions for customizing the model
To perform this setup, you must be a SambaNova customer with a SambaStudio account or have a SambaNova Cloud API key (more details in the following sections). You also have to set up your environment before you can run or customize the starter kit.
These steps assume a Mac/Linux/Unix shell environment. If using Windows, you will need to adjust some commands for navigating folders, activating virtual environments, etc.
Clone the starter kit repo.
git clone https://github.com/sambanova/ai-starter-kit.git
The next step is to set up your environment variables to use one of the models available from SambaNova. If you're a current SambaNova customer, you can deploy your models with SambaStudio. If you are not a SambaNova customer, you can self-service provision API endpoints using SambaNova Cloud.
-
If using SambaNova Cloud Please follow the instructions here for setting up your environment variables.
-
If using SambaStudio Please follow the instructions here for setting up endpoint and your environment variables.
-
(Recommended) Create a virtual environment and activate it (python version 3.11 recommended):
python<version> -m venv <virtual-environment-name> source <virtual-environment-name>/bin/activate
-
Install the required dependencies:
cd benchmarking # If not already in the benchmarking folder pip install -r requirements.txt
When using the benchmarking starter kit, you have two options for running the program:
- GUI Option: This option contains plots and configurations from a web browser.
- CLI Option: This option allows you to run the program from the command line and provides more flexibility.
The GUI for this starter kit uses Streamlit, a Python framework for building web applications. This method is useful for analyzing outputs in a graphical manner since the results are shown via plots in the UI.
Ensure you are in the benchmarking
folder and run the following command:
streamlit run streamlit/app.py --browser.gatherUsageStats false
After deploying the starter kit, you will see the following user interface:
After you've deployed the GUI, you can use the starter kit. More details will come in the following sections, however the general usage is described in the comming bullets:
- In the left side bar, select one of the three app functionalities (Click on each section to go to the full details):
Synthetic Performance Evaluation
: Evaluate the performance of the selected LLM on data generated by this benchmarking tool.Custom Performance Evaluation
: Evaluate the performance of the selected LLM on custom data specified by you.Performance on Chat
: Evaluate the performance of the selected LLM in a chat interface.
-
If the deployed LLM is a Composition of Experts (CoE), specify the desired expert in the corresponding text box and then set the configuration parameters. If the deployed LLM is not a CoE, simply set the configuration parameters.
-
If the deployed LLM is a SambaNova Cloud endpoint, choose in the API type dropdown the
sncloud
option. -
After pressing the
Run
button, the program will perform inference on the data and product results in the middle of the screen. In the case ofPerformance on Chat
functionality, users are able to interact with the LLM in a multi-turn chat interface.
There are 3 options on the left side bar for running the benchmarking tool. Pick the walkthrough that best suits your needs.
This option allows you to evaluate the performance of the selected LLM on synthetic data generated by this benchmarking tool.
- Enter a model name and choose the right API type
Note: Currently we have specific prompting support for Llama2, Llama3, Mistral, Deepseek, Solar, and Eeve. Other instruction models can work, but number of tokens may not be close to the ones specified.
- If the model specified is a CoE, specify the desired expert in the Model Name text box.
- The model name should mirror the name shown in studio, preceded with
COE/
- - For example, the Samba-1 Turbo Llama-3-8B expert in studio is titled
Meta-Llama-3-8B-Instruct
so my model name would beCOE/Meta-Llama-3-8B-Instruct
.
- The model name should mirror the name shown in studio, preceded with
- If the model is a standalone model, enter the full model name shown on the model card. I.e.
Llama-2-70b-chat-hf
. - If the model is a SambaNova Cloud one, then be aware of the right model names used. Then, choose
sncloud
in the API type dropdown option.- For example, the Llama-3-8B model in SambaNova Cloud is titled
llama3-8b
so that will be the model name.
- For example, the Llama-3-8B model in SambaNova Cloud is titled
- Set the configuration parameters
- Number of input tokens: The number of input tokens in the generated prompt. Default: 1000.
- Number of output tokens: The number of output tokens the LLM can generate. Default: 1000.
- Number of total requests: Number of requests sent. Default: 32. Note: the program can timeout before all requests are sent. Configure the Timeout parameter accordingly.
- Number of concurrent workers: The number of concurrent workers. Default: 1. For testing batching-enabled models, this value should be greater than the largest batch_size one needs to test. The typical batch sizes that are supported are 1,4,8 and 16.
- Timeout: Number of seconds before program times out. Default: 600 seconds
- Run the performance evaluation
- Click the
Run!
button. This will start the program and a spinning indicator will show in the UI confirming the program is executing. - Depending on the parameter configurations, it should take between 1 min and 10 min. Some diagnostic/progress information will be displayed in the terminal shell.
-
Analyze results
Note: Not all model endpoints currently support the calculation of server-side statistics. Depending on your choice of endpoint, you may see either client and server information, or you may see just the client-side information.
Bar plots
The plots compare (if available) the following:
- Server metrics: These are performance metrics from the API.
- Client metrics: These are performance metrics computed on the client side. Additionally, if the endpoint supports dynamic batching, the plots will show per-batch metrics.
The results are composed of four bar plots:
-
ttft_s
bar plot: This plot shows the median Time to First Token (TTFT) with the height of each colored bar and a small black distribution bar. One should see higher values and higher variance in the client-side metrics compared to the server-side metrics. This difference is mainly due to the request waiting in the queue to be served (for concurrent requests), which is not included in server-side metrics. There is also a small additional factor on the client-side due to the added latency of the API call to the client computer. -
end_to_end_latency_s
bar plot: This plot shows the median end-to-end latency with the height of each colored bar and a small black distribution bar. One should see higher values and higher variance in the client-side metrics compared to the server-side metrics. This difference is also mainly due to the request waiting in the queue to be served (for concurrent requests), which is not included in server-side metrics. There is also a small additional factor on the client-side due to the added latency of the API call to the client computer. -
output_token_per_s_per_request
bar plot: This plot shows the median number of output tokens per second per request with the height of each colored bar and a small black distribution bar. One should see good agreement between the client and server-side metrics. For endpoints that support dynamic batching, one should see a decreasing trend in metrics as the batch size increases. -
throughput_token_per_s
bar plot: This plot shows the median total tokens generated per second per batch with the height of each colored bar and a small black distribution bar. One should see good agreement between the client and server-side metrics. This metric represents the total number of tokens generated per second, which is the same as the previous metric for batch size = 1. However, for batch size > 1, it is estimated as the average ofoutput_token_per_s_per_request * batch_size_used
for each batch, to account for more tokens being generated due to concurrent requests being served in batch mode.
This option allows you to evaluate the performance of the selected LLM on your own custom dataset. The interface should look like this:
- Prep your dataset
- The dataset needs to be in
.jsonl
format - these means a file with one JSON object per line. - Each JSON object should have a
prompt
key with the value being the prompt you want to pass to the LLM.- You can use a different keyword instead of
prompt
, but it's important that all your JSON objects use the same key
- You can use a different keyword instead of
- Enter the dataset path
- The entered path should be an absolute path to your dataset.
- For example:
/Users/johndoe/Documents/my_dataset.jsonl
- For example:
- Enter a model name and choose the right API type
Note: Currently we have specific prompting support for Llama2, Llama3, Mistral, Deepseek, Solar, and Eeve. Other instruction models can work, but number of tokens may not be close to the ones specified.
- If the model specified is a CoE, specify the desired expert in the Model Name text box.
- The model name should mirror the name shown in studio, preceded with
COE/
- - For example, the Samba-1 Turbo Llama-3-8B expert in studio is titled
Meta-Llama-3-8B-Instruct
so my model name would beCOE/Meta-Llama-3-8B-Instruct
.
- The model name should mirror the name shown in studio, preceded with
- If the model is a standalone model, enter the full model name shown on the model card. I.e.
Llama-2-70b-chat-hf
. - If the model is a SambaNova Cloud one, then be aware of the right model names used. Then, choose
sncloud
in the API type dropdown option.- For example, the Llama-3-8B model in SambaNova Cloud is titled
llama3-8b
so that will be the model name
- For example, the Llama-3-8B model in SambaNova Cloud is titled
- Set the configuration and tuning parameters
- Number of concurrent workers: The number of concurrent workers. Default: 1. For testing batching-enabled models, this value should be greater than the largest batch_size one needs to test. The typical batch sizes that are supported are 1,4,8 and 16.
- Timeout: Number of seconds before program times out. Default: 600 seconds
- Max Output Tokens: Maximum number of tokens to generate. Default: 256
- Save LLM Responses: Whether to save the actual outputs of the LLM to an output file. The output file will contain the
response_texts
suffix.
-
Analyze results
Note: Not all model endpoints currently support the calculation of server-side statistics. Depending on your choice of endpoint, you may see either client and server information, or you may see just the client-side information.
Bar plots
The plots compare (if available) the following:
- Server metrics: These are performance metrics from the API.
- Client metrics: These are performance metrics computed on the client side. Additionally, if the endpoint supports dynamic batching, the plots will show per-batch metrics.
The results are composed of four bar plots:
-
ttft_s
bar plot: This plot shows the median Time to First Token (TTFT) with the height of each colored bar and a small black distribution bar. One should see higher values and higher variance in the client-side metrics compared to the server-side metrics. This difference is mainly due to the request waiting in the queue to be served (for concurrent requests), which is not included in server-side metrics. There is also a small additional factor on the client-side due to the added latency of the API call to the client computer. -
end_to_end_latency_s
bar plot: This plot shows the median end-to-end latency with the height of each colored bar and a small black distribution bar. One should see higher values and higher variance in the client-side metrics compared to the server-side metrics. This difference is also mainly due to the request waiting in the queue to be served (for concurrent requests), which is not included in server-side metrics. There is also a small additional factor on the client-side due to the added latency of the API call to the client computer. -
output_token_per_s_per_request
bar plot: This plot shows the median number of output tokens per second per request with the height of each colored bar and a small black distribution bar. One should see good agreement between the client and server-side metrics. For endpoints that support dynamic batching, one should see a decreasing trend in metrics as the batch size increases. -
throughput_token_per_s
bar plot: This plot shows the median total tokens generated per second per batch with the height of each colored bar and a small black distribution bar. One should see good agreement between the client and server-side metrics. This metric represents the total number of tokens generated per second, which is the same as the previous metric for batch size = 1. However, for batch size > 1, it is estimated as the average ofoutput_token_per_s_per_request * batch_size_used
for each batch, to account for more tokens being generated due to concurrent requests being served in batch mode.
This option allows you to measure performance during a multi-turn conversation with an LLM. The interface should look like this:
- Enter a model name and choose the right API type
- If the model specified is a CoE, specify the desired expert in the Model Name text box.
- The model name should mirror the name shown in studio, preceded with
COE/
- - For example, the Samba-1 Turbo Llama-3-8B expert in studio is titled
Meta-Llama-3-8B-Instruct
so my model name would beCOE/Meta-Llama-3-8B-Instruct
.
- The model name should mirror the name shown in studio, preceded with
- If the model is a standalone model, enter the full model name shown on the model card. I.e.
Llama-2-70b-chat-hf
. - If the model is a SambaNova Cloud one, then be aware of the right model names used. Then, choose
sncloud
in the API type dropdown option.- For example, the Llama-3-8B model in SambaNova Cloud is titled
llama3-8b
so that will be the model name
- For example, the Llama-3-8B model in SambaNova Cloud is titled
- Set the configuration parameters
- Max tokens to generate: Maximum number of tokens to generate. Default: 256
- Start the chat session
After entering the model name and configuring the parameters, press Run!
to activate the chat session.
- Ask anything and see results
Users are able to ask anything and get a generated answer to their questions, as shown in the image below. In addition to the back and forth conversations between the user and the LLM, there is a expander option that users can click to see the following metrics per each LLM response:
- Latency (s)
- Throughput (tokens/s)
- Time to first token (s)
This method can be ran from a terminal session. Users have this option if they want to experiment using values that are beyond the limits specified in the Streamlit app parameters. You have two options for running the program from terminal:
- Run with a custom dataset via
run_custom_dataset.sh
- Run with a synthetic dataset via
run_synthetic_dataset.sh
Note: Currently we have specific prompting support for Llama2, Llama3, Mistral, Deepseek, Solar, and Eeve. Other instruction models can work, but number of tokens may not be close to the ones specified.
- Open the file
run_custom_dataset.sh
and configure the following parameters:
- model-name: Model name to be used. If it's a COE model, add "COE/" prefix to the name. Example: "COE/Meta-Llama-3-8B-Instruct"
- llm-api: API type to be chosen. If it's a SambaNova Cloud model, double check the right model name spelling because it's shorter then other sambastudio model names.
- results-dir: Path to the results directory. Default: "./data/results/llmperf"
- num-workers: Number of concurrent workers. Default: 1
- timeout: Timeout in seconds. Default: 600
- input-file-path: The location of the custom dataset that you want to evaluate with
- save-llm-responses: Whether to save the actual outputs of the LLM to an output file. The output file will contain the
response_texts
suffix.
Note: You should leave the --mode
parameter untouched - this indicates what dataset mode to use.
- Run the script
- Run the following command in your terminal:
sh run_custom_dataset.sh
- The evaluation process will start and progress bar will be shown until it's complete.
- Analyze results
- Results will be saved at the location specified in
results-dir
. - The name of the output files will depend on the input file name, mode name, and number of workers. You should see files that follow a similar format to the following:
<MODEL_NAME>_{FILE_NAME}_{NUM_CONCURRENT_WORKERS}_{MODE}
-
For each run, two files are generated with the following suffixes in the output file names:
_individual_responses
and_summary
.-
Individual responses file
- This output file contains the number of input and output tokens, number of total tokens, Time To First Token (TTFT), End-To-End Latency (E2E Latency) and Throughput from Server (if available) and Client side, for each individual request sent to the LLM. Users can use this data for further analysis. We provide this notebook
notebooks/analyze-token-benchmark-results.ipynb
with some charts that they can use to start.
- This output file contains the number of input and output tokens, number of total tokens, Time To First Token (TTFT), End-To-End Latency (E2E Latency) and Throughput from Server (if available) and Client side, for each individual request sent to the LLM. Users can use this data for further analysis. We provide this notebook
-
-
Summary file
- This file includes various statistics such as percentiles, mean and standard deviation to describe the number of input and output tokens, number of total tokens, Time To First Token (TTFT), End-To-End Latency (E2E Latency) and Throughput from Client side. It also provides additional data points that bring more information about the overall run, like inputs used, number of errors, and number of completed requests per minute.
Note: Currently we have specific prompting support for Llama2, Llama3, Mistral, Deepseek, Solar, and Eeve. Other instruction models can work, but number of tokens may not be close to the ones specified.
- Open the file
run_synthetic_dataset.sh
and configure the following parameters:
- model-name: Model name to be used. If it's a COE model, add "COE/" prefix to the name. Example: "COE/Meta-Llama-3-8B-Instruct"
- llm-api: API type to be chosen. If it's a SambaNova Cloud model, double check the right model name spelling because it's shorter then other sambastudio model names.
- results-dir: Path to the results directory. Default: "./data/results/llmperf"
- num-workers: Number of concurrent workers. Default: 1
- timeout: Timeout in seconds. Default: 600
- num-input-tokens: Number of input tokens to include in the request prompts. It's recommended to choose no more than 2000 tokens to avoid long wait times. Default: 1000.
- num-output-tokens: Number of output tokens in the generation. It's recommended to choose no more than 2000 tokens to avoid long wait times. Default: 1000.
- num-requests: Number of requests sent. Default: 32. Note: the program can timeout before all requests are sent. Configure the Timeout parameter accordingly.
Note: You should leave the --mode
parameter untouched - this indicates what dataset mode to use.
- Run the script
- Run the following command in your terminal:
sh run_synthetic_dataset.sh
- The evaluation process will start and progress bar will be shown until it's complete.
- Analyze results
- Results will be saved at the location specified in
results-dir
. - The name of the output files will depend on the input file name, mode name, and number of workers. You should see files that follow a similar format to the following:
<MODEL_NAME>_{NUM_INPUT_TOKENS}_{NUM_OUTPUT_TOKENS}_{NUM_CONCURRENT_WORKERS}_{MODE}
-
For each run, two files are generated with the following suffixes in the output file names:
_individual_responses
and_summary
.-
Individual responses file
- This output file contains the number of input and output tokens, number of total tokens, Time To First Token (TTFT), End-To-End Latency (E2E Latency) and Throughput from Server (if available) and Client side, for each individual request sent to the LLM. Users can use this data for further analysis. We provide this notebook
notebooks/analyze-token-benchmark-results.ipynb
with some charts that they can use to start.
- This output file contains the number of input and output tokens, number of total tokens, Time To First Token (TTFT), End-To-End Latency (E2E Latency) and Throughput from Server (if available) and Client side, for each individual request sent to the LLM. Users can use this data for further analysis. We provide this notebook
-
-
Summary file
- This file includes various statistics such as percentiles, mean and standard deviation to describe the number of input and output tokens, number of total tokens, Time To First Token (TTFT), End-To-End Latency (E2E Latency) and Throughput from Client side. It also provides additional data points that bring more information about the overall run, like inputs used, number of errors, and number of completed requests per minute.
- There's an additional notebook
notebooks/multiple-models-benchmark.ipynb
that will help users on running multiple benchmarks with different experts and gather performance results in one single table. A COE endpoint is meant to be used for this analysis.
This kit also supports SambaNova Studio models with Dynamic Batch Size, which improves the model performance significantly.
In order to use a batching model, first users need to set up the proper endpoint supporting this feature, please look at this section for reference. Additionally, users need to specify number of workers > 1
, either using the streamlit app or the terminal. Since the current maximum batch size is 16, it's recomended to choose a value for number of workers
equal or greater than that to test different batch sizes.
Here are some examples with parameters for using an endpoint with and without dynamic batching size.
Non-batching setup:
If the user wants to send 32 requests to be processed sequentially, here are the parameter values that can work as an example:
- Parameters:
- Number of requests: 32
- Number of concurrent workers: 1
We can see in the following Gantt chart how the 32 requests are being executed one after the other. (SambaNova Cloud with LLama3-8b was used for this example)
Batching setup:
If the user wants to send 60 requests to be processed in batch, it's important to consider the number of workers chosen.
For example:
For the following parameter values:
- Parameters:
- Number of requests: 60
- Number of concurrent workers: 21
We can see from the Gantt chart that the way they're being batched and processed is 1-16-4 requests, because there are 21 workers sending requests in parallel. This setup took ~ 4 mins 30 secs.
Another example is the following:
- Parameters:
- Number of requests: 60
- Number of concurrent workers: 60
We can see from the Gantt chart that the way they're being batched and processed is 1-16-16-16-8-1-1-1 requests, because there are 60 workers sending all requests in parallel. This setup took ~ 3 mins.
All the packages/tools are listed in the requirements.txt file in the project directory. Some of the main packages are listed below:
- streamlit (version 1.37.0)
- st-pages (version 0.5.0)
- transformers (version 4.41.1)
- python-dotenv (version 1.0.0)
- Requests (version 2.31.0)
- seaborn (version 0.12.2)