by Felix Krumme, Ray Wan, Dimple Amitha Garuadapuri, Manan Dhanuka
12/18/2024
Contains further data analysis scripts
The datasets folder contains all raw datasets in CSV format.
The questions directory contains all question collection sets in a JSON format.
The question sets with the scoring from the evaluator are stored here for every tested model.
This script contains the code for the test_agent and the evaluator.
This notebook contains the code to run the benchmark on all questions for a specific LLM
This script contains the code to create general information for the benchmarks results
This notebook contains the code to create graphs to visualize the benchmark results
To integrate and use the LangChain agent with Large Language Models (LLMs), you need to provide the necessary API keys for the LLM services.
Create a .env
file in the root of the project directory (if one doesn't exist already)
Add your API keys for the respective LLMs into the .env file. The format should be as follows:
OPENAI_API_KEY="your_api_key_here"
ANTHROPIC_API_KEY="your_api_key_here"
GEMINI_API_KEY="your_api_key_here"
MISTRAL_API_KEY="your_api_key_here"
If you do NOT have all API keys yet, get them here:
Ensure that you have the necessary Python packages installed for working with environment variables and the LangChain agent
Locate the reuirements.txt
file within this repository to see the required packages and versions.
To install the requirements use the following command
pip install -r requirements.txt
If these steps did NOT work, you have to install the dependencies manually, by going throught the files and installing all packages that throw errors.
This file is where the test agent execution and the evaluation happen. Open the file
Within the langchain_agent.py
file navigate to the main
function at the bottom of the script.
It should look something like this:
if __name__ == "__main__":
# DO NOT EDIT
questions_folder = "./questions/"
output_folder = "./results_folder/"
# Edit this to the model you want to test
processor = DataFrameAgentProcessor(
model_type="anthropic",
questions_path="",
model="claude-3-5-haiku-latest"
)
# Edit this path to the question collection you want to execute
processor.process_questions_list(
["./questions/statistics_4_hedge_fund_questions.json"],
output_folder
)
- Change the
model_type
to the model you want to test (openai
,gemini
,mistral
oranthropic
) - Change the
model
to the specific model you want to test
Edit the path file and set it to the question set you want to evaluate. All question sets are contained within the question directory.
After editing the main
method to the model and question collection you want to test, run the main method.
The results of the model you just ran should appear in the result_folder
directory.
This Jupyter notebook can be executed to run all questions at once
Change the model parameters to the model you want to run
Execute the script and find all questions answered in the results_folder
directory.
We provide a script to calculate key metrics to evaluate the benchmark results
This file contains the code to create the comprehensive model comparison.
Navigate to the main
method and run it to generate the report comprehensive_model_comparison.txt
We include a Jupyter notebook to create different graphs that help with visualizing the benchmark results.
This file contains the code to visualize the benchmark data
Run the notebook to create graphs that visualize the benchmark results
This concludes the DataSense Benchmark by Felix Krumme, Ray Wan, Dimple Amitha Garuadapuri, Manan Dhanuka