Evaluation of different Embedding Models based on a SAP-specific German Dataset with mock questions

It does matter which embedding model to choose when using RAG!

The evaluation structure can be seen in the Jupyter Notebook (embedding_evaluation.ipynb).

RAG(E)

Retrieval Augmented Generation (RAG) plays an important role in the evaluation of various embedding models.

RAG provides the grounding for today's chatbots and digital assistants and can thus clearly demonstrate how well and securely different models generate embeddings and how well they can be accessed.

RAG for enterprises is called RAGE and currently means a lot for the use of digital assistants based on specific business data of the company in which the digital assistant is used.

Evaluation Metrics

The results of the evaluation are based on the RAG algorithm, which was written for each model. First, the next best answer (k-nearest-neighbors in vector space of the embeddings --> k = 1) is taken and we see how well it can generate the answer based on the data set of 100 questions of SAP-specific knowledge and how often it is wrong.

In the 2nd part of the evaluation, k is set to 3 to see how RAG works for each of the 3 nearest embeddings for the different models and how well they perform when considering multiple neighbor embeddings.

Evaluation Results

Result for Top 1 answers (RAG including next neighbor):

Failure Score Mistral Embedding Model (mistral-embed): 7 => 93 %
Failure Score OpenAI Embedding Model (text-embedding-ada-002): 13 => 87 %
Failure Score OpenAI Embedding Model (text-embedding-3-small): 15 => 85 %
Failure Score OpenAI Embedding Model (text-embedding-3-large): 7 => 93 %

Best model (for this dataset in German): mistral-embed, text-embedding-3-large

Result for Top 3 answers (RAG including next 3 neighbors):

Failure Score Mistral Embedding Model (mistral-embed): 4 => 96 %
Failure Score OpenAI Embedding Model (text-embedding-ada-002): 3 => 97 %
Failure Score OpenAI Embedding Model (text-embedding-3-small): 4 => 96 %
Failure Score OpenAI Embedding Model (text-embedding-3-large): 2 => 98 %

Best model (for this dataset in German): text-embedding-3-large

Result for Top 5 questions (RAG including next 5 question-neighbors):

Failure Score Mistral Embedding Model (mistral-embed): 2 => 98 %
Failure Score OpenAI Embedding Model (text-embedding-ada-002): 2 => 98 %
Failure Score OpenAI Embedding Model (text-embedding-3-small): 3 => 97 %
Failure Score OpenAI Embedding Model (text-embedding-3-large): 4 => 96 %

Best model (for this dataset in German): mistral-embed, text-embedding-ada-002

Here you can see very clearly that different models perform differently as soon as you use several neighbors as comparison vectors for answers and questions. If you now connect an LLM and compare the 3 answers based on the input prompt and the LLM decides, the accuracy of the answers would be greatly improved.

Given the large number of variables and methods, it is quite possible that the naming of some of them may have been mixed up and the results may be distorted as a result. However, I have checked the notebook several times and it should now perform correct calculations in most cases.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
evaluation_images		evaluation_images
README.md		README.md
embedding_evaluation.ipynb		embedding_evaluation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of different Embedding Models based on a SAP-specific German Dataset with mock questions

It does matter which embedding model to choose when using RAG!

RAG(E)

Evaluation Metrics

Evaluation Results

Result for Top 1 answers (RAG including next neighbor):

Result for Top 3 answers (RAG including next 3 neighbors):

Result for Top 5 questions (RAG including next 5 question-neighbors):

If you got any questions or feedback regarding this research, please reach out to me anytime via Teams or Mail ([email protected])

About

Releases

Packages

Languages

ni-dz/embedding-evaluation

Folders and files

Latest commit

History

Repository files navigation

Evaluation of different Embedding Models based on a SAP-specific German Dataset with mock questions

It does matter which embedding model to choose when using RAG!

RAG(E)

Evaluation Metrics

Evaluation Results

Result for Top 1 answers (RAG including next neighbor):

Result for Top 3 answers (RAG including next 3 neighbors):

Result for Top 5 questions (RAG including next 5 question-neighbors):

If you got any questions or feedback regarding this research, please reach out to me anytime via Teams or Mail ([email protected])

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages