German Embedding Dataset Generation

Similar to the intfloat/e5-mistral-7b-instruct and the Paper "Improving Text Embeddings with Large Language Models" ArXiv a synthetic dataset should be generated. As of now we are only generating a dataset for retreival tasks with the following steps:

Brainstorming Topics for such Retreival Tasks

Instead of the LLM asking to be creative, the tasks will be generated with the help of randomly sampled 5 topics from the quora dataset. An example for such topics and questions are:

topic: Heißwassererhitzer
questions:
["Suche nach Anleitungen zur Installation von Boilern",
"Finde Produktbeschreibungen von Heißwassererhitzern mit Energieffizenzklasse A.",
...
]

Generating Questions

The retreival quality can differ according to the style of the question. While short search strings were superior in short text chunks, questions were better for longer chunk sizes. We will be generating the following questions:

Search String: short Keywords representing e.g. search engine interactions "example of IPv4 address in CIDR notation"
Imperative Questions: Question formulated in imperative Form e.g. "Describe how to specify an IP address range using the IP-address and IP-mask fields."
Question: standard Question e.g. "What is the difference between specifying an IP address range using the CIDR notation and using the IP-address and IP-mask fields?" Examples:

Identifiziere Studien zur Bedeutung von Bildung für die persönliche Entwicklung
Imperative Form:  "Suche nach Studien, die die Bedeutung von Bildung für die persönliche Entwicklung untersuchen."
Question:  "Welche Studien gibt es zur Bedeutung von Bildung für die persönliche Entwicklung?"
Search String:  "Studien Bedeutung Bildung persönliche Entwicklung"
-----------------------------
Berichte über innovative Ansätze in der Medizinforschung und -praxis
Imperative Form:  "Finde Berichte über neue Ansätze in der Medizinforschung und -praxis."
Question:  "Wo kann ich Berichte über innovative Ansätze in der Medizinforschung und -praxis finden?"
Search String:  "Berichte innovative Ansätze Medizinforschung Praxis"

Generating the hard positive and hard negatie example

We have seen that variance here is also important! The generation will not just consist of the same structured text but will sample from predefined categories.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Archiv		Archiv
llama.cpp		llama.cpp
00_generate_Aufgaben.ipynb		00_generate_Aufgaben.ipynb
00_results.parquet		00_results.parquet
00_results_parsed.parquet		00_results_parsed.parquet
01_extract_Aufgaben_dataset.ipynb		01_extract_Aufgaben_dataset.ipynb
01_results_questions.parquet		01_results_questions.parquet
02_generate_questions.ipynb		02_generate_questions.ipynb
03_parse_Questions.ipynb		03_parse_Questions.ipynb
03_parsed_questions.parquet		03_parsed_questions.parquet
04_Generate_texts.ipynb		04_Generate_texts.ipynb
04_results_texts.parquet		04_results_texts.parquet
04_results_texts_diffPrompt.parquet		04_results_texts_diffPrompt.parquet
04_results_texts_diffPrompt2.parquet		04_results_texts_diffPrompt2.parquet
04_results_texts_v3.parquet		04_results_texts_v3.parquet
04_results_texts_v4.parquet		04_results_texts_v4.parquet
04_results_texts_v5.parquet		04_results_texts_v5.parquet
05_preprocess_texts.ipynb		05_preprocess_texts.ipynb
06_extract_positiveNegative.py		06_extract_positiveNegative.py
README.md		README.md
results_texts.parquet		results_texts.parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

German Embedding Dataset Generation

Brainstorming Topics for such Retreival Tasks

Generating Questions

Generating the hard positive and hard negatie example

About

Releases

Packages

Languages

SebastianBodza/Embedding_Training

Folders and files

Latest commit

History

Repository files navigation

German Embedding Dataset Generation

Brainstorming Topics for such Retreival Tasks

Generating Questions

Generating the hard positive and hard negatie example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages