GitHub

Tune the Rag

Convert any Huggingface dataset to a retrieval-augmented dataset
Read the blog post »

About The Project

Use this repo to augment prompts with semantically similar prompts from the training set.
Choose the embedding model to perform the semantic search.
Customise the prompts.

For example, the data point from an SQL dataset {"question": "How many singers do we have?", "context": "CREATE TABLE singer (Id VARCHAR)"} gets prompted to be:

You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables. You must output the SQL query that answers the question.

Given the following example:
### Input:
How many artists do we have?

### Context:
CREATE TABLE artist (Id VARCHAR)

### Response:
SELECT count(*) FROM artist

Please generate the SQL query that answers the following:
### Input:
How many singers do we have?

### Context:
CREATE TABLE singer (Id VARCHAR)

### Response:

And these are the results of using retrieval-augmented prompts vs few-shot prompts:

Getting Started

Prerequisites

Clone repo

git clone https://github.com/samlhuillier/tunetherag.git

Install requirements

pip install -r requirements.txt

Creating a RAG dataset

Open tunetherag.ipynb
In the third cell, modify the dataset loading arguments:
```
embedding_feature = "question"
dataset_parameters = {"dataset_name": "gsm8k", "config_name": "main"}
```
(embedding_feature is what will be embedded and dataset_parameters will be plugged into load_dataset)

Setup prompts:

def format_math_example(example):
 inference_prompt = f"""### Problem:
 {example["question"]}

 ### Answer:"""

 full_prompt = f"{inference_prompt}\n{example['answer']}"

 return full_prompt, inference_prompt

math_prompt = "Solve the following math problem thinking step-by-step:"

Run cells to generate the retrieval augmented dataset which will be saved as a set of json files per split!

(back to top)

License

Distributed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
datasets		datasets
embeddings		embeddings
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
augment_dataset.py		augment_dataset.py
prompt_setup.py		prompt_setup.py
requirements.txt		requirements.txt
tunetherag.ipynb		tunetherag.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tune the Rag

About The Project

Getting Started

Prerequisites

Creating a RAG dataset

License

About

Releases

Packages

Languages

samlhuillier/tunetherag

Folders and files

Latest commit

History

Repository files navigation

Tune the Rag

About The Project

Getting Started

Prerequisites

Creating a RAG dataset

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages