Skip to content

Commit

Permalink
Multiple Prompts
Browse files Browse the repository at this point in the history
- allows multiple prompts and output files in a single run. this saves the model loading time especially when testing multiple prompts for hf and vllm runners
- we ensure that the number of prompt and output files match early in main.py since it applies to all runners
  • Loading branch information
wongjingping committed Nov 7, 2023
1 parent fbab8ad commit 1e98620
Show file tree
Hide file tree
Showing 6 changed files with 455 additions and 383 deletions.
49 changes: 40 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,13 +101,12 @@ Having implemented the query generator, the next piece of abstraction would be t

## Running the Test

### OpenAI
Remember to have your OpenAI API key (`OPENAI_API_KEY="sk-..."`) set as an environment variable before running the test if you plan to call the OpenAI API (or Anthropic/other LLM API's accordingly).
### OpenAI / Anthropic
Remember to have your API key (`OPENAI_API_KEY` or `ANTHROPIC_API_KEY`) set as an environment variable before running the test if you plan to call the OpenAI or Anthropic/other LLM API's accordingly.

To test it out with just 10 questions (instead of all 175), parallelized across 5 :
To test it out with just 10 questions (instead of all 175) using the gpt-3.5-turbo model, parallelized across 5 workers:

```bash
mkdir results # create directory for storing results
python main.py \
-q data/questions_gen.csv \
-o results/my_query_generator.csv \
Expand All @@ -118,21 +117,53 @@ python main.py \
-p 5
```

To test out the full suite of questions for claude-2:
```bash
python main.py \
-q data/questions_gen.csv \
-o results/claude-2.csv \
-g anthropic \
-f prompts/prompt_anthropic.md \
-m claude-2
```

### Hugging Face
To test it out with our fine-tuned sql model with just 10 questions (instead of all 175):

```bash
mkdir results #create directory for storing results

# use the -W option to ignore warnings about sequential use of transformers pipeline
python -W ignore main.py \
-q data/questions_gen.csv \
-o results/results.csv \
-g hf \
-f prompts/prompt.md \
-m defog/starcoder-finetune-v3 \
-m defog/sqlcoder2 \
-n 10
```
We also support loading a peft adapter here as well via the `-a` flag. Note that the loading of the adapter with the model will take slightly longer than usual.

### VLLM

We also have a [vllm](vllm.ai) runner which uses the VLLM engine to run the inference altogether as a single batch. It is much faster to do so especially when `num_beams` > 1. You would have to pass in a single set of merged model weights, and the model architecture needs to be supported by vllm. Here's a sample command:
```bash
python -W ignore main.py \
-q data/questions_gen.csv \
-o "results/results.csv" \
-g vllm \
-f "prompts/prompt.md" \
-m defog/sqlcoder2
```

If you'd like to test out a few prompts in a single run (to save the few minutes spent loading the model into GPU at the start of each run), you can specify a list of prompt files in `--prompt_file` (e.g. `-f prompts/prompt-1.md prompts/prompt-2.md prompts/prompt-3.md`), as well as a corresponding list of output files in `--output_file` (e.g. `-o results/results-1.csv results/results-2.csv results/results-3.csv`). The number of prompts and output files must be the same. Here's a sample command:
```bash
python -W ignore main.py \
-q data/questions_gen.csv \
-o results/results_1.csv results/results_2.csv \
-g vllm \
-f prompts/prompt_1.md prompts/prompt_2.md \
-m defog/sqlcoder2
```
While you can do the same for the other runners, the time savings are most significant when loading a large model locally, vs calling an always-on API.


### CLI Flags
Expand All @@ -143,9 +174,9 @@ You can use the following flags in the command line to change the configurations
| -n, --num_questions | Use this to limit the total number of questions you want to test. |
| -g, --model_type | Model type used. Make sure this matches the model used. Currently defined options in `main.py` are `oa` for OpenAI models and `hf` for Hugging Face models. |
| -m, --model | Model that will be tested and used to generate the queries. Currently defined options for OpenAI models are chat models `gpt-3.5-turbo-0613` and `gpt-4-0613`, and non-chat model `text-davinci-003`. For Hugging Face models, simply use the path of your chosen model (e.g. `defog/sqlcoder`). |
| -f, --prompt_file | Markdown file with the prompt used for query generation. |
| -f, --prompt_file | Markdown file with the prompt used for query generation. You can pass in a list of prompts to test sequentially without reloading the script. |
| -d, --use_private_data | Use this to read from your own private data library. |
| -o, --output_file | Output CSV file that will store your results. |
| -o, --output_file | Output CSV file that will store your results. You need to pass the same number of output file paths as the number of prompt files |
| -p, --parallel_threads | The default no. of parallel threads is 5. Decrease this to 1 for gpt-4 to avoid the rate limit error. Parallelization support is currently only defined for OpenAI models. |
| -t, --timeout_gen | No. of seconds before timeout occurs for query generation. The default is 30.0s. |
| -u, --timeout_exec | No. of seconds before timeout occurs for query execution on the database. The default is 10.0s. |
Expand Down
230 changes: 120 additions & 110 deletions eval/anthropic_runner.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from concurrent.futures import ThreadPoolExecutor, as_completed
import copy
import os
from eval.eval import compare_query_results
import pandas as pd
from psycopg2.extensions import QueryCanceledError
Expand All @@ -9,125 +10,134 @@


def run_anthropic_eval(args):
print("preparing questions...")
# get questions
question_query_df = prepare_questions_df(args.questions_file, args.num_questions)
qg_class = AnthropicQueryGenerator
# add columns for generated query and metrics
question_query_df["generated_query"] = ""
question_query_df["reason"] = ""
question_query_df["error_msg"] = ""
question_query_df["exact_match"] = 0
question_query_df["correct"] = 0
question_query_df["error_query_gen"] = 0
question_query_df["error_db_exec"] = 0
question_query_df["timeout"] = 0
# add custom metrics below:
question_query_df["latency_seconds"] = 0.0 # latency of query generation in seconds
question_query_df["tokens_used"] = 0 # number of tokens used in query generation
for prompt_file, output_file in zip(args.prompt_file, args.output_file):
print("preparing questions...")
# get questions
question_query_df = prepare_questions_df(
args.questions_file, args.num_questions
)
qg_class = AnthropicQueryGenerator
# add columns for generated query and metrics
question_query_df["generated_query"] = ""
question_query_df["reason"] = ""
question_query_df["error_msg"] = ""
question_query_df["exact_match"] = 0
question_query_df["correct"] = 0
question_query_df["error_query_gen"] = 0
question_query_df["error_db_exec"] = 0
question_query_df["timeout"] = 0
# add custom metrics below:
# latency of query generation in seconds
question_query_df["latency_seconds"] = 0.0
# number of tokens used in query generation
question_query_df["tokens_used"] = 0

question_query_df.reset_index(inplace=True, drop=True)
question_query_df.reset_index(inplace=True, drop=True)

input_rows = question_query_df.to_dict("records")
output_rows = []
with ThreadPoolExecutor(args.parallel_threads) as executor:
# for each query in the csv, generate a query using the generator asynchronously
futures = []
for row in input_rows:
# get db creds for each row's db_name
db_name = row["db_name"]
db_creds = {
"host": "localhost",
"port": 5432,
"user": "postgres",
"password": "postgres",
"database": db_name,
}

qg = qg_class(
db_creds=copy.deepcopy(db_creds),
model=args.model,
prompt_file=args.prompt_file,
timeout=args.timeout_gen,
verbose=args.verbose,
)

generated_query_fut = executor.submit(
qg.generate_query, question=row["question"]
)
futures.append(generated_query_fut)

total_tried = 0
total_correct = 0
for f in (pbar := tqdm(as_completed(futures), total=len(futures))):
total_tried += 1
i = futures.index(f)
row = input_rows[i]
result_dict = f.result()
query_gen = result_dict["query"]
reason = result_dict["reason"]
err = result_dict["err"]
# save custom metrics
if "latency_seconds" in result_dict:
row["latency_seconds"] = result_dict["latency_seconds"]
if "tokens_used" in result_dict:
row["tokens_used"] = result_dict["tokens_used"]
row["generated_query"] = query_gen
row["reason"] = reason
row["error_msg"] = err
# save failures into relevant columns in the dataframe
if "GENERATION ERROR" in err:
row["error_query_gen"] = 1
elif "EXECUTION ERROR" in err:
row["error_db_exec"] = 1
elif "TIMEOUT" in err:
row["timeout"] = 1
else:
expected_query = row["query"]
input_rows = question_query_df.to_dict("records")
output_rows = []
with ThreadPoolExecutor(args.parallel_threads) as executor:
# for each query in the csv, generate a query using the generator asynchronously
futures = []
for row in input_rows:
# get db creds for each row's db_name
db_name = row["db_name"]
question = row["question"]
query_category = row["query_category"]
exact_match = correct = 0
db_creds = {
"host": "localhost",
"port": 5432,
"user": "postgres",
"password": "postgres",
"database": db_name,
}
# try executing the queries and compare the results if they succeed
try:
exact_match, correct = compare_query_results(
query_gold=expected_query,
query_gen=query_gen,
db_name=db_name,
db_creds=db_creds,
timeout=args.timeout_exec,
question=question,
query_category=query_category,
)
row["exact_match"] = int(exact_match)
row["correct"] = int(correct)
row["error_msg"] = ""
if correct:
total_correct += 1
except QueryCanceledError as e:
row["timeout"] = 1
row["error_msg"] = f"QUERY EXECUTION TIMEOUT: {e}"
except Exception as e:

qg = qg_class(
db_creds=copy.deepcopy(db_creds),
model=args.model,
prompt_file=prompt_file,
timeout=args.timeout_gen,
verbose=args.verbose,
)

generated_query_fut = executor.submit(
qg.generate_query, question=row["question"]
)
futures.append(generated_query_fut)

total_tried = 0
total_correct = 0
for f in (pbar := tqdm(as_completed(futures), total=len(futures))):
total_tried += 1
i = futures.index(f)
row = input_rows[i]
result_dict = f.result()
query_gen = result_dict["query"]
reason = result_dict["reason"]
err = result_dict["err"]
# save custom metrics
if "latency_seconds" in result_dict:
row["latency_seconds"] = result_dict["latency_seconds"]
if "tokens_used" in result_dict:
row["tokens_used"] = result_dict["tokens_used"]
row["generated_query"] = query_gen
row["reason"] = reason
row["error_msg"] = err
# save failures into relevant columns in the dataframe
if "GENERATION ERROR" in err:
row["error_query_gen"] = 1
elif "EXECUTION ERROR" in err:
row["error_db_exec"] = 1
row["error_msg"] = f"QUERY EXECUTION ERROR: {e}"
output_rows.append(row)
pbar.set_description(
f"Correct so far: {total_correct}/{total_tried} ({100*total_correct/total_tried:.2f}%)"
)
output_df = pd.DataFrame(output_rows)
output_df = output_df.sort_values(by=["db_name", "query_category", "question"])
output_df.to_csv(args.output_file, index=False, float_format="%.2f")
elif "TIMEOUT" in err:
row["timeout"] = 1
else:
expected_query = row["query"]
db_name = row["db_name"]
question = row["question"]
query_category = row["query_category"]
exact_match = correct = 0
db_creds = {
"host": "localhost",
"port": 5432,
"user": "postgres",
"password": "postgres",
"database": db_name,
}
# try executing the queries and compare the results if they succeed
try:
exact_match, correct = compare_query_results(
query_gold=expected_query,
query_gen=query_gen,
db_name=db_name,
db_creds=db_creds,
timeout=args.timeout_exec,
question=question,
query_category=query_category,
)
row["exact_match"] = int(exact_match)
row["correct"] = int(correct)
row["error_msg"] = ""
if correct:
total_correct += 1
except QueryCanceledError as e:
row["timeout"] = 1
row["error_msg"] = f"QUERY EXECUTION TIMEOUT: {e}"
except Exception as e:
row["error_db_exec"] = 1
row["error_msg"] = f"QUERY EXECUTION ERROR: {e}"
output_rows.append(row)
pbar.set_description(
f"Correct so far: {total_correct}/{total_tried} ({100*total_correct/total_tried:.2f}%)"
)
output_df = pd.DataFrame(output_rows)
output_df = output_df.sort_values(by=["db_name", "query_category", "question"])
# get directory of output_file and create if not exist
output_dir = os.path.dirname(output_file)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
output_df.to_csv(output_file, index=False, float_format="%.2f")

# get average rate of exact matches
avg_acc = output_df["exact_match"].sum() / len(output_df)
print(f"Average rate of exact match: {avg_acc:.2f}")
# get average rate of correct results
avg_subset = output_df["correct"].sum() / len(output_df)
print(f"Average correct rate: {avg_subset:.2f}")
# get average rate of exact matches
avg_acc = output_df["exact_match"].sum() / len(output_df)
print(f"Average rate of exact match: {avg_acc:.2f}")
# get average rate of correct results
avg_subset = output_df["correct"].sum() / len(output_df)
print(f"Average correct rate: {avg_subset:.2f}")
Loading

0 comments on commit 1e98620

Please sign in to comment.