Multiple Prompts

- allows multiple prompts and output files in a single run. this saves the model loading time especially when testing multiple prompts for hf and vllm runners - we ensure that the number of prompt and output files match early in main.py since it applies to all runners
defog-ai · Nov 7, 2023 · 1e98620 · 1e98620
1 parent fbab8ad
commit 1e98620
Show file tree

Hide file tree

Showing 6 changed files with 455 additions and 383 deletions.
diff --git a/README.md b/README.md
@@ -101,13 +101,12 @@ Having implemented the query generator, the next piece of abstraction would be t
 
 ## Running the Test
 
-### OpenAI
-Remember to have your OpenAI API key (`OPENAI_API_KEY="sk-..."`) set as an environment variable before running the test if you plan to call the OpenAI API (or Anthropic/other LLM API's accordingly).
+### OpenAI / Anthropic
+Remember to have your API key (`OPENAI_API_KEY` or `ANTHROPIC_API_KEY`) set as an environment variable before running the test if you plan to call the OpenAI or Anthropic/other LLM API's accordingly.
 
-To test it out with just 10 questions (instead of all 175), parallelized across 5 :
+To test it out with just 10 questions (instead of all 175) using the gpt-3.5-turbo model, parallelized across 5 workers:
 
 ```bash
-mkdir results # create directory for storing results
 python main.py \
   -q data/questions_gen.csv \
   -o results/my_query_generator.csv \
@@ -118,21 +117,53 @@ python main.py \
   -p 5
 ```
 
+To test out the full suite of questions for claude-2:
+```bash
+python main.py \
+  -q data/questions_gen.csv \
+  -o results/claude-2.csv \
+  -g anthropic \
+  -f prompts/prompt_anthropic.md \
+  -m claude-2
+```
+
 ### Hugging Face
 To test it out with our fine-tuned sql model with just 10 questions (instead of all 175):
 
 ```bash
-mkdir results #create directory for storing results
-
 # use the -W option to ignore warnings about sequential use of transformers pipeline
 python -W ignore main.py \
   -q data/questions_gen.csv \
   -o results/results.csv \
   -g hf \
   -f prompts/prompt.md \
-  -m defog/starcoder-finetune-v3 \
+  -m defog/sqlcoder2 \
   -n 10
 ```
+We also support loading a peft adapter here as well via the `-a` flag. Note that the loading of the adapter with the model will take slightly longer than usual.
+
+### VLLM
+
+We also have a [vllm](vllm.ai) runner which uses the VLLM engine to run the inference altogether as a single batch. It is much faster to do so especially when `num_beams` > 1. You would have to pass in a single set of merged model weights, and the model architecture needs to be supported by vllm. Here's a sample command:
+```bash
+python -W ignore main.py \
+  -q data/questions_gen.csv \
+  -o "results/results.csv" \
+  -g vllm \
+  -f "prompts/prompt.md" \
+  -m defog/sqlcoder2
+```
+
+If you'd like to test out a few prompts in a single run (to save the few minutes spent loading the model into GPU at the start of each run), you can specify a list of prompt files in `--prompt_file` (e.g. `-f prompts/prompt-1.md prompts/prompt-2.md prompts/prompt-3.md`), as well as a corresponding list of output files in `--output_file` (e.g. `-o results/results-1.csv results/results-2.csv results/results-3.csv`). The number of prompts and output files must be the same. Here's a sample command:
+```bash
+python -W ignore main.py \
+  -q data/questions_gen.csv \
+  -o results/results_1.csv results/results_2.csv \
+  -g vllm \
+  -f prompts/prompt_1.md prompts/prompt_2.md \
+  -m defog/sqlcoder2
+```
+While you can do the same for the other runners, the time savings are most significant when loading a large model locally, vs calling an always-on API.
 
 
 ### CLI Flags
@@ -143,9 +174,9 @@ You can use the following flags in the command line to change the configurations
 | -n, --num_questions  |  Use this to limit the total number of questions you want to test.  |
 |  -g, --model_type   |  Model type used. Make sure this matches the model used. Currently defined options in `main.py` are `oa` for OpenAI models and `hf` for Hugging Face models.   |
 |  -m, --model   |  Model that will be tested and used to generate the queries. Currently defined options for OpenAI models are chat models `gpt-3.5-turbo-0613` and `gpt-4-0613`, and non-chat model `text-davinci-003`. For Hugging Face models, simply use the path of your chosen model (e.g. `defog/sqlcoder`).  |
-|  -f, --prompt_file   |  Markdown file with the prompt used for query generation.  |
+|  -f, --prompt_file   |  Markdown file with the prompt used for query generation. You can pass in a list of prompts to test sequentially without reloading the script.  |
 |  -d, --use_private_data  |  Use this to read from your own private data library.  |
-|  -o, --output_file   |  Output CSV file that will store your results.   |
+|  -o, --output_file   |  Output CSV file that will store your results. You need to pass the same number of output file paths as the number of prompt files |
 | -p, --parallel_threads  |  The default no. of parallel threads is 5. Decrease this to 1 for gpt-4 to avoid the rate limit error. Parallelization support is currently only defined for OpenAI models.  |
 | -t, --timeout_gen  |  No. of seconds before timeout occurs for query generation. The default is 30.0s. |
 | -u, --timeout_exec  |  No. of seconds before timeout occurs for query execution on the database. The default is 10.0s.  |

diff --git a/eval/anthropic_runner.py b/eval/anthropic_runner.py
@@ -1,5 +1,6 @@
 from concurrent.futures import ThreadPoolExecutor, as_completed
 import copy
+import os
 from eval.eval import compare_query_results
 import pandas as pd
 from psycopg2.extensions import QueryCanceledError
@@ -9,125 +10,134 @@
 
 
 def run_anthropic_eval(args):
-    print("preparing questions...")
-    # get questions
-    question_query_df = prepare_questions_df(args.questions_file, args.num_questions)
-    qg_class = AnthropicQueryGenerator
-    # add columns for generated query and metrics
-    question_query_df["generated_query"] = ""
-    question_query_df["reason"] = ""
-    question_query_df["error_msg"] = ""
-    question_query_df["exact_match"] = 0
-    question_query_df["correct"] = 0
-    question_query_df["error_query_gen"] = 0
-    question_query_df["error_db_exec"] = 0
-    question_query_df["timeout"] = 0
-    # add custom metrics below:
-    question_query_df["latency_seconds"] = 0.0  # latency of query generation in seconds
-    question_query_df["tokens_used"] = 0  # number of tokens used in query generation
+    for prompt_file, output_file in zip(args.prompt_file, args.output_file):
+        print("preparing questions...")
+        # get questions
+        question_query_df = prepare_questions_df(
+            args.questions_file, args.num_questions
+        )
+        qg_class = AnthropicQueryGenerator
+        # add columns for generated query and metrics
+        question_query_df["generated_query"] = ""
+        question_query_df["reason"] = ""
+        question_query_df["error_msg"] = ""
+        question_query_df["exact_match"] = 0
+        question_query_df["correct"] = 0
+        question_query_df["error_query_gen"] = 0
+        question_query_df["error_db_exec"] = 0
+        question_query_df["timeout"] = 0
+        # add custom metrics below:
+        # latency of query generation in seconds
+        question_query_df["latency_seconds"] = 0.0
+        # number of tokens used in query generation
+        question_query_df["tokens_used"] = 0
 
-    question_query_df.reset_index(inplace=True, drop=True)
+        question_query_df.reset_index(inplace=True, drop=True)
 
-    input_rows = question_query_df.to_dict("records")
-    output_rows = []
-    with ThreadPoolExecutor(args.parallel_threads) as executor:
-        # for each query in the csv, generate a query using the generator asynchronously
-        futures = []
-        for row in input_rows:
-            # get db creds for each row's db_name
-            db_name = row["db_name"]
-            db_creds = {
-                "host": "localhost",
-                "port": 5432,
-                "user": "postgres",
-                "password": "postgres",
-                "database": db_name,
-            }
-
-            qg = qg_class(
-                db_creds=copy.deepcopy(db_creds),
-                model=args.model,
-                prompt_file=args.prompt_file,
-                timeout=args.timeout_gen,
-                verbose=args.verbose,
-            )
-
-            generated_query_fut = executor.submit(
-                qg.generate_query, question=row["question"]
-            )
-            futures.append(generated_query_fut)
-
-        total_tried = 0
-        total_correct = 0
-        for f in (pbar := tqdm(as_completed(futures), total=len(futures))):
-            total_tried += 1
-            i = futures.index(f)
-            row = input_rows[i]
-            result_dict = f.result()
-            query_gen = result_dict["query"]
-            reason = result_dict["reason"]
-            err = result_dict["err"]
-            # save custom metrics
-            if "latency_seconds" in result_dict:
-                row["latency_seconds"] = result_dict["latency_seconds"]
-            if "tokens_used" in result_dict:
-                row["tokens_used"] = result_dict["tokens_used"]
-            row["generated_query"] = query_gen
-            row["reason"] = reason
-            row["error_msg"] = err
-            # save failures into relevant columns in the dataframe
-            if "GENERATION ERROR" in err:
-                row["error_query_gen"] = 1
-            elif "EXECUTION ERROR" in err:
-                row["error_db_exec"] = 1
-            elif "TIMEOUT" in err:
-                row["timeout"] = 1
-            else:
-                expected_query = row["query"]
+        input_rows = question_query_df.to_dict("records")
+        output_rows = []
+        with ThreadPoolExecutor(args.parallel_threads) as executor:
+            # for each query in the csv, generate a query using the generator asynchronously
+            futures = []
+            for row in input_rows:
+                # get db creds for each row's db_name
                 db_name = row["db_name"]
-                question = row["question"]
-                query_category = row["query_category"]
-                exact_match = correct = 0
                 db_creds = {
                     "host": "localhost",
                     "port": 5432,
                     "user": "postgres",
                     "password": "postgres",
                     "database": db_name,
                 }
-                # try executing the queries and compare the results if they succeed
-                try:
-                    exact_match, correct = compare_query_results(
-                        query_gold=expected_query,
-                        query_gen=query_gen,
-                        db_name=db_name,
-                        db_creds=db_creds,
-                        timeout=args.timeout_exec,
-                        question=question,
-                        query_category=query_category,
-                    )
-                    row["exact_match"] = int(exact_match)
-                    row["correct"] = int(correct)
-                    row["error_msg"] = ""
-                    if correct:
-                        total_correct += 1
-                except QueryCanceledError as e:
-                    row["timeout"] = 1
-                    row["error_msg"] = f"QUERY EXECUTION TIMEOUT: {e}"
-                except Exception as e:
+
+                qg = qg_class(
+                    db_creds=copy.deepcopy(db_creds),
+                    model=args.model,
+                    prompt_file=prompt_file,
+                    timeout=args.timeout_gen,
+                    verbose=args.verbose,
+                )
+
+                generated_query_fut = executor.submit(
+                    qg.generate_query, question=row["question"]
+                )
+                futures.append(generated_query_fut)
+
+            total_tried = 0
+            total_correct = 0
+            for f in (pbar := tqdm(as_completed(futures), total=len(futures))):
+                total_tried += 1
+                i = futures.index(f)
+                row = input_rows[i]
+                result_dict = f.result()
+                query_gen = result_dict["query"]
+                reason = result_dict["reason"]
+                err = result_dict["err"]
+                # save custom metrics
+                if "latency_seconds" in result_dict:
+                    row["latency_seconds"] = result_dict["latency_seconds"]
+                if "tokens_used" in result_dict:
+                    row["tokens_used"] = result_dict["tokens_used"]
+                row["generated_query"] = query_gen
+                row["reason"] = reason
+                row["error_msg"] = err
+                # save failures into relevant columns in the dataframe
+                if "GENERATION ERROR" in err:
+                    row["error_query_gen"] = 1
+                elif "EXECUTION ERROR" in err:
                     row["error_db_exec"] = 1
-                    row["error_msg"] = f"QUERY EXECUTION ERROR: {e}"
-            output_rows.append(row)
-            pbar.set_description(
-                f"Correct so far: {total_correct}/{total_tried} ({100*total_correct/total_tried:.2f}%)"
-            )
-    output_df = pd.DataFrame(output_rows)
-    output_df = output_df.sort_values(by=["db_name", "query_category", "question"])
-    output_df.to_csv(args.output_file, index=False, float_format="%.2f")
+                elif "TIMEOUT" in err:
+                    row["timeout"] = 1
+                else:
+                    expected_query = row["query"]
+                    db_name = row["db_name"]
+                    question = row["question"]
+                    query_category = row["query_category"]
+                    exact_match = correct = 0
+                    db_creds = {
+                        "host": "localhost",
+                        "port": 5432,
+                        "user": "postgres",
+                        "password": "postgres",
+                        "database": db_name,
+                    }
+                    # try executing the queries and compare the results if they succeed
+                    try:
+                        exact_match, correct = compare_query_results(
+                            query_gold=expected_query,
+                            query_gen=query_gen,
+                            db_name=db_name,
+                            db_creds=db_creds,
+                            timeout=args.timeout_exec,
+                            question=question,
+                            query_category=query_category,
+                        )
+                        row["exact_match"] = int(exact_match)
+                        row["correct"] = int(correct)
+                        row["error_msg"] = ""
+                        if correct:
+                            total_correct += 1
+                    except QueryCanceledError as e:
+                        row["timeout"] = 1
+                        row["error_msg"] = f"QUERY EXECUTION TIMEOUT: {e}"
+                    except Exception as e:
+                        row["error_db_exec"] = 1
+                        row["error_msg"] = f"QUERY EXECUTION ERROR: {e}"
+                output_rows.append(row)
+                pbar.set_description(
+                    f"Correct so far: {total_correct}/{total_tried} ({100*total_correct/total_tried:.2f}%)"
+                )
+        output_df = pd.DataFrame(output_rows)
+        output_df = output_df.sort_values(by=["db_name", "query_category", "question"])
+        # get directory of output_file and create if not exist
+        output_dir = os.path.dirname(output_file)
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+        output_df.to_csv(output_file, index=False, float_format="%.2f")
 
-    # get average rate of exact matches
-    avg_acc = output_df["exact_match"].sum() / len(output_df)
-    print(f"Average rate of exact match: {avg_acc:.2f}")
-    # get average rate of correct results
-    avg_subset = output_df["correct"].sum() / len(output_df)
-    print(f"Average correct rate: {avg_subset:.2f}")
+        # get average rate of exact matches
+        avg_acc = output_df["exact_match"].sum() / len(output_df)
+        print(f"Average rate of exact match: {avg_acc:.2f}")
+        # get average rate of correct results
+        avg_subset = output_df["correct"].sum() / len(output_df)
+        print(f"Average correct rate: {avg_subset:.2f}")