Skip to content

Commit

Permalink
Added github action to evaluate on private dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
hetulvp committed Apr 20, 2024
1 parent 002f879 commit 22a74d7
Show file tree
Hide file tree
Showing 12 changed files with 274 additions and 143 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/github_pages.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: ci
name: Deploy to github pages
on:
push:
branches:
Expand Down
61 changes: 61 additions & 0 deletions .github/workflows/update_leaderboard.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
name: Update leaderboard.

on:
pull_request:
types: [opened, reopened, synchronize]

jobs:
private_evaluation:
runs-on: ubuntu-latest
steps:
- name: Check if there are any changes in submissions dir
uses: dorny/[email protected]
id: changes
with:
filters: |
src:
- 'session_2/challenge/submissions/**'
list-files: "shell"

# Exit early if no changes in the submissions directory
- name: Print changed files
run: |
echo '${{ toJSON(steps.changes.outputs) }}'
# Install evaluation dependencies and run the evals.
- name: Checkout code
if: ${{ (steps.changes.outputs.src == 'true') }}
uses: actions/checkout@v4

- name: Install Python
if: ${{ (steps.changes.outputs.src == 'true') }}
uses: actions/setup-python@v5
with:
python-version: "3.10"

- name: Install dependencies
if: ${{ (steps.changes.outputs.src == 'true') }}
run: |
python -m pip install --upgrade pip
pip install -r session_2/challenge/requirements.txt
- name: Run leaderboard update script
id: leaderboard-update
if: ${{ (steps.changes.outputs.src == 'true') }}
run: |
cd session_2/challenge
python -m scripts.leaderboard --github_user="${{ github.actor }}" --prompt="${{ steps.changes.outputs.src_files }}"
# # Commit the updated leaderboard
# - name: Commit Updated Leaderboard
# id: commit-leaderboard
# run: |
# git config --global user.name "GitHub Actions"
# git config --global user.email "[email protected]"
# git add leaderboard.md
# git commit -m "Update leaderboard"
# git push origin HEAD:${{ github.ref }}

# # Print the commit SHA for reference
# - name: Print Commit SHA
# run: echo "Commit SHA: ${{ steps.commit-leaderboard.outputs.commit_sha }}"
16 changes: 4 additions & 12 deletions session_2/challenge/how_to_participate.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,18 @@
```

3. To submit your own prompt, make a copy of `submissions/baseline.py` and
change the name of the prompt from `baseline` to something else which
change the name of the file from `baseline` to something else which
describes your prompt. E.g,

```python
# file: submissions/name_of_your_prompt.py
@registry.register("name_of_your_prompt")
@registry.register()
class NameOfYourPrompt(base.PromptSubmission):
...
```

Also change the class name and register it with a new name (can be same as the
filename.)
Also change the class name.

4. Update the `build_prompt` and `parse_response` method.

Expand Down Expand Up @@ -62,11 +61,4 @@
your prompt.

8. Congratulations 🎉, once a repo maintainer approves your submission and merges
your PR, your rank based on a private test set will be published on the
public leader board.

!!! note
You can test your prompt on your own samples by adding new files under
`sample_inputs` dir. The file name must ends with `"yes.txt"` if the JD is
for a fresher, otherwise it should end with `"no.txt"`. Do not commit
these files.
your PR, your rank will be published on the public leader board.
16 changes: 9 additions & 7 deletions session_2/challenge/leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,15 @@ Check [participation guide](how_to_participate.md).
<center>

<!-- leader-board-begins -->
| Rank | Profile Image | GitHub Username | Solution | Accuracy % |
|-------:|:------------------------------------------------------------------------------------------------|:-------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------|-------------:|
| 1 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [New User](https://github.com/new_user) | [New Solution](https://github.com/new_solution) | 99.5 |
| 2 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 2](https://github.com/username2) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 95 |
| 3 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 4](https://github.com/username4) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 95 |
| 4 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 3](https://github.com/username3) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 10 |
| 5 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 1](https://github.com/username1) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 0 |
| Rank | Profile Image | GitHub Username | Solution | Accuracy % |
|-------:|:------------------------------------------------------------------------------------------------|:----------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------|-------------:|
| 1 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [New User](https://github.com/new_user) | [New Solution](https://github.com/new_solution) | 99.5 |
| 2 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 2](https://github.com/username2) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 95 |
| 3 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 4](https://github.com/username4) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 95 |
| 4 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [hetul-patel](https://github.com/hetul-patel) | [baseline](https://github.com/infocusp/llm_seminar_series/blob/main/session_2/challenge/submissions/baseline.py) | 50 |
| 5 | <img src="https://github.com/hetulvp.png" width="50px" height="50px" class="profile-image"> | [hetulvp](https://github.com/hetulvp) | [baseline](https://github.com/infocusp/llm_seminar_series/blob/main/session_2/challenge/submissions/baseline.py) | 50 |
| 6 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 3](https://github.com/username3) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 10 |
| 7 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 1](https://github.com/username1) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 0 |
<!-- leader-board-ends -->

</center>
24 changes: 24 additions & 0 deletions session_2/challenge/scripts/dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
"""Utilities to load evaluation datasets."""

import glob
import os


def load_sample_test_set(samples_dir: str) -> list[tuple[str, bool]]:
"""Loads sample job descriptions and answers for local testing."""
sample_files = glob.glob(os.path.join(samples_dir, "*.txt"))
sample_inputs = []
for filepath in sample_files:
content = open(filepath, "r").read()
filename = os.path.basename(filepath).lower()
if filename.endswith("_yes.txt"):
target = True
elif filename.endswith("_no.txt"):
target = False
else:
raise ValueError(
"File %s must end with yes.txt or no.txt" % filepath
)
target = True if "yes" in filename.lower() else False
sample_inputs.append((content, target))
return sample_inputs
61 changes: 9 additions & 52 deletions session_2/challenge/scripts/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,11 @@ def build_prompt(self, job_description: str) -> str:
python3 -m scripts.evaluate --prompt=baseline
"""

import glob
import logging
import os
from collections.abc import Sequence

import tqdm
from absl import app, flags
from scripts import model, registry
from submissions import baseline # noqa: F401
from scripts import dataset, evaluate_lib

_PROMPT = flags.DEFINE_string(
"prompt", None, "Name of the prompt to evaluate."
Expand All @@ -39,52 +35,12 @@ def build_prompt(self, job_description: str) -> str:
"debug", True, "Prints prompt and response if true."
)

_SAMPLES_DIR = "sample_inputs"


def load_sample_test_set() -> list[tuple[str, bool]]:
"""Loads sample job descriptions and answers for local testing."""
sample_files = glob.glob(os.path.join(_SAMPLES_DIR, "*.txt"))
sample_inputs = []
for filepath in sample_files:
content = open(filepath, "r").read()
filename = os.path.basename(filepath).lower()
if filename.endswith("_yes.txt"):
target = True
elif filename.endswith("_no.txt"):
target = False
else:
raise ValueError(
"File %s must end with yes.txt or no.txt" % filepath
)
target = True if "yes" in filename.lower() else False
sample_inputs.append((content, target))
return sample_inputs


def evaluate(prompt_name: str):
"""Evaluates the prompt submission."""
# Loads a free gpt4 model.
llm = model.G4fModel()

# Loads a prompt submission.
prompt_handler = registry.get(name=prompt_name)

# Generate results for the dataset.
dataset = load_sample_test_set()
correct_pred = 0
for idx, (job_description, target) in enumerate(tqdm.tqdm(dataset)):
prompt = prompt_handler.build_prompt(job_description=job_description)
logging.debug("[prompt %d]\n%s", idx, prompt)
response = llm.generate(prompt=prompt)
logging.debug("[response %d]\n%s", idx, response)
output = prompt_handler.parse_response(model_response=response)
logging.debug("[target %d]\n%s", idx, target)
logging.debug("[prediction %d]\n%s", idx, output)
if output == target:
correct_pred += 1

print("Accuracy: [%.3f] %%" % (correct_pred / len(dataset) * 100)) # noqa: T201

def evaluate_on_sample_dataset(prompt_name: str):
"""Evaluates the prompt on a sample_dataset."""
sample_inputs = dataset.load_sample_test_set(samples_dir="sample_inputs")
acc = evaluate_lib.evaluate(dataset=sample_inputs, prompt_name=prompt_name)
print("Accuracy: [%.3f] %%" % acc) # noqa: T201


def main(argv: Sequence[str]) -> None:
Expand All @@ -95,8 +51,9 @@ def main(argv: Sequence[str]) -> None:
logging.getLogger().setLevel(logging.DEBUG)
else:
logging.getLogger().setLevel(logging.INFO)
evaluate(prompt_name=_PROMPT.value)
evaluate_on_sample_dataset(prompt_name=_PROMPT.value)


if __name__ == "__main__":
flags.mark_flag_as_required("prompt")
app.run(main)
36 changes: 36 additions & 0 deletions session_2/challenge/scripts/evaluate_lib.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
"""Library function for evaluating a prompt on a particular dataset."""

import logging

import tqdm
from scripts import model, registry
from submissions import * # noqa: F401, F403
from submissions import baseline # noqa: F401


def evaluate(dataset: list[tuple[str, bool]], prompt_name: str):
"""Evaluates the prompt submission."""
# Loads a free gpt4 model.
llm = model.G4fModel()

# Loads a prompt submission.
prompt_handler = registry.get(name=prompt_name)

# Generate results for the dataset.
correct_pred = 0
for idx, (job_description, target) in enumerate(tqdm.tqdm(dataset)):
prompt = prompt_handler.build_prompt(job_description=job_description)
response = llm.generate(prompt=prompt)
prediction = prompt_handler.parse_response(model_response=response)
if prediction == target:
correct_pred += 1
result = "[PASS]"
else:
result = "[FAIL]"

logging.debug(
"No=%d. target=%s prediction=%s %s\n[prompt]\n%s\n[response]\n%s"
% (idx, target, prediction, result, prompt, response)
)
acc = correct_pred / len(dataset) * 100
return acc
Loading

0 comments on commit 22a74d7

Please sign in to comment.