Skip to content

Commit

Permalink
Added github action to evaluate on private dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
hetulvp committed Apr 20, 2024
1 parent 002f879 commit 2461e6a
Show file tree
Hide file tree
Showing 12 changed files with 295 additions and 144 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,6 @@ jobs:
id: check-star

- if: ${{ (steps.changes.outputs.src == 'true') && (steps.check-star.outputs.is-stargazer != 'true') }}
uses: actions/github-script@v6
uses: actions/github-script@v7
with:
script: core.setFailed('⭐ Please, star this repository!')
2 changes: 1 addition & 1 deletion .github/workflows/github_pages.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: ci
name: Deploy to github pages
on:
push:
branches:
Expand Down
82 changes: 82 additions & 0 deletions .github/workflows/update_leaderboard.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
name: Update leaderboard.

on:
pull_request:
types: [opened, reopened, synchronize]

jobs:
leaderboard_evaluation:
runs-on: ubuntu-latest
steps:
- name: Check if there are any changes in submissions dir
uses: dorny/[email protected]
id: changes
with:
filters: |
src:
- 'session_2/challenge/submissions/**'
list-files: "shell"

- name: Print changed files
run: |
echo '${{ toJSON(steps.changes.outputs) }}'
- if: ${{ (steps.changes.outputs.src_count > 1) }}
uses: actions/github-script@v7
with:
script: core.setFailed('More than one submissions are not allowed at once.')

# Update leaderboard only if single file is changed in submission dir
- if: ${{ (steps.changes.outputs.src == 'true') && (steps.changes.outputs.src_count == 1) }}
name: Checkout code
uses: actions/checkout@v4
with:
repository: ${{ github.event.pull_request.head.repo.full_name }}
ref: ${{ github.event.pull_request.head.ref }}

- if: ${{ (steps.changes.outputs.src == 'true') && (steps.changes.outputs.src_count == 1) }}
name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.10"

- if: ${{ (steps.changes.outputs.src == 'true') && (steps.changes.outputs.src_count == 1) }}
name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r session_2/challenge/requirements.txt
- if: ${{ (steps.changes.outputs.src == 'true') && (steps.changes.outputs.src_count == 1) }}
name: Run leaderboard update script
id: leaderboard-update
run: |
cd session_2/challenge
filename=$(basename "${{ steps.changes.outputs.src_files }}")
filename_without_extension="${filename%.*}" # Remove extension
python -m scripts.leaderboard --github_user="${{ github.actor }}" --prompt="$filename_without_extension"
- name: Commit changes
uses: EndBug/add-and-commit@v9
with:
author_name: GitHub Actions
author_email: [email protected]
message: 'Updated leader board'
add: 'session_2/challenge/leaderboard.md'

# # Commit the updated leaderboard
# - if: ${{ (steps.changes.outputs.src == 'true') && (steps.changes.outputs.src_count == 1) }}
# name: Commit updated leaderboard
# id: commit-leaderboard
# run: |
# git config --global user.name "GitHub Actions"
# git config --global user.email "[email protected]"
# git add session_2/challenge/leaderboard.md
# git commit -m "Update leaderboard"
# git push -f origin HEAD:${{ github.ref }}


# # Print the commit SHA for reference
# - if: ${{ (steps.changes.outputs.src == 'true') && (steps.changes.outputs.src_count == 1) }}
# name: Print Commit SHA
# run: |
# echo "Commit SHA: ${{ steps.commit-leaderboard.outputs.commit_sha }}"
16 changes: 4 additions & 12 deletions session_2/challenge/how_to_participate.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,18 @@
```

3. To submit your own prompt, make a copy of `submissions/baseline.py` and
change the name of the prompt from `baseline` to something else which
change the name of the file from `baseline` to something else which
describes your prompt. E.g,

```python
# file: submissions/name_of_your_prompt.py
@registry.register("name_of_your_prompt")
@registry.register()
class NameOfYourPrompt(base.PromptSubmission):
...
```

Also change the class name and register it with a new name (can be same as the
filename.)
Also change the class name.

4. Update the `build_prompt` and `parse_response` method.

Expand Down Expand Up @@ -62,11 +61,4 @@
your prompt.

8. Congratulations 🎉, once a repo maintainer approves your submission and merges
your PR, your rank based on a private test set will be published on the
public leader board.

!!! note
You can test your prompt on your own samples by adding new files under
`sample_inputs` dir. The file name must ends with `"yes.txt"` if the JD is
for a fresher, otherwise it should end with `"no.txt"`. Do not commit
these files.
your PR, your rank will be published on the public leader board.
15 changes: 8 additions & 7 deletions session_2/challenge/leaderboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,14 @@ Check [participation guide](how_to_participate.md).
<center>

<!-- leader-board-begins -->
| Rank | Profile Image | GitHub Username | Solution | Accuracy % |
|-------:|:------------------------------------------------------------------------------------------------|:-------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------|-------------:|
| 1 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [New User](https://github.com/new_user) | [New Solution](https://github.com/new_solution) | 99.5 |
| 2 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 2](https://github.com/username2) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 95 |
| 3 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 4](https://github.com/username4) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 95 |
| 4 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 3](https://github.com/username3) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 10 |
| 5 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 1](https://github.com/username1) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 0 |
| Rank | Profile Image | GitHub Username | Solution | Accuracy % |
|-------:|:------------------------------------------------------------------------------------------------|:----------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------|-------------:|
| 1 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [New User](https://github.com/new_user) | [New Solution](https://github.com/new_solution) | 99.5 |
| 2 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 2](https://github.com/username2) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 95 |
| 3 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 4](https://github.com/username4) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 95 |
| 4 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [hetul-patel](https://github.com/hetul-patel) | [baseline](https://github.com/infocusp/llm_seminar_series/blob/main/session_2/challenge/submissions/baseline.py) | 50 |
| 6 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 3](https://github.com/username3) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 10 |
| 7 | <img src="https://github.com/hetul-patel.png" width="50px" height="50px" class="profile-image"> | [Username 1](https://github.com/username1) | [Baseline](https://github.com/infocusp/llm_seminar_series/blob/hetul/prompting-leader-board/session_2/challenge/submissions/baseline.py) | 0 |
<!-- leader-board-ends -->

</center>
24 changes: 24 additions & 0 deletions session_2/challenge/scripts/dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
"""Utilities to load evaluation datasets."""

import glob
import os


def load_sample_test_set(samples_dir: str) -> list[tuple[str, bool]]:
"""Loads sample job descriptions and answers for local testing."""
sample_files = glob.glob(os.path.join(samples_dir, "*.txt"))
sample_inputs = []
for filepath in sample_files:
content = open(filepath, "r").read()
filename = os.path.basename(filepath).lower()
if filename.endswith("_yes.txt"):
target = True
elif filename.endswith("_no.txt"):
target = False
else:
raise ValueError(
"File %s must end with yes.txt or no.txt" % filepath
)
target = True if "yes" in filename.lower() else False
sample_inputs.append((content, target))
return sample_inputs
61 changes: 9 additions & 52 deletions session_2/challenge/scripts/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,11 @@ def build_prompt(self, job_description: str) -> str:
python3 -m scripts.evaluate --prompt=baseline
"""

import glob
import logging
import os
from collections.abc import Sequence

import tqdm
from absl import app, flags
from scripts import model, registry
from submissions import baseline # noqa: F401
from scripts import dataset, evaluate_lib

_PROMPT = flags.DEFINE_string(
"prompt", None, "Name of the prompt to evaluate."
Expand All @@ -39,52 +35,12 @@ def build_prompt(self, job_description: str) -> str:
"debug", True, "Prints prompt and response if true."
)

_SAMPLES_DIR = "sample_inputs"


def load_sample_test_set() -> list[tuple[str, bool]]:
"""Loads sample job descriptions and answers for local testing."""
sample_files = glob.glob(os.path.join(_SAMPLES_DIR, "*.txt"))
sample_inputs = []
for filepath in sample_files:
content = open(filepath, "r").read()
filename = os.path.basename(filepath).lower()
if filename.endswith("_yes.txt"):
target = True
elif filename.endswith("_no.txt"):
target = False
else:
raise ValueError(
"File %s must end with yes.txt or no.txt" % filepath
)
target = True if "yes" in filename.lower() else False
sample_inputs.append((content, target))
return sample_inputs


def evaluate(prompt_name: str):
"""Evaluates the prompt submission."""
# Loads a free gpt4 model.
llm = model.G4fModel()

# Loads a prompt submission.
prompt_handler = registry.get(name=prompt_name)

# Generate results for the dataset.
dataset = load_sample_test_set()
correct_pred = 0
for idx, (job_description, target) in enumerate(tqdm.tqdm(dataset)):
prompt = prompt_handler.build_prompt(job_description=job_description)
logging.debug("[prompt %d]\n%s", idx, prompt)
response = llm.generate(prompt=prompt)
logging.debug("[response %d]\n%s", idx, response)
output = prompt_handler.parse_response(model_response=response)
logging.debug("[target %d]\n%s", idx, target)
logging.debug("[prediction %d]\n%s", idx, output)
if output == target:
correct_pred += 1

print("Accuracy: [%.3f] %%" % (correct_pred / len(dataset) * 100)) # noqa: T201

def evaluate_on_sample_dataset(prompt_name: str):
"""Evaluates the prompt on a sample_dataset."""
sample_inputs = dataset.load_sample_test_set(samples_dir="sample_inputs")
acc = evaluate_lib.evaluate(dataset=sample_inputs, prompt_name=prompt_name)
print("Accuracy: [%.3f] %%" % acc) # noqa: T201


def main(argv: Sequence[str]) -> None:
Expand All @@ -95,8 +51,9 @@ def main(argv: Sequence[str]) -> None:
logging.getLogger().setLevel(logging.DEBUG)
else:
logging.getLogger().setLevel(logging.INFO)
evaluate(prompt_name=_PROMPT.value)
evaluate_on_sample_dataset(prompt_name=_PROMPT.value)


if __name__ == "__main__":
flags.mark_flag_as_required("prompt")
app.run(main)
36 changes: 36 additions & 0 deletions session_2/challenge/scripts/evaluate_lib.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
"""Library function for evaluating a prompt on a particular dataset."""

import logging

import tqdm
from scripts import model, registry
from submissions import * # noqa: F401, F403
from submissions import baseline # noqa: F401


def evaluate(dataset: list[tuple[str, bool]], prompt_name: str):
"""Evaluates the prompt submission."""
# Loads a free gpt4 model.
llm = model.G4fModel()

# Loads a prompt submission.
prompt_handler = registry.get(name=prompt_name)

# Generate results for the dataset.
correct_pred = 0
for idx, (job_description, target) in enumerate(tqdm.tqdm(dataset)):
prompt = prompt_handler.build_prompt(job_description=job_description)
response = llm.generate(prompt=prompt)
prediction = prompt_handler.parse_response(model_response=response)
if prediction == target:
correct_pred += 1
result = "[PASS]"
else:
result = "[FAIL]"

logging.debug(
"No=%d. target=%s prediction=%s %s\n[prompt]\n%s\n[response]\n%s"
% (idx, target, prediction, result, prompt, response)
)
acc = correct_pred / len(dataset) * 100
return acc
Loading

0 comments on commit 2461e6a

Please sign in to comment.