Skip to content

Commit

Permalink
Rename package to llmebench (#174)
Browse files Browse the repository at this point in the history
This commit renames the top-level package to `llmebench` to highlight the
multilingual nature of the framework. All assets have been modified to use
the new package name as well.
  • Loading branch information
fdalvi authored Aug 23, 2023
1 parent 92ff6c4 commit cfa4504
Show file tree
Hide file tree
Showing 310 changed files with 721 additions and 721 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/code-formatting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
pip install '.[dev]'
- name: Run ufmt check on framework code
run: |
ufmt check arabic_llm_benchmark
ufmt check llmebench
- name: Run ufmt check on test code
run: |
ufmt check tests
Expand Down
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@

Clone this repository:
```bash
git clone https://github.com/qcri/Arabic_LLM_Benchmark.git
cd Arabic_LLM_Benchmark
git clone https://github.com/qcri/LLMeBench.git
cd LLMeBench
```

Create a virtual environment:
```bash
python -m venv .envs/arabic_llm_benchmark
source .envs/arabic_llm_benchmark/bin/activate
python -m venv .envs/llmebench
source .envs/llmebench/bin/activate
```

Install the dependencies and benchmarking package:
Expand All @@ -21,7 +21,7 @@ pip install -e '.[dev,fewshot]'
```

## Get the benchmark data
Download the benchmark from [here](https://neurox.qcri.org/projects/arabic_llm_benchmark/arabic_llm_benchmark_data.zip), and unzip it into the `Arabic_LLM_Benchmark` folder. After this process, there should be a `data` directory inside the top-level folder of the repository, with roughly the following contents:
Download the benchmark from [here](https://neurox.qcri.org/projects/llmebench/arabic_llm_benchmark_data.zip), and unzip it into the `Arabic_LLM_Benchmark` folder. After this process, there should be a `data` directory inside the top-level folder of the repository, with roughly the following contents:

```bash
$ ls data/
Expand All @@ -39,7 +39,7 @@ speech
A sample benchmark is available in `assets/benchmark_v1`. To run the benchmark,

```bash
python -m arabic_llm_benchmark <benchmark-dir> <results-dir>
python -m llmebench <benchmark-dir> <results-dir>
```

where `<benchmark-dir>` can point to `assets/benchmark_v1` for example. The
Expand All @@ -58,7 +58,7 @@ git checkout -b feat/sarcasm_task
```

### Dataset
Check if the dataset used by your task already has an implementation in `arabic_llm_benchmark/datasets`. If not, implement a new dataset module (e.g. `arabic_llm_benchmark/datasets/SemEval23.py`), which implements a class (e.g. `SemEval23Dataset`) which subclasses `DatasetBase`. See an existing dataset module for inspiration. Each new dataset class requires implementing three functions:
Check if the dataset used by your task already has an implementation in `llmebench/datasets`. If not, implement a new dataset module (e.g. `llmebench/datasets/SemEval23.py`), which implements a class (e.g. `SemEval23Dataset`) which subclasses `DatasetBase`. See an existing dataset module for inspiration. Each new dataset class requires implementing three functions:

```python
class NewDataset(DatasetBase):
Expand All @@ -78,10 +78,10 @@ class NewDataset(DatasetBase):
# "label": this will be used for evaluation
```

Once the `Dataset` is implemented, export it in `arabic_llm_benchmark/datasets/__init__.py`.
Once the `Dataset` is implemented, export it in `llmebench/datasets/__init__.py`.

### Task
Check if the task you are adding to the benchmark already has an implementation in `arabic_llm_benchmark/tasks`. If not, implement a new dataset module (e.g. `arabic_llm_benchmark/tasks/Sarcasm.py`), which implements a class (e.g. `SarcasmTask`) which subclasses `TaskBase`. See an existing task module for inspiration. Each new task class requires implementing two functions:
Check if the task you are adding to the benchmark already has an implementation in `llmebench/tasks`. If not, implement a new dataset module (e.g. `llmebench/tasks/Sarcasm.py`), which implements a class (e.g. `SarcasmTask`) which subclasses `TaskBase`. See an existing task module for inspiration. Each new task class requires implementing two functions:

```python
class NewTask(TaskBase):
Expand All @@ -97,10 +97,10 @@ class NewTask(TaskBase):
# post_process function
```

Once the `Task` is implemented, export it in `arabic_llm_benchmark/tasks/__init__.py`.
Once the `Task` is implemented, export it in `llmebench/tasks/__init__.py`.

### Model
Next, check if the model you are trying to run the benchmark for has an implementation in `arabic_llm_benchmark/models`. If not, implement a new model module (e.g. `arabic_llm_benchmark/models/QARiB.py`), which implements a class (e.g. `QARiBModel`) which subclasses `ModelBase`. See an existing model module for inspiration. Each new model class requires implementing two functions:
Next, check if the model you are trying to run the benchmark for has an implementation in `llmebench/models`. If not, implement a new model module (e.g. `llmebench/models/QARiB.py`), which implements a class (e.g. `QARiBModel`) which subclasses `ModelBase`. See an existing model module for inspiration. Each new model class requires implementing two functions:

```python
class NewModel(TaskBase):
Expand All @@ -115,7 +115,7 @@ class NewModel(TaskBase):
# run the actual model and return model outputs
```

Once the `Model` is implemented, export it in `arabic_llm_benchmark/models/__init__.py`.
Once the `Model` is implemented, export it in `llmebench/models/__init__.py`.

### Benchmark Asset
Now that the Dataset, Task and Model are defined, the framework expects a given benchmark asset (e.g. "ArabGender" dataset, "GenderClassification" task, "GPT" model and "ZeroShot" prompting setting) to have a `*.py` file with three functions:
Expand Down Expand Up @@ -145,7 +145,7 @@ def post_process(response):
The benchmarking module allows one to run a specific asset instead of the entire benchmark using the `--filter` option. It is also a good idea to use the `--limit` option to limit the tests to few (e.g. 5 samples). Sample command below:

```bash
python -m arabic_llm_benchmark --filter 'demography/gender/AraGend_ChatGPT_ZeroShot' --limit 5 --ignore_cache <benchmark-dir> <results-dir>
python -m llmebench --filter 'demography/gender/AraGend_ChatGPT_ZeroShot' --limit 5 --ignore_cache <benchmark-dir> <results-dir>
```

Make sure to also run `scripts/run_tests.sh` before submitting your code, and once you are ready, you can commit your changes locally and push them to a remote branch:
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/MT/AraBench_Ara2Eng_BLOOMZ_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import AraBenchDataset
from arabic_llm_benchmark.models import BLOOMPetalModel
from arabic_llm_benchmark.tasks import MachineTranslationTask
from llmebench.datasets import AraBenchDataset
from llmebench.models import BLOOMPetalModel
from llmebench.tasks import MachineTranslationTask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/MT/AraBench_Ara2Eng_ChatGPT4_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import AraBenchDataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import MachineTranslationTask
from llmebench.datasets import AraBenchDataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import MachineTranslationTask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/MT/AraBench_Ara2Eng_ChatGPT_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import AraBenchDataset
from arabic_llm_benchmark.models import GPTModel
from arabic_llm_benchmark.tasks import MachineTranslationTask
from llmebench.datasets import AraBenchDataset
from llmebench.models import GPTModel
from llmebench.tasks import MachineTranslationTask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/NER/MGBWords_ChatGPT_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import os
import re

from arabic_llm_benchmark.datasets import MGBWordsDataset
from arabic_llm_benchmark.models import GPTModel
from arabic_llm_benchmark.tasks import NERTask
from llmebench.datasets import MGBWordsDataset
from llmebench.models import GPTModel
from llmebench.tasks import NERTask


def config():
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import os
import re

from arabic_llm_benchmark.datasets import MGBWordsDataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import NERTask
from llmebench.datasets import MGBWordsDataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import NERTask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/NER/NERANERcorp_ChatGPT_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import os
import re

from arabic_llm_benchmark.datasets import ANERcorpDataset
from arabic_llm_benchmark.models import GPTModel, RandomGPTModel
from arabic_llm_benchmark.tasks import NERTask
from llmebench.datasets import ANERcorpDataset
from llmebench.models import GPTModel, RandomGPTModel
from llmebench.tasks import NERTask


def config():
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import os
import re

from arabic_llm_benchmark.datasets import ANERcorpDataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import NERTask
from llmebench.datasets import ANERcorpDataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import NERTask


def config():
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import os
import re

from arabic_llm_benchmark.datasets import ANERcorpDataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import NERTask
from llmebench.datasets import ANERcorpDataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import NERTask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/NER/NERAqmar_ChatGPT_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import os
import re

from arabic_llm_benchmark.datasets import AqmarDataset
from arabic_llm_benchmark.models import GPTModel, RandomGPTModel
from arabic_llm_benchmark.tasks import NERTask
from llmebench.datasets import AqmarDataset
from llmebench.models import GPTModel, RandomGPTModel
from llmebench.tasks import NERTask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/NER/NERAqmar_GPTChatCompletion_FewShot.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import os
import re

from arabic_llm_benchmark.datasets import AqmarDataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import NERTask
from llmebench.datasets import AqmarDataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import NERTask


def config():
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import os
import re

from arabic_llm_benchmark.datasets import AqmarDataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import NERTask
from llmebench.datasets import AqmarDataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import NERTask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/ARCD_BLOOMZ_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import ARCDDataset
from arabic_llm_benchmark.models import BLOOMPetalModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import ARCDDataset
from llmebench.models import BLOOMPetalModel
from llmebench.tasks import QATask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/ARCD_ChatGPT_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import ARCDDataset
from arabic_llm_benchmark.models import GPTModel, RandomGPTModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import ARCDDataset
from llmebench.models import GPTModel, RandomGPTModel
from llmebench.tasks import QATask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/ARCD_GPTChatCompletion_FewShot.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import os
import random

from arabic_llm_benchmark.datasets import ARCDDataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import ARCDDataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import QATask

random.seed(3333)

Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/ARCD_GPTChatCompletion_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import ARCDDataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import ARCDDataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import QATask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/MLQA_BLOOMZ_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import MLQADataset
from arabic_llm_benchmark.models import BLOOMPetalModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import MLQADataset
from llmebench.models import BLOOMPetalModel
from llmebench.tasks import QATask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/MLQA_ChatGPT_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import MLQADataset
from arabic_llm_benchmark.models import GPTModel, RandomGPTModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import MLQADataset
from llmebench.models import GPTModel, RandomGPTModel
from llmebench.tasks import QATask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/MLQA_GPTChatCompletion_FewShot.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import os
import random

from arabic_llm_benchmark.datasets import MLQADataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import MLQADataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import QATask

random.seed(3333)

Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/MLQA_GPTChatCompletion_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import MLQADataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import MLQADataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import QATask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/TyDiQA_BLOOMZ_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import TyDiQADataset
from arabic_llm_benchmark.models import BLOOMPetalModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import TyDiQADataset
from llmebench.models import BLOOMPetalModel
from llmebench.tasks import QATask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/TyDiQA_ChatGPT_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import TyDiQADataset
from arabic_llm_benchmark.models import GPTModel, RandomGPTModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import TyDiQADataset
from llmebench.models import GPTModel, RandomGPTModel
from llmebench.tasks import QATask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/TyDiQA_GPTChatCompletion_FewShot.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import os
import random

from arabic_llm_benchmark.datasets import TyDiQADataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import TyDiQADataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import QATask

random.seed(3333)

Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/TydiQA_GPTChatCompletion_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import TyDiQADataset
from arabic_llm_benchmark.models import GPTChatCompletionModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import TyDiQADataset
from llmebench.models import GPTChatCompletionModel
from llmebench.tasks import QATask


def config():
Expand Down
6 changes: 3 additions & 3 deletions assets/benchmark_v1/QA/XQuAD_BLOOMZ_ZeroShot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os

from arabic_llm_benchmark.datasets import XQuADDataset
from arabic_llm_benchmark.models import BLOOMPetalModel
from arabic_llm_benchmark.tasks import QATask
from llmebench.datasets import XQuADDataset
from llmebench.models import BLOOMPetalModel
from llmebench.tasks import QATask


def config():
Expand Down
Loading

0 comments on commit cfa4504

Please sign in to comment.