-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Splitting the tutorial.md into multiple smaller tutorials - adding new dataset tutorial
- Loading branch information
1 parent
9c66a9f
commit 47abb0e
Showing
1 changed file
with
31 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Adding Dataset ([See Demo](https://youtu.be/_sO2PhKhKGA?feature=shared)) | ||
Check if the dataset used by your task already has an implementation in `llmebench/datasets`. If not, implement a new dataset module (e.g. `llmebench/datasets/SemEval23.py`), which implements a class (e.g. `SemEval23Dataset`) which subclasses `DatasetBase`. See [existing dataset modules](llmebench/datasets) for inspiration. Each new dataset class requires implementing four functions: | ||
|
||
```python | ||
class NewDataset(DatasetBase): | ||
def __init__(self, custom_param_1, custom_param_2, **kwargs): | ||
# custom_param_1/2 are passed from `dataset_args` in the benchmark | ||
# config | ||
... | ||
super(NewDataset, self).__init__(**kwargs) | ||
|
||
def metadata(): | ||
# This method should return a dictionary that defines metadata describing | ||
# the dataset, like: citation or reference, download_url, language, etc. | ||
|
||
def get_data_sample(self): | ||
# This method should return a dictionary that represents the structure of | ||
# a single sample of the data for the purpose of testing and viewing | ||
# of NewDataset representation | ||
|
||
def load_data(self, data_path): | ||
# This function loads the data and must return a list of | ||
# dictionaries, where each dictionary has at least two keys: | ||
# "input": this will be sent to the prompt generator | ||
# "label": this will be used for evaluation | ||
# "input_id": this optional key will be used for deduplication | ||
``` | ||
|
||
**Note:** in case of few shots assets, the framework provides the functionality of deduplicating the training examples, from which few shots are being extracted, against the evaluatin dataset, based on sample IDs. To enable this functionality, `load_data` should also define `"input_id"` per input sample. | ||
|
||
**Once the `Dataset` is implemented, export it in `llmebench/datasets/__init__.py`.** |