From 760d94d6af123c7488da48042c4e5cf6ce8ebbc4 Mon Sep 17 00:00:00 2001 From: Maram Hasanain Date: Sun, 17 Sep 2023 12:52:19 +0300 Subject: [PATCH] Update adding_dataset.md Removed youtube link, added note on further details location. --- docs/tutorials/adding_dataset.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/tutorials/adding_dataset.md b/docs/tutorials/adding_dataset.md index 10299a1c..838e02a3 100644 --- a/docs/tutorials/adding_dataset.md +++ b/docs/tutorials/adding_dataset.md @@ -1,4 +1,6 @@ -# Adding Dataset ([See Demo](https://youtu.be/_sO2PhKhKGA?feature=shared)) + +# Adding Dataset + Check if the dataset used by your task already has an implementation in `llmebench/datasets`. If not, implement a new dataset module (e.g. `llmebench/datasets/SemEval23.py`), which implements a class (e.g. `SemEval23Dataset`) which subclasses `DatasetBase`. See [existing dataset modules](llmebench/datasets) for inspiration. Each new dataset class requires implementing four functions: ```python @@ -26,6 +28,8 @@ class NewDataset(DatasetBase): # "input_id": this optional key will be used for deduplication ``` -**Note:** in case of few shots assets, the framework provides the functionality of deduplicating the training examples, from which few shots are being extracted, against the evaluatin dataset, based on sample IDs. To enable this functionality, `load_data` should also define `"input_id"` per input sample. +**Notes:** +- In case of few shots assets, the framework provides the functionality of deduplicating the training examples, from which few shots are being extracted, against the evaluatin dataset, based on sample IDs. To enable this functionality, `load_data` should also define `"input_id"` per input sample. +- Further details on how to implement each function for a dataset can be found [here](https://github.com/qcri/LLMeBench/blob/main/llmebench/datasets/dataset_base.py). **Once the `Dataset` is implemented, export it in `llmebench/datasets/__init__.py`.**