Add function to tokenize text stories and split into batches #55

siwei-li · 2024-03-10T05:06:04Z

The test is right now two cases for text input and tokenized input, lmk if there's anything to add to it

src/delphi/train/dataset_tokenization.py

tests/train/test_tokenizer.py

joshuawe

add docstrings
update test to not download the validation dataset from huggingface

src/delphi/train/dataset_tokenization.py

joshuawe · 2024-03-15T18:59:42Z

src/delphi/train/dataset_tokenization.py

+    return samples
+
+
+def get_tokenized_batches(


Please please a docstring 😄
text_stories is a list of stories? Each story is of arbitrary length and is not tokenized? E.g.: "There once was a dog called Barky. Barky lived .... "?
And in the end we split each of the stories into chunks that are of max size context_size? Is that the main task of the get_tokenized_batches function?

@siwei-li , a reply to my questions would have been much appreciated, instead of just resolving them. They might be simple or even stupid questions, but it would have helped me understand your code.

there are no batches, so I think the function name should be changed? to something like "tokenize_dataset"

tests/train/test_tokenizer.py

siwei-li · 2024-03-18T01:42:39Z

@joshuawe I added the docstrings to the functions,
the first two functions extend_deque() and make_new_samples() serve as two helper functions;

I noticed that the failing check is for the pytest in

tests/eval/test_token_labelling.py:2: in
import spacy
E ModuleNotFoundError: No module named 'spacy'

joshuawe · 2024-03-18T11:11:33Z

Hi @siwei-li , will check your updated code.
The error that we currently see (Link) is not because of the missing spacy module (I assume that your error message was from your local machine).
Following the link we see that beartype raised a type error in the extend_deque(). It seems the parameter text_stories expected the type list[str] but got something like list[int].

E   beartype.roar.BeartypeCallHintParamViolation: Function delphi.train.dataset_tokenization.extend_deque() parameter 
text_stories=[[1581, 1327, 61, 300, 3003, 1427, 1029, 2570, 1736, 3552, 3170, 635, 3551, 91, 3052, 2560,...]] violates type hint 
list[str] under non-default configuration BeartypeConf(claw_is_pep526=True, is_color=None, is_debug=False, 
is_pep484_tower=False, strategy=<BeartypeStrategy.O1: 2>, warning_cls_on_decorator_exception=<class 
'beartype.roar._roarwarn.BeartypeClawDecorWarning'>), as list index 11 item list [3638, 2316, 723, 505, 3953, 325, 742, 2799, 
3320, 4063, 159, 3931, 3300, 3072, 3636, 2544, ...] not instance of str.```

src/delphi/train/dataset_tokenization.py

tests/train/test_tokenizer.py

src/delphi/train/dataset_tokenization.py

tests/train/test_tokenizer.py

jettjaniak · 2024-03-18T20:53:45Z

We need a script in scripts/ that will take input dataset name, output dataset name, tokenizer name and HF credentials and tokenize and push to HF. The column name in the output dataset should be "tokens". The column name in the input dataset should be an optional argument, that defaults to the only column that is available or fails if there are more than one column. Please check (on a subset of the dataset) how slow the tokenization is. If it takes more than a few minutes, we might want to use tokenizer.batch_encode

siwei-li · 2024-03-20T18:55:45Z

The script for HF is added @jettjaniak
(Uploaded the tokenized 'stories' dataset to https://huggingface.co/datasets/delphi-suite/batched-tokenized-stories)

joshuawe

can be merged, if no rebase is necessary

joshuawe · 2024-03-30T11:15:02Z

tests/dataset/test_tokenizer.py

+
+
+def test_make_new_sample(tokenizer):
+    for _ in range(100):


Why is this done one hundred times?

joshuawe · 2024-03-30T11:21:15Z

src/delphi/dataset/tokenization.py

+    dq: deque[int], context_size: int, bos_token_id: int
+) -> list[list[int]]:
+    """
+    Generates new samples for training by creating sequences of tokens


I find the explanation a bit confusing (or I am confused).
This function does not generate entirely new content, correct?
It reduces the length pre-existing samples by clipping it to context_size, correct?
A different wording would make it easier to understand. You could also consider renaming the function make_new_samples to something such as clip_samples or similar. But this is up to you.

I see, my previous assumption was wrong. This function does not clip the samples, but rather splits them, prepending a BOS token as well as adding the final token from the previous split as well.

… issues

siwei-li linked an issue Mar 10, 2024 that may be closed by this pull request

dataset tokenization script #49

Closed

siwei-li marked this pull request as ready for review March 10, 2024 05:06

siwei-li requested review from jaidhyani and jettjaniak March 10, 2024 05:08

siwei-li commented Mar 11, 2024

View reviewed changes

src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved

siwei-li commented Mar 11, 2024

View reviewed changes

src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved

siwei-li commented Mar 11, 2024

View reviewed changes

src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved

siwei-li commented Mar 11, 2024

View reviewed changes

src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved

siwei-li commented Mar 11, 2024

View reviewed changes

src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved

siwei-li commented Mar 15, 2024

View reviewed changes

tests/train/test_tokenizer.py Outdated Show resolved Hide resolved

siwei-li requested review from joshuawe and removed request for jaidhyani March 15, 2024 17:02

joshuawe requested changes Mar 15, 2024

View reviewed changes

siwei-li force-pushed the 49-dataset-tokenization-script branch from 1de7467 to 64fdd4c Compare March 18, 2024 01:37

jettjaniak reviewed Mar 18, 2024

View reviewed changes

src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved

jettjaniak reviewed Mar 18, 2024

View reviewed changes

src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved

jettjaniak reviewed Mar 18, 2024

View reviewed changes

src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved

jettjaniak reviewed Mar 18, 2024

View reviewed changes

src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved

jettjaniak reviewed Mar 18, 2024

View reviewed changes

tests/train/test_tokenizer.py Outdated Show resolved Hide resolved

jettjaniak reviewed Mar 18, 2024

View reviewed changes

tests/train/test_tokenizer.py Outdated Show resolved Hide resolved

jettjaniak reviewed Mar 18, 2024

View reviewed changes

src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved

jettjaniak reviewed Mar 18, 2024

View reviewed changes

tests/train/test_tokenizer.py Outdated Show resolved Hide resolved

siwei-li force-pushed the 49-dataset-tokenization-script branch 3 times, most recently from c00f5b1 to eccad9d Compare March 20, 2024 18:55

joshuawe approved these changes Mar 30, 2024

View reviewed changes

joshuawe force-pushed the 49-dataset-tokenization-script branch from a4c0889 to fbda7d0 Compare March 30, 2024 17:47

jettjaniak approved these changes Mar 30, 2024

View reviewed changes

Siwei Li added 8 commits March 30, 2024 20:41

Add function to tokenize text stories and split into batches

e7ab2e8

Split the tokenization function into two parts, fixing the while-loop…

2e69942

… issues

Add docstrings to the functions

7609e4f

Minor edits in the code, fix the test

b98f81d

Uses batch_encode() method to save time

9869df7

Add script to upload to delphi-suite/batched-tokenized-stories

7bd9ef9

Remove the test file in tests/train to pass pytest

c5c0e09

Update function name

ba1b109

joshuawe force-pushed the 49-dataset-tokenization-script branch from fbda7d0 to ba1b109 Compare March 30, 2024 19:41

joshuawe merged commit 5b7ec89 into main Mar 30, 2024
1 check passed

joshuawe deleted the 49-dataset-tokenization-script branch March 30, 2024 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function to tokenize text stories and split into batches #55

Add function to tokenize text stories and split into batches #55

siwei-li commented Mar 10, 2024 •

edited

Loading

joshuawe left a comment

joshuawe Mar 15, 2024

joshuawe Mar 18, 2024

jettjaniak Mar 18, 2024

siwei-li commented Mar 18, 2024

joshuawe commented Mar 18, 2024

jettjaniak commented Mar 18, 2024

siwei-li commented Mar 20, 2024

joshuawe left a comment

joshuawe Mar 30, 2024

joshuawe Mar 30, 2024

joshuawe Mar 30, 2024

Add function to tokenize text stories and split into batches #55

Add function to tokenize text stories and split into batches #55

Conversation

siwei-li commented Mar 10, 2024 • edited Loading

joshuawe left a comment

Choose a reason for hiding this comment

joshuawe Mar 15, 2024

Choose a reason for hiding this comment

joshuawe Mar 18, 2024

Choose a reason for hiding this comment

jettjaniak Mar 18, 2024

Choose a reason for hiding this comment

siwei-li commented Mar 18, 2024

joshuawe commented Mar 18, 2024

jettjaniak commented Mar 18, 2024

siwei-li commented Mar 20, 2024

joshuawe left a comment

Choose a reason for hiding this comment

joshuawe Mar 30, 2024

Choose a reason for hiding this comment

joshuawe Mar 30, 2024

Choose a reason for hiding this comment

joshuawe Mar 30, 2024

Choose a reason for hiding this comment

siwei-li commented Mar 10, 2024 •

edited

Loading