Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to tokenize text stories and split into batches #55

Merged
merged 8 commits into from
Mar 30, 2024

Conversation

siwei-li
Copy link
Collaborator

@siwei-li siwei-li commented Mar 10, 2024

The test is right now two cases for text input and tokenized input, lmk if there's anything to add to it

@siwei-li siwei-li linked an issue Mar 10, 2024 that may be closed by this pull request
@siwei-li siwei-li marked this pull request as ready for review March 10, 2024 05:06
@siwei-li siwei-li requested review from joshuawe and removed request for jaidhyani March 15, 2024 17:02
Copy link
Collaborator

@joshuawe joshuawe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • add docstrings
  • update test to not download the validation dataset from huggingface

src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved
src/delphi/train/dataset_tokenization.py Outdated Show resolved Hide resolved
return samples


def get_tokenized_batches(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please please a docstring 😄
text_stories is a list of stories? Each story is of arbitrary length and is not tokenized? E.g.: "There once was a dog called Barky. Barky lived .... "?
And in the end we split each of the stories into chunks that are of max size context_size? Is that the main task of the get_tokenized_batches function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siwei-li , a reply to my questions would have been much appreciated, instead of just resolving them. They might be simple or even stupid questions, but it would have helped me understand your code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no batches, so I think the function name should be changed? to something like "tokenize_dataset"

tests/train/test_tokenizer.py Outdated Show resolved Hide resolved
@siwei-li siwei-li force-pushed the 49-dataset-tokenization-script branch from 1de7467 to 64fdd4c Compare March 18, 2024 01:37
@siwei-li
Copy link
Collaborator Author

@joshuawe I added the docstrings to the functions,
the first two functions extend_deque() and make_new_samples() serve as two helper functions;

I noticed that the failing check is for the pytest in

tests/eval/test_token_labelling.py:2: in
import spacy
E ModuleNotFoundError: No module named 'spacy'

@joshuawe
Copy link
Collaborator

Hi @siwei-li , will check your updated code.
The error that we currently see (Link) is not because of the missing spacy module (I assume that your error message was from your local machine).
Following the link we see that beartype raised a type error in the extend_deque(). It seems the parameter text_stories expected the type list[str] but got something like list[int].

E   beartype.roar.BeartypeCallHintParamViolation: Function delphi.train.dataset_tokenization.extend_deque() parameter 
text_stories=[[1581, 1327, 61, 300, 3003, 1427, 1029, 2570, 1736, 3552, 3170, 635, 3551, 91, 3052, 2560,...]] violates type hint 
list[str] under non-default configuration BeartypeConf(claw_is_pep526=True, is_color=None, is_debug=False, 
is_pep484_tower=False, strategy=<BeartypeStrategy.O1: 2>, warning_cls_on_decorator_exception=<class 
'beartype.roar._roarwarn.BeartypeClawDecorWarning'>), as list index 11 item list [3638, 2316, 723, 505, 3953, 325, 742, 2799, 
3320, 4063, 159, 3931, 3300, 3072, 3636, 2544, ...] not instance of str.```

@jettjaniak
Copy link
Contributor

We need a script in scripts/ that will take input dataset name, output dataset name, tokenizer name and HF credentials and tokenize and push to HF. The column name in the output dataset should be "tokens". The column name in the input dataset should be an optional argument, that defaults to the only column that is available or fails if there are more than one column. Please check (on a subset of the dataset) how slow the tokenization is. If it takes more than a few minutes, we might want to use tokenizer.batch_encode

@siwei-li siwei-li force-pushed the 49-dataset-tokenization-script branch 3 times, most recently from c00f5b1 to eccad9d Compare March 20, 2024 18:55
@siwei-li
Copy link
Collaborator Author

The script for HF is added @jettjaniak
(Uploaded the tokenized 'stories' dataset to https://huggingface.co/datasets/delphi-suite/batched-tokenized-stories)

Copy link
Collaborator

@joshuawe joshuawe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be merged, if no rebase is necessary



def test_make_new_sample(tokenizer):
for _ in range(100):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this done one hundred times?

dq: deque[int], context_size: int, bos_token_id: int
) -> list[list[int]]:
"""
Generates new samples for training by creating sequences of tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the explanation a bit confusing (or I am confused).
This function does not generate entirely new content, correct?
It reduces the length pre-existing samples by clipping it to context_size, correct?
A different wording would make it easier to understand. You could also consider renaming the function make_new_samples to something such as clip_samples or similar. But this is up to you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, my previous assumption was wrong. This function does not clip the samples, but rather splits them, prepending a BOS token as well as adding the final token from the previous split as well.

@joshuawe joshuawe force-pushed the 49-dataset-tokenization-script branch from a4c0889 to fbda7d0 Compare March 30, 2024 17:47
@joshuawe joshuawe force-pushed the 49-dataset-tokenization-script branch from fbda7d0 to ba1b109 Compare March 30, 2024 19:41
@joshuawe joshuawe merged commit 5b7ec89 into main Mar 30, 2024
1 check passed
@joshuawe joshuawe deleted the 49-dataset-tokenization-script branch March 30, 2024 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dataset tokenization script
3 participants