-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function to tokenize text stories and split into batches #55
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- add docstrings
- update test to not download the validation dataset from huggingface
return samples | ||
|
||
|
||
def get_tokenized_batches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please please a docstring 😄
text_stories
is a list of stories? Each story is of arbitrary length and is not tokenized? E.g.: "There once was a dog called Barky. Barky lived .... "?
And in the end we split each of the stories into chunks that are of max size context_size
? Is that the main task of the get_tokenized_batches
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siwei-li , a reply to my questions would have been much appreciated, instead of just resolving them. They might be simple or even stupid questions, but it would have helped me understand your code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are no batches, so I think the function name should be changed? to something like "tokenize_dataset"
1de7467
to
64fdd4c
Compare
@joshuawe I added the docstrings to the functions, I noticed that the failing check is for the pytest in
|
Hi @siwei-li , will check your updated code.
|
We need a script in scripts/ that will take input dataset name, output dataset name, tokenizer name and HF credentials and tokenize and push to HF. The column name in the output dataset should be "tokens". The column name in the input dataset should be an optional argument, that defaults to the only column that is available or fails if there are more than one column. Please check (on a subset of the dataset) how slow the tokenization is. If it takes more than a few minutes, we might want to use tokenizer.batch_encode |
c00f5b1
to
eccad9d
Compare
The script for HF is added @jettjaniak |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be merged, if no rebase is necessary
|
||
|
||
def test_make_new_sample(tokenizer): | ||
for _ in range(100): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this done one hundred times?
dq: deque[int], context_size: int, bos_token_id: int | ||
) -> list[list[int]]: | ||
""" | ||
Generates new samples for training by creating sequences of tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the explanation a bit confusing (or I am confused).
This function does not generate entirely new content, correct?
It reduces the length pre-existing samples by clipping it to context_size
, correct?
A different wording would make it easier to understand. You could also consider renaming the function make_new_samples
to something such as clip_samples
or similar. But this is up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, my previous assumption was wrong. This function does not clip the samples, but rather splits them, prepending a BOS token as well as adding the final token from the previous split as well.
a4c0889
to
fbda7d0
Compare
fbda7d0
to
ba1b109
Compare
The test is right now two cases for text input and tokenized input, lmk if there's anything to add to it