-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenizer training script #103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good!
There is a lot of files juggling that I don't think we need. Let's use tempfile module to deal with that and remove the files automatically after we're done.
Claude says:
In Python, you can use the tempfile
module to create temporary files. The tempfile
module provides a secure and convenient way to create temporary files and directories that are automatically deleted when they are no longer needed.
Here are a few ways to create temporary files using the tempfile
module:
- Using
tempfile.NamedTemporaryFile()
:
import tempfile
with tempfile.NamedTemporaryFile(mode='w+') as temp_file:
temp_file.write('Some temporary data')
temp_file.flush()
# Do something with the temporary file
# The file will be automatically deleted when the 'with' block ends
- Using
tempfile.TemporaryFile()
:
import tempfile
with tempfile.TemporaryFile(mode='w+') as temp_file:
temp_file.write('Some temporary data')
temp_file.seek(0) # Move the file pointer to the beginning
# Do something with the temporary file
# The file will be automatically deleted when the 'with' block ends
- Using
tempfile.mkstemp()
:
import tempfile
import os
fd, temp_path = tempfile.mkstemp()
try:
with os.fdopen(fd, 'w') as temp_file:
temp_file.write('Some temporary data')
# Do something with the temporary file
finally:
os.remove(temp_path) # Explicitly delete the temporary file
In the first two examples, the temporary files are created using a context manager (with
block), which ensures that the files are automatically deleted when the block ends, even if an exception occurs.
In the third example, tempfile.mkstemp()
returns a file descriptor and the path to the temporary file. You need to explicitly delete the temporary file using os.remove()
when you're done with it.
By default, the temporary files are created in the default temporary directory of the operating system. You can specify a different directory by passing the dir
parameter to the tempfile
functions.
Using the tempfile
module is recommended for creating temporary files because it handles the file creation and deletion securely and avoids common pitfalls associated with manual temporary file management.
scripts/train_tokenizer.py
Outdated
|
||
# push tokenizer to the hub | ||
tokenizer.push_to_hub( | ||
repo_id="jbrinkma/tokenizer_test", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Other scripts use a fixed, e.g. repo_id = f"{args.username}/v0-token-map"
(in map_tokens.py), but I think a specific parameter for the repo_id might make more sense. I will add one but please let me know if you prefer it the other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding the use of tempfile
: previously, I created three local files
- the text file for training the tokenizer:
- the original sentencepiecetokenizer
- when converting the tokenisers.
I removed (1) (3), but eliminating (2) doesn't seem possible: SentencePieceTrainer
seems to always create a local file which has to be handled (https://snyk.io/advisor/python/sentencepiece/functions/sentencepiece.SentencePieceTrainer).
scripts/train_tokenizer.py
Outdated
funct_test: bool = False, | ||
): | ||
""" | ||
Trains a SentencePiece tokenizer on a dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between SentencePiece and LlamaTokenizerFast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LlamaTokenizer is just a simple wrapper for SentencePiece with the option to add <eos>
and <bos>
tokens (see https://github.com/meta-llama/llama/blob/main/llama/tokenizer.py). ...Fast tokenisers are functionally the same but implemented in Rust and presumably are significantly faster (although I never tested this tbh).
scripts/train_tokenizer.py
Outdated
train_ds = load_dataset(dataset_name)["train"] | ||
if train_size < 1.0: | ||
train_ds = train_ds.train_test_split(train_size=train_size)["train"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is randomized, so not good for reproducibility. Can we just take str split argument that defaults to "train"? If you want to train on a subset, you can then specify "train[:1000]" or "train[:10%]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_train_split
takes an optional seed
argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer the seed option, but also like the absolute selection option. I will add a seed parameter and remove the type hint for the train_size argument, as train_test_split
allows both int and float for absolute and relative selections, respectively.
scripts/train_tokenizer.py
Outdated
tokenizer_model_path = get_tokenizer_model_path( | ||
vocab_size=vocab_size, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be a separate function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
src/delphi/train/tokenizer.py
Outdated
text_file = os.path.join(cache_dir, "text.txt") | ||
with open(text_file, 'w', encoding='utf-8') as file: | ||
for item in dataset: | ||
text = item['story'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature needs to be configurable. If not set, should default to the only feature there is, or fail if there are more than two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added as a parameter
scripts/train_tokenizer.py
Outdated
vocab = {sp_model.id_to_piece(index): index for index in trange(sp_model.GetPieceSize())} | ||
merges = [] | ||
for piece_l in tqdm(vocab.keys(), total=sp_model.GetPieceSize()): | ||
for piece_r in vocab.keys(): | ||
merge = f"{piece_l}{piece_r}" | ||
piece_id = vocab.get(merge, None) | ||
if piece_id: | ||
merges += [(piece_l, piece_r, piece_id)] | ||
merges = sorted(merges, key=lambda val: val[2]) | ||
merges = [(val[0], val[1]) for val in merges] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any idea what is this doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generates the components HF needs to initialise the SentencePieceBPETokenizer. vocab
are all tokens, and merges
are all combinations of tokens that exist as another token in the vocabulary (e.g. "hel" + "lo" = "hello")
Looks like you need to setup black, CI is failing on that |
Also, could you think about some localized unit tests? Like we have a pre-defined string as a text to train on and we check if the resulting tokenizer has the same vocab and tokenizes text the way we expect. |
6385892
to
4fc7757
Compare
No description provided.