Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer training script #103

Merged
merged 8 commits into from
Apr 19, 2024
Merged

tokenizer training script #103

merged 8 commits into from
Apr 19, 2024

Conversation

jettjaniak
Copy link
Contributor

No description provided.

Copy link
Contributor Author

@jettjaniak jettjaniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good!
There is a lot of files juggling that I don't think we need. Let's use tempfile module to deal with that and remove the files automatically after we're done.

Claude says:

In Python, you can use the tempfile module to create temporary files. The tempfile module provides a secure and convenient way to create temporary files and directories that are automatically deleted when they are no longer needed.

Here are a few ways to create temporary files using the tempfile module:

  1. Using tempfile.NamedTemporaryFile():
import tempfile

with tempfile.NamedTemporaryFile(mode='w+') as temp_file:
    temp_file.write('Some temporary data')
    temp_file.flush()
    # Do something with the temporary file
    # The file will be automatically deleted when the 'with' block ends
  1. Using tempfile.TemporaryFile():
import tempfile

with tempfile.TemporaryFile(mode='w+') as temp_file:
    temp_file.write('Some temporary data')
    temp_file.seek(0)  # Move the file pointer to the beginning
    # Do something with the temporary file
    # The file will be automatically deleted when the 'with' block ends
  1. Using tempfile.mkstemp():
import tempfile
import os

fd, temp_path = tempfile.mkstemp()
try:
    with os.fdopen(fd, 'w') as temp_file:
        temp_file.write('Some temporary data')
    # Do something with the temporary file
finally:
    os.remove(temp_path)  # Explicitly delete the temporary file

In the first two examples, the temporary files are created using a context manager (with block), which ensures that the files are automatically deleted when the block ends, even if an exception occurs.

In the third example, tempfile.mkstemp() returns a file descriptor and the path to the temporary file. You need to explicitly delete the temporary file using os.remove() when you're done with it.

By default, the temporary files are created in the default temporary directory of the operating system. You can specify a different directory by passing the dir parameter to the tempfile functions.

Using the tempfile module is recommended for creating temporary files because it handles the file creation and deletion securely and avoids common pitfalls associated with manual temporary file management.


# push tokenizer to the hub
tokenizer.push_to_hub(
repo_id="jbrinkma/tokenizer_test",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use arguments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Other scripts use a fixed, e.g. repo_id = f"{args.username}/v0-token-map" (in map_tokens.py), but I think a specific parameter for the repo_id might make more sense. I will add one but please let me know if you prefer it the other way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the use of tempfile: previously, I created three local files

  1. the text file for training the tokenizer:
  2. the original sentencepiecetokenizer
  3. when converting the tokenisers.

I removed (1) (3), but eliminating (2) doesn't seem possible: SentencePieceTrainer seems to always create a local file which has to be handled (https://snyk.io/advisor/python/sentencepiece/functions/sentencepiece.SentencePieceTrainer).

scripts/train_tokenizer.py Outdated Show resolved Hide resolved
funct_test: bool = False,
):
"""
Trains a SentencePiece tokenizer on a dataset.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between SentencePiece and LlamaTokenizerFast?

Copy link
Contributor

@jannik-brinkmann jannik-brinkmann Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LlamaTokenizer is just a simple wrapper for SentencePiece with the option to add <eos> and <bos> tokens (see https://github.com/meta-llama/llama/blob/main/llama/tokenizer.py). ...Fast tokenisers are functionally the same but implemented in Rust and presumably are significantly faster (although I never tested this tbh).

Comment on lines 31 to 33
train_ds = load_dataset(dataset_name)["train"]
if train_size < 1.0:
train_ds = train_ds.train_test_split(train_size=train_size)["train"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is randomized, so not good for reproducibility. Can we just take str split argument that defaults to "train"? If you want to train on a subset, you can then specify "train[:1000]" or "train[:10%]"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_train_split takes an optional seed argument

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the seed option, but also like the absolute selection option. I will add a seed parameter and remove the type hint for the train_size argument, as train_test_split allows both int and float for absolute and relative selections, respectively.

scripts/train_tokenizer.py Outdated Show resolved Hide resolved
Comment on lines 35 to 37
tokenizer_model_path = get_tokenizer_model_path(
vocab_size=vocab_size,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a separate function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

scripts/train_tokenizer.py Outdated Show resolved Hide resolved
text_file = os.path.join(cache_dir, "text.txt")
with open(text_file, 'w', encoding='utf-8') as file:
for item in dataset:
text = item['story']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature needs to be configurable. If not set, should default to the only feature there is, or fail if there are more than two

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added as a parameter

Comment on lines 49 to 64
vocab = {sp_model.id_to_piece(index): index for index in trange(sp_model.GetPieceSize())}
merges = []
for piece_l in tqdm(vocab.keys(), total=sp_model.GetPieceSize()):
for piece_r in vocab.keys():
merge = f"{piece_l}{piece_r}"
piece_id = vocab.get(merge, None)
if piece_id:
merges += [(piece_l, piece_r, piece_id)]
merges = sorted(merges, key=lambda val: val[2])
merges = [(val[0], val[1]) for val in merges]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea what is this doing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates the components HF needs to initialise the SentencePieceBPETokenizer. vocab are all tokens, and merges are all combinations of tokens that exist as another token in the vocabulary (e.g. "hel" + "lo" = "hello")

@jettjaniak
Copy link
Contributor Author

Looks like you need to setup black, CI is failing on that

@jettjaniak
Copy link
Contributor Author

Also, could you think about some localized unit tests? Like we have a pre-defined string as a text to train on and we check if the resulting tokenizer has the same vocab and tokenizes text the way we expect.

@jettjaniak jettjaniak requested a review from joshuawe April 2, 2024 16:49
@jettjaniak jettjaniak marked this pull request as ready for review April 19, 2024 18:28
@jettjaniak jettjaniak merged commit 51a8e57 into main Apr 19, 2024
1 check passed
@jettjaniak jettjaniak deleted the tokenizer_training branch April 19, 2024 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants