tokenizer training script #103

jettjaniak · 2024-04-01T17:56:26Z

No description provided.

jettjaniak

Thanks, looks good!
There is a lot of files juggling that I don't think we need. Let's use tempfile module to deal with that and remove the files automatically after we're done.

Claude says:

In Python, you can use the tempfile module to create temporary files. The tempfile module provides a secure and convenient way to create temporary files and directories that are automatically deleted when they are no longer needed.

Here are a few ways to create temporary files using the tempfile module:

Using tempfile.NamedTemporaryFile():

import tempfile

with tempfile.NamedTemporaryFile(mode='w+') as temp_file:
    temp_file.write('Some temporary data')
    temp_file.flush()
    # Do something with the temporary file
    # The file will be automatically deleted when the 'with' block ends

Using tempfile.TemporaryFile():

import tempfile

with tempfile.TemporaryFile(mode='w+') as temp_file:
    temp_file.write('Some temporary data')
    temp_file.seek(0)  # Move the file pointer to the beginning
    # Do something with the temporary file
    # The file will be automatically deleted when the 'with' block ends

Using tempfile.mkstemp():

import tempfile
import os

fd, temp_path = tempfile.mkstemp()
try:
    with os.fdopen(fd, 'w') as temp_file:
        temp_file.write('Some temporary data')
    # Do something with the temporary file
finally:
    os.remove(temp_path)  # Explicitly delete the temporary file

In the first two examples, the temporary files are created using a context manager (with block), which ensures that the files are automatically deleted when the block ends, even if an exception occurs.

In the third example, tempfile.mkstemp() returns a file descriptor and the path to the temporary file. You need to explicitly delete the temporary file using os.remove() when you're done with it.

By default, the temporary files are created in the default temporary directory of the operating system. You can specify a different directory by passing the dir parameter to the tempfile functions.

Using the tempfile module is recommended for creating temporary files because it handles the file creation and deletion securely and avoids common pitfalls associated with manual temporary file management.

jettjaniak · 2024-04-01T17:57:49Z

scripts/train_tokenizer.py

+
+    # push tokenizer to the hub
+    tokenizer.push_to_hub(
+        repo_id="jbrinkma/tokenizer_test",


should use arguments

Yes. Other scripts use a fixed, e.g. repo_id = f"{args.username}/v0-token-map" (in map_tokens.py), but I think a specific parameter for the repo_id might make more sense. I will add one but please let me know if you prefer it the other way.

regarding the use of tempfile: previously, I created three local files

the text file for training the tokenizer:

the original sentencepiecetokenizer

when converting the tokenisers.

I removed (1) (3), but eliminating (2) doesn't seem possible: SentencePieceTrainer seems to always create a local file which has to be handled (https://snyk.io/advisor/python/sentencepiece/functions/sentencepiece.SentencePieceTrainer).

scripts/train_tokenizer.py

jettjaniak · 2024-04-01T17:59:51Z

scripts/train_tokenizer.py

+    funct_test: bool = False,
+):
+    """
+    Trains a SentencePiece tokenizer on a dataset.


What's the difference between SentencePiece and LlamaTokenizerFast?

LlamaTokenizer is just a simple wrapper for SentencePiece with the option to add <eos> and <bos> tokens (see https://github.com/meta-llama/llama/blob/main/llama/tokenizer.py). ...Fast tokenisers are functionally the same but implemented in Rust and presumably are significantly faster (although I never tested this tbh).

jettjaniak · 2024-04-01T18:03:37Z

scripts/train_tokenizer.py

+    train_ds = load_dataset(dataset_name)["train"]
+    if train_size < 1.0:
+        train_ds = train_ds.train_test_split(train_size=train_size)["train"]


This is randomized, so not good for reproducibility. Can we just take str split argument that defaults to "train"? If you want to train on a subset, you can then specify "train[:1000]" or "train[:10%]"

test_train_split takes an optional seed argument

I think I prefer the seed option, but also like the absolute selection option. I will add a seed parameter and remove the type hint for the train_size argument, as train_test_split allows both int and float for absolute and relative selections, respectively.

scripts/train_tokenizer.py

jettjaniak · 2024-04-01T18:06:35Z

scripts/train_tokenizer.py

+    tokenizer_model_path = get_tokenizer_model_path(
+        vocab_size=vocab_size,
+    )


I don't think this should be a separate function

scripts/train_tokenizer.py

jettjaniak · 2024-04-01T18:08:42Z

src/delphi/train/tokenizer.py

+    text_file = os.path.join(cache_dir, "text.txt")
+    with open(text_file, 'w', encoding='utf-8') as file:
+        for item in dataset:
+            text = item['story']


feature needs to be configurable. If not set, should default to the only feature there is, or fail if there are more than two

added as a parameter

jettjaniak · 2024-04-01T18:12:38Z

scripts/train_tokenizer.py

+    vocab = {sp_model.id_to_piece(index): index for index in trange(sp_model.GetPieceSize())}
+    merges = []
+    for piece_l in tqdm(vocab.keys(), total=sp_model.GetPieceSize()):
+        for piece_r in vocab.keys():
+            merge = f"{piece_l}{piece_r}"
+            piece_id = vocab.get(merge, None)
+            if piece_id:
+                merges += [(piece_l, piece_r, piece_id)]
+    merges = sorted(merges, key=lambda val: val[2])
+    merges = [(val[0], val[1]) for val in merges]


any idea what is this doing?

This generates the components HF needs to initialise the SentencePieceBPETokenizer. vocab are all tokens, and merges are all combinations of tokens that exist as another token in the vocabulary (e.g. "hel" + "lo" = "hello")

jettjaniak · 2024-04-01T18:17:30Z

Looks like you need to setup black, CI is failing on that

jettjaniak · 2024-04-01T18:20:02Z

Also, could you think about some localized unit tests? Like we have a pre-defined string as a text to train on and we check if the resulting tokenizer has the same vocab and tokenizes text the way we expect.

Co-authored-by: Jett <[email protected]>

jettjaniak assigned jannik-brinkmann Apr 1, 2024

jettjaniak commented Apr 1, 2024

View reviewed changes

jettjaniak requested a review from joshuawe April 2, 2024 16:49

jannik-brinkmann and others added 5 commits April 19, 2024 09:00

draft for tokenizer training script

31b149a

Update scripts/train_tokenizer.py

6c84224

Co-authored-by: Jett <[email protected]>

integrate some of the suggested changes (WIP)

623f358

another use of tempfile (untested)

dff911b

bug fixes and test script

4fc7757

jettjaniak force-pushed the tokenizer_training branch from 6385892 to 4fc7757 Compare April 19, 2024 16:01

jettjaniak added 2 commits April 19, 2024 11:22

split the code, reduced num. tmp files, ...

e131120

removed tests/scripts

1014d6b

jettjaniak marked this pull request as ready for review April 19, 2024 18:28

updated transformers to 4.40.0

6d0ace0

jettjaniak merged commit 51a8e57 into main Apr 19, 2024
1 check passed

jettjaniak deleted the tokenizer_training branch April 19, 2024 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer training script #103

tokenizer training script #103

jettjaniak commented Apr 1, 2024

jettjaniak left a comment •

edited

Loading

jettjaniak Apr 1, 2024

jannik-brinkmann Apr 5, 2024

jannik-brinkmann Apr 5, 2024

jettjaniak Apr 1, 2024

jannik-brinkmann Apr 5, 2024 •

edited

Loading

jettjaniak Apr 1, 2024

jaidhyani Apr 2, 2024

jannik-brinkmann Apr 5, 2024

jettjaniak Apr 1, 2024

jannik-brinkmann Apr 9, 2024

jettjaniak Apr 1, 2024

jannik-brinkmann Apr 9, 2024

jettjaniak Apr 1, 2024

jannik-brinkmann Apr 5, 2024

jettjaniak commented Apr 1, 2024

jettjaniak commented Apr 1, 2024

tokenizer training script #103

tokenizer training script #103

Conversation

jettjaniak commented Apr 1, 2024

jettjaniak left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jannik-brinkmann Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jettjaniak commented Apr 1, 2024

jettjaniak commented Apr 1, 2024

jettjaniak left a comment •

edited

Loading

jannik-brinkmann Apr 5, 2024 •

edited

Loading