dataset tokenization script #7

jettjaniak · 2024-01-26T12:07:05Z

similar to training script for mamba & llama #3 I imagine this will import some utils from llama2.c and build around them
the short script itself should be in scripts/
waiting for @woog97 to confirm what should we do with BOS/EOS
I would like it to pull our v2 text dataset from HF, instead of building it on the fly
- reduces chance of errors
either save to file or upload to hf (depending on cli args)

The text was updated successfully, but these errors were encountered:

woog97 · 2024-01-26T14:23:58Z

Checked with eleuther. We should train with BOS to start every sample. In practice this would look like:

<BOS>tinystory1<EOS>tinystory2<EOS>tiny
<BOS>story3<EOS>tinystory4<EOS>tinystor
<BOS>y5<EOS>tinystory6<EOS>...```

jettjaniak · 2024-02-17T04:28:33Z

duplicate, taken care of in #31

jettjaniak added feature New feature or request dataset labels Jan 26, 2024

jettjaniak assigned woog97 and jannik-brinkmann Jan 26, 2024

jettjaniak closed this as not planned Won't fix, can't repro, duplicate, stale Feb 17, 2024

Provide feedback