Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset tokenization script #7

Closed
jettjaniak opened this issue Jan 26, 2024 · 2 comments
Closed

dataset tokenization script #7

jettjaniak opened this issue Jan 26, 2024 · 2 comments
Assignees
Labels
dataset feature New feature or request

Comments

@jettjaniak
Copy link
Contributor

  • similar to training script for mamba & llama #3 I imagine this will import some utils from llama2.c and build around them
  • the short script itself should be in scripts/
  • waiting for @woog97 to confirm what should we do with BOS/EOS
  • I would like it to pull our v2 text dataset from HF, instead of building it on the fly
    • reduces chance of errors
  • either save to file or upload to hf (depending on cli args)
@woog97
Copy link

woog97 commented Jan 26, 2024

Checked with eleuther. We should train with BOS to start every sample. In practice this would look like:

<BOS>tinystory1<EOS>tinystory2<EOS>tiny
<BOS>story3<EOS>tinystory4<EOS>tinystor
<BOS>y5<EOS>tinystory6<EOS>...```

@jettjaniak
Copy link
Contributor Author

duplicate, taken care of in #31

@jettjaniak jettjaniak closed this as not planned Won't fix, can't repro, duplicate, stale Feb 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants