This repository has code to filter data from exisiting corpora based on child vocabulary and train small language models on this filtered data.
git clone [email protected]:text-machine-lab/mini_gpt.git
cd mini_gpt
pip install -r requirements.txt
The tokenizer for filtering is from filter_vocab_cpp.
Compile the C++ based filtration code and copy over the object fileto the src directory.
The vocabulary used for simplification of pre-training data can be found in data/AOChildes_word_frequency.csv
This vocabulary is based on child-directed speech transcripts that can be found here
git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B
Chunks 1-10 are downloaded when this command is run.
For gathering the unfiltered dataset:
python SlimPajama_unfiltered.py
python SlimPajama_filtering.py --chunk_id 1
The pre-training data which consits of vocabulary filtered SlimPajama dataset can be found here 22B and 2.1B
python create_tokenizer.py --dataset_path ./dataset \
--vocab_size 15_000 \
--save_dir ./tokenizer
python -u -m accelerate.commands.launch main.py \
--lr 2.8e-3 --num_warmup_steps 1000 --num_layers 8 \
--hidden_size 32 --use_tokenizer filtered \
--chkpt_dir ../models/SlimPajama_Nov23_context128_vocab_21k/filtered/hidden_32_num_layer_8_int_128 \
--int_size 128 --rope_theta 20
For creating the minigpt dataset, counting the number of tokens in the filtered dataset, analysis of the filtered dataset use the notebook notebooks/2.0-dataset-statistics.ipynb
For applying position interpolation on pre-trained models use the notebook
notebooks/1.0-rope-pi.ipynb
To filter downstream evaluation datasets based on AO-Childes vocabulary use the notebook
notebooks/4.0-dataset_filtering.ipynb
To get generations from the pre-trained and baseline models use
notebooks/3.0-model-generations.ipynb
@misc{muckatira2024emergent,
title={Emergent Abilities in Reduced-Scale Generative Language Models},
author={Sherin Muckatira and Vijeta Deshpande and Vladislav Lialin and Anna Rumshisky},
year={2024},
eprint={2404.02204},
archivePrefix={arXiv},
primaryClass={cs.CL}
}