Emergent Abilities in Reduced-Scale Generative Language Models

This repository has code to filter data from exisiting corpora based on child vocabulary and train small language models on this filtered data.

Installation

git clone [email protected]:text-machine-lab/mini_gpt.git
cd mini_gpt
pip install -r requirements.txt

Usage

The tokenizer for filtering is from filter_vocab_cpp.
Compile the C++ based filtration code and copy over the object fileto the src directory.

The vocabulary used for simplification of pre-training data can be found in data/AOChildes_word_frequency.csv This vocabulary is based on child-directed speech transcripts that can be found here

Downloading the SlimPajama dataset using git lfs:

git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B

Chunks 1-10 are downloaded when this command is run.

For gathering the unfiltered dataset:

python SlimPajama_unfiltered.py

For Vocab filtering, use the following command per chunk:

python SlimPajama_filtering.py  --chunk_id 1

Pre-training data:

The pre-training data which consits of vocabulary filtered SlimPajama dataset can be found here 22B and 2.1B

To train BPE tokenizer:

python create_tokenizer.py --dataset_path ./dataset \
    --vocab_size 15_000 \
    --save_dir ./tokenizer

For pre-training a language model with distributed training:

python -u -m accelerate.commands.launch main.py \
     --lr 2.8e-3 --num_warmup_steps 1000 --num_layers 8 \
     --hidden_size 32 --use_tokenizer filtered \
     --chkpt_dir ../models/SlimPajama_Nov23_context128_vocab_21k/filtered/hidden_32_num_layer_8_int_128 \
     --int_size 128 --rope_theta 20

Notebooks

For creating the minigpt dataset, counting the number of tokens in the filtered dataset, analysis of the filtered dataset use the notebook notebooks/2.0-dataset-statistics.ipynb

For applying position interpolation on pre-trained models use the notebook notebooks/1.0-rope-pi.ipynb

To filter downstream evaluation datasets based on AO-Childes vocabulary use the notebook notebooks/4.0-dataset_filtering.ipynb

To get generations from the pre-trained and baseline models use notebooks/3.0-model-generations.ipynb

Citation

@misc{muckatira2024emergent,
      title={Emergent Abilities in Reduced-Scale Generative Language Models},
      author={Sherin Muckatira and Vijeta Deshpande and Vladislav Lialin and Anna Rumshisky},
      year={2024},
      eprint={2404.02204},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
images		images
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emergent Abilities in Reduced-Scale Generative Language Models

Installation

Usage

Downloading the SlimPajama dataset using git lfs:

For Vocab filtering, use the following command per chunk:

Pre-training data:

To train BPE tokenizer:

For pre-training a language model with distributed training:

Notebooks

Citation

About

Releases

Packages

Contributors 2

Languages

text-machine-lab/mini_gpt

Folders and files

Latest commit

History

Repository files navigation

Emergent Abilities in Reduced-Scale Generative Language Models

Installation

Usage

Downloading the SlimPajama dataset using git lfs:

For Vocab filtering, use the following command per chunk:

Pre-training data:

To train BPE tokenizer:

For pre-training a language model with distributed training:

Notebooks

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages