-
Extend context length with good contenders:
- TinyLLaMA 1B (4096 ctx length)
- OpenELM 1B (1024 ctx length)
- Danube 1B (8192 ctx length)
-
Generate a chain-of-thought (CoT) dataset leveraging the high context window.
- TinyLLaMA 1B, OpenELM 1B, Danube 1B, Mistral-7B, GPT-3.5-Turbo, Mamba, and Recurrent Memory Transformers (RMT) are good candidates for extending context length.
- OpenELM 1B has a maximum context length of 1024 tokens, while Danube 1B supports up to 8192 tokens.
- Mistral-7B and GPT-3.5-Turbo have context lengths of 32K and 16K tokens respectively.
- Generating a chain-of-thought dataset can leverage the high context window capabilities of these models.
Huge inspiration comes from this blog: kaiokendev rope implementation.
- Actually, scrap those top models off, they already have rotary position encoding enabled by default. We will do it for GPT-2.
- Source: We'll build our dataset from this.
- Colab Notebook: Current work.
- Using LIMA to train.
- After 5 tries on full fine-tuning and OOM'ing, I'm going to try QLoRA.
Got fine-tuned but incoherent, maybe I should try LLaMA instead of GPT-2.
- Monkey Patch Script
- GitHub Repository
- Extending GPT-2 context length via rope scaling:
- Training Run: This supposedly failed for obvious reasons.
- Should try with the 8B parameter: Source.
- Apply monkey patch for rope embeddings.
- Add EOS token as pad token (GPT-2 tokenizer doesn't have it by default) and increase lm_head size by one because the vocab increased by 1.
- Format and tokenize long Alpaca for training.
- Pass model and data to the trainer.
- Train?
- Monkey patch and change config and upload the model to Hugging Face (override PretrainedConfig class in config.py).
- Directly download and train with Trainer API.