Skip to content

Latest commit

 

History

History
55 lines (40 loc) · 2.54 KB

pretier_documentation.md

File metadata and controls

55 lines (40 loc) · 2.54 KB

DAY 1

Project TODO

  1. Extend context length with good contenders:

  2. Generate a chain-of-thought (CoT) dataset leveraging the high context window.

Key Points:

  • TinyLLaMA 1B, OpenELM 1B, Danube 1B, Mistral-7B, GPT-3.5-Turbo, Mamba, and Recurrent Memory Transformers (RMT) are good candidates for extending context length.
  • OpenELM 1B has a maximum context length of 1024 tokens, while Danube 1B supports up to 8192 tokens.
  • Mistral-7B and GPT-3.5-Turbo have context lengths of 32K and 16K tokens respectively.
  • Generating a chain-of-thought dataset can leverage the high context window capabilities of these models.

Huge inspiration comes from this blog: kaiokendev rope implementation.

Note:

  • Actually, scrap those top models off, they already have rotary position encoding enabled by default. We will do it for GPT-2.
  • Source: We'll build our dataset from this.
  • Colab Notebook: Current work.

Current Training:

  • Using LIMA to train.
  • After 5 tries on full fine-tuning and OOM'ing, I'm going to try QLoRA.

Got fine-tuned but incoherent, maybe I should try LLaMA instead of GPT-2.

DAY 2

References:

DAY 3

Clear Steps:

  1. Apply monkey patch for rope embeddings.
  2. Add EOS token as pad token (GPT-2 tokenizer doesn't have it by default) and increase lm_head size by one because the vocab increased by 1.
  3. Format and tokenize long Alpaca for training.
  4. Pass model and data to the trainer.
  5. Train?

Alternative Method:

  1. Monkey patch and change config and upload the model to Hugging Face (override PretrainedConfig class in config.py).
  2. Directly download and train with Trainer API.