- This is a 4-week-long course helps you understand Hugging Face's rich ecosystem and develop LLM models.
- Slides
- Notebook
- Loading datasets in different formats
- Understand the Structure of the loaded dataset
- Access and manipulate samples in the dataset
- Concatenate
- Interleave
- Map
- Filter
- Slides
- Notebook
- Set up a tokenization pipeline
- Train the tokenizer
- Encode the input samples (single or batch)
- Test the implementation
- Save and load the tokenizer
- Decode the token_ids
- Wrap the tokenizer with the
PreTrainedTokenizer
class - Save the pre-trained tokenizer
- Slides
- Notebook
- Download the model checkpoints: Here
- Set up the training pipeline
- Dataset: BookCorpus
- Number of tokens: 1.08 billion
- Tokenizer: gpt-2 tokenizer
- Model: gpt-2 with CLM head
- Optimizer: AdamW
- Parallelism: DDP (with L4 GPUs)
- Train the model on
- A100 80 GB single GPU
- L4 48 GB single node Multiple GPUs
- V100 32 GB single GPU
- Training Report at wandb
- Text Generation
- Slides
- Concepts:
- PEFT and Quantization
- Continued Pretraining of Llama 3.2 1B Notebook
- Task-specific fine-tuning (classification) Notebook
- Task-Specific fine-tuning with LoRA Notebook
- Instruction tuning Notebook from Unsloth
- Preference tuning