Modern NLP with Hugging Face

This is a 4-week-long course helps you understand Hugging Face's rich ecosystem and develop LLM models.

Week 1: Datasets

Slides
Notebook
Loading datasets in different formats
Understand the Structure of the loaded dataset
Access and manipulate samples in the dataset
- Concatenate
- Interleave
- Map
- Filter

Week 2: Tokenizers

Slides
Notebook
Set up a tokenization pipeline
Train the tokenizer
Encode the input samples (single or batch)
Test the implementation
Save and load the tokenizer
Decode the token_ids
Wrap the tokenizer with the PreTrainedTokenizer class
Save the pre-trained tokenizer

Week 3: Pre-training GPT-2

Slides
Notebook
Download the model checkpoints: Here
Set up the training pipeline
- Dataset: BookCorpus
- Number of tokens: 1.08 billion
- Tokenizer: gpt-2 tokenizer
- Model: gpt-2 with CLM head
- Optimizer: AdamW
- Parallelism: DDP (with L4 GPUs)
Train the model on
- A100 80 GB single GPU
- L4 48 GB single node Multiple GPUs
- V100 32 GB single GPU
Training Report at wandb
Text Generation

Week 4: Fine-tuning Llama 3.2 1B

Slides
Concepts:
- PEFT and Quantization
- Continued Pretraining of Llama 3.2 1B Notebook
- Task-specific fine-tuning (classification) Notebook
- Task-Specific fine-tuning with LoRA Notebook
- Instruction tuning Notebook from Unsloth
- Preference tuning