Modern NLP with Hugging Face

This is a 4-week-long course helps you understand Hugging Face's rich ecosystem and develop LLM models.

Week 1: Datasets

Slides
Notebook
Loading datasets in different formats
Understand the Structure of the loaded dataset
Access and manipulate samples in the dataset
- Concatenate
- Interleave
- Map
- Filter

Slides
Notebook
Download the model checkpoints: Here
Set up the training pipeline
- Dataset: BookCorpus
- Number of tokens: 1.08 billion
- Tokenizer: gpt-2 tokenizer
- Model: gpt-2 with CLM head
- Optimizer: AdamW
- Parallelism: DDP (with L4 GPUs)
Train the model on
- A100 80 GB single GPU
- L4 48 GB single node Multiple GPUs
- V100 32 GB single GPU
Training Report at wandb
Text Generation

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Notebooks		Notebooks
LICENSE		LICENSE
README.md		README.md