Skip to content

Latest commit

 

History

History
51 lines (49 loc) · 2.68 KB

README.md

File metadata and controls

51 lines (49 loc) · 2.68 KB

Modern NLP with Hugging Face

  • This is a 4-week-long course helps you understand Hugging Face's rich ecosystem and develop LLM models.

Week 1: Datasets

  • Slides
  • Notebook
  • Loading datasets in different formats
  • Understand the Structure of the loaded dataset
  • Access and manipulate samples in the dataset
    • Concatenate
    • Interleave
    • Map
    • Filter

Week 2: Tokenizers

  • Slides
  • Notebook
  • Set up a tokenization pipeline
  • Train the tokenizer
  • Encode the input samples (single or batch)
  • Test the implementation
  • Save and load the tokenizer
  • Decode the token_ids
  • Wrap the tokenizer with the PreTrainedTokenizer class
  • Save the pre-trained tokenizer

Week 3: Pre-training GPT-2

  • Slides
  • Notebook
  • Download the model checkpoints: Here
  • Set up the training pipeline
    • Dataset: BookCorpus
    • Number of tokens: 1.08 billion
    • Tokenizer: gpt-2 tokenizer
    • Model: gpt-2 with CLM head
    • Optimizer: AdamW
    • Parallelism: DDP (with L4 GPUs)
  • Train the model on
    • A100 80 GB single GPU
    • L4 48 GB single node Multiple GPUs
    • V100 32 GB single GPU
  • Training Report at wandb
  • Text Generation

Week 4: Fine-tuning Llama 3.2 1B