Skip to content

Notebooks for the DLP course by Prof.Mitesh Khapra and me, offered to IITM BS Students.

License

Notifications You must be signed in to change notification settings

Arunprakash-A/Modern-NLP-with-Hugging-Face

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 

Repository files navigation

Modern NLP with Hugging Face

  • This is a 4-week-long course helps you understand Hugging Face's rich ecosystem and develop LLM models.

Week 1: Datasets

  • Slides
  • Notebook
  • Loading datasets in different formats
  • Understand the Structure of the loaded dataset
  • Access and manipulate samples in the dataset
    • Concatenate
    • Interleave
    • Map
    • Filter

Week 2: Tokenizers

  • Slides
  • Notebook
  • Set up a tokenization pipeline
  • Train the tokenizer
  • Encode the input samples (single or batch)
  • Test the implementation
  • Save and load the tokenizer
  • Decode the token_ids
  • Wrap the tokenizer with the PreTrainedTokenizer class
  • Save the pre-trained tokenizer

Week 3: Pre-training GPT-2

  • Slides
  • Notebook
  • Download the model checkpoints: Here
  • Set up the training pipeline
    • Dataset: BookCorpus
    • Number of tokens: 1.08 billion
    • Tokenizer: gpt-2 tokenizer
    • Model: gpt-2 with CLM head
    • Optimizer: AdamW
    • Parallelism: DDP (with L4 GPUs)
  • Train the model on
    • A100 80 GB single GPU
    • L4 48 GB single node Multiple GPUs
    • V100 32 GB single GPU
  • Training Report at wandb
  • Text Generation

Week 4: Fine-tuning Llama 3.2 1B

About

Notebooks for the DLP course by Prof.Mitesh Khapra and me, offered to IITM BS Students.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published