Skip to content

sail-sg/VocabularyParallelism

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository is a fork of Megatron-LM. The original README can be found here.

Balancing Pipeline Parallelism with Vocabulary Parallelism

Vocabulary Parallelism is a novel technique that balances the computation and memory of vocabulary layers in pipeline parallelism.

image

Check out our paper at arxiv.

Quick Start

Run

VOCAB_PARALLEL=1 VOCAB_SIZE=256k pretrain_gpt.sh

This script comes with a dataset with varying vocabulary sizes of 32k, 64k, 128k and 256k. Change the vocabulary size by setting VOCAB_SIZE to either 32k, 64k, 128k or 256k.

Alternatively, include the argument --enable-vocab-parallel when training with the GPT model. Vocabulary Parallelism is not yet supported for the other models.

Methodology

Vocabulary Parallelism partitions the vocabulary layers evenly across pipeline devices and group the computation into two pipeline passes $S$ and $T$. We handle the all-reduce communication barriers $C_0$ and $C_1$ in separate streams to overlap with transformer layer computation.

Schedules

We propose a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. This repository supports Vocabulary Parallelism with non-interleaved 1F1B, with the following two variations:

  • Vocabulary Parallelism with 1 all-reduce communication barrier (default) Vocabulary Parallelism with 1 all-reduce communication barrier

  • Vocabulary Parallelism with 2 all-reduce communication barriers (enable using --disable-backward-fusion) Vocabulary Parallelism with 2 all-reduce communication barriers

An implementation of Vocabulary Parallelism on the V-Half schedule can be found at this branch of the Zero Bubble Pipeline Parallelism repository.

For comparison, we also implement the interlaced pipeline (Lin et al., 2024) which uses a tensor parallel style to handle the vocabulary layers.

  • Interlaced Pipeline (enable using --use-interlaced-schedule) Vocabulary Parallelism with 2 all-reduce communication barriers

Comparison of Schedules

1F1B Vocab-1 Vocab-2 Interlaced
Bubble Rate $B$ $B$ $B$ $B$
Activation Memory (number of microbatches) $p$ $p + 2$ $p + 1$ $1.5p$
Vocabulary Imbalanced Balanced Balanced Balanced
Overlapped All-Reduce Communication N.A. Yes Yes No

$p$ is the number of pipeline stages.

Evaluation

Vocabulary Parallelism resuls in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios.

About

Vocabulary Parallelism

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.7%
  • Shell 1.3%
  • C++ 0.8%
  • C 0.1%
  • HTML 0.1%
  • Dockerfile 0.0%