This repository is a fork of Megatron-LM. The original README can be found here.
Vocabulary Parallelism is a novel technique that balances the computation and memory of vocabulary layers in pipeline parallelism.
Check out our paper at arxiv.
Quick Start
Run
VOCAB_PARALLEL=1 VOCAB_SIZE=256k pretrain_gpt.sh
This script comes with a dataset with varying vocabulary sizes of 32k, 64k, 128k and 256k. Change the vocabulary size by setting VOCAB_SIZE
to either 32k
, 64k
, 128k
or 256k
.
Alternatively, include the argument --enable-vocab-parallel
when training with the GPT model. Vocabulary Parallelism is not yet supported for the other models.
Vocabulary Parallelism partitions the vocabulary layers evenly across pipeline devices and group the computation into two pipeline passes
We propose a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. This repository supports Vocabulary Parallelism with non-interleaved 1F1B, with the following two variations:
-
Vocabulary Parallelism with 1 all-reduce communication barrier (default)
-
Vocabulary Parallelism with 2 all-reduce communication barriers (enable using
--disable-backward-fusion
)
An implementation of Vocabulary Parallelism on the V-Half schedule can be found at this branch of the Zero Bubble Pipeline Parallelism repository.
For comparison, we also implement the interlaced pipeline (Lin et al., 2024) which uses a tensor parallel style to handle the vocabulary layers.
Comparison of Schedules
1F1B | Vocab-1 | Vocab-2 | Interlaced | |
---|---|---|---|---|
Bubble Rate | ||||
Activation Memory (number of microbatches) | ||||
Vocabulary | Imbalanced | Balanced | Balanced | Balanced |
Overlapped All-Reduce Communication | N.A. | Yes | Yes | No |
Vocabulary Parallelism resuls in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios.