Balancing Pipeline Parallelism with Vocabulary Parallelism

This repository is a fork of Megatron-LM. The original README can be found here.

Balancing Pipeline Parallelism with Vocabulary Parallelism

Vocabulary Parallelism is a novel technique that balances the computation and memory of vocabulary layers in pipeline parallelism.

Check out our paper at arxiv.

Quick Start

Run

VOCAB_PARALLEL=1 VOCAB_SIZE=256k pretrain_gpt.sh

This script comes with a dataset with varying vocabulary sizes of 32k, 64k, 128k and 256k. Change the vocabulary size by setting VOCAB_SIZE to either 32k, 64k, 128k or 256k.

Alternatively, include the argument --enable-vocab-parallel when training with the GPT model. Vocabulary Parallelism is not yet supported for the other models.

Methodology

Vocabulary Parallelism partitions the vocabulary layers evenly across pipeline devices and group the computation into two pipeline passes $S$ and $T$. We handle the all-reduce communication barriers $C_0$ and $C_1$ in separate streams to overlap with transformer layer computation.

Schedules

We propose a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. This repository supports Vocabulary Parallelism with non-interleaved 1F1B, with the following two variations:

Vocabulary Parallelism with 1 all-reduce communication barrier (default)
Vocabulary Parallelism with 2 all-reduce communication barriers (enable using --disable-backward-fusion)

An implementation of Vocabulary Parallelism on the V-Half schedule can be found at this branch of the Zero Bubble Pipeline Parallelism repository.

For comparison, we also implement the interlaced pipeline (Lin et al., 2024) which uses a tensor parallel style to handle the vocabulary layers.

Interlaced Pipeline (enable using --use-interlaced-schedule)

Comparison of Schedules

	1F1B	Vocab-1	Vocab-2	Interlaced
Bubble Rate	$B$	$B$	$B$	$B$
Activation Memory (number of microbatches)	$p$	$p + 2$	$p + 1$	$1.5p$
Vocabulary	Imbalanced	Balanced	Balanced	Balanced
Overlapped All-Reduce Communication	N.A.	Yes	Yes	No

$p$ is the number of pipeline stages.

Evaluation

Vocabulary Parallelism resuls in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 4,564 Commits
.github		.github
docs		docs
examples		examples
images		images
megatron		megatron
tasks		tasks
tests		tests
tools		tools
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.ci		Dockerfile.ci
Dockerfile.linting		Dockerfile.linting
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
input_store.py		input_store.py
jet-tests.yml		jet-tests.yml
pretrain_bert.py		pretrain_bert.py
pretrain_gpt.py		pretrain_gpt.py
pretrain_gpt.sh		pretrain_gpt.sh
pretrain_ict.py		pretrain_ict.py
pretrain_mamba.py		pretrain_mamba.py
pretrain_retro.py		pretrain_retro.py
pretrain_t5.py		pretrain_t5.py
pretrain_vision_classify.py		pretrain_vision_classify.py
pretrain_vision_dino.py		pretrain_vision_dino.py
pretrain_vision_inpaint.py		pretrain_vision_inpaint.py
pretrain_vlm.py		pretrain_vlm.py
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Balancing Pipeline Parallelism with Vocabulary Parallelism

Methodology

Schedules

Evaluation

About

Releases

Packages

Languages

License

sail-sg/VocabularyParallelism

Folders and files

Latest commit

History

Repository files navigation

Balancing Pipeline Parallelism with Vocabulary Parallelism

Methodology

Schedules

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages