[Feature] Support pipeline parallelism model wrapper #1355

fanqiNO1 · 2023-09-12T03:58:58Z

Background

As the model inference process requires more and more CUDA memory, we need a way to complete the model inference process in a variety of CUDA memory situations, mainly the following two cases:

Insufficient CUDA memory
The model inference process is accomplished by cpu offload, disk offload policy.
Sufficient CUDA memory
The model can be partitioned across multiple gpus, in which case the model inference should be done as efficiently as possible.

huggingface introduces the accelerate library, which can also allow users to complete the inference in the case of insufficient CUDA memory, but its utilization of resources is too inefficient.

Design

To accelerate the inference process by utilizing resources as much as possible, we will implement a pipeline parallelism-based model wrapper.

The pipeline parallelism-based model wrapper is primarily responsible for:

build model, load and dispatch weights
pipeline parallelism-based inference process

This PR will support MMPipelineParallel.

Environment

PyTorch: 2.0.0
CUDA: 11.8
GPU: 8 * A100, 80G

Validation

init_device_map
offload policy
pipeline parallelism

Experiment

ResNet-152

	Accelerate	Torchgpipe	Ours
pipeline-1	1478.576 samples/sec	1476.419 sample/sec	1482.065 samples/sec
pipeline-2	935.565 samples/sec	1327.254 samples/sec	1733.871 samples/sec
pipeline-4	1023.315 samples/sec	1908.557 samples/sec	2757.441 samples/sec
pipeline-8	1051.154 samples/sec	2874.286 samples/sec	3742.485 samples/sec

Scipts

import torch
from torch import nn
from torchvision.models import resnet152


class MMResnet(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = resnet152()
        
    def forward(self, x):
        return self.model(x)
    
    def data_preprocessor(self, x, training=False):
        return x
        
    def test_step(self, data):
        data = data['input']
        return self.model(data)


if __name__ == '__main__':
    import time
    from mmengine.model import MMPipelineParallel
    from tqdm import tqdm
    from torchvision.models import resnet152
    from colorama import Fore

    SEED = 0x66ccff
    torch.manual_seed(SEED)
    torch.cuda.manual_seed(SEED)

    model = MMResnet()
    model.eval()

    SETTINGS = {
        1: {
            'num_chunks': 2,
            'batch_size': 220
        },
        2: {
            'num_chunks': 1667,
            'batch_size': 25000
        },
        4: {
            'num_chunks': 256,
            'batch_size': 5632
        },
        8: {
            'num_chunks': 150,
            'batch_size': 5400
        }
    }

    num_pipelines = 1
    num_chunks = SETTINGS[num_pipelines]['num_chunks']
    num_samples = 50000
    batch_size = SETTINGS[num_pipelines]['batch_size']
    # generate data
    dataset = []
    all_cuda_data = True
    num_batches = num_samples // batch_size
    others = num_samples % batch_size
    if all_cuda_data:
        data = torch.randn(batch_size, 3, 224, 224).to('cuda:0')
        for i in tqdm(range(num_batches)):
            dataset.append(data)
        if others > 0:
            dataset.append(data[:others])
    else:
        for i in range(num_batches):
            dataset.append(torch.randn(batch_size, 3, 224, 224))
        if others > 0:
            dataset.append(torch.randn(others, 3, 224, 224))
    print(f'{Fore.GREEN}Data generated{Fore.RESET}')
    # init inferencer
    inferencer = MMPipelineParallel(
        module=model,
        num_pipelines=num_pipelines,
        num_chunks=num_chunks,
        no_split_module_classes='Bottleneck',
        input_key='input'
    )
    # run
    EPOCHS = 10
    SKIP_EPOCHs = 1
    throughputs = []
    elapseds = []
    for i in range(EPOCHS):
        torch.cuda.synchronize()
        tick = time.time()
        # output = inferencer(dataset)
        for data in dataset:
            data = {'input': data}
            out = inferencer(data)
        torch.cuda.synchronize()
        tock = time.time()
        # calculate throughput
        elapsed = tock - tick
        throughput = num_samples / elapsed
        if i >= SKIP_EPOCHs:
            throughputs.append(throughput)
            elapseds.append(elapsed)
        print(f'{Fore.BLUE}Epoch {i+1} throughput: ' +
              f'{throughput:.3f}samples/sec with ' +
              f'{elapsed:.3f}secs{Fore.RESET}')
    throughput = sum(throughputs) / len(throughputs)
    elapsed = sum(elapseds) / len(elapseds)
    print(f'{Fore.RED}Pipeline {num_pipelines}, ' +
          f'Chunks {num_chunks}, Batchsize {batch_size} ' +
          f'Throughput: {throughput:.3f}samples/sec ' +
          f'with {elapsed:.3f}secs{Fore.RESET}')

CLAassistant · 2023-09-12T03:59:02Z

All committers have signed the CLA.

mmengine/model/wrappers/pipeline_distributed.py

HAOCHENYE · 2023-10-09T12:50:21Z

mmengine/model/wrappers/pipeline_distributed.py

+
+    def __init__(self,
+                 model: Union[dict, nn.Module],
+                 weights: Optional[str] = None,


Accepting weights and loading weights in model_wrapper is inconsistent with other model wrappers. we should consider combining with BaseInferencer to see if there's a better approach.

mmengine/model/wrappers/pipeline_distributed.py

HAOCHENYE · 2023-10-11T17:02:21Z

mmengine/model/wrappers/pipeline_distributed.py

+                }
+        # handle tied weights
+        tied_weights = self.model_tree['tied_parameters']
+        for source, targets in tied_weights.items():


The key of tied_weights means param_name, and value means list of module_names. So, why do we use device_map[source] and device_map[target] here?

fanqiNO1 added 2 commits September 5, 2023 15:46

[Feature] Support pipeline parallelism model wrapper

ae13282

[Feature] Support pipeline parallelism model wrapper

b3bdcc0

fanqiNO1 requested review from HAOCHENYE and C1rN09 as code owners September 12, 2023 03:58

fanqiNO1 added 12 commits September 12, 2023 22:13

[Fix] Implement infer_device_map

005bfd1

[Fix] Implement forward

8c80e77

[Fix] Fix __init__

1f5fafe

[Fix] Fix type error

14588fc

[Fix] Fix init device map

db96560

[Fix] Fix type

231f45c

[Refactor] Refactor with yapf and isort

f6862dc

[Enhancement] Support move part

300aabc

[Fix] Fix __init__

114d89e

[Fix] Fix some bugs

01caa07

[Enhancement] Support offload

b51d2fd

[Enhancement] Add tests

2fa28d3

fanqiNO1 requested a review from zhouzaida as a code owner September 18, 2023 06:07

fanqiNO1 added 10 commits September 18, 2023 14:10

[Fix] Fix typo

5e6e8d8

[Enhancement] Add docs

af9f732

[Enhancement] Add example in docs

35ce3c3

[Refactor] Refactor with docformatter

e5fcc07

[Refactor] Refactor with yapf

eff9913

[Fix] Fix import logic

8f9467d

[Fix] Add docs

003bfca

[Fix] Add docs

688453e

[Refactor] Refactor with yapf and docformatter

0befbb4

[Fix] Add docs

fd19abd

zhouzaida reviewed Sep 24, 2023

View reviewed changes

mmengine/model/wrappers/pipeline_distributed.py Outdated Show resolved Hide resolved

zhouzaida reviewed Sep 24, 2023

View reviewed changes

mmengine/model/wrappers/pipeline_distributed.py Outdated Show resolved Hide resolved

fanqiNO1 and others added 11 commits September 24, 2023 21:46

[Refactor] Refactor no_split_module_classes declaration

a2840a2

[Fix] Fix typo

94bb1f4

[Refactor] Move some functions out

464f352

[Refactor] For the lint!

bdc151e

[Refactor] Refactor with yapf-pep8

b72a661

[Refactor] Refactor with yapf

8dffd49

[Refactor] Refactor with yapf

f898c98

[Fix] Fix unit test

f23c06b

[Refactor] Rollback unit test

b488cdc

[Fix] Fix stream context to avoid init error

b1d2354

Merge branch 'open-mmlab:main' into pipeline

d6b4f5a

HAOCHENYE reviewed Oct 11, 2023

View reviewed changes

fanqiNO1 and others added 13 commits October 24, 2023 10:43

Merge branch 'open-mmlab:main' into pipeline

7562cba

[Fix] Remove unit test temporarily

c5305d3

[Fix] Fix PR comments

2e1017f

[Fix] Fix lint

bb60834

[Fix] Fix some bugs

6f821bd

[Fix] Fix datasample

c9a37bf

[Fix] change typeddict to dataclass

02a35c3

[Fix] Fix import

4146b4d

[Fix] Remove tied_weights

98e8232

[Fix] Fix exec_order

8048a15

[Fix] Fix disk offload

1ff6cd1

[Fix] Fix offload

144eba3

[Enhancement] Add unit test

e7f0d91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support pipeline parallelism model wrapper #1355

[Feature] Support pipeline parallelism model wrapper #1355

fanqiNO1 commented Sep 12, 2023 •

edited

Loading

CLAassistant commented Sep 12, 2023 •

edited

Loading

HAOCHENYE Oct 9, 2023

HAOCHENYE Oct 11, 2023

[Feature] Support pipeline parallelism model wrapper #1355

Are you sure you want to change the base?

[Feature] Support pipeline parallelism model wrapper #1355

Conversation

fanqiNO1 commented Sep 12, 2023 • edited Loading

Background

Design

Environment

Validation

Experiment

ResNet-152

Scipts

CLAassistant commented Sep 12, 2023 • edited Loading

HAOCHENYE Oct 9, 2023

Choose a reason for hiding this comment

HAOCHENYE Oct 11, 2023

Choose a reason for hiding this comment

fanqiNO1 commented Sep 12, 2023 •

edited

Loading

CLAassistant commented Sep 12, 2023 •

edited

Loading