Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue when batch_size is 32 #3

Open
letheantest opened this issue Oct 22, 2024 · 0 comments
Open

Performance issue when batch_size is 32 #3

letheantest opened this issue Oct 22, 2024 · 0 comments

Comments

@letheantest
Copy link

Hello, we attempt to utilize SpecInfer to accelerate model inference. However, we encounter several performance issues. Specifically, as the batch size increases from 1 to 16, the system throughput gradually improves. But when the batch size reaches 32, there is a significant decline in throughput, which is confusing. Our execution configurations is as follows:

Environment Setup

We use the provided docker image(ghcr.io/flexflow/flexflow-cuda-11.8:latest) and build from source following the docs(https://flexflow.readthedocs.io/en/latest/).
We test two supported models: Llama2-70B and OPT-13B on this dataset(https://huggingface.co/datasets/gbharti/finance-alpaca).

Test Script

We run the model inference following the Quickstart guidance in the repo.

import flexflow.serve as ff
import argparse
import json
import os

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--num_gpus', type=int)
    parser.add_argument('--memory_per_gpu', type=int)
    parser.add_argument('--zero_copy_memory_per_node', type=int)
    parser.add_argument('--tensor_parallelism_degree', type=int)
    parser.add_argument('--pipeline_parallelism_degree', type=int)
    parser.add_argument('--llm', type=str)
    parser.add_argument('--ssm', type=str)
    parser.add_argument('--prompts_file', type=str)
    parser.add_argument('--max_requests_per_batch', type=int)
    parser.add_argument('--max_seq_length', type=int)
    parser.add_argument('--max_tokens_per_batch', type=int)
    args = parser.parse_args()

    ff.init(num_gpus=args.num_gpus,
            memory_per_gpu=args.memory_per_gpu,
            zero_copy_memory_per_node=args.zero_copy_memory_per_node,
            tensor_parallelism_degree=args.tensor_parallelism_degree,
            pipeline_parallelism_degree=args.pipeline_parallelism_degree
        )
    # Specify the LLM
    llm = ff.LLM(args.llm)

    # Specify a list of SSMs (just one in this case)
    ssms=[]
    if args.ssm != '':
        ssm_names = args.ssm.split(',')
        for ssm_name in ssm_names:
            ssm = ff.SSM(ssm_name)
            ssms.append(ssm)

    # Create the sampling configs
    generation_config = ff.GenerationConfig(
        do_sample=False, temperature=0, topp=1, topk=1
    )

    # Compile the SSMs for inference and load the weights into memory
    for ssm in ssms:
        ssm.compile(generation_config,
                    max_requests_per_batch=args.max_requests_per_batch,
                    max_seq_length=args.max_seq_length,
                    max_tokens_per_batch=args.max_tokens_per_batch)

    # Compile the LLM for inference and load the weights into memory
    llm.compile(generation_config, 
                ssms=ssms,
                max_requests_per_batch=args.max_requests_per_batch,
                max_seq_length=args.max_seq_length,
                max_tokens_per_batch=args.max_tokens_per_batch
               )

    # load prompts
    with open(args.prompts_file, 'r') as f:
        prompts = json.load(f)

    llm.start_server()
    result = llm.generate(prompts=prompts)

Test Results

We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the throughput when batch size increases from 1 to 32. The results are as follows:

throughput(tokens/s) Llama2-70B OPT-13B
BS=1 28.709671931 97.12122162
BS=2 52.22124339 189.1327599
BS=4 106.9214668 362.0640686
BS=8 182.9473744 680.4388029
BS=16 322.7966769 1188.828348
BS=32 298.8251763 437.7545888

Any help to solve this issue is appreciated!

@lockshaw lockshaw transferred this issue from flexflow/flexflow-train Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant