Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issue #15

Open
lethean287 opened this issue Apr 20, 2024 · 1 comment
Open

Performance Issue #15

lethean287 opened this issue Apr 20, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@lethean287
Copy link

Hi, we have tried to run the speculative inference process on OPT-13B and Llama2-70B-chat, but meet some issues. Specifically, for Llama2-70B-chat , we obtained performance worse than vLLM, which seems abnormal. For OPT-13B, we meet core dump error on several inference datasets.
Our execution process is as follows:
We first set up the environment by directly use the docker you provided (ghcr.io/flexflow/flexflow-cuda-11.8:latest), and build from source following your instructions.

We attempt to use flexflow inference by running the following command, but encountered an issue with core dump.

python -u run.py \
--num_gpus 4\
--memory_per_gpu 78000\
--zero_copy_memory_per_node 200000 \ 
--tensor_parallelism_degree 4\
--pipeline_parallelism_degree 1\
--max_requests_per_batch 8\
--max_seq_length 128\
--max_tokens_per_batch 1024\
--llm facebook/opt-13b \
--ssm facebook/opt-125m \
--prompts_file prompts/dialogue.jsonSpecially, run.py is the file we write following the Quickstart guidance in the repo.

Specially, run.py is the file we write following the Quickstart guidance in the repo.

import flexflow.serve as ff
import argparse
import json
import os

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--num_gpus', default=2, type=int)
    parser.add_argument('--memory_per_gpu', default=38000, type=int)
    parser.add_argument('--zero_copy_memory_per_node', default=30000, type=int)
    parser.add_argument('--tensor_parallelism_degree', default=2, type=int)
    parser.add_argument('--pipeline_parallelism_degree', default=1, type=int)
    parser.add_argument('--llm', default='facebook/opt-125m', type=str)
    parser.add_argument('--ssm', default='facebook/opt-125m', type=str)
    parser.add_argument('--prompts_file', default='prompts/Alpaca.json', type=str)
    parser.add_argument('--max_requests_per_batch', default=16, type=int)
    parser.add_argument('--max_seq_length', default=128, type=int)
    parser.add_argument('--max_tokens_per_batch', default=128, type=int)
    args = parser.parse_args()

    os.environ['TRANSFORMERS_OFFLINE'] = '1'

    ff.init(num_gpus=args.num_gpus,
            memory_per_gpu=args.memory_per_gpu,
            zero_copy_memory_per_node=args.zero_copy_memory_per_node,
            tensor_parallelism_degree=args.tensor_parallelism_degree,
            pipeline_parallelism_degree=args.pipeline_parallelism_degree
        )

    #pdb.set_trace()

    # Specify the LLM
    llm = ff.LLM(args.llm)

    # Specify a list of SSMs (just one in this case)
    ssms=[]
    if args.ssm != '':
        ssm_names = args.ssm.split(',')
        for ssm_name in ssm_names:
            ssm = ff.SSM(ssm_name)
            ssms.append(ssm)

    # Create the sampling configs
    generation_config = ff.GenerationConfig(
        do_sample=False, temperature=0, topp=1, topk=1
    )

    # Compile the SSMs for inference and load the weights into memory
    for ssm in ssms:
        ssm.compile(generation_config,
                    max_requests_per_batch=args.max_requests_per_batch,
                    max_seq_length=args.max_seq_length,
                    max_tokens_per_batch=args.max_tokens_per_batch)

    # Compile the LLM for inference and load the weights into memory
    llm.compile(generation_config, 
                ssms=ssms,
                max_requests_per_batch=args.max_requests_per_batch,
                max_seq_length=args.max_seq_length,
                max_tokens_per_batch=args.max_tokens_per_batch
               )

    # load prompts
    with open(args.prompts_file, 'r') as f:
        prompts = json.load(f)

    llm.start_server()
    result = llm.generate(prompts=prompts)

We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the total inference time to process all requests in the chatbot dataset using vLLM and SpecInfer respectively. We first test the Llama2-70B-chat model with llama-160M you provided as SSM. The results are as follows:

inference time vLLM(s) SpecInfer(s)
BS=1 1022.952185869 1550.611874
BS=2 529.516379833 800.023607
BS=4 275.700631380 408.75528
BS=8 144.448794603 236.409383
BS=16 76.175143718 133.675686
BS=32 42.816745996 95.503888

And it seems that the performance of vLLM is better than SpecInfer.
Moreover, we have also run OPT-13B with OPT-125M as SSM on several datasets including dialogue dataset, but meet the core dump error:
core_dump

All the datasets mentioned above are here: https://github.com/lethean287/dataset_0421
Any help to solve this issue is appreciated!

@lockshaw lockshaw added the enhancement New feature or request label Jun 3, 2024
@QAZWSX0827
Copy link

Hello, have you successfully reproduced the results of SpecInfer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants