You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, we have tried to run the speculative inference process on OPT-13B and Llama2-70B-chat, but meet some issues. Specifically, for Llama2-70B-chat , we obtained performance worse than vLLM, which seems abnormal. For OPT-13B, we meet core dump error on several inference datasets.
Our execution process is as follows:
We first set up the environment by directly use the docker you provided (ghcr.io/flexflow/flexflow-cuda-11.8:latest), and build from source following your instructions.
We attempt to use flexflow inference by running the following command, but encountered an issue with core dump.
python -u run.py \
--num_gpus 4\
--memory_per_gpu 78000\
--zero_copy_memory_per_node 200000 \
--tensor_parallelism_degree 4\
--pipeline_parallelism_degree 1\
--max_requests_per_batch 8\
--max_seq_length 128\
--max_tokens_per_batch 1024\
--llm facebook/opt-13b \
--ssm facebook/opt-125m \
--prompts_file prompts/dialogue.jsonSpecially, run.py is the file we write following the Quickstart guidance in the repo.
Specially, run.py is the file we write following the Quickstart guidance in the repo.
importflexflow.serveasffimportargparseimportjsonimportosif__name__=='__main__':
parser=argparse.ArgumentParser()
parser.add_argument('--num_gpus', default=2, type=int)
parser.add_argument('--memory_per_gpu', default=38000, type=int)
parser.add_argument('--zero_copy_memory_per_node', default=30000, type=int)
parser.add_argument('--tensor_parallelism_degree', default=2, type=int)
parser.add_argument('--pipeline_parallelism_degree', default=1, type=int)
parser.add_argument('--llm', default='facebook/opt-125m', type=str)
parser.add_argument('--ssm', default='facebook/opt-125m', type=str)
parser.add_argument('--prompts_file', default='prompts/Alpaca.json', type=str)
parser.add_argument('--max_requests_per_batch', default=16, type=int)
parser.add_argument('--max_seq_length', default=128, type=int)
parser.add_argument('--max_tokens_per_batch', default=128, type=int)
args=parser.parse_args()
os.environ['TRANSFORMERS_OFFLINE'] ='1'ff.init(num_gpus=args.num_gpus,
memory_per_gpu=args.memory_per_gpu,
zero_copy_memory_per_node=args.zero_copy_memory_per_node,
tensor_parallelism_degree=args.tensor_parallelism_degree,
pipeline_parallelism_degree=args.pipeline_parallelism_degree
)
#pdb.set_trace()# Specify the LLMllm=ff.LLM(args.llm)
# Specify a list of SSMs (just one in this case)ssms=[]
ifargs.ssm!='':
ssm_names=args.ssm.split(',')
forssm_nameinssm_names:
ssm=ff.SSM(ssm_name)
ssms.append(ssm)
# Create the sampling configsgeneration_config=ff.GenerationConfig(
do_sample=False, temperature=0, topp=1, topk=1
)
# Compile the SSMs for inference and load the weights into memoryforssminssms:
ssm.compile(generation_config,
max_requests_per_batch=args.max_requests_per_batch,
max_seq_length=args.max_seq_length,
max_tokens_per_batch=args.max_tokens_per_batch)
# Compile the LLM for inference and load the weights into memoryllm.compile(generation_config,
ssms=ssms,
max_requests_per_batch=args.max_requests_per_batch,
max_seq_length=args.max_seq_length,
max_tokens_per_batch=args.max_tokens_per_batch
)
# load promptswithopen(args.prompts_file, 'r') asf:
prompts=json.load(f)
llm.start_server()
result=llm.generate(prompts=prompts)
We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the total inference time to process all requests in the chatbot dataset using vLLM and SpecInfer respectively. We first test the Llama2-70B-chat model with llama-160M you provided as SSM. The results are as follows:
inference time
vLLM(s)
SpecInfer(s)
BS=1
1022.952185869
1550.611874
BS=2
529.516379833
800.023607
BS=4
275.700631380
408.75528
BS=8
144.448794603
236.409383
BS=16
76.175143718
133.675686
BS=32
42.816745996
95.503888
And it seems that the performance of vLLM is better than SpecInfer.
Moreover, we have also run OPT-13B with OPT-125M as SSM on several datasets including dialogue dataset, but meet the core dump error:
Hi, we have tried to run the speculative inference process on OPT-13B and Llama2-70B-chat, but meet some issues. Specifically, for Llama2-70B-chat , we obtained performance worse than vLLM, which seems abnormal. For OPT-13B, we meet core dump error on several inference datasets.
Our execution process is as follows:
We first set up the environment by directly use the docker you provided (ghcr.io/flexflow/flexflow-cuda-11.8:latest), and build from source following your instructions.
We attempt to use flexflow inference by running the following command, but encountered an issue with core dump.
Specially, run.py is the file we write following the Quickstart guidance in the repo.
We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the total inference time to process all requests in the chatbot dataset using vLLM and SpecInfer respectively. We first test the Llama2-70B-chat model with llama-160M you provided as SSM. The results are as follows:
And it seems that the performance of vLLM is better than SpecInfer.
Moreover, we have also run OPT-13B with OPT-125M as SSM on several datasets including dialogue dataset, but meet the core dump error:
All the datasets mentioned above are here: https://github.com/lethean287/dataset_0421
Any help to solve this issue is appreciated!
The text was updated successfully, but these errors were encountered: