You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, we attempt to utilize SpecInfer to accelerate model inference. However, we encounter several performance issues. Specifically, as the batch size increases from 1 to 16, the system throughput gradually improves. But when the batch size reaches 32, there is a significant decline in throughput, which is confusing. Our execution configurations is as follows:
We run the model inference following the Quickstart guidance in the repo.
importflexflow.serveasffimportargparseimportjsonimportosif__name__=='__main__':
parser=argparse.ArgumentParser()
parser.add_argument('--num_gpus', type=int)
parser.add_argument('--memory_per_gpu', type=int)
parser.add_argument('--zero_copy_memory_per_node', type=int)
parser.add_argument('--tensor_parallelism_degree', type=int)
parser.add_argument('--pipeline_parallelism_degree', type=int)
parser.add_argument('--llm', type=str)
parser.add_argument('--ssm', type=str)
parser.add_argument('--prompts_file', type=str)
parser.add_argument('--max_requests_per_batch', type=int)
parser.add_argument('--max_seq_length', type=int)
parser.add_argument('--max_tokens_per_batch', type=int)
args=parser.parse_args()
ff.init(num_gpus=args.num_gpus,
memory_per_gpu=args.memory_per_gpu,
zero_copy_memory_per_node=args.zero_copy_memory_per_node,
tensor_parallelism_degree=args.tensor_parallelism_degree,
pipeline_parallelism_degree=args.pipeline_parallelism_degree
)
# Specify the LLMllm=ff.LLM(args.llm)
# Specify a list of SSMs (just one in this case)ssms=[]
ifargs.ssm!='':
ssm_names=args.ssm.split(',')
forssm_nameinssm_names:
ssm=ff.SSM(ssm_name)
ssms.append(ssm)
# Create the sampling configsgeneration_config=ff.GenerationConfig(
do_sample=False, temperature=0, topp=1, topk=1
)
# Compile the SSMs for inference and load the weights into memoryforssminssms:
ssm.compile(generation_config,
max_requests_per_batch=args.max_requests_per_batch,
max_seq_length=args.max_seq_length,
max_tokens_per_batch=args.max_tokens_per_batch)
# Compile the LLM for inference and load the weights into memoryllm.compile(generation_config,
ssms=ssms,
max_requests_per_batch=args.max_requests_per_batch,
max_seq_length=args.max_seq_length,
max_tokens_per_batch=args.max_tokens_per_batch
)
# load promptswithopen(args.prompts_file, 'r') asf:
prompts=json.load(f)
llm.start_server()
result=llm.generate(prompts=prompts)
Test Results
We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the throughput when batch size increases from 1 to 32. The results are as follows:
throughput(tokens/s)
Llama2-70B
OPT-13B
BS=1
28.709671931
97.12122162
BS=2
52.22124339
189.1327599
BS=4
106.9214668
362.0640686
BS=8
182.9473744
680.4388029
BS=16
322.7966769
1188.828348
BS=32
298.8251763
437.7545888
Any help to solve this issue is appreciated!
The text was updated successfully, but these errors were encountered:
Hello, we attempt to utilize SpecInfer to accelerate model inference. However, we encounter several performance issues. Specifically, as the batch size increases from 1 to 16, the system throughput gradually improves. But when the batch size reaches 32, there is a significant decline in throughput, which is confusing. Our execution configurations is as follows:
Environment Setup
We use the provided docker image(ghcr.io/flexflow/flexflow-cuda-11.8:latest) and build from source following the docs(https://flexflow.readthedocs.io/en/latest/).
We test two supported models: Llama2-70B and OPT-13B on this dataset(https://huggingface.co/datasets/gbharti/finance-alpaca).
Test Script
We run the model inference following the Quickstart guidance in the repo.
Test Results
We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the throughput when batch size increases from 1 to 32. The results are as follows:
Any help to solve this issue is appreciated!
The text was updated successfully, but these errors were encountered: