-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the measurement of the latency #1454
Comments
Hi, is there an answer to the question? |
Additionally, I tested the difference between Incremental decoding and Speculative decoding: For Incremental decoding, I used the following code: import flexflow.serve as ff
ff.init(
num_gpus=1,
memory_per_gpu=56000,
zero_copy_memory_per_node=120000,
tensor_parallelism_degree=1,
pipeline_parallelism_degree=1
)
# Specify the LLM
# llm = ff.LLM("meta-llama/Llama-2-7b-hf")
llm = ff.LLM("/public/home/wutong/meta-llama/Llama-2-7b-hf")
# Specify a list of SSMs (just one in this case)
ssms=[]
# ssm = ff.SSM("JackFram/llama-68m")
ssm = ff.SSM("/public/home/wutong/JackFram/llama-68m")
ssms.append(ssm)
# Create the sampling configs
generation_config = ff.GenerationConfig(
do_sample=False, temperature=0.9, topp=0.8, topk=1
)
# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
ssm.compile(generation_config)
# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config,
max_requests_per_batch = 16,
max_seq_length = 256,
max_tokens_per_batch = 128,
ssms=ssms)
llm.start_server()
result = llm.generate("Here are some travel tips for Tokyo:\n")
# result = llm.generate("Give three tips for staying healthy.")
llm.stop_server() # This invocation is optional For Speculative decoding, I used the following code: import flexflow.serve as ff
# Initialize the FlexFlow runtime. ff.init() takes a dictionary or the path to a JSON file with the configs
ff.init(
num_gpus=1,
memory_per_gpu=56000,
zero_copy_memory_per_node=120000,
tensor_parallelism_degree=1,
pipeline_parallelism_degree=1
)
# Create the FlexFlow LLM
# llm = ff.LLM("meta-llama/Llama-2-7b-hf")
llm = ff.LLM("/public/home/wutong/meta-llama/Llama-2-7b-hf")
# Create the sampling configs
generation_config = ff.GenerationConfig(
do_sample=True, temperature=0.9, topp=0.8, topk=1
)
# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config,
max_requests_per_batch = 16,
max_seq_length = 256,
max_tokens_per_batch = 128)
# Generation begins!
llm.start_server()
result = llm.generate("Here are some travel tips for Tokyo:\n")
# result = llm.generate("Give three tips for staying healthy.")
llm.stop_server() # This invocation is optional When testing with the prompt "Here are some travel tips for Tokyo:\n", I obtained the same result. However, when testing with the prompt "Give three tips for staying healthy.", I received different results. The result for "Incremental decoding" was:
The result for "Speculative decoding" was:
Is this normal? |
Hello, FlexFlow team!
Thank you for your outstanding work! I am attempting to reproduce the experimental results from the paper "SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification." on a single H100.However, we encountered some issues and would like to understand how these results compare with the vllm framework. The details are as follows:
Dataset:
We used the first ten prompts from alpaca.json, one of the five datasets provided by the team.
Model:
LLM: meta-llama/Llama-2-7b-hf
SSM: jackfram/llama-68m
(As I am unable to access Hugging Face directly, I downloaded the model parameters locally.)
Parameter Settings:
For SpecInfer: max_requests_per_batch = 16, max_seq_length = 256, max_tokens_per_batch = 128, temperature = 0.8, top_p = 0.95
For vllm: temperature = 0.8, top_p = 0.95, max_tokens = 256
Environment Configuration:
For SpecInfer, I installed version v24.1.0 from source.
For vllm, I used pip install vllm.
During the testing of SpecInfer, I referred to the code in issue flexflow/flexflow-serve#15. My run_specinfer.py script is as follows:
Command-line execution:
For testing vllm, I referred to the code in issue flexflow/flexflow-serve#37. My run_vllm.py script is as follows:
Command-line execution:
python3 run_vllm.py > resultOfvllm.txt
The logs obtained from SpecInfer are as follows:
resultOfSpec.txt
The logs obtained from vllm are as follows:
resultOfvllm.txt
According to the team's previous issues, the latency (in microsecond) for each prompt represents the computation time. Therefore, I summed the latency of the ten prompts, and the result is: 1189722.0 + 1190138.0 + 1318237.0 + 1598564.0 + 1734440.0 + 2855074.0 +
2855302.0 + 3304062.0 + 3902707.0 + 4895604.0 = 24,843,850 microsecond = 24.84385 s
(I also used time.time() in Python to measure the time required for vllm, and the result is:3.26208758354187s
My test results seem unusual. Could you please advise if there are any errors in my testing method? Additionally, any further details on reproducing the paper's results would be greatly appreciated.
The text was updated successfully, but these errors were encountered: