Update on request manager #14

zwang86 · 2024-04-09T04:37:21Z

A few changes will apply to the new request manager:

split pre-filling iterations and decoding iterations. A pre-filling iteration: SSM -> LLM; A speculation-verification decoding iteration: SSM_1 -> … -> SSM_D -> LLM
apply padding in pre-filling batch (chunk prefilling) and SSM commit batch (first batch in each spec-verify iteration). For the first SSM iteration, the BatchConfig includes all previously verified tokens (0 ~ BUDGET_L). For all remaining SSM iterations, the BatchConfig includes all tokens we want to speculate.
remove old_bc dependency and convert request manager to a state machine.

Description	Request in Batch	Token in Batch
Prefilling	1	chunk_size (MAX_NUM_TOKEN_IN_BATCH)
SSM speculation	n	n * k
LLM verification	n	budget_L

Commit (LLM)	n	0 ~ budget_L
First SSM speculation	n	0 ~ budget_S

Incremental	n	n

Assumption: MAX_NUM_TOKENS_IN_BATCH = BUDGET_L = BUDGET_S

The text was updated successfully, but these errors were encountered:

zwang86 · 2024-04-09T04:39:44Z

@xinhaoc @goliaro @jiazhihao @zikun-li
Please review and let me know if there is any thoughts or suggestion!

zikun-li · 2024-04-09T14:42:46Z

It seems that for the SSM speculation, there's also up to budget_S tokens in the batch for each small model inference. And we also need to apply padding to SSM speculation, because the number of tokens is less than budget_S most of the time.

zwang86 · 2024-04-11T01:19:19Z

Request Manager states:

Prefilling (SSM + LLM)-> update_ir (update rm state) -> 2. SSM decoing iteration -> -> update_ir (update rm state) -> 3. LLM verify iteration -> update_ir (update rm state) -> (1/2) -> ...

zwang86 · 2024-04-11T01:25:43Z

Old version api:

BatchConfig RequestManager::prepare_next_batch(BatchConfig const &old_bc, InferenceResult const &result)

New version api:

void RequestManager::update_inference_result(InferenceResult const &ir);
BatchConfig RequestManager::get_next_batch();

chenzhuofu mentioned this issue Dec 16, 2024

Operators support for updated speculative inference design #8

Open

lockshaw transferred this issue from flexflow/flexflow-train Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update on request manager #14

Update on request manager #14

zwang86 commented Apr 9, 2024 •

edited by zikun-li

Loading

zwang86 commented Apr 9, 2024

zikun-li commented Apr 9, 2024

zwang86 commented Apr 11, 2024

zwang86 commented Apr 11, 2024 •

edited

Loading

Update on request manager #14

Update on request manager #14

Comments

zwang86 commented Apr 9, 2024 • edited by zikun-li Loading

zwang86 commented Apr 9, 2024

zikun-li commented Apr 9, 2024

zwang86 commented Apr 11, 2024

zwang86 commented Apr 11, 2024 • edited Loading

zwang86 commented Apr 9, 2024 •

edited by zikun-li

Loading

zwang86 commented Apr 11, 2024 •

edited

Loading