Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update on request manager #14

Open
3 tasks
zwang86 opened this issue Apr 9, 2024 · 4 comments
Open
3 tasks

Update on request manager #14

zwang86 opened this issue Apr 9, 2024 · 4 comments

Comments

@zwang86
Copy link
Contributor

zwang86 commented Apr 9, 2024

A few changes will apply to the new request manager:

  • split pre-filling iterations and decoding iterations. A pre-filling iteration: SSM -> LLM; A speculation-verification decoding iteration: SSM_1 -> … -> SSM_D -> LLM
  • apply padding in pre-filling batch (chunk prefilling) and SSM commit batch (first batch in each spec-verify iteration). For the first SSM iteration, the BatchConfig includes all previously verified tokens (0 ~ BUDGET_L). For all remaining SSM iterations, the BatchConfig includes all tokens we want to speculate.
  • remove old_bc dependency and convert request manager to a state machine.
Description Request in Batch Token in Batch
Prefilling 1 chunk_size (MAX_NUM_TOKEN_IN_BATCH)
SSM speculation n n * k
LLM verification n budget_L
Commit (LLM) n 0 ~ budget_L
First SSM speculation n 0 ~ budget_S
Incremental n n

Assumption: MAX_NUM_TOKENS_IN_BATCH = BUDGET_L = BUDGET_S

@zwang86
Copy link
Contributor Author

zwang86 commented Apr 9, 2024

@xinhaoc @goliaro @jiazhihao @zikun-li
Please review and let me know if there is any thoughts or suggestion!

@zikun-li
Copy link

zikun-li commented Apr 9, 2024

It seems that for the SSM speculation, there's also up to budget_S tokens in the batch for each small model inference. And we also need to apply padding to SSM speculation, because the number of tokens is less than budget_S most of the time.

@zwang86
Copy link
Contributor Author

zwang86 commented Apr 11, 2024

Request Manager states:

  1. Prefilling (SSM + LLM)-> update_ir (update rm state) -> 2. SSM decoing iteration -> -> update_ir (update rm state) -> 3. LLM verify iteration -> update_ir (update rm state) -> (1/2) -> ...

@zwang86
Copy link
Contributor Author

zwang86 commented Apr 11, 2024

Old version api:

BatchConfig RequestManager::prepare_next_batch(BatchConfig const &old_bc, InferenceResult const &result)

New version api:

void RequestManager::update_inference_result(InferenceResult const &ir);
BatchConfig RequestManager::get_next_batch();

@lockshaw lockshaw transferred this issue from flexflow/flexflow-train Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants