From 26c03eee24a1773e7102afefbe6a9ad0478ef528 Mon Sep 17 00:00:00 2001 From: Insu Jang Date: Wed, 10 Jan 2024 09:43:59 -0500 Subject: [PATCH] rebuilding site Wed Jan 10 09:43:59 EST 2024 --- .../index.html | 25 +++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/2024-01-07/llm-inference-continuous-batching-and-pagedattention/index.html b/2024-01-07/llm-inference-continuous-batching-and-pagedattention/index.html index a804dbc..4f3d66f 100644 --- a/2024-01-07/llm-inference-continuous-batching-and-pagedattention/index.html +++ b/2024-01-07/llm-inference-continuous-batching-and-pagedattention/index.html @@ -112,7 +112,7 @@ "keywords": ["dl","inference","attention"], "mainEntityOfPage": "true", - "wordCount": "1201" + "wordCount": "1338" }] @@ -410,6 +410,7 @@

  • PagedAttention
  • @@ -580,7 +581,27 @@

    Preemption with Page M self._append_slot(seq_group, blocks_to_copy) num_curr_seqs += num_new_seqs self.running.append(seq_group) -
    +

    Prompt Handling #

    +

    PagedAttention does not seem to coalesce prompt and decode requests in the same iteration, different from the illustration above.

    +

    In PagedAttention implementation, forward() checks whether the input is prompt:

    +
    if input_metadata.is_prompt:
    +    # Prompt run.
    +else:
    +    # Decoding run.
    +    ...
    +

    Because query, key, and cache arguments include a batched input, all inputs should be either prompt or decode, and cannot be coalesced. +This is also verified in Model Runner:

    +
    def prpare_input_tensors(...):
    +    # NOTE: We assume that all sequences in the group are all prompts or
    +    # all decodes.
    +    is_prompt = seq_group_metadata_list[0].is_prompt
    +    if is_prompt:
    +        (input_tokens, input_positions, input_metadata, prompt_lens) = self._prepare_prompt(seq_group_metadata_list)
    +    else:
    +        (input_tokens, input_positions, input_metadata) = self._prepare_decode(seq_group_metadata_list)
    +        prompt_lens = []
    +

    They might use several iterations, however, to finish all pending prompts before resuming decoding. Prompts can be grouped with padding or separated and executed in different iterations.

    +