[Core] Reduce TTFT with concurrent partial prefills #10235

joerunde · 2024-11-11T21:59:38Z

Replaces #10061, as inspired by @njhill and @comaniac's comments. Co-authored by @prashantgupta24

Context: our customers running large multi-tenanted SaaS deployments of vLLM have a problem where high volumes of small-prompt requests are usually processed smoothly, but quickly pile up in a giant queue when a small number of large-prompt requests are submitted. We see the decoding throughput drop to zero on multiple replicas when this happens.

The current chunked prefill implementation only allows a single sequence to be partially prefilled at a time. This has a few limitations:

Multiple medium-sized prompts must wait to be prefilled serially, increasing TTFT for those in the back of the queue
A single very large prompt will block all other prompts from prefilling for many iterations. This can eventually starve decoding- for example a 130k token prompt with —max-num-batched-tokens=512 will take about 250 iterations to prefill, in which time the currently decoding sequences may all finish. Send a few of these requests at once and very quickly nothing will be decoding.

This PR implements both

An explicit setting for the number of sequences that can be partially prefilled concurrently. This can be configured with --max-num-partial-prefills=N
A limit on the number of “very long prompt” sequences that can be prefilled concurrently. This can be configured with
- --max-long-partial-prefills=N to set the limit on the number of long sequences that can be concurrently prefilled. This defaults to 1 sequence.
- --long-prefill-threashold=x% to set a percentage of the context length that determines which sequences are considered "long". This defaults to 4%

This is implemented in the v0 scheduler. We’re aware that the v1 implementation is underway and will later become the default, but we need a fix for our customers soon and we hope that what we discover here may help inform a different, better solution in the v1 scheduler.

To test this we created three scenarios, a “medium request” case, a “large request” case, and a “mixed” case.

For the medium request case, we created a subset of the sharegpt dataset with 900 small requests (<50 prompt characters) and 100 of the largest requests (typically between 10k and 20k prompt characters, which we call “medium” sized). We modified the benchmark_serving.py test to not filter out any of the small or large requests, and ran it with this dataset. What we expect to find is similar throughput compared to the main branch, but much lower TTFT on the small requests. Since 10% of the requests are larger than the rest, we should see better TTFT at p90 and below, with comparable TTFT above p90.

For the large request case, we took 990 of the smallest requests from the sharegpt dataset, and then took 10 of the largest requests and duplicated the prompts until they were around 100k characters in length. We ran this in the same way as the medium request case, and here we expect to see smaller TTFT across the board since the small requests will no longer be blocked from prefilling by the few very large requests.

For the mixed case, we used 850 “small”, and 140 “medium” requests, as well as 10 "large" requests where we duplicated the prompts up to 200k characters.

All tests were run on a single 80GB A100, with the command:

python benchmarks/benchmark_serving.py --model meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-path ${test_case} --metric-percentiles 80,85,90,95,99 --request-rate 12

We ran the tests against the main branch (commit 874f551b3626321f6bf9a902b8fd9fc1fa7c7f2e), as well as this PR with the new optimization both disabled (--max-num-partial-prefills=1), and enabled (--max-num-partial-prefills=4)

The results are shown here:

The TTFT improvements are very easy to see- in the medium case we cut the p90 TTFT in half, and in the large case we cut it nearly 30x. In both cases we did not measure a throughput drop when run with --max-num-partial-prefills=1, and the throughput drop with --max-num-partial-prefills=4 is minimal.

Surprisingly, along with the massive TTFT improvements in the "mixed" test case, we also see a 4% throughput improvement (3506 tokens/s up from 3368 tokens/s). Based on the fact that ITL still looks a little slower, it seems that the throughput is higher simply because more requests were able to be successfully scheduled at the same time.

cc @rickyyx

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

github-actions · 2024-11-11T21:59:49Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

joerunde · 2024-11-11T22:12:32Z

vllm/model_executor/layers/sampler.py

+        assert len(sampling_metadata.seq_groups) \
+            == len(maybe_deferred_sample_results) \
+            == len(prompt_logprobs) \
+            == len(sample_logprobs)


I'm not really familiar with the sampler code at all, but this assert at least does not trigger under any of our tests so far.

@comaniac @rickyyx Do y'all have pointers to any other places we should dig into where there might be an implicit assumption that the number of partial prefills is <= 1?

I think the assertions currently would only happen if:

All running sequences are chunked continuous prefills

No decode sequences in the scheduler batch.

Sweet, we will try to add some unit tests to cover that case and see if we can make it crash. Thanks!

vllm/core/scheduler.py

comaniac · 2024-11-12T23:47:55Z

vllm/config.py

@@ -1085,6 +1085,7 @@ def __init__(self,
                 max_num_batched_tokens: Optional[int],
                 max_num_seqs: int,
                 max_model_len: int,
+                 num_prefill_slots: int = 1,


Is this actually "maximum number of prefill sequence in a batch"? If so could we name it something more informative, like max_num_batched_prefill_seqs ?

It's technically only the number of partial prefills allowed in a batch. You could still have like 100 sequence groups with 5 prompt tokens each all schedule in a single step here.

max_num_partial_prefills?

comaniac · 2024-11-12T23:51:26Z

vllm/core/scheduler.py

+        # Requests with more than (4% max context length) tokens to prefill
+        # are "big".


Why this definition and threshold?

The entire goal here is to not allow decode to be starved by the prefill phase blocking on long requests- this part of the PR description:

A single very large prompt will block all other prompts from prefilling for many iterations. This can eventually starve decoding- for example a 130k token prompt with —max-num-batched-tokens=512 will take about 250 iterations to prefill, in which time the currently decoding sequences may all finish. Send a few of these requests at once and very quickly nothing will be decoding.

Just allowing concurrent partial prefills doesn't solve the problem by itself, because multiple long requests could still block up the prefill. So what we do is only allow a single long request to prefill, and allow smaller requests to be pulled from the waiting queue instead of more long ones

vllm/core/scheduler.py

joerunde · 2024-11-14T20:37:58Z

tests/core/test_chunked_prefill_scheduler.py

+
+@pytest.mark.parametrize("model", ["facebook/opt-125m"])
+@pytest.mark.parametrize("max_num_partial_prefills", [2, 4, 8])
+def test_chunked_prefill_with_actual_engine(model: str,


cc @rickyyx here's what we tried to do to test that the sampler doesn't throw any assertions- we put multiple prompts into an engine and manually step it forward with them all partially prefilled

prashantgupta24 · 2024-11-15T21:08:24Z

tests/core/test_chunked_prefill_scheduler.py

+    seq_group_meta, out = schedule_and_update_computed_tokens(scheduler)
+    # large req gets 63 tokens (minus 1 for decode)
+    assert seq_group_meta[0].token_chunk_size == 63
+    assert seq_group_meta[1].token_chunk_size == 1  # decode


Not sure if this is a bug, but at this stage, request#3 should be decoding, but it didn't get any budget. Request#2 got budget for 1 decode token, and request#0 got the remaining budget for prefilling 63 tokens. Is that expected?

Based on this comment,

# Update new running requests. # By default, vLLM scheduler prioritizes prefills. # Once chunked prefill is enabled, # the policy is changed to prioritize decode requests.

vllm should have prioritized decode requests and given both request#2 and request#3 1 budget, and request#0 62?

This does happen in the next iteration though

mergify · 2024-11-20T10:59:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @joerunde.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]>

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]>

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]>

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]>

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]>

prashantgupta24 · 2024-11-22T22:21:27Z

@prashantgupta24 it looks like the regression test is hanging for some reason, after prompt processing starts.

[2024-11-21T18:53:40Z] Running 4 items in this shard: tests/test_regression.py::test_duplicated_ignored_sequence_group, tests/test_regression.py::test_max_tokens_none, tests/test_regression.py::test_gc, tests/test_regression.py::test_model_from_modelscope
...
[2024-11-21T18:53:51Z] INFO 11-21 10:53:51 model_runner.py:1399] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
[2024-11-21T18:53:51Z] INFO 11-21 10:53:51 model_runner.py:1403] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
[2024-11-21T18:54:02Z] INFO 11-21 10:54:02 model_runner.py:1517] Graph capturing finished in 10 secs, took 0.12 GiB
Processed prompts:  50% 1/2 [00:00<00:00,  1.70it/s, est. speed input: 10.17 toks/s, output: 434.01 toks/s]
# Received cancellation signal, interrupting
[2024-11-21T20:52:13Z] 🚨 Error: The command exited with status -1

Whoops fixed that!

mergify · 2024-11-23T05:17:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @joerunde.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

prashantgupta24 · 2024-11-26T01:02:13Z

ugh this is a tough merge conflict

Signed-off-by: Prashant Gupta <[email protected]>

prashantgupta24 · 2024-12-02T18:03:05Z

@comaniac @njhill any chance we could get this merged? All tests passed

joerunde · 2024-12-06T18:15:55Z

vllm/core/scheduler.py

@@ -1755,6 +1935,8 @@ def _get_num_new_uncached_and_cached_tokens(
                budget,
                self._get_prompt_limit(seq_group),
                num_uncached_new_tokens,
+                self.partial_prefill_budget_lookup_list,
+                partial_prefill_metadata,


nit: I think _chunk_new_tokens_to_schedule should not be static if this many of arguments are instance attributes

But doesn't have to block this PR

mergify · 2024-12-10T22:42:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @joerunde.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

rickyyx

This is definitely a great improvement for workload that might have long prefill requests. Also great job on the merge (I think I had a PR that also touched on very similar components)

W.r.t the PR itself, I mainly had one question:

What's the overheads of recomputing partial prefill metadata on each scheduler step? If it's not significant, could we avoid it? I feel the information we needed for the partial metadata could be accumulated as some very limited amount of internal state of the scheduler.

I am still a bit concerned of the potential consequences of having multiple concurrent prefills in the system, but if nothing breaks, then we are good. I will let folks with more knowledge on the other parts of the system to chime in.

cc @comaniac

vllm/core/scheduler.py

rickyyx · 2024-12-17T20:45:38Z

vllm/core/scheduler.py

+        self.partial_prefill_budget_lookup_list = [0] * (
+            self.scheduler_config.max_num_partial_prefills + 1)
+        self.partial_prefill_budget_lookup_list[0] = (
+            scheduler_config.max_num_batched_tokens)


Just so I am understanding it correctly:
partial_prefill_budget_lookup_list is a list where num partial prefills -> token budget for each partial prefill?

We only need to compute it once for an integer division when we schedule prefills right? Seems not too big an overhead to me comparing with what we are computing each scheduler step already for all the partial prefill metadata?

If we want to keep this, I think moving this into the metadata class might be better?

partial_prefill_budget_lookup_list is a list where num partial prefills -> token budget for each partial prefill?

Yup! We calculate this once up-front when the scheduler is created so that we don't have to do a division for every sequence in every scheduling step. A list access measures slightly faster than an integer division, at least on the machines we have running in IBM cloud.

If we want to keep this, I think moving this into the metadata class might be better?

Yeah we could do that, I just wasn't quite sure how to make that properly static and cached on the metadata class, since we wouldn't want to re-create this list on every scheduler step

I spent a few minutes looking into this and I'm not sure there's a clean way to add a class attribute to a @dataclass, unfortunately. I'd rather not make the code even more hacky than this is!

What do you think between either:

Leaving as is, or

Doing the division in-line like scheduler_config.max_num_batched_tokens // partial_prefill_metadata.schedulable_prefills ?

vllm/model_executor/layers/sampler.py

Signed-off-by: Joe Runde <[email protected]>

joerunde · 2024-12-18T20:37:18Z

What's the overheads of recomputing partial prefill metadata on each scheduler step? If it's not significant, could we avoid it? I feel the information we needed for the partial metadata could be accumulated as some very limited amount of internal state of the scheduler.

@rickyyx good question! I kinda already answered why we went this way in this comment but another reason we need to do some amount of up-front processing is that we have to first peek through the waiting queue at the beginning of each step to see if there are prefills that we can schedule so that we can then properly budget the tokens when we schedule the running queue. So, we could keep info around in the scheduler state about the number of prefills currently running, but we still have to peek at the waiting queue each step.

As for exactly how long the from_queues method takes, I don't know. The results we measured from the serving benchmark suggest that there's no significant overhead, but if you want we could instrument this method with some timing and measure exactly how long it takes.

rickyyx · 2024-12-18T21:02:30Z

What's the overheads of recomputing partial prefill metadata on each scheduler step? If it's not significant, could we avoid it? I feel the information we needed for the partial metadata could be accumulated as some very limited amount of internal state of the scheduler.

@rickyyx good question! I kinda already answered why we went this way in this comment but another reason we need to do some amount of up-front processing is that we have to first peek through the waiting queue at the beginning of each step to see if there are prefills that we can schedule so that we can then properly budget the tokens when we schedule the running queue. So, we could keep info around in the scheduler state about the number of prefills currently running, but we still have to peek at the waiting queue each step.

As for exactly how long the from_queues method takes, I don't know. The results we measured from the serving benchmark suggest that there's no significant overhead, but if you want we could instrument this method with some timing and measure exactly how long it takes.

Thanks for the response! Ah, I see the dependency between number of waiting requests and the running scheduling now.

If possible, I think a profiling on the from_queues should tell us what the overheads is. Based on that, we could better understand how much of the degrade in TPOT could be reverted. I think the improvement on TTFT is itself very impactful already regardless of that.

joerunde · 2024-12-18T22:51:51Z

Related recent issue that I think would be improved by this: #11286

noooop · 2024-12-19T06:05:16Z

maybe related to #10774

joerunde · 2024-12-19T17:26:11Z

@rickyyx Here are some numbers from running the serving benchmark and measuring the time of from_queues

Metadata Overhead (mean/median/p95/max): 0.005/0.005/0.010/0.027 ms

This took at most 27 nanoseconds, so I don't think we need to worry about this overhead. What do you think?

rickyyx · 2024-12-19T18:10:32Z

@rickyyx Here are some numbers from running the serving benchmark and measuring the time of from_queues
Metadata Overhead (mean/median/p95/max): 0.005/0.005/0.010/0.027 ms
This took at most 27 nanoseconds, so I don't think we need to worry about this overhead. What do you think?

Oh nice, this is rather negligible. Thanks for collecting that - do you know roughly how many elements are in the queue? Or what's the load when serving?

joerunde · 2024-12-19T18:47:49Z

@rickyyx It was the "medium case" from the pr description at 12qps, but I didn't grab any logs from the server about the size of the queues while it was running. I can run the other cases and do it at higher qps to verify it doesn't get out of hand

joerunde · 2024-12-19T21:18:55Z

@rickyyx Some more numbers, with the "mixed case" under 24qps load:

INFO 12-19 21:15:37 metrics.py:467] Avg prompt throughput: 6380.4 tokens/s, Avg generation throughput: 532.1 tokens/s, Running: 91 reqs, Swapped: 0 reqs, Pending: 230 reqs, GPU KV cache usage: 98.7%, CPU KV cache usage: 0.0%.
Metadata Overhead (mean/median/p95/max): 0.031/0.025/0.070/0.106 ms

With many requests running and a large waiting queue it maxes out at 100 nanoseconds. So it's a little worse, but I think still sub-millisecond should be fine here.

Signed-off-by: Joe Runde <[email protected]>

joerunde commented Nov 11, 2024

View reviewed changes

vllm/core/scheduler.py Outdated Show resolved Hide resolved

prashantgupta24 reviewed Nov 12, 2024

View reviewed changes

vllm/core/scheduler.py Outdated Show resolved Hide resolved

prashantgupta24 reviewed Nov 12, 2024

View reviewed changes

vllm/core/scheduler.py Outdated Show resolved Hide resolved

comaniac requested changes Nov 13, 2024

View reviewed changes

mergify bot added the frontend label Nov 13, 2024

joerunde marked this pull request as ready for review November 14, 2024 20:36

joerunde requested review from WoosukKwon, zhuohan123, youkaichao, alexm-neuralmagic and njhill as code owners November 14, 2024 20:36

joerunde commented Nov 14, 2024

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 14, 2024

prashantgupta24 reviewed Nov 15, 2024

View reviewed changes

mergify bot added the needs-rebase label Nov 20, 2024

joerunde and others added 12 commits November 20, 2024 10:02

🐛 fix multi-chunked-prefill sampler bug

f97eacf

Signed-off-by: Joe Runde <[email protected]>

🚧 add num_prefill_slots arg

b50a6b8

Signed-off-by: Prashant Gupta <[email protected]>

✨ start to write prefill slot logic

7f23c04

Signed-off-by: Joe Runde <[email protected]>

🎨 format

d271cc9

Signed-off-by: Prashant Gupta <[email protected]>

✨ update num tokens for prefill slots

b2cb96f

Signed-off-by: Joe Runde <[email protected]>

♻️ add schedule_chunked_prefill logic

c349ac0

Signed-off-by: Prashant Gupta <[email protected]>

♻️ change function name

e20518d

Signed-off-by: Prashant Gupta <[email protected]>

✨ reserve incoming prefill slots

6ba0e34

Signed-off-by: Joe Runde <[email protected]>

🎨 fix some typos

a7491cc

Signed-off-by: Prashant Gupta <[email protected]>

⚡ finish awesome scheduler

1ee6fea

Signed-off-by: Joe Runde <[email protected]>

🐛 fix the deadlocks

517915a

Signed-off-by: Joe Runde <[email protected]>

📝 Add more docstrings

ed298c3

Signed-off-by: Joe Runde <[email protected]>

🎨 fmt

8a8a07f

Signed-off-by: Prashant Gupta <[email protected]>

mergify bot added the needs-rebase label Nov 23, 2024

prashantgupta24 added 2 commits November 26, 2024 11:28

♻️ merge with main

90a53ab

Signed-off-by: Prashant Gupta <[email protected]>

Merge remote-tracking branch 'upstream/main' into prefill-slots

29a7ccd

Signed-off-by: Prashant Gupta <[email protected]>

mergify bot removed the needs-rebase label Nov 26, 2024

🎨 fmt

752ce1b

Signed-off-by: Prashant Gupta <[email protected]>

Merge remote-tracking branch 'upstream/main' into prefill-slots

edc204e

joerunde commented Dec 6, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into prefill-slots

0206173

mergify bot added the needs-rebase label Dec 10, 2024

rickyyx reviewed Dec 17, 2024

View reviewed changes

joerunde added 2 commits December 18, 2024 11:03

Merge remote-tracking branch 'upstream/main' into prefill-slots

80b72ef

🐛 fix index out of range

03525f2

Signed-off-by: Joe Runde <[email protected]>

mergify bot removed the needs-rebase label Dec 18, 2024

joerunde mentioned this pull request Dec 19, 2024

[Feature]: Unblock LLM while handling long sequences / Handling multiple prefills at the same time #10774

Open

1 task

joerunde added 2 commits December 19, 2024 14:39

♻️ naming updates

d5f5eb6

Signed-off-by: Joe Runde <[email protected]>

🐛 fix long prefill threshold init

cb5361a

Signed-off-by: Joe Runde <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Reduce TTFT with concurrent partial prefills #10235

[Core] Reduce TTFT with concurrent partial prefills #10235

joerunde commented Nov 11, 2024 •

edited

Loading

github-actions bot commented Nov 11, 2024

joerunde Nov 11, 2024

rickyyx Nov 12, 2024

joerunde Nov 12, 2024

comaniac Nov 12, 2024

joerunde Nov 13, 2024

comaniac Nov 12, 2024

joerunde Nov 13, 2024

joerunde Nov 14, 2024

prashantgupta24 Nov 15, 2024 •

edited

Loading

prashantgupta24 Nov 15, 2024 •

edited

Loading

mergify bot commented Nov 20, 2024

prashantgupta24 commented Nov 22, 2024

mergify bot commented Nov 23, 2024

prashantgupta24 commented Nov 26, 2024

prashantgupta24 commented Dec 2, 2024

joerunde Dec 6, 2024

mergify bot commented Dec 10, 2024

rickyyx left a comment

rickyyx Dec 17, 2024

joerunde Dec 18, 2024

joerunde Dec 19, 2024 •

edited

Loading

joerunde commented Dec 18, 2024

rickyyx commented Dec 18, 2024

joerunde commented Dec 18, 2024

noooop commented Dec 19, 2024

joerunde commented Dec 19, 2024

rickyyx commented Dec 19, 2024

joerunde commented Dec 19, 2024

joerunde commented Dec 19, 2024

		# Requests with more than (4% max context length) tokens to prefill
		# are "big".

[Core] Reduce TTFT with concurrent partial prefills #10235

Are you sure you want to change the base?

[Core] Reduce TTFT with concurrent partial prefills #10235

Conversation

joerunde commented Nov 11, 2024 • edited Loading

PR Title and Classification

Code Quality

Adding or changing kernels

Notes for Large Changes

What to Expect for the Reviews

Thank You

github-actions bot commented Nov 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prashantgupta24 Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

prashantgupta24 Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

mergify bot commented Nov 20, 2024

prashantgupta24 commented Nov 22, 2024

mergify bot commented Nov 23, 2024

prashantgupta24 commented Nov 26, 2024

prashantgupta24 commented Dec 2, 2024

Choose a reason for hiding this comment

mergify bot commented Dec 10, 2024

rickyyx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joerunde Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

joerunde commented Dec 18, 2024

rickyyx commented Dec 18, 2024

joerunde commented Dec 18, 2024

noooop commented Dec 19, 2024

joerunde commented Dec 19, 2024

rickyyx commented Dec 19, 2024

joerunde commented Dec 19, 2024

joerunde commented Dec 19, 2024

joerunde commented Nov 11, 2024 •

edited

Loading

prashantgupta24 Nov 15, 2024 •

edited

Loading

prashantgupta24 Nov 15, 2024 •

edited

Loading

joerunde Dec 19, 2024 •

edited

Loading