[llama] Store KV Cache on CPU and Use PyTorch `SPDA` for Next token generation #1182

zhentaoyu · 2024-08-02T02:23:57Z

What does this PR do?

Results

python run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf --max_new_tokens 4096 --bf16 --use_kv_cache --attn_softmax_bf16 --reuse_cache --do_sample --prompt "Tell me somethings about Intel"

with --kv_cache_on_host

```bash Stats: -------------------------------------------------------------------------------------------------------------- Throughput (including tokenization) = 2.132539697795915 tokens/second Number of HPU graphs = 14 Memory allocated = 12.68 GB Max memory allocated = 12.77 GB Total memory available = 94.62 GB Graph compilation duration = 5842.699780527037 seconds~~ -------------------------------------------------------------------------------------------------------------- ```

update 4b0fa1a

Stats:
-------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 12.22449896564133 tokens/second
Number of HPU graphs                = 0
Memory allocated                    = 12.68 GB
Max memory allocated                = 12.68 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 1010.5770402610069 seconds
--------------------------------------------------------------------------------------------------------------

without --kv_cache_on_host

Stats:
--------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 31.41817953959749 tokens/second
Number of HPU graphs                = 11
Memory allocated                    = 14.68 GB
Max memory allocated                = 14.68 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 397.36551256105304 seconds
--------------------------------------------------------------------------------------------------------------

Limitations

can not generate correct results when --use_hpu_graphs because it has host-device memory transfer in the self-attn forward process.

cc @airMeng and @luoyu-intel

Update

Yi-34b-chat on gaudi-2 with ~11k input + 5k output
command:

python run_generation.py \
--model_name_or_path 01-ai/Yi-34B-Chat \
--use_kv_cache \
--bf16 \
--attn_softmax_bf16 \
--reuse_cache \
--do_sample \
--dataset_name emozilla/pg19-test \
--batch_size 1 \
--max_input_tokens 11200 \
--column_name "text" \
--dataset_max_samples 1 \
--warmup 0 \
--n_iterations 1 \
--max_new_tokens 5000 \
--kv_cache_on_host

without kv_cache_on_host:

 09/18/2024 05:28:11 - INFO - __main__ - Graph compilation...
Traceback (most recent call last):
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 707, in <module>
    main()
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 655, in main
    generate_dataset(batch)
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 633, in generate_dataset
    outputs = model.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/optimum-habana/optimum/habana/transformers/generation/utils.py", line 1299, in generate
    result = self._sample(
  File "/data/optimum-habana/optimum/habana/transformers/generation/utils.py", line 2239, in _sample
    self.htcore_generation.mark_step()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/utils/internal.py", line 26, in wrapper
    func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/step_closure.py", line 66, in mark_step
    htcore._mark_step(device_str, sync)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_SYNHELPER workspace Allocation of size ::28127918336 failed!

with kv_cache_on_host:

Stats:
----------------------------------------------------------------------
Throughput (including tokenization) = 1.2790787964372536 tokens/second
Total runtime for dataset: 3909.073683977127
Memory allocated                    = 90.72 GB
Max memory allocated                = 91.63 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 3907.185397926951 seconds
----------------------------------------------------------------------

eblarge output token num with kv_cache_on_host:
--max_input_tokens 11200 --max_new_tokens 10000

Stats:
----------------------------------------------------------------------
Throughput (including tokenization) = 1.2790787964372536 tokens/second
Total runtime for dataset: 3909.073683977127
Memory allocated                    = 90.72 GB
Max memory allocated                = 91.63 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 3907.185397926951 seconds
----------------------------------------------------------------------

airMeng · 2024-08-02T02:25:27Z

@hshen14 @luoyu-intel for awareness

airMeng · 2024-08-07T01:04:24Z

@mandy-li @libinta @dvarshney-habana This is the first PR of system optimization from intel neural compressor(INC) team, could you give a review?

Experiments of Llama2 on single Gaudi2 card with Xeon 8380 host. With offloading KV Cache and SDPA to CPU, we improve the context limit from 26k(input:10k+output:16k) to 310k(input:10k+output:300k).

Config	Context	HPU Memory (GB, steady/peak)	CPU Memory (GB)
KV cache on HPU	10k+16k	~90GB	NA
KV cache on HPU	10k+100	83.36/84.11	4.4
KV cache on HPU	12k+100	91.78/92.72	5.03
KV cache on HPU	12k+10k	92.06/93.0	7.68
KV cache on HPU	12k+100k	OOM	N/A
KV cache on HPU	10k+100k	86.22/86.97	31
KV cache on HPU	10k+300k	91.94/92.70	85

emascarenhas · 2024-09-03T14:51:04Z

Please sync your PR with main/upstream and fix any merge conflicts. Thank you.

zhentaoyu · 2024-09-04T07:10:01Z

Please sync your PR with main/upstream and fix any merge conflicts. Thank you.

done.

imangohari1 · 2024-09-10T19:39:44Z

@zhentaoyu
Thanks for the PR and the results in description.
Do I read this correctly that the use of kv chache on host is degregading the throughput, while not generating correct answer with hpu graphs? if so, what's the use of this option?

this PR also has merge conflict with main, could you please take a look at the differences?
We need to test this PR with CI system to make sure it is not breaking anything and it is not impacting any performance.

zhentaoyu · 2024-09-11T02:24:51Z

@zhentaoyu Thanks for the PR and the results in description. Do I read this correctly that the use of kv chache on host is degregading the throughput, while not generating correct answer with hpu graphs? if so, what's the use of this option?

this PR also has merge conflict with main, could you please take a look at the differences? We need to test this PR with CI system to make sure it is not breaking anything and it is not impacting any performance.

Yes. It's an option for long-context inference or generation when a single hpu card OOM. In this PR, I just use torch.Tensor.to to transfer kv_cache related tensors between CPU and Gaudi2 and make next token sdpa happen on CPU for saving data transferring time. However, It can not generate right answer when --use_hpu_graphs. I'm not familiar with the habana synapse graph, and please tell me if you have any insights, I'm happy to try to fix it.
Ok, I have rebased the PR.

zhentaoyu · 2024-09-12T08:45:37Z

Hi, @imangohari1, I have updated the PR (see descriptions). Could you please retake a look when you have free time? Please let me know if you have more comments or need more tests. Thanks a lot.
cc @hshen14

yeonsily · 2024-09-17T22:00:27Z

optimum/habana/transformers/generation/utils.py

+                    else:
+                        unwrap_deepspeed_model(self).allocate_kv_cache(
+                            bs * generation_config.num_beams, calculated_max_length, token_idx + num_virtual_tokens
+                        )


From line 1096 to 1107, I would like to suggest to change like this.

if not is_greedy_or_beam_and_bucket:
cache_device = "hpu"
if generation_config.kv_cache_on_host and self.config.model_type in ["llama"]:
print("Allocate KV Cache on CPU...")
cache_device = "cpu"
unwrap_deepspeed_model(self).allocate_kv_cache(
bs * generation_config.num_beams, calculated_max_length, token_idx + num_virtual_tokens,
device=cache_device
)

Thanks, I have updated it in 74e94ff. However, I can not remove the else line because I only modified the modeling_llama.py for this experimental feature.

yeonsily · 2024-09-17T22:03:38Z

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ?
The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

zhentaoyu · 2024-09-18T08:37:45Z

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ? The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

Hi @yeonsily, thanks for your comment. Yes, I add a case in README and update the results in the PR description.

yeonsily · 2024-09-18T21:10:35Z

optimum/habana/transformers/models/llama/modeling_llama.py

-                else:
-                    with ht.sdp_kernel(enable_recompute=flash_attention_recompute):
+        else:
+            if kv_cache_on_host:


Can you please explain what's the case switching kv_cache device? I thought line 656 is the case only when line 658.

In this pr, we make kv cache store on cpu and do cpu sdpa only when generating the next token. The first token or prefill stage is performed on HPU due to its powerful computation ability under long-context scenario (long prompt in most cases). The full pipeline diagram shows on the pr description.
So line 658 tells the machine it can do pytorch-cpu sdpa (flash-attn) only when kv_cache_on_host & in next-token generation & inference stage. Otherwise, it will transfer the kv-cache to hpu device if need for its original operations.
Please let me know if you need more explanations or have some suggestions. Thanks.

airMeng · 2024-10-29T05:06:44Z

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ? The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

@yeonsily the similiar features already available in tensorrt-llm https://nvidia.github.io/TensorRT-LLM/kv_cache_reuse.html#offloading-to-host-memory

yeonsily · 2024-11-22T00:41:54Z

@zhentaoyu Can you please rebase your change? We can merge this change after that.

zhentaoyu · 2024-11-22T05:08:43Z

@zhentaoyu Can you please rebase your change? We can merge this change after that.

Yes, rebased. Thanks a lot.

yeonsily

Please resolve conflicts also.

airMeng · 2024-11-24T14:41:07Z

@yeonsily Anything else needed before merging?

yeonsily · 2024-11-25T17:02:32Z

@airMeng Can you please run all of the llama CI test to make sure this doesn't impact the current numbers? You can check the llama CI cases from https://github.com/huggingface/optimum-habana/tree/main/tests folder. Thanks.

airMeng · 2024-11-26T02:18:12Z

hi @yeonsily the CI/CD set on github can't be triggered?

regisss · 2024-11-26T11:44:44Z

hi @yeonsily the CI/CD set on github can't be triggered?

I can trigger it but the PR CI won't test what @yeonsily is suggesting, you'll have to run these tests manually.

zhentaoyu · 2024-12-03T09:03:20Z

Hi, @yeonsily, I have run the llama model test cases in tests/test_text_generation_example.py with this PR. code see below:

if os.environ.get("GAUDI2_CI", "0") == "1":
    # Gaudi2 CI baselines
    MODELS_TO_TEST = {
        "bf16_1x": [
            ("meta-llama/Llama-2-7b-hf", 1, True, 141.25776956002076, True),
            ("meta-llama/Meta-Llama-3-8B", 1, True, 129, False),
            ("meta-llama/Llama-2-7b-hf", 512, True, 12808, False),
            ("meta-llama/Llama-2-7b-hf", 512, False, 8711, False),  # in some cases like TGI, reuse_cache isnt used
        ],
        "fp8": [
            ("meta-llama/Llama-2-7b-hf", 1, 1230, False, 128, 128, 13152.7),
            ("meta-llama/Llama-2-7b-hf", 1, 163, False, 128, 2048, 4774.7),
            ("meta-llama/Llama-2-7b-hf", 1, 94, False, 2048, 128, 1293.3),
            ("meta-llama/Llama-2-7b-hf", 1, 81, False, 2048, 2048, 1942.9),
        ],
        "load_quantized_model_with_autogptq": [
            ("TheBloke/Llama-2-7b-Chat-GPTQ", 1, 10, False, 128, 2048, 456.7),
        ],
        "torch_compile": [
            ("meta-llama/Llama-2-7b-hf", 102.27823420713148),
        ],
        "torch_compile_distributed": [
            ("meta-llama/Llama-2-7b-hf", 39.72973199515235),
        ],
        "distributed_tp": [
            ("meta-llama/Llama-2-7b-hf", 1345.2369318328463),
        ],
    }

the running command is GAUDI2_CI=1 RUN_SLOW=true python test_text_generation_example.py 2>&1 | tee pytest_log.txt
my local machine driver version is 1.18.0-ee698fb and the docker image is vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest

results:

the whole test log is here:
pytest_log.txt

HuggingFaceDocBuilderDev · 2024-12-03T22:17:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss · 2024-12-03T22:18:45Z

@zhentaoyu CI didn't pass, can you take a look at this error? https://github.com/huggingface/optimum-habana/actions/runs/12149024129/job/33878873264?pr=1182#step:4:370

yeonsily · 2024-12-03T23:11:57Z

@zhentaoyu You will need this change to fix the CI failure.

diff --git a/optimum/habana/transformers/models/llama/modeling_llama.py b/optimum/habana/transformers/models/llama/modeling_llama.py
index 075edc8..d8866d7 100755
--- a/optimum/habana/transformers/models/llama/modeling_llama.py
+++ b/optimum/habana/transformers/models/llama/modeling_llama.py
@@ -443,7 +443,8 @@ class KVCache(torch.nn.Module):
     @staticmethod
     def update(prev, cur, dim, idx, inp_seq_len):
         cur = cur.to(prev.device)
-        idx = idx.to(prev.device)
+        if idx is not None:
+            idx = idx.to(prev.device)
         orig_cur = cur
         if prev.shape == cur.shape:
             prev.copy_(cur)

Meanwhile, I think you should also try llama training cases if your change doesn't affect the perf number. It seems you ran only inference case.

Signed-off-by: Yu Zhentao <[email protected]>

zhentaoyu · 2024-12-04T03:13:31Z

@zhentaoyu You will need this change to fix the CI failure.

diff --git a/optimum/habana/transformers/models/llama/modeling_llama.py b/optimum/habana/transformers/models/llama/modeling_llama.py
index 075edc8..d8866d7 100755
--- a/optimum/habana/transformers/models/llama/modeling_llama.py
+++ b/optimum/habana/transformers/models/llama/modeling_llama.py
@@ -443,7 +443,8 @@ class KVCache(torch.nn.Module):
     @staticmethod
     def update(prev, cur, dim, idx, inp_seq_len):
         cur = cur.to(prev.device)
-        idx = idx.to(prev.device)
+        if idx is not None:
+            idx = idx.to(prev.device)
         orig_cur = cur
         if prev.shape == cur.shape:
             prev.copy_(cur)

Meanwhile, I think you should also try llama training cases if your change doesn't affect the perf number. It seems you ran only inference case.

Fixed, thanks.

As for training test, I test this function called test_multiple_peft_adapters locally since it using the llama model with test_trainer.py. Here is the result:

regisss · 2024-12-04T08:38:19Z

@zhentaoyu Can you also run the Llama training regression tests with

GAUDI2_CI=1 RUN_SLOW=1 pytest tests/test_examples.py -v -s -k "llama"

please?

optimum/habana/transformers/generation/utils.py

zhentaoyu · 2024-12-04T08:55:24Z

@zhentaoyu Can you also run the Llama training regression tests with
GAUDI2_CI=1 RUN_SLOW=1 pytest tests/test_examples.py -v -s -k "llama"
please?

ok. Will update here if I get the result.

Signed-off-by: Yu Zhentao <[email protected]>

zhentaoyu force-pushed the cpu_sdpa branch from 554b8ac to d5c06c7 Compare August 6, 2024 07:52

zhentaoyu marked this pull request as ready for review August 8, 2024 01:20

zhentaoyu requested review from ssarkar2, bhargaveede, vivekgoe, mandy-li and libinta as code owners August 8, 2024 01:20

zhentaoyu requested a review from a user August 8, 2024 01:20

zhentaoyu requested a review from regisss as a code owner August 8, 2024 01:20

zhentaoyu force-pushed the cpu_sdpa branch from d5c06c7 to 928ab58 Compare August 9, 2024 02:13

zhentaoyu force-pushed the cpu_sdpa branch from 928ab58 to 7ca2c8f Compare September 4, 2024 02:18

zhentaoyu force-pushed the cpu_sdpa branch from 7ca2c8f to ff3c54f Compare September 11, 2024 02:15

zhentaoyu force-pushed the cpu_sdpa branch from ff3c54f to 4b0fa1a Compare September 12, 2024 06:13

yeonsily reviewed Sep 17, 2024

View reviewed changes

zhentaoyu force-pushed the cpu_sdpa branch from 4b0fa1a to 74e94ff Compare September 18, 2024 08:17

yeonsily reviewed Sep 18, 2024

View reviewed changes

zhentaoyu force-pushed the cpu_sdpa branch from 74e94ff to c5c8723 Compare November 22, 2024 03:28

yeonsily approved these changes Nov 22, 2024

View reviewed changes

zhentaoyu force-pushed the cpu_sdpa branch from c5c8723 to 314bb37 Compare December 3, 2024 08:53

libinta added the run-test Run CI for PRs from external contributors label Dec 3, 2024

zhentaoyu and others added 5 commits December 4, 2024 01:57

cpu_kv and cpu_sdpa on llama

1551c71

Signed-off-by: Yu Zhentao <[email protected]>

refact code and add README

34bf592

Signed-off-by: Yu Zhentao <[email protected]>

add long-context example in README

c021bd5

Signed-off-by: Yu Zhentao <[email protected]>

fix rebase

be7723b

Signed-off-by: Yu Zhentao <[email protected]>

Typos and minor improvements

1a70b12

zhentaoyu force-pushed the cpu_sdpa branch from f43c38e to 6f12c92 Compare December 4, 2024 02:44

fix rebase and CI

6f12c92

Signed-off-by: Yu Zhentao <[email protected]>

regisss reviewed Dec 4, 2024

View reviewed changes

optimum/habana/transformers/generation/utils.py Outdated Show resolved Hide resolved

remove logging.set_verbosity_info()

0f3ece0

Signed-off-by: Yu Zhentao <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llama] Store KV Cache on CPU and Use PyTorch `SPDA` for Next token generation #1182

[llama] Store KV Cache on CPU and Use PyTorch `SPDA` for Next token generation #1182

zhentaoyu commented Aug 2, 2024 •

edited

Loading

airMeng commented Aug 2, 2024

airMeng commented Aug 7, 2024 •

edited

Loading

emascarenhas commented Sep 3, 2024

zhentaoyu commented Sep 4, 2024

imangohari1 commented Sep 10, 2024 •

edited

Loading

zhentaoyu commented Sep 11, 2024

zhentaoyu commented Sep 12, 2024

yeonsily Sep 17, 2024

zhentaoyu Sep 18, 2024

yeonsily commented Sep 17, 2024

zhentaoyu commented Sep 18, 2024

yeonsily Sep 18, 2024

zhentaoyu Sep 19, 2024

airMeng commented Oct 29, 2024 •

edited

Loading

yeonsily commented Nov 22, 2024

zhentaoyu commented Nov 22, 2024 •

edited

Loading

yeonsily left a comment

airMeng commented Nov 24, 2024

yeonsily commented Nov 25, 2024

airMeng commented Nov 26, 2024

regisss commented Nov 26, 2024

zhentaoyu commented Dec 3, 2024

HuggingFaceDocBuilderDev commented Dec 3, 2024

regisss commented Dec 3, 2024

yeonsily commented Dec 3, 2024 •

edited

Loading

zhentaoyu commented Dec 4, 2024

regisss commented Dec 4, 2024

zhentaoyu commented Dec 4, 2024

[llama] Store KV Cache on CPU and Use PyTorch SPDA for Next token generation #1182

Are you sure you want to change the base?

[llama] Store KV Cache on CPU and Use PyTorch SPDA for Next token generation #1182

Conversation

zhentaoyu commented Aug 2, 2024 • edited Loading

What does this PR do?

Results

Limitations

Update

airMeng commented Aug 2, 2024

airMeng commented Aug 7, 2024 • edited Loading

emascarenhas commented Sep 3, 2024

zhentaoyu commented Sep 4, 2024

imangohari1 commented Sep 10, 2024 • edited Loading

zhentaoyu commented Sep 11, 2024

zhentaoyu commented Sep 12, 2024

yeonsily Sep 17, 2024

Choose a reason for hiding this comment

zhentaoyu Sep 18, 2024

Choose a reason for hiding this comment

yeonsily commented Sep 17, 2024

zhentaoyu commented Sep 18, 2024

yeonsily Sep 18, 2024

Choose a reason for hiding this comment

zhentaoyu Sep 19, 2024

Choose a reason for hiding this comment

airMeng commented Oct 29, 2024 • edited Loading

yeonsily commented Nov 22, 2024

zhentaoyu commented Nov 22, 2024 • edited Loading

yeonsily left a comment

Choose a reason for hiding this comment

airMeng commented Nov 24, 2024

yeonsily commented Nov 25, 2024

airMeng commented Nov 26, 2024

regisss commented Nov 26, 2024

zhentaoyu commented Dec 3, 2024

HuggingFaceDocBuilderDev commented Dec 3, 2024

regisss commented Dec 3, 2024

yeonsily commented Dec 3, 2024 • edited Loading

zhentaoyu commented Dec 4, 2024

regisss commented Dec 4, 2024

zhentaoyu commented Dec 4, 2024

[llama] Store KV Cache on CPU and Use PyTorch `SPDA` for Next token generation #1182

[llama] Store KV Cache on CPU and Use PyTorch `SPDA` for Next token generation #1182

zhentaoyu commented Aug 2, 2024 •

edited

Loading

airMeng commented Aug 7, 2024 •

edited

Loading

imangohari1 commented Sep 10, 2024 •

edited

Loading

airMeng commented Oct 29, 2024 •

edited

Loading

zhentaoyu commented Nov 22, 2024 •

edited

Loading

yeonsily commented Dec 3, 2024 •

edited

Loading