Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

开源的模型里哪个是100K上下文的版本? #178

Open
xiaoyuer2019 opened this issue Apr 11, 2024 · 8 comments
Open

开源的模型里哪个是100K上下文的版本? #178

xiaoyuer2019 opened this issue Apr 11, 2024 · 8 comments

Comments

@xiaoyuer2019
Copy link

开源的模型里哪个是100K上下文的版本?我看了最新的v6版本好像是只有8k吧?

@chentigerye
Copy link
Contributor

v6两个模型都支持100k, config里的数值只是placeholder, 我们测试用8x 40g 卡支持到总长度100k,没有问题,如果8x80g, 可支持200k,如下命令可测试:

可以根据实际硬件情况调整max_input/generate_length

export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=0 ; streamlit run apps/web_demo.py -- --model_path tigerbot-70b-chat-v6 --rope_scaling yarn --rope_factor 8 --max_input_length 37888 --max_generate_length 62112

@xiaoyuer2019
Copy link
Author

我这里只有9块3090的显卡,请问有v6量化版的么?量化版的是不是可以支持100K?

@Vivicai1005
Copy link
Contributor

70b chat v6的量化版本在这里:https://huggingface.co/TigerResearch/tigerbot-70b-chat-v6-4bit-exl2

@xiaoyuer2019
Copy link
Author

谢谢,4bit量化的也可以推理100K的上下文吧?

@chentigerye
Copy link
Contributor

可以的

@xiaoyuer2019
Copy link
Author

请问exllama2量化的模型用什么框架推理可以使用api接口?

@xiaoyuer2019
Copy link
Author

量化模型使用下面参数启动模型
export PYTHONPATH='./' ; CUDA_VISIBLE_DEVICES=1,2,3,4 streamlit run apps/exllamav2_web_demo.py -- --model_path /data/model/tigerbot-70b-chat-v6-4bit-exl2/tigerbot --max_input_length 37888 --max_generate_length 62112

在长文本推理的时候会报下面的错误
Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.
Both max_new_tokens (=62112) and max_length(=100000) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Exception in thread Thread-10 (eval_generate):
Traceback (most recent call last):
File "/data/anaconda3/envs/exl2/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/data/anaconda3/envs/exl2/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/data/tigerbot/other_infer/exllamav2_hf_infer.py", line 214, in eval_generate
model.generate(**args)
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
result = self._greedy_search(
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
outputs = self(
File "/data/tigerbot/other_infer/exllamav2_hf_infer.py", line 112, in call
self.ex_model.forward(
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/exllamav2/model.py", line 662, in forward
assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
AssertionError: Total sequence length exceeds cache size in model.forward

@xiaoyuer2019
Copy link
Author

13B模型使用下面参数启动模型
export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8 ; streamlit run apps/web_demo.py -- --model_path /data/model/tigerbot-13b-chat-v6 --rope_scaling yarn --rope_factor 8 --max_input_length 10240 --max_generate_length 10240

在长文本推理的时候会报下面的错误

Namespace(model_path='/data/model/tigerbot-13b-chat-v6', rope_scaling='yarn', rope_factor=8.0, max_input_length=10240, max_generate_length=10240)
Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.
{'input_ids': tensor([[65107, 9134, 421, 3317, 29918, 1482, 29918, 517, 12360, 29952,
11070, 29947, 29896, 29929, 29906, 29897, 322, 421, 3317, 29918,
2848, 29952, 29898, 29922, 29896, 29906, 29906, 29947, 29947, 29897,
2833, 304, 505, 1063, 731, 29889, 421, 3317, 29918, 1482,
29918, 517, 12360, 29952, 674, 2125, 9399, 663, 29889, 3529,
2737, 304, 278, 5106, 363, 901, 2472, 29889, 313, 991,
597, 29882, 688, 3460, 2161, 29889, 1111, 29914, 2640, 29914,
9067, 414, 29914, 3396, 29914, 264, 29914, 3396, 29918, 13203,
29914, 726, 29918, 4738, 362, 29897, 13, 65108]],
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
Both max_new_tokens (=10240) and max_length(=20480) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [65,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [66,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [67,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [68,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [69,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [70,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [71,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [72,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [73,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [74,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [75,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [76,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [77,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [78,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [79,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [80,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [81,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [82,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [83,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [84,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [85,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [86,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [87,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [88,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [89,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [90,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [91,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [92,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [93,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [94,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [95,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
Exception in thread Thread-9 (generate):
Traceback (most recent call last):
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/generation/utils.py", line 1479, in generate
return self.greedy_search(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/generation/utils.py", line 2340, in greedy_search
outputs = self(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward
outputs = self.model(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1070, in forward
layer_outputs = decoder_layer(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 798, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 706, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 234, in apply_rotary_pos_emb
q_embed = (q * cos) + (rotate_half(q) * sin)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 208, in rotate_half
return torch.cat((-x2, x1), dim=-1)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants