开源的模型里哪个是100K上下文的版本？ #178

xiaoyuer2019 · 2024-04-11T10:14:07Z

开源的模型里哪个是100K上下文的版本？我看了最新的v6版本好像是只有8k吧？

chentigerye · 2024-04-12T08:19:24Z

v6两个模型都支持100k, config里的数值只是placeholder, 我们测试用8x 40g 卡支持到总长度100k,没有问题，如果8x80g, 可支持200k,如下命令可测试：

可以根据实际硬件情况调整max_input/generate_length

export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=0 ; streamlit run apps/web_demo.py -- --model_path tigerbot-70b-chat-v6 --rope_scaling yarn --rope_factor 8 --max_input_length 37888 --max_generate_length 62112

xiaoyuer2019 · 2024-04-12T08:38:19Z

我这里只有9块3090的显卡，请问有v6量化版的么？量化版的是不是可以支持100K？

Vivicai1005 · 2024-04-15T05:50:04Z

70b chat v6的量化版本在这里：https://huggingface.co/TigerResearch/tigerbot-70b-chat-v6-4bit-exl2

xiaoyuer2019 · 2024-04-15T07:59:38Z

谢谢，4bit量化的也可以推理100K的上下文吧？

chentigerye · 2024-04-16T05:54:19Z

可以的

xiaoyuer2019 · 2024-04-16T12:19:37Z

请问exllama2量化的模型用什么框架推理可以使用api接口？

xiaoyuer2019 · 2024-04-16T12:47:57Z

量化模型使用下面参数启动模型
export PYTHONPATH='./' ; CUDA_VISIBLE_DEVICES=1,2,3,4 streamlit run apps/exllamav2_web_demo.py -- --model_path /data/model/tigerbot-70b-chat-v6-4bit-exl2/tigerbot --max_input_length 37888 --max_generate_length 62112

在长文本推理的时候会报下面的错误
Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.
Both max_new_tokens (=62112) and max_length(=100000) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Exception in thread Thread-10 (eval_generate):
Traceback (most recent call last):
File "/data/anaconda3/envs/exl2/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/data/anaconda3/envs/exl2/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/data/tigerbot/other_infer/exllamav2_hf_infer.py", line 214, in eval_generate
model.generate(**args)
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
result = self._greedy_search(
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
outputs = self(
File "/data/tigerbot/other_infer/exllamav2_hf_infer.py", line 112, in call
self.ex_model.forward(
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/anaconda3/envs/exl2/lib/python3.10/site-packages/exllamav2/model.py", line 662, in forward
assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
AssertionError: Total sequence length exceeds cache size in model.forward

xiaoyuer2019 · 2024-04-16T13:02:07Z

13B模型使用下面参数启动模型
export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8 ; streamlit run apps/web_demo.py -- --model_path /data/model/tigerbot-13b-chat-v6 --rope_scaling yarn --rope_factor 8 --max_input_length 10240 --max_generate_length 10240

在长文本推理的时候会报下面的错误

Namespace(model_path='/data/model/tigerbot-13b-chat-v6', rope_scaling='yarn', rope_factor=8.0, max_input_length=10240, max_generate_length=10240)
Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.
{'input_ids': tensor([[65107, 9134, 421, 3317, 29918, 1482, 29918, 517, 12360, 29952,
11070, 29947, 29896, 29929, 29906, 29897, 322, 421, 3317, 29918,
2848, 29952, 29898, 29922, 29896, 29906, 29906, 29947, 29947, 29897,
2833, 304, 505, 1063, 731, 29889, 421, 3317, 29918, 1482,
29918, 517, 12360, 29952, 674, 2125, 9399, 663, 29889, 3529,
2737, 304, 278, 5106, 363, 901, 2472, 29889, 313, 991,
597, 29882, 688, 3460, 2161, 29889, 1111, 29914, 2640, 29914,
9067, 414, 29914, 3396, 29914, 264, 29914, 3396, 29918, 13203,
29914, 726, 29918, 4738, 362, 29897, 13, 65108]],
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
Both max_new_tokens (=10240) and max_length(=20480) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [65,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [66,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [67,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [68,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [69,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [70,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [71,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [72,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [73,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [74,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [75,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [76,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [77,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [78,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [79,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [80,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [81,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [82,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [83,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [84,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [85,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [86,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [87,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [88,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [89,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [90,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [91,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [92,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [93,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [94,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [82,0,0], thread: [95,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
Exception in thread Thread-9 (generate):
Traceback (most recent call last):
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/generation/utils.py", line 1479, in generate
return self.greedy_search(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/generation/utils.py", line 2340, in greedy_search
outputs = self(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward
outputs = self.model(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1070, in forward
layer_outputs = decoder_layer(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 798, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 706, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 234, in apply_rotary_pos_emb
q_embed = (q * cos) + (rotate_half(q) * sin)
File "/data/anaconda3/envs/tigerbotdemo/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 208, in rotate_half
return torch.cat((-x2, x1), dim=-1)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

开源的模型里哪个是100K上下文的版本？ #178

开源的模型里哪个是100K上下文的版本？ #178

xiaoyuer2019 commented Apr 11, 2024

chentigerye commented Apr 12, 2024

xiaoyuer2019 commented Apr 12, 2024

Vivicai1005 commented Apr 15, 2024

xiaoyuer2019 commented Apr 15, 2024

chentigerye commented Apr 16, 2024

xiaoyuer2019 commented Apr 16, 2024

xiaoyuer2019 commented Apr 16, 2024

xiaoyuer2019 commented Apr 16, 2024

开源的模型里哪个是100K上下文的版本？ #178

开源的模型里哪个是100K上下文的版本？ #178

Comments

xiaoyuer2019 commented Apr 11, 2024

chentigerye commented Apr 12, 2024

可以根据实际硬件情况调整max_input/generate_length

xiaoyuer2019 commented Apr 12, 2024

Vivicai1005 commented Apr 15, 2024

xiaoyuer2019 commented Apr 15, 2024

chentigerye commented Apr 16, 2024

xiaoyuer2019 commented Apr 16, 2024

xiaoyuer2019 commented Apr 16, 2024

xiaoyuer2019 commented Apr 16, 2024