You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' not found" after computing AWQ scales and applying them to the gguf model
#655
Open
Autism-al opened this issue
Nov 25, 2024
· 1 comment
When I quantified the Qwen2.5-1.5B-instruct model according to "GGUF Export" in the examples.md in the docs, it showed that the quantization was complete and I obtained the gguf model.But when I load the model through llama-cpp-python, the following error was reported:
llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' not found
After multiple attempts, I have confirmed that the model obtained by directly quantifying with llama.cpp without using autoawq is normal, and I can correctly read it by using llama.cpp.the original model Qwen2.5-1.5B-instruct can also use transformers for inference normally.
This is my quantization code:
import os
import subprocess
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'Qwen2.5-1.5B-Instruct'
quant_path = 'Qwen2.5-1.5B-Instruct-awq'
llama_cpp_path = 'llama.cpp'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path, low_cpu_mem_usage=True, use_cache=False, device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
# NOTE: We avoid packing weights, so you cannot use this model in AutoAWQ
# after quantizing. The saved model is FP16 but has the AWQ scales applied.
model.quantize(
tokenizer,
quant_config=quant_config,
export_compatible=True
)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
# GGUF conversion
print('Converting model to GGUF...')
llama_cpp_method = "q4_K_M"
convert_cmd_path = os.path.join(llama_cpp_path, "convert_hf_to_gguf.py")
quantize_cmd_path = os.path.join(llama_cpp_path, "llama-quantize")
if not os.path.exists(llama_cpp_path):
cmd = f"git clone https://github.com/ggerganov/llama.cpp.git {llama_cpp_path} && cd {llama_cpp_path} && make LLAMA_CUBLAS=1 LLAMA_CUDA_F16=1"
subprocess.run([cmd], shell=True, check=True)
subprocess.run([
f"python {convert_cmd_path} {quant_path} --outfile {quant_path}/slu-int4.gguf"
], shell=True, check=True)
subprocess.run([
f"{quantize_cmd_path} {quant_path}/slu-int4.gguf {quant_path}/slu_{llama_cpp_method}.gguf {llama_cpp_method}"
], shell=True, check=True)
It's interesting that before this, I had been using this way(awq+convert to Q4K gguf) for quantization, and I could always read it correctly by using llama.cpp. Recently, when I changed the conda environment, this problem began to occur. Is it due to the impact of version changes? I have been troubled by this problem for several days now, but I really can't restore to the original conda environment.
I encountered the same problem in Qwen2-1.5B and Qwen2.5-3B. Both found the weight "embed_tokens", but did not find its existence when exporting to gguf.
When I quantified the Qwen2.5-1.5B-instruct model according to "GGUF Export" in the examples.md in the docs, it showed that the quantization was complete and I obtained the gguf model.But when I load the model through llama-cpp-python, the following error was reported:
llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' not found
After multiple attempts, I have confirmed that the model obtained by directly quantifying with llama.cpp without using autoawq is normal, and I can correctly read it by using llama.cpp.the original model Qwen2.5-1.5B-instruct can also use transformers for inference normally.
This is my quantization code:
It's interesting that before this, I had been using this way(awq+convert to Q4K gguf) for quantization, and I could always read it correctly by using llama.cpp. Recently, when I changed the conda environment, this problem began to occur. Is it due to the impact of version changes? I have been troubled by this problem for several days now, but I really can't restore to the original conda environment.
The complete error log is as follows:
llama_model_loader: loaded meta data with 26 key-value pairs and 338 tensors from Qwen2.5-1.5B-Instruct-awq/slu-int4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 1.5B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5
llama_model_loader: - kv 5: general.size_label str = 1.5B
llama_model_loader: - kv 6: qwen2.block_count u32 = 28
llama_model_loader: - kv 7: qwen2.context_length u32 = 32768
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 1536
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 8960
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 12
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 1
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type f16: 197 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 1536
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 12
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 6
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8960
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.54 B
llm_load_print_meta: model size = 2.88 GiB (16.00 BPW)
llm_load_print_meta: general.name = Qwen2.5 1.5B Instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 0.15 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' not found
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
File "/home/lmf/llm/Qwen2.5-finetuning/llama-cpp-instruct.py", line 2, in
llm = Llama(model_path="Qwen2.5-1.5B-Instruct-awq/slu-int4.gguf")
File "/home/lmf/anaconda3/envs/qwenslu/lib/python3.10/site-packages/llama_cpp/llama.py", line 371, in init
_LlamaModel(
File "/home/lmf/anaconda3/envs/qwenslu/lib/python3.10/site-packages/llama_cpp/_internals.py", line 55, in init
raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: Qwen2.5-1.5B-Instruct-awq/slu-int4.gguf
ERROR conda.cli.main_run:execute(124):
conda run python /home/lmf/llm/Qwen2.5-finetuning/llama-cpp-instruct.py
failed. (See above for error)The text was updated successfully, but these errors were encountered: