Qwen model issues & embedding and loss has nan #52

lylcst · 2023-11-03T12:37:24Z

after a loss backward and optimizer step, then forward the embedding layer output hidden states become inf and loss is nan.

LMXKO · 2023-11-28T07:03:04Z

+1

kar9999 · 2023-12-14T06:41:32Z

请问是sft阶段还是dpo阶段啊，我在作者的框架下，用sft微调chatglm3 loss也会nan

lylcst · 2023-12-14T06:58:04Z

请问是sft阶段还是dpo阶段啊，我在作者的框架下，用sft微调chatglm3 loss也会nan

dpo

akshayraghavan21 · 2024-07-22T21:44:28Z

Hi, any update on this? Were you able to fix this issue?

John-Watson123 · 2024-10-11T21:11:39Z

I've got this problem too when using the model Qwen2.5-7B.
Python Output:
Computing eval metrics: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:20<00:00, 1.26s/it]
Generating samples...: 0%| | 0/1 [00:00<?, ?it/s]Both max_new_tokens (=2048) and max_length(=512) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Generating samples...: 0%| | 0/1 [00:00<?, ?it/s]
Error executing job with overrides: []
Traceback (most recent call last):
File "/direct-preference-optimization/train.py", line 114, in main
worker_main(0, 1, config, policy, reference_model)
File "/direct-preference-optimization/train.py", line 44, in worker_main
trainer.train()
File "/direct-preference-optimization/trainers.py", line 320, in train
policy_samples, reference_samples = self.get_batch_samples(local_eval_batch)
File "/direct-preference-optimization/trainers.py", line 188, in get_batch_samples
policy_output = self.policy.generate(
File "/anaconda3/envs/dpo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/anaconda3/envs/dpo/lib/python3.10/site-packages/transformers/generation/utils.py", line 2024, in generate
result = self._sample(
File "~/anaconda3/envs/dpo/lib/python3.10/site-packages/transformers/generation/utils.py", line 3020, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen model issues & embedding and loss has nan #52

Qwen model issues & embedding and loss has nan #52

lylcst commented Nov 3, 2023

LMXKO commented Nov 28, 2023

kar9999 commented Dec 14, 2023

lylcst commented Dec 14, 2023

akshayraghavan21 commented Jul 22, 2024

John-Watson123 commented Oct 11, 2024

Qwen model issues & embedding and loss has nan #52

Qwen model issues & embedding and loss has nan #52

Comments

lylcst commented Nov 3, 2023

LMXKO commented Nov 28, 2023

kar9999 commented Dec 14, 2023

lylcst commented Dec 14, 2023

akshayraghavan21 commented Jul 22, 2024

John-Watson123 commented Oct 11, 2024