Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while training LLM judge #66

Open
WJ44 opened this issue Jul 25, 2024 · 4 comments
Open

Error while training LLM judge #66

WJ44 opened this issue Jul 25, 2024 · 4 comments

Comments

@WJ44
Copy link
Contributor

WJ44 commented Jul 25, 2024

When attempting to train an LLM judge I get the following error.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], [line 16](vscode-notebook-cell:?execution_count=1&line=16)
      [3](vscode-notebook-cell:?execution_count=1&line=3) classifier_config = {
      [4](vscode-notebook-cell:?execution_count=1&line=4)     "training_dataset": ["nq_synthetic_queries.tsv"],
      [5](vscode-notebook-cell:?execution_count=1&line=5)     "validation_set": ["datasets/example_files/nq_labeled_output.tsv"],
   (...)
     [12](vscode-notebook-cell:?execution_count=1&line=12)     "model_choice": "microsoft/deberta-v3-xsmall",
     [13](vscode-notebook-cell:?execution_count=1&line=13) }
     [15](vscode-notebook-cell:?execution_count=1&line=15) ares = ARES(classifier_model=classifier_config)
---> [16](vscode-notebook-cell:?execution_count=1&line=16) results = ares.train_classifier()
     [17](vscode-notebook-cell:?execution_count=1&line=17) print(results)

File ~/projects/ARES/ares/ares.py:134, in ARES.train_classifier(self)
    [132](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/ares.py:132)     print("Skipping binary classifier configuration due to missing parameters.")
    [133](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/ares.py:133) else:
--> [134](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/ares.py:134)     binary_classifer_config(**self.classifier_model_config)

File ~/projects/ARES/ares/binary_classifier.py:164, in binary_classifer_config(training_dataset, validation_set, label_column, num_epochs, patience_value, learning_rate, training_dataset_path, validation_dataset_path, model_choice, validation_set_scoring, assigned_batch_size, gradient_accumulation_multiplier, number_of_runs, num_warmup_steps, training_row_limit, validation_row_limit)
    [147](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:147) tokenized_datasets = initalize_dataset_for_tokenization(tokenizer, training_dataset_arrow, validation_dataset_arrow, test_dataset_arrow)
    [149](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:149) train_and_eval_settings = {
    [150](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:150)     "number_of_runs": number_of_runs,
    [151](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:151)     "tokenized_datasets": tokenized_datasets,
   (...)
    [161](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:161)     "gradient_accumulation_multiplier": gradient_accumulation_multiplier
    [162](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:162) }
--> [164](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:164) model, avg_train_losses, avg_valid_losses, eval_dataloader, inference_times = train_and_evaluate_model(train_and_eval_settings)
    [166](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:166) total_predictions, total_references, metric = evaluate_model(model, model_choice, checkpoint_path, device, eval_dataloader, inference_times)
    [168](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:168) print_and_save_model(total_predictions, total_references, checkpoint_path, metric)

File ~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:740, in train_and_evaluate_model(params)
    [738](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:738) outputs = model(**new_batch)
    [739](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:739) loss = criterion(outputs, batch['labels'].to(device))
--> [740](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:740) loss.backward()
    [742](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:742) # Gradient accumulation
    [743](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:743) gradient_accumulation_count += 1

File ~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:522, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    [512](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:512) if has_torch_function_unary(self):
    [513](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:513)     return handle_torch_function(
    [514](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:514)         Tensor.backward,
    [515](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:515)         (self,),
   (...)
    [520](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:520)         inputs=inputs,
    [521](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:521)     )
--> [522](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:522) torch.autograd.backward(
    [523](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:523)     self, gradient, retain_graph, create_graph, inputs=inputs
    [524](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:524) )

File ~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:266, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    [261](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:261)     retain_graph = create_graph
    [263](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:263) # The reason we repeat the same comment below is that
    [264](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:264) # some Python versions print out the first line of a multi-line function
    [265](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:265) # calls in the traceback and some print out the last line
--> [266](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:266) Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    [267](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:267)     tensors,
    [268](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:268)     grad_tensors_,
    [269](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:269)     retain_graph,
    [270](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:270)     create_graph,
    [271](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:271)     inputs,
    [272](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:272)     allow_unreachable=True,
    [273](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:273)     accumulate_grad=True,
    [274](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:274) )

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I am using the xsmall model to make testing quicker and made the necessary change in embossing size in the CustomBERTModel class. The same error happens when using the (default) large model. I am using a shortened synthetic queries file to make testing quicker as well, but the same happens with the example file provided.

I am somewhat at a loss, since I am sure it was working earlier.

@WJ44
Copy link
Contributor Author

WJ44 commented Jul 25, 2024

Also happens when not in notebook.

/opt/conda/conda-bld/pytorch_1708025845868/work/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "/home/wesley/projects/ARES/reproduce.py", line 16, in <module>
    results = ares.train_classifier()
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wesley/projects/ARES/ares/ares.py", line 139, in train_classifier
    binary_classifer_config(**self.classifier_model_config)
  File "/home/wesley/projects/ARES/ares/binary_classifier.py", line 164, in binary_classifer_config
    model, avg_train_losses, avg_valid_losses, eval_dataloader, inference_times = train_and_evaluate_model(train_and_eval_settings)
                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wesley/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py", line 800, in train_and_evaluate_model
    loss.backward()
  File "/home/wesley/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/home/wesley/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
  0%|         

@WJ44
Copy link
Contributor Author

WJ44 commented Aug 14, 2024

I have tried in on different hardware and OSes and run into the same problem everywhere.

@WJ44
Copy link
Contributor Author

WJ44 commented Aug 20, 2024

This happens even in a clean install in a clean VM when trying the example code for training a classifier.

@WJ44
Copy link
Contributor Author

WJ44 commented Aug 29, 2024

Solved by #71

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant