LightingModule optimizer issue - Thermostability fine-tuning #71

dc2211 · 2024-11-13T14:05:37Z

Hi all, I have been trying to run the example to fine tune the 650M model with the provided thermostability data. Unfortunately, I’m getting the following error

[rank 0]: TypeError: LightingModule.optimizer_step() takes from 4 to 5 positional arguments but 9 were given

I’m using exactly the same script provided, with only changing the number of GPU from 4 to 1, and CUDA_VISIBLE_DEVICES to 0.

Any help is greatly appreciated. Thank you

The text was updated successfully, but these errors were encountered:

LTEnjoy · 2024-11-13T14:31:14Z

Hi,

I think it's due to the imcompatibility with the version of pytorch-lightning. Could you degrade your pytorch-lightning to 1.8.3?

dc2211 · 2024-11-13T18:20:20Z

Thanks, that solved the previous issue, and training started just fine.

Also, is there any way to automatically select to not visualize the results (option 3) from the interactive prompt? I’m submitting my job through slurm, and the error I’m getting I assume is because of that

wandb.errors.errors.Usage: api_key not configured (no-tty)

I also tried to set WANDB_MODE: dryrun in the config file, and wandb disabled but it did not work.

moreover, by setting logger: False I got that there is No supported gpu backend found!

Thanks

LTEnjoy · 2024-11-14T02:19:43Z

If you don't want to record your training then set logger to False should work. The error No supported gpu backend found!
seems to be caused by your hardware configuration. How did you run it normally as you said "that solved the previous issue, and training started just fine."?

dc2211 · 2024-11-14T12:26:30Z

To run it normally, I requested the necessary resources through srun in the terminal, and then run scripts/training.py, interactively chose option 3 an everything started just fine.

The issue was when I submitted the training through sbatch. Currently is not giving any error, but is stuck in

All distributed processes registered. Starting with 1 processes

I decided to create an account on wandb, but the problem through sbatch persists, and is still stuck.

LTEnjoy · 2024-11-14T13:51:14Z

The problem is more likely due to the sbatch command, not wandb. I'm not familiar with slurm. Perhaps you could check whether sbatch does some additional operations that conflicts with the python script?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightingModule optimizer issue - Thermostability fine-tuning #71

LightingModule optimizer issue - Thermostability fine-tuning #71

dc2211 commented Nov 13, 2024

LTEnjoy commented Nov 13, 2024

dc2211 commented Nov 13, 2024

LTEnjoy commented Nov 14, 2024

dc2211 commented Nov 14, 2024

LTEnjoy commented Nov 14, 2024

LightingModule optimizer issue - Thermostability fine-tuning #71

LightingModule optimizer issue - Thermostability fine-tuning #71

Comments

dc2211 commented Nov 13, 2024

LTEnjoy commented Nov 13, 2024

dc2211 commented Nov 13, 2024

LTEnjoy commented Nov 14, 2024

dc2211 commented Nov 14, 2024

LTEnjoy commented Nov 14, 2024