Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightingModule optimizer issue - Thermostability fine-tuning #71

Open
dc2211 opened this issue Nov 13, 2024 · 5 comments
Open

LightingModule optimizer issue - Thermostability fine-tuning #71

dc2211 opened this issue Nov 13, 2024 · 5 comments

Comments

@dc2211
Copy link

dc2211 commented Nov 13, 2024

Hi all, I have been trying to run the example to fine tune the 650M model with the provided thermostability data. Unfortunately, I’m getting the following error

[rank 0]: TypeError: LightingModule.optimizer_step() takes from 4 to 5 positional arguments but 9 were given

I’m using exactly the same script provided, with only changing the number of GPU from 4 to 1, and CUDA_VISIBLE_DEVICES to 0.

Any help is greatly appreciated. Thank you

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Nov 13, 2024

Hi,

I think it's due to the imcompatibility with the version of pytorch-lightning. Could you degrade your pytorch-lightning to 1.8.3?

@dc2211
Copy link
Author

dc2211 commented Nov 13, 2024

Thanks, that solved the previous issue, and training started just fine.

Also, is there any way to automatically select to not visualize the results (option 3) from the interactive prompt? I’m submitting my job through slurm, and the error I’m getting I assume is because of that

wandb.errors.errors.Usage: api_key not configured (no-tty)

I also tried to set WANDB_MODE: dryrun in the config file, and wandb disabled but it did not work.

moreover, by setting logger: False I got that there is No supported gpu backend found!

Thanks

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Nov 14, 2024

If you don't want to record your training then set logger to False should work. The error No supported gpu backend found!
seems to be caused by your hardware configuration. How did you run it normally as you said "that solved the previous issue, and training started just fine."?

@dc2211
Copy link
Author

dc2211 commented Nov 14, 2024

To run it normally, I requested the necessary resources through srun in the terminal, and then run scripts/training.py, interactively chose option 3 an everything started just fine.

The issue was when I submitted the training through sbatch. Currently is not giving any error, but is stuck in

All distributed processes registered. Starting with 1 processes

I decided to create an account on wandb, but the problem through sbatch persists, and is still stuck.

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Nov 14, 2024

The problem is more likely due to the sbatch command, not wandb. I'm not familiar with slurm. Perhaps you could check whether sbatch does some additional operations that conflicts with the python script?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants