Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to start finetuning #724

Open
cecilia-hong opened this issue Nov 29, 2024 · 19 comments
Open

Unable to start finetuning #724

cecilia-hong opened this issue Nov 29, 2024 · 19 comments
Labels
enhancement New feature or request

Comments

@cecilia-hong
Copy link

Hello,

I hope you are all doing well, thank you for your help in my previous issue, unfortunately I have ran into another, this time when I tried to fine tune based on a foundation model.

So for my first try, I followed the instructions here: https://mace-docs.readthedocs.io/en/latest/guide/finetuning.html
and my training input are as follows:

mace_run_train \ --name="MACE" \ --foundation_model="small" \ --multiheads_finetuning=False \ --train_file="train.xyz" \ --valid_fraction=0.05 \ --test_file="test.xyz" \ --energy_weight=1.0 \ --forces_weight=1.0 \ --E0s="average" \ --energy_weight=100 \ --forces_weight=1 \ --lr=0.01 \ --scaling="rms_forces_scaling" \ --batch_size=2 \ --max_num_epochs=6 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --default_dtype="float64" \ --device=cuda \ --seed=3

However, my training will not start and I cannot seem to understand the source of the error from the output files. I have attached my log file to this, can you please help me with this?
MACE_run-3_debug.log

@ilyes319
Copy link
Contributor

ilyes319 commented Dec 3, 2024

Does your node have internet access? Can you share your output log, there must be an error log somewhere.

@cecilia-hong
Copy link
Author

Hi, I see where the problem may be now as the computing nodes I use do not have internet access. Is internet access required even if I were to download the models/checkpoint from https://github.com/ACEsuit/mace-mp/releases/tag/mace_mp_0b to finetune as well?

Other than the log file I previously attached, the job output had the following:

Traceback (most recent call last): File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/bin/mace_run_train", line 8, in <module> sys.exit(main()) File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/mace/cli/run_train.py", line 63, in main run(args) File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/mace/cli/run_train.py", line 126, in run calc = mace_mp( File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/mace/calculators/foundations_models.py", line 122, in mace_mp mace_calc = MACECalculator( File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/mace/calculators/mace.py", line 129, in __init__ self.models = [ File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/mace/calculators/mace.py", line 130, in <listcomp> torch.load(f=model_path, map_location=device) File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/torch/serialization.py", line 1040, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/torch/serialization.py", line 1258, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) EOFError: Ran out of input

@ilyes319
Copy link
Contributor

ilyes319 commented Dec 4, 2024

You can download the model from github and put it in ~/.cache/mace/ or just run the initial process on your login node.

@cecilia-hong
Copy link
Author

I had previously tried downloading the model and specifying the path to it in my training input like this:

mace_run_train \ --name="MACE" \ --foundation_model="/work/ec225/ec225/cchong/from_eddie/fine_tune/mace_agnesi_small.model" \ --train_file="train.xyz" \ --valid_fraction=0.05 \ --test_file="valid.xyz" \ --loss="universal" \ --energy_weight=1.0 \ --forces_weight=1.0 \ --forces_key="REF_forces" \ --energy_key="REF_energy" \ --E0s='{1:-13.6131, 6:-1006.8939, 8:-2042.5049, 31:-52313.1072 }' \ --eval_interval=1 \ --lr=0.01 \ --scaling="rms_forces_scaling" \ --batch_size=2 \ --max_num_epochs=6 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --default_dtype="float64" \ --device=cuda \ --seed=3
But that also failed with this output:

`Matplotlib created a temporary cache directory at /dev/shm/cchong_6174208/matplotlib-39h7ig7_ because the default path (/home/ec225/ec225/cchong/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2024-10-21 20:30:31.266 INFO: ===========VERIFYING SETTINGS===========
2024-10-21 20:30:31.280 INFO: MACE version: 0.3.7
2024-10-21 20:30:32.271 INFO: CUDA version: 11.8, CUDA device: 0
2024-10-21 20:30:34.520 INFO: Using foundation model /work/ec225/ec225/cchong/from_eddie/fine_tune/mace_agnesi_small.model as initial checkpoint.
2024-10-21 20:30:34.561 INFO: ===========LOADING INPUT DATA===========
2024-10-21 20:30:34.561 INFO: Using heads: ['default']
2024-10-21 20:30:34.561 INFO: ============= Processing head default ===========
2024-10-21 20:30:35.451 INFO: Training set [841 configs, 841 energy, 131196 forces] loaded from 'train.xyz'
2024-10-21 20:30:35.478 INFO: Using random 5% of training set for validation with indices saved in: ./valid_indices_3.txt
2024-10-21 20:30:35.478 INFO: Validaton set contains 42 configurations [42 energy, 6552 forces]
2024-10-21 20:30:35.792 INFO: Test set (841 configs) loaded from 'valid.xyz':
2024-10-21 20:30:35.792 INFO: Default_Default: 841 configs, 841 energy, 131196 forces
2024-10-21 20:30:35.792 INFO: Total number of configurations: train=799, valid=42, tests=[Default_Default: 841],
2024-10-21 20:30:35.792 INFO: ==================Using multiheads finetuning mode==================
2024-10-21 20:30:35.793 INFO: Using foundation model for multiheads finetuning with Materials Project data


mace_run_train 8
sys.exit(main())

run_train.py 63 main
run(args)

run_train.py 261 run
collections = assemble_mp_data(args, tag, head_configs)

multihead_tools.py 183 assemble_mp_data
raise RuntimeError(

RuntimeError:
Model or descriptors download failed and no local model found
`
I am pretty certain that the path is correct though or does the model have to be in the ~/.cache/mace/ directory?

@ilyes319
Copy link
Contributor

ilyes319 commented Dec 4, 2024

sorry it is not just the model you need to download https://github.com/ACEsuit/mace-mp/releases/download/mace_mp_0b/mp_traj_combined.xyz and https://github.com/ACEsuit/mace-mp/releases/download/mace_mp_0b/descriptors.npy to your ~/.cache/mace/. The easiest is for you to do a warmup run on your login node killing it just after it says it has downloaded the data.

@bernstei
Copy link
Collaborator

bernstei commented Dec 4, 2024

It might be nice to have a mace_run_train ... --exit_after_pretraining_download command line option, for those of use non-internet-connected compute nodes.

@alinelena
Copy link
Contributor

or more something like --cache-files that only downloads

@gabor1
Copy link
Collaborator

gabor1 commented Dec 5, 2024

or how about a --dry-run option which does various tasks up to but not including training? we could have --dry-run-levels, the first just checks argument validity, the second downloads (and saves) files if needed, the third one evaluates the loss once without updating any weights.

@cecilia-hong
Copy link
Author

I was wondering if there is a way to specify the path of the ~/cache/mace directory or to redefine it elsewhere?
I think the issue I am facing now is that I am working on CIRRUS but the batch jobs are ran on /work/ and cannot access /home/.

@ilyes319
Copy link
Contributor

ilyes319 commented Dec 5, 2024

I see, currently there is no way to do that from mace atm, maybe there is a way to hack it from your env variables, like a path link. We should add an option to provide a path for these files.

@cecilia-hong
Copy link
Author

Got it, I will play around with that, thank you!

@alinelena
Copy link
Contributor

I did not try this but trying to setup XDG_CACHE_HOME in your script to point to a .cache folder on the /work may help.

@alinelena
Copy link
Contributor

ok to answer myself, will not work since path is hardcoded...

mace/calculators/foundations_models.py:    cache_dir = os.path.expanduser("~/.cache/mace")
mace/calculators/foundations_models.py:            cache_dir = os.path.expanduser("~/.cache/mace")
mace/tools/multihead_tools.py:        cache_dir = os.path.expanduser("~/.cache/mace")

you can edit the three lines to your path

or maybe have something like

cache_dir = os.path.expanduser(os.environ.get('XDG_CACHE_HOME',"~/")+".cache/mace")

@ilyes319 if you are happy I can pr this change.

@ilyes319
Copy link
Contributor

ilyes319 commented Dec 9, 2024

sure, happy to merge. Will it change anything to the default?

@ilyes319 ilyes319 added the enhancement New feature or request label Dec 12, 2024
@cecilia-hong
Copy link
Author

Hi all, thank you for the help and suggestions.

I tried changing the cache_dir lines in both the multihead_tools.py and the foundation_models.py files to point towards a directory in my working directory:

cache_dir=os.path.expanduser("/work/ec225/ec225/cchong/mace_cache")

From there, I had downloaded the following files:

mace_agnesi_small.model descriptors.npy mp_traj_combined.xyz

Unfortunately, I am still getting the error when I tried to commence the finetuning:

`
2024-12-17 15:05:15.726 INFO: ===========VERIFYING SETTINGS===========
2024-12-17 15:05:15.727 INFO: MACE version: 0.3.7
2024-12-17 15:05:16.710 INFO: CUDA version: 11.8, CUDA device: 0
2024-12-17 15:05:17.376 INFO: Using foundation model /work/ec225/ec225/cchong/mace_cache/mace_agnesi_small.model as initial checkpoint.
2024-12-17 15:05:17.376 INFO: ===========LOADING INPUT DATA===========
2024-12-17 15:05:17.376 INFO: Using heads: ['default']
2024-12-17 15:05:17.376 INFO: ============= Processing head default ===========
2024-12-17 15:05:20.126 INFO: Training set [7000 configs, 0 energy, 1092000 forces] loaded from 'train.xyz'
2024-12-17 15:05:20.129 INFO: Using random 5% of training set for validation with indices saved in: ./valid_indices_3.txt
2024-12-17 15:05:20.130 INFO: Validaton set contains 350 configurations [0 energy, 54600 forces]
2024-12-17 15:05:22.907 INFO: Test set (7000 configs) loaded from 'test.xyz':
2024-12-17 15:05:22.910 INFO: Default_Default: 7000 configs, 0 energy, 1092000 forces
2024-12-17 15:05:22.910 INFO: Total number of configurations: train=6650, valid=350, tests=[Default_Default: 7000],
2024-12-17 15:05:22.910 INFO: ==================Using multiheads finetuning mode==================
2024-12-17 15:05:22.910 INFO: Using foundation model for multiheads finetuning with Materials Project data
2024-12-17 15:05:22.911 INFO: Downloading MP structures for finetuning


mace_run_train 8
sys.exit(main())

run_train.py 63 main
run(args)

run_train.py 261 run
collections = assemble_mp_data(args, tag, head_configs)

multihead_tools.py 183 assemble_mp_data
raise RuntimeError(

RuntimeError:
Model or descriptors download failed and no local model found
`

Sorry for being a pain but can you tell me where I might be going wrong with this please?

@alinelena
Copy link
Contributor

did you reinstall the changed version?

@alinelena
Copy link
Contributor

@cecilia-hong can you try this #755

you can install it by

python3 -m pip install -U git+https://github.com/alinelena/mace@custom_cache

but I suggest to have a clean environment in which you test. all you need to do is
XDG_CACHE_HOME=yourpath mace_run_train ...

@cecilia-hong
Copy link
Author

Hi Alin,

Sorry fo taking so long to get back to you. I had made a new python virtual environment to install the changed version of MACE but have been having trouble with getting the right modules onto the env so haven't been able to test it out thus far but will definitely let you know once I get it working!

@cecilia-hong
Copy link
Author

Hello, can confirm the version you sent is working now, thank you!
Hope you all have a lovely festive season !

ilyes319 added a commit that referenced this issue Dec 20, 2024
allow custom cache based on XDG_CACHE_HOME env variable, addresses #724
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants