Unable to start finetuning #724

cecilia-hong · 2024-11-29T13:47:13Z

Hello,

I hope you are all doing well, thank you for your help in my previous issue, unfortunately I have ran into another, this time when I tried to fine tune based on a foundation model.

So for my first try, I followed the instructions here: https://mace-docs.readthedocs.io/en/latest/guide/finetuning.html
and my training input are as follows:

mace_run_train \ --name="MACE" \ --foundation_model="small" \ --multiheads_finetuning=False \ --train_file="train.xyz" \ --valid_fraction=0.05 \ --test_file="test.xyz" \ --energy_weight=1.0 \ --forces_weight=1.0 \ --E0s="average" \ --energy_weight=100 \ --forces_weight=1 \ --lr=0.01 \ --scaling="rms_forces_scaling" \ --batch_size=2 \ --max_num_epochs=6 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --default_dtype="float64" \ --device=cuda \ --seed=3

However, my training will not start and I cannot seem to understand the source of the error from the output files. I have attached my log file to this, can you please help me with this?
MACE_run-3_debug.log

The text was updated successfully, but these errors were encountered:

ilyes319 · 2024-12-03T16:17:16Z

Does your node have internet access? Can you share your output log, there must be an error log somewhere.

cecilia-hong · 2024-12-04T10:16:08Z

Hi, I see where the problem may be now as the computing nodes I use do not have internet access. Is internet access required even if I were to download the models/checkpoint from https://github.com/ACEsuit/mace-mp/releases/tag/mace_mp_0b to finetune as well?

Other than the log file I previously attached, the job output had the following:

Traceback (most recent call last): File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/bin/mace_run_train", line 8, in <module> sys.exit(main()) File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/mace/cli/run_train.py", line 63, in main run(args) File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/mace/cli/run_train.py", line 126, in run calc = mace_mp( File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/mace/calculators/foundations_models.py", line 122, in mace_mp mace_calc = MACECalculator( File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/mace/calculators/mace.py", line 129, in __init__ self.models = [ File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/mace/calculators/mace.py", line 130, in <listcomp> torch.load(f=model_path, map_location=device) File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/torch/serialization.py", line 1040, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/exports/csce/eddie/chem/groups/Hobday/Cecilia/anaconda3/envs/mace_new/lib/python3.10/site-packages/torch/serialization.py", line 1258, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) EOFError: Ran out of input

ilyes319 · 2024-12-04T10:17:40Z

You can download the model from github and put it in ~/.cache/mace/ or just run the initial process on your login node.

cecilia-hong · 2024-12-04T10:26:16Z

I had previously tried downloading the model and specifying the path to it in my training input like this:

mace_run_train \ --name="MACE" \ --foundation_model="/work/ec225/ec225/cchong/from_eddie/fine_tune/mace_agnesi_small.model" \ --train_file="train.xyz" \ --valid_fraction=0.05 \ --test_file="valid.xyz" \ --loss="universal" \ --energy_weight=1.0 \ --forces_weight=1.0 \ --forces_key="REF_forces" \ --energy_key="REF_energy" \ --E0s='{1:-13.6131, 6:-1006.8939, 8:-2042.5049, 31:-52313.1072 }' \ --eval_interval=1 \ --lr=0.01 \ --scaling="rms_forces_scaling" \ --batch_size=2 \ --max_num_epochs=6 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --default_dtype="float64" \ --device=cuda \ --seed=3
But that also failed with this output:

`Matplotlib created a temporary cache directory at /dev/shm/cchong_6174208/matplotlib-39h7ig7_ because the default path (/home/ec225/ec225/cchong/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2024-10-21 20:30:31.266 INFO: ===========VERIFYING SETTINGS===========
2024-10-21 20:30:31.280 INFO: MACE version: 0.3.7
2024-10-21 20:30:32.271 INFO: CUDA version: 11.8, CUDA device: 0
2024-10-21 20:30:34.520 INFO: Using foundation model /work/ec225/ec225/cchong/from_eddie/fine_tune/mace_agnesi_small.model as initial checkpoint.
2024-10-21 20:30:34.561 INFO: ===========LOADING INPUT DATA===========
2024-10-21 20:30:34.561 INFO: Using heads: ['default']
2024-10-21 20:30:34.561 INFO: ============= Processing head default ===========
2024-10-21 20:30:35.451 INFO: Training set [841 configs, 841 energy, 131196 forces] loaded from 'train.xyz'
2024-10-21 20:30:35.478 INFO: Using random 5% of training set for validation with indices saved in: ./valid_indices_3.txt
2024-10-21 20:30:35.478 INFO: Validaton set contains 42 configurations [42 energy, 6552 forces]
2024-10-21 20:30:35.792 INFO: Test set (841 configs) loaded from 'valid.xyz':
2024-10-21 20:30:35.792 INFO: Default_Default: 841 configs, 841 energy, 131196 forces
2024-10-21 20:30:35.792 INFO: Total number of configurations: train=799, valid=42, tests=[Default_Default: 841],
2024-10-21 20:30:35.792 INFO: ==================Using multiheads finetuning mode==================
2024-10-21 20:30:35.793 INFO: Using foundation model for multiheads finetuning with Materials Project data

mace_run_train 8
sys.exit(main())

run_train.py 63 main
run(args)

run_train.py 261 run
collections = assemble_mp_data(args, tag, head_configs)

multihead_tools.py 183 assemble_mp_data
raise RuntimeError(

RuntimeError:
Model or descriptors download failed and no local model found
`
I am pretty certain that the path is correct though or does the model have to be in the ~/.cache/mace/ directory?

ilyes319 · 2024-12-04T11:34:20Z

sorry it is not just the model you need to download https://github.com/ACEsuit/mace-mp/releases/download/mace_mp_0b/mp_traj_combined.xyz and https://github.com/ACEsuit/mace-mp/releases/download/mace_mp_0b/descriptors.npy to your ~/.cache/mace/. The easiest is for you to do a warmup run on your login node killing it just after it says it has downloaded the data.

bernstei · 2024-12-04T13:20:28Z

It might be nice to have a mace_run_train ... --exit_after_pretraining_download command line option, for those of use non-internet-connected compute nodes.

alinelena · 2024-12-05T07:29:29Z

or more something like --cache-files that only downloads

gabor1 · 2024-12-05T08:48:55Z

or how about a --dry-run option which does various tasks up to but not including training? we could have --dry-run-levels, the first just checks argument validity, the second downloads (and saves) files if needed, the third one evaluates the loss once without updating any weights.

cecilia-hong · 2024-12-05T11:47:46Z

I was wondering if there is a way to specify the path of the ~/cache/mace directory or to redefine it elsewhere?
I think the issue I am facing now is that I am working on CIRRUS but the batch jobs are ran on /work/ and cannot access /home/.

ilyes319 · 2024-12-05T12:40:23Z

I see, currently there is no way to do that from mace atm, maybe there is a way to hack it from your env variables, like a path link. We should add an option to provide a path for these files.

cecilia-hong · 2024-12-05T15:29:40Z

Got it, I will play around with that, thank you!

alinelena · 2024-12-06T18:34:13Z

I did not try this but trying to setup XDG_CACHE_HOME in your script to point to a .cache folder on the /work may help.

alinelena · 2024-12-06T19:48:54Z

ok to answer myself, will not work since path is hardcoded...

mace/calculators/foundations_models.py:    cache_dir = os.path.expanduser("~/.cache/mace")
mace/calculators/foundations_models.py:            cache_dir = os.path.expanduser("~/.cache/mace")
mace/tools/multihead_tools.py:        cache_dir = os.path.expanduser("~/.cache/mace")

you can edit the three lines to your path

or maybe have something like

cache_dir = os.path.expanduser(os.environ.get('XDG_CACHE_HOME',"~/")+".cache/mace")

@ilyes319 if you are happy I can pr this change.

ilyes319 · 2024-12-09T17:42:00Z

sure, happy to merge. Will it change anything to the default?

cecilia-hong · 2024-12-17T15:10:50Z

Hi all, thank you for the help and suggestions.

I tried changing the cache_dir lines in both the multihead_tools.py and the foundation_models.py files to point towards a directory in my working directory:

cache_dir=os.path.expanduser("/work/ec225/ec225/cchong/mace_cache")

From there, I had downloaded the following files:

mace_agnesi_small.model descriptors.npy mp_traj_combined.xyz

Unfortunately, I am still getting the error when I tried to commence the finetuning:

`
2024-12-17 15:05:15.726 INFO: ===========VERIFYING SETTINGS===========
2024-12-17 15:05:15.727 INFO: MACE version: 0.3.7
2024-12-17 15:05:16.710 INFO: CUDA version: 11.8, CUDA device: 0
2024-12-17 15:05:17.376 INFO: Using foundation model /work/ec225/ec225/cchong/mace_cache/mace_agnesi_small.model as initial checkpoint.
2024-12-17 15:05:17.376 INFO: ===========LOADING INPUT DATA===========
2024-12-17 15:05:17.376 INFO: Using heads: ['default']
2024-12-17 15:05:17.376 INFO: ============= Processing head default ===========
2024-12-17 15:05:20.126 INFO: Training set [7000 configs, 0 energy, 1092000 forces] loaded from 'train.xyz'
2024-12-17 15:05:20.129 INFO: Using random 5% of training set for validation with indices saved in: ./valid_indices_3.txt
2024-12-17 15:05:20.130 INFO: Validaton set contains 350 configurations [0 energy, 54600 forces]
2024-12-17 15:05:22.907 INFO: Test set (7000 configs) loaded from 'test.xyz':
2024-12-17 15:05:22.910 INFO: Default_Default: 7000 configs, 0 energy, 1092000 forces
2024-12-17 15:05:22.910 INFO: Total number of configurations: train=6650, valid=350, tests=[Default_Default: 7000],
2024-12-17 15:05:22.910 INFO: ==================Using multiheads finetuning mode==================
2024-12-17 15:05:22.910 INFO: Using foundation model for multiheads finetuning with Materials Project data
2024-12-17 15:05:22.911 INFO: Downloading MP structures for finetuning

mace_run_train 8
sys.exit(main())

run_train.py 63 main
run(args)

run_train.py 261 run
collections = assemble_mp_data(args, tag, head_configs)

multihead_tools.py 183 assemble_mp_data
raise RuntimeError(

RuntimeError:
Model or descriptors download failed and no local model found
`

Sorry for being a pain but can you tell me where I might be going wrong with this please?

alinelena · 2024-12-17T19:28:54Z

did you reinstall the changed version?

alinelena · 2024-12-17T23:59:46Z

@cecilia-hong can you try this #755

you can install it by

python3 -m pip install -U git+https://github.com/alinelena/mace@custom_cache

but I suggest to have a clean environment in which you test. all you need to do is
XDG_CACHE_HOME=yourpath mace_run_train ...

cecilia-hong · 2024-12-19T12:28:27Z

Hi Alin,

Sorry fo taking so long to get back to you. I had made a new python virtual environment to install the changed version of MACE but have been having trouble with getting the right modules onto the env so haven't been able to test it out thus far but will definitely let you know once I get it working!

cecilia-hong · 2024-12-19T14:29:10Z

Hello, can confirm the version you sent is working now, thank you!
Hope you all have a lovely festive season !

allow custom cache based on XDG_CACHE_HOME env variable, addresses #724

ilyes319 added the enhancement New feature or request label Dec 12, 2024

ilyes319 added a commit that referenced this issue Dec 20, 2024

Merge pull request #755 from alinelena/custom_cache

fece538

allow custom cache based on XDG_CACHE_HOME env variable, addresses #724

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to start finetuning #724

Unable to start finetuning #724

cecilia-hong commented Nov 29, 2024

ilyes319 commented Dec 3, 2024

cecilia-hong commented Dec 4, 2024

ilyes319 commented Dec 4, 2024

cecilia-hong commented Dec 4, 2024

ilyes319 commented Dec 4, 2024

bernstei commented Dec 4, 2024

alinelena commented Dec 5, 2024

gabor1 commented Dec 5, 2024

cecilia-hong commented Dec 5, 2024

ilyes319 commented Dec 5, 2024

cecilia-hong commented Dec 5, 2024

alinelena commented Dec 6, 2024

alinelena commented Dec 6, 2024

ilyes319 commented Dec 9, 2024

cecilia-hong commented Dec 17, 2024

alinelena commented Dec 17, 2024

alinelena commented Dec 17, 2024

cecilia-hong commented Dec 19, 2024

cecilia-hong commented Dec 19, 2024

Unable to start finetuning #724

Unable to start finetuning #724

Comments

cecilia-hong commented Nov 29, 2024

ilyes319 commented Dec 3, 2024

cecilia-hong commented Dec 4, 2024

ilyes319 commented Dec 4, 2024

cecilia-hong commented Dec 4, 2024

ilyes319 commented Dec 4, 2024

bernstei commented Dec 4, 2024

alinelena commented Dec 5, 2024

gabor1 commented Dec 5, 2024

cecilia-hong commented Dec 5, 2024

ilyes319 commented Dec 5, 2024

cecilia-hong commented Dec 5, 2024

alinelena commented Dec 6, 2024

alinelena commented Dec 6, 2024

ilyes319 commented Dec 9, 2024

cecilia-hong commented Dec 17, 2024

alinelena commented Dec 17, 2024

alinelena commented Dec 17, 2024

cecilia-hong commented Dec 19, 2024

cecilia-hong commented Dec 19, 2024