Fine-Tuning vs. From-Scratch Training #341

naoki-titech · 2024-03-11T11:06:11Z

naoki-titech
Mar 11, 2024

Hello everyone,
As someone relatively new to MACE, I've encountered an intriguing situation regarding fine-tuning a foundation model and would greatly appreciate any insights or suggestions from this knowledgeable community.

Background:

I've developed a foundation model trained from scratch using DFT results of 100,000 structures obtained from the trajectory of structural relaxation. The results were encouraging, with MAE scores for energy around 20 meV/atom, for force approximately 40 meV/Å, and for stress about 1 meV/Å^3.
I then attempted to fine-tune this model using a dataset of about 1,000 DFT-MD trajectories of a specific compound at 2000 K (NVT ensemble). While fine-tuning did improve the MAE score from 140 to 80 meV/atom on the test data, training a new model from scratch on this smaller dataset yielded a significantly better MAE score of 40 meV/atom. It's noteworthy that both approaches utilized the same architecture and number of parameters.
In attempts to improve the MAE score during fine-tuning, I've experimented below but they did not improve the score.
・Increasing the learning rate to 0.01
・Adjusting E0s to their average
・Reducing the batch size from 10 to 2

Input parameters for fine-tuning:
2024-03-10 13:32:56.104 INFO: Configuration: {'num_workers': 1, 'batch_size': 10, 'valid_batch_size': 10, 'foundation_model': 'fmace1.1', 'restart_latest': False, 'device': 'cuda', 'name': 'fine_tuning', 'train_file': './data/train.xyz', 'test_file': './data/test.xyz', 'valid_fraction': 0.1, 'max_num_epochs': 100, 'patience': 10, 'default_dtype': 'float32', 'test_dir': None, 'statistics_file': None, 'valid_file': None, 'seed': 123, 'log_dir': 'logs', 'model_dir': '.', 'checkpoints_dir': 'checkpoints', 'results_dir': 'results', 'log_level': 'INFO', 'error_table': 'PerAtomMAE', 'model': 'ScaleShiftMACE', 'r_max': 6.0, 'radial_type': 'bessel', 'num_radial_basis': 10, 'num_cutoff_basis': 5, 'interaction': 'RealAgnosticResidualInteractionBlock', 'interaction_first': 'RealAgnosticResidualInteractionBlock', 'max_ell': 3, 'correlation': 3, 'num_interactions': 2, 'MLP_irreps': '16x0e', 'radial_MLP': [64, 64, 64], 'hidden_irreps': None, 'num_channels': 64, 'max_L': 1, 'gate': 'silu', 'scaling': 'rms_forces_scaling', 'avg_num_neighbors': 1, 'compute_avg_num_neighbors': True, 'compute_stress': True, 'compute_forces': True, 'multi_processed_test': True, 'pin_memory': True, 'atomic_numbers': None, 'mean': None, 'std': None, 'E0s': 'average', 'energy_key': 'energy', 'forces_key': 'forces', 'stress_key': 'stress', 'loss': 'weighted', 'energy_weight': 1.0, 'forces_weight': 1.0, 'stress_weight': 50.0, 'swa_energy_weight': 1.0, 'swa_forces_weight': 10.0, 'swa_stress_weight': 50.0, 'huber_delta': 0.01, 'optimizer': 'adam', 'lr': 0.01, 'swa': False, 'swa_lr': 0.001, 'weight_decay': 1e-08, 'amsgrad': True, 'scheduler': 'ReduceLROnPlateau', 'lr_factor': 0.8, 'scheduler_patience': 20, 'lr_scheduler_gamma': 0.9993, 'start_swa': None, 'ema': True, 'ema_decay': 0.995, 'eval_interval': 1, 'keep_checkpoints': True, 'save_cpu': True, 'clip_grad': 100.0, 'config_type_weights': '{"Default":1.0}', 'wandb': False, 'restart_lr': None, 'foundation_model_readout': True}

Input parameters for From-scratch training:
2024-03-11 14:03:26.265 INFO: Configuration: {'num_workers': 1, 'batch_size': 10, 'valid_batch_size': 10, 'foundation_model': None, 'restart_latest': False, 'device': 'cuda', 'name': 'scratch', 'train_file': './data/train.xyz', 'test_file': './data/test.xyz', 'valid_fraction': 0.1, 'max_num_epochs': 200, 'patience': 10, 'default_dtype': 'float32', 'test_dir': None, 'statistics_file': None, 'valid_file': None, 'seed': 123, 'log_dir': 'logs', 'model_dir': '.', 'checkpoints_dir': 'checkpoints', 'results_dir': 'results', 'log_level': 'INFO', 'error_table': 'PerAtomMAE', 'model': 'ScaleShiftMACE', 'r_max': 6.0, 'radial_type': 'bessel', 'num_radial_basis': 10, 'num_cutoff_basis': 5, 'interaction': 'RealAgnosticResidualInteractionBlock', 'interaction_first': 'RealAgnosticResidualInteractionBlock', 'max_ell': 3, 'correlation': 3, 'num_interactions': 2, 'MLP_irreps': '16x0e', 'radial_MLP': [64, 64, 64], 'hidden_irreps': None, 'num_channels': 64, 'max_L': 1, 'gate': 'silu', 'scaling': 'rms_forces_scaling', 'avg_num_neighbors': 1, 'compute_avg_num_neighbors': True, 'compute_stress': True, 'compute_forces': True, 'multi_processed_test': True, 'pin_memory': True, 'atomic_numbers': None, 'mean': None, 'std': None, 'E0s': 'average', 'energy_key': 'energy', 'forces_key': 'forces', 'stress_key': 'stress', 'loss': 'weighted', 'energy_weight': 1.0, 'forces_weight': 1.0, 'stress_weight': 50.0, 'swa_energy_weight': 1.0, 'swa_forces_weight': 10.0, 'swa_stress_weight': 50.0, 'huber_delta': 0.01, 'optimizer': 'adam', 'lr': 0.01, 'swa': False, 'swa_lr': 0.001, 'weight_decay': 1e-08, 'amsgrad': True, 'scheduler': 'ReduceLROnPlateau', 'lr_factor': 0.8, 'scheduler_patience': 20, 'lr_scheduler_gamma': 0.9993, 'start_swa': None, 'ema': True, 'ema_decay': 0.995, 'eval_interval': 1, 'keep_checkpoints': True, 'save_cpu': True, 'clip_grad': 100.0, 'config_type_weights': '{"Default":1.0}', 'wandb': False, 'restart_lr': None, 'foundation_model_readout': False}

Concern:

It appears that fine-tuning may be restricting the model's ability to adapt, leading to underfitting for out-of-domain data. Conversely, training from scratch on the smaller dataset tends to overfit.

Question:

When loading the foundation model for fine-tuning, are certain parameters inherently fixed to their values in the foundation model, potentially limiting flexibility? How might I better balance the trade-offs between overfitting and underfitting in this context?

ilyes319 · 2024-03-11T11:12:41Z

ilyes319
Mar 11, 2024
Maintainer

Hey,

Fine tuning is very much an ongoing area of research for us. To help you, I would need to know more about your training set, test set and the way you do the finetuning.

What is your training set composed of. Why don't you use the foundation models that we provide?
How is the test set system different from the training set? Are the two evaluated at the same level of theory?
What script do you use to do the fine tuning?
Can you share the log files for the from scratch and fine tuning runs.

2 replies

naoki-titech Mar 11, 2024
Author

Thank you for your reply!

・The training set for fine-tuning consists of structures containing Pb, K, and F elements. I've tested the model with other systems comprising different elements and observed similar outcomes. The motivations for developing a new foundation model are twofold: (i) My work uses different settings from those in the Materials Project, such as PBEsol and different POTCAR files. (ii) I am interested in more complex structures, including those with defects, and thus aimed to train a new foundation model capable of representing defective structures.
・The foundation model's training data encompass 100,000 structures featuring 55 elements, while the fine-tuning dataset includes 2,000 structures with 3 elements. The training and testing data for fine-tuning are derived from the first 1,800 steps and the subsequent 200 steps of structures from DFT-MD trajectories, respectively. The DFT conditions vary slightly between the foundation model training data and the small dataset used for fine-tuning. Specifically, the former dataset was calculated with in the fine k-mesh density but later utilizes rough k-mesh (only gamma-point).
・I utilized scripts combined from the foundations, universal, and multi-GPU branches.
scripts_and_logs.zip

naoki-titech Mar 12, 2024
Author

I tried training under 2-3 different conditions after that but could not improve the score of the fine-tuned model. My foundation model is apparently stuck in local minima, so it may not be suitable for training out-of-domain data such as high-temperature trajectory. BTW, in GNoME, the fine-tuning foundation model seems to be the checkpoint before the scheduler runs. I will try the same method using MACE.

gabor1 · 2024-03-11T11:14:50Z

gabor1
Mar 11, 2024
Maintainer

one thing is that your number of channels is very small, so there is inherently limited flexibility in your models.

1 reply

naoki-titech Mar 11, 2024
Author

Thank you for your response! For the foundation model, as well as during the fine-tuning and training from scratch on the smaller dataset, I utilized 64 channels. I experimented with varying the channel size between 16 and 128 for training the smaller dataset (which includes 3 elements) from scratch. Through this experimentation, I observed that changes in channel size do not significantly impact the performance score...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-Tuning vs. From-Scratch Training #341

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Fine-Tuning vs. From-Scratch Training #341

naoki-titech Mar 11, 2024

Background:

Concern:

Question:

Replies: 2 comments · 3 replies

ilyes319 Mar 11, 2024 Maintainer

naoki-titech Mar 11, 2024 Author

naoki-titech Mar 12, 2024 Author

gabor1 Mar 11, 2024 Maintainer

naoki-titech Mar 11, 2024 Author

naoki-titech
Mar 11, 2024

Replies: 2 comments 3 replies

ilyes319
Mar 11, 2024
Maintainer

naoki-titech Mar 11, 2024
Author

naoki-titech Mar 12, 2024
Author

gabor1
Mar 11, 2024
Maintainer

naoki-titech Mar 11, 2024
Author