Fine-Tuning vs. From-Scratch Training #341
Unanswered
naoki-titech
asked this question in
Q&A
Replies: 2 comments 3 replies
-
Hey, Fine tuning is very much an ongoing area of research for us. To help you, I would need to know more about your training set, test set and the way you do the finetuning.
|
Beta Was this translation helpful? Give feedback.
2 replies
-
one thing is that your number of channels is very small, so there is inherently limited flexibility in your models. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello everyone,
As someone relatively new to MACE, I've encountered an intriguing situation regarding fine-tuning a foundation model and would greatly appreciate any insights or suggestions from this knowledgeable community.
Background:
I've developed a foundation model trained from scratch using DFT results of 100,000 structures obtained from the trajectory of structural relaxation. The results were encouraging, with MAE scores for energy around 20 meV/atom, for force approximately 40 meV/Å, and for stress about 1 meV/Å^3.
I then attempted to fine-tune this model using a dataset of about 1,000 DFT-MD trajectories of a specific compound at 2000 K (NVT ensemble). While fine-tuning did improve the MAE score from 140 to 80 meV/atom on the test data, training a new model from scratch on this smaller dataset yielded a significantly better MAE score of 40 meV/atom. It's noteworthy that both approaches utilized the same architecture and number of parameters.
In attempts to improve the MAE score during fine-tuning, I've experimented below but they did not improve the score.
・Increasing the learning rate to 0.01
・Adjusting E0s to their average
・Reducing the batch size from 10 to 2
Input parameters for fine-tuning:
2024-03-10 13:32:56.104 INFO: Configuration: {'num_workers': 1, 'batch_size': 10, 'valid_batch_size': 10, 'foundation_model': 'fmace1.1', 'restart_latest': False, 'device': 'cuda', 'name': 'fine_tuning', 'train_file': './data/train.xyz', 'test_file': './data/test.xyz', 'valid_fraction': 0.1, 'max_num_epochs': 100, 'patience': 10, 'default_dtype': 'float32', 'test_dir': None, 'statistics_file': None, 'valid_file': None, 'seed': 123, 'log_dir': 'logs', 'model_dir': '.', 'checkpoints_dir': 'checkpoints', 'results_dir': 'results', 'log_level': 'INFO', 'error_table': 'PerAtomMAE', 'model': 'ScaleShiftMACE', 'r_max': 6.0, 'radial_type': 'bessel', 'num_radial_basis': 10, 'num_cutoff_basis': 5, 'interaction': 'RealAgnosticResidualInteractionBlock', 'interaction_first': 'RealAgnosticResidualInteractionBlock', 'max_ell': 3, 'correlation': 3, 'num_interactions': 2, 'MLP_irreps': '16x0e', 'radial_MLP': [64, 64, 64], 'hidden_irreps': None, 'num_channels': 64, 'max_L': 1, 'gate': 'silu', 'scaling': 'rms_forces_scaling', 'avg_num_neighbors': 1, 'compute_avg_num_neighbors': True, 'compute_stress': True, 'compute_forces': True, 'multi_processed_test': True, 'pin_memory': True, 'atomic_numbers': None, 'mean': None, 'std': None, 'E0s': 'average', 'energy_key': 'energy', 'forces_key': 'forces', 'stress_key': 'stress', 'loss': 'weighted', 'energy_weight': 1.0, 'forces_weight': 1.0, 'stress_weight': 50.0, 'swa_energy_weight': 1.0, 'swa_forces_weight': 10.0, 'swa_stress_weight': 50.0, 'huber_delta': 0.01, 'optimizer': 'adam', 'lr': 0.01, 'swa': False, 'swa_lr': 0.001, 'weight_decay': 1e-08, 'amsgrad': True, 'scheduler': 'ReduceLROnPlateau', 'lr_factor': 0.8, 'scheduler_patience': 20, 'lr_scheduler_gamma': 0.9993, 'start_swa': None, 'ema': True, 'ema_decay': 0.995, 'eval_interval': 1, 'keep_checkpoints': True, 'save_cpu': True, 'clip_grad': 100.0, 'config_type_weights': '{"Default":1.0}', 'wandb': False, 'restart_lr': None, 'foundation_model_readout': True}
Input parameters for From-scratch training:
2024-03-11 14:03:26.265 INFO: Configuration: {'num_workers': 1, 'batch_size': 10, 'valid_batch_size': 10, 'foundation_model': None, 'restart_latest': False, 'device': 'cuda', 'name': 'scratch', 'train_file': './data/train.xyz', 'test_file': './data/test.xyz', 'valid_fraction': 0.1, 'max_num_epochs': 200, 'patience': 10, 'default_dtype': 'float32', 'test_dir': None, 'statistics_file': None, 'valid_file': None, 'seed': 123, 'log_dir': 'logs', 'model_dir': '.', 'checkpoints_dir': 'checkpoints', 'results_dir': 'results', 'log_level': 'INFO', 'error_table': 'PerAtomMAE', 'model': 'ScaleShiftMACE', 'r_max': 6.0, 'radial_type': 'bessel', 'num_radial_basis': 10, 'num_cutoff_basis': 5, 'interaction': 'RealAgnosticResidualInteractionBlock', 'interaction_first': 'RealAgnosticResidualInteractionBlock', 'max_ell': 3, 'correlation': 3, 'num_interactions': 2, 'MLP_irreps': '16x0e', 'radial_MLP': [64, 64, 64], 'hidden_irreps': None, 'num_channels': 64, 'max_L': 1, 'gate': 'silu', 'scaling': 'rms_forces_scaling', 'avg_num_neighbors': 1, 'compute_avg_num_neighbors': True, 'compute_stress': True, 'compute_forces': True, 'multi_processed_test': True, 'pin_memory': True, 'atomic_numbers': None, 'mean': None, 'std': None, 'E0s': 'average', 'energy_key': 'energy', 'forces_key': 'forces', 'stress_key': 'stress', 'loss': 'weighted', 'energy_weight': 1.0, 'forces_weight': 1.0, 'stress_weight': 50.0, 'swa_energy_weight': 1.0, 'swa_forces_weight': 10.0, 'swa_stress_weight': 50.0, 'huber_delta': 0.01, 'optimizer': 'adam', 'lr': 0.01, 'swa': False, 'swa_lr': 0.001, 'weight_decay': 1e-08, 'amsgrad': True, 'scheduler': 'ReduceLROnPlateau', 'lr_factor': 0.8, 'scheduler_patience': 20, 'lr_scheduler_gamma': 0.9993, 'start_swa': None, 'ema': True, 'ema_decay': 0.995, 'eval_interval': 1, 'keep_checkpoints': True, 'save_cpu': True, 'clip_grad': 100.0, 'config_type_weights': '{"Default":1.0}', 'wandb': False, 'restart_lr': None, 'foundation_model_readout': False}
Concern:
It appears that fine-tuning may be restricting the model's ability to adapt, leading to underfitting for out-of-domain data. Conversely, training from scratch on the smaller dataset tends to overfit.
Question:
When loading the foundation model for fine-tuning, are certain parameters inherently fixed to their values in the foundation model, potentially limiting flexibility? How might I better balance the trade-offs between overfitting and underfitting in this context?
Beta Was this translation helpful? Give feedback.
All reactions