Source code for our paper Leveraging Large-scale Computational Database and Deep Learning for Accurate Prediction of Material Properties
The code was built based on CMPNN and optimized for the crystal graph. Thanks a lot for their code sharing!
- cuda == 10.1
- cudnn >= 7.4.1
- pymatgen == 2020.12.18
- torch == 1.5.0
- numpy == 1.20.2
- tqdm == 4.50.0
- scikit-learn == 0.24.1
Dataset | Crystal | Property | Metric |
---|---|---|---|
Material Project | 69,239 | band_gap, formation_energy | MAE |
DComPET | 74,939 | band_gap, total_energy, per_atom_energy, formation_energy, efermi, magnetization | MAE |
DComPET | 1,716 | band_gap (experimental) | MAE |
DComPET | 54 | band_gap (experimental) | MAE |
The Material Project
dataset can be referred to the figshare, their website could be found in The materials project, thanks a lot for their contribution in releasing the calculated crystal data.
The key for preprocessing a new dataset with fitting our model is that generating graph_cache.pickle
and property.csv
.
The structure of data/material_project
:
You can download a zip files from the google drive link.
seed_0/
: a 9-fold cross-validation and independent test spliting example generated bypreprocess.py
.train_fold_[0-9].csv
: the training set for the corresponding fold number.valid_fold_[0-9].csv
: the validation set for the corresponding fold number.test.csv
: the independent test set.
atom_init.json
: the element fingerprint vector, with the format of<atom_index>:<vector>
.graph_cache.pickle
: the crystal graph dict, with the format of<crystal_name>:<poscar_dict>
.hubbard_u.yaml
: the atom radius dict.preprocess.py
: the code of preprocessing data.property.csv
: the table of crystal name with corresponding properties(band gap, total energy, per atom energy, formation energy, efermi, magnetization), a column with all 0's means without recording this property.structures.tar.gz
: all crystal graph files with the format of.poscar
.
The structure of data/DComPET
:
(2021.7.19 Notes: The dataset will be released later!)
Tips: The most time-consuming step is generating the graph_cache.pickle
from the crystal graph files to the crystal graph dict object, thus we cache all dict object instead of generating it during training step.
For the band_gap property, run:
nohup python -u train.py --gpu 0 --seed 0 --data_path ./data/material_project --train_path ./data/material_project --dataset_name band_gap --dataset_type regression --run_fold 1 --metric mae --save_dir ./ckpt/ensemble_band_gap --epochs 200 --init_lr 1e-4 --max_lr 3e-4 --final_lr 1e-4 --no_features_scaling --show_individual_scores > ./log/fold_1_band_gap.log 2>&1
where the model will be stored at ckpt/ensemble_band_gap
, the training log will be stored at log
.
We also provide a bash script to run all training folds in parallel, please refer to train_band_gap_*.sh
.
Some tips when training the model:
-
If you execute the code in the first time, it will generate
train_fold_{args.run_fold}_crystalnet.pickle
,valid_fold_{args.run_fold}_crystalnet.pickle
andtest_crystalnet.pickle
, which will cost a few of time. And it will reload the pickle files when you executing the code for the second time, which help reduce the training time. -
Hyperparameters could be found in the
chemprop/parsing.py
, some key hyperparameters are listed below:
--radius
: The crystal neighbor radius, it will effect the number of neighbor atoms. If you revise this parameter, please regenerate the.pickle
files in the step 1.--rbf_parameters
: The key parameters for generating the Gaussian basis vectors. If you revise this parameter, please regenerate the.pickle
files in the step 1.--max_num_neighbors
: The maximum number of neighbors for each atoms. If you revise this parameter, please regenerate the.pickle
files in the step 1.--depth
: The number of stacking CMPNN blocks, unlike molecular graph, it will affect the final prediction result a lot for crystal graph since the over-smoothing phenomenon.
-
Although we load 6 properties for each crystal graph, we only use 1 of them to train the model during training, you can refer to
train.py
, line 72-88 to see how to fake the code to use only 1 property. -
You may find the training time decrease after the first epoch, this is because we cached part of batch crystals in the memory, you can refer to
chemprop/features/featurization.py
, line 206-224, and revise thelen(CRYSTAL_TO_GRAPH) <= 10000
. The more batch crystals you cache, the less training time you cost and the more memories you use.
For the band_gap property, run:
nohup python -u predict.py --gpu 0 --seed 0 --data_path ./data/material_project --test_path ./data/material_project --dataset_name band_gap --checkpoint_dir ./ckpt/ensemble_band_gap/ --no_features_scaling > ./log/predict_band_gap.log 2>&1 &
where the code will generate each fold of prediction in the {args.test_path}/seed_{args.seed}/predict_{args.dataset_name}_fold_{fold_num}_crystalnet.csv
and ensemble all predictions for the final prediction in the {args.test_path}/seed_{args.seed}/predict_{args.dataset_name}_crystalnet.csv
.
We also provide a bash script to run all training folds in parallel, please refer to predict.sh
.
- Clean the unuse function and write more comments.
- Add and clean the transfer learning code.
- Integrate other comparsion methods into this repository.
- Try our best to reduce the training time and the using memory, especially for the large dataset (long period).
You can come to our website for more related applications!
Please cite the following paper if you use this code in your work.
@article{chen2021leveraging,
title={Leveraging Large-scale Computational Database and Deep Learning for Accurate Prediction of Material Properties},
author={Chen, Pin and Chen, Jianwen and Yan, Hui and Mo, Qing and Xu, Zexin and Liu, Jinyu and Zhang, Wenqing and Yang, Yuedong and Lu, Yutong},
journal={arXiv preprint arXiv:2112.14429},
year={2021}
}