Skip to content

Latest commit

 

History

History
105 lines (78 loc) · 7.42 KB

README.md

File metadata and controls

105 lines (78 loc) · 7.42 KB

Wasserstein Weisfeiler-Lehman graph kernel with Gaussian Process Regression (WWL-GPR)

igraph ray ray

Table of Contents

Overview

This software package provides the implementation of the WWL-GPR model, a data-efficient, physics-inspired machine learning (ML) model for the prediction of binding motifs and associated adsorption enthalpies of complex adsorbates at transition metals (TMs) and their alloys based on a customized Wasserstein Weisfeiler-Lehman graph kernel and Gaussian Process Regression. The task that is solved is to directly predict the relaxed adsorption enthalpies corresponding to a range of plausible initial guesses of the adsorption motif based on graph representation. Thereby, for a given surface/adsorbate combination of interest, both the most stable and all meta-stable adsorption motifs as well as their associated adsorption enthalpies can be predicted. Apart from a graph representation of the intial geometry, the model uses input features in the form of node attributes, which represent physically motivated properties, e.g. d-band moments (surfaces), HOMO/LUMO energy levels (adsorbate molecules) and features of the local geometry, all derived from either the clean surfaces or the adsorbates in the gas phase. Optimization of the hyperparameters in the model is done with Bayesian optimization implemented with scikit-optimize.

A case study predicting adsorption enthalpies of complex adsorbates involved in ethanol synthesis is provided.

Please refer to our manuscript for further details (link to be inserted upon publication).

System Requirements

The WWL-GPR package requires only a standard computer with enough RAM to support the training and application of the ML model through the required conda environment (see below). To the benefit of computational scientists who would like to accelerate the ML process or interact with other computationally intensive codes on High-Performance Computing (HPC) facilities, we also provide the possibility for interfacing with a standard SLURM Workload Manager.

All software dependencies (including version numbers the software has been tested on) are specified in self-contained env.yml

Installation

The easiest way to install the prerequisites is via conda. All the dependencies are given in env.yml. The expected installation time is around 2 minutes.

Firstly, download or clone this repository via:

git clone https://github.com/Wenbintum/WWL-GPR.git

Ensure you have installed conda, step into the WWL-GPR directory, and then run the following command to create a new environment named wwl-gpr.

conda env create -f env.yml

Activate the conda environment with:

conda activate wwl-gpr

Install the package with:

pip install -e .

Usage

This package allows parallel computing at a local computer or a supercomputing facility. We implement this functionality via Ray, a simple and universal API for building distributed applications. Ray, originally developed by the computer science community, has many benefits, not least providing simple primitives for building and running distributed applications. Readers are referred to the webpage of Ray for more details.

Python-interface SLURM scripts:

We provide a helper utility to auto-generate SLURM scripts and launch. There are some options you can use to submit your job in the SLURM system by running:

  python launch.py -h

Available machine learning tasks:

We provide four machine learning tasks as showcases in compliance with our manuscript, that are:

  • a) 5-fold Cross-validation applied to in-domain prediction for the complex adsorbates database (termed as "CV5")
  • b) 5-fold Cross-validation applied to in-domain prediction for the simple adsorbates database (termed as "CV5_simpleads")
  • c) Extrapolation to out-of-domain samples, an alloy (CuCo) and a new metal (Pt), when only training on the complex adsorbates database containing the elemental metals Cu, Co, Rh and Pd as well as additionally the atomic species (H, O, and C) calculated at Pt (termed as "Extrapolation_t1")
  • d) Extrapolation to out-of-domain samples, an alloy (PdRh) and a new metal (Ru), when only training on the complex adsorbates database containing the elemental metals Cu, Co, Rh and Pd as well as additionally the atomic species (H, O, and C) calculated at Ru (termed as "Extrapolation_t2")

All tasks can be viewed by running:

  python main.py -h

Run task on HPC facility:

By coupling python-interface SLURM scripts and self-contained ML tasks, now you can run these four tasks on High Performance Computing (HPC) facility. For instance, running task a) by given computational resources of 40 CPUs and 3 hours with the title "test".

  python launch.py --num-cpus 40 -t 03:00:00  --exp-name test --command "python -u main.py --task CV5 --uuid \$redis_password"

   Run extrapolation task d) via:

  python launch.py --num-cpus 40 -t 03:00:00  --exp-name test --command "python -u main.py --task Extrapolation_t2 --uuid \$redis_password"

Run task on local computer:

We also provide an example for running 5-fold cross-validation within the complex adsorbates database on a local desktop or laptop with fixed hyperparameters (FHP). In this case, the ML learning task will be run on 8 CPUs as given in input.yml.

  • Run task on local desktop or laptop:
  python main.py --task CV5_FHP

We use Bayesian optimization to optimize hyperparameters. You may want to change the settings via this function

Expected outpout and run time:

The output file consists of ground truth and ML predicted values, which is located in the "Results" directory for further analysis, and resulting Root Mean Square Error (RMSE) is printed. The run time of CV5_FHP on local computer with 8 CPUs is around 7 minutes, for which the expected RMSE is around 0.18 eV.

Authors

This software was primarily written by Wenbin Xu who was advised by Prof. Mie Andersen.

Acknowledgements

The simple adsorbates database is taken from our previous papers Deimel et al., ACS Catal. 2020, 10, 22, 13729–13736 and Andersen et al., ACS Catal. 2019, 9, 4, 2752–2759. The complex adsorbates database is constructed via AIIDA and CatKit The WWL-GPR implementation is based on the Wasserstein Weisfeiler-Lehman Graph Kernels

License

WWL-GPR is released under the MIT License.