This is the implementation of our Modof model: https://arxiv.org/abs/2012.04231. This paper has already been accepted by the journal "Nature Machine Intelligence".
Note: This repository has been moved to https://github.com/ninglab/Modof. Please check this link for the most recent updates we have.
copyright:
Copyright (C) 2021 The Ohio State University
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Operating systems: Red Hat Enterprise Linux (RHEL) 7.7
-
python==3.6.12
-
scikit-learn==0.22.1
-
networkx==2.4
-
pytorch==1.5.1
-
rdkit==2020.03.5
-
scipy==1.4.1
Download the code and dataset with the command:
git clone https://github.com/ziqi92/Modof.git
The download can take several minutes.
If you want to use our provided processed dataset, please check the directories below: ./data/logp06/
: the dataset of pairs of molecules with 0.6 similarity and different on penalized logp property.
./data/drd2_25
: the dataset of pairs of molecules with 0.6 similarity and different on DRD2 property. The property difference between each pair of molecules is greater than 0.25.
./data/qed_1
: the dataset of pairs of molecules with 0.6 similarity and different on QED property. The property difference between each pair of molecules is greater than 0.1. The QED properties of these molecules are greater than 0.7.
./data/drd2_25_qed6
: the dataset of pairs of molecules with 0.6 similarity and different on QED property and DRD2 property. The property differences on DRD2 between each pair of molecules are greater than 0.25 (i.e., DRD2$(Y)-$DRD2$(X)\geq 0.25$ ). The QED property of each pair of molecules should satisfy
In each directory, you will see the following files:
- multiple zipped
tensors-*.pkl
files. These binary files contain the processed data including pairs of molecules and their edit paths. The data in these*.pkl
files should be used for model training. All thetensors-*.pkl
files will be read into Modof as training data. If you are using your own training data rather than the provided one, you can generate suchtensors-*.pkl
using the data processing tools as will be described below.
Note: Due to the limit of file size, we only provide part of the processed file here. To use the whole training dataset, please use the provided data preprocessing script to preprocess the dataset. Please decompress the zipped file before using them to train the model.
train_pairs.txt
file in logp06 dataset. This file contains all pairs of molecules used in Jin’s paper. This file is identical to train_pairs.txt file in (https://github.com/wengong-jin/iclr19-graph2graph/tree/master/data/logp06). Please note that the molecule pairs contained intensors-*.pkl
files are a subset of all the molecule pairs in train_pairs.txt.
- File format: each line in
train_pairs.txt
has two SMILE strings, separated by an empty space. The first SMILE string represents the molecule with worse properties, and the second SMILE string represent the molecule with better properties.
one_ds_pairs.txt
file. This file contains the pairs of molecules used in Modof.
- File format: each line in
one_ds_pairs.txt
has two SMILE strings, separated by an empty space.
test.txt
. This file contains the SMILE strings of single molecules that are used as the testing molecules in XXX’s paper. These molecules are also the testing molecules used in our Modof.
- File format: each line in
test.txt
is a SMILE string of a testing molecule.
vocab.txt
. This file contains all the substructures of training molecules intensors-*.pkl
files. These substructures are in SMILE strings.
- File format: each line in vocab.txt is a SMILE string of a substructures. The i-th row represents the i-th substructure (i.e., ‘i’ here is the substructure ID).
f you want to train Modof using your own training data, you will need to process your own data into the same format as the processed data, respectively. All the code for data processing is provided under data_processing.
To process your own data, run
python ./data_preprocessing/preprocess.py --train train_file_name –-output out_file_name –-batch_size NNN --batch_id ZZZ
where train_file_name is the name of the file that contains all the molecule pairs that should be used for Modof training. This file should have the same format as train_pairs.txt
as above.
For the output option, the above command (1) will generate n=(number of pairs) / NNN out_file_name-ZZZ.pkl files in the same directory as train_file_name. These files will be used in Modof training. For other options of this command, please check –-help
.
batch_size and batch_id is recommended to use for large training dataset. If your training dataset is large, you can process batches of training data in a parallel way by running the above command multiple times with different batch_id. These two arguments are simply designed to speed up the data preprocessing for large dataset. If you have small training dataset, you can choose not to specify the value of batch_size and batch_id, and then the entire training data will be processed one time.
Note that the training pairs of molecules for Modof model are required to differ in terms of only one disconnection site. The training pairs which differ in multiple disconnection site will be filtered out by the above command. To get enough training pairs for Modof model, it is expected that the molecules in your own training data are very similar.
Example:
python ./data_preprocessing/preprocess.py --train ./data/logp06/train_pairs.txt –-output new_tensors –-batch_size 10000 --batch_id 0
To train our Modof model, run
python ./model/train.py --depthT 3 --depthG 5 --hidden_size 64 --latent_size 8 --add_ds --beta 0.1 --warmup 2000 --beta_anneal_iter 500 --step_beta 0.05 --max_beta 0.5 --save_iter 3000 --print_iter 20 --train train_path --vocab vocab_path --save_dir model_path
depthT specifies the depth of tree message passing neural networks.
depthG specifies the depth of graph message passing neural networks.
hidden_size specifies the dimension of all hidden layers.
latent_size specifies the dimension of latent embeddings.
add_ds specifies whether or not to add the embedding of disconnection site into the latent embedding. This parameter is a bool value and will default to False when "--add_ds" is not present.
beta specifies the initial value of weight of KL loss in the total loss.
warmup specifies the number of steps that beta value remains unchanged at the beginning. (Each step represents an update of model on a single batch.)
beta_anneal_iter specifies the number of steps that beta value is reduced by a certain value after the number of training steps.
step_beta specifies the value used to reduce the value of beta.
max_beta specifies the maximum value of beta.
save_iter controls how often the model would be saved. In the above example, the model will be saved every 3,000 steps. The model will be saved at model_path/model.iter-*.pt
print_iter controls how often the intermediate result would be displayed (e.g., the accuracy of each prediction, the loss of each function).
train specifies the directory of training data. The program will extract the training pairs from all "pkl" files under this directory. The train path defaults to be the path of our provided dataset if not specified.
vocab specifies the path of vocabulary of all substructures in the training data. You can generate the vocab file for your own training data with the provided code as will be described below. The vocab path defaults to be the path of our provided vocab file if not specified.
save_dir specifies the path to save the trained model. The model path defaults to be "./result" if not specified.
Use the command python ./model/train.py -h
to check the meaning of other parameters.
Generating Vocabulary file: In the above command, the training of Modof model requires a vocabulary file that contains all the substructures in the molecules in all the training files under the train_path. This file should have the same format as in vocab.txt
as above.
To generate the vocab file for your own training data, run
python ./model/mol_tree.py --train train_path --out vocab_path
Running time: It can take no more than 4 hours to train a modof model using a GPU with our provided training data for 6,000 steps. Typically, our model can produce decent results with 6,000 steps of training.
To test a trained model, you can run the file ./model/optimize.py
with following command:
python ./model/optimize.py --test test_path --vocab vocab_path --model model_path --save_dir test_result_path --hidden_size 64 --latent_size 8 --depthT 3 --depthG 5 --iternum 5 --num N -s 0.6
Note that the option "hidden_size", "latent_size", "depthT" and "depthG" must be the same with the train model.
-s specifies the similarity threshold of generated molecules.
save_dir specifies the path of results. All result files will be saved into the test_result_path directory
test_path specifies the path of test data. The test path defaults to be the provided test data (i.e., ./data/logp06/test.txt) if not specified.
vocab_path specifies the path of vocab file. The vocab path defaults to be the provided vocab file (i.e., ./data/logp06/vocab.txt) if not specified.
num specifies the number of latent embedding samples for each molecule at each iteration. iternum specifies the number of iterations for optimization.
The outputs of the above command include:
*-iter[0-N].txt
include the optimization results of each input molecule with all the latent embedding samples.*-iter[N]_results.txt
include the optimization results of each input molecule at all [num] iterations.*-smiles.txt
include the best optimized molecules among all iterations, the property scores of these optimized molecules and the similarities of these optimized molecules with input molecules.
You can also run the file optimize1.py
with the similar command to enable the Modof to optimize multiple best molecules at each iteration using the option --iter_size
as below.
python ./model/optimize1.py --test test_path --vocab vocab_path --model model_path --save_dir test_result_path --hidden_size 64 --latent_size 8 --depthT 3 --depthG 5 --iternum 5 --num N -s 0.6 --iter_size 5