Our paper is at https://arxiv.org/pdf/2002.03230.pdf
First install the dependencies via conda:
- PyTorch >= 1.0.0
- networkx
- RDKit >= 2019.03
- numpy
- Python >= 3.6
And then run pip install .
. Additional dependency for property-guided finetuning:
- Chemprop >= 1.2.0
- For graph generation, each line of a training file is a SMILES string of a molecule
- For graph translation, each line of a training file is a pair of molecules (molA, molB) that are similar to each other but molB has better chemical properties. Please see
data/qed/train_pairs.txt
. The test file is a list of molecules to be optimized. Please seedata/qed/test.txt
.
The original vocab list does not work with the provided pre-trained model. It generates this issue: wengong-jin#47. I included the motifs that were causing the issue in the provided vocab list and replaced 27 less used motif pairs (seen after 800 million times.)
python generate.py --vocab data/chembl/recovered_vocab_2000.txt --model ckpt/chembl-pretrained/model.ckpt --nsamples 1000
We can train a molecular language model on a large corpus of unlabeled molecules. We have uploaded a model checkpoint pre-trained on ChEMBL dataset in ckpt/chembl-pretrained/model.ckpt
. If you wish to train your own language model, please follow the steps below:
- Extract substructure vocabulary from a given set of molecules:
python get_vocab.py --ncpu 16 < data/chembl/all.txt > vocab.txt
- Preprocess training data:
python preprocess.py --train data/chembl/all.txt --vocab data/chembl/all.txt --ncpu 16 --mode single
mkdir train_processed
mv tensor* train_processed/
- Train graph generation model
mkdir ckpt/chembl-pretrained
python train_generator.py --train train_processed/ --vocab data/chembl/vocab.txt --save_dir ckpt/chembl-pretrained
- Sample molecules from a model checkpoint
python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000 --output_dir results_in_here/
The following script loads a trained Chemprop model and finetunes a pre-trained molecule language model to generate molecules with specific chemical properties.
mkdir ckpt/finetune
python finetune_generator.py --train ${ACTIVE_MOLECULES} --vocab data/chembl/vocab.txt --generative_model ckpt/chembl-pretrained/model.ckpt --chemprop_model ${YOUR_PROPERTY_PREDICTOR} --min_similarity 0.1 --max_similarity 0.5 --nsample 10000 --epoch 10 --threshold 0.5 --save_dir ckpt/finetune
Here ${ACTIVE_MOLECULES}
should contain a list of experimentally verified active molecules.
${YOUR_PROPERTY_PREDICTOR}
should be a directory containing saved chemprop model checkpoint.
--max_similarity 0.5
means any novel molecule should have nearest neighbor similarity lower than 0.5 to any known active molecules in ${ACTIVE_MOLECULES}` file.
--nsample 10000
means to sample 10000 molecules in each epoch.
--threshold 0.5
is the activity threshold. A molecule is considered as active if its predicted chemprop score is greater than 0.5.
In each epoch, generated active molecules are saved in ckpt/finetune/good_molecules.${epoch}
. All the novel active molecules are saved in ckpt/finetune/new_molecules.${epoch}
Molecule translation is often useful for lead optimization (i.e., modifying a given molecule to improve its properties)
- Extract substructure vocabulary from a given set of molecules:
python get_vocab.py --ncpu 16 < data/qed/mols.txt > vocab.txt
Please replace data/qed/mols.txt
with your molecules.
- Preprocess training data:
python preprocess.py --train data/qed/train_pairs.txt --vocab data/qed/vocab.txt --ncpu 16
mkdir train_processed
mv tensor* train_processed/
- Train the model:
mkdir ckpt/translation
python train_translator.py --train train_processed/ --vocab data/qed/vocab.txt --save_dir ckpt/translation
- Make prediction on your lead compounds (you can use any model checkpoint, here we use model.5 for illustration)
python translate.py --test data/qed/valid.txt --vocab data/qed/vocab.txt --model ckpt/translation/model.5 --num_decode 20 > results.csv
The polymer generation code is in the polymer/
folder. The polymer generation code is similar to train_generator.py
, but the substructures are tailored for polymers.
For generating regular drug like molecules, we recommend to use train_generator.py
in the root directory.