This code has been merged into a more comprehensive collection of ML code for the covid moonshot. This repo will therefore be archived.
Genetic algorithm search for molecules with high similarities to known COVID-19 protease inhibitors - recap on the protease can be found here
Visualise protease as well as the ligands which inhibit specific sites here
3D coordinates of all the ligands (need to filter to get the relevant ones) can be found here in the .pdb files which are formatted like this
Background on the graph-based genetic algorithm (GB-GA) that I used can be found in this paper - I co-opted the GB-GA code from this Github
- dscribe - make sure you get the latest version which is much quicker at calculating SOAP descriptors
- RDKit
- pandas
Fragment data was preprocessed using transform.sh
which uses openbabel to conver the .mol files into .xyz, which are then fed into concat_ligand.py
to concat the atom coordinates into one file.
Candidates were found from this Google Sheets and the 'SMILES' column is saved in data/covid_submissions.csv
running python GA-soap.py
will start running a genetic algorithm. It uses crossover.py
and mutate.py
from Jensen's Github. Doc-strings and comments in GA-soap.py
should be enough to help you understand what's going on.
- parallelize conformer generation / similarity calculation with MPI? Takes ~1 minute per generation right now which isn't terrible but could be better
- play with GA and SOAP parameters to find optimal candidate(s) ; set
-tgt_size
to average size of the molecules in the submission? - I think we should only use submission candidates that include the fragments from site 2 and 11 (which are the only ones I selected); too tired to do it now
- possibly include fragments in the initial population also
- improved way of writing best candidates from each generation to a file for visualisation
- some form of synthesizability scoring?
- May need better objective function as target ligand field has a LOT of atoms (892)
- more conformers?