Skip to content
This repository has been archived by the owner on Jul 13, 2020. It is now read-only.
/ CovidGenetic Public archive

Genetic algorithm search for molecules with high similarities to known COVID-19 protease inhibitors

Notifications You must be signed in to change notification settings

wjm41/CovidGenetic

Repository files navigation

This code has been merged into a more comprehensive collection of ML code for the covid moonshot. This repo will therefore be archived.

CovidGenetic

Genetic algorithm search for molecules with high similarities to known COVID-19 protease inhibitors - recap on the protease can be found here

Visualise protease as well as the ligands which inhibit specific sites here

3D coordinates of all the ligands (need to filter to get the relevant ones) can be found here in the .pdb files which are formatted like this

Background on the graph-based genetic algorithm (GB-GA) that I used can be found in this paper - I co-opted the GB-GA code from this Github

Dependencies (for running the GA)

  • dscribe - make sure you get the latest version which is much quicker at calculating SOAP descriptors
  • RDKit
  • pandas

Data

Fragment data was preprocessed using transform.sh which uses openbabel to conver the .mol files into .xyz, which are then fed into concat_ligand.py to concat the atom coordinates into one file.

Candidates were found from this Google Sheets and the 'SMILES' column is saved in data/covid_submissions.csv

How to use

running python GA-soap.py will start running a genetic algorithm. It uses crossover.py and mutate.py from Jensen's Github. Doc-strings and comments in GA-soap.py should be enough to help you understand what's going on.

To-try

  • parallelize conformer generation / similarity calculation with MPI? Takes ~1 minute per generation right now which isn't terrible but could be better
  • play with GA and SOAP parameters to find optimal candidate(s) ; set -tgt_size to average size of the molecules in the submission?
  • I think we should only use submission candidates that include the fragments from site 2 and 11 (which are the only ones I selected); too tired to do it now
  • possibly include fragments in the initial population also
  • improved way of writing best candidates from each generation to a file for visualisation
  • some form of synthesizability scoring?
  • May need better objective function as target ligand field has a LOT of atoms (892)
  • more conformers?

About

Genetic algorithm search for molecules with high similarities to known COVID-19 protease inhibitors

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published