Merge branch 'dev' into 'master'

Version 3.4.5 See merge request cdd/DrugEx!114
CDDLeiden · Sep 20, 2023 · eddf3e2 · eddf3e2
2 parents b61d31a + 63a3f89
commit eddf3e2
Show file tree

Hide file tree

Showing 60 changed files with 35,583 additions and 21,555 deletions.
diff --git a/.gitignore b/.gitignore
@@ -4,19 +4,27 @@
 !.gitignore
 /build/
 /base_test/
+/drugex/_version.py
 /tutorial/data
+/tutorial/advanced/data
 /tutorial/CLI/examples
 /tutorial/download.json
 /testing/clitest/data/*.txt
 /testing/clitest/data/*.vocab
 /testing/clitest/data/backup*
 /testing/clitest/data/dataset.json
 /testing/clitest/generators/
+/testing/clitest/new_molecules/
 /docs/_build
+/qspr/
+/tmp/
+tmp*
 *.pkg
 *.tgz
 *.gzip
 *.tar.gz
+**/*.cv.tsv
+**/*.ind.tsv
 
 ### Python template
 # Byte-compiled / optimized / DLL files

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,18 +1,30 @@
 # Change Log
-From v3.4.3 to v3.4.4
+From v3.4.4 to v3.4.5
 
 ## Fixes
 
-- Fixed a bug that may have caused the standardizer to return molecules failing in standardization in their original form instead of removing them (14fd58dc758cb882c2a24e4a481a9064318927f1).
+- Fixed a bug in calculation of the Pareto fronts (fronts are now calculated for maximization of objectives instead of objective minimization).
+- Patch a bug that that caused a crash when an invalid smiles was encountered in the fragment generation step. This
+  bug was introduced in v3.4.4, now invalid smiles are skipped and a warning is printed to the log.
 
 ## Changes
 
-None.
+- Installation of pip package with pyproject.toml instead of setup.cfg.
+- Methods `cpu_non_dominated_sort` and `gpu_non_dominated_sort` have been replace by `get_Pareto_fronts`.
+- Improve calculation of crowding distance.
+- The rewards module is refactored and the `RankingStrategy` class was replace by `ParetoRankingScheme` class. 
+    - The final reward calcuation for `ParetoRankingScheme`-based methods is now directly the scaled rank of the molecules.
+    - The `ParetoTanimotoDistance` now has a attribute `distance_metric` which can be "min", "mean" or "mutual" instead of attribute `ranking`.
+- DrugEx is now compatible with the latest version of qsprpred v2.0.1, previous versions of qsprpred are no longer supported.
+- `drugex.generate` CLI environment arguments are no longer overwritten by environment variables from generator.
 
 ## Removed Features
 
-None.
+None. 
 
 ## New Features
 
-None.
+- When installing package with pip, the commit hash and date of the installation is saved into `qsprpred._version`
+- Added an automated Docker runner for tests that can run on GPUs. See [testing/runner/README.md](testing/runner/README.md) for more information.
+- When installing package with pip, the commit hash and date of the installation is saved into `drugex._version`. This information is also used as a basis of a new dynamic versioning scheme for the package. The version number is generated automatically upon installation of the package and saved to `drugex.__version__`. 
+- QSPRPred is now available as an optional dependency that can be installed with DrugEx using the `[qsprpred]` option.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,4 @@
+recursive-include drugex * = test_files/*.*
+recursive-include drugex test_data/*.*
+recursive-include drugex test_data/A2AR_RandomForestClassifier/*.*
+recursive-include drugex *.pkl.gz
diff --git a/README.md b/README.md
@@ -1,13 +1,35 @@
-DrugEx
-==================== 
+# DrugEx
+
 <img src='figures/logo.png' width=20% align=right>
 <p align=left width=70%>
-DrugEx is open-source software library for <i>de novo</i> design of small molecules with deep learning generative models in a multi-objective reinforcement learning framework. This toolkit is a continuation of the original and incremental work of Liu et al.'s DrugEx [<a href="liu_drugex1">1</a>, <a href="liu_drugex2">2</a>, <a href="liu_drugex3">3</a>] and is currently developed by Gerard van Westen's Computational Drug Discovery group. 
+DrugEx is an open-source software library for <i>de novo</i> design of small molecules with deep learning generative models in a multi-objective reinforcement learning framework. The package contains multiple generator architectures and a variety of scoring tools and multi-objective optimisation methods. It has a flexible application programming interface and can readily be used via the command line interface [<a href="sicho_drugex">4</a>] (see [Quick Start](#quick-start) to get to work right away).
 
-The package contains multiple generator architectures and a variety of scoring tools and multi-objective optimisation methods. It has a flexible application programming interface and can readily be used via the command line interface [<a href="sicho_drugex">4</a>].
+## History
 
-Quick Start
-===========
+This software is a continuation of the original and incremental work of Liu et al.'s DrugEx [<a href="liu_drugex1">1</a>, <a href="liu_drugex2">2</a>, <a href="liu_drugex3">3</a>] and is currently developed by [Gerard van Westen's Computational Drug Discovery](https://twitter.com/cddleiden) group in Leiden, Netherlands. The first version of DrugEx [<a href="liu_drugex1">1</a>] consisted of a recurrent neural network (RNN) single-task agent of gated recurrent units (GRU) which were updated to long short-term memory (LSTM) units in the second version [<a href="liu_drugex2">2</a>], also introducing MOO-based RL and an updated exploitation-exploration strategy. In its third version, [<a href="liu_drugex3">3</a>] generators based on a variant of the transformer and a novel graph-based encoding allowing for the sampling of molecules with specific substructures were introduced. This package builds on these works and provides a unified API with increased usability and flexibile enough for customization. However, new additional features are beeing added as well [<a href="sicho_drugex">4</a>]. Furthermore, the development and traning of QSAR models, used to score molecules during reinforcement learning has been moved to a separate [QSPRpred](https://github.com/CDDLeiden/QSPRPred)-package, which became a useful library in its own right.
+
+
+## Workflow
+
+The DrugEx package provides classes to standardize, clean and encode molecules for the various deep learning algorithms provided in the package as well as features to set up and monitor training and optimization. The resulting models can be used readily for generation of focused libraries and are easily transferable.
+
+![Fig1](figures/TOC_figure.png)
+
+<!-- Introduction
+=============
+Due to the large drug-like chemical space available to search for feasible drug-like molecules, rational drug design often starts from specific scaffolds to which side chains/substituents are added or modified. With the rapid growth of the application of deep learning in drug discovery, a variety of effective approaches have been developed for de novo drug design. In previous work, we proposed a method named DrugEx, which can be applied in polypharmacology based on multi-objective deep reinforcement learning. However, the previous version is trained under fixed objectives similar to other known methods and does not allow users to input any prior information (i.e. a desired scaffold). In order to improve the general applicability, we updated DrugEx to design drug molecules based on scaffolds which consist of multiple fragments provided by users. In this work, the Transformer model was employed to generate molecular structures. The Transformer is a multi-head self-attention deep learning model containing an encoder to receive scaffolds as input and a decoder to generate molecules as output. In order to deal with the graph representation of molecules we proposed a novel positional encoding for each atom and bond based on an adjacency matrix to extend the architecture of the Transformer. Each molecule was generated by growing and connecting procedures for the fragments in the given scaffold that were unified into one model. Moreover, we trained this generator under a reinforcement learning framework to increase the number of desired ligands. As a proof of concept, our proposed method was applied to design ligands for the adenosine A2A receptor (A2AAR) and compared with SMILES-based methods. The results demonstrated the effectiveness of our method in that 100% of the generated molecules are valid and most of them had a high predicted affinity value towards A2AAR with given scaffolds.  -->
+
+<!-- <b>Keywords</b>: deep learning, reinforcement learning, policy gradient, drug design, Transformer, multi-objective optimization</p> -->
+
+<!-- Deep learning Archietectures
+====================
+![Fig2](figures/fig_2.png)
+
+Examples
+=========
+![Fig3](figures/fig_3.png) -->
+
+# Quick Start
 
 > A small step for exploring the drug space in need, a giant leap for exploiting a healthy state indeed.
 
@@ -22,9 +44,11 @@ pip install git+https://github.com/CDDLeiden/DrugEx.git@master
 
 ### Optional Dependencies
 
-**[QSPRPred](https://github.com/CDDLeiden/QSPRPred.git)** - Optional package to install if you want to use the command line interface of DrugEx, which requires the models to be serialized with this package. It is also used by some examples in the tutorial.
+<<<<<<< HEAD
+**[QSPRPred](https://github.com/CDDLeiden/QSPRPred.git)** - Optional package to install if you want to use the command line interface of DrugEx, which requires the models to be serialized with this package. It is also used by some examples in the tutorial. Install DrugEx with the following command if you want these features:
+
 ```bash
-pip install git+https://github.com/CDDLeiden/QSPRPred.git@v1.3.1
+pip install "drugex[qsprpred] @ git+https://github.com/CDDLeiden/DrugEx.git@master"
 ```
 
 **[RAscore](https://github.com/reymond-group/RAscore)** - If you want to use the Retrosynthesis Accessibility Score in the desirability function.
@@ -95,50 +119,25 @@ The DrugEx toolkit offers a variety of models with varying complexities, each wi
 
 It is noteworthy, however, that even on a suboptimal configuration, it should be possible to fine-tune and optimize the basic sequential RNN model using reinforcement learning techniques if a pretrained model is used. Regarding the two transformers, we recommend leveraging multiple GPUs to increase throughput via parallelization, automated by the DrugEx package. This technique divides the model's workload across multiple GPUs, enabling the system to handle more significant volumes of data at a faster rate than when using a single GPU.
 
-History
-=======
-
-The first version of DrugEx [<a href="liu_drugex1">1</a>] consisted of a recurrent neural network (RNN) single-task agent of gated recurrent units (GRU) which were updated to long short-term memory (LSTM) units in the second version [<a href="liu_drugex2">2</a>], also introducing MOO-based RL and an updated exploitation-exploration strategy. In its third version, [<a href="liu_drugex3">3</a>] generators based on a variant of the transformer and a novel graph-based encoding allowing for the sampling of molecules with specific substructures were introduced.  This package builds on these works to have a user-friendly but also easily customisable toolkit for DNDD with a development of an API and a command line interface, and the addition of new features [<a href="sicho_drugex">4</a>]. Furthermore, the development and traning of QSAR models, used to score molecules during reinforcement learning has been moved to a separate [QSPRpred](https://github.com/CDDLeiden/QSPRPred)-package.
-
-<!-- Introduction
-=============
-Due to the large drug-like chemical space available to search for feasible drug-like molecules, rational drug design often starts from specific scaffolds to which side chains/substituents are added or modified. With the rapid growth of the application of deep learning in drug discovery, a variety of effective approaches have been developed for de novo drug design. In previous work, we proposed a method named DrugEx, which can be applied in polypharmacology based on multi-objective deep reinforcement learning. However, the previous version is trained under fixed objectives similar to other known methods and does not allow users to input any prior information (i.e. a desired scaffold). In order to improve the general applicability, we updated DrugEx to design drug molecules based on scaffolds which consist of multiple fragments provided by users. In this work, the Transformer model was employed to generate molecular structures. The Transformer is a multi-head self-attention deep learning model containing an encoder to receive scaffolds as input and a decoder to generate molecules as output. In order to deal with the graph representation of molecules we proposed a novel positional encoding for each atom and bond based on an adjacency matrix to extend the architecture of the Transformer. Each molecule was generated by growing and connecting procedures for the fragments in the given scaffold that were unified into one model. Moreover, we trained this generator under a reinforcement learning framework to increase the number of desired ligands. As a proof of concept, our proposed method was applied to design ligands for the adenosine A2A receptor (A2AAR) and compared with SMILES-based methods. The results demonstrated the effectiveness of our method in that 100% of the generated molecules are valid and most of them had a high predicted affinity value towards A2AAR with given scaffolds.  -->
-
-<!-- <b>Keywords</b>: deep learning, reinforcement learning, policy gradient, drug design, Transformer, multi-objective optimization</p> -->
-
-Workflow
-========
-![Fig1](figures/TOC_figure.png)
-
-<!-- Deep learning Archietectures
-====================
-![Fig2](figures/fig_2.png)
-
-Examples
-=========
-![Fig3](figures/fig_3.png) -->
+# License
 
-License
-=======
-Please see the LICENSE file for the license terms for the software. Basically it's free to academic users. If you do wish to sell the software or use it in a commercial product, then please contact Gerard J.P. van Westen:
+The software is licensed under the standard MIT license, which means it is free to use also in commercial applications as long as the copyright terms of the license are preserved. You can view the [LICENSE](./LICENSE) file for the full terms. If you have questions about the license or the use of the software in your organization, please, contact Gerard J.P. van Westen:
 
    [Gerard J.P. van Westen](mailto:[email protected]): [email protected] 
 
-Current Development Team
-========================
+# Current Development Team
+
 - [M. Sicho](https://github.com/martin-sicho)
 - [S. Luukkonen](https://github.com/sohviluukkonen)
 - [H. van den Maagdenberg](https://github.com/HellevdM)
 - [L. Schoenmaker](https://github.com/LindeSchoenmaker)
 - [O. Béquignon](https://github.com/OlivierBeq)
 
-Contributions
-=============
+# Contributions
 
 If you find that there is something missing, have a question, or you just want to contribute a new model or feature, please, feel free to open an issue to initiate a discussion. We are more than happy to improve the package with your contributions, bug reports and ideas. After the feature is discussed in its designated issue, the best way to contribute is to fork the repository, make your changes and then create a pull request. We will then review your changes and merge them into the main repository. Alternatively, you can contact us directly via [email](mailto:[email protected]).
 
-Acknowledgements
-================
+# Acknowledgements
 
 We would like to thank the following people for significant contributions:
 
@@ -151,8 +150,7 @@ We also thank the following Git repositories that gave Xuhan a lot of inspiratio
 2. [ORGAN](https://github.com/gablg1/ORGAN)
 3. [SeqGAN](https://github.com/LantaoYu/SeqGAN)
 
-References
-==========
+# References
 
 <a name="liu_drugex1"></a> [1] [Liu X., Ye K., van Vlijmen H.W.T, IJzerman A.P., van Westen G.J.P. An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine A2A receptor. Journal of cheminformatics. 2019;11(1):35.](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0355-6)
 

diff --git a/docs/api/drugex.rst b/docs/api/drugex.rst
@@ -33,18 +33,18 @@ drugex.dataset module
    :undoc-members:
    :show-inheritance:
 
-drugex.designer module
+drugex.download module
 ----------------------
 
-.. automodule:: drugex.designer
+.. automodule:: drugex.download
    :members:
    :undoc-members:
    :show-inheritance:
 
-drugex.download module
+drugex.generate module
 ----------------------
 
-.. automodule:: drugex.download
+.. automodule:: drugex.generate
    :members:
    :undoc-members:
    :show-inheritance:

diff --git a/docs/api/drugex.utils.rst b/docs/api/drugex.utils.rst
@@ -28,18 +28,18 @@ drugex.utils.gcmol module
    :undoc-members:
    :show-inheritance:
 
-drugex.utils.nsgaii module
---------------------------
+drugex.utils.optim module
+-------------------------
 
-.. automodule:: drugex.utils.nsgaii
+.. automodule:: drugex.utils.optim
    :members:
    :undoc-members:
    :show-inheritance:
 
-drugex.utils.optim module
--------------------------
+drugex.utils.pareto module
+--------------------------
 
-.. automodule:: drugex.utils.optim
+.. automodule:: drugex.utils.pareto
    :members:
    :undoc-members:
    :show-inheritance:

diff --git a/docs/api/modules.rst b/docs/api/modules.rst
@@ -1,7 +1,7 @@
 ..  _api-docs:
 
-DrugEx Python API
-=================
+DrugEx Package API Documentation
+================================
 
 .. toctree::
    :maxdepth: 4

diff --git a/docs/conf.py b/docs/conf.py
@@ -26,8 +26,9 @@
 spec = importlib.util.spec_from_file_location("drugex.about", "../drugex/about.py")
 about = importlib.util.module_from_spec(spec)
 spec.loader.exec_module(about)
-release = about.VERSION
-version = f'v{release}'
+from importlib.metadata import version
+release = f"v{version('drugex')}"
+version = release
 
 
 # -- General configuration ---------------------------------------------------

diff --git a/docs/use.rst b/docs/use.rst
@@ -42,7 +42,7 @@ Basics
 ------
 
 Fine-tuning a Pretrained Generator
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 In this example, we will use the DrugEx CLI to fine-tune a pretrained graph transformer (trained on the latest version of the Papyrus data set).
 This pretrained model has been trained on a diverse set of molecules.

diff --git a/drugex/about.py b/drugex/about.py
@@ -4,5 +4,10 @@
 Created by: Martin Sicho
 On: 24.06.22, 10:36
 """
+import os
 
-VERSION = "3.4.4"
+VERSION = "3.4.5"
+
+if os.path.exists(os.path.join(os.path.dirname(__file__), '_version.py')):
+    from ._version import version
+    VERSION = version
diff --git a/drugex/dataset.py b/drugex/dataset.py
@@ -5,7 +5,7 @@
 
 import pandas as pd
 
-from drugex.logs.utils import enable_file_logger, commit_hash, backUpFiles
+from drugex.logs.utils import enable_file_logger, backUpFiles
 
 from drugex.molecules.converters.fragmenters import Fragmenter, FragmenterWithSelectedFragment
 from drugex.molecules.converters.dummy_molecules import dummyMolsFromFragments
@@ -63,8 +63,6 @@ def DatasetArgParser():
                         help="Number of parallel processes to use for multi-core tasks.")
     parser.add_argument('-cs', '--chunk_size', type=int, default=512,
                         help="Number of iitems to be given to each process for multi-core tasks. If not specified, this number is set to 512.")
-    parser.add_argument('-ng', '--no_git', action='store_true',
-                        help="If on, git hash is not retrieved")
 
     args = parser.parse_args()
 
@@ -273,7 +271,6 @@ def __call__(self, smiles_list):
         'dataset.log',
         args.debug,
         __name__,
-        commit_hash(os.path.dirname(os.path.realpath(__file__))) if not args.no_git else None,
         vars(args)
     )
     log = logSettings.log