diff --git a/docs/databases_klifs_statement_of_need.rst b/docs/databases_klifs_statement_of_need.rst index 95fb15a3..1ae26cd8 100644 --- a/docs/databases_klifs_statement_of_need.rst +++ b/docs/databases_klifs_statement_of_need.rst @@ -1,24 +1,25 @@ Statement of need ================= -OpenCADD-KLIFS is aimed at current and future users of the KLIFS database who seek to -integrate kinase resources into Python-based research projects. -This module offers access to KLIFS data [Kanev_2021]_ such as information about kinases, -structures, ligands, +The KLIFS resource [Kanev_2021]_ contains information about kinases, structures, ligands, interaction fingerprints, and bioactivities. KLIFS thereby focuses especially on the ATP binding site, defined as a set of 85 residues and -aligned across all structures using a multiple sequence alignment (MSA) [vanLinden_2014]_. -With OpenCADD-KLIFS, KLIFS data can be queried either locally from a KLIFS download or remotely -from the KLIFS webserver. -The presented module provides identical APIs for the remote and local queries for KLIFS data and -streamlines all output into -standardized `Pandas `_ DataFrames to allow for easy and -quick downstream data analyses (Figure 1). This Pandas-focused setup is ideal to work with in -Jupyter notebooks [Kluyver_2016]_. +aligned across all structures using a multiple sequence alignment [vanLinden_2014]_. +Fetching, filtering, and integrating the KLIFS content on a larger scale into Python-based +pipelines is currently not straight-forward, especially for users without a background in +online queries. +Furthermore, switching between data queries from a *local* KLIFS download and +the *remote* KLIFS database is not readily possible. -`OpenCADD-KLIFS `_ -(``opencadd.databases.klifs``) is a part of the `OpenCADD `_ -package, a collection of Python modules for structural cheminformatics. +OpenCADD-KLIFS is aimed at current and future users of the KLIFS database who seek to +integrate kinase resources into Python-based research projects. +With OpenCADD-KLIFS, KLIFS data can be queried either locally from a KLIFS download or +remotely from the KLIFS webserver. +The presented module provides identical APIs for the remote and local queries and +streamlines all output into standardized Pandas DataFrames +`Pandas `_ to allow for easy and quick +downstream data analyses (Figure 1). +This Pandas-focused setup is ideal if you work with Jupyter notebooks [Kluyver_2016]_. .. raw:: html @@ -29,45 +30,6 @@ package, a collection of Python modules for structural cheminformatics. *Figure 1*: OpenCADD-KLIFS fetches KLIFS data offline from a KLIFS download or online from the KLIFS database and formats the output as user-friendly Pandas DataFrames. -The KLIFS database offers a REST API compliant with the OpenAPI specification -(`KLIFS OpenAPI `_). -Our module OpenCADD-KLIFS uses `bravado `_ to dynamically -generate a Python client based on the OpenAPI definitions and adds wrappers to enable the -following functionalities: - -- A session is set up, which allows access to various KLIFS *data sources* by different - *identifiers* with the API ``session.data_source.by_identifier``. *Data sources* currently - include kinases, structures and annotated conformations, modified residues, pockets, ligands, - drugs, and bioactivities; *identifiers* refer to kinase names, PDB IDs, KLIFS IDs, and more. - For example, ``session.structures.by_kinase_name`` fetches information on all structures for a - query kinase. -- The same API is used for local and remote sessions. -- The returned data follows the same schema regardless of the session type (local/remote); all - results obtained with bravado are formatted as Pandas DataFrames with standardized column names, - data types, and handling of missing data. -- Files with the structural 3D coordinates deposited on KLIFS include full complexes or selections - such as proteins, pockets, ligands, and more. These files can be downloaded to disc or loaded - via biopandas [Raschka_2017]_ or `RDKit `_. - -OpenCADD-KLIFS is especially convenient whenever users are interested in multiple or more -complex queries such as "fetching all structures for the kinase EGFR in the DFG-in conformation" -or "fetching the measured bioactivity profiles for all ligands that are structurally resolved in -complex with EGFR". Formatting the output as DataFrames facilitates subsequent filtering steps -and DataFrame merges in case multiple KLIFS datasets need to be combined. -OpenCADD-KLIFS is currently used in several projects -from the `Volkamer Lab `_ -including -`TeachOpenCADD `_, -`OpenCADD-pocket `_, -`KiSSim `_, -`KinoML `_, and -`PLIPify `_. -For example, OpenCADD-KLIFS is applied in a -`TeachOpenCADD tutorial `_ -to demonstrate how to fetch all kinase-ligand interaction profiles for all available EGFR kinase -structures to visualize the per-residue interaction types and frequencies with only a few -lines of code. - .. [Kanev_2021] Kanev et al., (2021), KLIFS: an overhaul after the first 5 years of supporting kinase research, Nucleic Acids Research, @@ -80,7 +42,4 @@ lines of code. .. [Kluyver_2016] Kluyver et al., (2016), Jupyter Notebooks – a publishing format for reproducible computational workflows, In Positioning and Power in Academic Publishing: Players, Agents and Agendas. IOS Press. pp. 87-90, - doi:10.3233/978-1-61499-649-1-87. -.. [Raschka_2017] Raschka, (2017), - BioPandas: Working with molecular structures in pandas DataFrames, Journal of Open Source Software, - 2(14), 279, doi:10.21105/joss.00279. \ No newline at end of file + doi:10.3233/978-1-61499-649-1-87. \ No newline at end of file diff --git a/paper/paper.bib b/paper/paper.bib index d1ea7459..7b03117e 100644 --- a/paper/paper.bib +++ b/paper/paper.bib @@ -9,14 +9,28 @@ @article{Cohen:2021 doi={10.1038/s41573-021-00195-4}, } -@article{Kooistra:2017, - author = {Kooistra, Albert J. and Volkamer, Andrea}, - title = {{Kinase-Centric Computational Drug Development}}, - journal = {Annu. Rep. Med. Chem.}, +@incollection{Kooistra:2017, + booktitle = {Platform Technologies in Drug Discovery and Validation}, + series = {Annual Reports in Medicinal Chemistry}, + editor = {Robert A. Goodnow}, + title = {Chapter Six - Kinase-Centric Computational Drug Development}, + author = {Kooistra, {A. J.} and Volkamer, A.}, + publisher = {Academic Press}, volume = {50}, - pages = {197--236}, + pages = {197-236}, year = {2017}, - doi = {10.1016/BS.ARMC.2017.08.001}, + doi = {10.1016/bs.armc.2017.08.001}, +} + +@inproceedings{Kluyver:2016, + booktitle = {Positioning and Power in Academic Publishing: Players, Agents and Agendas}, + editor = {Fernando Loizides and Birgit Scmidt}, + title = {Jupyter Notebooks - a publishing format for reproducible computational workflows}, + author = {Thomas Kluyver and Benjamin Ragan-Kelley and Fernando P{\'e}rez and Brian Granger and Matthias Bussonnier and Jonathan Frederic and Kyle Kelley and Jessica Hamrick and Jason Grout and Sylvain Corlay and Paul Ivanov and Dami{\'a}n Avila and Safia Abdalla and Carol Willing and Jupyter development team}, + publisher = {IOS Press}, + year = {2016}, + pages = {87--90}, + url = {https://eprints.soton.ac.uk/403913/}, } @article{Kanev:2021, @@ -31,7 +45,7 @@ @article{Kanev:2021 } @article{vanLinden:2014, - author={van Linden, Oscar P. J. and Kooistra, Albert J. and Leurs, Rob and de Esch, Iwan J. P. and de Graaf, Chris}, + author={{van Linden}, Oscar P. J. and Kooistra, Albert J. and Leurs, Rob and de Esch, Iwan J. P. and de Graaf, Chris}, title={KLIFS: A Knowledge-Based Structural Database To Navigate Kinase--Ligand Interaction Space}, journal={Journal of Medicinal Chemistry}, volume={57}, @@ -51,15 +65,48 @@ @article{Raschka:2017 doi = {10.21105/joss.00279}, } -@inproceedings{Kluyver:2016, - booktitle = {Positioning and Power in Academic Publishing: Players, Agents and Agendas}, - editor = {Fernando Loizides and Birgit Scmidt}, - title = {Jupyter Notebooks - a publishing format for reproducible computational workflows}, - author = {Thomas Kluyver and Benjamin Ragan-Kelley and Fernando P{\'e}rez and Brian Granger and Matthias Bussonnier and Jonathan Frederic and Kyle Kelley and Jessica Hamrick and Jason Grout and Sylvain Corlay and Paul Ivanov and Dami{\'a}n Avila and Safia Abdalla and Carol Willing and Jupyter development team}, - publisher = {IOS Press}, - year = {2016}, - pages = {87--90}, - url = {https://eprints.soton.ac.uk/403913/}, +@article{Mendez:2018, + author = {Mendez, David and Gaulton, Anna and Bento, A Patrícia and Chambers, Jon and De Veij, Marleen and Félix, Eloy and Magariños, María Paula and Mosquera, Juan F and Mutowo, Prudence and Nowotka, Michał and Gordillo-Marañón, María and Hunter, Fiona and Junco, Laura and Mugumbate, Grace and Rodriguez-Lopez, Milagros and Atkinson, Francis and Bosc, Nicolas and Radoux, Chris J and Segura-Cabrera, Aldo and Hersey, Anne and Leach, Andrew R}, + title = "{ChEMBL: towards direct deposition of bioassay data}", + journal = {Nucleic Acids Research}, + volume = {47}, + number = {D1}, + pages = {D930-D940}, + year = {2018}, + doi = {10.1093/nar/gky1075}, +} + +@article{Carles:2018, + author = {Carles, Fabrice and Bourg, St{\'{e}}phane and Meyer, Christophe and Bonnet, Pascal}, + title = {{PKIDB: A Curated, Annotated and Updated Database of Protein Kinase Inhibitors in Clinical Trials}}, + journal = {Molecules}, + volume = {23}, + number = {4}, + pages = {908}, + year = {2018}, + doi = {10.3390/molecules23040908}, +} + +@article{McGuire:2017, + author = {McGuire, Ross and Verhoeven, Stefan and Vass, Márton and Vriend, Gerrit and de Esch, Iwan J. P. and Lusher, Scott J. and Leurs, Rob and Ridder, Lars and Kooistra, Albert J. and Ritschel, Tina and de Graaf, Chris}, + title = {3D-e-Chem-VM: Structural Cheminformatics Research Infrastructure in a Freely Available Virtual Machine}, + journal = {Journal of Chemical Information and Modeling}, + volume = {57}, + number = {2}, + pages = {115-121}, + year = {2017}, + doi = {10.1021/acs.jcim.6b00686}, +} + +@article{Kooistra:2018, + author = {Kooistra, {A. J.} and Vass, M. and McGuire, R. and Leurs, R. and de Esch, I. J. P. and Vriend, G. and Verhoeven, S. and de Graaf, C. }, + title = {{3{D}-e-{C}hem: {S}tructural {C}heminformatics {W}orkflows for {C}omputer-{A}ided {D}rug {D}iscovery}}, + journal = {ChemMedChem}, + volume = {13}, + number = {6}, + pages = {614--626}, + year = {2018}, + doi = {10.1002/cmdc.201700754}, } @misc{klifsswagger, @@ -70,6 +117,15 @@ @misc{klifsswagger url = {https://dev.klifs.net/swagger_v2/}, } +@misc{requests, + author = {requests}, + title = {{requests}}, + year = 2021, + publisher = {GitHub}, + journal = {GitHub repository}, + url = {https://github.com/psf/requests}, +} + @misc{bravado, author = {bravado}, title = {{bravado}}, diff --git a/paper/paper.md b/paper/paper.md index 401d7279..b48cdaaf 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -11,7 +11,7 @@ authors: - name: Jaime Rodríguez-Guerra orcid: 0000-0001-8974-1566 affiliation: 1 - - name: Andrea Volkamer + - name: Andrea Volkamer^[corresponding author] affiliation: 1 orcid: 0000-0002-3760-580X affiliations: @@ -23,34 +23,56 @@ bibliography: paper.bib # Summary -Protein kinases are involved in most aspects of cell life due to their role in signal transduction. Dysregulated kinases can cause severe diseases such as cancer, inflammatory and neurodegenerative diseases, which has made them a frequent target in drug discovery for the last decades [@Cohen:2021]. +Protein kinases are involved in most aspects of cell life due to their role in signal transduction. Dysregulated kinases can cause severe diseases such as cancer, inflammation, and neurodegeneration, which has made them a frequent target in drug discovery for the last decades [@Cohen:2021]. The immense research on kinases has led to an increasing amount of kinase resources [@Kooistra:2017]. -Among them is the KLIFS database, which focuses on storing and analyzing structural data on kinases and interacting drugs and other small molecules [@Kanev:2021]. +Among them is the KLIFS database, which focuses on storing and analyzing structural data on kinases and interacting ligands [@Kanev:2021]. The OpenCADD-KLIFS Python module offers a convenient integration of the KLIFS data into workflows to facilitate computational kinase research. +[OpenCADD-KLIFS](https://opencadd.readthedocs.io/en/latest/databases_klifs.html) (``opencadd.databases.klifs``) is part of the [OpenCADD](https://opencadd.readthedocs.io/) package, a collection of Python modules for structural cheminformatics. + # Statement of need +The KLIFS resource [@Kanev:2021] contains information about kinases, structures, ligands, interaction fingerprints, and bioactivities. +KLIFS thereby focuses especially on the ATP binding site, defined as a set of 85 residues and aligned across all structures using a multiple sequence alignment [@vanLinden:2014]. +Fetching, filtering, and integrating the KLIFS content on a larger scale into Python-based pipelines is currently not straight-forward, especially for users without a background in online queries. Furthermore, switching between data queries from a _local_ KLIFS download and the _remote_ KLIFS database is not readily possible. + OpenCADD-KLIFS is aimed at current and future users of the KLIFS database who seek to integrate kinase resources into Python-based research projects. -This module offers access to KLIFS data [@Kanev:2021] such as information about kinases, structures, ligands, -interaction fingerprints, and bioactivities. -KLIFS thereby focuses especially on the ATP binding site, defined as a set of 85 residues and aligned across all structures using a multiple sequence alignment (MSA) [@vanLinden:2014]. With OpenCADD-KLIFS, KLIFS data can be queried either locally from a KLIFS download or remotely from the KLIFS webserver. -The presented module provides identical APIs for the remote and local queries for KLIFS data and streamlines all output into -standardized Pandas DataFrames [@pandas] to allow for easy and quick downstream data analyses (\autoref{fig:opencadd_klifs_toc}). This Pandas-focused setup is ideal to work with in Jupyter notebooks [@Kluyver:2016]. +The presented module provides identical APIs for the remote and local queries and streamlines all output into +standardized Pandas DataFrames [@pandas] to allow for easy and quick downstream data analyses (\autoref{fig:opencadd_klifs_toc}). This Pandas-focused setup is ideal if you work with Jupyter notebooks [@Kluyver:2016]. + +![OpenCADD-KLIFS fetches KLIFS data [@Kanev:2021] offline from a local KLIFS download or online from the KLIFS database and formats the output as user-friendly Pandas DataFrames [@pandas].\label{fig:opencadd_klifs_toc}](opencadd_klifs_toc.png) + +# State of the field + +The KLIFS database is unique in the structure-based kinase field in terms of integrating and annotating different data resources in a kinase- and pocket-focused manner. Kinases, structures, and ligands have unique identifiers in KLIFS, which makes it possible to fetch and filter cross-referenced information for a query kinase, structure, or ligand. -[OpenCADD-KLIFS](https://opencadd.readthedocs.io/en/latest/databases_klifs.html) (``opencadd.databases.klifs``) is a part of the [OpenCADD](https://opencadd.readthedocs.io/) package, a collection of Python modules for structural cheminformatics. +- Kinase structures are fetched from the PDB, split by chains and alternate models, annotated with the KLIFS pocket of 85 residues, and aligned across the fully structurally covered kinome. +- Kinase-ligand interactions seen in experimental structures are annotated for the 85 pocket residues in the form of the KLIFS interaction fingerprint (KLIFS IFP). +- Bioactivity data measured against kinases are fetched from ChEMBL [@Mendez:2018] and linked to kinases, structures, and ligands available in KLIFS. +- Kinase inhibitor metadata are fetched from the PKIDB [@Carles:2018] and linked to co-crystallized ligands available in KLIFS. -![OpenCADD-KLIFS fetches KLIFS data [@Kanev:2021] offline from a KLIFS download or online from the KLIFS database and formats the output as user-friendly Pandas DataFrames [@pandas].\label{fig:opencadd_klifs_toc}](opencadd_klifs_toc.png) +The KLIFS data integrations and annotations can be accessed in different ways, which are all open source: -The KLIFS database offers a REST API compliant with the OpenAPI specification [@klifsswagger]. Our module OpenCADD-KLIFS uses bravado [@bravado] to dynamically generate a Python client based on the OpenAPI definitions and adds wrappers to enable the following functionalities: +- Manually via the [KLIFS website](https://klifs.net/) interface: This mode is preferable when searching for information on a specific structure or smaller set of structures. +- Automated via the [KLIFS KNIME](https://github.com/3D-e-Chem/knime-klifs) nodes [@McGuire:2017; @Kooistra:2018]: This mode is extremely useful if the users' projects are embedded in KNIME workflows; programming is not needed. +- Programmatically using the REST API and KLIFS OpenAPI specifications: This mode is needed for users who seek to perform larger scale queries or to integrate different queries into programmatic workflows. In the following, we will discuss this mode in context of Python-based projects and explain how OpenCADD-KLIFS improves the user experience. -- A session is set up, which allows access to various KLIFS *data sources* by different *identifiers* with the API ``session.data_source.by_identifier``. *Data sources* currently include kinases, structures and annotated conformations, modified residues, pockets, ligands, drugs, and bioactivities; *identifiers* refer to kinase names, PDB IDs, KLIFS IDs, and more. For example, ``session.structures.by_kinase_name`` fetches information on all structures for a query kinase. -- The same API is used for local and remote sessions. +The KLIFS database offers standardized URL schemes (REST API), which allows users to query data by defined URLs, using e.g. the Python package requests [@requests]. Instead of writing customized scripts to generate such KLIFS URLs, the KLIFS OpenAPI specifications — a document that defines the KLIFS REST API scheme — can be used to generate a Python client, using e.g. the Python package bravado [@bravado]. This client offers a Python API to send requests and receive responses. +This setup is already extremely useful, however, it has a few drawbacks: the setup is technical, the output is not easily readable for humans and not ready for immediate downstream integrations — requiring similar but not identical reformatting functions for different query results —, and switching from remote requests to local KLIFS download queries is not possible. Facilitating and streamlining these tasks is the purpose of OpenCADD-KLIFS as discussed in more detail in the next section. + +# Key Features + +The KLIFS database offers a REST API compliant with the OpenAPI specification [@klifsswagger]. Our module OpenCADD-KLIFS uses bravado to dynamically generate a Python client based on the OpenAPI definitions and adds wrappers to enable the following functionalities: + +- A session is set up automatically, which allows access to various KLIFS *data sources* by different *identifiers* with the API ``session.data_source.by_identifier``. *Data sources* currently include kinases, structures and annotated conformations, modified residues, pockets, ligands, drugs, and bioactivities; *identifiers* refer to kinase names, PDB IDs, KLIFS IDs, and more. For example, ``session.structures.by_kinase_name`` fetches information on all structures for a query kinase. +- The same API is used for local and remote sessions, i.e. interacting with data from a KLIFS download folder and from the KLIFS website, respectively. - The returned data follows the same schema regardless of the session type (local/remote); all results obtained with bravado are formatted as Pandas DataFrames with standardized column names, data types, and handling of missing data. -- Files with the structural 3D coordinates deposited on KLIFS include full complexes or selections such as proteins, pockets, ligands, and more. These files can be downloaded to disc or loaded via biopandas [Raschka:2017] or RDKit [@rdkit]. +- Files with the structural 3D coordinates deposited on KLIFS include full complexes or selections such as proteins, pockets, ligands, and more. These files can be downloaded to disc or loaded via biopandas [@Raschka:2017] or RDKit [@rdkit]. OpenCADD-KLIFS is especially convenient whenever users are interested in multiple or more complex queries such as "fetching all structures for the kinase EGFR in the DFG-in conformation" or "fetching the measured bioactivity profiles for all ligands that are structurally resolved in complex with EGFR". Formatting the output as DataFrames facilitates subsequent filtering steps and DataFrame merges in case multiple KLIFS datasets need to be combined. + OpenCADD-KLIFS is currently used in several projects from the Volkamer Lab [@volkamerlab] including TeachOpenCADD [@teachopencadd], OpenCADD-pocket [@opencadd_pocket], KiSSim [@kissim], KinoML [@kinoml], and PLIPify [@plipify]. For example, OpenCADD-KLIFS is applied in a [TeachOpenCADD tutorial](https://projects.volkamerlab.org/teachopencadd/talktorials/T012_query_klifs.html) to demonstrate how to fetch all kinase-ligand interaction profiles for all available EGFR kinase structures to visualize the per-residue interaction types and frequencies with only a few lines of code.