Skip to content

Latest commit

 

History

History
178 lines (137 loc) · 15.7 KB

File metadata and controls

178 lines (137 loc) · 15.7 KB

Generative_Protein_Design_Datasets

A repository highlighting and linking existing resources and datasets for generative protein modeling.

Sections

Protein Sequence Datasets

  • Suzek, Baris E et al. “UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.”
    Bioinformatics (Oxford, England) vol. 31,6 (2015): 926-32. doi:10.1093/bioinformatics/btu739
    Description: A collection of protein sequence clusters from the UniProt Knowledgebase.
    [Paper], [Dataset/Server]

  • Bryant, D.H., Bashir, A., Sinai, S. et al. "Deep diversification of an AAV capsid protein by machine learning."
    Nat Biotechnol 39, 691–696 (2021). https://doi.org/10.1038/s41587-020-00793-4
    Description: DL-engineered adeno-associated virus 2 (AAV2) capsid protein variants.
    [Paper], [Dataset/Server]

  • Jaina Mistry et al. "Pfam: The protein families database in 2021"
    Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D412–D419, https://doi.org/10.1093/nar/gkaa913
    Description: A collection of protein families, each represented by multiple sequence alignments and hidden Markov models.
    [Paper], [Dataset/Server (Legacy Host)], [Dataset/Server (Hosted at InterPro)]

  • Trinquier, J., Uguzzoni, G., Pagnani, A. et al. "Efficient generative modeling of protein sequences using simple autoregressive models." Nat Commun 12, 5800 (2021). https://doi.org/10.1038/s41467-021-25756-4
    Description: Uses data for five protein families (PF00014, PF00072, PF00076, PF00595, PF13354) from Pfam.
    [Paper], [Dataset/Server]

  • Elodie Laine et al. "GEMME: A Simple and Fast Global Epistatic Model Predicting Mutational Effects"
    Molecular Biology and Evolution, Volume 36, Issue 11, November 2019, Pages 2604–2619, https://doi.org/10.1093/molbev/msz179
    Description: Generates the mutational landscape of a protein given an input alignment
    [Paper], [Datatset/Server]

  • Shin, JE., Riesselman, A.J., Kollasch, A.W. et al. "Protein design and variant prediction using autoregressive generative models."
    Nat Commun 12, 2403 (2021). https://doi.org/10.1038/s41467-021-22732-w
    Description: PTEN phosphatase deletions & IGP dehydratase insertions and deletions
    [Paper], [Dataset/Server]

  • Typhaine Paysan-Lafosse et al. "InterPro in 2022"
    Nucleic Acids Research, Volume 51, Issue D1, 6 January 2023, Pages D418–D427, https://doi.org/10.1093/nar/gkac993
    Description: A database that integrates diverse information about protein families, domains and functional sites, providing a unified view of protein sequence classification and annotation.
    [Paper], [Datatset/Server]

  • Milot Mirdita et al. "Uniclust databases of clustered and deeply annotated protein sequences and alignments"
    Nucleic Acids Research, Volume 45, Issue D1, January 2017, Pages D170–D176, https://doi.org/10.1093/nar/gkw1081
    Description: A database of protein sequences clustered at different sequence similarity thresholds, providing a resource to reduce redundancy and improve the efficiency of various bioinformatics analyses
    [Paper], [Dataset/Server]

  • Gilson, Michael K et al. “BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology.”
    Nucleic acids research vol. 44, D1 (2016): D1045-53. doi:10.1093/nar/gkv1072
    Description: A database of measured binding affinities, focusing chiefly on the interactions of proteins considered to be drug-targets with small, drug-like molecules.
    [Paper], [Dataset/Server]

  • Antje Chang et al. "BRENDA, the ELIXIR core data resource in 2021: new developments and updates"
    Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D498–D508, https://doi.org/10.1093/nar/gkaa1025
    Description: A database providing detailed information on enzymes such as function, structure, localization, and application, derived from scientific literature and manual curation.
    [Paper], [Dataset/Server]

  • Schmitt, L.T., Paszkowski-Rogacz, M., Jug, F. et al. "Prediction of designer-recombinases for DNA editing with generative deep learning."
    Nat Commun 13, 7966 (2022). https://doi.org/10.1038/s41467-022-35614-6
    Description: 89 recombinase libraries for loxA-1 to loxX-6 target sites
    [Paper], [Dataset/Server]

  • M. AlQuraishi, “ProteinNet: a standardized data set for machine learning of protein structure,”
    BMC Bioinformatics, vol. 20, no. 1, p. 311, Dec. 2019, doi: 10.1186/s12859-019-2932-0.
    Description: A standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training/validation/test splits
    [Paper], [Dataset/Server]

  • Dallago, Christian, et al. “FLIP: Benchmark Tasks in Fitness Landscape Inference for Proteins."
    bioRxiv (Cold Spring Harbor Laboratory), Cold Spring Harbor Laboratory, Nov. 2021, https://doi.org/10.1101/2021.11.09.467890.
    Description: An open-source benchmark for function prediction, designed to enhance machine learning applications in protein engineering via representation learning scoring.
    [Paper], [Dataset]

Protein Structure Datasets

  • Ian Sillitoe et al. "CATH: expanding the horizons of structure-based functional annotations for genome sequences"
    Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D280–D284, https://doi.org/10.1093/nar/gky1097
    Description: Hierarchical Classification of Protein Structures based on function and evolutionary relationships.
    [Paper], [Dataset/Server]

  • Cuff, J A, and G J Barton. “Evaluation and improvement of multiple sequence methods for protein secondary structure prediction.”
    Proteins vol. 34,4 (1999): 508-19. doi:10.1002/(sici)1097-0134(19990301)34:4<508::aid-prot10>3.0.co;2-4
    Description: Dataset for Secondary Structure Prediction
    [Paper], [CB513 Dataset]

  • Kryshtafovych, A, Schwede, T, Topf, M, Fidelis, K, Moult, J. "Critical assessment of methods of protein structure prediction (CASP)—Round XIII."
    Proteins. 2019; 87: 1011– 1020. https://doi.org/10.1002/prot.25823
    Description: A collection of predicted structures for sequences
    [Paper], [Datatset/Server]

  • Protein Data Bank

  • John-Marc Chandonia et al. "SCOPe: classification of large macromolecular structures in the structural classification of proteins—extended database."
    Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D475–D481, https://doi.org/10.1093/nar/gky1134
    Description: Database of known protein structures classified based on their structural and evolutionary relationships
    [Paper], [Dataset/Server]

  • Ferdous S, Martin ACR. "AbDb: antibody structure database-a database of PDB-derived antibody structures."
    Database (Oxford). 2018;2018:bay040. doi:10.1093/database/bay040
    Description: A database of antibody structures.
    [Paper], [Datatset/Server]

  • S. Axelrod and R. Gomez-Bombarelli, “GEOM: Energy-annotated molecular conformations for property prediction and molecular generation.”
    arXiv, Feb. 09, 2022. doi: 10.48550/arXiv.2006.05531.
    Description: A dataset of 37 million molecular conformations annotated by energy and statistical weight for over 450,000 molecules.
    [Paper], [Datatset/Server]

  • M. AlQuraishi, “ProteinNet: a standardized data set for machine learning of protein structure,”
    BMC Bioinformatics, vol. 20, no. 1, p. 311, Dec. 2019, doi: 10.1186/s12859-019-2932-0.
    Description: A standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training/validation/test splits
    [Paper], [Dataset/Server]

  • Leonardo V Castorina et al. "PDBench: evaluating computational methods for protein-sequence design."
    Bioinformatics, Volume 39, Issue 1, January 2023, btad027, https://doi.org/10.1093/bioinformatics/btad027
    Description: Diverse dataset of proteins and tests, intended to evaluate and optimize protein-sequence design
    [Paper], [Dataset/Server]

  • Amelia Villegas-Morcillo et al. "ManyFold: an efficient and flexible library for training and validating protein folding models."
    Bioinformatics, Volume 39, Issue 1, January 2023, btac773, https://doi.org/10.1093/bioinformatics/btac773
    Description: A flexible, Jax-based library for protein structure prediction using deep learning, capable of fine-tuning and training models from scratch.
    [Paper], [Dataset]

  • Thomas, Neil, et al. “Tuned Fitness Landscapes for Benchmarking Model-Guided Protein Design.”
    bioRxiv, 1 Jan. 2022, www.biorxiv.org/content/10.1101/2022.10.28.514293v1.full.
    Description: An open-source framework for creating synthetic, biologically-inspired protein landscapes to test and benchmark machine learning-directed evolution (MLDE) strategies.
    [Paper], [Dataset]

  • Dallago, Christian, et al. “FLIP: Benchmark Tasks in Fitness Landscape Inference for Proteins."
    bioRxiv (Cold Spring Harbor Laboratory), Cold Spring Harbor Laboratory, Nov. 2021, https://doi.org/10.1101/2021.11.09.467890.
    Description: An open-source benchmark for function prediction, designed to enhance machine learning applications in protein engineering via representation learning scoring.
    [Paper], [Dataset]

Protein-Protein Interaction Datasets

  • P. Bryant, G. Pozzati, and A. Elofsson, “Improved prediction of protein-protein interactions using AlphaFold2,”
    Nat. Commun., vol. 13, no. 1, Art. no. 1, Mar. 2022, doi: 10.1038/s41467-022-28865-w.
    Description: ALphaFold2-predicted protein-protein interactions.
    [Paper], [Input Datatset], [Predictions]

  • A. G. Green, H. Elhabashy, K. P. Brock, R. Maddamsetti, O. Kohlbacher, and D. S. Marks, “Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences,”
    Nat. Commun., vol. 12, no. 1, Art. no. 1, Mar. 2021, doi: 10.1038/s41467-021-21636-z.
    Description: De-novo protein interactions in the E.coli proteome.
    [Paper], [Datatset]

  • D. F. Burke et al., “Towards a structurally resolved human protein interaction network,”
    Nat. Struct. Mol. Biol., vol. 30, no. 2, Art. no. 2, Feb. 2023, doi: 10.1038/s41594-022-00910-8.
    Description: AlphaFold2-predicted structures for human protein interactions.
    [Paper], [Datatset]

Protein-Ligand Datasets

  • J. Desaphy, G. Bret, D. Rognan, and E. Kellenberger, “sc-PDB: a 3D-database of ligandable binding sites—10 years on”
    Nucleic Acids Res., vol. 43, no. D1, pp. D399–D404, Jan. 2015, doi: 10.1093/nar/gku928.
    Description: A database of ligandable sites from the PDB(all-atom description of the protein, its ligand, their binding site and their binding mode)
    [Paper], [Dataset/Server]

  • M. Naderi, R. G. Govindaraj, and M. Brylinski, “eModel-BDB: a database of comparative structure models of drug-target interactions from the Binding Database,”
    GigaScience, vol. 7, no. 8, p. giy091, Aug. 2018, doi: 10.1093/gigascience/giy091.
    Description: Database of atomic-level models of drug-protein assemblies
    [Paper], [Datatset/Server]
    PS: Read the paper to see how to extract the relevant structures/models from BDB.

  • A. Gaulton et al., “ChEMBL: a large-scale bioactivity database for drug discovery,”
    Nucleic Acids Res., vol. 40, no. D1, pp. D1100–D1107, Jan. 2012, doi: 10.1093/nar/gkr777.
    Description: A database of bioactive molecules with drug-like properties(chemical, bioactivity and genomic data)
    [Paper], [Dataset/Server]

  • P. G. Francoeur et al., “3D Convolutional Neural Networks and a CrossDocked Dataset for Structure-Based Drug Design,”
    J. Chem. Inf. Model., vol. 60, no. 9, pp. 4200–4215, Sep. 2020, doi: 10.1021/acs.jcim.0c00411.
    Description: Benchmark dataset for protein-ligand binding affinity prediction
    [Paper], [Dataset/Server]