The repository is an official implementation of ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning.
Quickly try our online server here.
If you have any question about the paper or the code, feel free to raise an issue!
Table of contents
- 2024/09/27: We added OMG database, which contains 200M protein sequences from metagenomic sequencing.
- 2024/09/04: We built ColabProTrek. ColabProTrek has joined OPMC. If you find it useful for your research, please consider also cite the original OPMC paper.
ProTrek is a tri-modal protein language model that jointly models protein sequence, structure and function (SSF). It employs contrastive learning with three core alignment strategies: (1) using structure as the supervision signal for AA sequences and vice versa, (2) mutual supervision between sequences and functions, and (3) mutual supervision between structures and functions. This tri-modal alignment training enables ProTrek to tightly associate SSF by bringing genuine sample pairs (sequence-structure, sequence-function, and structure-function) closer together while pushing negative samples farther apart in the latent space.
ProTrek achieves over 30x and 60x improvements in sequence-function and function-sequence retrieval, is 100x faster than Foldseek and MMseqs2 in protein-protein search, and outperforms ESM-2 in 9 of 11 downstream prediction tasks.
conda create -n protrek python=3.10 --yes
conda activate protrek
bash environment.sh
ProTrek provides pre-trained models with different sizes (35M and 650M), as shown below. For each pre-trained model,
Please download all files and put them in the weights
directory, e.g. weights/ProTrek_35M_UniRef50/...
.
Name | Size (protein sequence encoder) | Size (protein structure encoder) | Size (text encoder) | Dataset |
---|---|---|---|---|
ProTrek_35M_UniRef50 | 35M parameters | 35M parameters | 130M parameters | Swiss-Prot + UniRef50 |
ProTrek_650M_UniRef50 | 650M parameters | 150M parameters | 130M parameters | Swiss-Prot + UniRef50 |
We provide an example to download the pre-trained model weights.
huggingface-cli download westlake-repl/ProTrek_650M_UniRef50 \
--repo-type model \
--local-dir weights/ProTrek_650M_UniRef50
Note: if you cannot access the huggingface website, you can try to connect to the mirror site through "export HF_ENDPOINT=https://hf-mirror.com"
To run examples correctly and deploy your demo locally, please at first download the Foldseek
binary file from here and place
it into the bin
folder. Then add the execute permission to the binary file.
chmod +x bin/foldseek
Below is an example of how to obtain embeddings and calculate similarity score using the pre-trained ProTrek model.
import torch
from model.ProTrek.protrek_trimodal_model import ProTrekTrimodalModel
from utils.foldseek_util import get_struc_seq
# Load model
config = {
"protein_config": "weights/ProTrek_650M_UniRef50/esm2_t33_650M_UR50D",
"text_config": "weights/ProTrek_650M_UniRef50/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
"structure_config": "weights/ProTrek_650M_UniRef50/foldseek_t30_150M",
"load_protein_pretrained": False,
"load_text_pretrained": False,
"from_checkpoint": "weights/ProTrek_650M_UniRef50/ProTrek_650M_UniRef50.pt"
}
device = "cuda"
model = ProTrekTrimodalModel(**config).eval().to(device)
# Load protein and text
pdb_path = "example/8ac8.cif"
seqs = get_struc_seq("bin/foldseek", pdb_path, ["A"])["A"]
aa_seq = seqs[0]
foldseek_seq = seqs[1].lower()
text = "Replication initiator in the monomeric form, and autogenous repressor in the dimeric form."
with torch.no_grad():
# Obtain protein sequence embedding
seq_embedding = model.get_protein_repr([aa_seq])
print("Protein sequence embedding shape:", seq_embedding.shape)
# Obtain protein structure embedding
struc_embedding = model.get_structure_repr([foldseek_seq])
print("Protein structure embedding shape:", struc_embedding.shape)
# Obtain text embedding
text_embedding = model.get_text_repr([text])
print("Text embedding shape:", text_embedding.shape)
# Calculate similarity score between protein sequence and structure
seq_struc_score = seq_embedding @ struc_embedding.T / model.temperature
print("Similarity score between protein sequence and structure:", seq_struc_score.item())
# Calculate similarity score between protein sequence and text
seq_text_score = seq_embedding @ text_embedding.T / model.temperature
print("Similarity score between protein sequence and text:", seq_text_score.item())
# Calculate similarity score between protein structure and text
struc_text_score = struc_embedding @ text_embedding.T / model.temperature
print("Similarity score between protein structure and text:", struc_text_score.item())
"""
Protein sequence embedding shape: torch.Size([1, 1024])
Protein structure embedding shape: torch.Size([1, 1024])
Text embedding shape: torch.Size([1, 1024])
Similarity score between protein sequence and structure: 28.506675720214844
Similarity score between protein sequence and text: 17.842409133911133
Similarity score between protein structure and text: 11.866174697875977
"""
We provide an online server for using ProTrek. If you want to deploy the server locally, please follow the steps below:
Please follow the instructions in the Environment installation section.
Please follow the instructions in the Download Foldseek binary file section.
Please download the weights of ProTrek_650M_UniRef50 and put them into the weights
directory,
i.e. weights/ProTrek_650M_UniRef50/...
. Please follow the instructions in the
Download model weights section.
We have built faiss index for fast retrieval using the embedding computed by ProTrek_650M_UniRef50.
Please download the faiss index from here
and put it into the faiss_index
directory, i.e. faiss_index/SwissProt/...
. You can follow the below command to
download the faiss index.
huggingface-cli download westlake-repl/faiss_index --repo-type dataset --local-dir faiss_index/
After all data and files are prepared, you can run the server by executing the following command. Once you see the
prompt All servers are active! You can now visit http://127.0.0.1:7860/ to start to use.
, you can visit the
specified URL to use the server.
# Important: The server will occupy the ports 7860 to 7863, please make sure these ports are available!
python demo/run_pipeline.py
You can add your custom database to the server for retrieval. Please follow the instructions below:
You can build the faiss index through a .fasta
file:
python scripts/generate_database.py --fasta example/custom_db.fasta --save_dir faiss_index/Custom/ProTrek_650M_UniRef50/sequence
You need to add the index to the demo/config.yaml
:
...
sequence_index_dir:
- name: Swiss-Prot
index_dir: faiss_index/SwissProt/ProTrek_650M_UniRef50/sequence
# Add your custom database here
- name: Custom
index_dir: faiss_index/Custom/ProTrek_650M_UniRef50/sequence
...
frontend:
sequence: [
'Swiss-Prot',
# Add your custom database here
'Custom',
]
...
Finally, you can run the server to use the custom database.
If you find ProTrek useful for your research, please consider citing the following paper:
@article{su2024protrek,
title={ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning},
author={Su, Jin and Zhou, Xibin and Zhang, Xuting and Yuan, Fajie},
journal={bioRxiv},
pages={2024--05},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}