Skip to content

A4Bio/FoldToken_open

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo

FoldToken: A generative protein structure language!

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Downstream Tasks
  5. Dataset
  6. License
  7. Contact
  8. Citation

About The Project

Product Name Screen Shot

This project aims to create protein structure language for unifing the modality of protein sequence and structure. Here we provide the open-source code of FoldToken4 for research purpose. Welcome researchers to use or contribute to this project! We know that the structure language is not perfect for all downstream tasks now, we need more feedback to improve it further.

Here's why we introduce structure language:

  • Unification: If we can convert data of any modality to a sequence representation, we can use the unified transformer model to solve any problem in protein modeling.
  • Simplification: Structure modeling generally requires complex and inefficient model design. In comparison, the highly optimized transformer will be more suitable for scalling up.

(back to top)

Getting Started

conda env create -f environment.yml

Usage

Reconstruct Protein Structures
export PYTHONPATH=project_path

CUDA_VISIBLE_DEVICES=0 python foldtoken/reconstruct.py --path_in ./N128 --path_out ./N128_pred --level 8

One can use this script to validate the reconstruction performance of FoldToken4. The molel will encode input pdbs in path_in, reconstruct them, and save the reconstructed structures to path_out. Users can specify config and checkpoint to select appropriate model. The codebook size is $2^{level}$, i.e., $2^{8}$ in the example.

8ybxR 8vy8E
8ybxR 8vy8E
8ybxR: [222, 120, 78, 191, 184, 3, 190, 182, 182, 4, 51, 254, 210, 252, 72, 188, 121, 86, 188, 236, 236, 237, 24, 195, 47, 248, 247, 192, 74, 79, 82, 27, 199, 167, 170, 70, 45, 32, 215, 14, 14, 254, 254, 59, 38, 166, 115, 98, 53, 1, 106, 79, 79, 79, 166, 240, 181, 162, 179, 96, 16, 69, 211, 112, 113, 197, 49, 56, 246, 122, 214, 119, 50, 252, 51, 51, 171, 151, 41, 185, 207, 216, 153, 243]
8vy8E: [13, 186, 190, 211, 51, 178, 252, 119, 50, 103, 112, 6, 185, 190, 228, 3, 81, 139, 139, 116, 127, 242, 242, 182, 251, 38, 195, 195, 244, 86, 225, 44, 250, 180, 227, 39, 57, 142, 237, 49, 251, 51, 190, 26, 88, 139, 218, 2, 239, 43, 43, 215, 124, 60, 205, 195, 98, 166, 1, 242, 127, 191, 102, 41, 240, 211, 54, 19, 219, 194, 113, 16, 179, 162]
Batch tokenizing structures
export PYTHONPATH=project_path

CUDA_VISIBLE_DEVICES=0 python extract_vq_ids.py --path_in ./N128 --save_vqid_path ./N128_vqid.jsonl --level 8

CUDA_VISIBLE_DEVICES=0 python extract_vq_ids_jsonl.py --path_in ./pdb.jsonl --save_vqid_path ./N128_vqid.jsonl --level 8

One can use following script to extract vq ids from pdbs in path_in, and save it to path_out. Users can specify config and checkpoint to select appropriate model. The codebook size is $2^{level}$, i.e., $2^{8}$ in the example.

(back to top)

Downstream Tasks

  • Structure Generation (struct->struct)
    • Unconditional Generation
    • Inpainiting & Scaffolding
    • Binder Design
  • Inverse Folding (struct->seq)
  • Protein Folding (seq->struct)
    • Single-chain Folding
    • MSA Folding
  • Function Prediction (struct->Func)

Dataset & Model

Dataset Link Samples Comments
PDB Download 162,118 Used for pretraining, Multi-chain data
CATH4.3 Download 22,508 Single-chain Data
N128 Download 128 Single-chain Data, for evaluation
T116 Download 493 Single-chain Data, for evaluation
T493 Download 493 Single-chain Data, for evaluation
M1031 Download 1031 Protein Complex Data, for evaluation
Model Link
FoldToken4 ckpt

License

Distributed under the Apache 2.0 license License. See LICENSE.txt for more information.

(back to top)

Contact

Zhangyang Gao - [email protected]

(back to top)

Citation

If you are interested in our repository or our paper, please cite the following paper:

@article{gao2023vqpl,
  title={Vqpl: Vector quantized protein language},
  author={Gao, Zhangyang and Tan, Cheng and Li, Stan Z},
  journal={arXiv preprint arXiv:2310.04985},
  year={2023}
}

@article{gao2024foldtoken,
  title={Foldtoken: Learning protein language via vector quantization and beyond},
  author={Gao, Zhangyang and Tan, Cheng and Wang, Jue and Huang, Yufei and Wu, Lirong and Li, Stan Z},
  journal={arXiv preprint arXiv:2403.09673},
  year={2024}
}
@article{gao2024foldtoken2,
  title={FoldToken2: Learning compact, invariant and generative protein structure language},
  author={Gao, Zhangyang and Tan, Cheng and Li, Stan Z},
  journal={bioRxiv},
  pages={2024--06},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}
@article{gao2024foldtoken3,
  title={FoldToken3: Fold Structures Worth 256 Words or Less},
  author={Gao, Zhangyang and Tan, Chen and Li, Stan Z},
  journal={bioRxiv},
  pages={2024--07},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}
@article{gao2024foldtoken4,
  title={FoldToken4: Consistent \& Hierarchical Fold Language},
  author={Gao, Zhangyang and Tan, Cheng and Li, Stan Z},
  journal={bioRxiv},
  pages={2024--08},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published