Table of Contents
This project aims to generate protein structures using FoldLanguage via a GPT model. Here's why we introduce FoldGPT:
- Condition Token: encoding full information (seq, struct, and func) of the known residues.
- Prompt Token: encoding partial information (seq or func) of residues.
- Mask Token: used for learning the feature of unkown residues.
Currently, we only encode the structural vq_id as conditional features, while leaving sequence and function conditions as future work to serve as a multimodal generative model.
conda env create -f environment.yml
export PYTHONPATH=project_path
CUDA_VISIBLE_DEVICES=0 python sampling.py --save_path results/unconditional --config model_zoom/unconditional/config.yaml --checkpoint model_zoom/unconditional/params.ckpt --temperature 0.5 --length 150 --nums 20 --mask_mode unconditional
One can use this script to generate protein structures from noise. The molel will save nums
generated pdbs in save_path
, the sampling temperature is 0.5, the protein contains 150 residues. The maximum generative length is PAD//2=256
. Thanks to the RoPE, we can slightly extend the length.
Length | Fig1 | Fig2 | Fig3 |
---|---|---|---|
50 | |||
100 | |||
200 | |||
300 |
CUDA_VISIBLE_DEVICES=0 python sampling.py --save_path results/conditional --config model_zoom/conditional/config.yaml --checkpoint model_zoom/conditional/params.ckpt --temperature 0.5 --length 150 --nums 20 --mask_mode conditional --template 8vrwB.pdb --mask 39-51,85-98
One can use this script to generate protein structures from noise. The molel will save nums
generated pdbs in save_path
, the sampling temperature is 0.5. The structure template is xxx.pdb
, where residues in 39-51
and 85-98
are masked.
TODO
Distributed under the Apache 2.0 license License. See LICENSE.txt
for more information.
Zhangyang Gao - [email protected]