- Jump to Previous Updates
- 15Oct2022 - v1.1.0
- integrating proteinMPNN!
- bugfix for sidechain loss
- 17Nov2022
- updating pae/plddt loss calculation to be consistent with pae/plddt outputs
- 24Dec2022 - v1.1.1
- adding af_pseudo_diffusion examples
- updating to alphafold-multimer v2.3.0
- enabling fused_triangle_multiplication by default
- 21Jan2023
- add support for bfloat16 (enabled by default
- 01Mar2023
- adding support for RfDiffusion
first install jax (with GPU support)
pip install "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
second install colabdesign
pip install git+https://github.com/sokrypton/[email protected]
# download alphafold weights
mkdir params
curl -fsSL https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar | tar x -C params
By default mk_afdesign_model()
assumes alphafold "params" are saved in the run directory (data_dir="."
). To override:
model = mk_afdesign_model(..., data_dir="/location/of")
import numpy as np
from IPython.display import HTML
from colabdesign import mk_afdesign_model, clear_mem
For a given protein backbone, generate/design a new sequence that AlphaFold thinks folds into that conformation
model = mk_afdesign_model(protocol="fixbb")
model.prep_inputs(pdb_filename="1TEN.pdb", chain="A")
model.design_3stage()
For a given length, generate/hallucinate a protein sequence that AlphaFold thinks folds into a well structured protein (high plddt, low pae, many contacts).
model = mk_afdesign_model(protocol="hallucination")
model.prep_inputs(length=100)
model.set_seq(mode="gumbel")
model.design_soft(50)
model.set_seq(model.aux["seq"]["pseudo"])
model.design_3stage(50,50,10)
For a given protein target and protein binder length, generate/hallucinate a protein binder sequence AlphaFold thinks will bind to the target structure. To do this, we minimize PAE and maximize number of contacts at the interface and within the binder, and we maximize pLDDT of the binder.
model = mk_afdesign_model(protocol="binder")
model.prep_inputs(pdb_filename="4MZK.pdb", chain="A", binder_len=19)
model.design_3stage(100, 100, 10)
Instead of hallucination, you can redesign an existing binder:
model.prep_inputs(pdb_filename="4MZK.pdb", chain="A", binder_chain="T")
If you have a motif (binding motif, or functional motif) and you want to hallucinate a new scaffold around it, you can use partial hallucination. Or you have a protein and you want to extend one of the loops.
af_model = mk_afdesign_model(protocol="partial")
af_model.prep_inputs(pdb_filename="6MRR.pdb", chain="A", pos="3-30,33-68", length=100)
af_model.rewire(loops=[36])
model.restart()
This can be done using the provided function:
model.set_weights(pae=0.0, plddt=1.0)
or the dictionary directly:
model.opt["weights"]["pae"] = 0.0
model = mk_afdesign_model(num_recycles=1, recycle_mode="average")
# if recycle_mode in ["average",last","sample","first"] the number of recycles can change during optimization
model.set_opt(num_recycles=1)
num_recycles
- number of recycles to use during design (for denovo proteins we find 0 is often enough)recycle_mode
- optimizing across all recycles can be tricky, we experiment with a couple of ways:- last - use loss from last recycle. (Default)
- average - compute loss at each recycle and average gradients. (Previous default from v.1.0.5)
- sample - Same as last but each iteration a different number of recycles are used.
- first - use loss from first recycle.
- add_prev - average the outputs (dgram, plddt, pae) across all recycles before computing loss.
- backprop - use loss from last recycle, but backprop through all recycles.
By default all five models are used during optimization. If num_models
> 1, then multiple params are evaluated at each iteration and the gradients/losses are averaged. Each iteration a random set of model params are used unless sample_models=False
.
model = mk_afdesign_model(num_models=1, sample_models=True)
# or
model.set_opt(num_models=1, sample_models=True)
num_models
- number of model params to use at each iteration.sample_models
:- True - randomly select models params to use. (Recommended)
- False - use the same model params each iteration. You can also specify exactly which models are used during any of the design protocols:
model.design_(num_models=1, sample_models=True, models=[0,2,3])
# or
model.design_(num_models=2, sample_models=False, models=["model_1_ptm","model_3_ptm"])
You may need to download them:
for W in openfold_model_ptm_1 openfold_model_ptm_2 openfold_model_no_templ_ptm_1
do wget -qnc https://files.ipd.uw.edu/krypton/openfold/${W}.npz -P params; done
Once downloaded:
model = mk_afdesign_model(use_openfold=True, use_alphafold=False)
model.prep_inputs(..., hotspot="1-10,15,3")
model.prep_inputs(..., chain="A,B")
model.prep_inputs(pdb_filename="6Q40.pdb", chain="A,B", copies=2, homooligomer=True)
WARNING, this functionality assumes the input chains are of equal length.
model.restart(rm_aa="C,W")
model.restart(seed=0)
-
For design we provide 5 different functions:
design_logits()
- optimize logits inputs (continious)design_soft()
- optimize softmax(logits) inputs (probabilities)design_hard()
- optimize one_hot(logits) inputs (discrete)
-
For complex topologies, we find directly optimizing one_hot encoded sequence
design_hard()
to be very challenging. To get around this problem, we propose optimizing in 3 stages or first learning logits then switching to semigreedy optimization.design_3stage()
- gradient based optimization (GD) (logits → soft → hard)design_semigreedy(tries=X)
- tries X random mutations, accepts those that decrease lossdesign_pssm_semigreey(tries=X)
- uses GD to get a sequence profile (PSSM), then uses the PSSM to bias semigreedy opt. (Recommended)
-
general losses
- pae - minimizes the predicted alignment error
- plddt - maximizes the predicted LDDT
- pae and plddt values are between 0 and 1 (where lower is better for both)
-
fixbb specific losses
- dgram_cce - minimizes the categorical-crossentropy between predicted distogram and one extracted from pdb.
- fape - minimize difference between coordinates (see AlphaFold paper)
- we find dgram_cce loss to be more stable for design (compared to fape)
-
hallucination specific losses
- con - maximize
1
contacts per position.model.set_opt("con",num=1)
- con - maximize
-
binder specific losses
- pae - minimize PAE at interface and within binder
- con - - maximize
2
contacts per binder position, within binder.model.set_opt("con",num=2)
- i_con - maximize
1
contacts per binder positionmodel.set_opt("i_con",num=1)
-
partial hallucination specific losses
- sc_fape - sidechain-specific fape
By default, 2 [con]tacts per positions are optimized to be within cβ-cβ < 14.0Å and sequence seperation ≥ 9. This can be changed with:
model.set_opt(con=dict(cutoff=8, seqsep=5, num=1))
For interface:
model.set_opt(i_con=dict(...))
By default, we use stochastic gradient descent set_optimizer(optimizer="sgd", learning_rate=0.1, norm_seq_grad=True)
for optimization. This seems to work quite well for the default problems. But if you want to try other optimizers, ColabDesign is now fully integrated with all Optax optimizers.
Example how to change optimizer. Note the default learning_rate of 0.1 was calibrated for sgd with gradient normalization. A different learning_rate may be more optimal for other optimization settings.
model = mk_afdesign_model(optimizer="adam", learning_rate=0.01)
Or for more control (or to change settings after model initialization), use:
model.set_optimizer(optimizer="adam", learning_rate=0.01, b1=0.9, b2=0.999)
By default, the gradients for the sequence parameters are normalized. We find this helps with convergence. To disable:
model.set_opt(norm_seq_grad=False)
Gradient descent updates multiple positions each iteration, which can be a little too aggressive during hard (discrete) mode.
Instead, one can try (tries
) a few random mutations and accept one with lowest loss. If use_plddt=True
the random mutations will be biased towards positions with low pLDDT.
model.design_3stage(hard_iters=0)
# set number of model params to evaluate at each iteration
num_models = 2 if model.args["use_templates"] else 5
model.design_semigreedy(iters=10, tries=20, num_models=num_models, use_plddt=True)
I was getting better results before the major update (19June2022), how do I revert back to the old settings?
We are actively trying to find the best weights model.opt["weights"]
, settings model.opt
for each protocol.
Please send us a note if you find something better! To revert back to old settings do this after prepping the model:
- fixbb:
model.set_weights(dgram_cce=1, pae=0.1, plddt=0.1)
model.design_3stage()
- hallucination:
model.set_seq(mode="gumbel")
model.set_weights(pae=1, plddt=1, con=0.5)
model.set_opt("con", binary=True, cutoff=21.6875, num=model._len, seqsep=0)
model.design_2stage(100, 100, 10)
- binder hallucination:
model.set_weights(plddt=0.1, pae=0.1, i_pae=1.0, con=0.1, i_con=0.5)
model.set_opt("con", binary=True, cutoff=21.6875, num=model._binder_len, seqsep=0)
model.set_opt("i_con", binary=True, cutoff=21.6875, num=model._target_len)
model.design_3stage(100, 100, 10)
def design_custom(self):
# set options
self.set_opt(dropout=True, soft=False, hard=False)
# set number of recycles
self.set_opt(num_recycles=0)
# take 100 steps
for _ in range(100): self.step()
# increase weight for plddt
self.set_weights(plddt=2.0)
# take another 100 steps
for _ in range(100): self.step()
# increase number of recycles
self.set_opt(num_recycles=1)
# take another 100 steps
for _ in range(100): self.step()
# etc...
model = mk_afdesign_model()
design_custom(model)
Looking for more control over afdesign? The callback functions have gotten much smarter. Based on your input arguments, it will automatically fetch the variable of interest. You can now define your own custom losses, params to optimize, modify inputs before alphafold is run, and modify auxiliary outputs.
def custom_pre_callback(inputs, aux, opt, key):
inputs["aatype"] = inputs["aatype"].at[:].set(0)
aux["pre"] = opt["pre"] + jax.random.randint(key,[],0,10)
def custom_post_callback(outputs, aux):
aux["post"] = outputs["structure_module"]
def custom_loss_callback(outputs, params):
loss = jnp.square(outputs["structure_module"]["final_atom14_positions"] + params["custom_param"]).mean()
return {"custom_loss":loss}
af_model = mk_afdesign_model(protocol="fixbb",
pre_callback=custom_pre_callback,
post_callback=custom_post_callback,
loss_callback=custom_loss_callback)
af_model._params["custom_param"] = 1.0
af_model.opt["weights"]["custom_loss"] = 0.1
af_model.opt["pre"] = 100
af_model.prep_inputs(pdb_filename=get_pdb("1TEN"), chain="A")
- 24Feb2022 - "Beta" branch started. Refactoring code to allow homooligomeric hallucination/design and averaging gradients across recycles (which is now the default). Minor changes changes include renaming intra_pae/inter_con to pae/con and inter_pae/inter_con to i_pae/i_con for clarity.
- 28Feb2022 - We find backprop through structure module to be unstable, all functions have been updated to only use distogram by default. The definition of contact has changed to minimize entropy within distance cutoff.
- 02May2022 - The
design.py
code has been split up into multiple python files undersrc/
- 14May2022 - Adding support for partial hallucination (if you want to constrain one part and generate structure/sequence for rest).
- 19June2022 - "Beta" branch is now the "Main" branch. WARNING: Lots of default settings and weights were changed. Click here for info on how to revert back to old settings.
- 28June2022 - v1.0.1 - Major code reorganization/refactoring to add support for callbacks (to allow integration w/ other tools during design) and to avoid clashes with existing trrosetta/alphafold installations. (eg.
af → colabdesign
,af.src → colabdesign.af
andalphafold → colabdesign.af.alphafold
). - 05July2022 - v1.0.2 - Major code cleanup, removing duplicate code. Adding support for custom loss functions.
- 11July2022 - v1.0.3 - Improved homo-oligomeric support. RMSD and dgram losses have been refactored to automatically save aligned coordinates. Multimeric coordinates now saved with chain identifiers.
- 23July2022 - v1.0.4 - Adding support for openfold weights. To enable set
mk_afdesign_model(..., use_openfold=True)
. - 31July2022 - v1.0.5 - Refactoring to add support for swapping batch features without recompile. Allowing for implementation of AF2Rank!
- 09Sept2022 - v1.0.6
- support for alphafold-multimer
model = mk_afdesign_model(..., use_multimer=True)
- support for experimentally resolved loss
model.set_weights(exp_res=1)
- support for multichain design/hallucination for fixbb, hallucination and partial protocols:
model.prep_inputs(..., copies=2)
- support to fix the sequence for certain positions
model.prep_inputs(..., fix_pos="1-10")
(supported in protocols "fixbb" and "partial") - binder protocol improved, prior protocol would try to optimize number of contacts per target, new default is to optimize number of contacts per binder position. Number of contacts per binder position can be controlled with
model.set_opt("i_con",num=1)
and number of positions that should be contact withmodel.set_opt("i_con",num_pos=5)
- implementing David Jones'-like protocol for semi-greedy optimization, where positions are selected based on plddt, and after 20 tries, the mutation that decreasing loss the most is accepted.
model.design_semigreedy()
- WARNING: the returned pLDDT is now in the "correct" direction (higher is better)
- removing recycle dimension from the input features (to standardize with multimer inputs)
- removing all dependence on TensorFlow
- support for alphafold-multimer
- 14Sept2022 - v1.0.7
- refactoring design.py to add
design_pssm_semigreedy()
protocol, which is a wrapper arounddesign_semigreedy(seq_logits=)
, and can be used to input/learn PSSM for biased optimization. - adding example peptide_binder_design.ipynb targeted for peptide binder hallucination/design.
- adding finer control over what models are used during optimization.
- fixing RAM memory leaks,
clear_mem()
now also does garbage collection - fixing integration with TrDesign that got broken in v1.0.6
- refactoring design.py to add
- 22Sept2022 - v1.0.8
- custom callback functions ([pre|loss|pos]_callback) have been refactored to be more flexible.
- Supported input arguments include: ["inputs", "outputs", "params", "opt", "seq", "aux", "key"].
- The pre_callback function can be used to modify inputs before prediction, loss_callback to add cutstom loss.
- adding support for Optax optimizers
- custom callback functions ([pre|loss|pos]_callback) have been refactored to be more flexible.
- 24Sept2022 - v1.0.9
- adding contrib section where user contributed modifications and protocols will go.