Skip to content

v0.2.0

Latest
Compare
Choose a tag to compare
@gbouras13 gbouras13 released this 13 Jul 06:20
· 40 commits to main since this release
14e916c

You will need to re-install the updated phold database for v0.2.0 using phold install
You will also need to upgrade Foldseek to v9.427df8a

v0.2.0 is a very large update adding:

  • Improved sensitivity and faster runtime for the foldseek search. This is achieved by clustering the Phold database at --min-seq-id 0.3 -c 0.8 and creating a cluster db before running with foldseek which significantly improves runtime
    • Overall, just over 1.1M structures are clustered into around 372k clusters
  • --cluster-search 1 parameter is added to foldseek search to search against the cluster representatives first and then within each cluster, which increases sensitivity and reduces resource usage compared to phold v0.1.4
  • Changed default --max_seqs from 1000 to 10000 to improve sensitivity at little resource usage cost
  • Phold database is expanded adding:
    • Extremely conservative high confidence efam proteins with hits to PHROGs.
    • 95% dereplicated diversity-generating retroelements (DGRs) from Roux et al.
    • 7153 netflax toxin-antitoxin system proteins from Ernits et al.
  • Adds --ultra_sensitive flag which turns off Foldseek prefiltering for maximum sensitivity. Recommended for small datasets/single phages only.
    • This passes the --exhaustive-search parameter to foldseek search
  • Adds the ability to save ProstT5 embeddings with --save_per_residue_embeddings and --save_per_protein_embeddings
  • Adds .cif support (e.g. from Alphafold3 server) for structures, not just .pdb file format and changing the CLI to reflect this
  • Removes some experimental parameters from v0.1.4 (--split etc)

Breaking CLI parameter changes

  • --pdb has changed to --structures
  • --pdb_dir has changed to --structure_dir
  • --filter_pdbs has changed to --filter_structures