-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions regarding the proper use of phold #33
Comments
These are awesome questions - and I will consider them (especially 1) as I write up a manuscript. To answer one by one:
My feeling is that phold is probably a bit more sensitive than TM score 0.6 (perhaps with more false positives), but not sure until I benchmark it. Regarding the higher e-value for CARD, that it is to reduce false positives for these types of genes based on this paper (https://www.nature.com/articles/ismej201690). Also, I consider that reducing false positives for phage therapy users is very important for AMR and virulence factor genes, hence the higher threshold.
i. generate pLM embedding for query protein per residue (ProstT5 encoder) There is scope potentially to look at using the embeddings in another way, but I think I'd go along the lines of comparing them directly (e.g. https://github.com/Rostlab/EAT) rather than construct a 10-fold classifier on PHROGs like that paper - mostly because I like the extra specificity of annotation rather than the broad PHROG category. Unknown genes are hard!
George |
Thank you for the detailed answers!
Follow-up questions about performance:
|
Thanks for the link @bhagavadgitadu22 , please let me know how you go with the overlapping gene question. To answer the performance Qs, Phold is approximately linear in terms of compute. So I would try 5 or 10 viruses first and see if you think you have the compute to scale from there. Without knowing your setup, if you have access to anything above a laptop, 2000 viruses should be doable. Certainly split the run into 2 steps - To give you some context, on a machine with RTX4090 and Intel i9-13900 (32 threads), phold took 59 minutes on 249 phages with 22k CDS. If you don't have a GPU, well, George |
Thanks for the tool it sounds very useful! I want to use it to annotate the viruses I find in my metagenomes but I have a few questions concerning the use of the tool:
In the literature (example the article "Structure-guided discovery of anti-CRISPR and anti-phage defense proteins" from last month), they use a TM-score>0.6 between an unknown protein and a known one to predict the function of the known one. The default thresholds in Phold are an e-value of 1e-3 and a sensitivity of 9.5 for Foldseek. How does the default thresholds of phold compare to this score? My guess is that it is less sensitive because the aim is to get true annotations rather than extreme novelty. Also you take a stricter e-value cutoff for CARD hits and I am not sure why?
Phold makes a great use of sequence and stucture alignments to make a maximum of protein annotations. Do you feel like large language models might improve the result of Phold by providing at least the PHROG category of some unknown genes? The results obtained in "Large language models improve annotation of prokaryotic viral proteins" in 2023 sounded promising
To further improve the annotations, I feel like using the colocalization of viral genes might work. PHROG incorporates a network of colocalized genes: do you think it might be leveraged to make a decision beween several hits that would be as likely otherwise?
Overlapping genes are not provided by any viral annotation tool I know. What I do so far is looking of potential overlapping genes by making blastp requests within the viral genes to find potential additional genes. Would there be a way to look for and add likely overlapping genes to Phold output?
The text was updated successfully, but these errors were encountered: