Inquiry on protein annotation #54

songmj86 · 2024-07-16T03:01:55Z

phold version: 0.2.0
Python version: 3.11.9
Operating System: Ubuntu 22.04.4 LTS (GNU/Linux 5.15.0-101-generic x86_64)

Description

Hi. I I am trying to annotate viral proteins

I do not have input format (genbank file) as the output generated from Pharokka resulted from the command "pharokka_proteins.py "

Is there any way to use Pharokka output obatined from "pharokka_proteins.py " as input for phold ?

Thanks !

gbouras13 · 2024-07-16T05:27:04Z

Hi @songmj86 ,

Yes. You can use these 2 commands

phold proteins-predict and phold proteins-compare.

This will be equivalent to running phold predict and phold compare on a genbank/nucleotideFASTA file

George

songmj86 · 2024-07-16T06:07:50Z

Thanks !

songmj86 · 2024-07-16T06:44:31Z

I missed another question

I used the protein fasta files as inputs to run Pharokka

Identical protein fasta files are not suitible formats for the input to run Phold ??
Then, do I need to use nucleotide sequences for CDS for input to run Phold predict & compare ?

Thanks!

gbouras13 · 2024-08-01T03:03:30Z

Hi @songmj86 ,

I am not sure what you mean but this question. Phold will accept amino acid FASTA (aka .faa) files using phold proteins-predict and proteins-compare.

George

shiraz-shah · 2024-10-02T06:39:55Z

George,
We have viral proteins instead of viral genomes, and I've tried to run proteins-predict followed by proteins-compare. Why is it so much slower than run? It's stuck on the foldseek step for days. I want to avoid run because we've already clustered our viral proteins, so doing run on the genomes would mean we have to go several steps back.

gbouras13 · 2024-10-02T06:56:29Z

Hi @shiraz-shah

If you theoretically run the same set of proteins via run or proteins-compare, it should be identical for the foldseek step so this is surprising to me. To confirm it is the foldseek step, do you have the foldseek logs?

George

shiraz-shah · 2024-10-02T10:49:50Z

Yes. Here's a top showing that foldseek has been running for 26 hours:

Here's in ls -ltrh of the log folder:

Here's what the bottom of that log looks like:

Query database size: 629827 type: Aminoacid
Estimated memory consumption: 1G
Target database size: 372645 type: Aminoacid
Index table k-mer threshold: 78 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 372.64K 4s 795ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 372.64K 10s 329ms
Index statistics
Entries:          91326864
DB size:          1010 MB
Avg k-mer size:   1.426982
Top 10 k-mers
    DDDDDD	137912
    VVLVVV	128606
    SVVVVV	121636
    SVSVVV	114664
    VVSVVV	106457
    DPVVVV	105783
    LVVVVV	96573
    DDVVVV	77979
    PPVVVV	77605
    CVVVVV	75319
Time for index table init: 0h 0m 16s 267ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 78
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 629827
Target db start 1 to 372645
[=============================================================

And here's an ls -trhof the temp_db/latest folder. This huge file is being continuously written to, it seems:

...output_phold_compare/temp_db/latest$ lt
total 78G
-rwx------ 1 bing data 4.6K Oct  1 10:42 structuresearch.sh
-rw-r--r-- 1 bing data  78G Oct  2 12:46 pref.0
-rw-r--r-- 1 bing data  15M Oct  2 12:47 pref.index.0

What do you think? Is it because foldseek is only using a single core maybe? Foldseek can't use GPU, can it? Because it isn't.

gbouras13 · 2024-10-02T12:08:41Z

If Foldseek is using only 1 core, that would certainly explain why it is taking a while - nothing seems wrong with foldseek per se, it is running, and for sure a massive job like you have with 630k proteins means a 78G and counting prefilter file is not unexpected. Are you intentionally only using 1 core with -t 1 (assuming your machine has more)?

George

gbouras13 · 2024-10-02T12:12:55Z

Note if you want to reduce the file size generated with the pre filter, change --max_seqs from its default of 10000 to e.g. 1000 or 500. This will slightly reduce Phold's sensitivity but also quicken its runtime.

Honestly, in this case, I'd just wait for it to finish, it should only be a few hours away is my guess.

George

shiraz-shah · 2024-10-02T12:49:35Z

Thanks for this input, George. It's very useful and appreciated. I'll wait and see!

The 1 core was not intentional. I think it's the default behavior. But maybe for run the default was different?

songmj86 · 2024-10-08T10:51:43Z

@gbouras13

Hi. I have a little fundamental questions

What is the reason to run both "phold proteins-predict" and "phold proteins-compare" ?

Is it because both tools seem to be used for annotation against dissimilar databases ?

Thanks

gbouras13 · 2024-10-08T11:20:11Z

@songmj86

it it because predict requires GPU compute and compare CPU (ideally with many cores). In a cluster environment, it is much more resource efficient the split phold into 2 commands sending predict to the GPU partition and compare to the CPU.

George

gbouras13 added the question Further information is requested label Jul 16, 2024

songmj86 closed this as completed Jul 16, 2024

songmj86 reopened this Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry on protein annotation #54

Inquiry on protein annotation #54

songmj86 commented Jul 16, 2024

gbouras13 commented Jul 16, 2024

songmj86 commented Jul 16, 2024

songmj86 commented Jul 16, 2024

gbouras13 commented Aug 1, 2024 •

edited

Loading

shiraz-shah commented Oct 2, 2024

gbouras13 commented Oct 2, 2024

shiraz-shah commented Oct 2, 2024

gbouras13 commented Oct 2, 2024

gbouras13 commented Oct 2, 2024

shiraz-shah commented Oct 2, 2024

songmj86 commented Oct 8, 2024

gbouras13 commented Oct 8, 2024

Inquiry on protein annotation #54

Inquiry on protein annotation #54

Comments

songmj86 commented Jul 16, 2024

Description

gbouras13 commented Jul 16, 2024

songmj86 commented Jul 16, 2024

songmj86 commented Jul 16, 2024

gbouras13 commented Aug 1, 2024 • edited Loading

shiraz-shah commented Oct 2, 2024

gbouras13 commented Oct 2, 2024

shiraz-shah commented Oct 2, 2024

gbouras13 commented Oct 2, 2024

gbouras13 commented Oct 2, 2024

shiraz-shah commented Oct 2, 2024

songmj86 commented Oct 8, 2024

gbouras13 commented Oct 8, 2024

gbouras13 commented Aug 1, 2024 •

edited

Loading