Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry on protein annotation #54

Open
songmj86 opened this issue Jul 16, 2024 · 12 comments
Open

Inquiry on protein annotation #54

songmj86 opened this issue Jul 16, 2024 · 12 comments
Labels
question Further information is requested

Comments

@songmj86
Copy link

  • phold version: 0.2.0
  • Python version: 3.11.9
  • Operating System: Ubuntu 22.04.4 LTS (GNU/Linux 5.15.0-101-generic x86_64)

Description

Hi. I I am trying to annotate viral proteins

I do not have input format (genbank file) as the output generated from Pharokka resulted from the command "pharokka_proteins.py "

Is there any way to use Pharokka output obatined from "pharokka_proteins.py " as input for phold ?

Thanks !

@gbouras13 gbouras13 added the question Further information is requested label Jul 16, 2024
@gbouras13
Copy link
Owner

Hi @songmj86 ,

Yes. You can use these 2 commands

phold proteins-predict and phold proteins-compare.

This will be equivalent to running phold predict and phold compare on a genbank/nucleotideFASTA file

George

@songmj86
Copy link
Author

Thanks !

@songmj86 songmj86 reopened this Jul 16, 2024
@songmj86
Copy link
Author

I missed another question

I used the protein fasta files as inputs to run Pharokka

Identical protein fasta files are not suitible formats for the input to run Phold ??
Then, do I need to use nucleotide sequences for CDS for input to run Phold predict & compare ?

Thanks!

@gbouras13
Copy link
Owner

gbouras13 commented Aug 1, 2024

Hi @songmj86 ,

I am not sure what you mean but this question. Phold will accept amino acid FASTA (aka .faa) files using phold proteins-predict and proteins-compare.

George

@shiraz-shah
Copy link

George,
We have viral proteins instead of viral genomes, and I've tried to run proteins-predict followed by proteins-compare. Why is it so much slower than run? It's stuck on the foldseek step for days. I want to avoid run because we've already clustered our viral proteins, so doing run on the genomes would mean we have to go several steps back.

@gbouras13
Copy link
Owner

Hi @shiraz-shah

If you theoretically run the same set of proteins via run or proteins-compare, it should be identical for the foldseek step so this is surprising to me. To confirm it is the foldseek step, do you have the foldseek logs?

George

@shiraz-shah
Copy link

Yes. Here's a top showing that foldseek has been running for 26 hours:
image

Here's in ls -ltrh of the log folder:
image

Here's what the bottom of that log looks like:

Query database size: 629827 type: Aminoacid
Estimated memory consumption: 1G
Target database size: 372645 type: Aminoacid
Index table k-mer threshold: 78 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 372.64K 4s 795ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 372.64K 10s 329ms
Index statistics
Entries:          91326864
DB size:          1010 MB
Avg k-mer size:   1.426982
Top 10 k-mers
    DDDDDD	137912
    VVLVVV	128606
    SVVVVV	121636
    SVSVVV	114664
    VVSVVV	106457
    DPVVVV	105783
    LVVVVV	96573
    DDVVVV	77979
    PPVVVV	77605
    CVVVVV	75319
Time for index table init: 0h 0m 16s 267ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 78
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 629827
Target db start 1 to 372645
[============================================================= 

And here's an ls -trhof the temp_db/latest folder. This huge file is being continuously written to, it seems:

...output_phold_compare/temp_db/latest$ lt
total 78G
-rwx------ 1 bing data 4.6K Oct  1 10:42 structuresearch.sh
-rw-r--r-- 1 bing data  78G Oct  2 12:46 pref.0
-rw-r--r-- 1 bing data  15M Oct  2 12:47 pref.index.0

What do you think? Is it because foldseek is only using a single core maybe? Foldseek can't use GPU, can it? Because it isn't.

@gbouras13
Copy link
Owner

If Foldseek is using only 1 core, that would certainly explain why it is taking a while - nothing seems wrong with foldseek per se, it is running, and for sure a massive job like you have with 630k proteins means a 78G and counting prefilter file is not unexpected. Are you intentionally only using 1 core with -t 1 (assuming your machine has more)?

George

@gbouras13
Copy link
Owner

Note if you want to reduce the file size generated with the pre filter, change --max_seqs from its default of 10000 to e.g. 1000 or 500. This will slightly reduce Phold's sensitivity but also quicken its runtime.

Honestly, in this case, I'd just wait for it to finish, it should only be a few hours away is my guess.

George

@shiraz-shah
Copy link

Thanks for this input, George. It's very useful and appreciated. I'll wait and see!

The 1 core was not intentional. I think it's the default behavior. But maybe for run the default was different?

@songmj86
Copy link
Author

songmj86 commented Oct 8, 2024

@gbouras13

Hi. I have a little fundamental questions

What is the reason to run both "phold proteins-predict" and "phold proteins-compare" ?

Is it because both tools seem to be used for annotation against dissimilar databases ?

Thanks

@gbouras13
Copy link
Owner

@songmj86

it it because predict requires GPU compute and compare CPU (ideally with many cores). In a cluster environment, it is much more resource efficient the split phold into 2 commands sending predict to the GPU partition and compare to the CPU.

George

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants