Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assistance needed for running 10,500 Salmonella genomes through fastANI without running into a time out error. #142

Open
brennenohunt opened this issue Dec 4, 2024 · 4 comments

Comments

@brennenohunt
Copy link

I am attempting to run ~10,500 Salmonella genomes on the Texas A&M HPRC, but running into an issue where the job timed out at 21 days. Per the outfile, only about 2,300 genomes were run through fastANI taking on average between 700-800 seconds each. Another student ran 8,000 Bacteroides genomes on the same partition on the same cluster and was able to complete the run in nine days where each genome took on average between 200-300 seconds to run. I was wondering if anyone had any recommendations for how to run my 10,500 genomes to completion, preferably in less than 21 days. The only difference between my script and the other script is that I allotted 200gb of memory and she allotted 350gb. Any help would be much appreciated. I can be reached at [email protected]. Thank you!

@cjain7
Copy link
Member

cjain7 commented Dec 4, 2024

FastANI's runtime depends on several factors including how similar the given set of genomes are to each other. For example, comparison of distant genomes is done much faster (because there are fewer k-mer matches to process).

Does this help?
https://github.com/ParBLiSS/FastANI?tab=readme-ov-file#parallelization

@brennenohunt
Copy link
Author

I will be reviewing this with my professor. If we have any other issues, I will be sure to leave another comment. Thank you!

@brennenohunt
Copy link
Author

We have a question regarding the script provided for the parallelization. We are concerned that once the databases are split, they are being run independently and we want to be able to compare across all databases. Is there a way to combine them? Thanks!

@cjain7
Copy link
Member

cjain7 commented Dec 10, 2024

The output won't be affected; you can try once with a smaller set (e.g., 100) genomes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants