Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pgmap performance #13

Open
marissafujimoto opened this issue Dec 17, 2024 · 0 comments
Open

Improve pgmap performance #13

marissafujimoto opened this issue Dec 17, 2024 · 0 comments

Comments

@marissafujimoto
Copy link
Collaborator

marissafujimoto commented Dec 17, 2024

Issue Description

Based on initial benchmarks of the Fred Hutch cluster, the performance on a ~10gb read file is around 2.5 hrs. Since files can get > 100 gb depending on read depth, it's worth thinking about how performance might be improved. Here I'll briefly document some approaches and their pros and cons.

  1. Optimizing the single threaded python implementation: Profiling the python code and rearranging / optimizing some of the functions might improve the speed. One example I tried was to cache some of the alignment steps. This gave a modest boost in performance. However overall caching if done wrong can greatly increase the memory usage which is not ideal. Another optimization is to check for perfect matches to skip alignments when possible, though the speedup from this depends on the quality of the reads. Overall these are not too difficult, but no major speed improvements should be expected here.
  2. Use a faster string library. I have no POC of this, but I know there are fast string libraries which might implement some of the alignment steps better: https://github.com/ashvardanian/StringZilla. The performance boost of this is unknown and would add modest to significant complexity.
  3. Rewrite in a compiled language. C and Rust have good ecosystems for genomics code, but this code becomes a lot less maintainable by the Berger Lab. Expected to have a 10x improvement following the same single threaded algorithm.
  4. Parallelize. Python can parallelize code, but we would have to spawn multiple processes to avoid the GIL. We would want to avoid creating large intermediate files or using lots of memory. This adds significant complexity, but offers maybe > 10x speed ups (or more depending on the compute environment). This option also adds user complexity as we would likely expose optional parameters for the number of threads to use / other non functional parameters.
@marissafujimoto marissafujimoto changed the title pgmap performance Improve pgmap performance Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant