Improve pgmap performance #13

marissafujimoto · 2024-12-17T23:30:52Z

Issue Description

Based on initial benchmarks of the Fred Hutch cluster, the performance on a ~10gb read file is around 2.5 hrs. Since files can get > 100 gb depending on read depth, it's worth thinking about how performance might be improved. Here I'll briefly document some approaches and their pros and cons.

Optimizing the single threaded python implementation: Profiling the python code and rearranging / optimizing some of the functions might improve the speed. One example I tried was to cache some of the alignment steps. This gave a modest boost in performance. However overall caching if done wrong can greatly increase the memory usage which is not ideal. Another optimization is to check for perfect matches to skip alignments when possible, though the speedup from this depends on the quality of the reads. Overall these are not too difficult, but no major speed improvements should be expected here.
Use a faster string library. I have no POC of this, but I know there are fast string libraries which might implement some of the alignment steps better: https://github.com/ashvardanian/StringZilla. The performance boost of this is unknown and would add modest to significant complexity.
Rewrite in a compiled language. C and Rust have good ecosystems for genomics code, but this code becomes a lot less maintainable by the Berger Lab. Expected to have a 10x improvement following the same single threaded algorithm.
Parallelize. Python can parallelize code, but we would have to spawn multiple processes to avoid the GIL. We would want to avoid creating large intermediate files or using lots of memory. This adds significant complexity, but offers maybe > 10x speed ups (or more depending on the compute environment). This option also adds user complexity as we would likely expose optional parameters for the number of threads to use / other non functional parameters.

marissafujimoto changed the title ~~pgmap performance~~ Improve pgmap performance Dec 17, 2024

marissafujimoto mentioned this issue Jan 7, 2025

Single threaded optimization #15

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pgmap performance #13

Improve pgmap performance #13

marissafujimoto commented Dec 17, 2024 •

edited

Loading

Improve pgmap performance #13

Improve pgmap performance #13

Comments

marissafujimoto commented Dec 17, 2024 • edited Loading

Issue Description

marissafujimoto commented Dec 17, 2024 •

edited

Loading