You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Based on initial benchmarks of the Fred Hutch cluster, the performance on a ~10gb read file is around 2.5 hrs. Since files can get > 100 gb depending on read depth, it's worth thinking about how performance might be improved. Here I'll briefly document some approaches and their pros and cons.
Optimizing the single threaded python implementation: Profiling the python code and rearranging / optimizing some of the functions might improve the speed. One example I tried was to cache some of the alignment steps. This gave a modest boost in performance. However overall caching if done wrong can greatly increase the memory usage which is not ideal. Another optimization is to check for perfect matches to skip alignments when possible, though the speedup from this depends on the quality of the reads. Overall these are not too difficult, but no major speed improvements should be expected here.
Use a faster string library. I have no POC of this, but I know there are fast string libraries which might implement some of the alignment steps better: https://github.com/ashvardanian/StringZilla. The performance boost of this is unknown and would add modest to significant complexity.
Rewrite in a compiled language. C and Rust have good ecosystems for genomics code, but this code becomes a lot less maintainable by the Berger Lab. Expected to have a 10x improvement following the same single threaded algorithm.
Parallelize. Python can parallelize code, but we would have to spawn multiple processes to avoid the GIL. We would want to avoid creating large intermediate files or using lots of memory. This adds significant complexity, but offers maybe > 10x speed ups (or more depending on the compute environment). This option also adds user complexity as we would likely expose optional parameters for the number of threads to use / other non functional parameters.
The text was updated successfully, but these errors were encountered:
Issue Description
Based on initial benchmarks of the Fred Hutch cluster, the performance on a ~10gb read file is around 2.5 hrs. Since files can get > 100 gb depending on read depth, it's worth thinking about how performance might be improved. Here I'll briefly document some approaches and their pros and cons.
The text was updated successfully, but these errors were encountered: