-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bam group by speedup #164
Bam group by speedup #164
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #164 +/- ##
=======================================
Coverage 79.53% 79.54%
=======================================
Files 89 89
Lines 7931 7948 +17
=======================================
+ Hits 6308 6322 +14
- Misses 1623 1626 +3
☔ View full report in Codecov by Sentry. |
For me this PR does not speedup |
It's been a hot minute since I've worked on this, so I'm having a hard time providing any additional context, but I believe the speedup was achieved for very large BAMs. But given my limited recollection, I'd trust your experience more than whatever I can remember, so I'm fine if you decide this PR is no longer needed. |
I tested with a 15GB bam file and see a 2X speed up. MD5 checksums of the outputs are the same. Nice work @Lioscro! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Using
hits.utilities.group_by
happens to be very slow for some reason, so I implemented our owngroup_bam_by_key
function, which achieves approx. 100x speedup.Branches from
remove-intbc-logging
#163