Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bam group by speedup #164

Merged
merged 8 commits into from
Sep 14, 2023
Merged

Bam group by speedup #164

merged 8 commits into from
Sep 14, 2023

Conversation

Lioscro
Copy link
Collaborator

@Lioscro Lioscro commented Jan 18, 2022

Using hits.utilities.group_by happens to be very slow for some reason, so I implemented our own group_bam_by_key function, which achieves approx. 100x speedup.

Branches from remove-intbc-logging #163

@Lioscro Lioscro self-assigned this Jan 18, 2022
@codecov
Copy link

codecov bot commented Jan 20, 2022

Codecov Report

Patch coverage: 93.87% and no project coverage change.

Comparison is base (013f615) 79.53% compared to head (fe28c8b) 79.54%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #164   +/-   ##
=======================================
  Coverage   79.53%   79.54%           
=======================================
  Files          89       89           
  Lines        7931     7948   +17     
=======================================
+ Hits         6308     6322   +14     
- Misses       1623     1626    +3     
Files Changed Coverage Δ
cassiopeia/preprocess/cassiopeia_preprocess.py 21.81% <0.00%> (-1.26%) ⬇️
cassiopeia/preprocess/pipeline.py 91.15% <ø> (ø)
cassiopeia/preprocess/UMI_utils.py 94.71% <100.00%> (+0.34%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Lioscro Lioscro changed the base branch from master to remove-intbc-logging January 24, 2022 17:27
@Lioscro Lioscro marked this pull request as ready for review January 31, 2022 23:05
@Lioscro Lioscro requested a review from mattjones315 January 31, 2022 23:05
@Lioscro Lioscro changed the base branch from remove-intbc-logging to master May 18, 2022 14:08
@colganwi
Copy link
Collaborator

For me this PR does not speedup cas.pp.collapse at all. @Lioscro can you provide a test case or more details about the compute environment? Maybe this is faster with memory constraints?

@Lioscro
Copy link
Collaborator Author

Lioscro commented Sep 13, 2023

It's been a hot minute since I've worked on this, so I'm having a hard time providing any additional context, but I believe the speedup was achieved for very large BAMs.

But given my limited recollection, I'd trust your experience more than whatever I can remember, so I'm fine if you decide this PR is no longer needed.

@colganwi colganwi self-requested a review September 14, 2023 20:53
@colganwi
Copy link
Collaborator

I tested with a 15GB bam file and see a 2X speed up. MD5 checksums of the outputs are the same. Nice work @Lioscro!

Copy link
Collaborator

@colganwi colganwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@colganwi colganwi merged commit 9f272fc into master Sep 14, 2023
@colganwi colganwi deleted the bam-group-by-speedup branch September 14, 2023 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants