-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguous greedy & parallel dissimilarity computation #226
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #226 +/- ##
==========================================
- Coverage 79.54% 79.49% -0.06%
==========================================
Files 89 89
Lines 7948 8050 +102
==========================================
+ Hits 6322 6399 +77
- Misses 1626 1651 +25
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @mattjones315! The only major comment I have is about using shared memory for the dissimilarity map computation. With large trees I suspect that copying the character matrix for each thread will be slow and may lead to overflow. Based on some quick research I think this could be fixed using multiprocessing.shared_memory, but I admit I haven't thought about it too deeply. If you think this is too much work and out of scope I'm happy to table for now.
I also left a few minor comments re formatting and the threads parameter. Happy to clarify anything if needed. Once these are addressed and it the tests are fixed we can merge.
cassiopeia/config.ini
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be removed from the commit. I've also had issues with cassiopeia/config.ini changes being included even though its in the .gitignore. Do you know how to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that is weird. I agree, this should be removed form the commit.
I think the issue here is that the config is tracked, but only as a default. So .gitignore gets confused because the file does exist. I propose we put the default config in ./data
and specify in the README how to utilize it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That solution sounds good to me
Thanks @colganwi for a great review! I've made several of your requested changes, which were very insightful, and I believe I'm now ready for a second review. One small comment is on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Maybe just add some tests for cluster_dissimilarity_weighted_hamming_distance_min_linkage
and add a note about config.ini to the README
This PR implements two important changes.
Supporting ambiguous alleles in Cassiopeia Greedy algorithm.
Specific changes:
cassiopeia.mixins.utilities
Supporting parallel dissimilarity matrix
We implement a parallel dissimilarity matrix computation. Due to compatibility issues with
numba
, I introduce a wrapper function around the main bones of the dissimilarity map computation, and allow this to operate on batches. I notice a slight slow down for computations that would be numba jit-compatible (on the order of seconds) but I find drastic runtime improvements for computations that are not jit-compatible. This becomes particularly important for cases dealing with ambiguous alleles as currently the cluster dissimilarity function is not able to be compiled innopython
mode. Thus, dramatic speedups - roughly proportional the number of threads, ~10x speedup with 10 threads (as one would expect).I also find that implementing more prescriptive cluster dissimilarity functions (e.g., specific function for
linkage=np.min
anddissimilarity_function=weighted_hamming_distance
) allows the function to be compiled withnopython=False, forceobj=True
, which does speed up the computation noticeably. I retain the original cluster computation to keep it backwards compatible and to allow users to experiment with various linkages and dissimilarity functions on cases where performance is not such an issue.