Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous greedy & parallel dissimilarity computation #226

Merged
merged 28 commits into from
Oct 11, 2023

Conversation

mattjones315
Copy link
Collaborator

@mattjones315 mattjones315 commented Sep 26, 2023

This PR implements two important changes.

Supporting ambiguous alleles in Cassiopeia Greedy algorithm.

Specific changes:

  • Enabled greedy splits to be found when states are ambiguous (by flattening ambiguous states)
  • Enabled missing data assignment in greedy splits when ambiguous states are present
  • A few new utilities around handling ambiguous states, implemented in cassiopeia.mixins.utilities
  • A few miscellaneous changes, not related to the greedy algorithm, especially around running tests when software is not installed.

Supporting parallel dissimilarity matrix

We implement a parallel dissimilarity matrix computation. Due to compatibility issues with numba, I introduce a wrapper function around the main bones of the dissimilarity map computation, and allow this to operate on batches. I notice a slight slow down for computations that would be numba jit-compatible (on the order of seconds) but I find drastic runtime improvements for computations that are not jit-compatible. This becomes particularly important for cases dealing with ambiguous alleles as currently the cluster dissimilarity function is not able to be compiled in nopython mode. Thus, dramatic speedups - roughly proportional the number of threads, ~10x speedup with 10 threads (as one would expect).

I also find that implementing more prescriptive cluster dissimilarity functions (e.g., specific function for linkage=np.min and dissimilarity_function=weighted_hamming_distance) allows the function to be compiled with nopython=False, forceobj=True, which does speed up the computation noticeably. I retain the original cluster computation to keep it backwards compatible and to allow users to experiment with various linkages and dissimilarity functions on cases where performance is not such an issue.

@codecov
Copy link

codecov bot commented Sep 27, 2023

Codecov Report

Attention: 36 lines in your changes are missing coverage. Please review.

Comparison is base (9f272fc) 79.54% compared to head (60a914e) 79.49%.
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #226      +/-   ##
==========================================
- Coverage   79.54%   79.49%   -0.06%     
==========================================
  Files          89       89              
  Lines        7948     8050     +102     
==========================================
+ Hits         6322     6399      +77     
- Misses       1626     1651      +25     
Files Coverage Δ
cassiopeia/data/CassiopeiaTree.py 91.78% <100.00%> (+0.02%) ⬆️
cassiopeia/solver/GreedySolver.py 98.64% <100.00%> (+0.07%) ⬆️
cassiopeia/solver/HybridSolver.py 74.78% <100.00%> (-0.22%) ⬇️
cassiopeia/solver/NeighborJoiningSolver.py 70.88% <ø> (ø)
cassiopeia/solver/UPGMASolver.py 73.21% <ø> (ø)
cassiopeia/solver/VanillaGreedySolver.py 100.00% <100.00%> (ø)
cassiopeia/mixins/utilities.py 91.66% <93.75%> (+11.66%) ⬆️
cassiopeia/solver/DistanceSolver.py 59.32% <16.66%> (-1.21%) ⬇️
cassiopeia/solver/dissimilarity_functions.py 84.82% <84.37%> (-0.13%) ⬇️
cassiopeia/plotting/local.py 88.01% <40.00%> (-1.95%) ⬇️
... and 2 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mattjones315 mattjones315 marked this pull request as ready for review September 27, 2023 19:36
@mattjones315 mattjones315 changed the title Ambiguous greedy Ambiguous greedy & parallel dissimilarity computation Oct 5, 2023
Copy link
Collaborator

@colganwi colganwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @mattjones315! The only major comment I have is about using shared memory for the dissimilarity map computation. With large trees I suspect that copying the character matrix for each thread will be slow and may lead to overflow. Based on some quick research I think this could be fixed using multiprocessing.shared_memory, but I admit I haven't thought about it too deeply. If you think this is too much work and out of scope I'm happy to table for now.

I also left a few minor comments re formatting and the threads parameter. Happy to clarify anything if needed. Once these are addressed and it the tests are fixed we can merge.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed from the commit. I've also had issues with cassiopeia/config.ini changes being included even though its in the .gitignore. Do you know how to fix this?

Copy link
Collaborator Author

@mattjones315 mattjones315 Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that is weird. I agree, this should be removed form the commit.

I think the issue here is that the config is tracked, but only as a default. So .gitignore gets confused because the file does exist. I propose we put the default config in ./data and specify in the README how to utilize it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That solution sounds good to me

cassiopeia/data/CassiopeiaTree.py Show resolved Hide resolved
cassiopeia/data/utilities.py Show resolved Hide resolved
cassiopeia/data/utilities.py Outdated Show resolved Hide resolved
cassiopeia/data/utilities.py Show resolved Hide resolved
cassiopeia/solver/DistanceSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/GreedySolver.py Show resolved Hide resolved
cassiopeia/solver/NeighborJoiningSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/dissimilarity_functions.py Show resolved Hide resolved
@mattjones315
Copy link
Collaborator Author

Thanks @colganwi for a great review! I've made several of your requested changes, which were very insightful, and I believe I'm now ready for a second review.

One small comment is on the config.ini issue in .gitignore -- I found that if you specify it in the .gitignore, it doesn't get packaged. It's quite a tricky problem. So I removed it from tracking, added a dummy version to ./data and readded a cassiopeia/config.ini to my own personal distribution. Let me know if you have a better idea.

@mattjones315 mattjones315 requested a review from colganwi October 10, 2023 19:35
Copy link
Collaborator

@colganwi colganwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Maybe just add some tests for cluster_dissimilarity_weighted_hamming_distance_min_linkage and add a note about config.ini to the README

cassiopeia/data/CassiopeiaTree.py Show resolved Hide resolved
cassiopeia/data/utilities.py Outdated Show resolved Hide resolved
cassiopeia/data/utilities.py Show resolved Hide resolved
cassiopeia/data/utilities.py Show resolved Hide resolved
@mattjones315 mattjones315 merged commit a520f64 into master Oct 11, 2023
2 of 3 checks passed
@mattjones315 mattjones315 deleted the ambiguous_greedy branch October 11, 2023 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants