Ambiguous greedy & parallel dissimilarity computation #226

mattjones315 · 2023-09-26T23:04:05Z

This PR implements two important changes.

Supporting ambiguous alleles in Cassiopeia Greedy algorithm.

Specific changes:

Enabled greedy splits to be found when states are ambiguous (by flattening ambiguous states)
Enabled missing data assignment in greedy splits when ambiguous states are present
A few new utilities around handling ambiguous states, implemented in cassiopeia.mixins.utilities
A few miscellaneous changes, not related to the greedy algorithm, especially around running tests when software is not installed.

Supporting parallel dissimilarity matrix

We implement a parallel dissimilarity matrix computation. Due to compatibility issues with numba, I introduce a wrapper function around the main bones of the dissimilarity map computation, and allow this to operate on batches. I notice a slight slow down for computations that would be numba jit-compatible (on the order of seconds) but I find drastic runtime improvements for computations that are not jit-compatible. This becomes particularly important for cases dealing with ambiguous alleles as currently the cluster dissimilarity function is not able to be compiled in nopython mode. Thus, dramatic speedups - roughly proportional the number of threads, ~10x speedup with 10 threads (as one would expect).

I also find that implementing more prescriptive cluster dissimilarity functions (e.g., specific function for linkage=np.min and dissimilarity_function=weighted_hamming_distance) allows the function to be compiled with nopython=False, forceobj=True, which does speed up the computation noticeably. I retain the original cluster computation to keep it backwards compatible and to allow users to experiment with various linkages and dissimilarity functions on cases where performance is not such an issue.

codecov · 2023-09-27T18:20:54Z

Codecov Report

Attention: 36 lines in your changes are missing coverage. Please review.

Comparison is base (9f272fc) 79.54% compared to head (60a914e) 79.49%.
Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #226      +/-   ##
==========================================
- Coverage   79.54%   79.49%   -0.06%     
==========================================
  Files          89       89              
  Lines        7948     8050     +102     
==========================================
+ Hits         6322     6399      +77     
- Misses       1626     1651      +25

Files	Coverage Δ
cassiopeia/data/CassiopeiaTree.py	`91.78% <100.00%> (+0.02%)`	⬆️
cassiopeia/solver/GreedySolver.py	`98.64% <100.00%> (+0.07%)`	⬆️
cassiopeia/solver/HybridSolver.py	`74.78% <100.00%> (-0.22%)`	⬇️
cassiopeia/solver/NeighborJoiningSolver.py	`70.88% <ø> (ø)`
cassiopeia/solver/UPGMASolver.py	`73.21% <ø> (ø)`
cassiopeia/solver/VanillaGreedySolver.py	`100.00% <100.00%> (ø)`
cassiopeia/mixins/utilities.py	`91.66% <93.75%> (+11.66%)`	⬆️
cassiopeia/solver/DistanceSolver.py	`59.32% <16.66%> (-1.21%)`	⬇️
cassiopeia/solver/dissimilarity_functions.py	`84.82% <84.37%> (-0.13%)`	⬇️
cassiopeia/plotting/local.py	`88.01% <40.00%> (-1.95%)`	⬇️
... and 2 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

colganwi

Great work @mattjones315! The only major comment I have is about using shared memory for the dissimilarity map computation. With large trees I suspect that copying the character matrix for each thread will be slow and may lead to overflow. Based on some quick research I think this could be fixed using multiprocessing.shared_memory, but I admit I haven't thought about it too deeply. If you think this is too much work and out of scope I'm happy to table for now.

I also left a few minor comments re formatting and the threads parameter. Happy to clarify anything if needed. Once these are addressed and it the tests are fixed we can merge.

colganwi · 2023-10-10T13:38:22Z

cassiopeia/config.ini

Should be removed from the commit. I've also had issues with cassiopeia/config.ini changes being included even though its in the .gitignore. Do you know how to fix this?

Hmm, that is weird. I agree, this should be removed form the commit.

I think the issue here is that the config is tracked, but only as a default. So .gitignore gets confused because the file does exist. I propose we put the default config in ./data and specify in the README how to utilize it.

That solution sounds good to me

cassiopeia/data/CassiopeiaTree.py

cassiopeia/data/utilities.py

cassiopeia/solver/DistanceSolver.py

cassiopeia/solver/GreedySolver.py

cassiopeia/solver/NeighborJoiningSolver.py

cassiopeia/solver/dissimilarity_functions.py

mattjones315 · 2023-10-10T19:35:17Z

Thanks @colganwi for a great review! I've made several of your requested changes, which were very insightful, and I believe I'm now ready for a second review.

One small comment is on the config.ini issue in .gitignore -- I found that if you specify it in the .gitignore, it doesn't get packaged. It's quite a tricky problem. So I removed it from tracking, added a dummy version to ./data and readded a cassiopeia/config.ini to my own personal distribution. Let me know if you have a better idea.

colganwi

Looks good! Maybe just add some tests for cluster_dissimilarity_weighted_hamming_distance_min_linkage and add a note about config.ini to the README

cassiopeia/data/CassiopeiaTree.py

cassiopeia/data/utilities.py

mattjones315 added 7 commits September 26, 2023 15:38

updated state freq computation to account for ambiguous alleles

7104bc5

added tests for computing frequencies in ambiguous setting

cbfea70

robust duplicate dropping for ambiguous characters

58e3a05

formatting

2cfba2b

implemented greedy splitting with ambiguous states

98a786b

full support for greedy with ambiguous states

66dbc33

updated when ccphylo tests would be run

085a409

mattjones315 added 4 commits September 27, 2023 11:25

added mixin utilities test

ac3deb4

updated duplicated leaf addition to account for ambiguity

a58480e

added test to catch ambiguous states where theyre not supposed to be

6418ba5

Merge branch 'master' into ambiguous_greedy

7c22236

mattjones315 marked this pull request as ready for review September 27, 2023 19:36

mattjones315 requested a review from colganwi September 27, 2023 19:37

mattjones315 added 6 commits September 27, 2023 12:55

updated missing data classification when ambig states are present

1097521

formatting

e641932

added breaking test for collapsing ambiguous edges

ccfd54e

updated ancestral reconstruction with ambiguous states

d27cd57

updated branch length calculation with ambiguous states

54bb343

parallel distance computation

bfb1ad1

mattjones315 changed the title ~~Ambiguous greedy~~ Ambiguous greedy & parallel dissimilarity computation Oct 5, 2023

appending duplicates correctly for hybridsolver

d493646

colganwi requested changes Oct 10, 2023

View reviewed changes

mattjones315 added 6 commits October 10, 2023 11:56

updated ccphylo config and gitignore

2b3d554

added shared memory buffer

4196559

updated docstring in DistanceSolver for threads

2071fc4

updated docstring in NeighborJoiningSolver

c2a56ba

updated neighborjoining_solve_tests with new threads parameter

fbeb105

updated ccphylo search

97656e0

mattjones315 requested a review from colganwi October 10, 2023 19:35

colganwi approved these changes Oct 10, 2023

View reviewed changes

cassiopeia/data/CassiopeiaTree.py Show resolved Hide resolved

cassiopeia/data/utilities.py Outdated Show resolved Hide resolved

cassiopeia/data/utilities.py Show resolved Hide resolved

cassiopeia/data/utilities.py Show resolved Hide resolved

mattjones315 added 4 commits October 10, 2023 15:52

updated docstrings

877c9bd

updated README with ccphylo instructions

1368c49

added more tests for cluster linkage

9964571

Merge branch 'master' into ambiguous_greedy

60a914e

mattjones315 merged commit a520f64 into master Oct 11, 2023
2 of 3 checks passed

mattjones315 deleted the ambiguous_greedy branch October 11, 2023 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguous greedy & parallel dissimilarity computation #226

Ambiguous greedy & parallel dissimilarity computation #226

mattjones315 commented Sep 26, 2023 •

edited

Loading

codecov bot commented Sep 27, 2023 •

edited

Loading

colganwi left a comment

colganwi Oct 10, 2023

mattjones315 Oct 10, 2023 •

edited

Loading

colganwi Oct 10, 2023

mattjones315 commented Oct 10, 2023

colganwi left a comment •

edited

Loading

Ambiguous greedy & parallel dissimilarity computation #226

Ambiguous greedy & parallel dissimilarity computation #226

Conversation

mattjones315 commented Sep 26, 2023 • edited Loading

Supporting ambiguous alleles in Cassiopeia Greedy algorithm.

Supporting parallel dissimilarity matrix

codecov bot commented Sep 27, 2023 • edited Loading

Codecov Report

colganwi left a comment

Choose a reason for hiding this comment

colganwi Oct 10, 2023

Choose a reason for hiding this comment

mattjones315 Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

colganwi Oct 10, 2023

Choose a reason for hiding this comment

mattjones315 commented Oct 10, 2023

colganwi left a comment • edited Loading

Choose a reason for hiding this comment

mattjones315 commented Sep 26, 2023 •

edited

Loading

codecov bot commented Sep 27, 2023 •

edited

Loading

mattjones315 Oct 10, 2023 •

edited

Loading

colganwi left a comment •

edited

Loading