Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast UPGMA and NJ with CCPhylo #205

Merged
merged 9 commits into from
Aug 18, 2023
Merged

Conversation

colganwi
Copy link
Collaborator

The C implementations of UPGMA and NJ in CCPhylo are significantly faster than the current Cassiopeia implementations of these algorithms. This update adds a CCPhyloSolver subclass of DistanceSolver which writes the dissimilarity matrix to a temp file and then calls ccphlyo tree to solve the tree with the specified method. NeighborJoiningSolver and UPGMASolver are now subclasses of CCPhyloSolver so when fast is set to true they use CCPhyloSolver._fast_solve() instead of DistanceSolver.solve(). CCPhylo also includes two new algorithms Dynamic and Heuristic NJ which are significantly faster than standard NJ. DynamicNeighborJoiningSolver and HeuristicNeighborJoiningSolver are subclasses of NeighborJoiningSolver which implement these algorithms. In most cases DNJ should be used since it is guaranteed to generate an exact NJ tree.

@colganwi
Copy link
Collaborator Author

Benchmarking results

benchmarking

@mattjones315 mattjones315 self-requested a review August 9, 2023 01:34
Copy link
Collaborator

@mattjones315 mattjones315 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fantastic PR @colganwi! It's looking really good, and I appreciate you learning our code structure and style guide so quickly as to make contributions.

Most of the code looks great, minus some nitpicky issues here and there around documentation and style (sorry, I know it's annoying).

One major issue that I'd like to get hammered out, and perhaps you can just convince me of it, is that I'm not sure we need to have a separate CCPhyloSolver class. I feel like there are some quirks about it that make it feel half-completed, such as the fact that you run the risk of throwing an error with solve. I think there are two potential solutions, if you are open to them:

  1. I think we can keep solve as an abstractmethod (marked with the @abc.abstractmethod decorator) in CCPhyloSolver. We can keep the _fast_solve method implemented. Then, in any of the subclasses (e.g., NeighborJoiningSolver), we can override solve as to take into account the fast variable. In other words, if fast=True, we call super()._fast_solve; else we call super().super().solve().
  2. Second, I think we can get rid of CCPhyloSolver altogether and move the logic to DistanceSolver. In fact, that would make the first suggestion more coherent as well. We'd just have to move _fast_solve to DistanceSolver. Then, all subclasses would be invoked with solve, and we can encode the decision between _fast_solve and the traditional solve logic within DistanceSolver.

Do either of these suggestions make sense? Happy to discuss more in depth, and happy to be convinced that we need CCPhyloSolver if that's really the case.

cassiopeia/solver/CCPhyloSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/CCPhyloSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/CCPhyloSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/CCPhyloSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/CCPhyloSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/HeuristicNeighborJoiningSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/solver_utilities.py Show resolved Hide resolved
@mattjones315
Copy link
Collaborator

Oh and one more comment -- we should make sure to merge in the latest changes to the master branch, as the pandas typing/ILP error appears to still be a problem in running the tests.

@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Patch coverage: 39.74% and project coverage change: -0.44% ⚠️

Comparison is base (f895301) 79.05% compared to head (7f96e86) 78.62%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #205      +/-   ##
==========================================
- Coverage   79.05%   78.62%   -0.44%     
==========================================
  Files          85       85              
  Lines        7080     7157      +77     
==========================================
+ Hits         5597     5627      +30     
- Misses       1483     1530      +47     
Files Changed Coverage Δ
cassiopeia/solver/DistanceSolver.py 60.52% <25.45%> (-32.81%) ⬇️
cassiopeia/solver/NeighborJoiningSolver.py 70.88% <50.00%> (-1.72%) ⬇️
cassiopeia/solver/UPGMASolver.py 73.21% <50.00%> (-2.79%) ⬇️
cassiopeia/solver/SpectralNeighborJoiningSolver.py 98.92% <100.00%> (+0.01%) ⬆️
cassiopeia/solver/solver_utilities.py 86.00% <100.00%> (+3.50%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@sprillo sprillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @colganwi , this is awesome, I've always wanted a faster NJ implementation!

All the pieces seem to be there, but I want to leave some thoughts on the current design.

I think using inheritance is not the ideal way to implement the "fast" functionality in the distance solvers. Using inheritance (with NeighborJoiningSolver inheriting from CCPhyloSolver) couples NeighborJoiningSolver to the CCPhyloSolver, and in particular to its "fast" implementation. Here's an example of why this is problematic: Suppose that in the future another even faster implementation of NJ comes up, and we want to expose it, but we still want to support the current CCPhyloSolver implementation. What do we do?

I think a superior way to implement the new "fast" functionality is to use composition over inheritance. This requires minimal changes to the current NeighborJoiningSolver class (and none to the DistanceSolver class). The pseudocode would look like this:

class NeighborJoiningSolver(DistanceSolver.DistanceSolver):
    def __init__(
        ...
        implementation: str = "ccphylo"
    ):
        self._implementation = implementation
        ...

    def solve(self, ...):
        if self._implementation == "ccphylo":
            solver = CCPhyloSolver(...)  # (Note: Could be done in __init__ instead to raise errors earlier.)
            solver.solve(tree)
        else:
            super().solve(...)

This way, NeighborJoiningSolver essentially wraps the CCPhyloSolver, instead of inheriting from it. Also, CCPhyloSolver would implement "solve" directly and not have a "_fast_solve" method. With this approach, if other faster implementations of NJ come up in the future, they can just be registered in the NeighborJoiningSolver's solve method, e.g.:

def solve(self, ...):
    if self._implementation == "ccphylo":
        solver = CCPhyloSolver(...)
        solver.solve(tree)
    elif self._implementation == "even_faster_implementation":
        solver = EvenFasterNJSolver(...)
        solver.solve(tree)
    else:
        super().solve(...)

I think it's fine to have a concrete CCPhyloSolver class. In fact, one could argue there is no need for modifying NeighborJoiningSolver at all, since one could just instantiate and use the CCPhyloSolver instead. However, as I see it, NeighborJoiningSolver is a "centralized" place to create NJ solvers with slightly different implementations. In this sense, one could even argue that the DynamicNeighborJoiningSolver and HeuristicNeighborJoiningSolver classes are unnecessary and could be realized through the NeighborJoiningSolver directly, e.g.:

def solve(self, ...):
    if self._implementation == "ccphylo_nj":
        solver = CCPhyloSolver(...)
        solver.solve(tree)
    elif self._implementation == "ccphylo_hnj":
        solver = CCPhyloSolver(...)  # Use HNJ
        solver.solve(tree)
    elif self._implementation == "ccphylo_dnj":
        solver = CCPhyloSolver(...)  # Use DNJ
        solver.solve(tree)
    else:
        super().solve(...)

Regardless, the main point is that I am not sure going down the inheritance path is ideal because it creates deep, coupled hierarchies and code that is harder to read. @mattjones315 what do you think?

Some resources:
Composition over inheritance: https://youtu.be/hxGOiiR9ZKg
Strategy pattern: https://en.wikipedia.org/wiki/Strategy_pattern

@colganwi
Copy link
Collaborator Author

colganwi commented Aug 16, 2023

Unfortunately, I implemented @mattjones315's suggestion of removing the CCPhyloSolver class (c0c5ca9) before seeing @sprillo's comments. This commit addresses the issue of a future "even_faster_implementation" by changing CCPhyloSolver._fast_solve to DistanceSolver._ccphylo_solve. The fast solver is then selected in DistanceSolver.__init__ which would allow other fast solver implementations.

But now that I've read @sprillo's comment I think his solution is cleaner since it doesn't require modifications to the DistanceSolver class and would allow for CCPhylo methods such as K-means Closest First to be run without creating a dedicated KMeansClosestFirstSolver subclass of the DistanceSolver. However, one issue with this approach is rooting the tree. Currently, the CCPhyloSolver versions of UPGMA and NJ use different rooting strategies to ensure that the results are the same as the existing implementations. It would be redundant to also put these rooting strategies into the CCPhyloSolver class. Of course, the obvious solution to this is to make the rooting functions compositional but this will require some work/reorganization. Do we want to make this change to how the solvers are organized?

Copy link
Collaborator

@sprillo sprillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colganwi given the current design used for solvers in Cassiopeia I think the solution you have currently implemented (diverting to a "_fast_solve" variant in DistanceSolver) is okay.

I think refactoring (or, completely redesigning the solver package) isn't worth the effort right now, but I agree that the current design using inheritance has a few issues. Generally, the solver classes take on too many responsibilities / make too many assumptions. For example, the abstract base class CassiopeiaSolver is not only responsible for estimating a tree topology but also for collapsing mutationless edges. Partly as a results you see a lot of code duplication in the concrete subclasses, e.g. in DistanceSolver , GreedySolver , ILPSolver. (This code could be moved to the parent class somehow, but that's not the point, the point is solvers probably should not have to bother about collapsing mutationless edges at all).

In the particular case of the DistanceSolver class, subclasses are expected to conform to a specific algorithm structure, providing hooks for find_cherry, update_dissimilarity_map, setup_root_finder, root_tree that are used in DistanceSolver's solve method. This avoids duplicating the DistanceSolver's solve logic in the subclasses, but it limits the kind of DistanceSolver's that are expected. In this sense, the CCPhyloSolver class is not a DistanceSolver, since a CCPhyloSolver class wrapping the C implementation actually reuses none of DistanceSolver's "solve" method and implements none of the hooks. It is simply a CassiopeiaSolver. For this reason, in a PR I recently opened ( #211 - feel free to leave comments ) where I implement what is quite intuitively a distance-based method, I don't inherit from DistanceSolver at all.

To summarize, a distance solver wrapping a C++ implementation shouldn't really be thought of as a DistanceSolver given what the current expectations of the DistanceSolver class are. Inheriting from DistanceSolver and then overriding solve to divert to a _fast_solve (as in the original PR) is misleading from this perspective. The current solution, where DistanceSolver allows for a _fast_solve seems more readable to me. However, note that DistanceSolver is now basically doing two different things: (1) providing a skeleton solve as originally which you can hook to, and (2) knowing how to run ccphylo or other programs. So it's basically responsible for two completely unrelated things. I don't mind it too much as it is, but you can see how it could hamper readability and make things a bit confusing for a newcomer.

I'll unblock the PR on my end now that we are aware of the pros and cons of the current design. I'll let Matt have the final word on what shape this should have.

@mattjones315 mattjones315 self-requested a review August 16, 2023 21:31
Copy link
Collaborator

@mattjones315 mattjones315 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @colganwi and @sprillo --

First, thank you both for a very thoughtful discussion on this PR. It has certainly sparked a lot of good ideas on how to improve functionality and readability, and it is evident to me that we have outgrown our initial design/vision from a few years ago when we executed the first refactor (trust me, this is a much improved version of the codebase back in 2019...).

I like several aspects of Sebastian's proposed refactor towards a compositional structure in general. I certainly see how it would benefit the current implementation of our DistanceSolver's but I also see that these themes can be extended to apply to several of our solvers. In fact, it sounds like given our recent work implementing several new algorithms that the solver module is due for a refactor in general. For this reason, I don't want to put the burden on this PR; it deserves a larger effort and more conversation.

For the time being, this PR implements well the suggestions I had earlier this week and I think it will be ready to merge in once my nit-picky comments on formatting and additional tests are addressed. I also want to commend William for making a very solid contribution to the Cassiopeia codebase!

I'm going to make an issue linking this PR around reformatting the solver module.

cassiopeia/solver/DistanceSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/DistanceSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/DistanceSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/DistanceSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/HeuristicNeighborJoiningSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/NeighborJoiningSolver.py Outdated Show resolved Hide resolved
cassiopeia/solver/solver_utilities.py Show resolved Hide resolved
@colganwi
Copy link
Collaborator Author

@sprillo I completely agree that the current architecture of the solvers, particularly the DistanceSolver, is not ideal. The main reason I built the CCPhyloSolver on top the DistanceSolver was to take advantage of the logic for rooting trees and calculating dissimilarity maps. Let's plan to address this when we refactor the solvers. We could either create a more generic distance DistanceSolver, or with a sufficiently compositional framework, write a CCPhyloSolver that does not inherent from DistanceSolver. When @mattjones315 opens the issue we can have a more in depth conversation there.

@mattjones315 thanks for the additional comments. I'm going to add two more commits, one addressing the comments and adding tests, and one with a few new features including multithreading and subclasses for the other solvers implemented by CCPhylo. Once those commits are reviewed we can merge.

@mattjones315
Copy link
Collaborator

Hi @colganwi, sounds like a great plan. Looking forward to those commits.

I've now opened an issue referencing our thoughts on refactoring the DistanceSolver and the solver module in general which you can find in #214. William, once you accept the invitation to join the Cassiopeia repo as a collaborator I'll add you to the list of assignees (currently just me and Sebastian).

@colganwi
Copy link
Collaborator Author

Should be good to go. Since we are planning to refactor I've minimized changes to the solver class structure. The DNJ and HNJ classes have been removed in favor of an implementation parameter in the NeighborJoiningSolver constructor and the DistanceSolver constructor is now unchanged. One advantage of this approach is that CCPhylo DNJ is now the default fast implementation of NeighborJoining.

I've also decided to table adding multithreading and classes for other CCPhylo solvers for now to minimize work when we refactor. If we bring back the CCPhyloSolver class these options can be built into it.

@mattjones315 mattjones315 self-requested a review August 18, 2023 20:42
Copy link
Collaborator

@mattjones315 mattjones315 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Fantastic work putting together this PR, and I'm glad this was also an opportunity to think about how we can refactor the logic of the solver module altogether. I'm just going to make one small commit re-ordering imports to abide by my nitpicky ordering preferences and then we can merge this in.

@mattjones315 mattjones315 merged commit 7dfa403 into YosefLab:master Aug 18, 2023
colganwi added a commit to colganwi/Cassiopeia that referenced this pull request Aug 18, 2023
Merge pull request YosefLab#205 from colganwi/CCPhylo
@colganwi colganwi deleted the CCPhylo branch August 18, 2023 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants