-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast UPGMA and NJ with CCPhylo #205
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fantastic PR @colganwi! It's looking really good, and I appreciate you learning our code structure and style guide so quickly as to make contributions.
Most of the code looks great, minus some nitpicky issues here and there around documentation and style (sorry, I know it's annoying).
One major issue that I'd like to get hammered out, and perhaps you can just convince me of it, is that I'm not sure we need to have a separate CCPhyloSolver
class. I feel like there are some quirks about it that make it feel half-completed, such as the fact that you run the risk of throwing an error with solve
. I think there are two potential solutions, if you are open to them:
- I think we can keep
solve
as an abstractmethod (marked with the@abc.abstractmethod
decorator) inCCPhyloSolver
. We can keep the_fast_solve
method implemented. Then, in any of the subclasses (e.g.,NeighborJoiningSolver
), we can overridesolve
as to take into account thefast
variable. In other words, iffast=True
, we callsuper()._fast_solve
; else we callsuper().super().solve()
. - Second, I think we can get rid of
CCPhyloSolver
altogether and move the logic toDistanceSolver
. In fact, that would make the first suggestion more coherent as well. We'd just have to move_fast_solve
toDistanceSolver
. Then, all subclasses would be invoked withsolve
, and we can encode the decision between_fast_solve
and the traditionalsolve
logic withinDistanceSolver
.
Do either of these suggestions make sense? Happy to discuss more in depth, and happy to be convinced that we need CCPhyloSolver
if that's really the case.
Oh and one more comment -- we should make sure to merge in the latest changes to the |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #205 +/- ##
==========================================
- Coverage 79.05% 78.62% -0.44%
==========================================
Files 85 85
Lines 7080 7157 +77
==========================================
+ Hits 5597 5627 +30
- Misses 1483 1530 +47
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @colganwi , this is awesome, I've always wanted a faster NJ implementation!
All the pieces seem to be there, but I want to leave some thoughts on the current design.
I think using inheritance is not the ideal way to implement the "fast" functionality in the distance solvers. Using inheritance (with NeighborJoiningSolver inheriting from CCPhyloSolver) couples NeighborJoiningSolver to the CCPhyloSolver, and in particular to its "fast" implementation. Here's an example of why this is problematic: Suppose that in the future another even faster implementation of NJ comes up, and we want to expose it, but we still want to support the current CCPhyloSolver implementation. What do we do?
I think a superior way to implement the new "fast" functionality is to use composition over inheritance. This requires minimal changes to the current NeighborJoiningSolver class (and none to the DistanceSolver class). The pseudocode would look like this:
class NeighborJoiningSolver(DistanceSolver.DistanceSolver):
def __init__(
...
implementation: str = "ccphylo"
):
self._implementation = implementation
...
def solve(self, ...):
if self._implementation == "ccphylo":
solver = CCPhyloSolver(...) # (Note: Could be done in __init__ instead to raise errors earlier.)
solver.solve(tree)
else:
super().solve(...)
This way, NeighborJoiningSolver essentially wraps the CCPhyloSolver, instead of inheriting from it. Also, CCPhyloSolver would implement "solve" directly and not have a "_fast_solve" method. With this approach, if other faster implementations of NJ come up in the future, they can just be registered in the NeighborJoiningSolver's solve method, e.g.:
def solve(self, ...):
if self._implementation == "ccphylo":
solver = CCPhyloSolver(...)
solver.solve(tree)
elif self._implementation == "even_faster_implementation":
solver = EvenFasterNJSolver(...)
solver.solve(tree)
else:
super().solve(...)
I think it's fine to have a concrete CCPhyloSolver class. In fact, one could argue there is no need for modifying NeighborJoiningSolver at all, since one could just instantiate and use the CCPhyloSolver instead. However, as I see it, NeighborJoiningSolver is a "centralized" place to create NJ solvers with slightly different implementations. In this sense, one could even argue that the DynamicNeighborJoiningSolver and HeuristicNeighborJoiningSolver classes are unnecessary and could be realized through the NeighborJoiningSolver directly, e.g.:
def solve(self, ...):
if self._implementation == "ccphylo_nj":
solver = CCPhyloSolver(...)
solver.solve(tree)
elif self._implementation == "ccphylo_hnj":
solver = CCPhyloSolver(...) # Use HNJ
solver.solve(tree)
elif self._implementation == "ccphylo_dnj":
solver = CCPhyloSolver(...) # Use DNJ
solver.solve(tree)
else:
super().solve(...)
Regardless, the main point is that I am not sure going down the inheritance path is ideal because it creates deep, coupled hierarchies and code that is harder to read. @mattjones315 what do you think?
Some resources:
Composition over inheritance: https://youtu.be/hxGOiiR9ZKg
Strategy pattern: https://en.wikipedia.org/wiki/Strategy_pattern
Unfortunately, I implemented @mattjones315's suggestion of removing the CCPhyloSolver class (c0c5ca9) before seeing @sprillo's comments. This commit addresses the issue of a future "even_faster_implementation" by changing But now that I've read @sprillo's comment I think his solution is cleaner since it doesn't require modifications to the DistanceSolver class and would allow for CCPhylo methods such as K-means Closest First to be run without creating a dedicated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colganwi given the current design used for solvers in Cassiopeia I think the solution you have currently implemented (diverting to a "_fast_solve" variant in DistanceSolver
) is okay.
I think refactoring (or, completely redesigning the solver package) isn't worth the effort right now, but I agree that the current design using inheritance has a few issues. Generally, the solver classes take on too many responsibilities / make too many assumptions. For example, the abstract base class CassiopeiaSolver
is not only responsible for estimating a tree topology but also for collapsing mutationless edges. Partly as a results you see a lot of code duplication in the concrete subclasses, e.g. in DistanceSolver , GreedySolver , ILPSolver. (This code could be moved to the parent class somehow, but that's not the point, the point is solvers probably should not have to bother about collapsing mutationless edges at all).
In the particular case of the DistanceSolver
class, subclasses are expected to conform to a specific algorithm structure, providing hooks for find_cherry
, update_dissimilarity_map
, setup_root_finder
, root_tree
that are used in DistanceSolver
's solve
method. This avoids duplicating the DistanceSolver
's solve
logic in the subclasses, but it limits the kind of DistanceSolver
's that are expected. In this sense, the CCPhyloSolver
class is not a DistanceSolver
, since a CCPhyloSolver
class wrapping the C implementation actually reuses none of DistanceSolver
's "solve" method and implements none of the hooks. It is simply a CassiopeiaSolver
. For this reason, in a PR I recently opened ( #211 - feel free to leave comments ) where I implement what is quite intuitively a distance-based method, I don't inherit from DistanceSolver
at all.
To summarize, a distance solver wrapping a C++ implementation shouldn't really be thought of as a DistanceSolver
given what the current expectations of the DistanceSolver
class are. Inheriting from DistanceSolver
and then overriding solve
to divert to a _fast_solve
(as in the original PR) is misleading from this perspective. The current solution, where DistanceSolver
allows for a _fast_solve
seems more readable to me. However, note that DistanceSolver
is now basically doing two different things: (1) providing a skeleton solve
as originally which you can hook to, and (2) knowing how to run ccphylo or other programs. So it's basically responsible for two completely unrelated things. I don't mind it too much as it is, but you can see how it could hamper readability and make things a bit confusing for a newcomer.
I'll unblock the PR on my end now that we are aware of the pros and cons of the current design. I'll let Matt have the final word on what shape this should have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, thank you both for a very thoughtful discussion on this PR. It has certainly sparked a lot of good ideas on how to improve functionality and readability, and it is evident to me that we have outgrown our initial design/vision from a few years ago when we executed the first refactor (trust me, this is a much improved version of the codebase back in 2019...).
I like several aspects of Sebastian's proposed refactor towards a compositional structure in general. I certainly see how it would benefit the current implementation of our DistanceSolver
's but I also see that these themes can be extended to apply to several of our solvers. In fact, it sounds like given our recent work implementing several new algorithms that the solver
module is due for a refactor in general. For this reason, I don't want to put the burden on this PR; it deserves a larger effort and more conversation.
For the time being, this PR implements well the suggestions I had earlier this week and I think it will be ready to merge in once my nit-picky comments on formatting and additional tests are addressed. I also want to commend William for making a very solid contribution to the Cassiopeia codebase!
I'm going to make an issue linking this PR around reformatting the solver
module.
@sprillo I completely agree that the current architecture of the solvers, particularly the @mattjones315 thanks for the additional comments. I'm going to add two more commits, one addressing the comments and adding tests, and one with a few new features including multithreading and subclasses for the other solvers implemented by CCPhylo. Once those commits are reviewed we can merge. |
Hi @colganwi, sounds like a great plan. Looking forward to those commits. I've now opened an issue referencing our thoughts on refactoring the |
Should be good to go. Since we are planning to refactor I've minimized changes to the solver class structure. The DNJ and HNJ classes have been removed in favor of an implementation parameter in the I've also decided to table adding multithreading and classes for other CCPhylo solvers for now to minimize work when we refactor. If we bring back the CCPhyloSolver class these options can be built into it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Fantastic work putting together this PR, and I'm glad this was also an opportunity to think about how we can refactor the logic of the solver
module altogether. I'm just going to make one small commit re-ordering imports to abide by my nitpicky ordering preferences and then we can merge this in.
Merge pull request YosefLab#205 from colganwi/CCPhylo
The C implementations of UPGMA and NJ in CCPhylo are significantly faster than the current Cassiopeia implementations of these algorithms. This update adds a
CCPhyloSolver
subclass ofDistanceSolver
which writes the dissimilarity matrix to a temp file and then callsccphlyo tree
to solve the tree with the specified method.NeighborJoiningSolver
andUPGMASolver
are now subclasses ofCCPhyloSolver
so when fast is set to true they useCCPhyloSolver._fast_solve()
instead ofDistanceSolver.solve()
. CCPhylo also includes two new algorithms Dynamic and Heuristic NJ which are significantly faster than standard NJ.DynamicNeighborJoiningSolver
andHeuristicNeighborJoiningSolver
are subclasses ofNeighborJoiningSolver
which implement these algorithms. In most cases DNJ should be used since it is guaranteed to generate an exact NJ tree.