-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework are_isomorphic method of Molecule class for polymers #1734
Comments
Any idea what causes the poor scaling? i.e. graph conversion vs. the actual comparisons Note the use of |
Graph conversion is fast. The problem is at |
Last time I came across a similar problem (years back, on a different project, but still dealing with a sort of graph-matching issue) we wanted to use I worry that continuing to put performance shortcuts on top of a slow algorithm will only get us so far and leave us with something that's brittle (already we have a ton of shortcuts that make it hard to change). But refactoring it might also be more work. |
If I see it correctly, the solution that I propose won't change much for the code. Inside mol1.perceive_residues()
mol2.perceive_residues()
if len(mol1.residues) > 1 and len(mol2.residues) > 1:
return Molecule._compare_polymers(mol1, mol2)
# in other case we follow the previous solution
From the global perspective the only thing that will change in this case, is that |
I don't want to get in the way of you adding functionality - if you can get a new use case from glacial to viable without breaking other behavior that's a pure win - I just wanted to provide some context for why it's slow. The vast majority of the |
I'm quite open to ideas about speeding this up, and coarsening the whole graph into repeating units seems like a promising approach. My biggest concern here is the reliance on residues specifically, for a few reasons:
I agree with the idea of simply swapping in a faster library like graph-tool (especially now that it's deployable from conda-forge). And if we want to try optimizations on top of that, the two general routes I see would be:
|
Actually, fast testing with k = 3
print(f"Test for a system with {p1[k].n_atoms} atoms and {p1[k].n_bonds}")
s = time.time()
# networkx
def node_match_func(x, y):
# always match by atleast atomic number
is_equal = x["atomic_number"] == y["atomic_number"]
return is_equal
GM = GraphMatcher(p1[k].to_networkx(), p2[k].to_networkx(), node_match=node_match_func)
print(f"Graphs are isomorhpic: {GM.is_isomorphic()}")
print(f"Networkx took {time.time() - s}")
#networkx, no node_match function
s = time.time()
# networkx
GM = GraphMatcher(p1[k].to_networkx(), p2[k].to_networkx())
print(f"Graphs are isomorhpic: {GM.is_isomorphic()}")
print(f"Networkx without node_match function took {time.time() - s}")
# igraph bliss
s = time.time()
g = igraph.Graph(len(list(p1[3].to_networkx().nodes)), [list(x) for x in list(p1[3].to_networkx().edges)])
h = igraph.Graph(len(list(p2[3].to_networkx().nodes)), [list(x) for x in list(p2[3].to_networkx().edges)])
print(f"Graphs are isomorhpic: {g.isomorphic(h)}")
print(f"Igraph took {time.time() - s}")
# igraph vf2
s = time.time()
g = igraph.Graph(len(list(p1[3].to_networkx().nodes)), [list(x) for x in list(p1[3].to_networkx().edges)])
h = igraph.Graph(len(list(p2[3].to_networkx().nodes)), [list(x) for x in list(p2[3].to_networkx().edges)])
print(f"Graphs are isomorhpic: {g.isomorphic_vf2(h)}")
print(f"Igraph took {time.time() - s}")
# graph_tool
s = time.time()
g = Graph([list(x) for x in list(p1[k].to_networkx().edges)])
h = Graph([list(x) for x in list(p2[k].to_networkx().edges)])
print(f"Graphs are isomorhpic: {isomorphism(g, h)}")
print(f"Graph_tool took {time.time() - s}")
So the fastest method as for now is
I tried passing attributes to #graph tool with vertex map
print(f"Test for a system with {p1[k].n_atoms} atoms and {p1[k].n_bonds}")
s = time.time()
g = Graph([list(x) for x in list(p1[k].to_networkx().edges)])
prop1 = g.new_vertex_property("int")
prop1.a = np.array([p1[k].to_networkx().nodes[x]["atomic_number"] for x in range(len(p1[k].to_networkx().nodes))])
h = Graph([list(x) for x in list(p2[k].to_networkx().edges)])
prop2 = h.new_vertex_property("int")
prop2.a = np.array([p2[k].to_networkx().nodes[x]["atomic_number"] for x in range(len(p2[k].to_networkx().nodes))])
print(f"Graphs are isomorhpic: {isomorphism(g, h, prop1, prop2)}")
print(f"Graph_tool with vertex invariants took {time.time() - s}")
#igraph with node matching
print(f"Test for a system with {p1[3].n_atoms} atoms and {p1[3].n_bonds}")
s = time.time()
g = igraph.Graph(len(list(p1[3].to_networkx().nodes)), [list(x) for x in list(p1[3].to_networkx().edges)])
g["a"] = [p1[3].to_networkx().nodes[x]["atomic_number"] for x in range(len(p1[3].to_networkx().nodes))]
h = igraph.Graph(len(list(p2[3].to_networkx().nodes)), [list(x) for x in list(p2[3].to_networkx().edges)])
h["a"] = [p2[3].to_networkx().nodes[x]["atomic_number"] for x in range(len(p2[3].to_networkx().nodes))]
def node_match_func(g1, g2, i1, i2):
# always match by atleast atomic number
is_equal = g1["a"][i1] == g2["a"][i1]
return is_equal
print(f"Graphs are isomorhpic: {g.isomorphic_vf2(h, node_compat_fn=node_match_func)}")
print(f"Igraph with attributes took {time.time() - s}")
So my current opinion, that using other libraries is a proper solution.
I think that splitting into residues can be done, when molecule is created (at least residues are already there if molecule is created with |
If the isomorphism check for residue-coarsened molecules can be made to work correctly (that is, it is always correct and never an approximation if the template matching succeeded in matching the entire molecule to some residue templates), this sounds like a great approach. If we provide templates for the majority of the common use cases, we speed up nearly all the common use cases without losing correctness. We could still fall back to slow graph marching when there are things that don't march templates. |
A slightly different approach can be to
|
Is your feature request related to a problem? Please describe.
are_isomorphic
method ofMolecule
class is slow for large polymers. This can be a problem when importingTopology
fromopenmm
. To check this, I generated capped ALA peptides with one randomly positioned ASP withtleap
. For example:Then I loaded generated
pdb
files toMolecule
objects and calledas in
from_openmm
method and checked how long the method will work.The results are as follows:
For even larger protein system the method can take even longer, which can stop the users from using$O(n^2)$
from_openmm
method. And the complexity is worse thanDescribe the solution you'd like$n$ comparisons, where $n$ is the length of protein. This will require passing
I think that the solution could be in case of polymers to test if molecules are isomorhic on the residue basis. We can use
perceive_resides
to split polymer into residues and then generate and compare graphs for each residue. This should be faster, since amino acids are small and we will need to do at maximumMolecule
object toare_isomorphic
, but this should be a minor issue.If we agree on the suggested solution, I can start working on it.
The text was updated successfully, but these errors were encountered: