Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework are_isomorphic method of Molecule class for polymers #1734

Open
pbuslaev opened this issue Oct 4, 2023 · 9 comments
Open

Rework are_isomorphic method of Molecule class for polymers #1734

pbuslaev opened this issue Oct 4, 2023 · 9 comments

Comments

@pbuslaev
Copy link
Contributor

pbuslaev commented Oct 4, 2023

Is your feature request related to a problem? Please describe.
are_isomorphic method of Molecule class is slow for large polymers. This can be a problem when importing Topology from openmm. To check this, I generated capped ALA peptides with one randomly positioned ASP with tleap. For example:

sequence 1: ACE ALA ASP ALA ALA ALA NME
sequence 2: ACE ALA ALA ASP ALA ALA NME

Then I loaded generated pdb files to Molecule objects and called

Molecule.are_isomorphic(x.to_networkx(), y.to_networkx(),  
    aromatic_matching = False,
    formal_charge_matching = False,
    bond_order_matching = False,
    atom_stereochemistry_matching = False,
    bond_stereochemistry_matching= False,
    strip_pyrimidal_n_atom_stereo = False,)

as in from_openmm method and checked how long the method will work.
The results are as follows:

# amino acids time (s)
5 0.03
9 0.21
13 1.52
17 11.3
21 79.9
25 552

For even larger protein system the method can take even longer, which can stop the users from using from_openmm method. And the complexity is worse than $O(n^2)$

Describe the solution you'd like
I think that the solution could be in case of polymers to test if molecules are isomorhic on the residue basis. We can use perceive_resides to split polymer into residues and then generate and compare graphs for each residue. This should be faster, since amino acids are small and we will need to do at maximum $n$ comparisons, where $n$ is the length of protein. This will require passing Molecule object to are_isomorphic, but this should be a minor issue.

If we agree on the suggested solution, I can start working on it.

@mattwthompson
Copy link
Member

Any idea what causes the poor scaling? i.e. graph conversion vs. the actual comparisons

Note the use of networkx is already known to be horrible for performance - see some related discussion #617 #1143 #353. There might be ways to speed it up while keeping the base structure the same. (For example, the notion of residues is weak in non-biological polymers, so for some use cases it'd be nice to not need to specify the subunits.)

@pbuslaev
Copy link
Contributor Author

pbuslaev commented Oct 4, 2023

Graph conversion is fast. The problem is at GraphMatcher stage.

@mattwthompson
Copy link
Member

Last time I came across a similar problem (years back, on a different project, but still dealing with a sort of graph-matching issue) we wanted to use graph-tool which was fast but hard to install - I think that's changed.

I worry that continuing to put performance shortcuts on top of a slow algorithm will only get us so far and leave us with something that's brittle (already we have a ton of shortcuts that make it hard to change). But refactoring it might also be more work.

@pbuslaev
Copy link
Contributor Author

pbuslaev commented Oct 4, 2023

If I see it correctly, the solution that I propose won't change much for the code. Inside are_isomorphic, we can add something like this:

mol1.perceive_residues()
mol2.perceive_residues()
if len(mol1.residues) > 1 and len(mol2.residues) > 1:
    return Molecule._compare_polymers(mol1, mol2)
# in other case we follow the previous solution

_compare_polymers will simply iterate over residues comparing residue pairs, again based on networkx approach.

From the global perspective the only thing that will change in this case, is that unique_molecules are stored as Molecule objects and all calls of are_isomorphic should use Molecule objects.

@mattwthompson
Copy link
Member

I don't want to get in the way of you adding functionality - if you can get a new use case from glacial to viable without breaking other behavior that's a pure win - I just wanted to provide some context for why it's slow. The vast majority of the Molecule API has never been used with molecules larger than about 100 atoms until this year, so things like GraphMatcher being unworkably slow for polymers is an expected growing pain, and the sort of thing that's tempting to refactor without a small molecule assumption.

@j-wags
Copy link
Member

j-wags commented Oct 4, 2023

I'm quite open to ideas about speeding this up, and coarsening the whole graph into repeating units seems like a promising approach. My biggest concern here is the reliance on residues specifically, for a few reasons:

  • (not a big problem) The residue names aren't totally informative (by the CCD's rules, the residue name "ASP" would be assigned to both aspartic acid and aspartate, but those aren't isomorphic)
  • (a big problem) Like Matt said, dividing on residues is a bit arbitrary, and we can't assume that perceive_residues will always be suitable for the user's polymer. Unconditionally adding perceive_residues to every are_isomorphic call will make all non-protein comparisons slower.
  • (not a big problem) Residue ordering outside proteins is arbitrary. And even within proteins, there are cases where insertions or modifications can make it unclear what "order" the residues go in.

I agree with the idea of simply swapping in a faster library like graph-tool (especially now that it's deployable from conda-forge). And if we want to try optimizations on top of that, the two general routes I see would be:

  • Some sort of "coarsening" approach, like preprocessing the structure to find repeating units (a more general form of "residues" from @pbuslaev's original post), and then comparing the coarsened graph.
  • Some sort of "canonical ordering" approach, where we find some O(1) or O(N) algorithm that assigns node indices identically for isomorphic graphs. So we'd run that canonical-ordering-algorithm on both inputs to assign them new node indices, and then just walk through all the nodes of both in index order and make sure they're all identical (O(N) time)

@pbuslaev
Copy link
Contributor Author

pbuslaev commented Oct 4, 2023

Actually, fast testing with Networkx, igraph, and graph_tools gave the following results:

k = 3
print(f"Test for a system with {p1[k].n_atoms} atoms and {p1[k].n_bonds}")
s = time.time()
# networkx
def node_match_func(x, y):
    # always match by atleast atomic number
    is_equal = x["atomic_number"] == y["atomic_number"]
    return is_equal
GM = GraphMatcher(p1[k].to_networkx(), p2[k].to_networkx(), node_match=node_match_func)
print(f"Graphs are isomorhpic: {GM.is_isomorphic()}")
print(f"Networkx took {time.time() - s}")
#networkx, no node_match function
s = time.time()
# networkx
GM = GraphMatcher(p1[k].to_networkx(), p2[k].to_networkx())
print(f"Graphs are isomorhpic: {GM.is_isomorphic()}")
print(f"Networkx without node_match function took {time.time() - s}")
# igraph bliss
s = time.time()
g = igraph.Graph(len(list(p1[3].to_networkx().nodes)), [list(x) for x in list(p1[3].to_networkx().edges)])
h = igraph.Graph(len(list(p2[3].to_networkx().nodes)), [list(x) for x in list(p2[3].to_networkx().edges)])
print(f"Graphs are isomorhpic: {g.isomorphic(h)}")
print(f"Igraph took {time.time() - s}")
# igraph vf2
s = time.time()
g = igraph.Graph(len(list(p1[3].to_networkx().nodes)), [list(x) for x in list(p1[3].to_networkx().edges)])
h = igraph.Graph(len(list(p2[3].to_networkx().nodes)), [list(x) for x in list(p2[3].to_networkx().edges)])
print(f"Graphs are isomorhpic: {g.isomorphic_vf2(h)}")
print(f"Igraph took {time.time() - s}")
# graph_tool
s = time.time()
g = Graph([list(x) for x in list(p1[k].to_networkx().edges)])
h = Graph([list(x) for x in list(p2[k].to_networkx().edges)])
print(f"Graphs are isomorhpic: {isomorphism(g, h)}")
print(f"Graph_tool took {time.time() - s}")
Test for a system with 184 atoms and 183
Graphs are isomorhpic: False
Networkx took 11.790411710739136
Graphs are isomorhpic: False
Igraph took 0.005591154098510742
Graphs are isomorhpic: False
Igraph took 95.27721905708313
Graphs are isomorhpic: False
Graph_tool took 104.98597836494446

So the fastest method as for now is igraph.isomorphic, which is using bliss algorithm. The problem is, if I understand correctly, that we can not pass any vertex attributes to this method. We can pass attributes to igraph.isomorhic_vf2. Networkx also uses vf2 algorithm. And for smaller systems without passing node_match function to Networkx, igraph was several fold faster

Test for a system with 64 atoms and 63
Graphs are isomorhpic: False
Networkx without node_match function took 0.3168954849243164
Graphs are isomorhpic: False
Igraph took 0.0030617713928222656
Test for a system with 104 atoms and 103
Graphs are isomorhpic: False
Networkx without node_match function took 15.669962644577026
Graphs are isomorhpic: False
Igraph took 0.05946803092956543

I tried passing attributes to igraph and graph_tool. With igraph I did not get any performance gain, while graph_tools worked faster than without property passing, but still slower than networkx:

#graph tool with vertex map
print(f"Test for a system with {p1[k].n_atoms} atoms and {p1[k].n_bonds}")
s = time.time()
g = Graph([list(x) for x in list(p1[k].to_networkx().edges)])
prop1 = g.new_vertex_property("int")
prop1.a = np.array([p1[k].to_networkx().nodes[x]["atomic_number"] for x in range(len(p1[k].to_networkx().nodes))])
h = Graph([list(x) for x in list(p2[k].to_networkx().edges)])
prop2 = h.new_vertex_property("int")
prop2.a = np.array([p2[k].to_networkx().nodes[x]["atomic_number"] for x in range(len(p2[k].to_networkx().nodes))])
print(f"Graphs are isomorhpic: {isomorphism(g, h, prop1, prop2)}")
print(f"Graph_tool with vertex invariants took {time.time() - s}")
#igraph with node matching
print(f"Test for a system with {p1[3].n_atoms} atoms and {p1[3].n_bonds}")
s = time.time()
g = igraph.Graph(len(list(p1[3].to_networkx().nodes)), [list(x) for x in list(p1[3].to_networkx().edges)])
g["a"] = [p1[3].to_networkx().nodes[x]["atomic_number"] for x in range(len(p1[3].to_networkx().nodes))]
h = igraph.Graph(len(list(p2[3].to_networkx().nodes)), [list(x) for x in list(p2[3].to_networkx().edges)])
h["a"] = [p2[3].to_networkx().nodes[x]["atomic_number"] for x in range(len(p2[3].to_networkx().nodes))]
def node_match_func(g1, g2, i1, i2):
    # always match by atleast atomic number
    is_equal = g1["a"][i1] == g2["a"][i1]
    return is_equal
print(f"Graphs are isomorhpic: {g.isomorphic_vf2(h, node_compat_fn=node_match_func)}")
print(f"Igraph with attributes took {time.time() - s}")
Test for a system with 184 atoms and 183
Graphs are isomorhpic: False
Graph_tool with vertex invariants took 27.565279006958008
Test for a system with 184 atoms and 183
Graphs are isomorhpic: False
Igraph with attributestook 88.5980315208435

So my current opinion, that using other libraries is a proper solution.

(a big problem) Like Matt said, dividing on residues is a bit arbitrary, and we can't assume that perceive_residues will always be suitable for the user's polymer. Unconditionally adding perceive_residues to every are_isomorphic call will make all non-protein comparisons slower.
Some sort of "coarsening" approach, like preprocessing the structure to find repeating units (a more general form of "residues" from @pbuslaev's original post), and then comparing the coarsened graph.

I think that splitting into residues can be done, when molecule is created (at least residues are already there if molecule is created with Molecule.from_polymer_pdb and some other methods). Then splitting into residues with perceive_residues is not needed. And all we need to do, is to check if residues are present for molecule, and if yes, run are_isomorphic on the residue basis.

@jchodera
Copy link
Member

jchodera commented Oct 4, 2023

If the isomorphism check for residue-coarsened molecules can be made to work correctly (that is, it is always correct and never an approximation if the template matching succeeded in matching the entire molecule to some residue templates), this sounds like a great approach. If we provide templates for the majority of the common use cases, we speed up nearly all the common use cases without losing correctness. We could still fall back to slow graph marching when there are things that don't march templates.

@pbuslaev
Copy link
Contributor Author

pbuslaev commented Oct 4, 2023

A slightly different approach can be to

  1. Match residues first (do graph matching on the residue basis). This should be fast
  2. If residues can be mapped, we run additional graph matching for atoms of the residues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants