-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed of OFF molecule/network molecule comparisons and isomorphisms #353
Comments
I had to do something similar (identify copies of small molecules in a system) and found that "serializing" all connected molecules to canonical SMILES (as provided by |
It looks like you also do the expensive graph matching for all molecules, then try to correct for having different mappings afterwards. We really need to do some fast hashing / caching to avoid this expense. What if we used an algorithm like this:
( (element_1, element_2, ..., element_N), ((bond1_atomindex1, bond1_atomindex2), ..., (bondN_atomindex1, bond1_atomindex2))
If the |
In the case where many candidate molecules are provided (which is not this case, but could be a future use case), we would also want to speed up the determination of which unique molecule the From there, imperfect matches can also be rapidly rejected by first calling Later on, we'll have to think about optimizing this for the case where one of the molecules in the system is a biological macromolecule. |
In the past when solving this problem I've always done very similar to what @jchodera suggests and find that in general it works well. The only slight difference is that before even considering the You would then do the graph isomorphism matching on the first molecule in each of the lists in the dictionary map. In doing this you replace a lot of graph matching with simple hashing, and can safely only consider one molecule of each type when doing the matching (provided the hash is sensible). The added benefit is that you can do some nice sanity checks and provide some feedback to the user. Namely, if you find that two of your hashes map to the same unique molecule something most likely is going wrong (e.g think of a pure system where for some reason one of the molecules has a slightly different atom ordering than the others in the system), and additionally if a unique molecule has not been matched to anything in the system. This could also be used to easily check / enforce that all molecules of the same type have the same atom ordering and are ordered contiguously which in turn could yield performance improvements and code simplification when parameterising the system. This could also require (for openmm systems I think?) code to turn lists of atoms and bonds into individual molecules, but this isn't too bad to do and is pretty quick to run (you can use a flood-fill like algorithm). The pseudocode would be something like:
|
@jaimergp This would be sufficient for identification, but unfortunately we need to get an atom index mapping from this process. |
@jaimergp @jchodera @SimonBoothroyd Thanks for the really thoughtful feedback here. I agree that these ideas should net us a close to 1000x speedup in this case. This is super helpful :-) |
Alas! A critical facet of our implementation is that we don't require detailed molecule info (bond orders and stereochemistry) from OpenMM. If we had a reliable source of these, we wouldn't need |
So just to clarify - when I say into individual molecules, I mean each atom needs to be assigned a parent molecule index (i.e atoms [0-19] belong are assigned a molecule index of 0, atoms [20-39] are assigned a molecule index of 1) as opposed to we need to create actual
If the bond order is there, awesome, if not, a bond order of None still works in the hash! |
@jthorton and I have been discussing consolidating/deduplicating molecule identity operations. I think that making two highly-performant methods can cover all of our needs:
The arguments for both methods can be
If the user specifies I'm in favor of keeping these two functions separate, because I don't like when one function can return different types (bool vs. dict). This tends to confuse static code analysis and users. The question remains of "can we do this in a way to efficiently handle batch operations?". Caching data on the |
Or you could just have a
where
They would - they would just replace the
I would advise against caching unless essential - the required bookkeeping associated with caching is never fun, and avoiding caching would ensure that the methods to work well with |
Ahh, yes. I agree -- We should make a single function that returns
(Reading through our DM conversation, I now see this may be what @jthorton was suggesting all along) I agree that caching can be a pain to manage. On the other hand, the most efficient isomorphism checks in some cases overlap with other OFFTK functionality -- eg. SMILES are effectively molecule hashes for certain comparisons. Specifically, One of the other concerns is that the proposed function will only do 1 vs. 1 comparisons, meaning that matching a dataset of N molecules to M unique mols is an O(N*M) operation. However, if we could extract the hashes, then we could make a hash table for either M or N, and complete the matching in O(M+N) time using John/Simon's algorithm. Since we already have the algorithm in mind, and it is a significant "big-O-notation" speed-up for dataset matching, maybe we should have |
Yes, this was what I had in mind I think due to the number of optional arguments we have it makes sense to just have it as one function, with the option to get the mapping as well. Also, are there any arguments listed that we would always want to search by? I would imagine we would always compare the |
There's one quick change that can be made to greatly accelerate If we add a simple method to return the Hill formula, this could be checked for equality before proceeding further. It could even be used as a hash if you want to get super fancy with a couple of additional lines of code. This presumes, of course, that you want to check atomic numbers are the same ( Edit: Just realized I suggested this earlier in this thread. But it's still an incredibly simple and inexpensive way of achieving a big speedup for now. |
This could easily be added to the |
Weird, it seems like we only use that for forming error messages right now! If It would be good to consider the whole If I recall correctly, there is one other issue regarding the desire to pick out the same mapping each time if multiple mappings are possible---for example in matching waters. By re-using the same mapping each time (when possible), this allows for a great deal of compression when exporting to tools like gromacs. I'm not sure if the current code does this or not---looks like it may not.
+1 for useful methods like this, though I'd make it a property called |
Linking to #461 on this point
Yes, this was a "there-are-20-things-on-my-to-do list, either-this-gets-committed-now-or-never" feature. Since it was not heavily tested, I kept it as a private method only for formatting error messages. But I think it's behaved well enough (and there are enough eyes on it now) that it should be promoted into the public API. @jthorton and I spoke this morning about the API for hill formulas. It seems best to make a single Also, Hill formula comparisons are isomorphism checks if we don't care about stereochemistry, aromaticity, bond orders, or bond atom indices. I know that |
@jchodera And I fully agree about using heuristic checks to speed up comparisons. Hill formula comparison is a fast way to return |
One quick note here: Currently, If you want to make a static method called I would suggest keeping the static method private and instead using it as appropriate in methods that are consistently named with the rest of the API. Or perhaps establishing a new standard for static methods, like |
After discussion with @j-wags today, I'll be taking a crack at a PR on this issue. Just performed a skim of the existing discussion and will reach out if I hit points of confusion. Thanks! |
Awesome. Thanks, @dotsdl -- Please do reach out if you have any questions :-) |
This may have been solved in 0ea045d by @jthorton. I'm seeing topology creation take between 65 and 130 seconds on my machine, whereas a run from e.g. e3a8856 (before @jthorton's changes) goes for far longer (would probably be close to @j-wags' original ~1600 seconds if I was willing to wait that long). As for creating the system with start_time = time.time()
system = ff.create_openmm_system(top)
print("--- Ssytem creation finished in %s seconds ---" % (time.time() - start_time)) I currently get the following traceback from
It may not necessarily be related to this issue, but any thoughts on what's going on with the |
I have been testing Jeffs example system above of a box of 1000 hexadecanes, and with the current master I get the following times:
However, I think I can get this a bit faster by being more efficient, as I think on cases where I do check full isomorphism I do it twice so with a small code change this becomes:
I'll check if the tests still all pass before making any other changes. Not too sure about the error above, however. |
Awesome, this is great @jthorton! I imagine this is sufficient performance to close the issue once it's in? |
Also, looking at this with fresh eyes this morning, it helps to turn off the profiler magic ( Also, realized I had |
Great! Thanks for checking into this. I agree that this is a huge performance improvement, and a minute or two is much better than an hour. That said, we might reasonably need to scale from 1000 hexadecanes to 100,000 waters at some point (or nastier combos for mixed-solvent/membrane sims). For M molecules in the PDB files and U unique molecules, our current algorithm scales as O(U*M). Simon's proposal above would make it O(U^2), which is a large improvement in the vast majority of cases, and no worse in the worst-case scenario. So I'd like to leave this issue open until Simon's algorithm is implemented. I could be wrong about the big-O notation of the current and proposed solutions -- Please correct me if so! |
I've seen the error above a few times in my We will be changing the OFFTK recipe to use AmberTools19 instead of AmberMini for the next release, so AT19 compatibility would be the highest-priority factor here. Of course, it's be better if they both worked (not all users will upgrade). But if it's strictly one-or-the-other, AmberTools is the more important one. @dotsdl Can you confirm that it's a AmberMini vs. AmberTools issue? If so, we may need to find a command line syntax that's shared between both versions of antechamber, or have OFFTK detect which AmberX version it's talking to and change its behavior accordingly. |
This doesn't seem to be the only AmberTools19-migration-related problem. We've opened a discussion about AT19-related issues here. #532 |
There's still a slowdown(s) somewhere In [1]: import openmm.app
In [2]: from openff.toolkit import Molecule, ForceField, Topology
/Users/mattthompson/micromamba/envs/openff-toolkit-test/lib/python3.10/site-packages/openff/nagl_models/openff_nagl_models.py:6: UserWarning: Module openff was already imported from None, but /Users/mattthompson/software/openff-interchange is being added to sys.path
from pkg_resources import resource_filename
In [3]: sage = ForceField("openff-2.0.0.offxml")
In [4]: hexadecane = Molecule.from_smiles('CCCCCCCCCCCCCCCC')
...: openmm_topology = openmm.app.PDBFile('output.pdb').topology
In [5]: topology = Topology.from_openmm(openmm_topology, unique_molecules=[hexadecane])
In [6]: %%timeit
...: Topology.from_openmm(openmm_topology, unique_molecules=[hexadecane])
...:
...:
4.84 s ± 44.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: %%timeit
...: sage.create_openmm_system(topology)
...:
...:
1min 3s ± 1.15 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %%timeit
...: topology.molecule(0).assign_partial_charges(partial_charge_method="am1bccelf10")
...:
...:
7.44 s ± 48.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
Description
System is a box of 1000 hexadecanes.
output.pdb.gz
Thoughts
I suspect that this slowdown occurs in the graph matching stage of topology creation. Hexadecane has a huge number of self-symmetries. While the function only takes the first molecular topology match that's found, it may be generating all of them unnecessarily.
Relevant code is here: https://github.com/openforcefield/openforcefield/blob/master/openforcefield/topology/topology.py#L1635-L1664
We should poke around NetworkX to see if there's a faster algorithm that this code could use.
The text was updated successfully, but these errors were encountered: