Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mol fix and xyz #529

Merged
merged 16 commits into from
Mar 3, 2020
Merged

Mol fix and xyz #529

merged 16 commits into from
Mar 3, 2020

Conversation

jthorton
Copy link
Collaborator

@jthorton jthorton commented Feb 26, 2020

Here I aim to add support for writing multiframe and single frame XYZ files and fixing the MOL file type writer and adding more tests for each of the file output types. We also have another speed improvement for #353.

This is a new branch for the fix as the old one was not from master.

@jthorton
Copy link
Collaborator Author

As I did the branch wrong I have copied the comments across from the old PR

Could you specify a few things about the intended behavior of OFFTK with respect to XYZ?
Namely:

  • Will we only ever write conformers of the same molecule to a single XYZ file, or could multiple distinct molecules be written to a single XYZ?

My intention was that only one molecule would be written to an XYZ file but multipule conformers could be, however there is a way around this for users by doing something like this:

from openforcefield.topology import Molecule
methane = Molecule.from_smiles('C')
ethane = Molecule.from_smiles('CC')

xyz = open('molecules.xyz', 'w')
methane.to_file(xyz, 'xyz')
xyz = open('molecules.xyz', 'a')
ethane.to_file(xyz, 'xyz')

would give :

5
CH4
C 0.0000000000 0.0000000000 0.0000000000
H 0.0000000000 0.0000000000 0.0000000000
H 0.0000000000 0.0000000000 0.0000000000
H 0.0000000000 0.0000000000 0.0000000000
H 0.0000000000 0.0000000000 0.0000000000
8
C2H6
C 0.0000000000 0.0000000000 0.0000000000
C 0.0000000000 0.0000000000 0.0000000000
H 0.0000000000 0.0000000000 0.0000000000
H 0.0000000000 0.0000000000 0.0000000000
H 0.0000000000 0.0000000000 0.0000000000
H 0.0000000000 0.0000000000 0.0000000000
H 0.0000000000 0.0000000000 0.0000000000
H 0.0000000000 0.0000000000 0.0000000000

As we will not support the reading of this file type this is not an issue for us but could be for users when interfacing with other programs so am not sure if we should stop this behaviour or not.

  • Is atom indexing guaranteed to be preserved when writing XYZ?

Yes the file is generated based on the current molecule atom ordering, as this is done by the toolkit directly not some back end toolkit the order should never change.

  • What is the behaviour if offmol.conformers is None?

If there are no conformers then we set all of the positions of the atoms to [0,0,0] and write out the file copying the behaviour of writing a PDB with no conformers.

  • Do we expect XYZ to be loadable back into OFFTK in any form? If so...

No there is even less information in here than in a PDB file so we should never load from a XYZ file.

* Do we expect a molecule's connectivity and/or formal charges to survive a round trip to XYZ?
* Do we expect a molecule's `name` to survive a round trip to XYZ?
* Do we expect a molecule's `partial_charges` to survive a round trip to XYZ?
* Do we expect a molecule's `properties` dict to survive a round trip to XYZ?

@jthorton
Copy link
Collaborator Author

jthorton commented Feb 26, 2020

Tests are failing due to the Mol writer fix I have put in, due to the list of file writers being a class level object and we require the import of RDKit to link the file sufix to the file writer in this part:

class RDKitToolkitWrapper(ToolkitWrapper):
    """
    RDKit toolkit wrapper

    .. warning :: This API is experimental and subject to change.
    """

    from rdkit import Chem

    _toolkit_name = 'The RDKit'
    _toolkit_installation_instructions = 'A conda-installable version of the free and open source RDKit cheminformatics ' \
                                         'toolkit can be found at: https://anaconda.org/rdkit/rdkit'
    _toolkit_file_read_formats = ['SDF', 'MOL', 'SMI']  #TODO: Add TDT support
    _toolkit_file_write_formats = {'SDF': Chem.SDWriter, 'MOL': Chem.SDWriter,
                                   'SMI': Chem.SmilesWriter, 'PDB': Chem.PDBWriter, 'TDT': Chem.TDTWriter}

any environment that does not have RDKit installed will fail at this point as it is always run. This could be fixed by possibly having this dict be filled during an __init__ call which could import rdkit something like:

class RDKitToolkitWrapper(ToolkitWrapper):

    _toolkit_file_write_formats = {'SDF': None, 'MOL': None,
                                   'SMI': None, 'PDB': None, 'TDT': None}

    def __init__(self):
        try:
            from rdkit import Chem
            self._toolkit_file_write_formats = {'SDF': Chem.SDWriter, 'MOL': Chem.SDWriter,
                                                'SMI': Chem.SmilesWriter, 'PDB': Chem.PDBWriter, 'TDT': Chem.TDTWriter}
        except ImportError:
            pass

However, we still have the same problem of having to update the dictionary in two locations which is what caused the issue in the first place.

One other solution I have is to just have strings of the names of the class/functions that will do the file writing be stored in the dictionary that can then be executed using eval.

@codecov-io
Copy link

codecov-io commented Feb 26, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@53e7e95). Click here to learn what that means.
The diff coverage is 93.33%.

@j-wags
Copy link
Member

j-wags commented Feb 26, 2020

Excellent.

so am not sure if we should stop this behaviour or not.

I don't think we need to police the one-molecule-per-XYZ behavior if a user opens a file in append mode. If that messes up their workflow, it's their own fault :-)

I agree with all of your answers on the behavior and think this will be a great addition. Thanks, Josh!

@j-wags
Copy link
Member

j-wags commented Feb 26, 2020

Tests are failing due to the Mol writer fix I have put in, due to the list of file writers being a class level object

I wonder if these should be instance-level? I've (occasionally) debugged problems by making two RDKitToolkitWrappers, and overridden the behavior of one of them by redefining its attributes. But it gets annoying when that attribute is class-level, since then it affects the other one as well.

Is there a good reason to keep the read and write formats at the class level? If not, I'd be fine with defining them on an instance-level (in __init__) and avoiding all this eval business.

@jthorton
Copy link
Collaborator Author

jthorton commented Feb 27, 2020

Yeah, that should be fine as we can do the import in the init, would we want to change this behaviour for both of the toolkit wrappers to make them consistent? Also, what should happen if RDKit is not installed and someone tried to create the wrapper should we fail loud with an import error?

@jthorton jthorton requested a review from j-wags March 2, 2020 10:07
@jthorton
Copy link
Collaborator Author

jthorton commented Mar 2, 2020

Just need one final check that I have RDKit file writers set up correctly so that Mol and SDF write the same file format. Also, do we need more tests for each of the file writing formats similar to the XYZ testing I have put in?

Copy link
Member

@j-wags j-wags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, Josh. Especially nice tests. I commented on a few spelling things, and requested a reference spec for our XYZ implementation, but otherwise this is excellent work.

One note (kinda to myself) -- MOL is technically different from SDF, in that it doesn't contain property information. https://en.wikipedia.org/wiki/Chemical_table_file#SDF. I don't think there's anything to do here for it (since we don't tell RDKit to write any properties anyway), however I'll probably need to override this behavior in #281

docs/releasehistory.rst Outdated Show resolved Hide resolved
docs/releasehistory.rst Show resolved Hide resolved
openforcefield/tests/test_toolkits.py Outdated Show resolved Hide resolved
@@ -38,6 +38,7 @@ def get_mini_drug_bank(toolkit):
# This is a work around a weird error where even though the test is skipted due to a missing toolkit
# we still try and read the file with the toolkit
if toolkit.is_available():
toolkit = toolkit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confused me for a minute -- Could you change it to toolkit = toolkit_class(), and have the input parameter above be toolkit_class as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah no problem!

Comment on lines +2181 to +2184
isomorphic = GM.is_isomorphic()

if isomorphic and return_atom_map:
# now generate the sorted mapping between the molecules
GM = GraphMatcher(
mol1_netx,
mol2_netx,
node_match=node_match_func,
edge_match=edge_match_func)
for mapping in GM.isomorphisms_iter():
topology_atom_map = mapping
break
topology_atom_map = GM.mapping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slick. I like it.

Comment on lines +1663 to +1668
@property
def toolkit_file_write_formats(self):
"""
List of file formats that this toolkit can write.
"""
return list(self._toolkit_file_write_formats.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughtful!

@@ -1841,7 +1870,6 @@ def from_file_obj(self,
a list of Molecule objects is returned.

"""
from openforcefield.topology import Molecule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

openforcefield/utils/toolkits.py Outdated Show resolved Hide resolved
writer = rdkit_writers[file_format](file_obj)
writer.write(rdmol)
writer.close()
self.to_file_obj(molecule=molecule, file_obj=file_obj, file_format=file_format)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slick!

docs/releasehistory.rst Outdated Show resolved Hide resolved
@jthorton
Copy link
Collaborator Author

jthorton commented Mar 3, 2020

Thanks for the feedback Jeff I have made the changes you suggested and will merge once tests pass.

@jthorton jthorton merged commit 0e5ff41 into master Mar 3, 2020
@jthorton jthorton deleted the mol_fix_and_xyz branch March 3, 2020 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing to MOL error
3 participants