-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mol fix and xyz #529
Mol fix and xyz #529
Conversation
As I did the branch wrong I have copied the comments across from the old PR
My intention was that only one molecule would be written to an XYZ file but multipule conformers could be, however there is a way around this for users by doing something like this: from openforcefield.topology import Molecule
methane = Molecule.from_smiles('C')
ethane = Molecule.from_smiles('CC')
xyz = open('molecules.xyz', 'w')
methane.to_file(xyz, 'xyz')
xyz = open('molecules.xyz', 'a')
ethane.to_file(xyz, 'xyz') would give :
As we will not support the reading of this file type this is not an issue for us but could be for users when interfacing with other programs so am not sure if we should stop this behaviour or not.
Yes the file is generated based on the current molecule atom ordering, as this is done by the toolkit directly not some back end toolkit the order should never change.
If there are no conformers then we set all of the positions of the atoms to
No there is even less information in here than in a PDB file so we should never load from a XYZ file.
|
Tests are failing due to the Mol writer fix I have put in, due to the list of file writers being a class level object and we require the import of RDKit to link the file sufix to the file writer in this part: class RDKitToolkitWrapper(ToolkitWrapper):
"""
RDKit toolkit wrapper
.. warning :: This API is experimental and subject to change.
"""
from rdkit import Chem
_toolkit_name = 'The RDKit'
_toolkit_installation_instructions = 'A conda-installable version of the free and open source RDKit cheminformatics ' \
'toolkit can be found at: https://anaconda.org/rdkit/rdkit'
_toolkit_file_read_formats = ['SDF', 'MOL', 'SMI'] #TODO: Add TDT support
_toolkit_file_write_formats = {'SDF': Chem.SDWriter, 'MOL': Chem.SDWriter,
'SMI': Chem.SmilesWriter, 'PDB': Chem.PDBWriter, 'TDT': Chem.TDTWriter} any environment that does not have RDKit installed will fail at this point as it is always run. This could be fixed by possibly having this dict be filled during an class RDKitToolkitWrapper(ToolkitWrapper):
_toolkit_file_write_formats = {'SDF': None, 'MOL': None,
'SMI': None, 'PDB': None, 'TDT': None}
def __init__(self):
try:
from rdkit import Chem
self._toolkit_file_write_formats = {'SDF': Chem.SDWriter, 'MOL': Chem.SDWriter,
'SMI': Chem.SmilesWriter, 'PDB': Chem.PDBWriter, 'TDT': Chem.TDTWriter}
except ImportError:
pass However, we still have the same problem of having to update the dictionary in two locations which is what caused the issue in the first place. One other solution I have is to just have strings of the names of the class/functions that will do the file writing be stored in the dictionary that can then be executed using |
Codecov Report
|
Excellent.
I don't think we need to police the one-molecule-per-XYZ behavior if a user opens a file in append mode. If that messes up their workflow, it's their own fault :-) I agree with all of your answers on the behavior and think this will be a great addition. Thanks, Josh! |
I wonder if these should be instance-level? I've (occasionally) debugged problems by making two RDKitToolkitWrappers, and overridden the behavior of one of them by redefining its attributes. But it gets annoying when that attribute is class-level, since then it affects the other one as well. Is there a good reason to keep the read and write formats at the class level? If not, I'd be fine with defining them on an instance-level (in |
Yeah, that should be fine as we can do the import in the init, would we want to change this behaviour for both of the toolkit wrappers to make them consistent? Also, what should happen if RDKit is not installed and someone tried to create the wrapper should we fail loud with an import error? |
…ield into mol_fix_and_xyz
Just need one final check that I have RDKit file writers set up correctly so that Mol and SDF write the same file format. Also, do we need more tests for each of the file writing formats similar to the XYZ testing I have put in? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, Josh. Especially nice tests. I commented on a few spelling things, and requested a reference spec for our XYZ implementation, but otherwise this is excellent work.
One note (kinda to myself) -- MOL is technically different from SDF, in that it doesn't contain property information. https://en.wikipedia.org/wiki/Chemical_table_file#SDF. I don't think there's anything to do here for it (since we don't tell RDKit to write any properties anyway), however I'll probably need to override this behavior in #281
@@ -38,6 +38,7 @@ def get_mini_drug_bank(toolkit): | |||
# This is a work around a weird error where even though the test is skipted due to a missing toolkit | |||
# we still try and read the file with the toolkit | |||
if toolkit.is_available(): | |||
toolkit = toolkit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This confused me for a minute -- Could you change it to toolkit = toolkit_class()
, and have the input parameter above be toolkit_class
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah no problem!
isomorphic = GM.is_isomorphic() | ||
|
||
if isomorphic and return_atom_map: | ||
# now generate the sorted mapping between the molecules | ||
GM = GraphMatcher( | ||
mol1_netx, | ||
mol2_netx, | ||
node_match=node_match_func, | ||
edge_match=edge_match_func) | ||
for mapping in GM.isomorphisms_iter(): | ||
topology_atom_map = mapping | ||
break | ||
topology_atom_map = GM.mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slick. I like it.
@property | ||
def toolkit_file_write_formats(self): | ||
""" | ||
List of file formats that this toolkit can write. | ||
""" | ||
return list(self._toolkit_file_write_formats.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughtful!
@@ -1841,7 +1870,6 @@ def from_file_obj(self, | |||
a list of Molecule objects is returned. | |||
|
|||
""" | |||
from openforcefield.topology import Molecule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
writer = rdkit_writers[file_format](file_obj) | ||
writer.write(rdmol) | ||
writer.close() | ||
self.to_file_obj(molecule=molecule, file_obj=file_obj, file_format=file_format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slick!
Thanks for the feedback Jeff I have made the changes you suggested and will merge once tests pass. |
Here I aim to add support for writing multiframe and single frame XYZ files and fixing the MOL file type writer and adding more tests for each of the file output types. We also have another speed improvement for #353.
This is a new branch for the fix as the old one was not from master.