-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qcarchive update #187
base: main
Are you sure you want to change the base?
Qcarchive update #187
Conversation
…to be compatible with qcportal >=0.5
Awesome! This is good timing with #186 Once we get both in, we should cut a new release. |
This PR implements the logic in effectively the same way as the old code, which is on a per-record basis (i.e., a function operates on a single record name). The new version of qcportal has iterators on records, which are substantially faster (like orders of magnitude, due to prefetching and caching). The next commit will include functions that operate on the entire record sets to avoid slow performance. |
Codecov Report
Additional details and impacted files |
This line will need to get changed @chrisiacovella https://github.com/choderalab/espaloma/pull/187/files#diff-ba5d22563299549a389183418fe5786b83275382be592bf1ed06fae673b7d086L23 Sorry about that! |
We can probably remove that line since https://github.com/choderalab/espaloma/pull/187/files#diff-ba5d22563299549a389183418fe5786b83275382be592bf1ed06fae673b7d086R33 will pull in what we need (I think, I am not sure what the "main" qcarchive package is) |
…rds and iterates_entries functions
Good catch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, had two non-blocking notes
espaloma/data/qcarchive_utils.py
Outdated
mol = final_molecules[angle] | ||
# NOTE: this is calling the first index of the optimization array | ||
# this gives the same value as the prior implementation, but I wonder if it | ||
# should be molecule_optimization[angle][-1] in both cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kntkb or @yuanqing-wang thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've been trying to figure out the structure of the torsion drive datasets (since I have not looked at them really yet prior to this). Considering the example dataset I used in test (I'll put the code below), each angle has n-number of unique initial conformations that are then optimized. In this case, there are 4 configurations (each that has their own trajectory). So I suppose choosing the first vs the last is somewhat irrelevant (I was initially thinking this was a set of chained optimizations, hence my comment...don't ask why I was thinking that).
Should each of these conformations be considered and added to the datasets rather than just arbitrarily picking one?
from espaloma.data import qcarchive_utils
import numpy as np
record_name = "[h]c1c(c(c(c([c:1]1[n:2]([c:3](=[o:4])c(=c([h])[h])[h])c([h])([h])[h])[h])[h])n(=o)=o)[h]"
name = "OpenFF Amide Torsion Set v1.0"
collection_type = "torsiondrive"
collection, record_names = qcarchive_utils.get_collection(qcarchive_utils.get_client(), collection_type, name)
record_info = collection.get_record(record_name, specification_name="default")
molecule_optimization = record_info.optimizations
angle_keys = list(molecule_optimization.keys())
angle = angle_keys[0]
mol = molecule_optimization[angle][0].final_molecule
result = molecule_optimization[angle][0].trajectory[-1].properties
looking at the actual configurations:
for i in range(len(molecule_optimization[angle])):
init = molecule_optimization[angle][i].initial_molecule.geometry
final = molecule_optimization[angle][i].final_molecule.geometry
print(init,"\n-\n", final, "\n--\n")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kntkb or @yuanqing-wang thoughts?
I don't know off the top of my head, but I've played around with different QCArchive workflows in the past. I may have some notes left somewhere, so I'll catch up shortly (tomorrow?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the api and the way you access the data changed using qcportal v0.5...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the older version of qcportal, "get_final_molecule()" just picked the first one in the array. The full array was still part of the data record, just you had to dig through the qcvars or something to access. From conversations with Ben, there was a lot of trying to force records into a very rigid schema in the old version; he opted to break the schema in a lot of cases to just make it easier to access the relevant information (and make it clearer what information is available).
As I mentioned in an early comment, it seems that for each angle, multiple (in this case 4) independent starting configurations were used. It seems like it would be better to have the code return data for each replicate, but I'm not sure how this would impact any workflows that use this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is geat. I'm glad that we are now testing the behavior and have some documentation for these utils. I agree with the comments that have been made. Looks good to be merged, just a single non-blocking comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @jchodera
There are apparently some additional issues with the object model such that datasets beyond
OptimizationDataset
are not supported
Yes. the |
@chrisiacovella I remember when fetching the results from the |
@kntkb This is something I started looking at when switching from the old to the new version, but I can't seem to find my notes; for some reason I think one of the specifications does include the sum, but don't quote me on that. I'm currently trying to figure that out right now actually. |
… dataset has the smiles encoded for converting to openff.molecule
… dataset has the smiles encoded for converting to openff.molecule
…d so that it will raise the desired exception rather than failing.
…rse the singlepoint records properly at this point. Other issues need to be resolved with singlepoint energy beyond this (i.e., summation of dispersion corrections).
…rse the singlepoint records properly at this point. Other issues need to be resolved with singlepoint energy beyond this (i.e., summation of dispersion corrections). This PR should sufficiently reproduce the prior behavior, but with new qcportal.
@chrisiacovella Is this PR good to go? I know its a year old now BUT is it good to go? |
This updates qcarchive_utils.py to be compatible with v0.5 of qcportal. Relates to issue #185
This code reproduces the same behavior as the prior implementation.