-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SPICE 2.0.1, Fix bug in MemmappedDataset #333
Conversation
There is one bogus molecule in the zenodo hosted 2.0.1 hdf5 file. >>> ds = SPICE("~/data/", version="2.0.1")
Downloading https://zenodo.org/records/10975225/files/SPICE-2.0.1.hdf5
Processing...
Gathering statistics...
Molecules: 40119it [01:46, 661.97it/s]WARNING:root:Bogus molecule with id 54X VAL
WARNING:root:Found torch.Size([0]) positions, torch.Size([0]) energies and torch.Size([0]) gradients I downloaded the file a couple times to discard a corruption on my end. Also confirmed the md5 hash coincides with the one in zenodo. |
@stefdoerr could you please review? this is an important one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah nice. I was aware of the overwriting and I was just working around it by changing the root dir for each Ace version. But this is of course the much better way of doing it. Thanks!
Thanks! It's only "bogus" in that there are no conformations for it. It's a dimer that only had one conformation to begin with, and it looks like that one was excluded because the forces were too large. |
Adds compatibility for the latest SPICE version https://arxiv.org/pdf/2406.13112
In doing so I uncovered a couple of issues related to MemmapedDataset.
I fixed MemmapedDataset overwriting the names of processed datasets.
In particular, SPICE sets self.name as:
Which gets overwritten in the MemmappedDataset constructor:
Causing any subsample/gradient configurations to be stored as just "SPICE.*.mmap". Probably ignoring subsample/gradient if the dataset has been already processed.
This in turn uncovered a nasty bug in the test for the ACE Dataset.
Since all instances of the Dataset are named with just
self.__class__.__name__
, creating two instances of ACE will simply ignore the data in the second and use what was processed in the first to a group of files just called "Ace.*.mmap".This test should not pass but it was passing until now:
torchmd-net/tests/test_datasets.py
Lines 246 to 299 in e908988
In Ace v2 the field is called
dipole_moments
. But MemmappedDataset was not even trying to process the second one.I also made some changes
{root}/raw/spice/
or{root}/processed/spice
. Before it was storing things under{root}/raw/{version}
, which could collide with other datasets.Note that this PR will invalidate many preprocessed datasets, prompting for redownloading and reprocessing. I recommend you delete the dataset storage folder and let it redownload things.