Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SPICE 2.0.1, Fix bug in MemmappedDataset #333

Merged
merged 7 commits into from
Jun 28, 2024
Merged

Conversation

RaulPPelaez
Copy link
Collaborator

@RaulPPelaez RaulPPelaez commented Jun 28, 2024

Adds compatibility for the latest SPICE version https://arxiv.org/pdf/2406.13112

In doing so I uncovered a couple of issues related to MemmapedDataset.

I fixed MemmapedDataset overwriting the names of processed datasets.

In particular, SPICE sets self.name as:

        arg_hash = f"{version}{subsets}{max_gradient}{subsample_molecules}"
        arg_hash = hashlib.md5(arg_hash.encode()).hexdigest()
        self.name = f"{self.__class__.__name__}-{arg_hash}"

Which gets overwritten in the MemmappedDataset constructor:

self.name = self.__class__.__name__

Causing any subsample/gradient configurations to be stored as just "SPICE.*.mmap". Probably ignoring subsample/gradient if the dataset has been already processed.

This in turn uncovered a nasty bug in the test for the ACE Dataset.

Since all instances of the Dataset are named with just self.__class__.__name__, creating two instances of ACE will simply ignore the data in the second and use what was processed in the first to a group of files just called "Ace.*.mmap".
This test should not pass but it was passing until now:

def test_ace(tmpdir):
# Test Version 1.0
tmpfilename = join(tmpdir, "molecule.h5")
f = h5py.File(tmpfilename, "w")
f.attrs["layout"] = "Ace"
f.attrs["layout_version"] = "1.0"
f.attrs["name"] = "sample_molecule_data"
for m in range(3): # Three molecules
mol = f.create_group(f"mol_{m+1}")
mol["atomic_numbers"] = [1, 6, 8] # H, C, O
mol["formal_charges"] = [0, 0, 0] # Neutral charges
confs = mol.create_group("conformations")
for i in range(2): # Two conformations
conf = confs.create_group(f"conf_{i+1}")
conf["positions"] = np.random.random((3, 3))
conf["positions"].attrs["units"] = "Å"
conf["formation_energy"] = np.random.random()
conf["formation_energy"].attrs["units"] = "eV"
conf["forces"] = np.random.random((3, 3))
conf["forces"].attrs["units"] = "eV/Å"
conf["partial_charges"] = np.random.random(3)
conf["partial_charges"].attrs["units"] = "e"
conf["dipole_moment"] = np.random.random(3)
conf["dipole_moment"].attrs["units"] = "e*Å"
dataset = Ace(root=tmpdir, paths=tmpfilename)
assert len(dataset) == 6
f.flush()
f.close()
# Test Version 2.0
tmpfilename_v2 = join(tmpdir, "molecule_v2.h5")
f2 = h5py.File(tmpfilename_v2, "w")
f2.attrs["layout"] = "Ace"
f2.attrs["layout_version"] = "2.0"
f2.attrs["name"] = "sample_molecule_data_v2"
master_mol_group = f2.create_group("master_molecule_group")
for m in range(3): # Three molecules
mol = master_mol_group.create_group(f"mol_{m+1}")
mol["atomic_numbers"] = [1, 6, 8] # H, C, O
mol["formal_charges"] = [0, 0, 0] # Neutral charges
mol["positions"] = np.random.random((2, 3, 3)) # Two conformations
mol["positions"].attrs["units"] = "Å"
mol["formation_energies"] = np.random.random(2)
mol["formation_energies"].attrs["units"] = "eV"
mol["forces"] = np.random.random((2, 3, 3))
mol["forces"].attrs["units"] = "eV/Å"
mol["partial_charges"] = np.random.random((2, 3))
mol["partial_charges"].attrs["units"] = "e"
mol["dipole_moment"] = np.random.random((2, 3))
mol["dipole_moment"].attrs["units"] = "e*Å"
dataset_v2 = Ace(root=tmpdir, paths=tmpfilename_v2)
assert len(dataset_v2) == 6
f2.flush()
f2.close()

In Ace v2 the field is called dipole_moments. But MemmappedDataset was not even trying to process the second one.

I also made some changes

  • Now SPICE checks the hash of the downloaded files.
  • Now SPICE stores files under {root}/raw/spice/ or {root}/processed/spice. Before it was storing things under {root}/raw/{version}, which could collide with other datasets.

Note that this PR will invalidate many preprocessed datasets, prompting for redownloading and reprocessing. I recommend you delete the dataset storage folder and let it redownload things.

@RaulPPelaez
Copy link
Collaborator Author

There is one bogus molecule in the zenodo hosted 2.0.1 hdf5 file.

>>> ds = SPICE("~/data/", version="2.0.1")
Downloading https://zenodo.org/records/10975225/files/SPICE-2.0.1.hdf5
Processing...
Gathering statistics...
Molecules: 40119it [01:46, 661.97it/s]WARNING:root:Bogus molecule with id 54X VAL
WARNING:root:Found torch.Size([0]) positions, torch.Size([0]) energies and torch.Size([0]) gradients

I downloaded the file a couple times to discard a corruption on my end. Also confirmed the md5 hash coincides with the one in zenodo.
cc @peastman

@RaulPPelaez RaulPPelaez changed the title Add SPICE 2.0.1 Add SPICE 2.0.1, Fix bug in MemmappedDataset Jun 28, 2024
@RaulPPelaez
Copy link
Collaborator Author

@stefdoerr could you please review? this is an important one.

@RaulPPelaez RaulPPelaez marked this pull request as ready for review June 28, 2024 09:32
Copy link
Collaborator

@stefdoerr stefdoerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nice. I was aware of the overwriting and I was just working around it by changing the root dir for each Ace version. But this is of course the much better way of doing it. Thanks!

@stefdoerr stefdoerr merged commit c800af1 into torchmd:main Jun 28, 2024
2 checks passed
@peastman
Copy link
Collaborator

There is one bogus molecule in the zenodo hosted 2.0.1 hdf5 file.

Thanks! It's only "bogus" in that there are no conformations for it. It's a dimer that only had one conformation to begin with, and it looks like that one was excluded because the forces were too large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants