Add SPICE 2.0.1, Fix bug in MemmappedDataset #333

RaulPPelaez · 2024-06-28T07:19:18Z

Adds compatibility for the latest SPICE version https://arxiv.org/pdf/2406.13112

In doing so I uncovered a couple of issues related to MemmapedDataset.

I fixed MemmapedDataset overwriting the names of processed datasets.

In particular, SPICE sets self.name as:

        arg_hash = f"{version}{subsets}{max_gradient}{subsample_molecules}"
        arg_hash = hashlib.md5(arg_hash.encode()).hexdigest()
        self.name = f"{self.__class__.__name__}-{arg_hash}"

Which gets overwritten in the MemmappedDataset constructor:

self.name = self.__class__.__name__

Causing any subsample/gradient configurations to be stored as just "SPICE.*.mmap". Probably ignoring subsample/gradient if the dataset has been already processed.

This in turn uncovered a nasty bug in the test for the ACE Dataset.

Since all instances of the Dataset are named with just self.__class__.__name__, creating two instances of ACE will simply ignore the data in the second and use what was processed in the first to a group of files just called "Ace.*.mmap".
This test should not pass but it was passing until now:

torchmd-net/tests/test_datasets.py

Lines 246 to 299 in e908988

    
           def test_ace(tmpdir): 
        
               # Test Version 1.0 
        
               tmpfilename = join(tmpdir, "molecule.h5") 
        
               f = h5py.File(tmpfilename, "w") 
        
               f.attrs["layout"] = "Ace" 
        
               f.attrs["layout_version"] = "1.0" 
        
               f.attrs["name"] = "sample_molecule_data" 
        
               for m in range(3):  # Three molecules 
        
                   mol = f.create_group(f"mol_{m+1}") 
        
                   mol["atomic_numbers"] = [1, 6, 8]  # H, C, O 
        
                   mol["formal_charges"] = [0, 0, 0]  # Neutral charges 
        
                   confs = mol.create_group("conformations") 
        
                   for i in range(2):  # Two conformations 
        
                       conf = confs.create_group(f"conf_{i+1}") 
        
                       conf["positions"] = np.random.random((3, 3)) 
        
                       conf["positions"].attrs["units"] = "Å" 
        
                       conf["formation_energy"] = np.random.random() 
        
                       conf["formation_energy"].attrs["units"] = "eV" 
        
                       conf["forces"] = np.random.random((3, 3)) 
        
                       conf["forces"].attrs["units"] = "eV/Å" 
        
                       conf["partial_charges"] = np.random.random(3) 
        
                       conf["partial_charges"].attrs["units"] = "e" 
        
                       conf["dipole_moment"] = np.random.random(3) 
        
                       conf["dipole_moment"].attrs["units"] = "e*Å" 
        
               dataset = Ace(root=tmpdir, paths=tmpfilename) 
        
               assert len(dataset) == 6 
        
               f.flush() 
        
               f.close() 
        
               # Test Version 2.0 
        
               tmpfilename_v2 = join(tmpdir, "molecule_v2.h5") 
        
               f2 = h5py.File(tmpfilename_v2, "w") 
        
               f2.attrs["layout"] = "Ace" 
        
               f2.attrs["layout_version"] = "2.0" 
        
               f2.attrs["name"] = "sample_molecule_data_v2" 
        
               master_mol_group = f2.create_group("master_molecule_group") 
        
               for m in range(3):  # Three molecules 
        
                   mol = master_mol_group.create_group(f"mol_{m+1}") 
        
                   mol["atomic_numbers"] = [1, 6, 8]  # H, C, O 
        
                   mol["formal_charges"] = [0, 0, 0]  # Neutral charges 
        
                   mol["positions"] = np.random.random((2, 3, 3))  # Two conformations 
        
                   mol["positions"].attrs["units"] = "Å" 
        
                   mol["formation_energies"] = np.random.random(2) 
        
                   mol["formation_energies"].attrs["units"] = "eV" 
        
                   mol["forces"] = np.random.random((2, 3, 3)) 
        
                   mol["forces"].attrs["units"] = "eV/Å" 
        
                   mol["partial_charges"] = np.random.random((2, 3)) 
        
                   mol["partial_charges"].attrs["units"] = "e" 
        
                   mol["dipole_moment"] = np.random.random((2, 3)) 
        
                   mol["dipole_moment"].attrs["units"] = "e*Å" 
        
               dataset_v2 = Ace(root=tmpdir, paths=tmpfilename_v2) 
        
               assert len(dataset_v2) == 6 
        
               f2.flush() 
        
               f2.close()

In Ace v2 the field is called dipole_moments. But MemmappedDataset was not even trying to process the second one.

I also made some changes

Now SPICE checks the hash of the downloaded files.
Now SPICE stores files under {root}/raw/spice/ or {root}/processed/spice. Before it was storing things under {root}/raw/{version}, which could collide with other datasets.

Note that this PR will invalidate many preprocessed datasets, prompting for redownloading and reprocessing. I recommend you delete the dataset storage folder and let it redownload things.

RaulPPelaez · 2024-06-28T08:51:55Z

There is one bogus molecule in the zenodo hosted 2.0.1 hdf5 file.

>>> ds = SPICE("~/data/", version="2.0.1")
Downloading https://zenodo.org/records/10975225/files/SPICE-2.0.1.hdf5
Processing...
Gathering statistics...
Molecules: 40119it [01:46, 661.97it/s]WARNING:root:Bogus molecule with id 54X VAL
WARNING:root:Found torch.Size([0]) positions, torch.Size([0]) energies and torch.Size([0]) gradients

I downloaded the file a couple times to discard a corruption on my end. Also confirmed the md5 hash coincides with the one in zenodo.
cc @peastman

RaulPPelaez · 2024-06-28T09:32:05Z

@stefdoerr could you please review? this is an important one.

stefdoerr

ah nice. I was aware of the overwriting and I was just working around it by changing the root dir for each Ace version. But this is of course the much better way of doing it. Thanks!

peastman · 2024-06-28T20:14:06Z

There is one bogus molecule in the zenodo hosted 2.0.1 hdf5 file.

Thanks! It's only "bogus" in that there are no conformations for it. It's a dimer that only had one conformation to begin with, and it looks like that one was excluded because the forces were too large.

RaulPPelaez added 3 commits June 28, 2024 09:18

Add SPICE 2.0.1

8036aa5

Place raw spice data in a subfolder

b68788d

Add check for bogus conformations

610e33c

RaulPPelaez added 4 commits June 28, 2024 10:58

Fix processed name in MemmappedDataset

cb7c50e

Add hash checking of downloaded files in SPICE

9b5da8b

Fix typo in ACE dataset test

9a5ddf1

Update to Ace dataset test

66a84d2

RaulPPelaez changed the title ~~Add SPICE 2.0.1~~ Add SPICE 2.0.1, Fix bug in MemmappedDataset Jun 28, 2024

RaulPPelaez marked this pull request as ready for review June 28, 2024 09:32

stefdoerr approved these changes Jun 28, 2024

View reviewed changes

stefdoerr merged commit c800af1 into torchmd:main Jun 28, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SPICE 2.0.1, Fix bug in MemmappedDataset #333

Add SPICE 2.0.1, Fix bug in MemmappedDataset #333

RaulPPelaez commented Jun 28, 2024 •

edited

Loading

RaulPPelaez commented Jun 28, 2024

RaulPPelaez commented Jun 28, 2024

stefdoerr left a comment

peastman commented Jun 28, 2024

	def test_ace(tmpdir):
	# Test Version 1.0
	tmpfilename = join(tmpdir, "molecule.h5")
	f = h5py.File(tmpfilename, "w")
	f.attrs["layout"] = "Ace"
	f.attrs["layout_version"] = "1.0"
	f.attrs["name"] = "sample_molecule_data"
	for m in range(3): # Three molecules
	mol = f.create_group(f"mol_{m+1}")
	mol["atomic_numbers"] = [1, 6, 8] # H, C, O
	mol["formal_charges"] = [0, 0, 0] # Neutral charges
	confs = mol.create_group("conformations")
	for i in range(2): # Two conformations
	conf = confs.create_group(f"conf_{i+1}")
	conf["positions"] = np.random.random((3, 3))
	conf["positions"].attrs["units"] = "Å"
	conf["formation_energy"] = np.random.random()
	conf["formation_energy"].attrs["units"] = "eV"
	conf["forces"] = np.random.random((3, 3))
	conf["forces"].attrs["units"] = "eV/Å"
	conf["partial_charges"] = np.random.random(3)
	conf["partial_charges"].attrs["units"] = "e"
	conf["dipole_moment"] = np.random.random(3)
	conf["dipole_moment"].attrs["units"] = "e*Å"

	dataset = Ace(root=tmpdir, paths=tmpfilename)
	assert len(dataset) == 6
	f.flush()
	f.close()
	# Test Version 2.0
	tmpfilename_v2 = join(tmpdir, "molecule_v2.h5")
	f2 = h5py.File(tmpfilename_v2, "w")
	f2.attrs["layout"] = "Ace"
	f2.attrs["layout_version"] = "2.0"
	f2.attrs["name"] = "sample_molecule_data_v2"
	master_mol_group = f2.create_group("master_molecule_group")
	for m in range(3): # Three molecules
	mol = master_mol_group.create_group(f"mol_{m+1}")
	mol["atomic_numbers"] = [1, 6, 8] # H, C, O
	mol["formal_charges"] = [0, 0, 0] # Neutral charges
	mol["positions"] = np.random.random((2, 3, 3)) # Two conformations
	mol["positions"].attrs["units"] = "Å"
	mol["formation_energies"] = np.random.random(2)
	mol["formation_energies"].attrs["units"] = "eV"
	mol["forces"] = np.random.random((2, 3, 3))
	mol["forces"].attrs["units"] = "eV/Å"
	mol["partial_charges"] = np.random.random((2, 3))
	mol["partial_charges"].attrs["units"] = "e"
	mol["dipole_moment"] = np.random.random((2, 3))
	mol["dipole_moment"].attrs["units"] = "e*Å"
	dataset_v2 = Ace(root=tmpdir, paths=tmpfilename_v2)
	assert len(dataset_v2) == 6
	f2.flush()
	f2.close()

Add SPICE 2.0.1, Fix bug in MemmappedDataset #333

Add SPICE 2.0.1, Fix bug in MemmappedDataset #333

Conversation

RaulPPelaez commented Jun 28, 2024 • edited Loading

I fixed MemmapedDataset overwriting the names of processed datasets.

This in turn uncovered a nasty bug in the test for the ACE Dataset.

I also made some changes

RaulPPelaez commented Jun 28, 2024

RaulPPelaez commented Jun 28, 2024

stefdoerr left a comment

Choose a reason for hiding this comment

peastman commented Jun 28, 2024

RaulPPelaez commented Jun 28, 2024 •

edited

Loading