Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MACE-OFF dataset #332

Merged
merged 9 commits into from
Jul 9, 2024
Merged

MACE-OFF dataset #332

merged 9 commits into from
Jul 9, 2024

Conversation

RaulPPelaez
Copy link
Collaborator

I added a Dataset class for the dataset used in the work "MACE-OFF23: Transferable Machine Learning
Force Fields for Organic Molecules" https://arxiv.org/pdf/2312.15211

@RaulPPelaez
Copy link
Collaborator Author

This is ready for review. It is really slow to preprocess, about 40 minutes. The Dataset comes in an XYZ file which I am processing with ase. I do not know how to speed it up.

@RaulPPelaez RaulPPelaez marked this pull request as ready for review June 28, 2024 12:18
@RaulPPelaez RaulPPelaez requested a review from stefdoerr June 28, 2024 12:18
@stefdoerr
Copy link
Collaborator

import tarfile
from moleculekit.periodictable import periodictable


def parse_xyz(xyz_file):
    import re

    energy_re = re.compile("energy=(\S+)")

    with tarfile.open(xyz_file, "r:gz") as tar:
        for member in tar.getmembers():
            f = tar.extractfile(member)
            if f is None:
                continue

            n_atoms = None
            counter = 0
            positions = []
            numbers = []
            forces = []
            energy = None

            for line in f:
                line = line.decode("utf-8").strip()
                if n_atoms is None:
                    n_atoms = int(line)
                    positions = []
                    numbers = []
                    forces = []
                    energy = None
                    counter = 1
                    continue
                if counter == 1:
                    props = line
                    energy = float(energy_re.search(props).group(1))
                    counter = 2
                    continue

                el, x, y, z, fx, fy, fz, _, _, _ = line.split()
                numbers.append(periodictable[el].number)
                positions.append([float(x), float(y), float(z)])
                forces.append([float(fx), float(fy), float(fz)])
                counter += 1
                if counter == n_atoms + 2:
                    n_atoms = None
                    yield energy, numbers, positions, forces

I wrote an xyz parser for the MACE dataset. You can use it with:

gen = parse_xyz("./train_large_neut_no_bad_clean.tar.gz")
x = next(gen)
x = next(gen)

First call takes a small while to extract, then it goes super fast (around 60μs per call for me)

@stefdoerr
Copy link
Collaborator

Takes 1 minute total to parse the whole file (excluding the initial extraction cost which is like 10-20s)

@RaulPPelaez
Copy link
Collaborator Author

Works great @stefdoerr, thanks. Please review again!

@stefdoerr stefdoerr merged commit 6c42c8b into torchmd:main Jul 9, 2024
2 checks passed
@RaulPPelaez RaulPPelaez deleted the maceds branch July 9, 2024 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants