Authors: Alexander Mathiasen and Hatem Helal
Organization: Graphcore Research
We documented QM1B and intended uses through the Datasheets for Datasets framework 1. The goal of dataset datasheets as outlined by 1 is to provide a standardized process for documentating datasets. The authors of 1 present a list of carefully selected questions which dataset authors should answer. We hope our answers to these questions will facilitate better communication between us (the dataset creators) and future users of QM1B.
Purpose Prior gaussian-based Density Functional Theory (DFT) datasets contained fewer than 20 million training examples. We suspected the comparatively small dataset sizes held back the performance of neural networks. The purpose of QM1B was thus to present a dataset which increased the number of training examples. We believe this will facilitate novel research directions that train quantum chemical neural networks on more data, e.g., neural scaling laws for quantum chemistry
Non-goals QM1B should not be used to benchmark deep learning architectures. It remains unknown whether the ranking of neural architectures on QM1B will agree with the ranking on experimental datasets or higher resolution DFT datasets. Without evidence for such transferability, benchmarking on QM1B may bias the selection of neural architecture towards our lower resolution DFT. We instead encourage researchers to pretrain on QM1B and finetune/benchmark on higher-resolution DFT datasets like QM9 or PCQ.
Alexander Mathiasen and Hatem Helal created the dataset. The dataset was created as part of their employment as Research Scientists at Graphcore. The dataset generation process iteratively proceeded with Mathiasen generating a dataset and Helal training a model. During this process Mathiasen would alter generation parameters (which SMILES strings from GDB, the number of conformers, heavy atoms, basis set) and also optimize the
The costs associated with the dataset (compute and salaries) were funded by Graphcore.
1. What do the instances that comprise the dataset represent (e.g. documents, photos, people, countries)?
The instances (rows) of the dataset represent molecules. Each
The dataset contains 1.07B training examples. Of these there are 1.09M unique molecules. Each unique molecule has up to 1000 different atom positions (conformers) generated using the ETKDG algorithm2 as implemented in RDKit3.
3. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
The molecules used to create QM1B are a subset of a particular version of the Generated Data Bank (GDB).
There are three published versions of GDB, each named by the maximum number of heavy atoms included: GDB11, GDB13 and GDB17.
QM1B contains 1.09M molecules from GDB11 with {9,10,11} atoms from
Each row/instance of QM1B contains the input and output of a DFT evaluation:
QM1B dataset is stored in the open-source columnar Apache Parquet format, with the following schema:
smile
: The SMILES string taken from GDB11. There are up to 1000 rows (i.e. conformers) with the same SMILES string.atoms
: String representing the atom symbols of the molecule, e.g. ”COOH”.z
: Integer representation ofatoms
used by SchNet (the atomic numbers).energy
: energy of the molecule computed by PySCF IPU (unit eV).homo
: The energy of the Highest Occupied Molecular Orbital (HOMO) (unit eV).lumo
: The energy of the Lowest occupied Molecular Orbital (LUMO) (unit eV).N
: The number of atomic orbitals for the specific DFT computation (depends on the basis set STO3G).std
: The standard deviation of the energy of the last five iterations of running PySCFIPU, used as convergence criteria std < 0.01 (unit eV).y
: The HOMO-LUMO Gap (unit eV).pos
: The atom positions (unit Bohr).
See above. Each row/instance has three potential labels/prediction-targets: (homo, lumo, energy). We performed analysis on the numerical error of each and found energy to have larger numerical errors.
No.
Yes. Each row corresponds to a conformer of a molecule. Each molecule has up to 1000 conformer. This relationship is made explicit by including the SMILES string for each instance (row), that is, up to 1000 rows share the same SMILES string. The validation split we include is made on SMILES strings, ensuring that the same molecule does not appear in training and validation set.
Training and validation: yes. Test: no.
We include a training/validation split as used in our paper to train SchNet with 9M parameters. We advise against using QM1B as a benchmark for training neural networks due to its low resolution, and thus do not publish a test set.
The main reason for excluding a test set, is to discourage using QM1B to benchmark machine learning models. It remains unclear whether ranking model performance on our "low resolution DFT" will translate to "higher resolution DFT" and experimental measurements. It is thus possible that benchmarking deep learning models on QM1B would lead to a ranking that biases models towards "low resolution DFT" away from experimental measurements. We encourage research into investigating the utility of QM1B for such benchmarking, however, before sufficient evidence is provided, we discourage the use of QM1B as a benchmark to rank machine learning models. Instead, we encourage pretraining on QM1B and fine-tuning/benchmarking on downstream tasks like QM9 or PCQ.
We will communicate the state of such evidence and evolving best-practices through our project repo
QM1B contains numerical errors due to the use of float32 instead of float64. For an analysis of the numerical errors of (HOMO, LUMO, Energy, HLGap) please see Figure 2 of the main paper.
While the errors of (HOMO, LUMO, HLGap) were below that of neural networks, the numerical errors on energy were similar to that of well-trained neural networks.
We thus advise caution when using QM1B to train neural networks for energy prediction.
We plan to further optimize the numerical accuracy of
10. Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g. websites, tweets, other datasets)?
The QM1B dataset is entirely self-contained.
11. Does the dataset contain data that might be considered confidential (e.g. data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)?
No. The data contains molecules which are not considered confidential.
12. Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
No
Directly computed using our DFT library $\text{PySCF}\text{IPU}$ which we open-source. The molecules were obtained from GDB11 for which we got atom positions from RDKit. We validated $\text{PySCF}\text{IPU}$ against PySCF on 10k molecules from the validation set of QM1B, see Figure 2.
2. What mechanisms or procedures were used to collect the data (e.g. hardware apparatus or sensor, manual human curation, software program, software API)?
The data was collected using computational chemistry tools run on CPUs and IPUs. In particular, we relied on RDKit and our DFT implementation
3. If the dataset is a sample from a larger set, what was the sampling strategy (e.g. deterministic, probabilistic with specific sampling probabilities)?
There was no randomness in the sampling process from GDB11. We explicitly biased the molecules towards fewer number of hydrogen atoms to increase the size of the dataset. Users of the dataset may view exactly which SMILES strings we used by comparing the SMILES strings of GDB11 against QM1B.
4. Who was involved in the data collection process (e.g. students, crowdworkers, contractors) and how were they compensated (e.g. how much were crowdworkers paid)?
Alexander Mathiasen and Hatem Helal was involved in the data collection process. Both were paid through their employment at Graphcore.
The DFT calculations were started and stopped at different times due to hardware allocation. In particular, the 10 heavy atoms computations where started on May 23, 2023 and stopped May 26, 2023. The 11 heavy atoms where started on May 25, 2023 and, due to an error, continued generating longer than planned (until June 1, 2023) generating 3-4x too many training examples for 11 heavy atoms* Finally, the 9 heavy atoms calculation started 26th of May and finished the 29th of May.
The 5 days generation time reported in the main article was calculated as follows: compute the total IPU hours by summing up log files, $<$40 000, and divide by 320 IPUs.
*For example, IPU number 27 generated 5.84M training examples and we only used the first 1.5M for QM1B of which 1.47M converged.
No
1. Was any preprocessing/cleaning/labeling of the data done (e.g. discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
Yes. Hatem Helal and Alexander Mathiasen performed the following manual preprocessing/cleaning.
- Remove molecules with
$HLGap=0$ - Remove unconverged molecules as described above.
- Due to a software error in
$\text{PySCF}_\text{IPU}$ some molecules where run twice. Such duplicates were manually removed.
2. Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g. to support unanticipated future uses)?
For convenience QM1B does not include the removed molecules (unconverged and
Our postprocessing code is based on the open-source libraries: pandas, pyarrow and pqdm. We intend to improve/rewrite our postprocessing pipeline and thus decided against making it available in the first release.
Yes. We used QM1B to train the neural network SchNet 9M to predict HOMO-LUMO Gap.
No
The dataset could be used for pretraining large models subsequently fine-tuned on down-stream tasks. The possible prediction tasks are any combination of HOMO, LUMO and total energy. It might be possible that the scale of QM1B allows infering forces from energies, however, the larger numerical errors seem to make this unlikely
4. Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
The lower DFT resolution. We currently advise against using QM1B to benchmark machine learning models. Instead, we encourage pretraining on QM1B and subsequently fine-tuning on downstream tasks like PCQ and QM9.
Currently we advise against using QM1B to benchmark machine learning models. It remains unclear whether ranking machine learning models on QM1B would translate to experimental measurements. We instead encourage pretraining on QM1B and benchmarking models on downstream tasks on QM9 or PCQ.
1. Will the dataset be distributed to third parties outside of the entity (e.g. company, institution, organization) on behalf of which the dataset was created?
Yes. The dataset was created to be shared openly. We hope to iteratively refine the current dataset, and invite the entire research community to publicly share any critiques and participate in the iterative improvement of future versions of QM1B. The discussion will be facilitated through GitHub issues on github.com/graphcore-research/qm1b.
The dataset is stored in multiple Apache Parquet files. These are initially available to download from Amazon S3 and longer term storage is being arranged on the \href{https://figshare.com}{Figshare} platform. We will also provide a programming API to automate downloading the dataset to facilitate training neural networks.
Autumn 2023
4. Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?
QM1B will be released under the same license as GDB11 \href{https://zenodo.org/record/5172018}{Creative Commons Attribution 4.0 International}.
5. Have any third parties imposed IP-based or other restrictions on the data associated with the instances?
No
6. Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?
No
Support: The authors will provide support for QM1B through GitHub issues.
Hosting: QM1B will be hosted on Figshare and the long-term hosting fee will be paid by Graphcore.
Maintenance: The maintenance of dataset updates and corrections will be facilitated through the GitHub repository qm1b-dataset. We invite users to supply any feedback, questions, and bug reports as GitHub Issues. We will document corrections and updates through the CHANGELOG and provide versioned releases for any major updates.
Raise a GitHub issues through qm1b-dataset.
See CHANGELOG
4. Will the dataset be updated (e.g. to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g. mailing list, GitHub)?
QM1B will be updated in case of any human error in the postprocessing steps, otherwise, we have no plans to update QM1B.
Our main focus for updates will be on extending the capabilities of
5. If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g. were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.
QM1B contains molecules and does not relate to people.
6. Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers
In case of updates we will only release "diffs" ensuring the initial version of the dataset remains available.
7. If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description.
Yes, we accept pull requests to qm1b-dataset
Footnotes
-
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum ́e Iii, and Kate Crawford Datasheets for Datasets. Communications of the ACM, 2021 ↩ ↩2 ↩3
-
Sereina Riniker and Gregory A Landrum. Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation. Journal of chemical information and modeling. ↩
-
Greg Landrum et al. RDKit: A Software Suite For Cheminformatics, Computational Chemistry, and Predictive Modeling. Greg Landrum, 2013 ↩