Skip to content

Latest commit

 

History

History
85 lines (69 loc) · 6.45 KB

README.md

File metadata and controls

85 lines (69 loc) · 6.45 KB

RealNest - Nested Data from Real-World Datasets

This repository contains the details of the RealNest dataset, a collection of nested data derived from real-world datasets. The dataset is designed to help computer science researchers benchmark and evaluate data systems and data formats supporting nested data types.

RealNest is provided as a script that downloads and generates the data, but for convenience and to facilitate standardized comparisons, we host (outside of this repository) on the CWI website (https://event.cwi.nl/da/RealNest) two static datasets with data in .jsonl.gz format in sizes of 64 * 1024 resp. 10 * 64 * 1024 rows. These sample datasets were downloaded and generated by our script in mid-May 2024.

Furthermore, the sample-data directory inside this repository contains a small sample of the datasets mentioned above (the first 1024 rows and 100 MiB of each table) as a preview.

Because we provide the script that downloads the original datasets and processes them into a common format, one can create the dataset from newer versions of the underlying data and also enlarge them with respect to the static datasets, since even the larger of the two statically downloadable datasets contains only a small part of each of the original data sources. Please note that the availability of the original datasets is outside our control, and over time, some of the original datasets may become unavailable. The download script will attempt to download the data from the sources, skipping the ones that are not available.

Please refer to the README in the scripts directory for more details.

All materials in this GitHub repository, except the files under the sample-data folder, are released under the CC-NC-SA license (https://creativecommons.org/licenses/by-nc-sa/4.0/); hence, this repository is open-source, requires attribution to this page (which includes the Attribution section below) and does not allow commercial exploitation.

Note that the sample datasets inside this repository and the two static datasets hosted at CWI linked here remain under the same licenses and terms of use as the original datasets they are generated from. If you are the owner of an original dataset, and object to the inclusion of your data in the RealNest static datasets hosted at CWI or to the samples hosted in this repository, please contact Peter Boncz ([email protected]), and we will take action.

Please note that below we attempt to properly attribute the individual datasets as required by their various open-source licenses and terms of usage.

Dataset Structure

The dataset contains a directory for each table with the following files:

  • schema.json: The schema of the table. The schema is a JSON object with a single key, columns, containing a list of columns. Each column is a JSON object with 2 or 3 keys:
    • name - The name of the column as a string.
    • type - The type of the column as a string.
    • children - Optional, only exists for nested types (list, struct, map). Describes the child types of the nested type as a list of column objects. The list type always has a single child column with the name child. The map type always has two child columns with the names key and value.
  • data.jsonl or data.jsonl.gz: The data of the table in JSON Lines format (optionally Gzip compressed).

The schema might contain a JSON type, which may happen for empty JSON objects in the data ({}) or when DuckDB's schema inference detects incompatible types. The columns of this type can be ignored since they are not typical for structured data, or they can be handled as VARCHAR columns, where the value is the JSON string.

Attribution

The data has been downloaded from various public sources and converted to a common format. We note that the real-world datasets from which RealNest is derived are released under varying open-source licenses and terms of usage.

The sources of the original datasets are:

  1. Amazon Berkeley Objects (LICENSE)
    • J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, E. Gundogdu, X. Zhang, T. F. Yago Vicente, T. Dideriksen, H. Arora, M. Guillaumin, and J. Malik, "Abo: Dataset and benchmarks for real-world 3d object understanding," CVPR, 2022.
  2. AWS Public Blockchain Data (LICENSE)
  3. Data Lake as Code (ATTRIBUTIONS)
  4. CORD-19 (LICENSE)
    • L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. M. Kinney, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. D. Wade, K. Wang, C. Wilhelm, B. Xie, D. A. Raymond, D. S. Weld, O. Etzioni, and S. Kohlmeier, "Cord-19: The covid-19 open research dataset," ArXiv, 2020.
  5. Daylight Map Distribution of OpenStreetMap (Open Database License (ODbL))
  6. GitHub Archive
  7. CERN Open Data
    • CMS collaboration (2017). SingleMu primary dataset in AOD format from Run of 2012 ( /SingleMu/Run2012B-22Jan2013-v1/AOD). CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.IYVQ.1J0W
  8. Overture Maps Foundation Open Map Data
    • Overture data is licensed under the Community Database License Agreement Permissive v2 (CDLA) unless derived from a source that requires publishing under a different license, such as data derived from OpenStreetMap, that constitutes a 'Derivative Database' (as defined under ODbL v1.0), which will be licensed under ODbL v1.0.
  9. Twitter Stream Archive