Name		Name	Last commit message	Last commit date
parent directory ..
analysis		analysis
.gitignore		.gitignore
README.md		README.md
config.py		config.py
download.py		download.py
json_table_handler.py		json_table_handler.py
parquet_metadata.json		parquet_metadata.json
parquet_table_handler.py		parquet_table_handler.py
requirements.txt		requirements.txt
schema.py		schema.py
util.py		util.py

README.md

RealNest Dataset Downloader

This script downloads the RealNest dataset - the nested fields of the various Parquet and JSON datasets.

If a table has more rows than requested, only the last n rows are downloaded to get the latest data available. Hence, each run of the script might download different data.

Requirements

Python >= 3.9
Large enough disk space to store the downloaded data. Please note that the dataset is downloaded in parallel, and the downloaded data size can be much larger than the compressed size of the dataset.

Install Dependencies

Install the Python dependencies using the following command:

pip3 install -r requirements.txt

DuckDB

This script requires Map type inference functionality of DuckDB JSON reader, which is scheduled to be released in DuckDB v1.1.0. Until then, one can do the following to install DuckDB v0.10.2 with the required feature from the source:

Follow DuckDB Build Prerequisites page to install the required DuckDB build dependencies.

Clone the patched DuckDB repository:

git clone -b v0.10.2-with-json-map --single-branch https://github.com/ZiyaZa/duckdb.git

If using a Python virtual environment, make sure it is activated.

Execute the following command in the root folder of the DuckDB repository to build and install the DuckDB python package:

EXTENSION_STATIC_BUILD=1 GEN=ninja BUILD_PYTHON=1 OVERRIDE_GIT_DESCRIBE=v0.10.2 ENABLE_EXTENSION_AUTOLOADING=1 ENABLE_EXTENSION_AUTOINSTALL=1 make

Configuration

Look at the comments in the config.py file to see the available configuration options and modify them as needed. The options can be overridden by setting the corresponding environment variables.

Download Dataset

Run the download.py script to download the RealNest dataset:

python3 download.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

README.md

RealNest Dataset Downloader

Requirements

Install Dependencies

DuckDB

Configuration

Download Dataset

Files

scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

scripts

Folders and files

parent directory

README.md

RealNest Dataset Downloader

Requirements

Install Dependencies

DuckDB

Configuration

Download Dataset