Release v0.4.0 · ldbc/ldbc_snb_datagen_spark

This is the first Datagen release with Spark.

Execution environments

Both Spark 2 and 3 are supported.
The generator can be run in a Docker container (for tests and small data sets), on a Spark cluster, and in cloud-based Spark implementations.
We provide scripts for AWS EMR. We used these to generate data sets up to scale factor 30,000.

The generator produces a temporal graph where entities can be both inserted (creationDate) and deleted (deletionDate). It support three serialization modes:
- Raw mode: generates the entire temporal graph with the creationDate and deletionDate properties included for each dynamic entity. (Not intended for a benchmark but to be used for experiments where custom data sets are required.)
- BI mode: generates an initial data set and daily batches of deletions and insertions. To be used with the LDBC SNB Business Intelligence workload.
- Interactive mode (incomplete): does not take deletions into account. Generates an initial data set. Does not yet generate update streams. See ldbc/ldbc_snb_interactive_v1_impls#173 for the plans to use the new Datagen for SNB Interactive.
Supports producing factor tables.
This release does not yet have a parameter generator. It will be added in later releases.