PG Datagen - PostgreSQL Random Data Generator

A schema-driven Python random data generator for PostgreSQL.

⚠️ PG Datagen is currently very much alpha, any definition may change. Consider it having a non-stable interface.

ℹ️ Name suggestions welcome 😊

The purpose of this tool is to provide a random data generator for each table in an (annotated) PostgreSQL schema. For example, when loading the schema:

CREATE TABLE a(
  id CHAR(32) PRIMARY KEY, -- gen: md5
  value TEXT
);

The tool would automatically generate (and ingest) values for the columns id and value in table a. Dependencies between tables are supported as well (see this example).

Current Limitations

Annotations cannot be mixed
Each table has to be defined in a single file
IDs based on INT/BIGINT can have collisions
No automated tests

Installation

Clone this repository
Install dependencies: pip3 install -r requirements.txt

Example Usage

We'll be using ./examples/simple for a demonstration:

Create a database and load examples/simple/{a,b}.sql

Run the generator:

./generator.py \
  --dsn postgresql://postgres@localhost/pydatagen \
  --batch-size 100 \
  --rows 1000 \
  --truncate \
  --target examples/simple/two-tables.py

Details

Python Control File

You will have to provide a Python file as entrypoint. The minimum definition of such file for two tables a and b would look like this:

from lib.base_object import Table

TABLES = {
  'public.a': Table(schema_path='...', scaler=1),
  'public.b': Table(schema_path='...', scaler=2)
}

GRAPH = {
  'public.a': [],
  'public.b': []
}

ENTRYPOINT = 'public.a'

Here, a and b are independent and b has twice as many rows as a. That is, if you declare --rows=100, then 100 rows will be generated for a and 200 rows will be generated for b.

Supported Annotations

Currently, there can be only one annotation per column. An annotation must be placed on the same line as the column. Two examples below. This will work:

CREATE TABLE a(
    foo BIGINT -- none_prob: 0.9
  , bar BIGINT
)

This won't work:

CREATE TABLE a(
    foo BIGINT
    -- none_prob: 0.9
  , bar BIGINT
)

Annotation	Description
none_prob: <0.0..1.0>	Sets probability of generating a `NULL` value (if allowed)
gen:	Hardcodes the generator to use. Methods in ./lib/random.py are supported (not all!). For example: `-- gen: md5` would use the `md5` method. There is a special generator `choose_from_list` to inject dependencies (see below).

Inject Dependencies

If you have tables a and b and they are defined like:

-- a.sql
CREATE TABLE a(
    id CHAR(32) PRIMARY KEY -- gen: md5
  , value BIGINT
);
    
-- b.sql
CREATE TABLE b(
    id CHAR(32) PRIMARY KEY -- gen: md5
  , id_a CHAR(32) NOT NULL -- gen: choose_from_list public.a.id
  , value BIGINT
)

Then, the data generator will take random values from a and use them for the id_a column of table b. Always supply a complete "path", that is, the path must be of format: <schema>.<table>.<column>. Use public as <schema> if you do not use custom schemas.

Modify the Python target file to ensure, that a is generated before b:

GRAPH = {
  'public.a': ['public.b'],
  'public.b': []
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PG Datagen - PostgreSQL Random Data Generator

Current Limitations

Installation

Example Usage

Details

Python Control File

Supported Annotations

Inject Dependencies

Files

README.md

Latest commit

History

README.md

File metadata and controls

PG Datagen - PostgreSQL Random Data Generator

Current Limitations

Installation

Example Usage

Details

Python Control File

Supported Annotations

Inject Dependencies