A schema-driven Python random data generator for PostgreSQL.
ℹ️ Name suggestions welcome 😊
The purpose of this tool is to provide a random data generator for each table in an (annotated) PostgreSQL schema. For example, when loading the schema:
CREATE TABLE a(
id CHAR(32) PRIMARY KEY, -- gen: md5
value TEXT
);
The tool would automatically generate (and ingest) values for the columns id
and value
in table a
. Dependencies between tables are supported as well (see
this example).
- Annotations cannot be mixed
- Each table has to be defined in a single file
- IDs based on
INT
/BIGINT
can have collisions - No automated tests
- Clone this repository
- Install dependencies:
pip3 install -r requirements.txt
We'll be using ./examples/simple for a demonstration:
-
Create a database and load
examples/simple/{a,b}.sql
-
Run the generator:
./generator.py \ --dsn postgresql://postgres@localhost/pydatagen \ --batch-size 100 \ --rows 1000 \ --truncate \ --target examples/simple/two-tables.py
You will have to provide a Python file as entrypoint. The minimum definition
of such file for two tables a
and b
would look like this:
from lib.base_object import Table
TABLES = {
'public.a': Table(schema_path='...', scaler=1),
'public.b': Table(schema_path='...', scaler=2)
}
GRAPH = {
'public.a': [],
'public.b': []
}
ENTRYPOINT = 'public.a'
Here, a
and b
are independent and b
has twice as many rows as a
. That
is, if you declare --rows=100
, then 100 rows will be generated for a
and
200 rows will be generated for b
.
Currently, there can be only one annotation per column. An annotation must be placed on the same line as the column. Two examples below. This will work:
CREATE TABLE a(
foo BIGINT -- none_prob: 0.9
, bar BIGINT
)
This won't work:
CREATE TABLE a(
foo BIGINT
-- none_prob: 0.9
, bar BIGINT
)
Annotation | Description |
---|---|
none_prob: <0.0..1.0> | Sets probability of generating a NULL value (if allowed) |
gen: | Hardcodes the generator to use. Methods in ./lib/random.py are supported (not all!). For example: -- gen: md5 would use the md5 method. There is a special generator choose_from_list to inject dependencies (see below). |
If you have tables a
and b
and they are defined like:
-- a.sql
CREATE TABLE a(
id CHAR(32) PRIMARY KEY -- gen: md5
, value BIGINT
);
-- b.sql
CREATE TABLE b(
id CHAR(32) PRIMARY KEY -- gen: md5
, id_a CHAR(32) NOT NULL -- gen: choose_from_list public.a.id
, value BIGINT
)
Then, the data generator will take random values from a
and use them for
the id_a
column of table b
. Always supply a complete "path",
that is, the path must be of format: <schema>.<table>.<column>
. Use public
as <schema>
if you do not use custom schemas.
Modify the Python target file to ensure, that a
is generated before b
:
GRAPH = {
'public.a': ['public.b'],
'public.b': []
}