PySynic

Synthetic data generating framework for Python.

Introduction

Creating synthetic data is domain specific but often there are common requirements. For example, you may want

numbers or dates that are anywhere within a range even if you don't care exactly where
some values to be randomly null
a whole data frame to test your production code

PySynic facilitates that.

However, what we're really advocating is the philosophy of creating synthetic data to test your pipelines and machine learning models etc.

If you find the code here helpful, then great. It's available on PyPI so just add pysynic to your dependency list. It has no dependencies of its own so you won't have any transitive dependency issues.

Example

You can use this framework to generate data for your tests. For instance, if you want to test PySpark code with 1000 rows of bespoke data, you could write something similar to:

from pyspark.sql import SparkSession
from pysynic.synthetic_data import random_from, randomly_null, random_integer_in_range, random_date

def test_first_diagnosis(spark_session: SparkSession):
    data = []
    for i in range(1000):
        data.append([random_integer_in_range(0, 100, i),
                     randomly_null(random_from(["cancer", "heart attack", "stroke"])),
                     random_date(i, 31, "1/Jul/2021")
                     ])
    df = spark_session.createDataFrame(data, 
                                       ["patient_id", "disease_code", "admission_date"])
    results = YOUR_PRODUCTION_METHOD(df)
    assert results.count() > 0  # etc, etc

In this PyTest snippet, we create a Spark DataFrame that contains synthetic data. We're not too interested in exactly what the data is, just that it is representative. Then we use it to call our production code that presumably does something interesting and finally we make some sensible assertions. These assertions will be domain specific and we can't tell you what they are but hopefully you can see that with just a few lines of Python we can have large, semi-random test data sets.

Note that in this example, the data is the same every time we run it. If you want it to be unpredictable, then don't provide a seed to the PySynic methods (in this case above, don't pass i but instead None). Whether you want an element of determinism or true randomness is up to you. There are arguments for both.

If we were to run the same code in a PySpark shell, we could see that the output looks something like:

>>> df.show()
+----------+------------+-------------------+                                   
|patient_id|disease_code|     admission_date|
+----------+------------+-------------------+
|         0|        null|2021-07-01 00:00:00|
|         1|        null|2021-07-02 00:00:00|
|         2|        null|2021-07-03 00:00:00|
|         3|      cancer|2021-07-04 00:00:00|
|         4|      stroke|2021-07-05 00:00:00|
|         5|        null|2021-07-06 00:00:00|
|         6|        null|2021-07-07 00:00:00|
|         7|heart attack|2021-07-08 00:00:00|
|         8|        null|2021-07-09 00:00:00|
|         9|        null|2021-07-10 00:00:00|
|        10|        null|2021-07-11 00:00:00|
|        11|        null|2021-07-12 00:00:00|
|        12|      cancer|2021-07-13 00:00:00|
|        13|      stroke|2021-07-14 00:00:00|
|        14|        null|2021-07-15 00:00:00|
|        15|heart attack|2021-07-16 00:00:00|
|        16|heart attack|2021-07-17 00:00:00|
|        17|      stroke|2021-07-18 00:00:00|
|        18|      cancer|2021-07-19 00:00:00|
|        19|      stroke|2021-07-20 00:00:00|
+----------+------------+-------------------+
only showing top 20 rows

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
pysynic		pysynic
test		test
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySynic

Introduction

Example

About

Releases

Packages

Languages

License

PhillHenry/pysynic

Folders and files

Latest commit

History

Repository files navigation

PySynic

Introduction

Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages