-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
19 changed files
with
83 additions
and
80 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,24 @@ | ||
# Best Practices: Beware the flatfile & embrace working with entities | ||
|
||
In DataFrame libraries, joining different tables is often either cumbersome or slow. As a consequence, many data | ||
In DataFrame libraries, joining different tables is often either cumbersome or slow. As a consequence, many data | ||
pipelines bring their main pieces of information together in one big table called flatfile. While this might be nice | ||
for quick exploration of the data, it causes several problems for long term maintenance and speed of adding new | ||
for quick exploration of the data, it causes several problems for long term maintenance and speed of adding new | ||
features: | ||
1. The number of columns grows very large and may become hard to overlook by the users that don't know all the prefixes | ||
and suffixes by heart. | ||
2. Associated information with 1:n relationship either are duplicated (wasting space), | ||
or written to an array column (reducing flexibility for further joins), or simply make | ||
1. The number of columns grows very large and may become hard to overlook by the users that don't know all the prefixes | ||
and suffixes by heart. | ||
2. Associated information with 1:n relationship either are duplicated (wasting space), | ||
or written to an array column (reducing flexibility for further joins), or simply make | ||
it prohibitively hard to add features on a certain granularity. | ||
3. In case a table is historized, storing rows for each version of a data field, the table size grows quadratic with | ||
3. In case a table is historized, storing rows for each version of a data field, the table size grows quadratic with | ||
the number of columns. | ||
|
||
The other alternative is to keep column groups with similar subject matter meaning or similar data sources together in | ||
separate tables called entities. Especially when creating data transformation code programmatically with a nice syntax, | ||
it can be made quite easy to work with typical groups of entities with code in the background joining underlying tables. | ||
|
||
Often flatfiles are created before feature engineering. Due to the large number of features (columns), it becomes | ||
Often flatfiles are created before feature engineering. Due to the large number of features (columns), it becomes | ||
necessary to build automatic tools for executing the code for each feature in the correct order and to avoid wasteful | ||
execution. However, when using entity granularity (column groups of similar origin), it is more manageable to manually | ||
wire all feature engineering computations. It is even very valuable code to see how the different computation steps / | ||
execution. However, when using entity granularity (column groups of similar origin), it is more manageable to manually | ||
wire all feature engineering computations. It is even very valuable code to see how the different computation steps / | ||
entities build on each other. This makes tracking down problems much easier in debugging and helps new-joiners a chance | ||
to step through the code. | ||
to step through the code. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,26 @@ | ||
# Best Practices: start sql, finish polars | ||
|
||
At the beginning of a data pipeline, there is typically the biggest amount of data touched with rather simple | ||
operations: Data is combined, encodings are converted/harmonized, simple aggregations and computations are performed, | ||
and data is heavily filtered. These operations lend themselves very well to using a powerful database, and converting | ||
transformations to SQL `CREATE TABLE ... AS SELECT ...` statements. This way, the data stays within the database and | ||
At the beginning of a data pipeline, there is typically the biggest amount of data touched with rather simple | ||
operations: Data is combined, encodings are converted/harmonized, simple aggregations and computations are performed, | ||
and data is heavily filtered. These operations lend themselves very well to using a powerful database, and converting | ||
transformations to SQL `CREATE TABLE ... AS SELECT ...` statements. This way, the data stays within the database and | ||
the communication heavy operations can be performed efficiently (i.e. parallelized) right where the data is stored. | ||
|
||
Towards the end of the pipeline, the vast open source ecosystem of training libraries, evaluation, and | ||
Towards the end of the pipeline, the vast open source ecosystem of training libraries, evaluation, and | ||
visualization tools is needed which are best interfaced with classical Polars / Pandas DataFrames in Memory. | ||
|
||
In the middle with feature engineering, there is still a large part of logic, that is predominantly simple enough for | ||
typical SQL expressiveness with some exceptions. Thus, it is super helpful if we can jump between SQL and Polars for | ||
performance reasons, but stay within the same pydiverse.transform syntax for describing transformations for the most | ||
part. | ||
In the middle with feature engineering, there is still a large part of logic, that is predominantly simple enough for | ||
typical SQL expressiveness with some exceptions. Thus, it is super helpful if we can jump between SQL and Polars for | ||
performance reasons, but stay within the same pydiverse.transform syntax for describing transformations for the most | ||
part. | ||
|
||
When moving code to production it is often the case that prediction calls are done with much less data than during | ||
When moving code to production it is often the case that prediction calls are done with much less data than during | ||
training. This it might not be worth setting up a sophisticated database technology, in that case. Pydiverse.transform | ||
allows to take code written for SQL execution during training and use the exact same code for executing on Polars for | ||
production. In the long run, we also want to be able to generate ONNX graphs from transform code to make long term | ||
allows to take code written for SQL execution during training and use the exact same code for executing on Polars for | ||
production. In the long run, we also want to be able to generate ONNX graphs from transform code to make long term | ||
reliable deployments even easier. | ||
|
||
The aim of pydiverse.transform is not feature completeness but rather versatility, ease of use, and very predictable | ||
The aim of pydiverse.transform is not feature completeness but rather versatility, ease of use, and very predictable | ||
and reliable behavior. Thus it should always integrate nicely with other ways of writing data transformations. Together | ||
with [pydiverse.pipedag](https://pydiversepipedag.readthedocs.io/en/latest/), this interoperability is made even much | ||
easier. | ||
with [pydiverse.pipedag](https://pydiversepipedag.readthedocs.io/en/latest/), this interoperability is made even much | ||
easier. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,4 +53,3 @@ with tempfile.TemporaryDirectory() as temp_dir: | |
conn.commit() | ||
example(engine) | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.