Skip to content

Latest commit

 

History

History
210 lines (153 loc) · 8.81 KB

DECISIONS.md

File metadata and controls

210 lines (153 loc) · 8.81 KB

Decisions

This document is intended to provide clarity on many of the decisions/rationalizations which exist inside of Datamill's Target SQL project for Singer.

Principles

The guiding principles we try to adhere to herein as far as how to reach a conclusion are:

  1. When possible, make the resulting data/schema in the remote target consistent, no matter the ordering of potential messages
    • ie, if our decision would result in a random schema being produced in the remote target for no reasonable benefit, this is in violation
  2. Do right by the common majority of users
  3. Make a best effort to prevent a user from having to intervene
  4. Use Stitch’s offering and documentation as best practice guidance

Schema

De-nesting

What

  • JSON Schema allows for complex schemas which have non-literal (ie, compositional) elements
    • examples include:
      • objects (ie, {'a': 1, 'b': 2 ...})
      • array (ie, [1, 'a', 2, {4: False}])
      • anyOf
  • Standard SQL does not support compositional elements, but rather data which is highly structured in potentially many related tables
  • To overcome this, target-sql provides tooling which unpacks:
    • json objects into their parent record
    • json arrays as sub tables
# Stream `FOO`
[
  {'nested_object': {
    'a': 1,
    'b': 2
   }
   'nested_array': [
     {'c': False, 'd': 'abc'},
     {'c': True, 'd': 'xyz'}
   ]
  }
]


# Results in:
## Table `foo`
[
  {'nested_object__a': 1,
   'nested_object__b': 2}
]

## Table `foo__nested_array`
[
  {'c': False, 'd': 'abc'},
  {'c': True, 'd': 'xyz'}
]

Why

  • This approach is inspired by what Stitch Data takes with object/array de-nesting.
  • The user experience for those using a SQL querying language is better for flat tables
  • Data warehouses tend to prefer denormalized structures while operational databases prefer normalized structures. We normalize the incoming structure so the user can choose what to do with the normalized raw data. Also it's easy to access and transform later than JSON blobs.

Column Type Mismatch

What

  1. A field has been streamed to the remote target with type integer
  2. A new field with the same raw name as the remote column has been streamed but has type boolean
    • Data of type boolean cannot be placed into a column of type integer
  3. target-sql has tooling which will:
    1. rename the original column to original_field_name__i
    2. make the renamed column nullable
    3. create a new column of name original_field_name__b
    4. stream new data to original_field_name__b

Why

TL;DR: Instead of throwing a hard error and forcing users to do some manual transformation before streaming data through target-sql, we chose a "best effort" approach to resolving the underlying error.

By renaming and migrating the column we:

  • make the resulting structure in the database the same no matter whether we upload column integer then column boolean or vice versa.
  • users learn of dependent views/columns blocking a type change early

Column Name Collision

What

  1. Field of name foo is streamed
  2. Field of name FOO is then streamed
  3. Since both of these names canonicalize to the same result (ie, foo), we have a name collision
  4. When attempting to upsert_table, SQLInterface has to handle name collisions. To do this, it attaches a unique suffix to the name which caused the collision, not the original
    • The suffix is an auto-incrementing numerical value
# Field `foo` is streamed
# Field `FOO` is streamed

[
  {'foo': 1,
   'FOO': False,
   'fOo': 4.0}
]

# The resulting table will be:

[
  {'foo': 1,
   'foo__1': False,
   'foo__2': 4.0}
]

Why

TL;DR: Instead of throwing a hard error and forcing users to do some manual transformation before streaming data through target-sql, we chose a "best effort" approach to resolving the underlying error.

  • While this means that ordering of fields/actions matters in regards to the final remote structure, users can observe their remote structure simply
  • Hashes have been used as suffixes in past, but it was determined that these were too confusing for end users. So while they allowed us to adhere to principle (1), it meant principle (2) was being ignored.
    • Additionally, we chose not to prepend a numerical suffix to all columns for the same reason. Most users are not going to have name collisions, so instead of making the overall user experience worse, we chose to have a targeted solution to this particular edge case

Column Name Length

What

  • SQLInterface provides a single field called IDENTIFIER_FIELD_LENGTH which is to be overridden by the implementing class
  • Any column which is found to be excess of IDENTIFIER_FIELD_LENGTH is truncated to be no longer than IDENTIFIER_FIELD_LENGTH
  • All collision and type information is preserved in the truncation
    • ie, any values which are suffixed onto the name as __...
  • All original field/column names are preserved as a column_mapping

Why

TL;DR: Instead of throwing a hard error and forcing users to do some manual transformation before streaming data through target-sql, we chose a "best effort" approach to resolving the underlying error.

Most (all?) SQL targets we have encountered have length restrictions for identifiers in their schema. Since arbitrary JSON does not have this same restriction, we needed a best effort mechanism for handling names which were either auto-generated and are too long, or user input fields which physically cannot fit into the remote target.

As such, we chose to take the simplest method here for clarity. ie, truncate the original/generated name, and then proceed with collision support as normal.

The implementing class is tasked with providing canonicalize_identifier, a method which when called is expected to only transform a string identifier into another string identifier which contains only characters which are allowed by the remote target.

Data De-nesting

Objects

What

  • Objects are unpacked into their parent table.
  • The unpacked fields are prefixed with the name of the field which originally contained the object.

Why

  • This approach is inspired by what Stitch Data takes with object de-nesting.
  • The user experience for those using a SQL querying language is better for flat tables

Arrays

What

  • Arrays are unrolled as individual rows into a child table
  • The table name is constructed as parent_table__field

Why

  • This approach is inspired by what Stitch Data takes with array de-nesting.
  • The user experience for those using a SQL querying language is better for flat tables

Queries

What

  • When we write SQL at any given point, we have the option to use "latest" PostgreSQL features
  • We opt for features available from PostgreSQL 8.4.22 forward
  • We DO NOT support PostgreSQL 8.4.22
    • any features/bugs issues based on this will be weighed against this decision as far as effort to benefit

Why

  • Supporting multiple versions of PostgreSQL has thus far been fairly straightforward by adhering to only query support available in the oldest version of supported PostgreSQL
  • By doing this, we only have one main code base, instead of many fractured versions which all employ the latest/greatest system functions/methods/tables/information schemas available
  • By using 8.4.22, supporting Redshift is made simpler
    • Redshift was originally split from PostgreSQL 8.0.2
    • At some point, a lot of work was done by AWS to make Redshift a "simple fork" of PostgreSQL 8.4
  • We do not support PostgreSQL 8.4 simply because PostgreSQL does not support it anymore
    • Our only benefit to making 8.4 query language our target is Redshift
    • When a new supported version of PostgreSQL comes along, and we undertake the effort to support it herein, if supporting it is simpler to do by breaking 8.4, we will move the necessary logic to target-redshift