Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save properties not found in the jsonschema in a JSONB column #129

Open
airhorns opened this issue Jul 22, 2019 · 3 comments
Open

Save properties not found in the jsonschema in a JSONB column #129

airhorns opened this issue Jul 22, 2019 · 3 comments

Comments

@airhorns
Copy link
Contributor

It makes a lot of sense that target-postgres tries to load data in a structured way into Postgres. The more that the database knows about the structure of the data, the faster and more efficient the database can be when querying it.

That said, lots of the singer taps don't or can't know the exact, specific structure of the data they are extracting before hand. Stitch itself deals with this in a variety of ways, but it does indeed deal with it some how. In my experience feeding web analytics event data from something like Segment into Redshift using Stitch, the event properties can be named whatever the heck developers want, and Stitch somewhat magically adds columns each time a new property is discovered. This is nice because no schema has to be updated anywhere and the pipeline doesn't break while someone figures out where to update it.

target-postgres doesn't need to be that smart, after all, you could just use Stitch, but, I think it could do a bit better than just dropping unknown columns data (see #127).

I think a nice middle ground would be to keep using the structured smarts that can create nice tables with good relational structure for things specified by the JSONSchema, but then loading unrecognized columns into a JSONB column called additional_properties or something like that. This would allow us to at least get a handle on the data from Postgres land, and if you want to break it out in to a real column or do something useful with it you could do so with a transform or SQL view or whatever. JSONB support is really quite nice in recent Postgres with all the partial indexing support and whatnot, and while it wouldn't be as fast as "real" columns, it can get close.

I think this may be undesirable behaviour for some users as well though -- they may be purposefully ignoring bad quality data in some columns, or discarding it because it's simply way too big. That's ok, and Singer's catalog functionality has mechanisms to ignore stuff that those people could use, and we should likely add a configuration option to disable this functionality in the event that folks don't want it. But, if I've learned anything as a pipeline developer, it's that that data is probably there because someone thought it was important, so being able to capture it and save it somehow would be great.

Sadly, JSONB support only landed in Postgres 9.4, which is much more recent than the 8.4.22 version mentioned in DECISIONS.md. 8.4.22 is 5 years old and unsupported by the Postgres community at this point, where as 9.4 is supported. So, we could consider upgrading the minimum required Postgres version, but that seems like a bit much for a net-new feature that people have been getting by without so far. So, I'd say that it'd be best to just detect JSONB support, and warn if the config option is not off in its absence or something like that. Is there precedent for that kind of behaviour in the project?

I will prepare a PR if y'all are into it!

@awm33
Copy link
Member

awm33 commented Jul 23, 2019

@airhorns Stitch does handle pretty much anything you can place in JSON. The engine behind that is actually pretty complex (think of handle type changes on nested records, etc) and closed source.

Mixpanel and some other SaaS sources to come to mind, you can't always know the schema.

A JSONB column on objects with additionalProperties marked true in the schema is interesting. I think we'd want to make it an option to turn on in the config. Maybe making the column __sdc_additional_properties to follow the __sdc convention.

The complications come from:

  • Pre 9.4 postgres as you mentioned. We actually somewhat target 8.4 to test redshift compatibility.
  • We use the target as a basis for target-redshift, so we'd need a way of bypassing the logic in that target.

@AlexanderMann
Copy link
Collaborator

Weighing in, maybe a simpler approach here would be what do we want this to do ultimately?

We have some desire to persist data forward into our target so as to make sure nothing gets lost along the way right? __sdc_additional_properties seems quite useful, and I'm thinking that the simple solution is to have it default to TEXT and then use JSONB if it's available (etc.).

My questions would mainly be around:

  • updates to a particular row in the db
  • updates to a particular table's schema

Upsert for a row

In this instance do we want to try and deep merge the value which is present in the remote? If we are unable due to something like a list vs an object vs a scalar, do we overwrite with latest?

Do we want to wipe out the column on each update and simply use the latest payload each time?

Upserting the Schema

Dropping a column

Our schema declared value foo and now doesn't: Do we move the data over to our JSONB column?

Our schema declared value foo and now doesn't, but our payloads still do. Do we want to drop the column and then start persisting the values into the JSONB column, or do we want to leave the column around?

Adding a column

Schemas did not contain field foo but our payloads did. Our schemas now contain field foo. Do we want to "migrate" over the values present in the JSONB column?

What does Stitch do?

Well, aside from some of the "magic" which finds missing fields etc., it does also have an "errors" table it uses for debugging. This table ends up having a link to the source table, and source pk, and then the payload.

We could do something like this with a table which has text columns etc.

Proposal

I think it'd be pretty straightforward to add a flag to the config to allow persisting things into a "bonus" column.

target-postgres and target-redshift have a good deal of login in place for something like this. The main "trick" will be getting csv uploads to not fight with the JSON payloads etc. (I think this will be fine).

As far as logic, I'd vote for a dumb "latest-write-wins" scenario, where whatever comes along last gets put into the bonus column. If this means that there actually aren't any additional properties on a payload, then it'd be an empty dict.

@AlexanderMann
Copy link
Collaborator

So @airhorns, in my refactoring/future thinking today, I started piecing together what a pr for your above would look like, then I went a step further and started thinking through schema inspection.

Interestingly, that functionality is something which relies upon our batching logic herein, but not really a lot else. You could also make the argument that it relies upon our denesting logic.

For both of those things, they are distinctly different from the role of the target itself. ie, they make life simpler for the code which is ultimately concerned with persisting data to remote. Due to to this and some ideas I've had brewing for a while, I started to think of transformations to the messages being output by taps as pipes.

Ideally, what we're talking about here with schema inspection, is something which would transform a table_batch (a set of records the target should persist which has been denested) into another table_batch only updating the schemas associated.

To do this, I created a draft PR AlexanderMann#29 which starts by ripping out a bunch of the stream transformation logic we already have and turns it into a series of Python Generators to make the stream processing nature of everything as apparent as possible. (this already sort of happens when we represent the input stream as an iterator using io.TextIOWrapper)

If you get a chance, I'd appreciate other eyes on the general concept of the code changes.

If this is more or less 👍 then adding in a function which takes table_batches and inspects each record becomes pretty easy, and also becomes something which we can possibly turn into another command, ie:

tap-hubspot | pipe-inspect-schema | target-postgres >> state.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants