-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save properties not found in the jsonschema in a JSONB column #129
Comments
@airhorns Stitch does handle pretty much anything you can place in JSON. The engine behind that is actually pretty complex (think of handle type changes on nested records, etc) and closed source. Mixpanel and some other SaaS sources to come to mind, you can't always know the schema. A JSONB column on objects with The complications come from:
|
Weighing in, maybe a simpler approach here would be what do we want this to do ultimately? We have some desire to persist data forward into our target so as to make sure nothing gets lost along the way right? My questions would mainly be around:
Upsert for a rowIn this instance do we want to try and deep merge the value which is present in the remote? If we are unable due to something like a list vs an object vs a scalar, do we overwrite with latest? Do we want to wipe out the column on each update and simply use the latest payload each time? Upserting the SchemaDropping a columnOur schema declared value Our schema declared value Adding a columnSchemas did not contain field What does Stitch do?Well, aside from some of the "magic" which finds missing fields etc., it does also have an "errors" table it uses for debugging. This table ends up having a link to the source table, and source pk, and then the payload. We could do something like this with a table which has text columns etc. ProposalI think it'd be pretty straightforward to add a flag to the config to allow persisting things into a "bonus" column.
As far as logic, I'd vote for a dumb "latest-write-wins" scenario, where whatever comes along last gets put into the bonus column. If this means that there actually aren't any additional properties on a payload, then it'd be an empty dict. |
So @airhorns, in my refactoring/future thinking today, I started piecing together what a pr for your above would look like, then I went a step further and started thinking through schema inspection. Interestingly, that functionality is something which relies upon our For both of those things, they are distinctly different from the role of the Ideally, what we're talking about here with schema inspection, is something which would transform a To do this, I created a draft PR AlexanderMann#29 which starts by ripping out a bunch of the stream transformation logic we already have and turns it into a series of Python Generators to make the stream processing nature of everything as apparent as possible. (this already sort of happens when we represent the input stream as an iterator using If you get a chance, I'd appreciate other eyes on the general concept of the code changes. If this is more or less 👍 then adding in a function which takes tap-hubspot | pipe-inspect-schema | target-postgres >> state.json |
It makes a lot of sense that
target-postgres
tries to load data in a structured way into Postgres. The more that the database knows about the structure of the data, the faster and more efficient the database can be when querying it.That said, lots of the singer taps don't or can't know the exact, specific structure of the data they are extracting before hand. Stitch itself deals with this in a variety of ways, but it does indeed deal with it some how. In my experience feeding web analytics event data from something like Segment into Redshift using Stitch, the event properties can be named whatever the heck developers want, and Stitch somewhat magically adds columns each time a new property is discovered. This is nice because no schema has to be updated anywhere and the pipeline doesn't break while someone figures out where to update it.
target-postgres
doesn't need to be that smart, after all, you could just use Stitch, but, I think it could do a bit better than just dropping unknown columns data (see #127).I think a nice middle ground would be to keep using the structured smarts that can create nice tables with good relational structure for things specified by the JSONSchema, but then loading unrecognized columns into a JSONB column called
additional_properties
or something like that. This would allow us to at least get a handle on the data from Postgres land, and if you want to break it out in to a real column or do something useful with it you could do so with a transform or SQL view or whatever. JSONB support is really quite nice in recent Postgres with all the partial indexing support and whatnot, and while it wouldn't be as fast as "real" columns, it can get close.I think this may be undesirable behaviour for some users as well though -- they may be purposefully ignoring bad quality data in some columns, or discarding it because it's simply way too big. That's ok, and Singer's catalog functionality has mechanisms to ignore stuff that those people could use, and we should likely add a configuration option to disable this functionality in the event that folks don't want it. But, if I've learned anything as a pipeline developer, it's that that data is probably there because someone thought it was important, so being able to capture it and save it somehow would be great.
Sadly, JSONB support only landed in Postgres 9.4, which is much more recent than the 8.4.22 version mentioned in DECISIONS.md. 8.4.22 is 5 years old and unsupported by the Postgres community at this point, where as 9.4 is supported. So, we could consider upgrading the minimum required Postgres version, but that seems like a bit much for a net-new feature that people have been getting by without so far. So, I'd say that it'd be best to just detect JSONB support, and warn if the config option is not off in its absence or something like that. Is there precedent for that kind of behaviour in the project?
I will prepare a PR if y'all are into it!
The text was updated successfully, but these errors were encountered: