You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes it's useful to run an EL pipeline with zero rows in order to create the target tables but not yet load data into them.
In the SDK for taps, we recently added --test=schema to emit only the SCHEMA messages without emitting any RECORD messages.
For non-SDK taps, it would be helpful to have a similar option that excludes all rows or all rows past the first n records.
Note:
For safety, we should probably also not pass along any STATE messages when running in this mode. Since we'd be dropping records intentionally, we would not want to pass a bookmark that implied records had been written which were not actually passed.
To the extent that the tap is still having to process records, there's still a performance hit from having to read all the records from source. However, most pipelines are constrained on target performance, so by skipping the write process, there should still be significant gains for many/most use cases.
The text was updated successfully, but these errors were encountered:
When trying out Meltano for the first time for a new source, I want to ensure it works for all the data types I use in that source, and that it maps those data types to something reasonable in my destination.
For my workloads – Postgres tables with 10B+ rows representing 5+ TBs of data – Meltano does not replicate in a reasonable amount of time. As a workaround, I want Meltano to create the table, I manually backfill in a performant way, and then I let Meltano take over for ongoing replication.
We are running into this issue as well. For tables with INCREMENTAL replication, workarounds are available. Some taps support a start date, which can be used to artificially just fetch a few recent rows. It's also possible to use state postfixes and state manipulation to set state to a recent value for the purposes of testing. None of these are great, because they are fiddly, and because they do not achieve the goal of getting N records-- each table will of course need a different state to guarantee getting some rows.
The big issue, however, is with FULL_TABLE replication. We have not found any workarounds for this case. State is ignored.
We tried using filters in meltano-map-transformer, but this doesn't prevent selecting all of the rows in the source-- it just prevents them from going to the target.
One solution is to handle this in the tap-- if the singer spec contemplating testing, a --test mode that always retrieves the top N rows would have been great. but that ship has sailed.
Meltano could handle this by providing a configuration option to quit after N rows. I imagine that would go somewhere about here, as a different future waiting on a given number of rows from the tap:
That'd be a change to meltano core, not meltano-map-transform, so i realize that means this post is somewhat in the wrong place. But I figured since others had run into this issue here, i'd post here first.
Sometimes it's useful to run an EL pipeline with zero rows in order to create the target tables but not yet load data into them.
In the SDK for taps, we recently added
--test=schema
to emit only the SCHEMA messages without emitting any RECORD messages.For non-SDK taps, it would be helpful to have a similar option that excludes all rows or all rows past the first
n
records.Note:
STATE
messages when running in this mode. Since we'd be dropping records intentionally, we would not want to pass a bookmark that implied records had been written which were not actually passed.The text was updated successfully, but these errors were encountered: