Idea: Destinations for relational databases should operate in smaller batches. #44375
evantahler
started this conversation in
Connector Ideas and Features
Replies: 2 comments
-
Reference issues: |
Beta Was this translation helpful? Give feedback.
0 replies
-
+1 on this. Not everyone unfrotinately has a datalake. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Today, Airbyte destinations for relational databases (e.g. Postgres) operate similarly to Data Warehouse and Data Lake destinations. We stage the data as soon as it ready, in reasonably-sized upload batches, and then at the end of the sync/stream, we type and dedupe the staged data. This copies data from the raw table into the final table, in one transaction. This is the ideal flow for warehouse destinations that prefer to operate in large batches, and are cost sensitive to the number of full table queries preformed (e.g. for deduplication).
However, as relational database are not good warehouses...
... they often do quite poorly at handling large batches of changes (due to the need to keep transactionality, write a CDC log for all changes, etc). While the staging insert step into the raw tables generally complete in a reasonable timeframe, the final T&D often does not (and sometimes fails to complete at all).
I think we can do better for relational databases if we T&D in chunks. We still want to keep T&D of the chunk in the same transaction (so there are never duplicate records in the final table) but as long as build the final table incrementally in cursor-order, we can build up the final table slowly.
E.g.
This issue is not discussing:
Beta Was this translation helpful? Give feedback.
All reactions