Idea: Destinations for relational databases should operate in smaller batches. #44375

evantahler · 2024-08-19T03:04:34Z

evantahler
Aug 19, 2024
Maintainer

Today, Airbyte destinations for relational databases (e.g. Postgres) operate similarly to Data Warehouse and Data Lake destinations. We stage the data as soon as it ready, in reasonably-sized upload batches, and then at the end of the sync/stream, we type and dedupe the staged data. This copies data from the raw table into the final table, in one transaction. This is the ideal flow for warehouse destinations that prefer to operate in large batches, and are cost sensitive to the number of full table queries preformed (e.g. for deduplication).

However, as relational database are not good warehouses...

354425992-126e3460-c017-4f9f-888c-e75653b9bf2e

... they often do quite poorly at handling large batches of changes (due to the need to keep transactionality, write a CDC log for all changes, etc). While the staging insert step into the raw tables generally complete in a reasonable timeframe, the final T&D often does not (and sometimes fails to complete at all).

I think we can do better for relational databases if we T&D in chunks. We still want to keep T&D of the chunk in the same transaction (so there are never duplicate records in the final table) but as long as build the final table incrementally in cursor-order, we can build up the final table slowly.

E.g.

while (hasRemainingRawRows()) { 
   val limit = 10000
  `(T & D SQL) where loaded_at is null, order by emitted_at asc, limit ${limit}` 
}

fun hasRemainingRawRows () {
  val row = `select * from raw_table where loaded_at != null LIMIT 1`
  return row != null 
}

This issue is not discussing:

Removing 2-stage loading of data into the destination - there is still al lot of value in storing all records moved separate from a materialization of only the latest state of each record, and doing type-casting in warehouse.
Doing incremental T&D mid-sync.

evantahler · 2024-08-19T03:06:57Z

evantahler
Aug 19, 2024
Maintainer Author

Reference issues:

0 replies

krutoileshii · 2024-08-23T14:58:37Z

krutoileshii
Aug 23, 2024

+1 on this. Not everyone unfrotinately has a datalake.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Destinations for relational databases should operate in smaller batches. #44375

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Idea: Destinations for relational databases should operate in smaller batches. #44375

evantahler Aug 19, 2024 Maintainer

Replies: 2 comments

evantahler Aug 19, 2024 Maintainer Author

krutoileshii Aug 23, 2024

evantahler
Aug 19, 2024
Maintainer

evantahler
Aug 19, 2024
Maintainer Author

krutoileshii
Aug 23, 2024