Rotate over tables not sources for extract #81

tvogels01 · 2018-01-04T20:43:45Z

Summary

Let's switch to rotating (round-robin style) over tables not sources during extracts.

The current implementation of extract runs one thread per source and then serially dumps data from tables in that source. The suggested implementation would create an initial list of tables in an interleaved fashion (see details) and then work off that list with pre-set concurrency.

Details

We have observed problems where we're issuing too many extracts and the master node which needs to handle the many concurrent Sqoops runs out of memory. The number of concurrent Sqoops cannot be limited and may go as high as the number of sources defined the ETL configuration. So the more sources are defined, the higher the probability of a failure on the master node (and also the higher the pressure on the resource allocation against the containers).

To avoid this situation and to be able to have a deterministically limited number of concurrent Sqoop runs, we should switch to an implementation that orients itself along a list of tables, not a list of sources.

In order to preserve the current advantage of having only one extract running against any upstream source at a time, we can start with an interleaved list of tables and move on to some locking mechanisms if ever needed.

Currently

Thread 1: source1.table1, source1.table2, source1.table3
Thread 2: source2.table1, source2.table2
Thread3: source3.table1, source3.table2, source3.table3, source3.table4

Proposed

Initial list: source1.table1, source2.table1, source3.table1, source1.table2, source2.table2, source3.table2, source1.table3, source3.table3, source3.table4
Then:
Thread 1: source1.table1, ....
Thread 2: source2.table1, ....
Thread 3: source3.table1, ...

Exact execution order in the threads then depends on how long the extracts take.

An additional benefit of switching to processing a list is that we can address failed extracts by re-queueing the table.

tvogels01 added component: extract feature labels Jan 4, 2018

tvogels01 mentioned this issue Jan 5, 2018

Extract parallelism might exceed capacity and fail #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rotate over tables not sources for extract #81

Rotate over tables not sources for extract #81

tvogels01 commented Jan 4, 2018

Rotate over tables not sources for extract #81

Rotate over tables not sources for extract #81

Comments

tvogels01 commented Jan 4, 2018

Summary

Details