Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rotate over tables not sources for extract #81

Open
tvogels01 opened this issue Jan 4, 2018 · 0 comments
Open

Rotate over tables not sources for extract #81

tvogels01 opened this issue Jan 4, 2018 · 0 comments

Comments

@tvogels01
Copy link
Contributor

Summary

Let's switch to rotating (round-robin style) over tables not sources during extracts.

The current implementation of extract runs one thread per source and then serially dumps data from tables in that source. The suggested implementation would create an initial list of tables in an interleaved fashion (see details) and then work off that list with pre-set concurrency.

Details

We have observed problems where we're issuing too many extracts and the master node which needs to handle the many concurrent Sqoops runs out of memory. The number of concurrent Sqoops cannot be limited and may go as high as the number of sources defined the ETL configuration. So the more sources are defined, the higher the probability of a failure on the master node (and also the higher the pressure on the resource allocation against the containers).

To avoid this situation and to be able to have a deterministically limited number of concurrent Sqoop runs, we should switch to an implementation that orients itself along a list of tables, not a list of sources.

In order to preserve the current advantage of having only one extract running against any upstream source at a time, we can start with an interleaved list of tables and move on to some locking mechanisms if ever needed.

Currently

  • Thread 1: source1.table1, source1.table2, source1.table3
  • Thread 2: source2.table1, source2.table2
  • Thread3: source3.table1, source3.table2, source3.table3, source3.table4

Proposed

  • Initial list: source1.table1, source2.table1, source3.table1, source1.table2, source2.table2, source3.table2, source1.table3, source3.table3, source3.table4
    Then:
  • Thread 1: source1.table1, ....
  • Thread 2: source2.table1, ....
  • Thread 3: source3.table1, ...

Exact execution order in the threads then depends on how long the extracts take.

An additional benefit of switching to processing a list is that we can address failed extracts by re-queueing the table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant