You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's switch to rotating (round-robin style) over tables not sources during extracts.
The current implementation of extract runs one thread per source and then serially dumps data from tables in that source. The suggested implementation would create an initial list of tables in an interleaved fashion (see details) and then work off that list with pre-set concurrency.
Details
We have observed problems where we're issuing too many extracts and the master node which needs to handle the many concurrent Sqoops runs out of memory. The number of concurrent Sqoops cannot be limited and may go as high as the number of sources defined the ETL configuration. So the more sources are defined, the higher the probability of a failure on the master node (and also the higher the pressure on the resource allocation against the containers).
To avoid this situation and to be able to have a deterministically limited number of concurrent Sqoop runs, we should switch to an implementation that orients itself along a list of tables, not a list of sources.
In order to preserve the current advantage of having only one extract running against any upstream source at a time, we can start with an interleaved list of tables and move on to some locking mechanisms if ever needed.
Summary
Let's switch to rotating (round-robin style) over tables not sources during extracts.
The current implementation of extract runs one thread per source and then serially dumps data from tables in that source. The suggested implementation would create an initial list of tables in an interleaved fashion (see details) and then work off that list with pre-set concurrency.
Details
We have observed problems where we're issuing too many extracts and the master node which needs to handle the many concurrent Sqoops runs out of memory. The number of concurrent Sqoops cannot be limited and may go as high as the number of sources defined the ETL configuration. So the more sources are defined, the higher the probability of a failure on the master node (and also the higher the pressure on the resource allocation against the containers).
To avoid this situation and to be able to have a deterministically limited number of concurrent Sqoop runs, we should switch to an implementation that orients itself along a list of tables, not a list of sources.
In order to preserve the current advantage of having only one extract running against any upstream source at a time, we can start with an interleaved list of tables and move on to some locking mechanisms if ever needed.
Currently
Proposed
Then:
Exact execution order in the threads then depends on how long the extracts take.
An additional benefit of switching to processing a list is that we can address failed extracts by re-queueing the table.
The text was updated successfully, but these errors were encountered: