-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix RemoteFileUtil to download in parallel as expected. #5515
Conversation
|
||
/** | ||
* Download a batch of remote {@link URI}s in parallel, using at most numThreads to do so. | ||
* `numThreads` may not be larger than the number of available processors * 4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `numThreads` may not be larger than the number of available processors * 4. | |
* Specifying a `numThreads` greater than number of available processors * 4 will have the same effect as specifying `numThreads` equal to the available processors * 4. |
Or maybe
* `numThreads` may not be larger than the number of available processors * 4. | |
* `numThreads` should at maximum be equal to the number of available processors * 4. |
and then actually clamp the size of the thread pool
* @return {@link Path}s to the downloaded local files. | ||
*/ | ||
public List<Path> download(List<URI> srcs, int numThreads) { | ||
final ExecutorService executor = Executors.newFixedThreadPool(numThreads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably move this as transient member, initializing it in the constructor.
The class can implement AutoClosable
to shutdown the thread pool.
This is mainly relevant for the FileDownloadDoFn
where we don't want to create a new thread pool for every url batch.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5515 +/- ##
=======================================
Coverage 61.43% 61.43%
=======================================
Files 312 312
Lines 11103 11103
Branches 762 762
=======================================
Hits 6821 6821
Misses 4282 4282 ☔ View full report in Codecov by Sentry. |
Thanks @psobot ! |
My pleasure - thanks for reworking this to get it merged @RustedBones and @kellen! |
The docs for
RemoteFileUtil::download
say:...however, no parallel downloading happens. The concurrency arguments passed to
LoadingCache
control how many concurrent writes to the cache are allowed concurrently from callers on different threads, but callingLoadingCache::get
triggers a load to happen serially.This PR fixes this by creating a new
ExecutorService
to download the provided URIs in parallel within thedownload
function. Single-file downloads (i.e.:download(URI)
instead ofdownload(List<URI>)
) are not affected.