Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RemoteFileUtil to download in parallel as expected. #5515

Merged
merged 6 commits into from
Nov 5, 2024

Conversation

psobot
Copy link
Member

@psobot psobot commented Oct 18, 2024

The docs for RemoteFileUtil::download say:

Download a batch of remote URIs in parallel.

...however, no parallel downloading happens. The concurrency arguments passed to LoadingCache control how many concurrent writes to the cache are allowed concurrently from callers on different threads, but calling LoadingCache::get triggers a load to happen serially.

This PR fixes this by creating a new ExecutorService to download the provided URIs in parallel within the download function. Single-file downloads (i.e.: download(URI) instead of download(List<URI>)) are not affected.

@psobot psobot added the enhancement New feature or request label Oct 18, 2024
@psobot psobot requested a review from kellen October 18, 2024 13:38

/**
* Download a batch of remote {@link URI}s in parallel, using at most numThreads to do so.
* `numThreads` may not be larger than the number of available processors * 4.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `numThreads` may not be larger than the number of available processors * 4.
* Specifying a `numThreads` greater than number of available processors * 4 will have the same effect as specifying `numThreads` equal to the available processors * 4.

Or maybe

Suggested change
* `numThreads` may not be larger than the number of available processors * 4.
* `numThreads` should at maximum be equal to the number of available processors * 4.

and then actually clamp the size of the thread pool

* @return {@link Path}s to the downloaded local files.
*/
public List<Path> download(List<URI> srcs, int numThreads) {
final ExecutorService executor = Executors.newFixedThreadPool(numThreads);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably move this as transient member, initializing it in the constructor.
The class can implement AutoClosable to shutdown the thread pool.

This is mainly relevant for the FileDownloadDoFn where we don't want to create a new thread pool for every url batch.

Copy link

codecov bot commented Oct 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.43%. Comparing base (489bd7a) to head (9ef2b56).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5515   +/-   ##
=======================================
  Coverage   61.43%   61.43%           
=======================================
  Files         312      312           
  Lines       11103    11103           
  Branches      762      762           
=======================================
  Hits         6821     6821           
  Misses       4282     4282           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RustedBones RustedBones merged commit f4a8cc5 into spotify:main Nov 5, 2024
11 checks passed
@RustedBones
Copy link
Contributor

Thanks @psobot !

@psobot
Copy link
Member Author

psobot commented Nov 5, 2024

My pleasure - thanks for reworking this to get it merged @RustedBones and @kellen!

@psobot psobot deleted the rfu-parallel branch November 5, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants