Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the adapter be able to connect to a remote data store? #22

Open
slobentanzer opened this issue Oct 30, 2024 · 1 comment
Open
Labels
question Further information is requested

Comments

@slobentanzer
Copy link
Collaborator

It is within the realm of possibilities that the adapter solution we provide will be used by the community in the compute environment of Open Targets (Google cloud). Thus, we may benefit from an adapter principle (data ingestion package) that can connect to a cloud storage location of the Open Targets pipeline data as opposed to downloading it and processing locally. Streaming data from the original cloud location could bring nice performance improvements if we deploy in this way. Regarding the choice of package for ingesting .parquet, this could be a consideration.

@slobentanzer slobentanzer moved this to Todo in OTAR3088 Oct 30, 2024
@slobentanzer slobentanzer added the question Further information is requested label Oct 30, 2024
@kpto
Copy link
Collaborator

kpto commented Oct 30, 2024

Both duckdb and Polars support HTTPS, S3 and Google Cloud
duckdb supports also Cloudflare R2 while Polars supports also Azure

References:
duckdb: https://duckdb.org/docs/guides/network_cloud_storage/overview
Polars: https://docs.pola.rs/api/python/dev/reference/api/polars.read_parquet.html

Given that a lot of data science packages are designed with cloud storage in mind, this shouldn't be difficult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Todo
Development

No branches or pull requests

2 participants