-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery scripts to import/export GeoParquet files #113
Conversation
@@ -6,12 +6,14 @@ authors = [] | |||
license = "MIT" | |||
|
|||
[tool.poetry.dependencies] | |||
python = "^3.8" | |||
python = ">=3.8,<3.11" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google BigQuery requires this
|
||
if mode.upper() == 'FOLDER': | ||
# We need to export to multiple files, because a single file might hit bigquery limits (UDF out of memory). https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet | ||
pq.write_to_dataset(arrow_table, root_path=output, partition_cols=['__partition__'], compression=compression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still have a discussion on best practices for partitioned datasets. #79 Specifically around whether the _metadata
file is required/suggested/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will follow the conversation there
pyarrow = "^7.0.0" | ||
geopandas = "^0.10.2" | ||
pygeos = "^0.12.0" | ||
pandas = "^1.4.2" | ||
click = "^8.1.2" | ||
google-cloud-bigquery = "^3.2.0" | ||
db-dtypes = "^1.0.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used? It doesn't appear to be imported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it because google-cloud-bigquery complained about it by asking to install it, so I did
Closing this, as we'll move it over to https://github.com/geoparquet/bigquery-converter and keep the main geoparquet repo more focused on the spec. |
This PR allows:
It doesn't aim to be a production-ready solution, in the future, it will be supported natively by BigQuery. However, in the meantime, it provides a way to work with GeoParquet and BigQuery.
Convert a SQL query to parquet:
poetry run python bigquery_to_parquet.py \ --input-query "SELECT * FROM carto-do-public-data.carto.geography_usa_blockgroup_2019" \ --primary-column geom \ --output geography_usa_blockgroup_2019
Upload a parquet file or folder to BigQuery:
poetry run python parquet_to_bigquery.py \ --input geography_usa_blockgroup_2019 \ --output "cartodb-gcp-backend-data-team.alasarr.geography_usa_blockgroup_2019"
I've extracted to a global module some of the code generated at #87