You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the publish phase we have everything we need to create an external schema in Redshift / register the meta for Athena. Since we know this is in AWS, this would be a hugely powerful addition to current functionality
Pseudocode
parq.register(target="Redshift")
Why?
By registering a schema at publish, this makes the written data immediately queryable via any SQL workbench tool. We should standardize that the external schema is everything in the path leading up to the dataset name, and the table is the dataset name. So for a path s3://bananabucket/this/is/a/prefix/dataset/id=123/name=steve/asf809dg8jkljsd12.parquet
the external schema to register would be bananabucket_this_is_a_prefix and the table would be dataset. So querying it via Spectrum / Athena would be SELECT * FROM bananabucket_this_is_a_prefix.dataset WHERE id > 122 ... WOAH.
The text was updated successfully, but these errors were encountered:
Hi @norton120 , I'm much insterested in this feature. Did you find any python-focused solutions for management of parquets that can register meta in AWS Athena?
Update: nevermind, seems awswrangler can deal with this
Description
During the publish phase we have everything we need to create an external schema in Redshift / register the meta for Athena. Since we know this is in AWS, this would be a hugely powerful addition to current functionality
Pseudocode
Why?
By registering a schema at publish, this makes the written data immediately queryable via any SQL workbench tool. We should standardize that the external schema is everything in the path leading up to the dataset name, and the table is the dataset name. So for a path
s3://bananabucket/this/is/a/prefix/dataset/id=123/name=steve/asf809dg8jkljsd12.parquet
the external schema to register would be
bananabucket_this_is_a_prefix
and the table would bedataset
. So querying it via Spectrum / Athena would beSELECT * FROM bananabucket_this_is_a_prefix.dataset WHERE id > 122
... WOAH.The text was updated successfully, but these errors were encountered: