Spectrum / Athena Support #22

norton120 · 2019-04-27T16:45:57Z

Description

During the publish phase we have everything we need to create an external schema in Redshift / register the meta for Athena. Since we know this is in AWS, this would be a hugely powerful addition to current functionality

Pseudocode

parq.register(target="Redshift")

Why?

By registering a schema at publish, this makes the written data immediately queryable via any SQL workbench tool. We should standardize that the external schema is everything in the path leading up to the dataset name, and the table is the dataset name. So for a path
s3://bananabucket/this/is/a/prefix/dataset/id=123/name=steve/asf809dg8jkljsd12.parquet
the external schema to register would be bananabucket_this_is_a_prefix and the table would be
dataset. So querying it via Spectrum / Athena would be
SELECT * FROM bananabucket_this_is_a_prefix.dataset WHERE id > 122 ... WOAH.

The text was updated successfully, but these errors were encountered:

fixed the docker CI test

arogozhnikov · 2022-01-31T08:50:01Z

~~Hi @norton120 , I'm much insterested in this feature. Did you find any python-focused solutions for management of parquets that can register meta in AWS Athena?~~

Update: nevermind, seems awswrangler can deal with this

norton120 added the enhancement New feature or request label Apr 27, 2019

RyanAdalbert pushed a commit to RyanAdalbert/s3parq that referenced this issue Sep 16, 2019

Merge pull request IntegriChain1#22 from norton120/ci-test-fixes

5ff1f86

fixed the docker CI test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spectrum / Athena Support #22

Spectrum / Athena Support #22

norton120 commented Apr 27, 2019

arogozhnikov commented Jan 31, 2022 •

edited

Loading

Spectrum / Athena Support #22

Spectrum / Athena Support #22

Comments

norton120 commented Apr 27, 2019

Description

Pseudocode

Why?

arogozhnikov commented Jan 31, 2022 • edited Loading

arogozhnikov commented Jan 31, 2022 •

edited

Loading