diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 9c97fcc..d85fa0a 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -1,8 +1,9 @@ name: CI lints and tests on: push: - branches: - - "*" + branches: [ "main" ] + pull_request: + branches: [ "main" ] concurrency: group: ${{ github.ref }} diff --git a/README.md b/README.md index b411d45..6fbf0cc 100644 --- a/README.md +++ b/README.md @@ -98,13 +98,91 @@ SELECT * FROM product_example; ### Inspect Parquet schema You can call `SELECT * FROM parquet.schema()` to discover the schema of the Parquet file at given uri. +```sql +SELECT * FROM parquet.schema('/tmp/product_example.parquet') + uri | name | type_name | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type +------------------------------+--------------+------------+-------------+-----------------+--------------+------------------+-------+-----------+----------+-------------- + /tmp/product_example.parquet | arrow_schema | | | | 5 | | | | | + /tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 0 | + /tmp/product_example.parquet | product | | | OPTIONAL | 3 | | | | 1 | + /tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 2 | + /tmp/product_example.parquet | name | BYTE_ARRAY | | OPTIONAL | | UTF8 | | | 3 | STRING + /tmp/product_example.parquet | items | | | OPTIONAL | 1 | LIST | | | 4 | LIST + /tmp/product_example.parquet | list | | | REPEATED | 1 | | | | | + /tmp/product_example.parquet | items | | | OPTIONAL | 3 | | | | 5 | + /tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 6 | + /tmp/product_example.parquet | name | BYTE_ARRAY | | OPTIONAL | | UTF8 | | | 7 | STRING + /tmp/product_example.parquet | price | FLOAT | | OPTIONAL | | | | | 8 | + /tmp/product_example.parquet | products | | | OPTIONAL | 1 | LIST | | | 9 | LIST + /tmp/product_example.parquet | list | | | REPEATED | 1 | | | | | + /tmp/product_example.parquet | products | | | OPTIONAL | 3 | | | | 10 | + /tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 11 | + /tmp/product_example.parquet | name | BYTE_ARRAY | | OPTIONAL | | UTF8 | | | 12 | STRING + /tmp/product_example.parquet | items | | | OPTIONAL | 1 | LIST | | | 13 | LIST + /tmp/product_example.parquet | list | | | REPEATED | 1 | | | | | + /tmp/product_example.parquet | items | | | OPTIONAL | 3 | | | | 14 | + /tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 15 | + /tmp/product_example.parquet | name | BYTE_ARRAY | | OPTIONAL | | UTF8 | | | 16 | STRING + /tmp/product_example.parquet | price | FLOAT | | OPTIONAL | | | | | 17 | + /tmp/product_example.parquet | created_at | INT64 | | OPTIONAL | | TIMESTAMP_MICROS | | | 18 | TIMESTAMP + /tmp/product_example.parquet | updated_at | INT64 | | OPTIONAL | | TIMESTAMP_MICROS | | | 19 | TIMESTAMP +(24 rows) +``` + ### Inspect Parquet metadata You can call `SELECT * FROM parquet.metadata()` to discover the detailed metadata of the Parquet file, such as column statistics, at given uri. +```sql +SELECT uri, row_group_id, row_group_num_rows, row_group_num_columns, row_group_bytes, column_id, file_offset, num_values, path_in_schema, type_name FROM parquet.metadata('/tmp/product_example.parquet') LIMIT 1; + uri | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type_name +------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+----------- + /tmp/product_example.parquet | 0 | 1 | 13 | 842 | 0 | 0 | 1 | id | INT32 +(1 row) +``` + +```sql +SELECT stats_null_count, stats_distinct_count, stats_min, stats_max, compression, encodings, index_page_offset, dictionary_page_offset, data_page_offset, total_compressed_size, total_uncompressed_size FROM parquet.metadata('/tmp/product_example.parquet') LIMIT 1; + stats_null_count | stats_distinct_count | stats_min | stats_max | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size +------------------+----------------------+-----------+-----------+--------------------+--------------------------+-------------------+------------------------+------------------+-----------------------+------------------------- + 0 | | 1 | 1 | GZIP(GzipLevel(6)) | PLAIN,RLE,RLE_DICTIONARY | | 4 | 42 | 101 | 61 +(1 row) +``` + You can call `SELECT * FROM parquet.file_metadata()` to discover file level metadata of the Parquet file, such as format version, at given uri. +```sql +SELECT * FROM parquet.file_metadata('/tmp/product_example.parquet') + uri | created_by | num_rows | num_row_groups | format_version +------------------------------+------------+----------+----------------+---------------- + /tmp/product_example.parquet | pg_parquet | 1 | 1 | 1 +(1 row) +``` + You can call `SELECT * FROM parquet.kv_metadata()` to query custom key-value metadata of the Parquet file at given uri. +```sql +SELECT uri, encode(key, 'escape') as key, encode(value, 'escape') as value FROM parquet.kv_metadata('/tmp/product_example.parquet'); + uri | key | value +------------------------------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- + /tmp/product_example.parquet | ARROW:schema | /////5gIAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAUAAAD0BwAAlAQAAPwAAACIAAAABAAAADL4//9IAAAAHAAAAAwAAAAAAAEKKAAAAAAAAAAIAAwACgAEAAgAAAAIAAAAAAACAAYAAAArMDA6MDAAAAoAAAB. + | |.1cGRhdGVkX2F0AAABAAAABAAAADT4//8IAAAADAAAAAIAAAAxOQAAEAAAAFBBUlFVRVQ6ZmllbGRfaWQAAAAAsvj//zgAAAAUAAAADAAAAAAAAQoYAAAAAAAAAGr7//8AAAIAAAAAAAAAAAAKAAAAY3JlYXRlZF9hdAAAAQAAAAQAAACk+P//CAAAAA. + | |.wAAAACAAAAMTgAABAAAABQQVJRVUVUOmZpZWxkX2lkAAAAACL5//9cAwAAGAAAAAwAAAAAAAEMPAMAAAEAAAAIAAAALPr//0b5///0AgAAIAAAAAwAAAAAAAEN1AIAAAMAAABoAgAABAIAAAgAAABY+v//cvn//8ABAAAYAAAADAAAAAAAAQykAQAAA. + | |.QAAAAgAAAB8+v//lvn//1wBAAAgAAAADAAAAAAAAQ1AAQAAAwAAANQAAABwAAAACAAAAKj6///C+f//LAAAABAAAAAUAAAAAAABAxAAAAB2/P//AAABAAAAAAAFAAAAcHJpY2UAAAABAAAABAAAAKj5//8IAAAADAAAAAIAAAAxNwAAEAAAAFBBUlFV. + | |.RVQ6ZmllbGRfaWQAAAAAJvr//ygAAAAUAAAADAAAAAAAAQUMAAAAAAAAACz7//8EAAAAbmFtZQAAAAABAAAABAAAAAj6//8IAAAADAAAAAIAAAAxNgAAEAAAAFBBUlFVRVQ6ZmllbGRfaWQAAAAAhvr//ywAAAAQAAAAGAAAAAAAAQIUAAAAdPr//yA. + | |.AAAAAAAABAAAAAAIAAABpZAAAAQAAAAQAAABs+v//CAAAAAwAAAACAAAAMTUAABAAAABQQVJRVUVUOmZpZWxkX2lkAAAAAAUAAABpdGVtcwAAAAEAAAAEAAAArPr//wgAAAAMAAAAAgAAADE0AAAQAAAAUEFSUVVFVDpmaWVsZF9pZAAAAAAFAAAAaX. + | |.RlbXMAAAABAAAABAAAAOz6//8IAAAADAAAAAIAAAAxMwAAEAAAAFBBUlFVRVQ6ZmllbGRfaWQAAAAAavv//ygAAAAUAAAADAAAAAAAAQUMAAAAAAAAAHD8//8EAAAAbmFtZQAAAAABAAAABAAAAEz7//8IAAAADAAAAAIAAAAxMgAAEAAAAFBBUlFVR. + | |.VQ6ZmllbGRfaWQAAAAAyvv//ywAAAAQAAAAGAAAAAAAAQIUAAAAuPv//yAAAAAAAAABAAAAAAIAAABpZAAAAQAAAAQAAACw+///CAAAAAwAAAACAAAAMTEAABAAAABQQVJRVUVUOmZpZWxkX2lkAAAAAAgAAABwcm9kdWN0cwAAAAABAAAABAAAAPT7. + | |.//8IAAAADAAAAAIAAAAxMAAAEAAAAFBBUlFVRVQ6ZmllbGRfaWQAAAAACAAAAHByb2R1Y3RzAAAAAAEAAAAEAAAAOPz//wgAAAAMAAAAAQAAADkAAAAQAAAAUEFSUVVFVDpmaWVsZF9pZAAAAAC2/P//FAMAACAAAAAMAAAAAAABDfgCAAADAAAAjAI. + | |.AACQCAAAIAAAAyP3//+L8///gAQAAGAAAAAwAAAAAAAEMxAEAAAEAAAAIAAAA7P3//wb9//98AQAAJAAAAAwAAAAAAAENYAEAAAMAAAD0AAAAkAAAACAAAAAEAAYABAAAAAAAEgAaABQAEgATAAgAAAAMAAQAEgAAADQAAAAYAAAAHAAAAAAAAQMYAA. + | |.AAAAAGAAgABgAGAAAAAAABAAAAAAAFAAAAcHJpY2UAAAABAAAABAAAADj9//8IAAAADAAAAAEAAAA4AAAAEAAAAFBBUlFVRVQ6ZmllbGRfaWQAAAAAtv3//ygAAAAUAAAADAAAAAAAAQUMAAAAAAAAALz+//8EAAAAbmFtZQAAAAABAAAABAAAAJj9/. + | |./8IAAAADAAAAAEAAAA3AAAAEAAAAFBBUlFVRVQ6ZmllbGRfaWQAAAAAFv7//ywAAAAQAAAAGAAAAAAAAQIUAAAABP7//yAAAAAAAAABAAAAAAIAAABpZAAAAQAAAAQAAAD8/f//CAAAAAwAAAABAAAANgAAABAAAABQQVJRVUVUOmZpZWxkX2lkAAAA. + | |.AAUAAABpdGVtcwAAAAEAAAAEAAAAPP7//wgAAAAMAAAAAQAAADUAAAAQAAAAUEFSUVVFVDpmaWVsZF9pZAAAAAAFAAAAaXRlbXMAAAABAAAABAAAAHz+//8IAAAADAAAAAEAAAA0AAAAEAAAAFBBUlFVRVQ6ZmllbGRfaWQAAAAA+v7//ywAAAAYAAA. + | |.ADAAAAAAAAQUQAAAAAAAAAAQABAAEAAAABAAAAG5hbWUAAAAAAQAAAAQAAADg/v//CAAAAAwAAAABAAAAMwAAABAAAABQQVJRVUVUOmZpZWxkX2lkAAAAAF7///8sAAAAEAAAABgAAAAAAAECFAAAAEz///8gAAAAAAAAAQAAAAACAAAAaWQAAAEAAA. + | |.AEAAAARP///wgAAAAMAAAAAQAAADIAAAAQAAAAUEFSUVVFVDpmaWVsZF9pZAAAAAAHAAAAcHJvZHVjdAABAAAABAAAAIT///8IAAAADAAAAAEAAAAxAAAAEAAAAFBBUlFVRVQ6ZmllbGRfaWQAABIAGAAUABIAEwAIAAAADAAEABIAAAA0AAAAGAAAA. + | |.CAAAAAAAAECHAAAAAgADAAEAAsACAAAACAAAAAAAAABAAAAAAIAAABpZAAAAQAAAAwAAAAIAAwACAAEAAgAAAAIAAAADAAAAAEAAAAwAAAAEAAAAFBBUlFVRVQ6ZmllbGRfaWQAAAAA +(1 row) +``` + ## Object Store Support `pg_parquet` supports reading and writing Parquet files from/to `S3` object store. Only the uris with `s3://` scheme is supported.