Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds README.md and LICENSE #35

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .env_sample
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
AWS_S3_TEST_BUCKET=testbucket
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=admin
AWS_SECRET_ACCESS_KEY=admin123
PG_PARQUET_TEST=true
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
name = "pg_parquet"
version = "0.1.0"
edition = "2021"
license-file = "LICENSE"

[lib]
crate-type = ["cdylib","lib"]
Expand Down
19 changes: 19 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
PostgreSQL License

Copyright (c) 2024, Crunchy Data Solutions, Inc.

Permission to use, copy, modify, and distribute this software and its
documentation for any purpose, without fee, and without a written agreement is
hereby granted, provided that the above copyright notice and this paragraph
and the following two paragraphs appear in all copies.

IN NO EVENT SHALL CRUNCHY DATA SOLUTIONS, INC. BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT,
SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING
OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF CRUNCHY DATA SOLUTIONS, INC.
HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

CRUNCHY DATA SOLUTIONS, INC. SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS,
AND CRUNCHY DATA SOLUTIONS, INC. HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES,
ENHANCEMENTS, OR MODIFICATIONS.
176 changes: 176 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
![Logo](logo.png)

# pg_parquet

> Copy from/to Parquet files in PostgreSQL!

[![CI lints and tests](https://github.com/aykut-bozkurt/pg_parquet/actions/workflows/ci.yml/badge.svg)](https://github.com/aykut-bozkurt/pg_parquet/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/aykut-bozkurt/pg_parquet/graph/badge.svg?token=SVDGPEAP51)](https://codecov.io/gh/aykut-bozkurt/pg_parquet)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be updated to crunchydata

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codecov probably will break but lets resolve this afterwards.


`pg_parquet` is a PostgreSQL extension that allows you to read and write Parquet files, which are located in `S3` or `file system`, from PostgreSQL via `COPY TO/FROM` commands. It heavily uses [Apache Arrow](https://arrow.apache.org/rust/arrow/) project to read and write Parquet files and [pgrx](https://github.com/pgcentralfoundation/pgrx) project to extend PostgreSQL's `COPY` command.

marcoslot marked this conversation as resolved.
Show resolved Hide resolved
```sql
-- Copy a query result into Parquet in S3
COPY (SELECT * FROM table) TO 's3://mybucket/data.parquet' WITH (format 'parquet');

-- Load data from Parquet in S3
COPY table FROM 's3://mybucket/data.parquet' WITH (format 'parquet');
```

## Quick Reference
- [Installation From Source](#installation-from-source)
- [Usage](#usage)
- [Copy FROM/TO Parquet files TO/FROM Postgres tables](#copy-tofrom-parquet-files-fromto-postgres-tables)
- [Inspect Parquet schema](#inspect-parquet-schema)
- [Inspect Parquet metadata](#inspect-parquet-metadata)
- [Object Store Support](#object-store-support)
- [Copy Options](#copy-options)
- [Configuration](#configuration)
- [Supported Types](#supported-types)
- [Nested Types](#nested-types)
- [Postgres Support Matrix](#postgres-support-matrix)

## Installation From Source
After installing `Postgres`, you need to set up `rustup`, `cargo-pgrx` to build the extension.

```bash
# install rustup
> curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# install cargo-pgrx
> cargo install cargo-pgrx

# configure pgrx
> cargo pgrx init --pg17 $(which pg_config)

# append the extension to shared_preload_libraries in ~/.pgrx/data-17/postgresql.conf
> echo "shared_preload_libraries = 'pg_parquet'" >> ~/.pgrx/data-17/postgresql.conf

# run cargo-pgrx to build and install the extension
> cargo pgrx run

# create the extension in the database
psql> "CREATE EXTENSION pg_parquet;"
```

## Usage
There are mainly 3 things that you can do with `pg_parquet`:
1. You can export Postgres tables/queries to Parquet files,
2. You can ingest data from Parquet files to Postgres tables,
3. You can inspect the schema and metadata of Parquet files.

### COPY to/from Parquet files from/to Postgres tables
You can use PostgreSQL's `COPY` command to read and write Parquet files. Below is an example of how to write a PostgreSQL table, with complex types, into a Parquet file and then to read the Parquet file content back into the same table.

```sql
-- create composite types
CREATE TYPE product_item AS (id INT, name TEXT, price float4);
CREATE TYPE product AS (id INT, name TEXT, items product_item[]);

-- create a table with complex types
CREATE TABLE product_example (
id int,
product product,
products product[],
created_at TIMESTAMP,
updated_at TIMESTAMPTZ
);

-- insert some rows into the table
insert into product_example values (
1,
ROW(1, 'product 1', ARRAY[ROW(1, 'item 1', 1.0), ROW(2, 'item 2', 2.0), NULL]::product_item[])::product,
ARRAY[ROW(1, NULL, NULL)::product, NULL],
now(),
'2022-05-01 12:00:00-04'
);

-- copy the table to a parquet file
COPY product_example TO '/tmp/product_example.parquet' (FORMAT 'parquet', COMPRESSION 'gzip');

-- show table
SELECT * FROM product_example;

-- copy the parquet file to the table
COPY product_example FROM '/tmp/product_example.parquet';

-- show table
SELECT * FROM product_example;
```

### Inspect Parquet schema
You can call `SELECT * FROM parquet.schema(<uri>)` to discover the schema of the Parquet file at given uri.

### Inspect Parquet metadata
You can call `SELECT * FROM parquet.metadata(<uri>)` to discover the detailed metadata of the Parquet file, such as column statistics, at given uri.

You can call `SELECT * FROM parquet.file_metadata(<uri>)` to discover file level metadata of the Parquet file, such as format version, at given uri.

You can call `SELECT * FROM parquet.kv_metadata(<uri>)` to query custom key-value metadata of the Parquet file at given uri.

## Object Store Support
`pg_parquet` supports reading and writing Parquet files from/to `S3` object store. Only the uris with `s3://` scheme is supported.

You can either set the following environment variables or use shared configuration files to access to the object store:
- `AWS_ACCESS_KEY_ID`: the access key ID of the AWS account,
- `AWS_SECRET_ACCESS_KEY`: the secret access key of the AWS account,
- `AWS_REGION`: the default region of the AWS account.

marcoslot marked this conversation as resolved.
Show resolved Hide resolved
You can set config and credentials file paths with `AWS_CONFIG_FILE` and `AWS_SHARED_CREDENTIALS_FILE` environment variables. The default config and credentials file paths are `~/.aws/config` and `~/.aws/credentials`. You can also set profile name with `AWS_PROFILE` environment variable. The default profile name is `default`.

## Copy Options
`pg_parquet` supports the following options in the `COPY TO` command:
- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension. (This is the only option that `COPY FROM` command supports.),
- `row_group_size <size>`: the number of rows in each row group while writing Parquet files. The default row group size is `100000`,
- `compression <compression>`: the compression format to use while writing Parquet files. The supported compression formats are `uncompressed`, `snappy`, `gzip`, `brotli`, `lz4`, `lz4raw` and `zstd`. The default compression format is `uncompressed`. If not specified, the compression format is determined by the file extension.

## Configuration
There is currently only one GUC parameter to enable/disable the `pg_parquet`:
- `pg_parquet.enable_copy_hooks`: you can set this parameter to `on` or `off` to enable or disable the `pg_parquet` extension. The default value is `on`.

## Supported Types
`pg_parquet` has rich type support, including PostgreSQL's primitive, array, and composite types. Below is the table of the supported types in PostgreSQL and their corresponding Parquet types.

| PostgreSQL Type | Parquet Physical Type | Logical Type |
|-------------------|---------------------------|------------------|
| `bool` | BOOLEAN | |
| `smallint` | INT16 | |
| `integer` | INT32 | |
| `bigint` | INT64 | |
| `real` | FLOAT | |
| `oid` | INT32 | |
| `double` | DOUBLE | |
| `numeric`(1) | FIXED_LEN_BYTE_ARRAY(16) | DECIMAL(128) |
| `text` | BYTE_ARRAY | STRING |
| `json` | BYTE_ARRAY | STRING |
| `bytea` | BYTE_ARRAY | |
| `date` (2) | INT32 | DATE |
| `timestamp` | INT64 | TIMESTAMP_MICROS |
| `timestamptz` (3) | INT64 | TIMESTAMP_MICROS |
| `time` | INT64 | TIME_MICROS |
| `timetz`(3) | INT64 | TIME_MICROS |
| `geometry`(4) | BYTE_ARRAY | |

### Nested Types
| PostgreSQL Type | Parquet Physical Type | Logical Type |
|-------------------|---------------------------|------------------|
| `composite` | GROUP | STRUCT |
| `array` | element's physical type | LIST |
| `crunchy_map`(5) | GROUP | MAP |

> [!WARNING]
> - (1) The `numeric` types with <= `38` precision is represented as `FIXED_LEN_BYTE_ARRAY(16)` with `DECIMAL(128)` logical type. The `numeric` types with > `38` precision is represented as `BYTE_ARRAY` with `STRING` logical type.
> - (2) The `date` type is represented according to `Unix epoch` when writing to Parquet files. It is converted back according to `PostgreSQL epoch` when reading from Parquet files.
> - (3) The `timestamptz` and `timetz` types are adjusted to `UTC` when writing to Parquet files. They are converted back with `UTC` timezone when reading from Parquet files.
> - (4) The `geometry` type is represented as `BYTE_ARRAY` encoded as `WKB` when `postgis` extension is created. Otherwise, it is represented as `BYTE_ARRAY` with `STRING` logical type.
> - (5) The `crunchy_map` type is represented as `GROUP` with `MAP` logical type when `crunchy_map` extension is created. Otherwise, it is represented as `BYTE_ARRAY` with `STRING` logical type. This extension is only available at [Crunchy Bridge for Analytics](https://www.crunchydata.com/products/crunchy-bridge-for-analytics).

> [!WARNING]
> Any type that does not have a corresponding Parquet type will be represented, as a fallback mechanism, as `BYTE_ARRAY` with `STRING` logical type. e.g. `enum`

## Postgres Support Matrix
`pg_parquet` is tested with the following PostgreSQL versions:
| PostgreSQL Major Version | Supported |
|--------------------------|-----------|
| 17 | ✅ |
| 16 | ✅ |
Binary file added logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading