Cast types on read

`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema. e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type. We can avoid this situation by casting arrow array to the array that is expected by the tupledesc schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast. To make sure the cast is possible, we need to do 2 checks: 1. arrow-cast allows the cast from "arrow type at the parquet file" to "arrow type at the schema that is generated for tupledesc", 2. the cast is meaningful at Postgres. We check if there is an explicit cast from "Postgres type that corresponds for the arrow type at Parquet file" to "Postgres type at tupledesc". With that we can cast between many castable types as shown below: - INT16 => INT32 - UINT32 => INT64 - FLOAT32 => FLOAT64 - LargeUtf8 => UTF8 - LargeBinary => Binary - Struct, Array, and Map with castable fields, e.g. [UINT16] => [INT64] or struct {'x': UINT16} => struct {'x': INT64} **NOTE**: Struct fields must match by name and position to be cast. Closes #67.
CrunchyData · Nov 21, 2024 · ed3e766 · ed3e766
1 parent 518a5ac
commit ed3e766
Show file tree

Hide file tree

Showing 17 changed files with 1,725 additions and 301 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -21,6 +21,7 @@ pg_test = []
 
 [dependencies]
 arrow = {version = "53", default-features = false}
+arrow-cast = {version = "53", default-features = false}
 arrow-schema = {version = "53", default-features = false}
 aws-config = { version = "1.5", default-features = false, features = ["rustls"]}
 aws-credential-types = {version = "1.2", default-features = false}

diff --git a/README.md b/README.md
@@ -185,12 +185,17 @@ Alternatively, you can use the following environment variables when starting pos
 
 ## Copy Options
 `pg_parquet` supports the following options in the `COPY TO` command:
-- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension. (This is the only option that `COPY FROM` command supports.),
+- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension,
 - `row_group_size <int>`: the number of rows in each row group while writing Parquet files. The default row group size is `122880`,
 - `row_group_size_bytes <int>`: the total byte size of rows in each row group while writing Parquet files. The default row group size bytes is `row_group_size * 1024`,
-- `compression <string>`: the compression format to use while writing Parquet files. The supported compression formats are `uncompressed`, `snappy`, `gzip`, `brotli`, `lz4`, `lz4raw` and `zstd`. The default compression format is `snappy`. If not specified, the compression format is determined by the file extension.
+- `compression <string>`: the compression format to use while writing Parquet files. The supported compression formats are `uncompressed`, `snappy`, `gzip`, `brotli`, `lz4`, `lz4raw` and `zstd`. The default compression format is `snappy`. If not specified, the compression format is determined by the file extension,
 - `compression_level <int>`: the compression level to use while writing Parquet files. The supported compression levels are only supported for `gzip`, `zstd` and `brotli` compression formats. The default compression level is `6` for `gzip (0-10)`, `1` for `zstd (1-22)` and `1` for `brotli (0-11)`.
 
+`pg_parquet` supports the following options in the `COPY FROM` command:
+- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension,
+- `cast_mode <string>`: the cast mode can be passed as either `strict` or `relaxed`, which determines whether we should allow lossy conversions, which throws error at runtime if any value cannot be converted properly, instead of the schema creation time. By default, it is `strict`, which does not allow lossy
+conversions, e.g. int64 => int32 causes schema mismatch error beforehand. You can set it to `relaxed` if you want to allow lossy convertions, which might throw error at runtime if the cast is not possible.
+
 ## Configuration
 There is currently only one GUC parameter to enable/disable the `pg_parquet`:
 - `pg_parquet.enable_copy_hooks`: you can set this parameter to `on` or `off` to enable or disable the `pg_parquet` extension. The default value is `on`.

diff --git a/src/arrow_parquet.rs b/src/arrow_parquet.rs
@@ -1,5 +1,6 @@
 pub(crate) mod arrow_to_pg;
 pub(crate) mod arrow_utils;
+pub(crate) mod cast_mode;
 pub(crate) mod compression;
 pub(crate) mod parquet_reader;
 pub(crate) mod parquet_writer;