Coerce types on read #76

aykut-bozkurt · 2024-11-11T16:33:28Z

COPY FROM parquet is too strict when matching the parquet file schema to Postgres tupledesc schema .
e.g. INT32 type in the parquet schema cannot be read into a Postgres column with int64 type.
We can avoid this situation by casting arrow array to the array that is expected by the tupledesc
schema, if the cast is possible. We can make use of arrow-cast crate, which is in the same project
with arrow. Its public api lets us check if a cast possible between 2 arrow types and perform the cast.

To make sure the cast is possible, we need to do 2 checks:

arrow-cast allows the cast from "arrow type at the parquet file" to "arrow type at the schema that is
generated for tupledesc", (user created custom cast functions at Postgres won't work by arrow-cast)
the cast is meaningful at Postgres. We check if there is a cast from "Postgres type that corresponds to the arrow type at Parquet file" to "Postgres type at the tupledesc".

With that we can implicitly cast between many types as shown below:

INT16 => INT32
UINT32 => INT64
FLOAT32 => FLOAT64
LargeUtf8 => UTF8
LargeBinary => Binary
Struct, Array, and Map with castable fields, e.g. [UINT16] => [INT64] or struct {'x': UINT16} => struct {'x': INT64}
NOTE: Struct fields must always strictly match by name and position.

We can cast below types but with runtime errors e.g. value overflow

INT64 => INT32
TIMESTAMPTZ => TIMESTAMP

Closes #67.
Closes #79.

codecov · 2024-11-11T18:07:54Z

Codecov Report

Attention: Patch coverage is 87.82313% with 179 lines in your changes missing coverage. Please review.

Project coverage is 91.94%. Comparing base (518a5ac) to head (d48996c).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/lib.rs	88.25%	123 Missing ⚠️
src/arrow_parquet/schema_parser.rs	75.12%	51 Missing ⚠️
src/arrow_parquet/arrow_to_pg.rs	96.32%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #76      +/-   ##
==========================================
- Coverage   92.83%   91.94%   -0.89%     
==========================================
  Files          62       62              
  Lines        7645     8795    +1150     
==========================================
+ Hits         7097     8087     +990     
- Misses        548      708     +160

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema. e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type. We can avoid this situation by casting arrow array to the array that is expected by the tupledesc schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast. To make sure the cast is possible, we need to do 2 checks: 1. arrow-cast allows the cast from "arrow type at the parquet file" to "arrow type at the schema that is generated for tupledesc", 2. the cast is meaningful at Postgres. We check if there is an explicit cast from "Postgres type that corresponds for the arrow type at Parquet file" to "Postgres type at tupledesc". With that we can cast between many castable types as shown below: - INT16 => INT32 - UINT32 => INT64 - FLOAT32 => FLOAT64 - LargeUtf8 => UTF8 - LargeBinary => Binary - Struct, Array, and Map with castable fields, e.g. [UINT16] => [INT64] or struct {'x': UINT16} => struct {'x': INT64} **NOTE**: Struct fields must match by name and position to be cast. Closes #67.

onderkalaci · 2024-11-22T11:17:40Z

src/arrow_parquet/schema_parser.rs

+                    pgrx::PgLogLevel::ERROR,
+                    PgSqlErrorCode::ERRCODE_CANNOT_COERCE,
+                    type_mismatch_message,
+                    "Try COPY FROM '..' WITH (cast_mode = 'relaxed') to allow lossy casts with runtime checks."


should be (cast_mode 'relaxed'), so drop =

onderkalaci

I have been testing this PR for a while, and all the things just work flawlessly.

As we discussed internally, some CREATE CASTs may break, such as the following, so nice to document that somehow:

CREATE FUNCTION float_to_text(float) RETURNS text AS $$
BEGIN
    RETURN $1::text;
END;
$$ LANGUAGE plpgsql;


CREATE CAST (float AS text)
WITH FUNCTION float_to_text(float)
AS ASSIGNMENT;



COPY (SELECT 123.456::float as a UNION SELECT 789.01) TO '/tmp/float.parquet';
CREATE TABLE parquet_floats (a text);

-- Relaxed mode: uses the custom cast
COPY parquet_floats FROM '/tmp/float.parquet' WITH (cast_mode 'relaxed');

-- Strict mode: the behavior is consistent, but strictness may depend on server-side settings
COPY parquet_floats FROM '/tmp/float.parquet' WITH (cast_mode 'strict');

 COPY parquet_floats FROM '/tmp/float.parquet' WITH (cast_mode 'strict');
ERROR:  type mismatch for column "a" between table and parquet file.

table has "Utf8"

parquet file has "Float64"
DETAIL:  Try COPY FROM '..' WITH (cast_mode = 'relaxed') to allow lossy casts with runtime checks.

marcoslot · 2024-11-25T13:29:20Z

README.md

 - `compression_level <int>`: the compression level to use while writing Parquet files. The supported compression levels are only supported for `gzip`, `zstd` and `brotli` compression formats. The default compression level is `6` for `gzip (0-10)`, `1` for `zstd (1-22)` and `1` for `brotli (0-11)`.

+`pg_parquet` supports the following options in the `COPY FROM` command:
+- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension,
+- `cast_mode <string>`: Specifies the casting behavior, which can be set to either `strict` or `relaxed`. This determines whether lossy conversions are allowed. By default, the mode is `strict`, which does not permit lossy conversions (e.g., `bigint => int` causes a schema mismatch error during schema validation). When set to `relaxed`, lossy conversions are allowed, and errors will only be raised at runtime if a value cannot be properly converted. This option provides flexibility to handle schema mismatches by deferring error checks to runtime.


does not permit lossy conversions (e.g., bigint => int

why is that a lossy conversion? Does it overflow?

when cast_mode = strict, it is not allowed at schema comparison time.
when cast_mode = relaxed, it is allowed but might fail at runtime, if overflow occurs. pg throws error e.g.

pg_parquet=# select 123123123123::bigint::int; ERROR: integer out of range

maybe the word lossy is confusing. If error is not thrown, it d be lossy but pg throws the error.

got rid of cast_mode and now always cast in relaxed way.

marcoslot

It could be nice to relax it further through coerce via IO, but lots of nice improvements here.

marcoslot · 2024-11-26T12:31:58Z

src/arrow_parquet/schema_parser.rs

+            )
+            .with_metadata(key_field.metadata().clone());
+
+            let value_nullable = true;


seems like it's missing some code coverage

yeah, that's because we crunchy_map is missing at ci, we simply skip these tests.

marcoslot · 2024-11-26T12:32:52Z

src/lib.rs

@@ -340,6 +354,14 @@ mod tests {
        Spi::get_one(&query).unwrap().unwrap()
    }

+    fn write_record_batch_to_parquet(schema: SchemaRef, record_batch: RecordBatch) {
+        let file = File::create("/tmp/test.parquet").unwrap();


would be good to use a filename that's less likely to conflict with random stuff users do

will be handled at #78

aykut-bozkurt force-pushed the aykut/coerce-types-on-read branch from d0f1554 to ef6f07d Compare November 11, 2024 18:01

aykut-bozkurt force-pushed the aykut/coerce-types-on-read branch 15 times, most recently from a343a8a to c69fd25 Compare November 17, 2024 13:44

aykut-bozkurt marked this pull request as draft November 17, 2024 13:45

aykut-bozkurt force-pushed the aykut/coerce-types-on-read branch from c69fd25 to c8af064 Compare November 21, 2024 08:10

aykut-bozkurt marked this pull request as ready for review November 21, 2024 08:11

aykut-bozkurt force-pushed the aykut/coerce-types-on-read branch 2 times, most recently from 78fc3c9 to ed3e766 Compare November 21, 2024 10:42

aykut-bozkurt force-pushed the aykut/coerce-types-on-read branch from ed3e766 to b408fac Compare November 21, 2024 10:57

aykut-bozkurt mentioned this pull request Nov 21, 2024

Importing Array(String) from parquet file does not work with text[] column #79

Closed

onderkalaci reviewed Nov 22, 2024

View reviewed changes

address

7fa0477

aykut-bozkurt force-pushed the aykut/coerce-types-on-read branch from ced1479 to 7fa0477 Compare November 22, 2024 15:14

marcoslot reviewed Nov 25, 2024

View reviewed changes

remove cast_mode option

d48996c

aykut-bozkurt force-pushed the aykut/coerce-types-on-read branch from 66ee73a to d48996c Compare November 25, 2024 14:17

aykut-bozkurt enabled auto-merge (squash) November 25, 2024 14:26

marcoslot approved these changes Nov 26, 2024

View reviewed changes

aykut-bozkurt merged commit e3b4476 into main Nov 26, 2024
4 of 6 checks passed

aykut-bozkurt deleted the aykut/coerce-types-on-read branch November 26, 2024 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coerce types on read #76

Coerce types on read #76

aykut-bozkurt commented Nov 11, 2024 •

edited

Loading

codecov bot commented Nov 11, 2024 •

edited

Loading

onderkalaci Nov 22, 2024

onderkalaci left a comment

marcoslot Nov 25, 2024

aykut-bozkurt Nov 25, 2024

aykut-bozkurt Nov 25, 2024

aykut-bozkurt Nov 25, 2024

marcoslot left a comment

marcoslot Nov 26, 2024

aykut-bozkurt Nov 26, 2024

marcoslot Nov 26, 2024

aykut-bozkurt Nov 26, 2024

Coerce types on read #76

Coerce types on read #76

Conversation

aykut-bozkurt commented Nov 11, 2024 • edited Loading

codecov bot commented Nov 11, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

onderkalaci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoslot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aykut-bozkurt commented Nov 11, 2024 •

edited

Loading

codecov bot commented Nov 11, 2024 •

edited

Loading