Numeric improvement and fix #65

aykut-bozkurt · 2024-10-30T23:28:30Z

Problem
Previously, we were writing unbounded numerics, that does not specify precision and scale (i.e. numeric), as text since they can be too large to represent as parquet decimal. Most of the time users ignores the precision for numeric columns, so they were written as text. That prevents pushing down some operators on the numeric type by execution engines.

Improvement
We start to read/write unbounded numerics as numeric(38, 16) to parquet file. We throw a runtime error if an unbounded numeric value exceeds 22 digits before the decimal point or 16 digits after the decimal point.

For the users that bump into the error, we give hint to change the column type to a numeric(p,s) with precision and scale specified, to get rid of the error.

Fixes
Arrow to pg conversions were not correct for some cases

when there is no decimal point e.g. 1234 (fixed by relying on pg compatible Decimal128Type::format_decimal by arrow)
when the scale is negative e.g. numeric(5,-2) (arrow does not allow negative scale, fixed by adding abs(scale) to precision and setting the scale to 0, which means numeric(5,-2) => numeric(7,0) at parquet file. copy from can convert it back to numeric(5,-2))

These cases are fixed and covered by tests.

pgguru · 2024-11-08T15:37:52Z

README.md

+>    * `numeric(9 < P <= 18, S)` is represented as `INT64` with `DECIMAL` logical type
+>    * `numeric(18 < P <= 38, S)` is represented as `FIXED_LEN_BYTE_ARRAY(9-16)` with `DECIMAL` logical type
+>    * `numeric(38 < P, S)` is represented as `BYTE_ARRAY` with `STRING` logical type
+>    * `numeric` is allowed by Postgres. (precision and scale not specified). These are represented by a default precision (38) and scale (16) instead of writing them as string. You get runtime error if your table tries to read or write a numeric value which is not allowed by the default precision and scale (22 integral digits before decimal point, 16 digits after decimal point).


I believe this is true, but just being explicit in my understanding: So effectively if you want numeric with larger precision than 38 you need to explicitly define it there, it just will not end up being an actual stored numeric, but reads/writes will work to convert transparently from the Postgres side?

pgguru · 2024-11-08T15:40:15Z

src/arrow_parquet/pg_to_arrow.rs

@@ -65,8 +65,8 @@ pub(crate) struct PgToArrowAttributeContext {
    is_geometry: bool,
    is_map: bool,
    attribute_contexts: Option<Vec<PgToArrowAttributeContext>>,
-    scale: Option<usize>,


What are the changes here for? Related to the arm support?

Postgres returns scale and precision as i32 since negative scales are allowed.

But we adjust negative scale e.g. (5, -2) => (7, 0). Hence, we can make it u32. (not usize since pgrx::Numeric<P, S> requires P and S to be u32)

pgguru · 2024-11-08T15:41:24Z

src/parquet_copy_hook/copy_to_dest_receiver.rs

+    parquet_dest.copy_options.row_group_size = row_group_size;
+    parquet_dest.copy_options.row_group_size_bytes = row_group_size_bytes;
+    parquet_dest.copy_options.compression = compression;
+    parquet_dest.copy_options.compression_level = compression_level;


Is this part of this change, or just an indepedent refactor?

sorry for confusion, independent

**Problem** Previously, we were writing unbounded numerics, that does not specify precision and scale (i.e. `numeric`), as text since they can be too large to represent as parquet decimal. Most of the time users ignore the precision for numeric columns, so they were written as text. That prevented pushing down some operators on the numeric type by execution engines. **Improvement** We start to read/write unbounded numerics as numeric(38, 16) to parquet file. We throw a runtime error if an unbounded numeric value exceeds 22 digits before the decimal or 16 digits after the decimal. For the ones that bump into the error, we give hint to change the column type to a numeric(p,s) with precision and scale specified, to get rid of the error. **Fix** Arrow to pg conversions were not correct for some cases e.g. when there is no decimal point. These cases are fixed and covered by tests.

aykut-bozkurt force-pushed the aykut/numeric-improvement branch from 8eaab00 to da43062 Compare October 30, 2024 23:44

pgguru reviewed Nov 8, 2024

View reviewed changes

aykut-bozkurt force-pushed the aykut/numeric-improvement branch 3 times, most recently from 25ff037 to d97f8ad Compare November 9, 2024 13:56

aykut-bozkurt force-pushed the aykut/numeric-improvement branch from d97f8ad to 62cf7e3 Compare November 9, 2024 14:14

aykut-bozkurt merged commit 451f347 into main Nov 9, 2024
4 checks passed

aykut-bozkurt deleted the aykut/numeric-improvement branch November 9, 2024 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numeric improvement and fix #65

Numeric improvement and fix #65

aykut-bozkurt commented Oct 30, 2024 •

edited

Loading

pgguru Nov 8, 2024

aykut-bozkurt Nov 9, 2024

pgguru Nov 8, 2024

aykut-bozkurt Nov 9, 2024 •

edited

Loading

pgguru Nov 8, 2024

aykut-bozkurt Nov 9, 2024

Numeric improvement and fix #65

Numeric improvement and fix #65

Conversation

aykut-bozkurt commented Oct 30, 2024 • edited Loading

pgguru Nov 8, 2024

Choose a reason for hiding this comment

aykut-bozkurt Nov 9, 2024

Choose a reason for hiding this comment

pgguru Nov 8, 2024

Choose a reason for hiding this comment

aykut-bozkurt Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

pgguru Nov 8, 2024

Choose a reason for hiding this comment

aykut-bozkurt Nov 9, 2024

Choose a reason for hiding this comment

aykut-bozkurt commented Oct 30, 2024 •

edited

Loading

aykut-bozkurt Nov 9, 2024 •

edited

Loading