Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numeric improvement and fix #65

Merged
merged 1 commit into from
Nov 9, 2024
Merged

Conversation

aykut-bozkurt
Copy link
Collaborator

@aykut-bozkurt aykut-bozkurt commented Oct 30, 2024

Problem
Previously, we were writing unbounded numerics, that does not specify precision and scale (i.e. numeric), as text since they can be too large to represent as parquet decimal. Most of the time users ignores the precision for numeric columns, so they were written as text. That prevents pushing down some operators on the numeric type by execution engines.

Improvement
We start to read/write unbounded numerics as numeric(38, 16) to parquet file. We throw a runtime error if an unbounded numeric value exceeds 22 digits before the decimal point or 16 digits after the decimal point.

For the users that bump into the error, we give hint to change the column type to a numeric(p,s) with precision and scale specified, to get rid of the error.

Fixes
Arrow to pg conversions were not correct for some cases

  • when there is no decimal point e.g. 1234 (fixed by relying on pg compatible Decimal128Type::format_decimal by arrow)
  • when the scale is negative e.g. numeric(5,-2) (arrow does not allow negative scale, fixed by adding abs(scale) to precision and setting the scale to 0, which means numeric(5,-2) => numeric(7,0) at parquet file. copy from can convert it back to numeric(5,-2))

These cases are fixed and covered by tests.

@aykut-bozkurt aykut-bozkurt force-pushed the aykut/numeric-improvement branch from 8eaab00 to da43062 Compare October 30, 2024 23:44
> * `numeric(9 < P <= 18, S)` is represented as `INT64` with `DECIMAL` logical type
> * `numeric(18 < P <= 38, S)` is represented as `FIXED_LEN_BYTE_ARRAY(9-16)` with `DECIMAL` logical type
> * `numeric(38 < P, S)` is represented as `BYTE_ARRAY` with `STRING` logical type
> * `numeric` is allowed by Postgres. (precision and scale not specified). These are represented by a default precision (38) and scale (16) instead of writing them as string. You get runtime error if your table tries to read or write a numeric value which is not allowed by the default precision and scale (22 integral digits before decimal point, 16 digits after decimal point).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is true, but just being explicit in my understanding: So effectively if you want numeric with larger precision than 38 you need to explicitly define it there, it just will not end up being an actual stored numeric, but reads/writes will work to convert transparently from the Postgres side?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. An example

pg_parquet=# create table test(x numeric(40,2));
CREATE TABLE

pg_parquet=# insert into test values (1.23);
INSERT 0 1

pg_parquet=# copy test to '/tmp/test.parquet';
COPY 1

pg_parquet=# select * from parquet.metadata('/tmp/test.parquet');
-[ RECORD 1 ]-----------+-------------------------
uri                     | /tmp/test.parquet
row_group_id            | 0
row_group_num_rows      | 1
row_group_num_columns   | 1
row_group_bytes         | 65
column_id               | 0
file_offset             | 0
num_values              | 1
path_in_schema          | x
type_name               | BYTE_ARRAY
stats_null_count        | 0
stats_distinct_count    | 
stats_min               | 1.23
stats_max               | 1.23
compression             | SNAPPY
encodings               | PLAIN,RLE,RLE_DICTIONARY
index_page_offset       | 
dictionary_page_offset  | 4
data_page_offset        | 28
total_compressed_size   | 69
total_uncompressed_size | 65

pg_parquet=# copy test from '/tmp/test.parquet';
COPY 1

@@ -65,8 +65,8 @@ pub(crate) struct PgToArrowAttributeContext {
is_geometry: bool,
is_map: bool,
attribute_contexts: Option<Vec<PgToArrowAttributeContext>>,
scale: Option<usize>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the changes here for? Related to the arm support?

Copy link
Collaborator Author

@aykut-bozkurt aykut-bozkurt Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgres returns scale and precision as i32 since negative scales are allowed.

But we adjust negative scale e.g. (5, -2) => (7, 0). Hence, we can make it u32. (not usize since pgrx::Numeric<P, S> requires P and S to be u32)

Comment on lines +356 to +359
parquet_dest.copy_options.row_group_size = row_group_size;
parquet_dest.copy_options.row_group_size_bytes = row_group_size_bytes;
parquet_dest.copy_options.compression = compression;
parquet_dest.copy_options.compression_level = compression_level;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of this change, or just an indepedent refactor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for confusion, independent

@aykut-bozkurt aykut-bozkurt force-pushed the aykut/numeric-improvement branch 3 times, most recently from 25ff037 to d97f8ad Compare November 9, 2024 13:56
**Problem**
Previously, we were writing unbounded numerics, that does not specify precision and scale (i.e. `numeric`), as text since they can be too large to represent
as parquet decimal. Most of the time users ignore the precision for numeric columns, so they were written as text. That prevented pushing down some operators
on the numeric type by execution engines.

**Improvement**
We start to read/write unbounded numerics as numeric(38, 16) to parquet file. We throw a runtime error if an unbounded numeric value exceeds 22 digits before
the decimal or 16 digits after the decimal. For the ones that bump into the error, we give hint to change the column type to a numeric(p,s) with precision and
scale specified, to get rid of the error.

**Fix**
Arrow to pg conversions were not correct for some cases e.g. when there is no decimal point. These cases are fixed and covered by tests.
@aykut-bozkurt aykut-bozkurt force-pushed the aykut/numeric-improvement branch from d97f8ad to 62cf7e3 Compare November 9, 2024 14:14
@aykut-bozkurt aykut-bozkurt merged commit 451f347 into main Nov 9, 2024
4 checks passed
@aykut-bozkurt aykut-bozkurt deleted the aykut/numeric-improvement branch November 9, 2024 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants