Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing Array(String) from parquet file does not work with text[] column #79

Closed
aseigo opened this issue Nov 21, 2024 · 7 comments · Fixed by #76
Closed

Importing Array(String) from parquet file does not work with text[] column #79

aseigo opened this issue Nov 21, 2024 · 7 comments · Fixed by #76
Labels
bug Something isn't working

Comments

@aseigo
Copy link

aseigo commented Nov 21, 2024

Using the file at s3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-00000.snappy.parquet (part of https://opensource.foursquare.com/os-places/ .. ~434MB in size), pg_parquet shows the following schema:

fspoi=# select * from parquet.schema('places-00000.snappy.parquet');
                               uri                                |        name         | type_name  | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type
------------------------------------------------------------------+---------------------+------------+-------------+-----------------+--------------+----------------+-------+-----------+----------+--------------
 places-00000.snappy.parquet | arrow_schema        |            |             |                 |           23 |                |       |           |          |
 places-00000.snappy.parquet | fsq_place_id        | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | name                | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | latitude            | DOUBLE     |             | OPTIONAL        |              |                |       |           |          |
 places-00000.snappy.parquet | longitude           | DOUBLE     |             | OPTIONAL        |              |                |       |           |          |
 places-00000.snappy.parquet | address             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | locality            | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | region              | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | postcode            | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | admin_region        | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | post_town           | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | po_box              | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | country             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | date_created        | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | date_refreshed      | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | date_closed         | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | tel                 | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | website             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | email               | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | facebook_id         | INT64      |             | OPTIONAL        |              |                |       |           |          |
 places-00000.snappy.parquet | instagram           | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | twitter             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | fsq_category_ids    |            |             | OPTIONAL        |            1 | LIST           |       |           |          | LIST
 places-00000.snappy.parquet | list                |            |             | REPEATED        |            1 |                |       |           |          |
 places-00000.snappy.parquet | element             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
 places-00000.snappy.parquet | fsq_category_labels |            |             | OPTIONAL        |            1 | LIST           |       |           |          | LIST
 places-00000.snappy.parquet | list                |            |             | REPEATED        |            1 |                |       |           |          |
 places-00000.snappy.parquet | element             | BYTE_ARRAY |             | OPTIONAL        |              | UTF8           |       |           |          | STRING
(28 rows)

and creating a table in postgresql as:

create table places (
  fsq_place_id        text primary key,
  name                text,
  latitude            double precision,
  longitude           double precision, 
  address             text,
  locality            text,
  region              text,
  postcode            text,
  admin_region        text,
  post_town           text,
  po_box              text,
  date_created        text,
  country             text, 
  date_refreshed      text,
  date_closed         text, 
  tel                 text, 
  website             text,
  email               text,
  facebook_id         bigint,
  instagram           text,
  twitter             text,
  fsq_category_ids     text[],
  fsq_category_labels text[]
);

then attempting to copy the file into the table with:

COPY places FROM 'places-00000.snappy.parquet'  WITH (format 'parquet');

results in the following error:

ERROR:  type mismatch for column "fsq_category_ids" between table and parquet file. table expected "List(Field { name: "fsq_category_ids", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"PARQUET:field_id": "22"} })" but file had "List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })"

I also tried with a custom type such as CREATE TYPE element AS (elment text[]); but that ends up creating a list of records, which also does not match the parquet file.

I'm not sure if I'm doing something wrong here (the documentation is a bit light on how arrays are intended to work?), or if this is a bug in pg_parquet.

@aseigo
Copy link
Author

aseigo commented Nov 21, 2024

Idle wondering: is this because the names do not match? Is this another symptom of #39 ?

@aykut-bozkurt
Copy link
Collaborator

aykut-bozkurt commented Nov 21, 2024

Field names seem correct but COPY FROM is currently not interop well (requires exact match) with parquet files written by other tools. Hopefully, this PR should resolve it by allowing more relaxed schema match.

I also plan to introduce match_by_position option for COPY FROM to resolve #39.

EDIT: looks like field name also mismatch for the list element. So yes, another symptom of #39.

@aykut-bozkurt aykut-bozkurt added the bug Something isn't working label Nov 21, 2024
@aseigo
Copy link
Author

aseigo commented Nov 21, 2024

looks like field name also mismatch for the list element. So yes, another symptom of #39.

To make it even more tricky, I'm not even sure how to make the field name for the list element match. If the field in the table is changed from fsq_category_ids to element, then the error becomes that there is no element field in the parquet file (which is correct, of course!).

It is nice to match on name where possible ... perhaps a smaller useful change here would be to just ignore the name of the field on list elements, as they aren't nameable (afaik, anyways) in postgresql.

p.s. thank-you for this tool. Being able to drag data into pgsql from parquet files is very snazzy.

@aykut-bozkurt
Copy link
Collaborator

thanks for the feedback, hopefully we will improve COPY FROM experience in a few weeks. There will be a few more PRs coming after the #39. The issue will also be fixed.

@aykut-bozkurt
Copy link
Collaborator

you can try checkout to #39 btw. It has good chance to resolve the element name mismatch (cast will allow it), which will merged the next week.

@aseigo
Copy link
Author

aseigo commented Nov 26, 2024

Cool! I will check this out over the next day and report back how this addresses the issue. Thanks!

@aseigo
Copy link
Author

aseigo commented Nov 26, 2024

I can confirm that this has resolved the issue I was seeing. Thanks again! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants