Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Experimental unity catalog client #20798

Merged
merged 5 commits into from
Jan 20, 2025
Merged

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Jan 20, 2025

Introduces an experimental unity catalog client. Note that the API is unstable and subject to change.

The initial version in this PR supports:

  • Listing catalogs, schemas and tables
  • Retrieving table information
  • Reading a table as a LazyFrame for the following data_source_formats:
    • DELTA
    • PARQUET
    • CSV
import polars as pl
from pprint import pprint

# See https://github.com/unitycatalog/unitycatalog for the unity catalog server.
catalog = pl.Catalog("http://localhost:8080")

pprint(catalog.list_catalogs())
# [{"comment": "Main catalog", "name": "unity"}]
pprint(catalog.list_schemas("unity"))
# [{"comment": "Default schema", "name": "default"}]
pprint(catalog.list_tables("unity", "default"))
# [
#     {
#         "columns": [
#             {
#                 "comment": "ID primary key",
#                 "name": "id",
#                 "partition_index": None,
#                 "position": 0,
#                 "type_interval_type": None,
#                 "type_text": "int",
#             },
#             ...,
#         ],
#         "comment": "Managed table",
#         "data_source_format": "DELTA",
#         "name": "marksheet",
#         "storage_location": "file:///Users/nxs/git/unitycatalog/etc/data/managed/unity/default/tables/marksheet/",
#         "table_id": "c389adfa-5c8f-497b-8f70-26c2cca4976d",
#         "table_type": "MANAGED",
#     },
#     ...,
# ]
pprint(catalog.get_table_info("unity", "default", "numbers"))
# {
#     "columns": [
#         {
#             "comment": "Int column",
#             "name": "as_int",
#             "partition_index": None,
#             "position": 0,
#             "type_interval_type": None,
#             "type_text": "int",
#         },
#         ...,
#     ],
#     "comment": "External table",
#     "data_source_format": "DELTA",
#     "name": "numbers",
#     "storage_location": "file:///Users/nxs/git/unitycatalog/etc/data/external/unity/default/tables/numbers/",
#     "table_id": "32025924-be53-4d67-ac39-501a86046c01",
#     "table_type": "EXTERNAL",
# }
print(q := catalog.scan_table("unity", "default", "numbers"))
# naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

# Parquet SCAN [/Users/nxs/git/unitycatalog/etc/data/external/unity/default/tables/numbers/d1df15d1-33d8-45ab-ad77-465476e2d5cd-000.parquet]
# PROJECT */2 COLUMNS
print(q.collect())
# shape: (15, 2)
# ┌────────┬────────────┐
# │ as_int ┆ as_double  │
# │ ---    ┆ ---        │
# │ i32    ┆ f64        │
# ╞════════╪════════════╡
# │ 564    ┆ 188.755356 │
# │ 755    ┆ 883.610563 │
# │ 644    ┆ 203.439559 │
# │ 75     ┆ 277.880219 │
# │ 42     ┆ 403.857969 │
# │ …      ┆ …          │
# │ 294    ┆ 209.322436 │
# │ 150    ┆ 329.197303 │
# │ 539    ┆ 425.661029 │
# │ 247    ┆ 477.742227 │
# │ 958    ┆ 509.371273 │
# └────────┴────────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 20, 2025
Copy link

codecov bot commented Jan 20, 2025

Codecov Report

Attention: Patch coverage is 25.19201% with 487 lines in your changes missing coverage. Please review.

Project coverage is 79.62%. Comparing base (3696e53) to head (db564e0).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-python/src/catalog/mod.rs 0.56% 177 Missing ⚠️
crates/polars-io/src/catalog/unity/client.rs 0.00% 102 Missing ⚠️
crates/polars-io/src/catalog/schema.rs 68.75% 55 Missing ⚠️
crates/polars-io/src/catalog/unity/utils.rs 0.00% 53 Missing ⚠️
py-polars/polars/catalog.py 48.75% 41 Missing ⚠️
crates/polars-lazy/src/scan/catalog.rs 0.00% 31 Missing ⚠️
crates/polars-io/src/utils/other.rs 0.00% 21 Missing ⚠️
crates/polars-io/src/path_utils/hugging_face.rs 0.00% 4 Missing ⚠️
crates/polars-python/src/utils.rs 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20798      +/-   ##
==========================================
- Coverage   79.78%   79.62%   -0.17%     
==========================================
  Files        1561     1568       +7     
  Lines      222015   222669     +654     
  Branches     2533     2543      +10     
==========================================
+ Hits       177135   177295     +160     
- Misses      44296    44790     +494     
  Partials      584      584              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ion-elgreco
Copy link
Contributor

Ah we actually also have a PR open for this: delta-io/delta-rs#3078, could have shared components of the client


let args = ScanArgsParquet {
schema,
allow_missing_columns: matches!(data_source_format, DataSourceFormat::Delta),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plainly reading delta parquet files is not safe operation, you will have to check the protocol versions whether you are allowed to read it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for the review!

This branch is only hit if data_source_format=PARQUET - are there still version controls for this case?

For data_source_format=DELTA I am using the Python-side scan_delta.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case it should be fine! :)

@ritchie46 ritchie46 merged commit bf57bde into pola-rs:main Jan 20, 2025
28 checks passed
@nameexhaustion nameexhaustion deleted the catalog branch January 24, 2025 13:02
@nrccua-timr
Copy link

y'all are amazing!!!! been waiting for this for over a year!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants