Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Extension Types] Add support for cross-lang extension types. #899

Merged
merged 9 commits into from
May 18, 2023

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented May 5, 2023

This PR adds support for cross-language extension types, i.e. extension types that use a language-agnostic serialization method.

TODOs

  • Add non-tensor test coverage for Arrow < 12.0.0.

@clarkzinzow clarkzinzow force-pushed the clark/extension-types branch from ede88bf to 6011c52 Compare May 5, 2023 21:03
@@ -5,8 +5,11 @@ prettytable-rs = "^0.10"
rand = "^0.8"

[dependencies.arrow2]
branch = "clark/expand-casting-support"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we merge it into main in our fork before merging this PR?

Copy link
Contributor Author

@clarkzinzow clarkzinzow May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main con with working off of main is that we'd be making the divergence from upstream Arrow2 implicit, and if we keep merging upstream Arrow2 main into our fork's main, our differing commits might be lost in the git history. Instead, I think it would be better to anchor to a branch where all of the diverging commits are at the head and try to keep the branching small/short-lived.

The flow could be:

  1. Create a branch on Arrow2 fork containing requisite changes to Arrow2.
  2. Update Daft to point to that branch.
  3. Submit a PR from Arrow2 fork branch to upstream Arrow2.
  4. If the PR is updated during the review process, we can update the locked Daft dependency with a cargo update PR.
  5. When the PR is merged and is included in an Arrow2 release, switch Daft's Arrow2 dependency to point to that crates.io release.
  6. If we need another Arrow2 change stacked on the existing Arrow2 branch/PR, we could create a nightly snapshot branch containing both changes, similar to what Polars does: https://github.com/pola-rs/polars/blob/528590cfa57e48f5bd902ad027f5e35318644110/Cargo.toml#L47

I think this should help keep the difference with upstream Arrow2 explicit, which should be nicer? Let me know what you think about that flow.

@clarkzinzow clarkzinzow force-pushed the clark/extension-types branch from 59b18a7 to 262b9cc Compare May 13, 2023 21:55
@clarkzinzow clarkzinzow requested review from jaychia and samster25 May 13, 2023 21:56
Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me, but might require changes after the logical types refactor?

src/datatypes/matching.rs Show resolved Hide resolved
@clarkzinzow clarkzinzow force-pushed the clark/extension-types branch from 8e80005 to 9bf180a Compare May 17, 2023 22:42
src/datatypes/dtype.rs Show resolved Hide resolved
src/datatypes/field.rs Outdated Show resolved Hide resolved
src/datatypes/field.rs Outdated Show resolved Hide resolved
src/datatypes/matching.rs Show resolved Hide resolved
src/series/mod.rs Outdated Show resolved Hide resolved
// TODO(Clark): Refactor to GILOnceCell in order to avoid deadlock between the below mutex and the Python GIL.
lazy_static! {
static ref REGISTRY: Mutex<HashMap<std::string::String, arrow2::datatypes::DataType>> =
Mutex::new(HashMap::new());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I can see how this could lead to a deadlock with the GIL. Could be worthwhile to Fix with the GILOnceCell.

@clarkzinzow clarkzinzow force-pushed the clark/extension-types branch 2 times, most recently from 88ddc40 to b46c42a Compare May 18, 2023 00:31
@clarkzinzow clarkzinzow force-pushed the clark/extension-types branch from b46c42a to 7cd2205 Compare May 18, 2023 00:41
@clarkzinzow clarkzinzow force-pushed the clark/extension-types branch from d5c6ea8 to efaab36 Compare May 18, 2023 01:10
@codecov
Copy link

codecov bot commented May 18, 2023

Codecov Report

Merging #899 (c5ea133) into main (b7810a4) will decrease coverage by 0.20%.
The diff coverage is 82.80%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #899      +/-   ##
==========================================
- Coverage   86.37%   86.18%   -0.20%     
==========================================
  Files         178      178              
  Lines       14415    14538     +123     
==========================================
+ Hits        12451    12529      +78     
- Misses       1964     2009      +45     
Impacted Files Coverage Δ
daft/arrow_utils.py 95.53% <ø> (ø)
src/array/ops/compare_agg.rs 58.11% <ø> (ø)
src/array/ops/sort.rs 96.82% <0.00%> (-0.45%) ⬇️
src/array/pseudo_arrow/mod.rs 70.76% <0.00%> (-2.25%) ⬇️
src/datatypes/mod.rs 33.33% <ø> (ø)
src/lib.rs 100.00% <ø> (ø)
src/series/array_impl/data_array.rs 91.30% <ø> (ø)
src/array/ops/take.rs 73.76% <30.76%> (-3.53%) ⬇️
src/array/ops/if_else.rs 90.39% <68.88%> (-2.29%) ⬇️
src/array/ops/broadcast.rs 73.87% <71.42%> (-0.17%) ⬇️
... and 14 more

... and 1 file with indirect coverage changes

@samster25 samster25 merged commit 19e3881 into main May 18, 2023
@samster25 samster25 deleted the clark/extension-types branch May 18, 2023 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants