Migrate from `re_arrow2` to `arrow` #3741

teh-cmc · 2023-10-09T10:42:29Z

Blockers

Soft-blocked on Add shrink_to_fit to Array apache/arrow-rs#6360 (for lowering memory use)
Semi-blocked on finding a work-around for Add an ExtensionType to DataType enum apache/arrow-rs#4472 (we use DataType::Extension for Tuid)
Blocked by Add unsafe/unchecked slice functions apache/arrow-rs#6901

Multiple end-goals:

Use same arrow lib as the rest of the ecosystem, which is where all the bug & perf fixes are actually happening
Use inifinitely less space to store Arrow metadata (schema deduplication)
- Implement minimal datatype registry for cross-batch deduplication #1809
Make it possible to send raw Arrow data to Rerun and have it just work (RERUN:component_name)
- Clean up Arrow extension hell, implement RERUN:component_name #3360
- Also frees up usage of Arrow extensions for actual native extensions (e.g. Compatability with arrow fixed-shape-tensor extension #3004)
Native integration with half for f16
Etc etc

TODO (split into sub-issues as needed):

Move SizeBytes to own crate
Move fn arrow_ui to re_ui
Remove all direct uses of arrow2 (codegen, data{cell,row,table}, ArrowBuffer, etc)
- Related: Migrate ArrowBuffer wrapper to expose arrow-rs buffers rather than arrow2 #2978
Migrate serde-based components (i.e. blueprint stuff) to arrow1
- https://docs.rs/arrow-json/47.0.0/arrow_json/reader/struct.Decoder.html#method.serialize might be all we need
Get rid of Arrow extensions everywhere, introduce RERUN:component_name (Clean up Arrow extension hell, implement RERUN:component_name #3360)
Runtime schema registry / dedupe datatypes (Implement minimal datatype registry for cross-batch deduplication #1809)
Remove DataCell::component_name
Replace TransportChunk with RecordBatch?

On the way there we might hit a few bumps because we have a lot of redundant ad-hoc code that integrates with polars (which is built on top of arrow2).

The solution to this is to make sure we only integrate with polars in one single place: the Data{Cell,Row,Table} layer (#1692).
Once that's done, we can remove all ad-hoc polars code everywhere and just build a Data{Row,Cell,Table} anytime we want a polars::Series/polars::DataFrame (#1759).

Internally, the conversion from DataTable to polars::DataFrame will require a zero-copy tri-stage conversion from arrow1->arrow2->polars.

Supersedes arrow2 does _not_ refcount schema metadata #1805
Supersedes Switch to arrow-rs #2354

The text was updated successfully, but these errors were encountered:

This PR introduces a new crate: `re_types_core`. `re_types_core` only contains the fundamental traits and types that make up Rerun's data model. It is split off from the existing `re_types`. This makes it possible to work with our data model abstractions without having to depend on the `re_types` behemoth. This is more than a DX improvement: since so many things depend directly or indirectly on `re_types`, it is very easy to end-up with unsolvable dependency cycles. This helps with that in some cases (though certainly not all). In particular, `re_tuid` (and by extension `re_format`) are now completely free of `re_types`. For convenience, `re_types` reexports all of `re_types_core`, so the public API looks unchanged. In a handful of instances (`re_arrow_store`, `re_data_store`, `re_log_types`, `re_query`), I've went the extra mile and started porting these crates towards raw `re_types_core` rather than relying on the reexports. The reason is that, upon closer inspection, these crates are very close to being able to live free of `re_types`. In the future, the custom crate and custom module attributes coming with #3741 might allow us to make these independent. Similarly, the codegen now uses `re_types_core` directly, as that makes the life of the upcoming "serde-codegen" work much easier.

**Commit by commit** This is necessary refactoring work for the upcoming `attr.rust.custom_crate` attribute, itself necessary for the upcoming serde-codegen support, itself necessary for the upcoming blueprint experimentations as well as #3741. ### Changes 1. The `CodeGenerator` trait as well as all post-processing helpers (gitattributes, orphan detection...) are now I/O-free. ```rust pub type GeneratedFiles = std::collections::BTreeMap<camino::Utf8PathBuf, String>; pub trait CodeGenerator { fn generate( &mut self, reporter: &crate::Reporter, objects: &crate::Objects, arrow_registry: &crate::ArrowRegistry, ) -> GeneratedFiles; } ``` 2. All post-processing helpers are now agnostic to the location output. This is very important as it makes it possible to generate e.g. rust code out of the `re_types` crate without everything crumbling down. A side-effect is that gitattributes files are now finer-grained. 3. The Rust codegen pass is now crate agnostic: it is driven by the workspace path rather than a specific crate path. Necessary for the upcoming `attr.rust.custom_crate`. 4. All codegen passes now follow the exact same 4-step structure: ``` // 1. Generate in-memory code files. let mut gen = MyGenerator::new(); let mut files = gen.generate(reporter, objects, arrow_registry); // 2. Generate in-memory attribute files. generate_gitattributes_for_generated_files(&mut files); // 3. Write all in-memory files to disk. write_files(&gen.pkg_path, &gen.testing_pkg_path, &files); // 4. Remove orphaned files. crate::codegen::common::remove_orphaned_files(reporter, &files); ``` 5. The documentation codegen pass now removes its orphans, which is why some `md` files were removed in this PR. --- - Unblocks #3741 - Unblocks #3495

emilk · 2024-07-08T12:29:35Z

re_arrow2 has an arrow feature, with glue for converting data between arrow and re_arrow2: https://docs.rs/re_arrow2/0.17.4/re_arrow2/array/trait.Arrow2Arrow.html

Using that we can start this migration piece-wise. It would have double the dependencies for a transitionary period, leading to longer compilation times and bigger .wasm binary, but I think that is an ok tradeoff.

Potential roadmap:

Verify that Arrow2Arrow is zero-copy
- Benchmark arrow2arrow re_arrow2#6
Remove support for nullable components #6819
Move SizeBytes to own crate, with separate arrow and arrow2 feature flags
Rename to_arrow/from_arrow/… to to_arrow2/from_arrow2/…
Add poly-filled to_arrow/from_arrow using the glue
Migrate codegenned serialization

After de-chunkfification:

Migrate codegenned deserialization
Migrate everything else

As of 2024-07-08, there are only around 300 lines of Rust referencing the string arrow2 directly, when one ignores generated code.

ignored paths

crates/re_types/**, crates/re_types_core/src/archetypes/**, crates/re_types_core/src/datatypes/**, crates/re_types_core/src/components/**, crates/re_types_blueprint/src/blueprint/components/**, crates/re_types_blueprint/src/blueprint/archetypes/**

jleibs · 2024-07-10T17:11:23Z

I believe Experimental DataFusion integration #6807 also requires bringing in a dependency on arrow

Remove unused old traits. Part of a lot of clean up I want to while we head towards: * #7245 * #3741

It doesn't make any sense for a `ComponentBatch` to have any say in what the final `ArrowField` should look like. An `ArrowField` is a `Chunk`/`RecordBatch`/`Schema`-level concern that only makes sense during IO/transport/FFI/storage/etc, and which requires external context that a single `ComponentBatch` on its own has no idea of. --- Part of a lot of clean up I want to while we head towards: * #7245 * #3741

teh-cmc · 2024-08-31T18:19:40Z

Blocked on:

Fix MutableBuffer::into_buffer leaking its extra capacity into the final buffer apache/arrow-rs#6300

teh-cmc · 2024-09-05T13:44:09Z

New blocker:

Add shrink_to_fit to Array apache/arrow-rs#6360

### What * Waiting for a proper fix in apache/arrow-rs#6401 * Should be fixed before #3741 is considered finished ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested the web demo (if applicable): * Using examples from latest `main` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7426?manifest_url=https://app.rerun.io/version/main/examples_manifest.json) * Using full set of examples from `nightly` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7426?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG * [x] If applicable, add a new check to the [release checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)! * [x] If have noted any breaking changes to the log API in `CHANGELOG.md` and the migration guide - [PR Build Summary](https://build.rerun.io/pr/7426) - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html) To run all checks from `main`, comment on the PR with `@rerun-bot full-check`.

### What * Part of #3741 * Rename `arrow_datatype` to `arrow2_datatype` * Rename `to_arrow` to `to_arrow2` * Rename `from_arrow` to `from_arrow2` Next step is adding back the previous function names, but with actual `arrow-rs` interface, then start porting code to using them. ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) To run all checks from `main`, comment on the PR with `@rerun-bot full-check`. To deploy documentation changes immediately after merging this PR, add the `deploy docs` label.

* [x] I checked this checkbox * Part of #3741

* Part of #3741

* Part of #3741 * Closes #2978

* Follows #8206 * Part of #3741 ## Changes To implement nullable unions, we have a `_null_marker: Null` variants in all our unions. This means all our unions are nullable. Previously we would only mark a struct field as nullable if it was declared as such in the `.fbs` file, but `arrow-rs` complains about this. So with this PR, if a struct field refers to a union type, that struct field will be marked as `nullable: true` in the datatype (in Rust, Python and C++).

* Par of #3741

### Related * Part of #3741 ### What Some preparatory work for migrating the codegen deserializer from `re_arrow2`

### Related * Part if #3741 ### TODO * [x] `@rerun-bot full-check`

teh-cmc added the 🏹 arrow concerning arrow label Oct 9, 2023

teh-cmc mentioned this issue Oct 16, 2023

Introduce re_types_core #3878

Merged

4 tasks

teh-cmc mentioned this issue Oct 17, 2023

Make codegen I/O-free and agnostic to output location #3888

Merged

4 tasks

teh-cmc mentioned this issue Jan 9, 2024

Dataframe extension for Chunk #1692

Closed

emilk mentioned this issue Jan 11, 2024

Fork arrow2 and get rid of polars #4789

Closed

emilk mentioned this issue Jan 23, 2024

We need a better data slicing mechanism than Box<dyn Array> #4884

Closed

teh-cmc mentioned this issue May 31, 2024

Client-side chunks 1: introduce Chunk and its suffle/sort routines #6438

Merged

5 tasks

emilk self-assigned this Jul 8, 2024

emilk changed the title ~~Tracking issue: arrow cleanup & migration away from arrow2{-convert}~~ Tracking issue: Migrate from re_arrow2 to arrow Jul 8, 2024

emilk added dependencies concerning crates, pip packages etc 🦀 Rust API Rust logging API labels Jul 9, 2024

emilk removed their assignment Jul 9, 2024

teh-cmc mentioned this issue Aug 23, 2024

Remove unused Datatype and DatatypeBatch #7256

Merged

6 tasks

teh-cmc added a commit that referenced this issue Aug 23, 2024

Remove Datatype and DatatypeBatch (#7256)

2e2a988

Remove unused old traits. Part of a lot of clean up I want to while we head towards: * #7245 * #3741

teh-cmc mentioned this issue Aug 23, 2024

Remove unused Loggable{Batch}::arrow_field #7257

Merged

6 tasks

This was referenced Aug 23, 2024

Remove unused LoggableBatch::num_instances #7258

Merged

Remove unused Loggable::extended_arrow_datatype #7260

Merged

teh-cmc added the blocked can't make progress right now label Aug 31, 2024

emilk mentioned this issue Sep 16, 2024

Silence RUSTSEC-2023-0086 #7426

Merged

6 tasks

emilk self-assigned this Nov 21, 2024

emilk mentioned this issue Nov 21, 2024

Rust API: be explicit about when we're using the arrow2 crate #8194

Merged

1 task

emilk mentioned this issue Nov 21, 2024

Add arrow(1)-interface on top of Loggable and ArrowBuffer #8197

Merged

1 task

emilk added a commit that referenced this issue Nov 22, 2024

Add arrow(1)-interface on top of Loggable and ArrowBuffer (#8197)

7b32378

* [x] I checked this checkbox * Part of #3741

This was referenced Nov 22, 2024

Use arrow-rs in ArrowBuffer #8201

Merged

Port codegen of arrow datatype to arrow1 #8206

Merged

Port codegened arrow serialization to arrow1 #8208

Merged

emilk added a commit that referenced this issue Nov 25, 2024

Port codegen of arrow datatype to arrow1 (#8206)

a060d06

* Part of #3741

emilk added a commit that referenced this issue Nov 25, 2024

Use arrow-rs in ArrowBuffer (#8201)

11df7bb

* Part of #3741 * Closes #2978

emilk added the project Tracking issues for so-called "Projects" label Dec 4, 2024

emilk changed the title ~~Tracking issue: Migrate from re_arrow2 to arrow~~ Migrate from re_arrow2 to arrow Dec 4, 2024

emilk removed the blocked can't make progress right now label Dec 4, 2024

This was referenced Dec 9, 2024

Make ArrowString an opaque type #8365

Merged

Move as_array_ref into re_types_code/arrow_helpers #8362

Merged

emilk added a commit that referenced this issue Dec 9, 2024

Make ArrowString an opaque type (#8365)

04125bb

* Par of #3741

emilk mentioned this issue Dec 9, 2024

Prepare to port codegen deserialization from arrow2 #8372

Merged

emilk added a commit that referenced this issue Dec 9, 2024

Prepare to port codegen deserialization from arrow2 (#8372)

59a8440

### Related * Part of #3741 ### What Some preparatory work for migrating the codegen deserializer from `re_arrow2`

emilk mentioned this issue Dec 9, 2024

Port codegen arrow deserialization to arrow-rs #8375

Draft

1 task

emilk mentioned this issue Dec 18, 2024

Port parts of viewer to arrow-rs #8534

Merged

1 task

emilk added a commit that referenced this issue Dec 19, 2024

Port parts of viewer to arrow-rs (#8534)

662e404

### Related * Part if #3741 ### TODO * [x] `@rerun-bot full-check`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate from `re_arrow2` to `arrow` #3741

Migrate from `re_arrow2` to `arrow` #3741

teh-cmc commented Oct 9, 2023 •

edited by emilk

Loading

emilk commented Jul 8, 2024 •

edited

Loading

jleibs commented Jul 10, 2024

teh-cmc commented Aug 31, 2024

teh-cmc commented Sep 5, 2024

Migrate from re_arrow2 to arrow #3741

Migrate from re_arrow2 to arrow #3741

Comments

teh-cmc commented Oct 9, 2023 • edited by emilk Loading

emilk commented Jul 8, 2024 • edited Loading

jleibs commented Jul 10, 2024

teh-cmc commented Aug 31, 2024

teh-cmc commented Sep 5, 2024

Migrate from `re_arrow2` to `arrow` #3741

Migrate from `re_arrow2` to `arrow` #3741

teh-cmc commented Oct 9, 2023 •

edited by emilk

Loading

emilk commented Jul 8, 2024 •

edited

Loading