-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate from re_arrow2
to arrow
#3741
Comments
This PR introduces a new crate: `re_types_core`. `re_types_core` only contains the fundamental traits and types that make up Rerun's data model. It is split off from the existing `re_types`. This makes it possible to work with our data model abstractions without having to depend on the `re_types` behemoth. This is more than a DX improvement: since so many things depend directly or indirectly on `re_types`, it is very easy to end-up with unsolvable dependency cycles. This helps with that in some cases (though certainly not all). In particular, `re_tuid` (and by extension `re_format`) are now completely free of `re_types`. For convenience, `re_types` reexports all of `re_types_core`, so the public API looks unchanged. In a handful of instances (`re_arrow_store`, `re_data_store`, `re_log_types`, `re_query`), I've went the extra mile and started porting these crates towards raw `re_types_core` rather than relying on the reexports. The reason is that, upon closer inspection, these crates are very close to being able to live free of `re_types`. In the future, the custom crate and custom module attributes coming with #3741 might allow us to make these independent. Similarly, the codegen now uses `re_types_core` directly, as that makes the life of the upcoming "serde-codegen" work much easier.
**Commit by commit** This is necessary refactoring work for the upcoming `attr.rust.custom_crate` attribute, itself necessary for the upcoming serde-codegen support, itself necessary for the upcoming blueprint experimentations as well as #3741. ### Changes 1. The `CodeGenerator` trait as well as all post-processing helpers (gitattributes, orphan detection...) are now I/O-free. ```rust pub type GeneratedFiles = std::collections::BTreeMap<camino::Utf8PathBuf, String>; pub trait CodeGenerator { fn generate( &mut self, reporter: &crate::Reporter, objects: &crate::Objects, arrow_registry: &crate::ArrowRegistry, ) -> GeneratedFiles; } ``` 2. All post-processing helpers are now agnostic to the location output. This is very important as it makes it possible to generate e.g. rust code out of the `re_types` crate without everything crumbling down. A side-effect is that gitattributes files are now finer-grained. 3. The Rust codegen pass is now crate agnostic: it is driven by the workspace path rather than a specific crate path. Necessary for the upcoming `attr.rust.custom_crate`. 4. All codegen passes now follow the exact same 4-step structure: ``` // 1. Generate in-memory code files. let mut gen = MyGenerator::new(); let mut files = gen.generate(reporter, objects, arrow_registry); // 2. Generate in-memory attribute files. generate_gitattributes_for_generated_files(&mut files); // 3. Write all in-memory files to disk. write_files(&gen.pkg_path, &gen.testing_pkg_path, &files); // 4. Remove orphaned files. crate::codegen::common::remove_orphaned_files(reporter, &files); ``` 5. The documentation codegen pass now removes its orphans, which is why some `md` files were removed in this PR. --- - Unblocks #3741 - Unblocks #3495
Using that we can start this migration piece-wise. It would have double the dependencies for a transitionary period, leading to longer compilation times and bigger .wasm binary, but I think that is an ok tradeoff. Potential roadmap:
After de-chunkfification:
As of 2024-07-08, there are only around 300 lines of Rust referencing the string ignored pathscrates/re_types/**, crates/re_types_core/src/archetypes/**, crates/re_types_core/src/datatypes/**, crates/re_types_core/src/components/**, crates/re_types_blueprint/src/blueprint/components/**, crates/re_types_blueprint/src/blueprint/archetypes/** |
arrow2{-convert}
re_arrow2
to arrow
|
It doesn't make any sense for a `ComponentBatch` to have any say in what the final `ArrowField` should look like. An `ArrowField` is a `Chunk`/`RecordBatch`/`Schema`-level concern that only makes sense during IO/transport/FFI/storage/etc, and which requires external context that a single `ComponentBatch` on its own has no idea of. --- Part of a lot of clean up I want to while we head towards: * #7245 * #3741
New blocker: |
### What * Waiting for a proper fix in apache/arrow-rs#6401 * Should be fixed before #3741 is considered finished ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested the web demo (if applicable): * Using examples from latest `main` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7426?manifest_url=https://app.rerun.io/version/main/examples_manifest.json) * Using full set of examples from `nightly` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7426?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG * [x] If applicable, add a new check to the [release checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)! * [x] If have noted any breaking changes to the log API in `CHANGELOG.md` and the migration guide - [PR Build Summary](https://build.rerun.io/pr/7426) - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html) To run all checks from `main`, comment on the PR with `@rerun-bot full-check`.
### What * Part of #3741 * Rename `arrow_datatype` to `arrow2_datatype` * Rename `to_arrow` to `to_arrow2` * Rename `from_arrow` to `from_arrow2` Next step is adding back the previous function names, but with actual `arrow-rs` interface, then start porting code to using them. ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) To run all checks from `main`, comment on the PR with `@rerun-bot full-check`. To deploy documentation changes immediately after merging this PR, add the `deploy docs` label.
* [x] I checked this checkbox * Part of #3741
* Follows #8206 * Part of #3741 ## Changes To implement nullable unions, we have a `_null_marker: Null` variants in all our unions. This means all our unions are nullable. Previously we would only mark a struct field as nullable if it was declared as such in the `.fbs` file, but `arrow-rs` complains about this. So with this PR, if a struct field refers to a union type, that struct field will be marked as `nullable: true` in the datatype (in Rust, Python and C++).
re_arrow2
to arrow
re_arrow2
to arrow
### Related * Part of #3741 ### What Some preparatory work for migrating the codegen deserializer from `re_arrow2`
Blockers
shrink_to_fit
toArray
apache/arrow-rs#6360 (for lowering memory use)DataType::Extension
for Tuid)Multiple end-goals:
RERUN:component_name
)RERUN:component_name
#3360half
forf16
TODO (split into sub-issues as needed):
SizeBytes
to own cratefn arrow_ui
tore_ui
arrow2
(codegen, data{cell,row,table},ArrowBuffer
, etc)ArrowBuffer
wrapper to exposearrow-rs
buffers rather thanarrow2
#2978arrow1
RERUN:component_name
(Clean up Arrow extension hell, implementRERUN:component_name
#3360)DataCell::component_name
TransportChunk
withRecordBatch
?On the way there we might hit a few bumps because we have a lot of redundant ad-hoc code that integrates with
polars
(which is built on top ofarrow2
).The solution to this is to make sure we only integrate with
polars
in one single place: theData{Cell,Row,Table}
layer (#1692).Once that's done, we can remove all ad-hoc polars code everywhere and just build a
Data{Row,Cell,Table}
anytime we want apolars::Series
/polars::DataFrame
(#1759).Internally, the conversion from
DataTable
topolars::DataFrame
will require a zero-copy tri-stage conversion fromarrow1
->arrow2
->polars
.arrow2
does _not_ refcount schema metadata #1805The text was updated successfully, but these errors were encountered: