Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from re_arrow2 to arrow #3741

Open
1 of 8 tasks
teh-cmc opened this issue Oct 9, 2023 · 4 comments
Open
1 of 8 tasks

Migrate from re_arrow2 to arrow #3741

teh-cmc opened this issue Oct 9, 2023 · 4 comments
Assignees
Labels
🏹 arrow concerning arrow dependencies concerning crates, pip packages etc project Tracking issues for so-called "Projects" 🦀 Rust API Rust logging API

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Oct 9, 2023

Blockers


Multiple end-goals:


TODO (split into sub-issues as needed):

On the way there we might hit a few bumps because we have a lot of redundant ad-hoc code that integrates with polars (which is built on top of arrow2).

The solution to this is to make sure we only integrate with polars in one single place: the Data{Cell,Row,Table} layer (#1692).
Once that's done, we can remove all ad-hoc polars code everywhere and just build a Data{Row,Cell,Table} anytime we want a polars::Series/polars::DataFrame (#1759).

Internally, the conversion from DataTable to polars::DataFrame will require a zero-copy tri-stage conversion from arrow1->arrow2->polars.


@teh-cmc teh-cmc added the 🏹 arrow concerning arrow label Oct 9, 2023
@teh-cmc teh-cmc mentioned this issue Oct 16, 2023
4 tasks
teh-cmc added a commit that referenced this issue Oct 16, 2023
This PR introduces a new crate: `re_types_core`.

`re_types_core` only contains the fundamental traits and types that make
up Rerun's data model.
It is split off from the existing `re_types`.

This makes it possible to work with our data model abstractions without
having to depend on the `re_types` behemoth.
This is more than a DX improvement: since so many things depend directly
or indirectly on `re_types`, it is very easy to end-up with unsolvable
dependency cycles. This helps with that in some cases (though certainly
not all).

In particular, `re_tuid` (and by extension `re_format`) are now
completely free of `re_types`.

For convenience, `re_types` reexports all of `re_types_core`, so the
public API looks unchanged.
In a handful of instances (`re_arrow_store`, `re_data_store`,
`re_log_types`, `re_query`), I've went the extra mile and started
porting these crates towards raw `re_types_core` rather than relying on
the reexports.
The reason is that, upon closer inspection, these crates are very close
to being able to live free of `re_types`. In the future, the custom
crate and custom module attributes coming with #3741 might allow us to
make these independent.

Similarly, the codegen now uses `re_types_core` directly, as that makes
the life of the upcoming "serde-codegen" work much easier.
teh-cmc added a commit that referenced this issue Oct 17, 2023
**Commit by commit**

This is necessary refactoring work for the upcoming
`attr.rust.custom_crate` attribute, itself necessary for the upcoming
serde-codegen support, itself necessary for the upcoming blueprint
experimentations as well as #3741.

### Changes

1. The `CodeGenerator` trait as well as all post-processing helpers
(gitattributes, orphan detection...) are now I/O-free.
  ```rust
pub type GeneratedFiles =
std::collections::BTreeMap<camino::Utf8PathBuf, String>;
  
  pub trait CodeGenerator {
      fn generate(
          &mut self,
          reporter: &crate::Reporter,
          objects: &crate::Objects,
          arrow_registry: &crate::ArrowRegistry,
      ) -> GeneratedFiles;
  }
  ```
2. All post-processing helpers are now agnostic to the location output.
This is very important as it makes it possible to generate e.g. rust
code out of the `re_types` crate without everything crumbling down.
A side-effect is that gitattributes files are now finer-grained.
3. The Rust codegen pass is now crate agnostic: it is driven by the
workspace path rather than a specific crate path.
Necessary for the upcoming `attr.rust.custom_crate`.
4. All codegen passes now follow the exact same 4-step structure:
  ```
  // 1. Generate in-memory code files.
  let mut gen = MyGenerator::new();
  let mut files = gen.generate(reporter, objects, arrow_registry);
  // 2. Generate in-memory attribute files.
  generate_gitattributes_for_generated_files(&mut files);
  // 3. Write all in-memory files to disk.
  write_files(&gen.pkg_path, &gen.testing_pkg_path, &files);
  // 4. Remove orphaned files.
  crate::codegen::common::remove_orphaned_files(reporter, &files);
  ```
5. The documentation codegen pass now removes its orphans, which is why
some `md` files were removed in this PR.

---

- Unblocks #3741 
- Unblocks #3495
@emilk
Copy link
Member

emilk commented Jul 8, 2024

re_arrow2 has an arrow feature, with glue for converting data between arrow and re_arrow2: https://docs.rs/re_arrow2/0.17.4/re_arrow2/array/trait.Arrow2Arrow.html

Using that we can start this migration piece-wise. It would have double the dependencies for a transitionary period, leading to longer compilation times and bigger .wasm binary, but I think that is an ok tradeoff.

Potential roadmap:

After de-chunkfification:

  • Migrate codegenned deserialization
  • Migrate everything else

As of 2024-07-08, there are only around 300 lines of Rust referencing the string arrow2 directly, when one ignores generated code.

ignored paths crates/re_types/**, crates/re_types_core/src/archetypes/**, crates/re_types_core/src/datatypes/**, crates/re_types_core/src/components/**, crates/re_types_blueprint/src/blueprint/components/**, crates/re_types_blueprint/src/blueprint/archetypes/**

@emilk emilk self-assigned this Jul 8, 2024
@emilk emilk changed the title Tracking issue: arrow cleanup & migration away from arrow2{-convert} Tracking issue: Migrate from re_arrow2 to arrow Jul 8, 2024
@emilk emilk added dependencies concerning crates, pip packages etc 🦀 Rust API Rust logging API labels Jul 9, 2024
@emilk emilk removed their assignment Jul 9, 2024
@jleibs
Copy link
Member

jleibs commented Jul 10, 2024

teh-cmc added a commit that referenced this issue Aug 23, 2024
Remove unused old traits.

Part of a lot of clean up I want to while we head towards:
* #7245
* #3741
teh-cmc added a commit that referenced this issue Aug 23, 2024
It doesn't make any sense for a `ComponentBatch` to have any say in what
the final `ArrowField` should look like.

An `ArrowField` is a `Chunk`/`RecordBatch`/`Schema`-level concern that
only makes sense during IO/transport/FFI/storage/etc, and which requires
external context that a single `ComponentBatch` on its own has no idea
of.

---

Part of a lot of clean up I want to while we head towards:
* #7245
* #3741
@teh-cmc
Copy link
Member Author

teh-cmc commented Aug 31, 2024

@teh-cmc teh-cmc added the blocked can't make progress right now label Aug 31, 2024
@teh-cmc
Copy link
Member Author

teh-cmc commented Sep 5, 2024

@emilk emilk mentioned this issue Sep 16, 2024
6 tasks
Wumpf pushed a commit that referenced this issue Sep 16, 2024
### What
* Waiting for a proper fix in
apache/arrow-rs#6401
* Should be fixed before #3741
is considered finished

### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)
* [x] I've included a screenshot or gif (if applicable)
* [x] I have tested the web demo (if applicable):
* Using examples from latest `main` build:
[rerun.io/viewer](https://rerun.io/viewer/pr/7426?manifest_url=https://app.rerun.io/version/main/examples_manifest.json)
* Using full set of examples from `nightly` build:
[rerun.io/viewer](https://rerun.io/viewer/pr/7426?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json)
* [x] The PR title and labels are set such as to maximize their
usefulness for the next release's CHANGELOG
* [x] If applicable, add a new check to the [release
checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)!
* [x] If have noted any breaking changes to the log API in
`CHANGELOG.md` and the migration guide

- [PR Build Summary](https://build.rerun.io/pr/7426)
- [Recent benchmark results](https://build.rerun.io/graphs/crates.html)
- [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)

To run all checks from `main`, comment on the PR with `@rerun-bot
full-check`.
@emilk emilk self-assigned this Nov 21, 2024
emilk added a commit that referenced this issue Nov 21, 2024
### What
* Part of #3741

* Rename `arrow_datatype` to `arrow2_datatype`
* Rename `to_arrow` to `to_arrow2`
* Rename `from_arrow` to `from_arrow2`

Next step is adding back the previous function names, but with actual
`arrow-rs` interface, then start porting code to using them.

### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)

To run all checks from `main`, comment on the PR with `@rerun-bot
full-check`.

To deploy documentation changes immediately after merging this PR, add
the `deploy docs` label.
emilk added a commit that referenced this issue Nov 22, 2024
* [x] I checked this checkbox
* Part of #3741
emilk added a commit that referenced this issue Nov 25, 2024
emilk added a commit that referenced this issue Nov 25, 2024
emilk added a commit that referenced this issue Nov 25, 2024
* Follows #8206
* Part of #3741

## Changes
To implement nullable unions, we have a `_null_marker: Null` variants in
all our unions. This means all our unions are nullable.

Previously we would only mark a struct field as nullable if it was
declared as such in the `.fbs` file, but `arrow-rs` complains about
this. So with this PR, if a struct field refers to a union type, that
struct field will be marked as `nullable: true` in the datatype (in
Rust, Python and C++).
@emilk emilk added the project Tracking issues for so-called "Projects" label Dec 4, 2024
@emilk emilk changed the title Tracking issue: Migrate from re_arrow2 to arrow Migrate from re_arrow2 to arrow Dec 4, 2024
@emilk emilk removed the blocked can't make progress right now label Dec 4, 2024
emilk added a commit that referenced this issue Dec 9, 2024
emilk added a commit that referenced this issue Dec 9, 2024
### Related
* Part of #3741
 
### What
Some preparatory work for migrating the codegen deserializer from
`re_arrow2`
emilk added a commit that referenced this issue Dec 19, 2024
### Related
* Part if #3741

### TODO
* [x] `@rerun-bot full-check`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏹 arrow concerning arrow dependencies concerning crates, pip packages etc project Tracking issues for so-called "Projects" 🦀 Rust API Rust logging API
Projects
None yet
Development

No branches or pull requests

3 participants