Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of string arrays in the schema (especially unicode) #294

Open
sstansill opened this issue Oct 22, 2024 · 0 comments
Open

Use of string arrays in the schema (especially unicode) #294

sstansill opened this issue Oct 22, 2024 · 0 comments

Comments

@sstansill
Copy link
Collaborator

sstansill commented Oct 22, 2024

For next generation telescopes (SKA, ngVLA), the zarr-python library likely won't provide a fast enough interface to MSv4 datasets. Instead, libraries written in lower level languages will be used and the MSv4 schema should be compatible with these libraries. In particular, the SKAO has used Google's TensorStore to prototype MSv4 support in WSClean.

The problem is that arrays with unicode datatypes aren't supported by any of the C/C++ zarr implementations listed here https://zarr.dev/implementations/. So, I propose that null-terminated byte sequences "<S*" should be used in place of unicode "<U*" data types for arrays (there are 59 instances of unicode dtypes in v4.0.0 of the schema).

Additionally, variable / unknown length strings ("<U0" and "<S0") should be avoided wherever possible to reduce the amount of data stored on disk and improve the speed of opening a dataset--all coordinates are read eagerly and variable length strings are slower to parse. For example, the polarization coordinate should have dtype "<S2". For the coordinates baseline_antenna1_name and baseline_antenna2_name, it may be best to revert to integer arrays. The names corresponding to an antenna index can be any length which leads to larger metadata and more verbose code--the long-format antenna names should be reserved for AntennaXds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant