Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for loading data partially #13

Open
Illviljan opened this issue Apr 5, 2024 · 3 comments
Open

Add support for loading data partially #13

Illviljan opened this issue Apr 5, 2024 · 3 comments

Comments

@Illviljan
Copy link

This is a cool package! I'm excited!

It would be really nice if it was possible to load a subset of the data using for example

  • integer arrays, np.array([0, 1, 2, 8, 9], dtype=int).
  • boolean arrays, np.array([0, 1, 0, 1], dtype=bool).

This would allow quite advanced filtering without having to load all the data to RAM.

For more inspiration and reading:

@ratal
Copy link
Owner

ratal commented Apr 7, 2024

It would be relatively not so easy. and not so generic then. Only for one channel at time beause requested indexing will most likely not be applicable for another channel group or the rest of the data.
Proposed method signature then: get_channel_sliced_data_in memory() To efficiently use this api I think also you need to know the length before loading + it would be too complex to keep it in structure, only in returned value.
By the way, did you notice you can load in memory only one channel at a time with load_channels_data_in_memory() methods, providing a set of channel names ?

@Illviljan
Copy link
Author

Illviljan commented Apr 7, 2024

Handling one channel at a time was the scope I had in mind, the channels that could use the indexing array can be found through .get_master_channel_names_set.

Yes, I saw load_channels_data_in_memory() but I'm not sure it will scale well with a lot of files. My idea is basically to use dask in similar fashion as the h5py example, concatenate many files together and go from there.

@ratal
Copy link
Owner

ratal commented Apr 8, 2024

As you have to concatenate the same channel from different files, you should not have the need for parallelism, so only one channel per file to be stored in memory. One channel cannot fit in memory ?
If it fits, you could try using load_channels_data_in_memory(channel_name) with clear_channel_data_from_memory(channel_name)
Otherwise, yes .get_channel_sliced_data_in memory(channel_name, starting_index, ending_index) in combination with get_channel_dtype() to get the length woudl be needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants