Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge LH5 files from multiple threads #190

Open
gipert opened this issue Dec 9, 2024 · 4 comments
Open

Merge LH5 files from multiple threads #190

gipert opened this issue Dec 9, 2024 · 4 comments
Labels
discussion Further information is requested output Output Schemes

Comments

@gipert
Copy link
Member

gipert commented Dec 9, 2024

Is it possible to handle this in pure C++ in a reasonable way? If not, we should provide some Python remage wrapper that performs the concatenation.

@gipert gipert added discussion Further information is requested output Output Schemes labels Dec 9, 2024
@ManuelHu
Copy link
Collaborator

there are "hyperslabs" to read and write partial datasets availbale in the C++ API, but their usage is very verbose, if you compare them to equivalent python slices. Also you need to take care about all the low-level things (chunking, ...) yourself.

I would really not like to maintain such code (I found a - certainly much more sophisticated and feature-rich - implementation of dataset merging having >>2k LOC).
That is a much more than just the "name and attribute juggling" that we are currently doing...

@Yurivanderburg
Copy link

Not sure if this helps, but I found loading the lh5 files as awkward arrays and use ak.concatenate works very nicely. I can share some code if you want.

@tdixon97
Copy link
Collaborator

This is exactly the approach we suggest people to use.
However, the combination of the lh5 files directly in C++ / with an extra python script is a bit more tricky since you do not want to just concatenate but also sort by g4_evtid (since the files each contain a random set of g4_evtids). Then this also has to be done in a memory efficiency way (not reading the full data into memory), and it should be fast.

@gipert
Copy link
Member Author

gipert commented Dec 12, 2024

We also want to do some merging in the simulation production workflow to avoid cluttering the filesystem with a huge amount of files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Further information is requested output Output Schemes
Projects
None yet
Development

No branches or pull requests

4 participants