Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update of HyperSpy Markers API changes for the hspy/zspy format #164

Merged
merged 8 commits into from
Oct 5, 2023

Conversation

ericpre
Copy link
Member

@ericpre ericpre commented Sep 28, 2023

Follow up of hyperspy/hyperspy#3148.

Progress of the PR

  • Update to new markers API,
  • fix parsing tuple of string,
  • bump matploltib minimum requirement to support matplotlib.collections.Collection.set_offset_transform which was added in matplotlib 3.5
  • Various fixes dealing with ragged array
    • ragged attributes wasn't set properly in save/load cycle
    • dtype of ragged array was incorrectly saved in zspy
    • Fix issue with several ragged arrays dataset in the same group
  • update user guide: bump file version,
  • add a changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
  • Check formatting of the changelog entry (and eventual user guide changes) in the docs/readthedocs.org:rosettasciio build of this PR (link in github checks)
  • add tests,
  • ready for review.

@codecov
Copy link

codecov bot commented Sep 28, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (e4e71ad) 85.56% compared to head (0255321) 85.59%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #164      +/-   ##
==========================================
+ Coverage   85.56%   85.59%   +0.03%     
==========================================
  Files          76       76              
  Lines       10132    10148      +16     
  Branches     2210     2216       +6     
==========================================
+ Hits         8669     8686      +17     
+ Misses        945      944       -1     
  Partials      518      518              
Files Coverage Δ
rsciio/hspy/_api.py 93.18% <ø> (ø)
rsciio/zspy/_api.py 95.77% <ø> (ø)
rsciio/_hierarchical.py 76.30% <93.75%> (+0.40%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ericpre
Copy link
Member Author

ericpre commented Oct 4, 2023

@CSSFrancis, do you want to review this PR?

@CSSFrancis
Copy link
Member

@ericpre I can do that now! Sorry, I've been trying to finish writing some papers the last couple of days so I haven't been the best at responding :)

@CSSFrancis
Copy link
Member

@ericpre I looked over this briefly and it looks like a really good improvement overall! I've been meaning to go back to the idea of saving and loading ragged arrays as I think that it is a bit problematic.

One thing that is a bit concerning is that loading ragged arrays into memory is very slow for the zspy format. It's not so much an issue when saving the ragged arrays but the loading time appears to be much slower than the time to save the data. It also increases exponentially with the number of markers to make things extra bad :).

For example:

import time 
import matplotlib.pyplot as plt
import hyperspy.api as hs
import numpy as np

save_times_z = []
load_times_z =[]
save_times_h = []
load_times_h =[]

num_pos = [100, 500, 1000, 2000, 4000, 10000]
load_times =[]
for i in num_pos:
    test = np.empty((i), dtype=object)
    for j in np.ndindex(test.shape):
        test[j]=np.array([[1,1],[2,2]])
        
    s = hs.signals.BaseSignal(test)
    
    tic = time.time()
    s.save("data.zspy", overwrite=True)
    toc =time.time()
    save_times_z.append(toc-tic)
    
    tic = time.time()
    hs.load("data.zspy")
    toc =time.time()
    load_times_z.append(toc-tic)
    
    tic = time.time()
    s.save("data.hspy", overwrite=True)
    toc =time.time()
    save_times_h.append(toc-tic)
    
    tic = time.time()
    hs.load("data.hspy")
    toc =time.time()
    load_times_h.append(toc-tic)

plt.plot(num_pos, load_times_z, label="loading time (zspy)")
plt.plot(num_pos, save_times_z, label="saving time (zspy)")
plt.plot(num_pos, load_times_h, label="loading time (hspy)")
plt.plot(num_pos, save_times_h, label="saving time (hspy")
plt.xlabel("number of positions")
plt.ylabel("time in sec")
plt.legend()
image

I realize that this is probably something up stream in zarr and not related to this PR but I think that it will cause problems with saving markers like this.

Copy link
Member

@CSSFrancis CSSFrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change! It also cleans up some stuff. The only thing I am worried about is the slow loading for ragged .zarr arrays.

This has the problem of potentially "bricking" the loading for a 4-D STEM dataset if you save a large ragged array alongside it and then can't access the data because the loading time for the ragged array is very slow.

rsciio/_hierarchical.py Show resolved Hide resolved
rsciio/_hierarchical.py Show resolved Hide resolved
@CSSFrancis
Copy link
Member

@ericpre Let me know if you have any thoughts on why this might be otherwise I can try to look into this more to figure out some way to save/load efficiently.

@ericpre
Copy link
Member Author

ericpre commented Oct 4, 2023

@ericpre Let me know if you have any thoughts on why this might be otherwise I can try to look into this more to figure out some way to save/load efficiently.

I don't think that there is much to be done here, as you said, it must come from zarr/numcodecs.

…plementation of nd ragged array support in zarr
@CSSFrancis
Copy link
Member

So it seems like the VLenArray numcodec isn't the problem as:

def benchmark_codec(codec, a):
    print(codec)
    print('encode')
    %timeit codec.encode(a)
    enc = codec.encode(a)
    print('decode')
    %timeit codec.decode(enc)
    print('size         : {:,}'.format(len(enc)))

np.random.seed(42)
data4 = np.array([np.random.random(size=np.random.randint(0, 20)).astype(np.float64)
                  for i in range(200000)], dtype=object)
data4.shape
benchmark_codec(vlen_arr_codec, data4)

This seems to work just fine and scales nicely. The issue seems to instead be that each array is being treated as a separate chunk and the _decode_chunk function is being called for every position.

If we set the number of chunks here to 1:

@staticmethod
def _get_object_dset(group, data, key, chunks, **kwds):
"""Creates a Zarr Array object for saving ragged data"""
these_kwds = kwds.copy()
these_kwds.update(dict(dtype=object, exact=True, chunks=chunks))
dset = group.require_dataset(
key, data.shape, object_codec=numcodecs.VLenArray(int), **these_kwds
)
return dset

Then things work a little better :)
image

@CSSFrancis
Copy link
Member

CSSFrancis commented Oct 4, 2023

@ericpre What do you think?

I would vote that we force any numpy array with dtype=object into one chunk for compression/loading.

We could also try to guess the ideal chunk size by looking at the underlying data.

For a dask array we can leave it's chunking scheme and assume that the person saving/loading the data has some idea of what they are doing.

@ericpre
Copy link
Member Author

ericpre commented Oct 5, 2023

Thank you @CSSFrancis for looking at this. I tried using one chunks in _get_object_dset but reading is still slow, maybe you changed somewhere else?
Do you want to sort out this issue in a separate PR? The issue is not introduced in this PR, which is about getting the test suite to work with the markers update!

@CSSFrancis
Copy link
Member

Thank you @CSSFrancis for looking at this. I tried using one chunks in _get_object_dset but reading is still slow, maybe you changed somewhere else?
Do you want to sort out this issue in a separate PR? The issue is not introduced in this PR, which is about getting the test suite to work with the markers update!

Yep I can do that!

Do you want to merge this then and then I will open a new PR.

@ericpre ericpre merged commit 42574d2 into hyperspy:main Oct 5, 2023
29 of 30 checks passed
@ericpre ericpre added this to the v0.2 milestone Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants