Faster Ragged Reading of Markers #169

CSSFrancis · 2023-10-05T15:14:51Z

Description of the change

This fixes (some) of the bugs brought up in #168 #164

Progress of the PR

Change implemented (can be split into several points),
add a changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
Check formatting of the changelog entry (and eventual user guide changes) in the docs/readthedocs.org:rosettasciio build of this PR (link in github checks)
ready for review.

This should be much faster than previous. (For large arrays this makes ragged arrays actually usable) I wanted to start adding some benchmarks but I'm not sure the best way to do that? Jupyter notebooks don't seem like the best way to do that, maybe something similar to examples?

Minimal example of the bug fix or the new feature

import time 
import matplotlib.pyplot as plt
import hyperspy.api as hs
import numpy as np

save_times_z = []
load_times_z =[]
save_times_h = []
load_times_h =[]

num_pos = [100, 500, 1000, 2000, 4000, 10000, 40000, 100000,1000000]
load_times =[]
for i in num_pos:
    np.random.seed(42)
    data = np.array([np.random.random(size=np.random.randint(0, 20)).astype(np.float64)
                  for i in range(i)], dtype=object)
        
    s = hs.signals.BaseSignal(data)
    
    tic = time.time()
    s.save("data.zspy", overwrite=True)
    toc =time.time()
    save_times_z.append(toc-tic)
    
    tic = time.time()
    hs.load("data.zspy")
    toc =time.time()
    load_times_z.append(toc-tic)
    
    tic = time.time()
    s.save("data.hspy", overwrite=True)
    toc =time.time()
    save_times_h.append(toc-tic)
    
    tic = time.time()
    hs.load("data.hspy")
    toc =time.time()
    load_times_h.append(toc-tic)

plt.plot(num_pos, load_times_z, label="loading time (zspy)")
plt.plot(num_pos, save_times_z, label="saving time (zspy)")
plt.plot(num_pos, load_times_h, label="loading time (hspy)")
plt.plot(num_pos, save_times_h, label="saving time (hspy")
plt.xlabel("number of positions")
plt.ylabel("time in sec")
plt.legend()

CSSFrancis · 2023-10-05T15:22:38Z

@ericpre Right now I just have the chunks span the entire ragged dataset. I'm not sure that this is the best way to do things but you can always set the chunks using a lazy dataset if you really want to set the chunks. Because there isn't a good way to automate the setting of the chunks I think that this is a good solution.

codecov · 2023-10-05T15:34:54Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (42574d2) 85.59% compared to head (62e5e5e) 85.59%.
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #169      +/-   ##
==========================================
- Coverage   85.59%   85.59%   -0.01%     
==========================================
  Files          76       76              
  Lines       10148    10154       +6     
  Branches     2216     2217       +1     
==========================================
+ Hits         8686     8691       +5     
  Misses        944      944              
- Partials      518      519       +1

Files	Coverage Δ
rsciio/_hierarchical.py	`76.40% <100.00%> (+0.10%)`	⬆️
rsciio/zspy/_api.py	`94.66% <75.00%> (-1.11%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rsciio/zspy/_api.py

ericpre

This looks good. When I tried, I used len instead of np.prod and it seems that it was the reason why it was still slow!

I am not sure about how useful the benchmark file would be, because we may well forget about it if there isn't something that run regularly and check how long it takes.
Maybe a test with a upper limit on the execution time would be more useful?

rsciio/zspy/_api.py

rsciio/_hierarchical.py

rsciio/zspy/_api.py

CSSFrancis · 2023-10-05T20:19:16Z

@ericpre I'll just remove the Jupyter notebook for now. I think the information is in this PR and 168 to recreate the tests so that should be good.

Maybe something to come back to in time. I think that some sort of benchmarking for the different file loaders would be helpful as it would help to identify which ones are faster/slower and help to identify if certain file readers could be made faster.

github-advanced-security bot found potential problems Oct 5, 2023

View reviewed changes

rsciio/zspy/_api.py Fixed Show fixed Hide fixed

ericpre reviewed Oct 5, 2023

View reviewed changes

rsciio/zspy/_api.py Outdated Show resolved Hide resolved

rsciio/_hierarchical.py Outdated Show resolved Hide resolved

rsciio/zspy/_api.py Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Oct 5, 2023

View reviewed changes

rsciio/zspy/_api.py Fixed Show fixed Hide fixed

rsciio/zspy/_api.py Fixed Show fixed Hide fixed

CSSFrancis added 3 commits October 5, 2023 15:11

BugFix: Fix saving large ragged arrays zarr

3efd191

Documentation: Add changelog

9a13664

Bugfix: Fixed chunking for nd ragged arrays and added documentation

d7857fd

CSSFrancis force-pushed the markers_update branch from 62e5e5e to d7857fd Compare October 5, 2023 20:13

Refactor: Remove benchmarks

abbad30

ericpre approved these changes Oct 6, 2023

View reviewed changes

ericpre merged commit 6552fa6 into hyperspy:main Oct 6, 2023
28 checks passed

ericpre added this to the v0.2 milestone Oct 6, 2023

ericpre mentioned this pull request Oct 6, 2023

Ragged Array Reading is Slow(er than it should be) #168

Closed

ericpre mentioned this pull request Oct 14, 2023

Incompatibilities Between Files from 2.0.0 to 1.7.x hyperspy/hyperspy#3239

Closed

3 tasks

ericpre mentioned this pull request Dec 16, 2023

Saving dtype=object ragged LazySignal raises error #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Ragged Reading of Markers #169

Faster Ragged Reading of Markers #169

CSSFrancis commented Oct 5, 2023 •

edited

Loading

CSSFrancis commented Oct 5, 2023

codecov bot commented Oct 5, 2023 •

edited

Loading

ericpre left a comment

CSSFrancis commented Oct 5, 2023

Faster Ragged Reading of Markers #169

Faster Ragged Reading of Markers #169

Conversation

CSSFrancis commented Oct 5, 2023 • edited Loading

Description of the change

Progress of the PR

Minimal example of the bug fix or the new feature

CSSFrancis commented Oct 5, 2023

codecov bot commented Oct 5, 2023 • edited Loading

Codecov Report

ericpre left a comment

Choose a reason for hiding this comment

CSSFrancis commented Oct 5, 2023

CSSFrancis commented Oct 5, 2023 •

edited

Loading

codecov bot commented Oct 5, 2023 •

edited

Loading