Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Ragged Reading of Markers #169

Merged
merged 4 commits into from
Oct 6, 2023
Merged

Conversation

CSSFrancis
Copy link
Member

@CSSFrancis CSSFrancis commented Oct 5, 2023

Description of the change

This fixes (some) of the bugs brought up in #168 #164

Progress of the PR

  • Change implemented (can be split into several points),
  • add a changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
  • Check formatting of the changelog entry (and eventual user guide changes) in the docs/readthedocs.org:rosettasciio build of this PR (link in github checks)
  • ready for review.

This should be much faster than previous. (For large arrays this makes ragged arrays actually usable) I wanted to start adding some benchmarks but I'm not sure the best way to do that? Jupyter notebooks don't seem like the best way to do that, maybe something similar to examples?

Minimal example of the bug fix or the new feature

import time 
import matplotlib.pyplot as plt
import hyperspy.api as hs
import numpy as np

save_times_z = []
load_times_z =[]
save_times_h = []
load_times_h =[]

num_pos = [100, 500, 1000, 2000, 4000, 10000, 40000, 100000,1000000]
load_times =[]
for i in num_pos:
    np.random.seed(42)
    data = np.array([np.random.random(size=np.random.randint(0, 20)).astype(np.float64)
                  for i in range(i)], dtype=object)
        
    s = hs.signals.BaseSignal(data)
    
    tic = time.time()
    s.save("data.zspy", overwrite=True)
    toc =time.time()
    save_times_z.append(toc-tic)
    
    tic = time.time()
    hs.load("data.zspy")
    toc =time.time()
    load_times_z.append(toc-tic)
    
    tic = time.time()
    s.save("data.hspy", overwrite=True)
    toc =time.time()
    save_times_h.append(toc-tic)
    
    tic = time.time()
    hs.load("data.hspy")
    toc =time.time()
    load_times_h.append(toc-tic)

plt.plot(num_pos, load_times_z, label="loading time (zspy)")
plt.plot(num_pos, save_times_z, label="saving time (zspy)")
plt.plot(num_pos, load_times_h, label="loading time (hspy)")
plt.plot(num_pos, save_times_h, label="saving time (hspy")
plt.xlabel("number of positions")
plt.ylabel("time in sec")
plt.legend()

@CSSFrancis
Copy link
Member Author

@ericpre Right now I just have the chunks span the entire ragged dataset. I'm not sure that this is the best way to do things but you can always set the chunks using a lazy dataset if you really want to set the chunks. Because there isn't a good way to automate the setting of the chunks I think that this is a good solution.

@codecov
Copy link

codecov bot commented Oct 5, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (42574d2) 85.59% compared to head (62e5e5e) 85.59%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #169      +/-   ##
==========================================
- Coverage   85.59%   85.59%   -0.01%     
==========================================
  Files          76       76              
  Lines       10148    10154       +6     
  Branches     2216     2217       +1     
==========================================
+ Hits         8686     8691       +5     
  Misses        944      944              
- Partials      518      519       +1     
Files Coverage Δ
rsciio/_hierarchical.py 76.40% <100.00%> (+0.10%) ⬆️
rsciio/zspy/_api.py 94.66% <75.00%> (-1.11%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rsciio/zspy/_api.py Fixed Show fixed Hide fixed
Copy link
Member

@ericpre ericpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. When I tried, I used len instead of np.prod and it seems that it was the reason why it was still slow!

I am not sure about how useful the benchmark file would be, because we may well forget about it if there isn't something that run regularly and check how long it takes.
Maybe a test with a upper limit on the execution time would be more useful?

rsciio/zspy/_api.py Outdated Show resolved Hide resolved
rsciio/_hierarchical.py Outdated Show resolved Hide resolved
rsciio/zspy/_api.py Outdated Show resolved Hide resolved
rsciio/zspy/_api.py Fixed Show fixed Hide fixed
rsciio/zspy/_api.py Fixed Show fixed Hide fixed
@CSSFrancis
Copy link
Member Author

@ericpre I'll just remove the Jupyter notebook for now. I think the information is in this PR and 168 to recreate the tests so that should be good.

Maybe something to come back to in time. I think that some sort of benchmarking for the different file loaders would be helpful as it would help to identify which ones are faster/slower and help to identify if certain file readers could be made faster.

@ericpre ericpre merged commit 6552fa6 into hyperspy:main Oct 6, 2023
28 checks passed
@ericpre ericpre added this to the v0.2 milestone Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants