Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Non-uniform rebinning #345

Closed
yimuchen opened this issue Nov 15, 2021 · 13 comments
Closed

[FEATURE] Non-uniform rebinning #345

yimuchen opened this issue Nov 15, 2021 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@yimuchen
Copy link

yimuchen commented Nov 15, 2021

Right now, histograms can only be rebinned by some integer amount via the hist.rebin indicator. I would be nice if there was some way to rebin a certain axis arbitrary bin edges, as we might want to rebin a regular axis to be irregular just for low statistic region, without requiring the exact binning scheme to be known during histogram construction.

I'm not sure what the most optimal method should be, maybe something like extending the existing hist.rebin class to include something like hist.rebin( <new_axis>/<new bin edges> ) to specify the new binning scheme of interest?

@yimuchen yimuchen added the enhancement New feature or request label Nov 15, 2021
@yimuchen
Copy link
Author

Below is my own implementation for the histogram manipulation that we want: right now it only handles NamedHist and is probably super slow for large histograms. But is a good staring point for describing what we want to do:

We would run it something with like:
rebin_hist( h, x=new_x_axis, y=new_y_axis )

https://gist.github.com/yimuchen/a5e200c001ef4ea01681a7dd8fe89162

@swertz
Copy link

swertz commented Apr 13, 2022

A nice interface for this would be using array indexing, e.g. doing:

h = Hist(hist.axis.Regular(5, -5, 5))
rebinned_h = h[ [[0], [1,2,3], [4]] ]

Would return a new histogram with variable binning, and the central three bins merged into one.

@andrzejnovak
Copy link
Member

+1 on this

@gipert
Copy link

gipert commented Sep 8, 2022

If I may add, it would be also nice to be able to rebin an histogram based on a second one:

>>> h1.rebin(h2.axes.egdes)

@alexander-held
Copy link
Member

I just ran into a setup where I was looking for such a feature as well. Both the ability to specify new bin edges explicitly, and the possibility to pick specific bins to be merged (like @swertz's example) would be very useful!

@garvitaa
Copy link

garvitaa commented Jan 9, 2023

+1 on this. Having this functionality would make it a lot easier to produce quality plots for coffea based analyses.

@kdlong
Copy link

kdlong commented Jan 19, 2023

Probably others have too, but I've written a function for this for my own studies using hist: https://gist.github.com/kdlong/d697ee691c696724fc656186c25f8814

I think it is unique from the previous implementation in that it uses np.add.reduceat, so it shouldn't be so slow. I have fought with some details of it (like treating the overflow and underflow when rebinning to subset), and I think I've validated it, but I wouldn't swear in blood that there aren't mistakes. Others can try it out if it's useful, and I could convert it to a PR if it goes in the direction the developers would want.

@fabriceMUKARAGE
Copy link
Collaborator

https://github.com/fabriceMUKARAGE/rebinning_histogram

Based on the feedback above, here is the Non-uniform rebinning I was working on. Maybe you can check it out if it makes sense. REAMe.md and the comments in the code explain it better I guess

@rkansal47
Copy link

@kdlong Thanks! Not a developer but I think it would be a useful PR 🙂

@kdlong
Copy link

kdlong commented Jul 24, 2023

From what I can tell looking at related posts/issues/features the real developers are close to having something centrally supported.

@andrzejnovak andrzejnovak moved this from 👷 HaCATthon to HaCATthon 3 in DPROC Projects Jan 17, 2024
@andrzejnovak andrzejnovak removed the status in DPROC Projects Jan 17, 2024
@andrzejnovak andrzejnovak moved this to HaCATthon 3 in DPROC Projects Jan 17, 2024
@Saransh-cpp
Copy link
Member

Added to boost-histogram in scikit-hep/boost-histogram#913

@henryiii
Copy link
Member

Yes, it's available now; copying from @Saransh-cpp:

In [1]: import hist

In [2]: hist.__version__
Out[2]: '2.8.0'

In [3]: import numpy as np

In [4]: h = hist.Hist(hist.axis.Regular(10, 0, 1))

In [5]: h
Out[5]: Hist(Regular(10, 0, 1, label='Axis 0'), storage=Double())

In [6]: h.fill(np.random.normal(size=1_000_000))
Out[6]: Hist(Regular(10, 0, 1, label='Axis 0'), storage=Double()) # Sum: 341415.0 (1000000.0 with flow)

In [7]: rebin = hist.rebin(factor=2)

In [8]: h[::rebin]
Out[8]: Hist(Regular(5, 0, 1, label='Axis 0'), storage=Double()) # Sum: 341415.0 (1000000.0 with flow)

In [9]: rebin = hist.rebin(groups=[1, 2, 3, 4])

In [10]: h[::rebin]
Out[10]: Hist(Variable([0, 0.1, 0.3, 0.6, 1], metadata=...), storage=Double()) # Sum: 341415.0

@github-project-automation github-project-automation bot moved this from HaCATthon 3 to ✅ Done in DPROC Projects Oct 31, 2024
@mmarchegiani
Copy link

mmarchegiani commented Nov 1, 2024

Thank you for introducing this new feature.

Yes, it's available now; copying from @Saransh-cpp:

I have tested the same snippet of code with hist version 2.8.0 and it works for me.

Then, I have tested to rebin a 2D histogram with Weight storage type and I get the following error:

In [1]: from coffea.util import load

In [2]: import hist

In [3]: filename = "/work/mmarcheg/ttHbb/analysis/ttlf_background_calibration/ttlf_background_calibration_betterbins_2018/output_all.coffea"
   ...: 

In [4]: o = load(filename)

In [5]: years = list(o["datasets_metadata"]['by_datataking_period'])
   ...: h = o["variables"]["Njet_Ht"]

In [6]: samples_data = [s for s in h.keys() if s.startswith("DATA")]

In [7]: h_data = sum({k :val for s in samples_data for k, val in h[s].items() }.values())

In [8]: h_data = h_data[{'cat':'semilep'}]

In [9]: h_data
Out[9]: 
Hist(
  Variable([4, 5, 6, 7, 8, 9, 11, 20], name='events.nJetGood', label='$N_{JetGood}$'),
  Variable([0, 100, 200, 300, 400, 500, 750, 1000, 1250, 1500, 2000, 2500, 5000], name='events.JetGood_Ht', label='Jets $H_T$ [GeV]'),
  storage=Weight()) # Sum: WeightedSum(value=1.64868e+06, variance=1.64868e+06) (WeightedSum(value=1.64868e+06, variance=1.64868e+06) with flow)

In [10]: rebin = hist.rebin(groups=[1,1,2,3])

In [11]: h_data[::rebin,sum]
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
Cell In[11], line 1
----> 1 h_data[::rebin,sum]

File /work/mmarcheg/micromamba/envs/pocket-coffea/lib/python3.9/site-packages/hist/basehist.py:417, in BaseHist.__getitem__(self, index)
    410 def __getitem__(  # type: ignore[override]
    411     self, index: IndexingExpr
    412 ) -> Self | float | bh.accumulators.Accumulator:
    413     """
    414     Get histogram item.
    415     """
--> 417     return super().__getitem__(self._index_transform(index))

File /work/mmarcheg/micromamba/envs/pocket-coffea/lib/python3.9/site-packages/boost_histogram/_internal/hist.py:924, in Histogram.__getitem__(self, index)
    922     for _ in range(group):
    923         pos = [slice(None)] * (i)
--> 924         new_view[(*pos, new_j + 1, ...)] += reduced_view[  # type: ignore[arg-type]
    925             (*pos, j, ...)  # type: ignore[arg-type]
    926         ]
    927         j += 1
    929 reduced = new_reduced

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype([('value', '<f8'), ('variance', '<f8')]), dtype([('value', '<f8'), ('variance', '<f8')])) -> None

In [12]: h_data.storage_type
Out[12]: boost_histogram.storage.Weight

Is the rebinning of histograms with Weight storage type supported?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests