diff --git a/README.md b/README.md index ce20b896..7fcac52c 100644 --- a/README.md +++ b/README.md @@ -72,6 +72,9 @@ There are a few more examples available that can directly be run using the [bind - [Reading TGraph and TGraphError from '.C' files](https://github.com/HEPData/hepdata_lib/blob/main/examples/read_c_file.ipynb) [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/HEPData/hepdata_lib/main?filepath=examples/read_c_file.ipynb)

+- [Preparing scikit-hep histograms](https://github.com/HEPData/hepdata_lib/blob/main/examples/reading_scikithep_histogram.ipynb) +[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/HEPData/hepdata_lib/main?filepath=examples/reading_scikihep_histogram.ipynb) +

## External dependencies diff --git a/examples/reading_scikihep_histograms.ipynb b/examples/reading_scikihep_histograms.ipynb new file mode 100644 index 00000000..7334b5cd --- /dev/null +++ b/examples/reading_scikihep_histograms.ipynb @@ -0,0 +1,381 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Reading histograms\n", + "\n", + "For the new python-based frameworks, another common format would needs\n", + "translation are histogram in the\n", + "[`scikit-hep.hist`](https://hist.readthedocs.io/en/latest/). The functions in\n", + "the `hepdata_lib.hist_utils` will help you with that, and this notebook will\n", + "demonstrate how to do that.\n", + "\n", + "As explained in the [Getting started notebook](Getting_started.ipynb), a\n", + "`Submission` needs to exist or be created. Here, we'll just create one without\n", + "any additional information.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Welcome to JupyROOT 6.28/04\n" + ] + } + ], + "source": [ + "from hepdata_lib import Submission\n", + "\n", + "submission = Submission()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The common use-case for `scikit-hep` histograms is to allow for after-the-fact\n", + "slicing and grouping from a primary histogram. Let us first generate a fake\n", + "histogram that may appear in common histograms, as well as a common slicing\n", + "routine\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Hist(\n", + " StrCategory(['data', 'QCD', 'ttbar'], name='dataset'),\n", + " IntCategory([-1, 0, 4, 5], name='flavor'),\n", + " Regular(60, -3, 3, name='eta'),\n", + " Regular(20, 0, 500, name='pt'),\n", + " storage=Weight()) # Sum: WeightedSum(value=221221, variance=123802) (WeightedSum(value=256973, variance=143935) with flow)" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import hist\n", + "import numpy as np\n", + "\n", + "rng = np.random.default_rng(seed=123_456_789)\n", + "\n", + "h = hist.Hist(\n", + " hist.axis.StrCategory([\"data\", \"QCD\", \"ttbar\"], name=\"dataset\"),\n", + " hist.axis.IntCategory([-1, 0, 4, 5], name=\"flavor\"),\n", + " hist.axis.Regular(60, -3, +3, name=\"eta\"),\n", + " hist.axis.Regular(20, 0, 500, name=\"pt\"),\n", + " storage=hist.storage.Weight(),\n", + ")\n", + "\n", + "h.fill( ## For mock data\n", + " dataset=\"data\",\n", + " flavor=-1,\n", + " eta=rng.normal(0, 2.0, size=123_456),\n", + " pt=rng.exponential(100, size=123_456),\n", + ")\n", + "h.fill( ## For Mock QCD\n", + " dataset=\"QCD\",\n", + " flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.8, 0.15, 0.05]),\n", + " eta=rng.normal(0.0, 2.0, size=1_000_000),\n", + " pt=rng.exponential(100, size=1_000_000),\n", + " weight=0.123456 * 2 * rng.random(size=1_000_000),\n", + ")\n", + "h.fill( ## For mock ttbar\n", + " dataset=\"ttbar\",\n", + " flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.45, 0.1, 0.45]),\n", + " eta=rng.normal(0.0, 1.5, size=1_000_000),\n", + " pt=rng.exponential(200, size=1_000_000),\n", + " weight=0.01 * 2 * rng.random(size=1_000_000),\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example of manual processing to 1D array\n", + "\n", + "Let use create a simple slicing routine to get the various histograms of\n", + "interest, then use the most general function, the\n", + "`hepdata_lib.hist_utils.read_hist` method, to create arrays that will be\n", + "compatible with `hepdata_lib.Variable` declaration.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'hist_value': array([27405., 21382., 16585., 12740., 10069., 7878., 6007., 4678.,\n", + " 3666., 2903., 2333., 1734., 1352., 1048., 851., 634.,\n", + " 485., 401., 294., 230.]),\n", + " 'hist_variance': array([27405., 21382., 16585., 12740., 10069., 7878., 6007., 4678.,\n", + " 3666., 2903., 2333., 1734., 1352., 1048., 851., 634.,\n", + " 485., 401., 294., 230.]),\n", + " 'pt': array([( 0., 25.), ( 25., 50.), ( 50., 75.), ( 75., 100.),\n", + " (100., 125.), (125., 150.), (150., 175.), (175., 200.),\n", + " (200., 225.), (225., 250.), (250., 275.), (275., 300.),\n", + " (300., 325.), (325., 350.), (350., 375.), (375., 400.),\n", + " (400., 425.), (425., 450.), (450., 475.), (475., 500.)],\n", + " dtype=[('f0', ' Dict[str, numpy.ndarray]: + """ + Converting the scikit-hep histogram in to a dictionary of numpy arrays that + can be used for hepdata_lib Variable and Uncertainty declaration. + + For all axes define in the histogram, a `hepdata_lib.Variable` with + `is_independent=True` should be declared. The `values` of this variable will + be stored in the return dictionary following the axes names. + + Overflow and underflow bin will be handled using a single flag for all axes, + so be sure to declare/modify histogram axes properties according to your + needs. + + The storage content will be returned as is, so additional uncertainty + processing will need to be handled by the user using the return values. + """ + axes_entries = [_get_histaxis_array(ax, flow=flow) for ax in histo.axes] + axes_entries = numpy.meshgrid(*axes_entries) + + ## Getting axes return values + readout = { + ax.name: axes_entries[idx].flatten() for idx, ax in enumerate(histo.axes) + } + + ## Getting the histogram return values + view = histo.view(flow=flow) + + _storage_keys = { + hist.storage.Weight: ["value", "variance"], + hist.storage.Mean: ["value", "count", "_sum_of_deltas_squared"], + hist.storage.WeightedMean: [ + "value", + "sum_of_weights", + "sum_of_weights_squared", + "_sum_of_weighted_deltas_squared", + ], + } + + if view.dtype.names is None: # Single value storages + readout["hist_value"] = view.flatten() + else: + for var_name in view.dtype.names: + readout["hist_" + var_name] = view[var_name].flatten() + + return readout + + +def _get_histaxis_array(axis, flow: bool) -> numpy.ndarray: + """ + Given an axis array, return the bin entries and a numpy array. + + For continuous axes, the return will be a Nx2 array of bin edge pairs. For + categorical axes, the return will be a N array of bin content values. If the + flow is set to true, the function will also add the overflow/underflow bins + according to the settings found in axis.traits. For categorical axis, this + will include an extra `"__UNDETERMINED__"` entry (for StrCategory) or an +1 + entry (for IntCategory). + """ + + ## Getting the entries as a simple list + entries = list(axis) + + ## Adding overflow bin + if flow and axis.traits.overflow: + if isinstance(axis, hist.axis.StrCategory): + entries.append("__UNDETERMINED__") + elif isinstance(axis, hist.axis.IntCategory): + entries.append(entries[-1] + 1) + elif isinstance(axis, hist.axis.Integer): + entries.append(numpy.inf) + else: + entries.append((axis.edges[-1], numpy.inf)) + + ## Adding underflow bin + if flow and axis.traits.underflow: + if isinstance(axis,hist.axis.Integer): + entries = [-numpy.inf] + entries + else: + entries = [(-numpy.inf, axis.edges[0])] + entries + + ## Converting to numpy array + if axis.traits.continuous: + entries = numpy.array(entries, dtype="f,f") + else: + entries = numpy.array(entries) + + return entries + + +def hist_as_variable( + var_name, + histo: hist.Hist, + flow: bool = False, + uncertainty: Optional[Dict[str, Union[str, hist.Hist, numpy.ndarray]]] = None, + **kwargs, +) -> Variable: + """ + Returning this histogram entries as a Variable object, with a simpler + interface for automatically generating values uncertainties. + + The `h` and `flow` inputs are passed directly to the `read_hist` method to + extract the value to be used for the variable. + + The `uncertainty` is a dictionary defining how uncertainties should be + defined. Dictionary keys are used as the name of the uncertainty, while the + value defines how the uncertainty should be constructed. This can either be: + + - `str`: either "poisson_asym" or "poisson_sym", indicating to extract + Poisson uncertainty directly from the histogram values. (Either the + asymmetric Garwood interval defined by `hist.intervals` or a simply, + symmetric `sqrt(n)`.) + - `float`: A flat uncertainty to be used on all bins. + - `numpy.ndarray`: An array indicating the uncertainty for each bin. The + array should be compatible with the output of `read_hist['hist_values']` + - `hist.Hist`: The histogram with bin values indicating the uncertainty to + be used for each bin. The histogram should be compatible with the input + histogram. + - `tuple(T,T)` where `T` can either be a `float`, `numpy.ndarray` or + `hist.Hist`. This is used to indicate asymmetric uncertainties, following + the lower/upper ordering convention of hepdata_lib + """ + if uncertainty is None: + uncertainty = {} + + readout = read_hist(histo, flow=flow) + var = Variable( + var_name, + is_independent=False, + is_binned=False, + **kwargs, + ) + var.values = readout["hist_value"] + + def _make_unc_array(unc_val): + if isinstance(unc_val, float): + return numpy.ones_like(readout["hist_value"]) * unc_val + if isinstance(unc_val, numpy.ndarray): + return unc_val.flatten() + if isinstance(unc_val, hist.Hist): + return read_hist(unc_val, flow=flow)["hist_value"] + raise NotImplementedError(f"Unknown uncertainty format! {type(unc_val)}") + + for unc_name, unc_proc in uncertainty.items(): + is_symmetric = None + if isinstance(unc_proc, str): + if unc_proc == "poisson_sym": # Symmetric poisson uncertainty + is_symmetric = True + arr = _make_poisson_unc_array(readout, is_symmetric) + elif unc_proc == "poisson_asym": # Asymmetric poisson uncertainty + is_symmetric = False + arr = _make_poisson_unc_array(readout, is_symmetric) + else: + raise NotImplementedError(f"Unknown uncertainty process {unc_proc}") + elif isinstance(unc_proc, tuple): # Asymmetric uncertainty + is_symmetric = False # + assert len(unc_proc) == 2, "Asymmetric uncertainty can only have 2 entries" + _lo, _up = _make_unc_array(unc_proc[0]), _make_unc_array(unc_proc[1]) + arr = list(zip(_lo, _up)) + else: # Assuming symmetric error + is_symmetric = True + arr = _make_unc_array(unc_proc) + + assert len(arr) == len(readout["hist_value"]), "Mismatch array formats" + + unc_var = Uncertainty(unc_name, is_symmetric=is_symmetric) + unc_var.values = arr + var.add_uncertainty(unc_var) + + return var + + +def _make_poisson_unc_array( + readout: Dict[str, numpy.ndarray], symmetric: bool +) -> numpy.ndarray: + """ + Given the results of `read_hist`, extract the Poisson uncertainty using + hist.intervals. Automatically detecting the histogram storage type to handle + weighted uncertainties + """ + if symmetric: + if "hist_variance" not in readout.keys(): + n_events = numpy.sqrt(readout["hist_value"]) + unc_arr = numpy.sqrt(n_events) + else: # Effective number of events + _sw, _sw2 = readout["hist_value"], readout["hist_variance"] + n_events = numpy.divide( + _sw**2, _sw2, out=numpy.zeros_like(_sw), where=(_sw2 != 0) + ) + rel_unc = numpy.divide( + numpy.sqrt(n_events), + n_events, + out=numpy.zeros_like(_sw), + where=(n_events != 0), + ) + unc_arr = _sw * rel_unc + else: + _sw, _sw2 = readout["hist_value"], readout["hist_value"] + if "hist_variance" in readout.keys(): + _sw2 = readout["hist_variance"] + _lo, _up = hist.intervals.poisson_interval(_sw, _sw2) + _lo, _up = _lo - _sw, _up - _sw + # Suppressing the NAN for zero-entry events + _lo = numpy.nan_to_num(_lo, nan=0.0) + _up = numpy.nan_to_num(_up, nan=0.0) + unc_arr = list(zip(_lo, _up)) + + return unc_arr + + +def create_hist_base_table( + table_name: str, + histo: hist.Hist, + flow: bool = False, + axes_rename: Optional[Dict[str, str]] = None, + axes_units: Optional[Dict[str, str]] = None, +) -> Table: + """ + Preparing the table based on hist. This constructs just the histogram axis + as the table variable. Histogram entries should be added via the + `hist_as_variable` method. + """ + if axes_rename is None: + axes_rename = {} + if axes_units is None: + axes_units = {} + table = Table(table_name) + + readout = read_hist(histo, flow=flow) + + for axis in histo.axes: + var_name = axis.name + if axis.name in axes_rename: + var_name = axes_rename[axis.name] + elif axis.label: + var_name = axis.label + var = Variable( + var_name, + is_independent=True, + is_binned=axis.traits.continuous, + units=axes_units.get(axis.name, ""), + ) + var.values = readout[axis.name] + table.add_variable(var) + + return table diff --git a/requirements.txt b/requirements.txt index eaf80f96..b4f40231 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,6 @@ numpy PyYAML>=4.0 future +hist +scipy hepdata-validator>=0.3.5 diff --git a/tests/test_histread.py b/tests/test_histread.py new file mode 100644 index 00000000..44d80233 --- /dev/null +++ b/tests/test_histread.py @@ -0,0 +1,207 @@ +#!/usr/bin/env python +"""Test scikit-hep reading""" +from unittest import TestCase +import numpy as np +import hist +import hist.intervals +from hepdata_lib.hist_utils import read_hist, hist_as_variable, create_hist_base_table + + +class TestHistUtils(TestCase): + """Test the hist_utils functions.""" + + base_hist = hist.Hist( + hist.axis.StrCategory(["data", "QCD", "ttbar"], name="dataset"), + hist.axis.IntCategory([-1, 0, 4, 5], name="flavor"), + hist.axis.Regular(60, -3, +3, name="eta"), + hist.axis.Regular(50, 0, 500, name="pt"), + storage=hist.storage.Weight(), + ) + rng = np.random.default_rng(seed=123_456_789) + + base_hist.fill( ## For mock data + dataset="data", + flavor=-1, + eta=rng.normal(0, 2.0, size=123_456), + pt=rng.exponential(100, size=123_456), + ) + base_hist.fill( ## For Mock QCD + dataset="QCD", + flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.8, 0.15, 0.05]), + eta=rng.normal(0.0, 2.0, size=1_000_000), + pt=rng.exponential(100, size=1_000_000), + weight=0.123456 * 2 * rng.random(size=1_000_000), + ) + base_hist.fill( ## For mock ttbar + dataset="ttbar", + flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.45, 0.1, 0.45]), + eta=rng.normal(0.0, 1.5, size=1_000_000), + pt=rng.exponential(200, size=1_000_000), + weight=0.01 * 2 * rng.random(size=1_000_000), + ) + + def test_default_read(self): + """ + Ensure basic readout function generates arrays with compatible + dimensions with given the base-level histogram. Also checking alternate + binning methods. + """ + # Always test the base histogram + readout = read_hist(TestHistUtils.base_hist) + + # Checking dimension compatibility + self.assertTrue(len(readout["hist_value"]) == len(readout["hist_variance"])) + self.assertTrue(len(readout["hist_value"]) == len(readout["dataset"])) + self.assertTrue(len(readout["hist_value"]) == len(readout["flavor"])) + self.assertTrue(len(readout["hist_value"]) == len(readout["eta"])) + self.assertTrue(len(readout["hist_value"]) == len(readout["pt"])) + self.assertTrue(len(readout.keys()) == 6) + + ## Histograms with alternate types + # Simple double + h_double = hist.Hist(hist.axis.Regular(10, 0, 1, name="x")) # Double + readout = read_hist(h_double) + self.assertTrue(len(readout["hist_value"]) == len(readout["x"])) + self.assertTrue(len(readout.keys()) == 2) + + h_mean = hist.Hist( + hist.axis.Regular(10, 0, 1, name="x"), storage=hist.storage.Mean() + ) + readout = read_hist(h_mean) + self.assertTrue(len(readout["hist_value"]) == len(readout["x"])) + self.assertTrue(len(readout["hist_value"]) == len(readout["hist_count"])) + self.assertTrue( + len(readout["hist_value"]) == len(readout["hist__sum_of_deltas_squared"]) + ) + self.assertTrue(len(readout.keys()) == 4) + + h_mean = hist.Hist( + hist.axis.Regular(10, 0, 1, name="x"), storage=hist.storage.WeightedMean() + ) + readout = read_hist(h_mean) + self.assertTrue(len(readout["hist_value"]) == len(readout["x"])) + self.assertTrue( + len(readout["hist_value"]) == len(readout["hist_sum_of_weights"]) + ) + self.assertTrue( + len(readout["hist_value"]) == len(readout["hist_sum_of_weights_squared"]) + ) + self.assertTrue( + len(readout["hist_value"]) + == len(readout["hist__sum_of_weighted_deltas_squared"]) + ) + self.assertTrue(len(readout.keys()) == 5) + + self.doCleanups() + + def test_projection_read(self): + """ + Ensure basic readout function generates arrays with compatible + dimensions with histogram slicing operations. + """ + # Default read with slicing projection + read1 = read_hist(TestHistUtils.base_hist[{"dataset": "data", "flavor": sum}]) + read2 = read_hist(TestHistUtils.base_hist[{"dataset": "QCD", "flavor": 0j}]) + + # Checking dimension compatibility + self.assertTrue(len(read1["eta"]) == len(read1["pt"])) + self.assertTrue(len(read1["eta"]) == len(read2["pt"])) + self.assertTrue(np.all(read1["eta"] == read2["eta"])) + self.assertTrue(np.all(read1["pt"] == read2["pt"])) + + # Profiling histogram conversion + read3 = read_hist( + TestHistUtils.base_hist[{"dataset": sum, "flavor": sum}].profile("eta") + ) + self.assertTrue(len(read3["pt"]) == len(read3["hist_value"])) + self.assertTrue(len(read3.keys()) == 4) + + # Clean up + self.doCleanups() + + def test_uncertainty_generation(self): + """ + Exhaustively testing automatic variable generation with all defined + uncertainty formats + """ + + d_h = TestHistUtils.base_hist[{"dataset": "data"}] + q_h = TestHistUtils.base_hist[{"dataset": "QCD"}] + t_h = TestHistUtils.base_hist[{"dataset": "ttbar"}] + + d_arr = d_h.view(flow=True)["value"].flatten() + unc_arr = np.ones_like(d_arr) * np.random.random(size=d_arr.shape[0]) + auto_var = hist_as_variable( + "testing", + d_h, + flow=True, + uncertainty={ + "symmetric stat": "poisson_sym", + "asymmetric stat": "poisson_asym", + "symmetric float": 1.5, + "asymmetric float": (1.5, 2.2), + "symmetric array": unc_arr, + "asymmetric array": (-0.8 * unc_arr, unc_arr), + "symmetric histogram": q_h, + "asymmetric histogram": (q_h, t_h), + }, + ) + + def check_val(arr1, arr2): + return self.assertTrue( + np.all(np.isclose(arr1, arr2) | np.isnan(arr1) | np.isnan(arr2)) + ) + + check_val(auto_var.values, d_arr) + # Symmetric Poisson + check_val(auto_var.uncertainties[0].values, np.sqrt(d_arr)) + + # Asymmetric Poisson + _l, _u = hist.intervals.poisson_interval(d_arr, d_arr) + _l, _u = _l - d_arr, _u - d_arr + check_val(auto_var.uncertainties[1].values, list(zip(_l, _u))) + + # Flat uncertainties + check_val(auto_var.uncertainties[2].values, 1.5) + check_val(auto_var.uncertainties[3].values, (1.5, 2.2)) + + # Array defined uncertainties + check_val(auto_var.uncertainties[4].values, unc_arr) + check_val(auto_var.uncertainties[5].values, list(zip(-0.8 * unc_arr, unc_arr))) + + # Histogram defined uncertainties + q_arr = q_h.view(flow=True)["value"].flatten() + t_arr = t_h.view(flow=True)["value"].flatten() + check_val(auto_var.uncertainties[6].values, q_arr) + check_val(auto_var.uncertainties[7].values, list(zip(q_arr, t_arr))) + + self.doCleanups() + + def test_table_generation(self): + """ + Base table generation with base histogram + """ + _rename_ = { + "dataset": "Data set", + "flavor": "Jet flavor", + "eta": r"Jet $\eta$", + "pt": r"Jet $p_{T}$", + } + _units_ = {"pt": "GeV"} + + table = create_hist_base_table( + "my_table", + TestHistUtils.base_hist, + axes_rename=_rename_, + axes_units=_units_, + ) + + readout = read_hist(TestHistUtils.base_hist) + + for idx, (name, new_name) in enumerate(_rename_.items()): + # Checking rename + self.assertTrue(table.variables[idx].name == new_name) + self.assertTrue(table.variables[idx].units == _units_.get(name, "")) + self.assertTrue(len(table.variables[idx].values) == len(readout[name])) + + self.doCleanups()