diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 9eadc887..26227704 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -25,6 +25,13 @@ - [Legacy Instructions](dark-brem/legacy.md) - [Alternative photo-nuclear models](Alternative-Photo-Nuclear-Models.md) +# Physics Guides +- [Statistics and Calculations](physics/stats/intro.md) + - [Averages](physics/stats/averages.md) + - [Resolution](physics/stats/resolution.md) +- [ECal](physics/ecal/intro.md) + - [Layer Weights](physics/ecal/layer-weights.md) + # For Developers - [Tracking Performance of ldmx-sw](performance.md) - [Container-Software Compatibility](compatibility.md) diff --git a/src/physics/ecal/intro.md b/src/physics/ecal/intro.md new file mode 100644 index 00000000..399616df --- /dev/null +++ b/src/physics/ecal/intro.md @@ -0,0 +1,4 @@ +# ECal + +Software-related physics topics for the LDMX Electromagnetic Calorimeter (ECal). + diff --git a/src/physics/ecal/layer-weights.md b/src/physics/ecal/layer-weights.md new file mode 100644 index 00000000..541dee72 --- /dev/null +++ b/src/physics/ecal/layer-weights.md @@ -0,0 +1,212 @@ +# Layer Weights + +As a sampling calorimeter, the ECal needs to apply weights to all of the energies +it measures in order to account for the energy "lost" from particles traversing +material that is not active or sensitive. The ECal has sensitive silicon layers +that are actually making energy measurements while the rest of the material +(tungsten plates, cooling planes, printed circuit boards "PCB") are here treated +as "absorber". + +Reconstructing the amount of energy deposited within a cell inside the sensitive +silicon is another topic, largely a question of interpreting the electronic signals +readout using our knowledge of the detector. This document starts with the assumption +that we already have an estimate for the amount of energy deposited in the silicon. + +~~~admonish tip title="Energy Deposited in Silicon" +I am being intentionally vague here because the estimate for the energy deposited +in the silicon can be retrieved either from the `EcalRecHits` or the `EcalSimHits`. +The simulated hits exclude any of the electronic estimation and so are, in some +sense, less realistic; however they could be used to assign blame to energies within +the ECal which is helpful for some studies. + +The estimated energy deposited in the silicon is stored in different names in these +two classes. +- `amplitude_` member variable of `EcalHit` (`getAmplitude()` accessor in C++) +- `edep_` member variable of `EcalSimHit` (`getEdep()` accessor in C++) +~~~ + +Our strategy to account for the absorbing material is to calculate a scaling factor \\(S\\) +increasing the energy deposited in the silicon \\(E_\text{silicon}\\) up to a value that estimates the +energy deposited in all of the absorbing material in front of this silicon layer +up to the next sensitive silicon layer \\(E_\text{absorber}\\). This strategy allows us to keep the relatively +accurate estimate of the energy deposited in the silicon as well as avoids double +counting absorber layers by uniquely assigning them to a single silicon sensitive layer. + +\\[ + E_\text{absorber} \approx S E_\text{silicon} +\\] + +We can estimate \\(S\\) by using our knowledge of the detector design and measurements of how +much energy particles lose inside of these materials. Many of the measurements of how much energy +is lost by particles is focused on "minimum-ionizing particles" or MIPs and so we factor this +out by first estimating the number of MIPs within the silicon and then multiplying that number +by the estimate energy loss per MIP in the absorbing material. + +\\[ + S = L / E_\text{MIP in Si} + \quad\Rightarrow\quad + E_\text{absorber} \approx L \frac{E_\text{silicon}}{E_\text{MIP in Si}} +\\] + +The \\(L\\) in this equation is what the layer weights stored within are software represent. +They are the estimated energy loss of a single MIP passing through the absorbing layers in +front of a silicon layer up to the next silicon layer. + +~~~admonish note collapsible=true title="Calculating the Layer Weights" +The [`Detectors/util/ecal_layer_stack.py`](https://github.com/LDMX-Software/ldmx-sw/blob/trunk/Detectors/util/ecal_layer_stack.py) +python script calculate these layer weights as well as other aspects of the ECal detector design. +It copies the \\(dE/dx\\) values estimated within the PDG for the materials within our design +and then uses a coded definition of the material layers and their thicknesses to calculate +these layer weights as well as the positions of the sensitive layers. +~~~ + +Using this estimate for the energy deposited into the absorbing material, we can combine it +with our estimate of the energy deposited into the sensitive silicon to obtain a "total energy" +represented by a single hit. + +\\[ + E = E_\text{absorber}+E_\text{silicon} + \approx L E_\text{silicon} / E_\text{MIP in Si} + E_\text{silicon} + = \left(1 + \frac{L}{E_\text{MIP in Si}}\right) E_\text{silicon} +\\] + +~~~admonish note collapsible=true title="Second Order Corrections" +From simulations of the detector design, we found that this underestimated the energy of showers +within the ECal by a few percent, so we also introduces another scaling factor \\(C\\) +(called `secondOrderEnergyCorrection` in the reconstruction code) that multiplies this total +energy. + +\\[ + E = C \left(1 + \frac{L}{E_\text{MIP in Si}}\right) E_\text{silicon} +\\] + +This is helpful to keep in mind when editing the reconstruction code itself, but +for simplicity and understandability if you are manually applying the layer weights (e.g. +to use them with sim hits), you should not use this second order correction (i.e. just keep \\(C=1\\)). +~~~ + +Now that we have an estimate for the total energy represented by a single hit, +we can combine energies from multiple hits to obtain shower or cluster "features". + +## Applying the Layer Weights +In many cases, one can just use the `energy_` member variable of the `EcalHit` class to get +this energy estimate (including the layer weights) already calculated by the reconstruction; +however, in some studies it is helpful to apply them manually. This could be used to study +different values of the layer weights or to apply the layer weights to the `EcalSimHits` +branch in order to estimate the geometric acceptance (i.e. neglect any electronic noise and +readout affects). + +This section focuses on how to apply the layer weights within a python-based analysis +which uses the `uproot` and `awkward` packages. We are going to assume that you (the +user) have already loaded some data into an `ak.Array` with a form matching +``` +N * var * { + id: int32, + amplitude: float32 + ... other stuff +} +``` +The `N` represents the number of events and the `var` signals that this array has a +variable length sub-array. I am using the `EcalHit` terminology (i.e. `amplitude` +represents the energy deposited in just the silicon), but one could use the following +code with sim hits as well following a name change. You can view this form within +a jupyter notebook or by calling `show()` on the array of interest. + +~~~admonish tip collapsible=true title="Loading Data into `ak.Array`" +Just as an example, here is how I load the `EcalRecHits` branch from a pass named +`example`. +```python +import uproot +import awkward as ak +f = uproot.open('path/to/file.root') +ecal_hits = f['LDMX_Events'].arrays(expressions = 'EcalRecHits_example') +# shorten names for easier access, just renaming columns not doing any data manipulation +ecal_hits = ak.zip({ + m[m.index('.')+1:].strip('_') : ecal_hits.EcalRecHits_valid[m] + for m in ecal_hits.EcalRecHits_example.fields +}) +``` +~~~ + +From here on, I assume that the variable `ecal_hits` has a `ak.Array` of data +with a matching form. + +### Obtain Layer Number +First, we need to obtain the layer number corresponding to each hit. This can be +extracted from the detector ID that accompanies each hit. +```python +ecal_hits["layer"] = (ecal_hits.id >> 17) & 0x3f +``` +~~~admonish warning title="Bit Shifting Nonsense" +This bit shifting nonsense was manually deduced from the +[`ECalID` source code](https://github.com/LDMX-Software/ldmx-sw/blob/trunk/DetDescr/include/DetDescr/EcalID.h) +and should be double checked. +The layers should range in values from `0` up to the number +of sensitve layers minus one (currently `33`). + +If you are running from within the ldmx-sw development or production container +(i.e. have access to an installation of `fire`), you could also do +```python +from libDetDescr import EcalID +import numpy as np +@np.vectorize +def to_layer(rawid): + return EcalID(int(rawid)).layer() +``` +to use the C++ code directly. This is less performant but is more likely to +be correct. +~~~ + +### Efficiently Broadcasting Layer Weights +Now, the complicated bit. We want to efficiently broadcast the layer weights +to all of the hits without having to copy around a bunch of data. `awkward` +has a solution for this we can use +[IndexedArray](https://awkward-array.org/doc/main/reference/generated/ak.contents.IndexedArray.html?highlight=indexedarray#ak.contents.IndexedArray) +with the layer indices as the index and +[ListOffsetArray](https://awkward-array.org/doc/main/reference/generated/ak.contents.ListOffsetArray.html?highlight=listoffsetarray#ak.contents.ListOffsetArray) +with the offsets of the layers to get an `ak.Array`that presents the layer +weights for each hit without having to copy any data around. + +First, store the layer weights in order by the layer they correspond to. +This is how the layer weights are stored within +[the python config](https://github.com/LDMX-Software/Ecal/blob/trunk/python/digi.py) +and so I have copied the v14 geometry values below. +```python +layer_weights = ak.Array([ + 2.312, 4.312, 6.522, 7.490, 8.595, 10.253, 10.915, 10.915, 10.915, 10.915, 10.915, + 10.915, 10.915, 10.915, 10.915, 10.915, 10.915, 10.915, 10.915, 10.915, 10.915, + 10.915, 10.915, 14.783, 18.539, 18.539, 18.539, 18.539, 18.539, 18.539, 18.539, + 18.539, 18.539, 9.938 +]) +``` +Then we use the special arrays linked above to present these layer +weights as if there is a copy of the correct one for each hit. +```python +layer_weight_of_each_hit = ak.Array( + ak.contents.ListOffsetArray( + ecal_hits.layer.layout.offsets, + ak.contents.IndexedArray( + ak.index.Index(ecal_hits.layer.layout.content.data), + layer_weights.layout + ) + ) +) +``` + +### Applying Layer Weights +Now that we have an array that has a layer weight for each hit, we can go +through the calculation described above to apply these layer weights and +get an estimate for the total energy of each hit. +```python +mip_si_energy = 0.130 #MeV - corresponds to ~3.5 eV per e-h pair <- derived from 0.5mm thick Si +(1 + layer_weight_of_each_hit / mip_si_energy)*ecal_hits.amplitude +``` +If you are re-applying the layer weights used in reconstruction, you can +also confirm that your code is functioning properly by comparing the array +calculated above with the energies that the reconstruction provides alongside +the `amplitude` member variable. +```python +# we need to remove the second order correct that the reconsturction applies +secondOrderEnergyCorrection = 4000. / 3940.5 # extra correction factor used within ldmx-sw for v14 +ecal_hits.energy / secondOrderEnergyCorrection +``` diff --git a/src/physics/stats/averages.md b/src/physics/stats/averages.md new file mode 100644 index 00000000..c484a2d7 --- /dev/null +++ b/src/physics/stats/averages.md @@ -0,0 +1,108 @@ +# Different Kinds of Averages + +There are many more helpful and more detailed resources available. +Honestly, the [Wikipedia page on Average](https://en.wikipedia.org/wiki/Average) +is a great place to start. + +### mean +The "arithmetic mean" (or "mean" for short, there are different kinds of means as well), +is the most common average. Simply add up all the values of the samples and divide by +the number of samples. + +**Code**: [`np.mean`](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) + +How do we calculate the "error" on the mean? While the standard deviation shows the +width of a normal distribution, what shows the "uncertainty" in how well we know what +the center of that distribution is? The value we use for this is "standard error of the +mean" (or just "standard error" for short). The +[wikipedia page](https://en.wikipedia.org/wiki/Standard_error) gives a description +in all its statistics glory, but for our purposes its helpful to remember that the +error of the mean is the standard deviation divided by the square root of the number +of samples. + +### median +The middle number of the group of samples when ranked in order. If there are an even +number of samples, then take the arithmetic mean of the two samples in the middle. + +**Code**: [`np.median`](https://numpy.org/doc/stable/reference/generated/numpy.median.html) + +### weighted mean +Sometimes, the different samples should be weighted differently. In the arithmetic mean, +each sample carries the same weight (I think of it as importance). We can slightly augment +the arithmetic mean to consider weights by multipling each sample by its weight, adding these +up, and then dividing by the sum of weights. This is a nice definition since it simplifies +down to the regular arithmetic mean if all the weights are one. + +Confusingly, NumPy has decided to implemented the weighted mean in their `average` function. +This is just due to a difference in vocabulary. + +**Code**: [`np.average`](https://numpy.org/doc/stable/reference/generated/numpy.average.html) + +### iterative mean +In a lot of the distributions we look at, there +is a "core" distribution that is Gaussian, but the tails are distorted by other effects. +Sometimes, we only care about the core distribution and we want to intentionally cut away +the distorted tails so that we can study the "bulk" of the distribution. This is where +an iterative approach is helpful. We continue to take means and cut away outliers until +there are no outliers left. The main question then becomes: how do we define an outlier? +A simple definition that works well in our case since we usually have many samples is +to have any sample that is further from the mean by X standard deviations be considered +an outlier. + +NumPy does not have a function for this type of mean to my knowledge, so we will construct +our own. The key aspect of NumPy I use is +[boolean array indexing](https://numpy.org/doc/stable/user/basics.indexing.html#boolean-array-indexing) +which is one of many different ways NumPy allows you to access the contents of an array. +In this case, I access certain elements of an array by constructing another array of True/False +values (in the code below, this boolean array is called `selection`) and then when I use +that array as the "index" (the stuff between the square brackets), only the values of the +array that correspond to `True` values will be returned in the output array. I construct +this boolean array by [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html) +a certain comparison against every element in the array. Using broadcasting shortens the +code by removing the manual `for` loop _and_ makes the code faster since NumPy can move that +`for` loop into it's under-the-hood, faster operations in C. + +**Code**: +```python +import numpy as np + +# first I write my own mean calculation that includes the possibility of weights +# and returns the mean, standard deviation, and the error of the mean +def weightedmean(values, weights = None) : + """calculate the weighted mean and standard deviation of the input values + + This function isn't /super/ necessary, but it is helpful for the itermean + function below where the same code needs to be called in multiple times. + """ + mean = np.average(values, weights=weights) + stdd = np.sqrt(np.average((values-mean)**2, weights=weights)) + merr = stdd/np.sqrt(weights.sum()) + return mean, stdd, merr + +# now I can write the iterative mean +def itermean(values, weights = None, *, sigma_cut = 3.0) : + """calculate an iterative mean and standard deviation + + If no weights are provided, then we assume they are all one. + The sigma_cut parameter is what defines what an outlier is. + If a sample is further from the mean than the sigma_cut times + the standard deviation, it is removed. + """ + mean, stdd, merr = weightedmean(values, weights) + num_included = len(values)+1 # just to get loop started + # first selection is all non-zero weighted samples + selection = (weights > 0) if weights is not None else np.full(len(values), True) + while np.count_nonzero(selection) < num_included : + # update number included for this mean + num_included = np.count_nonzero(selection) + # calculate mean and std dev + mean, stdd, merr = weightedmean(values[selection], weights[selection] if weights is not None else None) + # determine new selection, since this variable was defined outside + # the loop, we can use it in the `while` line and it will just be updated + selection = (values > (mean - sigma_cut*stdd)) & (values < (mean + sigma_cut*stdd)) + + # left loop, meaning we settled into a state where nothing is outside sigma_cut standard deviations + # from our mean + return mean, stdd, merr +``` + diff --git a/src/physics/stats/intro.md b/src/physics/stats/intro.md new file mode 100644 index 00000000..c7b29b70 --- /dev/null +++ b/src/physics/stats/intro.md @@ -0,0 +1,5 @@ +# Statistics and Calculations + +This chapter gives explanations on various calculations and statistical methods common in our experiment. +This assorted group of topics is not a complete reference and should only be treated as an introduction rather +that a full source. diff --git a/src/physics/stats/resolution.md b/src/physics/stats/resolution.md new file mode 100644 index 00000000..e76135c3 --- /dev/null +++ b/src/physics/stats/resolution.md @@ -0,0 +1,88 @@ +# Resolution +What I (Tom Eichlersmith) mean when I say "resolution". + +Many of the measurements we take form distributions that are +[normal](https://en.wikipedia.org/wiki/Normal_distribution) +(a.k.a. Gaussian, a.k.a. bell). + +A normal distribution is completely characterized by its mean and +its standard deviation. In order to compare several different normal +distributions with different means and get an idea about which is "wider" +than another, we can divide the standard deviation by the mean to get +(what I call) the _resolution_. [^1] + +[^1]: The technical term for this value is the +[Coefficent of Variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) + +For our purposes, a lower value for the resolution is better since that +means our measurement is more precise. Additionally, in simulations we +know what the true measurement value _should_ be and so we can compare +the mean to the true value to see how accurate our measurement is. + +## Calculation +Calculating the mean and standard deviation are fairly common tasks, +so they are available from numpy: +[mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) +and +[std](https://numpy.org/doc/stable/reference/generated/numpy.std.html). + +So if I have a set of measurements, the following python code snippet +calculates the mean, std, resolution, and accuracy. +```python +import numpy as np +# below, I randomly generate "measurements" from a normal distribution +# in our work with the ECal, these measurements will actually be +# taken from the output of the simulation using uproot +true_value = 10 +true_width = 5 +measurements = np.random.normal(true_value, true_width, 10000) +mean = measurements.mean() +stdd = measurements.std() +resolution = stdd/mean +accuracy = mean/true_value +``` + +Oftentimes, it is helpful to display the distribution of the data compared +to an _actual_ normal distribution to confirm that the data is _actually_ +normal and our analysis makes sense. In this case, the following code +snippet is very helpful. +```python +import matplotlib.pyplot as plt +from scipy.stats import norm +import numpy as np +measurements = np.random.normal(true_value, true_width, 10000) + +# plot the raw data has a histogram +bin_values, bin_edges, art = plt.hist(measurements, bins='auto', label='data') +bin_centers = (bin_edges[1:]+bin_edges[:-1])/2 +# plot an actual normal distribution on top to see if the histogram follows that shape +plt.plot(bin_centers, norm.pdf(bin_centers, measurements.mean(), meansurements.std(), label='real normal') +``` + +**Brief Aside**: +Sometimes, we do not have access to the full list of actual measurements because +there are too many of them, so we only have access to the binned-data. In this case, +we can calculate an approximate mean and standard deviation using the center of the +bins as the "measurements" and the values in the bins as the "weights". +```python +approx_mean = np.average(bin_centers, weights=bin_values) +approx_stdd = np.sqrt(np.average((bin_centers-approx_mean)**2, weights=bin_values)) +``` + +## Plotting +There are several plots that are commonly made when studying the resolution of things. +I'm listing them here just as reference. + +#### Symbols +Not necessarily standard but helpful to be on the same page. +- \\(E_{meas}\\) the reconstructed total energy measured by the ECal +- \\(E_{true}\\) the known energy of the particle entering the ECal +- \\(\\langle X \\rangle\\) the mean of samples of \\(X\\) +- \\(\\sigma_X\\) the standard deviation of samples of \\(X\\) + +#### Plots +- Histogram of \\(E_{meas}/E_{true}\\): this helps us see the shap of the distributions and + dividing by the known beam energy allows us to overlay several different beam energies + and compare their shapes. +- Plot of \\(\\langle E_{meas} \\rangle/E_{true}\\) vs \\(\\langle E_{meas} \\rangle\\) shows how the mean changes with beam energy +- Plot of \\(\\sigma_E / \\langle E_{meas} \\rangle\\) vs \\(\\langle E_{meas} \\rangle\\) shows how the variation changes with beam energy