Merge pull request #27 from LDMX-Software/26-start-a-physics-section-…

…of-the-online-docs 26 start a physics section of the online docs
LDMX-Software · Jan 31, 2024 · a627e99 · a627e99
2 parents 74a3232 + 77c439f
commit a627e99
Show file tree

Hide file tree

Showing 6 changed files with 424 additions and 0 deletions.
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
@@ -25,6 +25,13 @@
   - [Legacy Instructions](dark-brem/legacy.md)
 - [Alternative photo-nuclear models](Alternative-Photo-Nuclear-Models.md)
 
+# Physics Guides
+- [Statistics and Calculations](physics/stats/intro.md)
+  - [Averages](physics/stats/averages.md)
+  - [Resolution](physics/stats/resolution.md)
+- [ECal](physics/ecal/intro.md)
+  - [Layer Weights](physics/ecal/layer-weights.md)
+
 # For Developers
 - [Tracking Performance of ldmx-sw](performance.md)
 - [Container-Software Compatibility](compatibility.md)

diff --git a/src/physics/ecal/intro.md b/src/physics/ecal/intro.md
@@ -0,0 +1,4 @@
+# ECal
+
+Software-related physics topics for the LDMX Electromagnetic Calorimeter (ECal).
+
diff --git a/src/physics/ecal/layer-weights.md b/src/physics/ecal/layer-weights.md
@@ -0,0 +1,212 @@
+# Layer Weights
+
+As a sampling calorimeter, the ECal needs to apply weights to all of the energies
+it measures in order to account for the energy "lost" from particles traversing
+material that is not active or sensitive. The ECal has sensitive silicon layers
+that are actually making energy measurements while the rest of the material
+(tungsten plates, cooling planes, printed circuit boards "PCB") are here treated
+as "absorber".
+
+Reconstructing the amount of energy deposited within a cell inside the sensitive
+silicon is another topic, largely a question of interpreting the electronic signals
+readout using our knowledge of the detector. This document starts with the assumption
+that we already have an estimate for the amount of energy deposited in the silicon.
+
+~~~admonish tip title="Energy Deposited in Silicon"
+I am being intentionally vague here because the estimate for the energy deposited
+in the silicon can be retrieved either from the `EcalRecHits` or the `EcalSimHits`.
+The simulated hits exclude any of the electronic estimation and so are, in some
+sense, less realistic; however they could be used to assign blame to energies within
+the ECal which is helpful for some studies.
+
+The estimated energy deposited in the silicon is stored in different names in these
+two classes.
+- `amplitude_` member variable of `EcalHit` (`getAmplitude()` accessor in C++)
+- `edep_` member variable of `EcalSimHit` (`getEdep()` accessor in C++)
+~~~
+
+Our strategy to account for the absorbing material is to calculate a scaling factor \\(S\\)
+increasing the energy deposited in the silicon \\(E_\text{silicon}\\) up to a value that estimates the
+energy deposited in all of the absorbing material in front of this silicon layer
+up to the next sensitive silicon layer \\(E_\text{absorber}\\). This strategy allows us to keep the relatively
+accurate estimate of the energy deposited in the silicon as well as avoids double
+counting absorber layers by uniquely assigning them to a single silicon sensitive layer.
+
+\\[
+  E_\text{absorber} \approx S E_\text{silicon}
+\\]
+
+We can estimate \\(S\\) by using our knowledge of the detector design and measurements of how
+much energy particles lose inside of these materials. Many of the measurements of how much energy
+is lost by particles is focused on "minimum-ionizing particles" or MIPs and so we factor this
+out by first estimating the number of MIPs within the silicon and then multiplying that number
+by the estimate energy loss per MIP in the absorbing material.
+
+\\[
+  S = L / E_\text{MIP in Si}
+  \quad\Rightarrow\quad
+  E_\text{absorber} \approx L \frac{E_\text{silicon}}{E_\text{MIP in Si}}
+\\]
+
+The \\(L\\) in this equation is what the layer weights stored within are software represent.
+They are the estimated energy loss of a single MIP passing through the absorbing layers in
+front of a silicon layer up to the next silicon layer.
+
+~~~admonish note collapsible=true title="Calculating the Layer Weights"
+The [`Detectors/util/ecal_layer_stack.py`](https://github.com/LDMX-Software/ldmx-sw/blob/trunk/Detectors/util/ecal_layer_stack.py)
+python script calculate these layer weights as well as other aspects of the ECal detector design.
+It copies the \\(dE/dx\\) values estimated within the PDG for the materials within our design
+and then uses a coded definition of the material layers and their thicknesses to calculate
+these layer weights as well as the positions of the sensitive layers.
+~~~
+
+Using this estimate for the energy deposited into the absorbing material, we can combine it
+with our estimate of the energy deposited into the sensitive silicon to obtain a "total energy"
+represented by a single hit.
+
+\\[
+  E = E_\text{absorber}+E_\text{silicon}
+    \approx L E_\text{silicon} / E_\text{MIP in Si} + E_\text{silicon}
+    = \left(1 + \frac{L}{E_\text{MIP in Si}}\right) E_\text{silicon}
+\\]
+
+~~~admonish note collapsible=true title="Second Order Corrections"
+From simulations of the detector design, we found that this underestimated the energy of showers
+within the ECal by a few percent, so we also introduces another scaling factor \\(C\\)
+(called `secondOrderEnergyCorrection` in the reconstruction code) that multiplies this total
+energy.
+
+\\[
+  E = C \left(1 + \frac{L}{E_\text{MIP in Si}}\right) E_\text{silicon}
+\\]
+
+This is helpful to keep in mind when editing the reconstruction code itself, but
+for simplicity and understandability if you are manually applying the layer weights (e.g.
+to use them with sim hits), you should not use this second order correction (i.e. just keep \\(C=1\\)).
+~~~
+
+Now that we have an estimate for the total energy represented by a single hit,
+we can combine energies from multiple hits to obtain shower or cluster "features".
+
+## Applying the Layer Weights
+In many cases, one can just use the `energy_` member variable of the `EcalHit` class to get
+this energy estimate (including the layer weights) already calculated by the reconstruction;
+however, in some studies it is helpful to apply them manually. This could be used to study
+different values of the layer weights or to apply the layer weights to the `EcalSimHits`
+branch in order to estimate the geometric acceptance (i.e. neglect any electronic noise and
+readout affects).
+
+This section focuses on how to apply the layer weights within a python-based analysis
+which uses the `uproot` and `awkward` packages. We are going to assume that you (the
+user) have already loaded some data into an `ak.Array` with a form matching
+```
+N * var * {
+  id: int32,
+  amplitude: float32
+  ... other stuff
+}
+```
+The `N` represents the number of events and the `var` signals that this array has a
+variable length sub-array. I am using the `EcalHit` terminology (i.e. `amplitude`
+represents the energy deposited in just the silicon), but one could use the following
+code with sim hits as well following a name change. You can view this form within
+a jupyter notebook or by calling `show()` on the array of interest.
+
+~~~admonish tip collapsible=true title="Loading Data into `ak.Array`"
+Just as an example, here is how I load the `EcalRecHits` branch from a pass named
+`example`.
+```python
+import uproot
+import awkward as ak
+f = uproot.open('path/to/file.root')
+ecal_hits = f['LDMX_Events'].arrays(expressions = 'EcalRecHits_example')
+# shorten names for easier access, just renaming columns not doing any data manipulation
+ecal_hits = ak.zip({
+    m[m.index('.')+1:].strip('_') : ecal_hits.EcalRecHits_valid[m]
+    for m in ecal_hits.EcalRecHits_example.fields
+})
+```
+~~~
+
+From here on, I assume that the variable `ecal_hits` has a `ak.Array` of data
+with a matching form.
+
+### Obtain Layer Number
+First, we need to obtain the layer number corresponding to each hit. This can be
+extracted from the detector ID that accompanies each hit.
+```python
+ecal_hits["layer"] = (ecal_hits.id >> 17) & 0x3f
+```
+~~~admonish warning title="Bit Shifting Nonsense"
+This bit shifting nonsense was manually deduced from the
+[`ECalID` source code](https://github.com/LDMX-Software/ldmx-sw/blob/trunk/DetDescr/include/DetDescr/EcalID.h)
+and should be double checked.
+The layers should range in values from `0` up to the number
+of sensitve layers minus one (currently `33`).
+
+If you are running from within the ldmx-sw development or production container
+(i.e. have access to an installation of `fire`), you could also do
+```python
+from libDetDescr import EcalID
+import numpy as np
+@np.vectorize
+def to_layer(rawid):
+    return EcalID(int(rawid)).layer()
+```
+to use the C++ code directly. This is less performant but is more likely to
+be correct.
+~~~
+
+### Efficiently Broadcasting Layer Weights
+Now, the complicated bit. We want to efficiently broadcast the layer weights
+to all of the hits without having to copy around a bunch of data. `awkward`
+has a solution for this we can use
+[IndexedArray](https://awkward-array.org/doc/main/reference/generated/ak.contents.IndexedArray.html?highlight=indexedarray#ak.contents.IndexedArray) 
+with the layer indices as the index and
+[ListOffsetArray](https://awkward-array.org/doc/main/reference/generated/ak.contents.ListOffsetArray.html?highlight=listoffsetarray#ak.contents.ListOffsetArray)
+with the offsets of the layers to get an `ak.Array`that presents the layer
+weights for each hit without having to copy any data around.
+
+First, store the layer weights in order by the layer they correspond to.
+This is how the layer weights are stored within
+[the python config](https://github.com/LDMX-Software/Ecal/blob/trunk/python/digi.py)
+and so I have copied the v14 geometry values below.
+```python
+layer_weights = ak.Array([
+    2.312, 4.312, 6.522, 7.490, 8.595, 10.253, 10.915, 10.915, 10.915, 10.915, 10.915,
+    10.915, 10.915, 10.915, 10.915, 10.915, 10.915, 10.915, 10.915, 10.915, 10.915,
+    10.915, 10.915, 14.783, 18.539, 18.539, 18.539, 18.539, 18.539, 18.539, 18.539,
+    18.539, 18.539, 9.938
+])
+```
+Then we use the special arrays linked above to present these layer
+weights as if there is a copy of the correct one for each hit.
+```python
+layer_weight_of_each_hit = ak.Array(
+    ak.contents.ListOffsetArray(
+        ecal_hits.layer.layout.offsets,
+        ak.contents.IndexedArray(
+            ak.index.Index(ecal_hits.layer.layout.content.data),
+            layer_weights.layout
+        )
+    )
+)
+```
+
+### Applying Layer Weights
+Now that we have an array that has a layer weight for each hit, we can go
+through the calculation described above to apply these layer weights and
+get an estimate for the total energy of each hit.
+```python
+mip_si_energy = 0.130 #MeV - corresponds to ~3.5 eV per e-h pair <- derived from 0.5mm thick Si
+(1 + layer_weight_of_each_hit / mip_si_energy)*ecal_hits.amplitude
+```
+If you are re-applying the layer weights used in reconstruction, you can
+also confirm that your code is functioning properly by comparing the array
+calculated above with the energies that the reconstruction provides alongside
+the `amplitude` member variable.
+```python
+# we need to remove the second order correct that the reconsturction applies
+secondOrderEnergyCorrection = 4000. / 3940.5 # extra correction factor used within ldmx-sw for v14
+ecal_hits.energy / secondOrderEnergyCorrection
+```
diff --git a/src/physics/stats/averages.md b/src/physics/stats/averages.md
@@ -0,0 +1,108 @@
+# Different Kinds of Averages
+
+There are many more helpful and more detailed resources available.
+Honestly, the [Wikipedia page on Average](https://en.wikipedia.org/wiki/Average)
+is a great place to start.
+
+### mean
+The "arithmetic mean" (or "mean" for short, there are different kinds of means as well),
+is the most common average. Simply add up all the values of the samples and divide by
+the number of samples.
+
+**Code**: [`np.mean`](https://numpy.org/doc/stable/reference/generated/numpy.mean.html)
+
+How do we calculate the "error" on the mean? While the standard deviation shows the 
+width of a normal distribution, what shows the "uncertainty" in how well we know what
+the center of that distribution is? The value we use for this is "standard error of the
+mean" (or just "standard error" for short). The 
+[wikipedia page](https://en.wikipedia.org/wiki/Standard_error) gives a description 
+in all its statistics glory, but for our purposes its helpful to remember that the
+error of the mean is the standard deviation divided by the square root of the number
+of samples.
+
+### median
+The middle number of the group of samples when ranked in order. If there are an even
+number of samples, then take the arithmetic mean of the two samples in the middle.
+
+**Code**: [`np.median`](https://numpy.org/doc/stable/reference/generated/numpy.median.html)
+
+### weighted mean
+Sometimes, the different samples should be weighted differently. In the arithmetic mean,
+each sample carries the same weight (I think of it as importance). We can slightly augment
+the arithmetic mean to consider weights by multipling each sample by its weight, adding these
+up, and then dividing by the sum of weights. This is a nice definition since it simplifies
+down to the regular arithmetic mean if all the weights are one.
+
+Confusingly, NumPy has decided to implemented the weighted mean in their `average` function.
+This is just due to a difference in vocabulary.
+
+**Code**: [`np.average`](https://numpy.org/doc/stable/reference/generated/numpy.average.html)
+
+### iterative mean
+In a lot of the distributions we look at, there 
+is a "core" distribution that is Gaussian, but the tails are distorted by other effects.
+Sometimes, we only care about the core distribution and we want to intentionally cut away
+the distorted tails so that we can study the "bulk" of the distribution. This is where
+an iterative approach is helpful. We continue to take means and cut away outliers until
+there are no outliers left. The main question then becomes: how do we define an outlier?
+A simple definition that works well in our case since we usually have many samples is
+to have any sample that is further from the mean by X standard deviations be considered
+an outlier.
+
+NumPy does not have a function for this type of mean to my knowledge, so we will construct
+our own. The key aspect of NumPy I use is 
+[boolean array indexing](https://numpy.org/doc/stable/user/basics.indexing.html#boolean-array-indexing)
+which is one of many different ways NumPy allows you to access the contents of an array.
+In this case, I access certain elements of an array by constructing another array of True/False
+values (in the code below, this boolean array is called `selection`) and then when I use
+that array as the "index" (the stuff between the square brackets), only the values of the
+array that correspond to `True` values will be returned in the output array. I construct
+this boolean array by [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html)
+a certain comparison against every element in the array. Using broadcasting shortens the
+code by removing the manual `for` loop _and_ makes the code faster since NumPy can move that
+`for` loop into it's under-the-hood, faster operations in C.
+
+**Code**:
+```python
+import numpy as np
+
+# first I write my own mean calculation that includes the possibility of weights
+#   and returns the mean, standard deviation, and the error of the mean
+def weightedmean(values, weights = None) :
+    """calculate the weighted mean and standard deviation of the input values
+    
+    This function isn't /super/ necessary, but it is helpful for the itermean
+    function below where the same code needs to be called in multiple times.
+    """ 
+    mean = np.average(values, weights=weights)
+    stdd = np.sqrt(np.average((values-mean)**2, weights=weights))
+    merr = stdd/np.sqrt(weights.sum())
+    return mean, stdd, merr
+
+# now I can write the iterative mean
+def itermean(values, weights = None, *, sigma_cut = 3.0) :
+    """calculate an iterative mean and standard deviation
+
+    If no weights are provided, then we assume they are all one.
+    The sigma_cut parameter is what defines what an outlier is.
+    If a sample is further from the mean than the sigma_cut times
+    the standard deviation, it is removed.
+    """
+    mean, stdd, merr = weightedmean(values, weights)
+    num_included = len(values)+1 # just to get loop started
+    # first selection is all non-zero weighted samples
+    selection = (weights > 0) if weights is not None else np.full(len(values), True)
+    while np.count_nonzero(selection) < num_included :
+        # update number included for this mean
+        num_included = np.count_nonzero(selection)
+        # calculate mean and std dev
+        mean, stdd, merr = weightedmean(values[selection], weights[selection] if weights is not None else None)
+        # determine new selection, since this variable was defined outside
+        #   the loop, we can use it in the `while` line and it will just be updated
+        selection = (values > (mean - sigma_cut*stdd)) & (values < (mean + sigma_cut*stdd))
+
+    # left loop, meaning we settled into a state where nothing is outside sigma_cut standard deviations
+    #   from our mean
+    return mean, stdd, merr
+```
+
diff --git a/src/physics/stats/intro.md b/src/physics/stats/intro.md
@@ -0,0 +1,5 @@
+# Statistics and Calculations
+
+This chapter gives explanations on various calculations and statistical methods common in our experiment.
+This assorted group of topics is not a complete reference and should only be treated as an introduction rather
+that a full source.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,4 @@
		# ECal

		Software-related physics topics for the LDMX Electromagnetic Calorimeter (ECal).