Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Documentation Section focused on underlying stats without code #839

Merged
merged 77 commits into from
Mar 14, 2024

Conversation

kcormi
Copy link
Collaborator

@kcormi kcormi commented May 16, 2023

This is (the start of) an attempt to help make more clear to users the underlying model, statistical tests etc... being used by combine.

These pages pages are designed to give users concise but thorough and precise references on the details of what is being done.

The existing documentation includes much of this material spread throughout. But it might be helpful to users to have more complete explanations in one easy to find spot with reminders and references back to that material in other parts of the documentation which are focused around how to run procedures and commands.

Open to suggestions/comments at all levels (overall structure, content, flow, choice of notation etc. ).

For those not familiar with setting up the documentation locally to have a look, please see the instructions in the contributing.md document from #838 (you can see it here: https://github.com/kcormi/HiggsAnalysis-CombinedLimit/blob/contributing/contributing.md). A page which should be identical to the one here has also been put up at: https://kcormi.github.io/HiggsAnalysis-CombinedLimit/ -- the new pages are the ones under the 'what combine does' tab.

@kcormi kcormi added documentation Updates for the documentation needs work labels May 16, 2023
@kcormi kcormi removed the needs work label Jun 30, 2023
@kcormi kcormi marked this pull request as ready for review June 30, 2023 12:53
@kcormi kcormi force-pushed the user_doc_statsforward branch from 7629e3b to 8b2bc41 Compare September 27, 2023 05:36
@kcormi kcormi force-pushed the user_doc_statsforward branch from 8b2bc41 to c4575db Compare October 3, 2023 06:00
@kcormi
Copy link
Collaborator Author

kcormi commented Nov 30, 2023

I've left this open for an unreasonably long time for no good reason. I just gave it another check, and despite what I'm sure are many flaws, I am happy enough with it to merge it and make it public. Unless there are any loud complaints soon, I will go ahead with the merge.

Closer to the time of releasing the paper, I will go through and try to harmonize some notation etc.

@kcormi kcormi force-pushed the user_doc_statsforward branch from 65ff9f8 to ddebb2b Compare March 12, 2024 09:24
The observation model, $\mathcal{M}_0( \vec{\Phi}_{0})$ defines the probability for any set of observations given specific values of the input parameters of the model $\vec{\Phi}_0$.
The probability for any observed data is denoted:

$$ p_{\mathcal{M}_{0}}(\mathrm{data}; \vec{\Phi}_0 ) $$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the _{0}? I think it looks better without this subscript

The event-count portion of the model consists of a sum over different processes.
The expected observations, $\vec{\lambda}$, are then the sum of the expected observations for each of the processes, $\vec{\lambda} =\sum_{p} \vec{\lambda}_{p}$.

The model can also be composed of multiple channels, in which case the expected observation is the set of all expected observations from the various channels $\vec{\lambda}_{0} = \{ \vec{\lambda}_{c1}, \vec{\lambda}_{c2}, .... \vec{\lambda}_{cN}\}$.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, we have _{0} here but in the previous paragraph, there's no subscript (prefer without)

For any given model, $\mathcal{M}(\vec{\Phi})$, [the likelihood](https://pdg.lbl.gov/2022/web/viewer.html?file=../reviews/rpp2022-rev-statistics.pdf#section.40.1) defines the probability of observing a given dataset.
It is numerically equal to the probability of observing the data, given the model.

$$ \mathcal{L}_\mathcal{M}(\vec{\Phi};\mathrm{data}) = p_{\mathcal{M}}(\mathrm{data};\vec{\Phi}) $$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the amount of time we took over the review of the paper for this, I would really try to stick to the paper (specifically, we never write a likelihood with "; data" , and later in the figure and elsewhere we don't have it so I would drop that here, just keep the parameters.


The likelihood in combine takes the general form:

$$ \mathcal{L} = \mathcal{L}_{\textrm{data}} \cdot \mathcal{L}_{\textrm{constraint}} $$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use "primary" and "auxiliary" as in the paper? instead of data and constraint ?

Where $\mathcal{L}_{\mathrm{data}}$ is equal to the probability of observing the event count data for a given set of model parameters, and $\mathcal{L}_{\mathrm{constraint}}$ represent some external constraints on the parameters.
The constraint term may be constraints from previous measurements (such as Jet Energy Scales) or prior beliefs about the value some parameter in the model should have.

Both $\mathcal{L}_{\mathrm{data}}$ and $\mathcal{L}_{\mathrm{constraint}}$ can be composed of many sublikelihoods, for example for observations of different bins and constraints on different nuisance parameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As before (data->primary, constraint->auxiliary)

While we presented the likelihoods for the template and parameteric models separately, they can also be combined into a single likelihood, by treating them each as separate channels.
When combining the models, the data likelihoods of the binned and unbinned channels are multiplied.

$$ \mathcal{L}_{\mathrm{combined}} = \mathcal{L}_{\mathrm{data}} \cdot \mathcal{L}_\mathrm{constraint} = (\prod_{c_\mathrm{template}} \mathcal{L}_{\mathrm{data}}^{c_\mathrm{template}}) (\prod_{c_\mathrm{parametric}} \mathcal{L}_{\mathrm{data}}^{c_\mathrm{parametric}}) \mathcal{L}_{\mathrm{constraint}} $$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one more repeat (data->primary, constraint -> auxiliary)

@kcormi
Copy link
Collaborator Author

kcormi commented Mar 14, 2024

Thanks, good points Nick. I changed those cases, I also tried to update the text to match this primary/auxiliary wording better and found some other instances throughout where I made the notation and wording more consistent with what's in the paper.

@kcormi kcormi merged commit 8007ee2 into cms-analysis:main Mar 14, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Updates for the documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants