Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tevatron inclusive DY #2185

Merged
merged 4 commits into from
Nov 25, 2024
Merged

Tevatron inclusive DY #2185

merged 4 commits into from
Nov 25, 2024

Conversation

jacoterh
Copy link
Collaborator

@jacoterh jacoterh commented Oct 21, 2024

This PR implements filter functions for the Tevatron inclusive DY datasets:

  • CDF_Z0_1P96TEV_ZRAP: agreement with legacy implementation
  • D0_WPWM_1P96TEV: agreement with legacy, except for possible rounding errors in the old implementation of the kinematics, see specifically the last two bins here
  • D0_Z0_1P96TEV: agreement with legacy. One difference: systematic in 3rd bin is symmetric already on hepdata, while the legacy implementation seems to have symmetrised plus and minus uncertainties in the past here. This also affects the central value.

General comment: small overall differences are due to formatting. For example: 1.2999999999999998 instead of 1.3. Is there any function that was used in the past to uniform the way floats were formatted?

@Radonirinaunimi
Copy link
Member

Thanks for this @jacoterh!

Small overall differences are due to formatting. For example: 1.2999999999999998 instead of 1.3. Is there any function that was used in the past to uniform the way floats were formatted?

Now, you can use this function:

def prettify_float(dumper, value):
"""
Override the default yaml representer:
https://github.com/yaml/pyyaml/blob/48838a3c768e3d1bcab44197d800145cfd0719d6/lib/yaml/representer.py#L189
This function is used to prettify the float representation in the yaml file.
If the float has more than 8 digits, it will be represented in scientific notation with 8 digits.
Note:
-----
When importing yaml in a module,
yaml.add_representer(float, prettify_float)
must be called to use this function.
"""
ret = dumper.represent_float(value)
if len(ret.value) > 8:
ret_str = f"{value:.8e}"
ret = dumper.represent_scalar('tag:yaml.org,2002:float', ret_str)
return ret

and call it as follows in the filter:

yaml.add_representer(float, prettify_float)

Also, same as in the other PRs, could you check that the datasets are loaded properly and that the (t0) covmats are the same?

from validphys.api import API
import numpy as np

inp1 = {"dataset_input": {"dataset": f"{new_implementation}"}, "theoryid": 40_000_000, "use_cuts": "internal", "t0pdfset": "NNPDF40_nnlo_as_01180", "use_t0": True}
inp2 = {"dataset_input": {"dataset": f"{old_implementation}", "variant": "legacy"}, "theoryid": 40_000_000, "use_cuts": "internal", "t0pdfset": "NNPDF40_nnlo_as_01180", "use_t0": True}

covmat1 = API.covmat_from_systematics(**inp1)
covmat2 = API.covmat_from_systematics(**inp2)

t0_covmat1 = API.t0_covmat_from_systematics(**inp1)
t0_covmat2 = API.t0_covmat_from_systematics(**inp2)

np.all(np.isclose(covmat1, covmat2))
np.all(np.isclose(t0_covmat1, t0_covmat2))

url: ''
version: -1
url: https://www.hepdata.net/record/ins856131?version=1&table=Table%202
version: 1
implemented_observables:
- observable_name: ZRAP
observable:
description: Drell-Yan Rapidity Distribution
label: CDF $Z$ rapidity (new)
units: ''
process_type: EWK_RAP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jacoterh

Remember to put as process_type: one of the processes that you will find in this file: https://github.com/NNPDF/nnpdf/blob/master/validphys2/src/validphys/process_options.py (at the bottom, either DY_Z_Y or DY_W_ETA)

and also put kinematic_override: identity.

This is to avoid going through some old code that we are trying to remove.

When you do that you will see that that file defines which kinematic variables are allowed, sometimes it is necessary to change from Q -> Q^2 or pT -> pT^2 (the kinematic info in the end is the same, but we are trying to standardize it a bit more)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done!

@jacoterh
Copy link
Collaborator Author

Thanks @Radonirinaunimi! The numbers have been formatted correctly now. I have also performed the check you proposed using the snippet you provided, see for instance here . If you run this script, you'll find that the 3rd bin in D0_Z0_1P96TEV differs from the legacy implementation for the reason I mentioned in the beginning of this PR.

Some questions:

  1. Why do the tests suddenly fail? Running pytest tests/test_datafiles.py locally works fine.
  2. I explicitly checked whether the comparison against the legacy implementation failed when I randomly changed some numbers. It did for all datasets, except for the 1st bin in D0_WPWM_1P96TEV. The test does neem to care about whatever I write here. This doesn't happen for the D0_Z0_1P96TEV and CDF_Z0_1P96TEV datasets.
  3. In the metadata, which hepdata table do I link for the url in case multiple entered the implementation?

@scarlehoff
Copy link
Member

The current failure is because it is plotting::kinematics_override (the probability that I get a variable name or word in general correct without looking in the first try is close to 0)

The previous one probably related to some change in digits looking at the logs.

@jacoterh
Copy link
Collaborator Author

jacoterh commented Nov 8, 2024

validphys reports:

observables = ["CDF_Z0_1P96TEV_ZRAP", "D0_WPWM_1P96TEV_ASY", "D0_Z0_1P96TEV_ZRAP"]

for obs in observables:

    new_implementation = obs
    old_implementation = obs

    inp1 = {
        "dataset_input": {"dataset": f"{new_implementation}"},
        "theoryid": 200,
        "use_cuts": "internal",
        "t0pdfset": "NNPDF40_nnlo_as_01180",
        "use_t0": True,
    }
    inp2 = {
        "dataset_input": {"dataset": f"{old_implementation}", "variant": "legacy"},
        "theoryid": 200,
        "use_cuts": "internal",
        "t0pdfset": "NNPDF40_nnlo_as_01180",
        "use_t0": True,
    }

    covmat1 = API.covmat_from_systematics(**inp1)
    covmat2 = API.covmat_from_systematics(**inp2)

    t0_covmat1 = API.t0_covmat_from_systematics(**inp1)
    t0_covmat2 = API.t0_covmat_from_systematics(**inp2)

    print(f"Comparison for {new_implementation}")
    print(np.all(np.isclose(covmat1, covmat2)))
    print(np.all(np.isclose(t0_covmat1, t0_covmat2)))

>>> Comparison for CDF_Z0_1P96TEV_ZRAP
True
True
Comparison for D0_WPWM_1P96TEV_ASY
True
True
Comparison for D0_Z0_1P96TEV_ZRAP
False
False

The last disagreement in D0_Z0_1P96TEV_ZRAP can be explained as follows. The systematic uncertainty in the 3rd bin appears symmetric already on hepdata, while the legacy implementation seems to have symmetrised plus and minus uncertainties in the past here. This also affects the central value.

@scarlehoff
Copy link
Member

Thanks Jaco. One last thing, could you check that also the experimental covmat for the 3 datasets together doesn't change for non-diagonal entries?

(you can also e.g. just check that the experimental chi2 for all 3 considered together is unchanged)

(this will be important for the singletop data as well, to check that possible experimental correlations across datasets are also captured)

@jacoterh
Copy link
Collaborator Author

jacoterh commented Nov 11, 2024

Sure, I made a plot of the ratio against the legacy implementation. There are no cross correlations between datasets here though (which I find odd as we have two datasets from D0), so the covmat is block diagonal meaning that the datasets can be cross checked separately. But I will also do this check for single top, where we do have cross correlations, thanks!
ratio_exp_covmat_tevatron_inc_dy

For reference, the ratio was obtained with

dsinps = [
        {'dataset': 'CDF_Z0_1P96TEV_ZRAP'},
        {'dataset': 'D0_WPWM_1P96TEV_ASY'},
        {'dataset': 'D0_Z0_1P96TEV_ZRAP'}
]

dsinps_legacy = [
        {'dataset': 'CDF_Z0_1P96TEV_ZRAP', 'variant': 'legacy'},
        {'dataset': 'D0_WPWM_1P96TEV_ASY', 'variant': 'legacy'},
        {'dataset': 'D0_Z0_1P96TEV_ZRAP', 'variant': 'legacy'}
]

inp = dict(dataset_inputs=dsinps, theoryid=200, use_cuts="internal")
inp_legacy = dict(dataset_inputs=dsinps_legacy, theoryid=200, use_cuts="internal")
cov = API.dataset_inputs_covmat_from_systematics(**inp)
cov_legacy = API.dataset_inputs_covmat_from_systematics(**inp_legacy)

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jacoterh

Thanks for this. All seems fine, the only thing, could you bump the version and change the version_comment in the metadata.yml files?

RE the failing test, you might need to regenerate pseudodata_test_fit_n3fit_240916. Since it is only the pseudodata that is relevant, you can just do:

vp-fitrename pseudodata_test_fit_n3fit_240916 pseudodata_test_fit_n3fit_241112
cp pseudodata_test_fit_n3fit_241112/filter.yaml  pseudodata_test_fit_n3fit_241112.yaml

And then just regenerate the data for the 10 replicas, no need to wait until the fit is done.

@scarlehoff scarlehoff added the Done PRs that are done but waiting on something else to merge/approve label Nov 12, 2024
@scarlehoff scarlehoff marked this pull request as ready for review November 13, 2024 15:40
@jacoterh
Copy link
Collaborator Author

@scarlehoff I have tried to follow your steps but I am a bit lost. This is what I did.

vp-get fit pseudodata_test_fit_n3fit_240916 
vp-fitrename pseudodata_test_fit_n3fit_240916 pseudodata_test_fit_n3fit_241115
cp pseudodata_test_fit_n3fit_241115/filter.yml pseudodata_test_fit_n3fit_241115.yml  

and launched the fit with

n3fit pseudodata_test_fit_n3fit_241115.yml 10

This gives me a fitted replica, but then what do I commit to make sure the tests don't fail anymore?

@scarlehoff
Copy link
Member

Upload the new fit to the server (so that it is available) amd in conftest change the name of the fit. If you then run pytest for that test it should work.

@jacoterh
Copy link
Collaborator Author

jacoterh commented Nov 15, 2024

Thanks, I ran the fit on 20 replicas, evolved them, but now postfit filters out 90% of them:
[INFO]: 2 replicas pass IntegNumber_3.
Should I just fit more replicas, i.e. is this expected? 90% seems a bit harsh.

And if it's just the pseudo data that we need, we don't even care right?

@scarlehoff
Copy link
Member

Yes, you don't even need to finish the fit. That's why if you use fit rename you get the previous one and just take need the new data csvs

@scarlehoff
Copy link
Member

Let me know whether I can merge this.

@jacoterh
Copy link
Collaborator Author

For future reference, this is the (hopefully) final validphys comparison: https://vp.nnpdf.science/PceSQWJaSLCrCp7wHzrsWg==/

If the tests pass, this branch should be ready for merging

@scarlehoff scarlehoff merged commit 676e70f into master Nov 25, 2024
6 checks passed
@scarlehoff scarlehoff deleted the tevatron_inc_DY branch November 25, 2024 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data toolchain Done PRs that are done but waiting on something else to merge/approve
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants