Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unpin ruamel.yaml version #2226

Merged
merged 20 commits into from
Dec 4, 2024
Merged

unpin ruamel.yaml version #2226

merged 20 commits into from
Dec 4, 2024

Conversation

RoyStegeman
Copy link
Member

Along with NNPDF/reportengine#67 this PR addresses the dependence on the ruamel.yaml version which is pinned to <0.18 due to changes in the API.

Hopefully this also allows us to resolve the problem with vp-nextfitruncard removing newlines and comments from the runcard 🤞

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for this, you are doing god's work here.

One thing, wouldn't it make sense to have the import ruamel.yaml ; yaml = YAML() in one single place (nnpdf_data for instance) and import it from there?

Would make your life easier I think debugging and ensuring that everything is doing the same thing.

Loader = yaml.CLoader
except AttributeError:
Loader = yaml.Loader
yaml = YAML(typ='safe')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using libyaml (CLoader) when available makes a huge difference in time when loading large (or many) files.

Copy link
Member Author

@RoyStegeman RoyStegeman Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLoader is PyYaml, do we want to keep using both yaml packages?

On that note: all filter.py use pyyaml - do we want to change that or is it fine?

Copy link
Member

@scarlehoff scarlehoff Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find an explicit mention in the docs but it might be that as long as you use safe (and pure is not True) it will use CLoader under the hood https://sourceforge.net/p/ruamel-yaml/tickets/159/#8f94

but the comment is from 2017... we should check. The difference in performance was enough to think about dropping ruamel if we can no longer use CLoader

@RoyStegeman
Copy link
Member Author

One thing, wouldn't it make sense to have the import ruamel.yaml ; yaml = YAML() in one single place (nnpdf_data for instance) and import it from there?

I'm still trying to understand some of the errors, it seems to depend on what settings are used during import (which is a bit concerning that it actually affects the way the yaml is parsed rather than just the formatting upon dumping). If this is the case then we may need a different YAML per module, though in general you're right and I'll probably clean it up once I understand the issues.

If not then you may be right

@scarlehoff
Copy link
Member

scarlehoff commented Nov 25, 2024

The documentation of ruamel is worse than our own 🫠 but from https://yaml.readthedocs.io/en/latest/basicuse/#more-examples I think you would need at least two yaml functions in utils.

  1. a "fast" where pure=False and typ=safe for the commondata, see also here to check all the combinations you can do, the goal is to arrive at cyaml/cparser 😆 https://sourceforge.net/p/ruamel-yaml/code/ci/default/tree/main.py#l105

  2. A slow typ=rt (which is the default) for all the fancy stuff.

The fast loader 1) would support only yaml 1.1 while the other is also 1.2 (or safe + pure, also 1.2), which explains probably the differences you see.

def _quick_yaml_load(filepath):
return yaml.load(filepath.read_text(encoding="utf-8"), Loader=Loader)

yaml = YAML(typ='rt')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
yaml = YAML(typ='rt')
yaml = YAML(typ='safe')

Just benchmarked it, with this change is (almost) as fast as master. Not as fast though, it takes 1.3x more.
With rt it takes 10x more.

As I said, I think the best thing at this point is to have in utils safe_load and rt_load or whatever with a comment explaining why. That will make it easier if we need to change it again.

@RoyStegeman
Copy link
Member Author

RoyStegeman commented Dec 3, 2024

I've added a single initialization invp.utils for the different options we may want to use in vp and n3fit. In nnpdf_data mostly pyyaml was used and ruamel only in some places where formatting doesn't matter, so I made nnpdf_data pyyaml-only.

The pyproject file still points to the corresponding branch in reportengine so that should be changed if we decide to merge both.

@scarlehoff
Copy link
Member

commondataparser will be moved to nnpdf_data, can it use pyyaml as well?

try:
return parse_input(inp, spec)
except ValidationError as e:
current_exc = e
# In order to provide a more complete error information, use round_trip_load
# to read the .yaml file again (insetad of using the CLoader)
current_inp = yaml.round_trip_load(input_yaml.open("r", encoding="utf-8"))
current_inp = yaml.load(input_yaml.open("r", encoding="utf-8"), yaml.Loader)
Copy link
Member

@scarlehoff scarlehoff Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work? The extra information here needs the rt loader, otherwise you could use the load from before.

Also, is yaml.Loader what in ruamel is CLoader? Otherwise it will be slow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out ruamel is built on top of pyyaml, so CLoader is the same thing in both. I took the passing ci as indication that this was fine, but I'll revert it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you need to use CLoader for the other one to speed things up (and you need the try-except since libyaml might not be installed everywhere).
I think all in all it is better to use ruamel also here, pyyaml doesn't really simplify things.

I took the passing ci as indication that this was fine, but I'll revert it

We would need a test that not only checks for errors but looks also at the message... and in this case is a ValidationError from reportengine. I'll put it in my "probably never happening but who knows" to do list

@RoyStegeman
Copy link
Member Author

RoyStegeman commented Dec 4, 2024

I wasn't too sure from the ruamel code, but it seems that ruamel doesn't use CLoader. I don't believe there is any key that allows to set that. These are the timings:

yaml_rt: 0.002154 seconds per load
yaml_fast: 0.000500 seconds per load
yaml_safe: 0.000504 seconds per load
CLoader: 0.000153 seconds per load

Here is the script for reference:

import pathlib
from timeit import timeit

from ruamel.yaml import YAML
import yaml

import nnpdf_data

yaml_rt = YAML(typ="rt")
yaml_fast = YAML(typ="safe", pure=False)
yaml_safe = YAML(typ="safe")

file_path = nnpdf_data.theory_cards / "410.yaml"


def load_with_yaml_rt():
    input_yaml = pathlib.Path(file_path)
    return yaml_rt.load(input_yaml.read_text(encoding="utf-8"))


def load_with_yaml_fast():
    input_yaml = pathlib.Path(file_path)
    return yaml_fast.load(input_yaml.read_text(encoding="utf-8"))


def load_with_CLoader():
    input_yaml = pathlib.Path(file_path)
    return yaml.load(input_yaml.read_text(encoding="utf-8"), Loader=yaml.CLoader)


def load_with_yaml_safe():
    input_yaml = pathlib.Path(file_path)
    return yaml_fast.load(input_yaml.read_text(encoding="utf-8"))

num_iterations = 1000
time_yaml_rt = timeit(load_with_yaml_rt, number=num_iterations)
time_yaml_fast = timeit(load_with_yaml_fast, number=num_iterations)
time_yaml_safe = timeit(load_with_yaml_safe, number=num_iterations)
time_yaml_CLoader = timeit(load_with_CLoader, number=num_iterations)

print(f"yaml_rt: {time_yaml_rt / num_iterations:.6f} seconds per load")
print(f"yaml_fast: {time_yaml_fast / num_iterations:.6f} seconds per load")
print(f"yaml_safe: {time_yaml_safe / num_iterations:.6f} seconds per load")
print(f"CLoader: {time_yaml_CLoader / num_iterations:.6f} seconds per load")

@RoyStegeman RoyStegeman merged commit 08e9384 into master Dec 4, 2024
6 checks passed
@RoyStegeman RoyStegeman deleted the update_ruamel branch December 4, 2024 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants