-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unpin ruamel.yaml version #2226
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for this, you are doing god's work here.
One thing, wouldn't it make sense to have the import ruamel.yaml ; yaml = YAML() in one single place (nnpdf_data for instance) and import it from there?
Would make your life easier I think debugging and ensuring that everything is doing the same thing.
nnpdf_data/nnpdf_data/utils.py
Outdated
Loader = yaml.CLoader | ||
except AttributeError: | ||
Loader = yaml.Loader | ||
yaml = YAML(typ='safe') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using libyaml (CLoader) when available makes a huge difference in time when loading large (or many) files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLoader is PyYaml, do we want to keep using both yaml packages?
On that note: all filter.py use pyyaml - do we want to change that or is it fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find an explicit mention in the docs but it might be that as long as you use safe
(and pure is not True) it will use CLoader under the hood https://sourceforge.net/p/ruamel-yaml/tickets/159/#8f94
but the comment is from 2017... we should check. The difference in performance was enough to think about dropping ruamel if we can no longer use CLoader
I'm still trying to understand some of the errors, it seems to depend on what settings are used during import (which is a bit concerning that it actually affects the way the yaml is parsed rather than just the formatting upon dumping). If this is the case then we may need a different If not then you may be right |
The documentation of ruamel is worse than our own 🫠 but from https://yaml.readthedocs.io/en/latest/basicuse/#more-examples I think you would need at least two yaml functions in utils.
The fast loader 1) would support only yaml 1.1 while the other is also 1.2 (or safe + pure, also 1.2), which explains probably the differences you see. |
def _quick_yaml_load(filepath): | ||
return yaml.load(filepath.read_text(encoding="utf-8"), Loader=Loader) | ||
|
||
yaml = YAML(typ='rt') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaml = YAML(typ='rt') | |
yaml = YAML(typ='safe') |
Just benchmarked it, with this change is (almost) as fast as master. Not as fast though, it takes 1.3x more.
With rt
it takes 10x more.
As I said, I think the best thing at this point is to have in utils safe_load
and rt_load
or whatever with a comment explaining why. That will make it easier if we need to change it again.
fad96a6
to
39741da
Compare
I've added a single initialization in The pyproject file still points to the corresponding branch in reportengine so that should be changed if we decide to merge both. |
commondataparser will be moved to nnpdf_data, can it use pyyaml as well? |
nnpdf_data/nnpdf_data/utils.py
Outdated
try: | ||
return parse_input(inp, spec) | ||
except ValidationError as e: | ||
current_exc = e | ||
# In order to provide a more complete error information, use round_trip_load | ||
# to read the .yaml file again (insetad of using the CLoader) | ||
current_inp = yaml.round_trip_load(input_yaml.open("r", encoding="utf-8")) | ||
current_inp = yaml.load(input_yaml.open("r", encoding="utf-8"), yaml.Loader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work? The extra information here needs the rt loader, otherwise you could use the load from before.
Also, is yaml.Loader what in ruamel is CLoader? Otherwise it will be slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out ruamel is built on top of pyyaml, so CLoader is the same thing in both. I took the passing ci as indication that this was fine, but I'll revert it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you need to use CLoader for the other one to speed things up (and you need the try-except since libyaml might not be installed everywhere).
I think all in all it is better to use ruamel also here, pyyaml doesn't really simplify things.
I took the passing ci as indication that this was fine, but I'll revert it
We would need a test that not only checks for errors but looks also at the message... and in this case is a ValidationError
from reportengine. I'll put it in my "probably never happening but who knows" to do list
I wasn't too sure from the ruamel code, but it seems that ruamel doesn't use CLoader. I don't believe there is any key that allows to set that. These are the timings:
Here is the script for reference: import pathlib
from timeit import timeit
from ruamel.yaml import YAML
import yaml
import nnpdf_data
yaml_rt = YAML(typ="rt")
yaml_fast = YAML(typ="safe", pure=False)
yaml_safe = YAML(typ="safe")
file_path = nnpdf_data.theory_cards / "410.yaml"
def load_with_yaml_rt():
input_yaml = pathlib.Path(file_path)
return yaml_rt.load(input_yaml.read_text(encoding="utf-8"))
def load_with_yaml_fast():
input_yaml = pathlib.Path(file_path)
return yaml_fast.load(input_yaml.read_text(encoding="utf-8"))
def load_with_CLoader():
input_yaml = pathlib.Path(file_path)
return yaml.load(input_yaml.read_text(encoding="utf-8"), Loader=yaml.CLoader)
def load_with_yaml_safe():
input_yaml = pathlib.Path(file_path)
return yaml_fast.load(input_yaml.read_text(encoding="utf-8"))
num_iterations = 1000
time_yaml_rt = timeit(load_with_yaml_rt, number=num_iterations)
time_yaml_fast = timeit(load_with_yaml_fast, number=num_iterations)
time_yaml_safe = timeit(load_with_yaml_safe, number=num_iterations)
time_yaml_CLoader = timeit(load_with_CLoader, number=num_iterations)
print(f"yaml_rt: {time_yaml_rt / num_iterations:.6f} seconds per load")
print(f"yaml_fast: {time_yaml_fast / num_iterations:.6f} seconds per load")
print(f"yaml_safe: {time_yaml_safe / num_iterations:.6f} seconds per load")
print(f"CLoader: {time_yaml_CLoader / num_iterations:.6f} seconds per load") |
91c2cde
to
2e4096d
Compare
Along with NNPDF/reportengine#67 this PR addresses the dependence on the ruamel.yaml version which is pinned to
<0.18
due to changes in the API.Hopefully this also allows us to resolve the problem with
vp-nextfitruncard
removing newlines and comments from the runcard 🤞