Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dd_to_csv bugfixes, move cache to ~/.cache/xl2times/ #230

Merged
merged 16 commits into from
Mar 23, 2024

Conversation

SamRWest
Copy link
Collaborator

@SamRWest SamRWest commented Mar 21, 2024

Just a minor bugfix and some housekeeping.

  • Fixes a bug in dd_to_csv.py where DD variable names containing spaces don't parse correctly.
  • Made dd_to_csv a second script in pyproject.toml (we'd like to use it from the austimes repo, and couldn't easily otherwise)
  • Moved pickle cache to ~/.cache/xl2times
  • Adds cache invalidation, to delete old cache files (just anything >365 days, couldn't think of a better way)

else:
raise ValueError(
f"Unexpected number of spaces in parameter value setting: {data[index]}"
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ValueError was getting thrown from veda-produced austimes DD files for a variable with a space in its name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SamRWest when you write "variable" do you mean an index of a GAMS parameter or a GAMS parameter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, 'variable' is the wrong terminology.

I mean the key part referred to in this comment:
# Either "value" for a scalar, or "key value" for an array.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, here's a line from one of our DD files that (i think) would have triggered the ValueError, because of the spaces in UC_non-decreasing EE penetration:

'UC_non-decreasing EE penetration'.RHS.'QLD'.2015.'EE2_Pri_Pub-b'.ANNUAL 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks! It should be okay not to raise ValueError, as long as the key is in quotes. Scalar is a numeric, so there should not be spaces in it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that was my assumption.
The new code just splits the line at the last space char, whereas the old code split it at all spaces, then warned if it ended up with >2 tokens, which failed on my example string above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, sometimes there is no value, e.g. from Demo 1:

SET COM_TMAP
/
'REG1'.'DEM'.'TPSCOA'
'REG1'.'NRG'.'COA'
/;

Let's say somebody does this:

SET COM_TMAP
/
'REG1'.'DEM'.'TPS COA'
'REG1'.'NRG'.'COA'
/;

What will happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I can see we are handling this above when we check whether it is a parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the new code will catch this with this check, because rfind returns -1 if no spaces are found in the string.

 if split_point == -1:
     #if only one word
     attributes, value = [], line

Comment on lines +61 to +69
# So value is always the last word, or only token
split_point = line.rfind(" ")
if split_point == -1:
# if only one word
attributes, value = [], line
else:
raise ValueError(
f"Unexpected number of spaces in parameter value setting: {data[index]}"
)
attributes, value = line[:split_point], line[split_point + 1 :]
attributes = attributes.split(".")
attributes = [a if " " in a else a.strip("'") for a in attributes]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rfind() works more reliably, on the assumption that there's always a space before the value, so it'll work with 'key with spaces' value style strings

"""
with open(filename, "rb") as f:
digest = hashlib.file_digest(f, "sha256") # pyright: ignore
hsh = digest.hexdigest()
if os.path.isfile(cache_dir + hsh):
fname1, _timestamp, tables = pickle.load(open(cache_dir + hsh, "rb"))
hash_file = cache_dir / f"{Path(filename).stem}_{hsh}.pkl"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache filename contains the original filename, then perhaps we no longer need to store a tuple in the pickle? And as discussed we probably don't need to check modified timestamp, since it's super unlikely that files get modified but their hash remains the same..

cleaned up caching code as suggested
pickle.dump(tables, f)
logger.info(f"Saved cache for {filename} to {cached_file}")

return tables
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this @siddharth-krishna ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@SamRWest
Copy link
Collaborator Author

SamRWest commented Mar 21, 2024

Hmmm, benchmarks are failing to find a ground-truth CSV for demo7. Gotta split, but will have a look in the morning.
Any ideas in the meantime guys?
Error is here: https://github.com/etsap-TIMES/xl2times/actions/runs/8370370796/job/22917616560?pr=230#step:14:3255

 File "/home/runner/work/xl2times/xl2times/xl2times/xl2times/__main__.py", line 509, in run
    ground_truth = read_csv_tables(args.ground_truth_dir)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ...
FileNotFoundError: [Errno 2] No such file or directory: 'benchmarks/csv/DemoS_007-all/benchmarks/csv/DemoS_007-all/TS_MAP.csv'

@olejandro
Copy link
Member

olejandro commented Mar 21, 2024

Hmmm, benchmarks are failing to find a ground-truth CSV for demo7. Gotta split, but will have a look in the morning.

Other demos as well!
To me this path looks strange: 'benchmarks/csv/DemoS_007-all/benchmarks/csv/DemoS_007-all/TS_MAP.csv'
Should it not be 'benchmarks/csv/DemoS_007-all/TS_MAP.csv' instead?

@olejandro
Copy link
Member

  • Made dd_to_csv a second script in pyproject.toml (we'd like to use it from the austimes repo, and couldn't easily otherwise)

@SamRWest could you explain this a bit more? I do not think I understand the motivation / need for the change.

@SamRWest
Copy link
Collaborator Author

  • Made dd_to_csv a second script in pyproject.toml (we'd like to use it from the austimes repo, and couldn't easily otherwise)

@SamRWest could you explain this a bit more? I do not think I understand the motivation / need for the change.

Just that I'd like to be able to convert DD files in other projects to CSV from the commandline:

cd some_other_project
pip install xl2times
dd_to_csv my/dd/files/

The last line is made possible by adding dd_to_csv to the [project.scripts] section in pyproject.toml (here), which requires dd_to_csv.py to be moved into the /xl2times module from /scripts.

@SamRWest
Copy link
Collaborator Author

Ok, tests are all passing now, but CI is failing because the runtime has gone up because the pickle cache gets regenerated.

@siddharth-krishna - how can we update the pickle cache with the new files when CI is failing? Disable the runtime check temporarily?

Copy link
Collaborator

@siddharth-krishna siddharth-krishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! But we'll also need to change the cache path in the github actions yaml, and perhaps also bump the cache key so that it saves a new one. If you don't mind I can push a commit to your branch directly?

@rschuchmann would there be any issues with the Times-Miro app if the cache folder is moved to ~/.cache/xl2times/?

pickle.dump(tables, f)
logger.info(f"Saved cache for {filename} to {cached_file}")

return tables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@siddharth-krishna siddharth-krishna changed the title dd_to_csv bugfixes, dd_to_csv bugfixes, move cache to ~/.cache/xl2times/ Mar 22, 2024
@rschuchmann
Copy link

Looks good, thanks! But we'll also need to change the cache path in the github actions yaml, and perhaps also bump the cache key so that it saves a new one. If you don't mind I can push a commit to your branch directly?

@rschuchmann would there be any issues with the Times-Miro app if the cache folder is moved to ~/.cache/xl2times/?

That should be fine. Personally, I would probably use $HOME (and on Windows %HOMEPATH%), but that's up to you.

Copy link
Collaborator

@siddharth-krishna siddharth-krishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the regression tests pass with main being much slower than this PR branch! This is because now the cache is restoring ~/.cache/xl2times/ but not xl2times/.cache/ so the main code doesn't have access to its cache anymore. :)

@rschuchmann yep, the code internally uses Path.home() which should work on Windows too.

@olejandro olejandro marked this pull request as ready for review March 23, 2024 00:52
@olejandro
Copy link
Member

@SamRWest I took the liberty to merge you PR, since all the tests are passing now. Hope you don't mind!

@olejandro olejandro merged commit bf5ea29 into main Mar 23, 2024
2 checks passed
@olejandro olejandro deleted the samw/dd_convert_bugfixes branch March 23, 2024 00:53
@SamRWest
Copy link
Collaborator Author

@SamRWest I took the liberty to merge you PR, since all the tests are passing now. Hope you don't mind!

All good @olejandro :) And thanks for sorting out the cache stuff for me @siddharth-krishna!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants