Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very small differences in numerical values when determining correct rows #246

Closed
olejandro opened this issue Nov 17, 2024 · 6 comments
Closed

Comments

@olejandro
Copy link
Member

E.g., there are currently 24 additional / missing values for PRC_RESID on Ireland.

@olejandro
Copy link
Member Author

@siddharth-krishna any chance you could take a look at this one?

@siddharth-krishna
Copy link
Collaborator

I assume you mean the additional rows in https://github.com/etsap-TIMES/xl2times/actions/runs/11969575511/job/33370600563?pr=240 for commit 5e9db3e

The underlying problem is that we convert the produced tables and the ground truth tables to strings before we compare them, so it isn't easy to set a tolerance (e.g. 1e-5) when comparing float values. I think the reason we did this is because we were not always using the correct data types for columns in pandas DataFrames, so we couldn't use builtin DataFrame equals/compare methods. We actually now have a transform.convert_to_string that converts all columns to strings before we output them.

Perhaps it's time to look into this again and see if we can remove convert_to_string, ensure we're using correct column types (int for years, float for other numeric tables, etc) and then compare DataFrames with a tolerance value?

@olejandro
Copy link
Member Author

I assume you mean the additional rows in https://github.com/etsap-TIMES/xl2times/actions/runs/11969575511/job/33370600563?pr=240 for commit 5e9db3e

Well, they are identified both as additional and missing. :-)

@olejandro
Copy link
Member Author

olejandro commented Nov 22, 2024

Perhaps it's time to look into this again and see if we can remove convert_to_string, ensure we're using correct column types (int for years, float for other numeric tables, etc) and then compare DataFrames with a tolerance value?

How about doing this in a step-wise manner? Most of the columns are ok to stay strings. We could start with e.g. value (the most important one) and then see whether we should do it for year. I believe we do not need to go beyond TimesModel.attributes and TimesModel.uc_attributes, at least for the moment.

@siddharth-krishna
Copy link
Collaborator

I had another idea for a quicker/step-wise solution: we could print floats as strings to a specific precision in convert_to_string, which might fix the issue for now, and then continue with the correct column types in a separate PR?

olejandro pushed a commit that referenced this issue Dec 7, 2024
An attempted fix for #246, which was first noticed in #240. With some
[trial and
error](#240 (comment))
I set a precision of `.10g` (10 significant figures in
exponential/scientific notation) for floating point numbers.

I also had to change the `gdxdiff` options to additionally use a
tolerance in relative difference when comparing the output DD files to
the ground truth, otherwise the higher precision output above increased
the DD diff. We now use `1e-6` for both absolute `Eps` and relative
`RelEps` difference tolerance. If I understand correctly how `gdxdiff`
[works](https://www.gams.com/latest/docs/T_GDXDIFF.html#GDXDIFF_OPTIONS_CRITERIONEXPLANATION),
this feels like a reasonable tolerance:
```
AbsDiff := Abs(V1 - V2);
if  AbsDiff <= EpsAbsolute
then
  Result := true
else
  if EpsRelative > 0.0
  then
     Result := AbsDiff / (1.0 + DMin(Abs(V1), Abs(V2))) <= EpsRelative
  else
     Result := false;
```

Finally, we were reporting the number of lines in the output of
`gdxdiff`, but that's only the number of sets/parameters that are
different, and doesn't track improvements in number of rows different.
I'm now passing the output of `gdxdiff` to `gdxdump` so that the "GDX
Diff" column is more accurate and tracks number of rows in the diff.
@siddharth-krishna
Copy link
Collaborator

I'm closing this because #252 introduced an epsilon for comparing numeric values, and the larger goal of using correct pandas data types can be tracked by #47. Let me know if I'm missing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants