Very small differences in numerical values when determining correct rows #246

olejandro · 2024-11-17T17:21:09Z

E.g., there are currently 24 additional / missing values for PRC_RESID on Ireland.

The text was updated successfully, but these errors were encountered:

olejandro · 2024-11-17T17:21:30Z

@siddharth-krishna any chance you could take a look at this one?

siddharth-krishna · 2024-11-22T13:18:17Z

I assume you mean the additional rows in https://github.com/etsap-TIMES/xl2times/actions/runs/11969575511/job/33370600563?pr=240 for commit 5e9db3e

The underlying problem is that we convert the produced tables and the ground truth tables to strings before we compare them, so it isn't easy to set a tolerance (e.g. 1e-5) when comparing float values. I think the reason we did this is because we were not always using the correct data types for columns in pandas DataFrames, so we couldn't use builtin DataFrame equals/compare methods. We actually now have a transform.convert_to_string that converts all columns to strings before we output them.

Perhaps it's time to look into this again and see if we can remove convert_to_string, ensure we're using correct column types (int for years, float for other numeric tables, etc) and then compare DataFrames with a tolerance value?

olejandro · 2024-11-22T16:01:17Z

I assume you mean the additional rows in https://github.com/etsap-TIMES/xl2times/actions/runs/11969575511/job/33370600563?pr=240 for commit 5e9db3e

Well, they are identified both as additional and missing. :-)

olejandro · 2024-11-22T16:05:38Z

Perhaps it's time to look into this again and see if we can remove convert_to_string, ensure we're using correct column types (int for years, float for other numeric tables, etc) and then compare DataFrames with a tolerance value?

How about doing this in a step-wise manner? Most of the columns are ok to stay strings. We could start with e.g. value (the most important one) and then see whether we should do it for year. I believe we do not need to go beyond TimesModel.attributes and TimesModel.uc_attributes, at least for the moment.

siddharth-krishna · 2024-11-25T05:44:50Z

I had another idea for a quicker/step-wise solution: we could print floats as strings to a specific precision in convert_to_string, which might fix the issue for now, and then continue with the correct column types in a separate PR?

An attempted fix for #246, which was first noticed in #240. With some [trial and error](#240 (comment)) I set a precision of `.10g` (10 significant figures in exponential/scientific notation) for floating point numbers. I also had to change the `gdxdiff` options to additionally use a tolerance in relative difference when comparing the output DD files to the ground truth, otherwise the higher precision output above increased the DD diff. We now use `1e-6` for both absolute `Eps` and relative `RelEps` difference tolerance. If I understand correctly how `gdxdiff` [works](https://www.gams.com/latest/docs/T_GDXDIFF.html#GDXDIFF_OPTIONS_CRITERIONEXPLANATION), this feels like a reasonable tolerance: ``` AbsDiff := Abs(V1 - V2); if AbsDiff <= EpsAbsolute then Result := true else if EpsRelative > 0.0 then Result := AbsDiff / (1.0 + DMin(Abs(V1), Abs(V2))) <= EpsRelative else Result := false; ``` Finally, we were reporting the number of lines in the output of `gdxdiff`, but that's only the number of sets/parameters that are different, and doesn't track improvements in number of rows different. I'm now passing the output of `gdxdiff` to `gdxdump` so that the "GDX Diff" column is more accurate and tracks number of rows in the diff.

siddharth-krishna · 2024-12-19T04:09:59Z

I'm closing this because #252 introduced an epsilon for comparing numeric values, and the larger goal of using correct pandas data types can be tracked by #47. Let me know if I'm missing something.

olejandro mentioned this issue Nov 25, 2024

Extend application of defaults beyond FI_T tables #240

Merged

siddharth-krishna mentioned this issue Dec 7, 2024

Set output float precision & relative epsilon for DD comparison #252

Merged

siddharth-krishna closed this as completed Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very small differences in numerical values when determining correct rows #246

Very small differences in numerical values when determining correct rows #246

olejandro commented Nov 17, 2024

olejandro commented Nov 17, 2024

siddharth-krishna commented Nov 22, 2024

olejandro commented Nov 22, 2024

olejandro commented Nov 22, 2024 •

edited

Loading

siddharth-krishna commented Nov 25, 2024

siddharth-krishna commented Dec 19, 2024

Very small differences in numerical values when determining correct rows #246

Very small differences in numerical values when determining correct rows #246

Comments

olejandro commented Nov 17, 2024

olejandro commented Nov 17, 2024

siddharth-krishna commented Nov 22, 2024

olejandro commented Nov 22, 2024

olejandro commented Nov 22, 2024 • edited Loading

siddharth-krishna commented Nov 25, 2024

siddharth-krishna commented Dec 19, 2024

olejandro commented Nov 22, 2024 •

edited

Loading