merge duplicate columns #198

SamRWest · 2024-02-27T00:50:00Z

Added function and unit test to merge values in columns with duplicate names.
Replaced table validation warning with column merging

I think this should duplicate VEDA's behaviour as discussed here: #177 (comment). Can you please confirm @Antti-L ?

…e names. replaced table validation warning with column merging

SamRWest · 2024-02-27T01:05:21Z

tests/test_transforms.py

+    ... 1  8.0  5.0  2
+    ... 2  6.0  3.0  3
+    """
+    df = pd.DataFrame(


I think this is the desired behaviour, matching what I've understood as VEDA's duplicate column merging logic, but I'm not sure how to test this in VEDA itself, so it'd be great if you guys could confirm. Thanks!

Yes, I think that example table looks good as such. But does it cover cases where some of the (original) columns has e.g. bfg, which would render equivalent to b? Would it be recognized as a duplicate column with the other b's ?

Thanks for your feedback @Antti-L.

But does it cover cases where some of the (original) columns has e.g. bfg, which would render equivalent to b? Would it be recognized as a duplicate column with the other b's ?

Would you mind elaborating or giving an example of this case? Specifically, how would a column called bfg render equivalent to a column called b? I'm not sure exactly what you mean by this.

I wrote b~f~g, for example an attribute column that has tilde-separated qualifiers (indexes). Not sure why the tildes disappeared in my post... So, is the merging applied to duplicate columns before or after processing those tildes (sorry I could not see that directly in the changed code; I am not a python coder). I would think it should be after the processing of the tildes?

Ah right, thanks @Antti-L that makes more sense :)
@olejandro where do the tildes in column names get processed? I'm hoping it's before here...

Actually it is done later on in process_flexible_import_tables and in process_user_constraint_tables.

This is not the only issue though. There may also be attribute aliases, which have different names, but same meaning as duplicated columns, so merging the columns too early may affect what gets overwritten.

Also, column renaming is done in normalize_column_aliases which may create duplicates.

My feeling is that we should distinguish 2 cases:

column names that represent TIMES attributes. These should be converted into rows; rows below overwrite rows above. We do this already in process_flexible_import_tables and in process_user_constraint_tables.

column names that represent indices of TIMES attributes. We can handle them as proposed in this PR, but it should happen after normalize_column_aliases.

Additional info: config.all_attributes and config.attr_aliases are sets of TIMES attributes and their aliases respectively. Maybe duplicate columns could be checked against these sets and, if not in any of the sets, merged according to the procedure proposed in this PR?

@SamRWest I've opened #203 that handles column names representing TIMES attributes. Any chance you could test whether it works for AusTIMES?

SamRWest · 2024-02-27T01:07:25Z

xl2times/transforms.py

@@ -588,7 +617,8 @@ def process_user_constraint_table(
        # TODO: apply table.uc_sets

        # Fill in UC_N blank cells with value from above
-        df["uc_n"] = df["uc_n"].ffill()
+        if "uc_n" in df.columns:
+            df["uc_n"] = df["uc_n"].ffill()


I definitely still need this check in here for austimes.

Yes, we should handle this elsewhere. Will try to create a PR on this soon.

@SamRWest I've opened #200 to address this. Any chance you could test whether it resolves the issue with AusTIMES?

Thanks :) I wasn't sure the best place to do this kind of validation.

It is still a bit of trial and error, but we should create a DAG for the processing steps soon.

xl2times/utils.py

siddharth-krishna

Code looks good to me, thanks! Not sure about VEDA's behaviour, I'll leave that for one of the others to verify.

olejandro · 2024-02-27T13:55:12Z

Thanks @SamRWest! Could you please share a table header example where you are experiencing a duplicate column error?
We are already handling some of it through conversion to rows, if a column contains GAMS parameter values. I believe, we currently don't handle duplicate columns that contain GAMS parameter indices.

Based on your example, I will modify one of the benchmarks, so we can reproduce the behaviour.

SamRWest · 2024-02-27T21:14:43Z

Thanks @SamRWest! Could you please share a table header example where you are experiencing a duplicate column error? We are already handling some of it through conversion to rows, if a column contains GAMS parameter values. I believe, we currently don't handle duplicate columns that contain GAMS parameter indices.

Based on your example, I will modify one of the benchmarks, so we can reproduce the behaviour.

Sure thing. The offending austimes table was:

EmbeddedXlTable(tag='~FI_T', uc_sets={}, sheetname='Batteries', range='AE3:BA30', filename='austimes-lfs\\VT_AUS_ELC.xlsx'

And the first 5 rows are (in CSV, so you can parse them easily):

,techname,timeslice,comm-in,year,comm-out,comm-out-a,commgrp,start,pasti,life,fixom,afa~lo,afa,fixom,varom,cap2act,flo_func~auxsto,stg_eff,afc,flo_eff,af~fx
0,EE_Battery004,DAYNITE,ELC,2022,ELC,AuxSto,ACT,2022,0.05,15,10.688513986014,0.015625,0.96,10.688513986014,0,8.76,,0.85,0.0625,,
1,EE_Battery005,DAYNITE,ELC,2022,ELC,AuxSto,ACT,2022,0.1,30,10.688513986014,0.015625,0.96,10.688513986014,0,8.76,,0.85,0.0625,,
2,EE_Battery006,DAYNITE,ELC,2024,ELC,AuxSto,ACT,2024,0.00627,20,10.688513986014,0.0215975544922913,0.96,10.688513986014,0,8.76,,0.85,0.0863902179691654,,
3,EE_Battery007,DAYNITE,ELC,2024,ELC,AuxSto,ACT,2024,0.002,20,10.688513986014,0.0260416666666667,0.96,10.688513986014,0,8.76,,0.85,0.104166666666667,,
4,EE_Battery008,DAYNITE,ELC,2024,ELC,AuxSto,ACT,2024,0.002,20,10.688513986014,0.0208333333333333,0.96,10.688513986014,0,8.76,,0.85,0.0833333333333333,,

The two fixom columns are actually identical in this table, so is probably caused by human error rather than being intentional. Still a corner case that we need to handle though.

SamRWest · 2024-02-29T03:19:06Z

Superseded by #201 and #203

olejandro · 2024-02-29T03:45:06Z

@SamRWest are you sure about closing this PR? This code would be perfect for handling duplicate column names that represent indices of TIMES attributes. Or is the plan to open a new PR?

SamRWest · 2024-02-29T04:14:33Z

@SamRWest are you sure about closing this PR? This code would be perfect for handling duplicate column names that represent indices of TIMES attributes. Or is the plan to open a new PR?

Oh I thought your PRs handled that now.
I'm a bit lost about where and which columns this would handle. If you can suggest which transform this should go into I'd be happy to reopen it.

olejandro · 2024-02-29T17:58:02Z

Sure, thanks @SamRWest.

normalize_column_aliases handles renaming of columns and would raise an error if a table includes duplicate column names after normalisation.

I think this would be the best place to merge duplicate columns (either in the same transform or in a transform right after it). The column names that I believe are safe to merge are in config.know_columns. They are tag specific.

added function and unit test to merge values in columns with duplicat…

1e62b45

…e names. replaced table validation warning with column merging

SamRWest requested review from siddharth-krishna, olejandro and Antti-L February 27, 2024 01:02

SamRWest commented Feb 27, 2024

View reviewed changes

xl2times/utils.py Show resolved Hide resolved

SamRWest marked this pull request as ready for review February 27, 2024 01:16

remove unused var

9300148

siddharth-krishna reviewed Feb 27, 2024

View reviewed changes

SamRWest mentioned this pull request Feb 29, 2024

Allow processing of duplicated attribute data columns #203

Merged

SamRWest closed this Feb 29, 2024

SamRWest mentioned this pull request Mar 6, 2024

WIP Feature/process wildcard speedup #210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge duplicate columns #198

merge duplicate columns #198

SamRWest commented Feb 27, 2024 •

edited

Loading

SamRWest Feb 27, 2024

Antti-L Feb 27, 2024

SamRWest Feb 27, 2024 •

edited

Loading

Antti-L Feb 27, 2024

SamRWest Feb 27, 2024

olejandro Feb 28, 2024

olejandro Feb 28, 2024

SamRWest Feb 27, 2024

olejandro Feb 27, 2024

olejandro Feb 27, 2024

SamRWest Feb 27, 2024

olejandro Feb 27, 2024

siddharth-krishna left a comment

olejandro commented Feb 27, 2024

SamRWest commented Feb 27, 2024

SamRWest commented Feb 29, 2024

olejandro commented Feb 29, 2024

SamRWest commented Feb 29, 2024

olejandro commented Feb 29, 2024

merge duplicate columns #198

merge duplicate columns #198

Conversation

SamRWest commented Feb 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamRWest Feb 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharth-krishna left a comment

Choose a reason for hiding this comment

olejandro commented Feb 27, 2024

SamRWest commented Feb 27, 2024

SamRWest commented Feb 29, 2024

olejandro commented Feb 29, 2024

SamRWest commented Feb 29, 2024

olejandro commented Feb 29, 2024

SamRWest commented Feb 27, 2024 •

edited

Loading

SamRWest Feb 27, 2024 •

edited

Loading