Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge duplicate columns #198

Closed
wants to merge 2 commits into from
Closed

Conversation

SamRWest
Copy link
Collaborator

@SamRWest SamRWest commented Feb 27, 2024

Added function and unit test to merge values in columns with duplicate names.
Replaced table validation warning with column merging

I think this should duplicate VEDA's behaviour as discussed here: #177 (comment). Can you please confirm @Antti-L ?

…e names.

replaced table validation warning with column merging
... 1 8.0 5.0 2
... 2 6.0 3.0 3
"""
df = pd.DataFrame(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the desired behaviour, matching what I've understood as VEDA's duplicate column merging logic, but I'm not sure how to test this in VEDA itself, so it'd be great if you guys could confirm. Thanks!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that example table looks good as such. But does it cover cases where some of the (original) columns has e.g. bfg, which would render equivalent to b? Would it be recognized as a duplicate column with the other b's ?

Copy link
Collaborator Author

@SamRWest SamRWest Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback @Antti-L.

But does it cover cases where some of the (original) columns has e.g. bfg, which would render equivalent to b? Would it be recognized as a duplicate column with the other b's ?

Would you mind elaborating or giving an example of this case? Specifically, how would a column called bfg render equivalent to a column called b? I'm not sure exactly what you mean by this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote b~f~g, for example an attribute column that has tilde-separated qualifiers (indexes). Not sure why the tildes disappeared in my post... So, is the merging applied to duplicate columns before or after processing those tildes (sorry I could not see that directly in the changed code; I am not a python coder). I would think it should be after the processing of the tildes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, thanks @Antti-L that makes more sense :)
@olejandro where do the tildes in column names get processed? I'm hoping it's before here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is done later on in process_flexible_import_tables and in process_user_constraint_tables.

This is not the only issue though. There may also be attribute aliases, which have different names, but same meaning as duplicated columns, so merging the columns too early may affect what gets overwritten.

Also, column renaming is done in normalize_column_aliases which may create duplicates.

My feeling is that we should distinguish 2 cases:

  • column names that represent TIMES attributes. These should be converted into rows; rows below overwrite rows above. We do this already in process_flexible_import_tables and in process_user_constraint_tables.
  • column names that represent indices of TIMES attributes. We can handle them as proposed in this PR, but it should happen after normalize_column_aliases.

Additional info: config.all_attributes and config.attr_aliases are sets of TIMES attributes and their aliases respectively. Maybe duplicate columns could be checked against these sets and, if not in any of the sets, merged according to the procedure proposed in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SamRWest I've opened #203 that handles column names representing TIMES attributes. Any chance you could test whether it works for AusTIMES?

@@ -588,7 +617,8 @@ def process_user_constraint_table(
# TODO: apply table.uc_sets

# Fill in UC_N blank cells with value from above
df["uc_n"] = df["uc_n"].ffill()
if "uc_n" in df.columns:
df["uc_n"] = df["uc_n"].ffill()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely still need this check in here for austimes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should handle this elsewhere. Will try to create a PR on this soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SamRWest I've opened #200 to address this. Any chance you could test whether it resolves the issue with AusTIMES?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :) I wasn't sure the best place to do this kind of validation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still a bit of trial and error, but we should create a DAG for the processing steps soon.

@SamRWest SamRWest marked this pull request as ready for review February 27, 2024 01:16
Copy link
Collaborator

@siddharth-krishna siddharth-krishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me, thanks! Not sure about VEDA's behaviour, I'll leave that for one of the others to verify.

@olejandro
Copy link
Member

Thanks @SamRWest! Could you please share a table header example where you are experiencing a duplicate column error?
We are already handling some of it through conversion to rows, if a column contains GAMS parameter values. I believe, we currently don't handle duplicate columns that contain GAMS parameter indices.

Based on your example, I will modify one of the benchmarks, so we can reproduce the behaviour.

@SamRWest
Copy link
Collaborator Author

Thanks @SamRWest! Could you please share a table header example where you are experiencing a duplicate column error? We are already handling some of it through conversion to rows, if a column contains GAMS parameter values. I believe, we currently don't handle duplicate columns that contain GAMS parameter indices.

Based on your example, I will modify one of the benchmarks, so we can reproduce the behaviour.

Sure thing. The offending austimes table was:

EmbeddedXlTable(tag='~FI_T', uc_sets={}, sheetname='Batteries', range='AE3:BA30', filename='austimes-lfs\\VT_AUS_ELC.xlsx'

And the first 5 rows are (in CSV, so you can parse them easily):

,techname,timeslice,comm-in,year,comm-out,comm-out-a,commgrp,start,pasti,life,fixom,afa~lo,afa,fixom,varom,cap2act,flo_func~auxsto,stg_eff,afc,flo_eff,af~fx
0,EE_Battery004,DAYNITE,ELC,2022,ELC,AuxSto,ACT,2022,0.05,15,10.688513986014,0.015625,0.96,10.688513986014,0,8.76,,0.85,0.0625,,
1,EE_Battery005,DAYNITE,ELC,2022,ELC,AuxSto,ACT,2022,0.1,30,10.688513986014,0.015625,0.96,10.688513986014,0,8.76,,0.85,0.0625,,
2,EE_Battery006,DAYNITE,ELC,2024,ELC,AuxSto,ACT,2024,0.00627,20,10.688513986014,0.0215975544922913,0.96,10.688513986014,0,8.76,,0.85,0.0863902179691654,,
3,EE_Battery007,DAYNITE,ELC,2024,ELC,AuxSto,ACT,2024,0.002,20,10.688513986014,0.0260416666666667,0.96,10.688513986014,0,8.76,,0.85,0.104166666666667,,
4,EE_Battery008,DAYNITE,ELC,2024,ELC,AuxSto,ACT,2024,0.002,20,10.688513986014,0.0208333333333333,0.96,10.688513986014,0,8.76,,0.85,0.0833333333333333,,

The two fixom columns are actually identical in this table, so is probably caused by human error rather than being intentional. Still a corner case that we need to handle though.

@SamRWest
Copy link
Collaborator Author

Superseded by #201 and #203

@SamRWest SamRWest closed this Feb 29, 2024
@olejandro
Copy link
Member

@SamRWest are you sure about closing this PR? This code would be perfect for handling duplicate column names that represent indices of TIMES attributes. Or is the plan to open a new PR?

@SamRWest
Copy link
Collaborator Author

@SamRWest are you sure about closing this PR? This code would be perfect for handling duplicate column names that represent indices of TIMES attributes. Or is the plan to open a new PR?

Oh I thought your PRs handled that now.
I'm a bit lost about where and which columns this would handle. If you can suggest which transform this should go into I'd be happy to reopen it.

@olejandro
Copy link
Member

Sure, thanks @SamRWest.

normalize_column_aliases handles renaming of columns and would raise an error if a table includes duplicate column names after normalisation.

I think this would be the best place to merge duplicate columns (either in the same transform or in a transform right after it). The column names that I believe are safe to merge are in config.know_columns. They are tag specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants