-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge duplicate columns #198
Conversation
…e names. replaced table validation warning with column merging
... 1 8.0 5.0 2 | ||
... 2 6.0 3.0 3 | ||
""" | ||
df = pd.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the desired behaviour, matching what I've understood as VEDA's duplicate column merging logic, but I'm not sure how to test this in VEDA itself, so it'd be great if you guys could confirm. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that example table looks good as such. But does it cover cases where some of the (original) columns has e.g. bfg, which would render equivalent to b? Would it be recognized as a duplicate column with the other b's ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback @Antti-L.
But does it cover cases where some of the (original) columns has e.g. bfg, which would render equivalent to b? Would it be recognized as a duplicate column with the other b's ?
Would you mind elaborating or giving an example of this case? Specifically, how would a column called bfg
render equivalent to a column called b
? I'm not sure exactly what you mean by this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote b~f~g
, for example an attribute column that has tilde-separated qualifiers (indexes). Not sure why the tildes disappeared in my post... So, is the merging applied to duplicate columns before or after processing those tildes (sorry I could not see that directly in the changed code; I am not a python coder). I would think it should be after the processing of the tildes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, thanks @Antti-L that makes more sense :)
@olejandro where do the tildes in column names get processed? I'm hoping it's before here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is done later on in process_flexible_import_tables
and in process_user_constraint_tables
.
This is not the only issue though. There may also be attribute aliases, which have different names, but same meaning as duplicated columns, so merging the columns too early may affect what gets overwritten.
Also, column renaming is done in normalize_column_aliases
which may create duplicates.
My feeling is that we should distinguish 2 cases:
- column names that represent TIMES attributes. These should be converted into rows; rows below overwrite rows above. We do this already in
process_flexible_import_tables
and inprocess_user_constraint_tables
. - column names that represent indices of TIMES attributes. We can handle them as proposed in this PR, but it should happen after
normalize_column_aliases
.
Additional info: config.all_attributes
and config.attr_aliases
are sets of TIMES attributes and their aliases respectively. Maybe duplicate columns could be checked against these sets and, if not in any of the sets, merged according to the procedure proposed in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -588,7 +617,8 @@ def process_user_constraint_table( | |||
# TODO: apply table.uc_sets | |||
|
|||
# Fill in UC_N blank cells with value from above | |||
df["uc_n"] = df["uc_n"].ffill() | |||
if "uc_n" in df.columns: | |||
df["uc_n"] = df["uc_n"].ffill() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely still need this check in here for austimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should handle this elsewhere. Will try to create a PR on this soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :) I wasn't sure the best place to do this kind of validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still a bit of trial and error, but we should create a DAG for the processing steps soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me, thanks! Not sure about VEDA's behaviour, I'll leave that for one of the others to verify.
Thanks @SamRWest! Could you please share a table header example where you are experiencing a duplicate column error? Based on your example, I will modify one of the benchmarks, so we can reproduce the behaviour. |
Sure thing. The offending austimes table was:
And the first 5 rows are (in CSV, so you can parse them easily):
The two |
@SamRWest are you sure about closing this PR? This code would be perfect for handling duplicate column names that represent indices of TIMES attributes. Or is the plan to open a new PR? |
Oh I thought your PRs handled that now. |
Sure, thanks @SamRWest.
I think this would be the best place to merge duplicate columns (either in the same transform or in a transform right after it). The column names that I believe are safe to merge are in |
Added function and unit test to merge values in columns with duplicate names.
Replaced table validation warning with column merging
I think this should duplicate VEDA's behaviour as discussed here: #177 (comment). Can you please confirm @Antti-L ?