-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capitalise all string values in tables #259
Conversation
@siddharth-krishna what do you think? This simple change gives a very good improvement on TIM! |
This PR would make the names of processes, commodities and user constraints look different in the output. Should we consider "undoing" the changes after processing in the future? E.g. we could use the names as they are before capitalising to create a dictionary which we could later use to undo the change in the processed data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Olex. To make sure I understand correctly: this PR increases accuracy and reduces additional rows because the input data is using differently cased versions of the same names/strings, and so some transforms are not matching what they should?
Because if it was just a matter of our output having a different casing compared to the ground truth, I'd have thought #255 would suffice.
If my theory is correct, then it makes sense to normalise the casing early on like we do here. And I suppose we'd want to preserve the casing if the output was using in some visualisation tool like Miro? If so, your approach makes sense, we could have an optional flag to do that, but the question is which option to pick if the input uses more than one variant: e.g. twowords, Twowords, TwoWords
?
colnames = ["attribute", "tact", "tcap", "unit", "sourcescen"] | ||
|
||
def capitalise_attributes_table(table: EmbeddedXlTable): | ||
def capitalise_table_entries(table: EmbeddedXlTable): | ||
df = table.dataframe.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need to copy the data frame since we modify it inplace below
df.loc[i, seen_col] = df[seen_col][i].str.upper() | ||
df.loc[i, seen_col] = df[seen_col][i].str.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.loc[i, seen_col] = df[seen_col][i].str.upper() | |
df.loc[i, seen_col] = df[seen_col][i].str.strip() | |
df.loc[i, seen_col] = df[seen_col][i].str.upper().str.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks! Sorry, didn't see these while merging from the mobile app. Will add them to one of the open PRs!
Thanks @siddharth-krishna! Yes, that's correct. |
No description provided.