Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capitalise all string values in tables #259

Merged
merged 6 commits into from
Dec 23, 2024
Merged

Conversation

olejandro
Copy link
Member

No description provided.

@olejandro olejandro marked this pull request as ready for review December 20, 2024 15:26
@olejandro
Copy link
Member Author

@siddharth-krishna what do you think? This simple change gives a very good improvement on TIM!
I'll try to look into why GAMS fails on Demos 09-12.

@olejandro
Copy link
Member Author

This PR would make the names of processes, commodities and user constraints look different in the output. Should we consider "undoing" the changes after processing in the future? E.g. we could use the names as they are before capitalising to create a dictionary which we could later use to undo the change in the processed data.

Copy link
Collaborator

@siddharth-krishna siddharth-krishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Olex. To make sure I understand correctly: this PR increases accuracy and reduces additional rows because the input data is using differently cased versions of the same names/strings, and so some transforms are not matching what they should?

Because if it was just a matter of our output having a different casing compared to the ground truth, I'd have thought #255 would suffice.

If my theory is correct, then it makes sense to normalise the casing early on like we do here. And I suppose we'd want to preserve the casing if the output was using in some visualisation tool like Miro? If so, your approach makes sense, we could have an optional flag to do that, but the question is which option to pick if the input uses more than one variant: e.g. twowords, Twowords, TwoWords?

colnames = ["attribute", "tact", "tcap", "unit", "sourcescen"]

def capitalise_attributes_table(table: EmbeddedXlTable):
def capitalise_table_entries(table: EmbeddedXlTable):
df = table.dataframe.copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need to copy the data frame since we modify it inplace below

Comment on lines +1206 to +1207
df.loc[i, seen_col] = df[seen_col][i].str.upper()
df.loc[i, seen_col] = df[seen_col][i].str.strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df.loc[i, seen_col] = df[seen_col][i].str.upper()
df.loc[i, seen_col] = df[seen_col][i].str.strip()
df.loc[i, seen_col] = df[seen_col][i].str.upper().str.strip()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks! Sorry, didn't see these while merging from the mobile app. Will add them to one of the open PRs!

@olejandro
Copy link
Member Author

Thanks @siddharth-krishna! Yes, that's correct.
To undo the changes, we could pick the variant from the process/commodity declaration tables. If there are duplicates, we could keep the last one.

@olejandro olejandro merged commit 8a46fda into main Dec 23, 2024
2 checks passed
@olejandro olejandro deleted the olex/capitalise-entries branch December 23, 2024 12:37
olejandro added a commit that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants