-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More aliases (for Cameroon, Sudan, Brazil, Congo) #9
Conversation
@@ -461,7 +461,11 @@ CD: | |||
- Congo-Kinshasa | |||
- DRC | |||
- DR Congo | |||
- Congo, The Democratic Republic Of The | |||
- Congo, Democratic Republic of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep Congo, The Democratic Republic Of The
. It's valid: https://www.iso.org/obp/ui/#iso:code:3166:CD.
Version with parenthesis should be added too but this looks like a broader task for another day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I meant to add the extra alias, and not delete the existing one. Will fix.
@@ -2394,6 +2407,7 @@ SR: | |||
SS: | |||
aliases: | |||
- S. Sudan | |||
- South Sudan, Republic of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we may as well add Republic of South Sudan
and Republic of S. Sudan
and similar entries for the others? In English Republic of South Sudan
is how it would be formally written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Although "S. Sudan" is a bit of a weird one, adding an extra permutation: abbreviations of "north", "south", etc. An "S." variation is included here and for a few other cases, but not for South Africa, South Georgia, North Macedonia etc.
Cool thanks for this. Aside from inline comments: Brasil and République démocratique du Congo, etc... There are Portuguese and French respectively but this is |
There's was also #5 |
I notice that "Côte D'Ivoire" is not English, and in fact many of the names contain other languages? For example, "São Tomé and Príncipe", "Sint Maarten", "Timor-Leste". There are surprises already!
I'd probably argue that for a normalisation tool to be most effective, it needs to recognise all sorts of weirdness? So the "input" labels can't be categorised as any particular language, and in fact could be a mixture, like "Republic of Côte D'Ivoire", "Sint Maarten (Dutch Part)" etc. Outputs are another matter. The aliases seem mostly to be there for recognition, and the canonical names and identifiers are the outputs. [edit] In my use case, I've been using this gem for normalising into an ISO code, and Carmen for generating names in various languages. |
The difference seems to be as discussed in #8: names are often written as they are in their native language. "Leste" may be Portuguese but "The Democratic Republic of" is not. "République démocratique du Congo" is all in French. "Republic of Côte D'Ivoire" only "Côte D'Ivoire" is French. This is the distinction. Maybe there are exceptions. "Curaçao" maybe since this is Portuguese (amongst others) but in Papiamento it's "Kòrsou".
Given this code/conversation I think there are some ways to normalize (in general not in the PR):
To me this is English with acceptable variants (aliases) of "Cote D'Ivoire" etc... |
90433dd
to
afd20a4
Compare
I've re-rolled this PR in the light of the above, see what you think now. |
Great thanks. This also gave me some things to think about and some ideas for code-level normalization. |
- Congo, Democratic Republic of | ||
- Democratic Republic of Congo | ||
- DR Congo | ||
- RD Congo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
French! Is this intentional? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mouais?
Thanks, feel free to tweak things for consistency, although saying that I do need that damn Cameroun, which might be French... Possibly helpful for you or others: I discover that the |
You could always just use your own YAML file. This was my original ideal I just never coded it in but doing it would be trivial. Maybe less than trivial if we just go with an env var at first.
Yes I see it's fairly compact will consider this thanks. |
Mainly inspired by an encounter with these in the wild:
However, some extra permutations/variations added which seem valid, having consulted Wikipedia.