-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with punctuation and order #8
Comments
Hi, thanks for bringing this to my attention.
I think this will require a library to address the tricky cases, for example
In the US this is common for names. For example, Gary Dell'Abate uses the
Seems to be the best. I can see this becoming unmanageable but seems that we're far away from that. Given you examples and similar existing cases, we should have aliases with non-ascii apostrophe and
I don't think mdash or endash is appropriate here. Are there other Unicode dashes that should be covered? Is there a use case for the name without a dash? I can see it both ways and don't have an issue having an alias without it.
Here I think it's fine to add an ASCII alias too
Yeah this should be an alias too. |
Checkout master for some updates to this. Is |
Thanks, will check, maybe I can remove some hacks! The Carmen gem mentioned in the Readme for this project uses the Debian ISO-3166-1 data as a source: and I notice that data includes "Palestine, State of". I just happen to know where to find that - I've not gone to the ISO standard itself to check, which is presumably the most definitive. My current use-case is to create a SKOS vocabulary of terms (in RDF) for the International Coop Association's database of members' locations and/or territories. Their data is notionally based on the ISO-3166-1 country code system, but they currently use English language labels instead of IDs in their database, which we need to convert to country codes, Their particular set of labels they have includes "Palestine, State of" and "Côte d’Ivoire" with the non-ASCII backquote. I'm not sure where these labels come from originally. I would hazard a guess that the backquote may have been automatically inserted by Word or Excel or something similar. |
Thanks. At some point I will check that data to make sure it's included. |
Note to self: #9 (comment) |
I've a dataset which has some problematic names. Specifically:
The
en.yml
data file contains these relevant entries:So it's a "close but no cigar" situation in both cases. I'm not sure how to solve this.
I'm wondering if the library should erase punctuation and flatten to ASCII when comparing? This would handle the different choice of apostrophe and any missing/altered accents in Côte D'Ivoire, but perhaps that goes too far. I can't currently think of country names it would break, but that's not saying they wouldn't be. And come to think of it, the official name is also a bit weird, mixing "Republic of" (English) with D'Ivoire (French).
There are other names with an apostrophe. These are going to be problematic, considering the general populace's facility with using punctuation. Likewise punctuation as in Bosnia-Herzegovina, Guinea-Bissau or accents as in Åland Islands, and just alternative spellings like Faeroes.
Palestine, State of does what some of the other names do, putting the main name first and any qualifiers like "State of" after a comma. But it doesn't match in this case. I think this is harder; removing punctuation is one thing, re-arranging word order is another.
I see elsewhere in
en.yml
there are aliases. Perhaps that's a better solution, adding a lot of aliases?The text was updated successfully, but these errors were encountered: