Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with punctuation and order #8

Open
wu-lee opened this issue Mar 1, 2021 · 5 comments
Open

Problems with punctuation and order #8

wu-lee opened this issue Mar 1, 2021 · 5 comments

Comments

@wu-lee
Copy link

wu-lee commented Mar 1, 2021

I've a dataset which has some problematic names. Specifically:

  • Palestine, State of
  • Côte d’Ivoire

The en.yml data file contains these relevant entries:

PS:
  aliases:
  - Palestinian Territories
  - Palestinian Territory
  alpha2: PS
  alpha3: PSE
  fifa: PLE
  ioc: PLE
  iso_name: Palestinian Territory, Occupied
  numeric: "275"
  official: State of Palestine
  short: Palestine
  emoji: "\U0001F1F5\U0001F1F8"
  shortcode: ":flag-ps:"
  alpha2: CI
  alpha3: CIV
  fifa: CIV
  ioc: CIV
  iso_name: Côte D'Ivoire
  numeric: "384"
  official: Republic of Côte D'Ivoire
  short: Ivory Coast
  emoji: "\U0001F1E8\U0001F1EE"
  shortcode: ":flag-ci:"

So it's a "close but no cigar" situation in both cases. I'm not sure how to solve this.

I'm wondering if the library should erase punctuation and flatten to ASCII when comparing? This would handle the different choice of apostrophe and any missing/altered accents in Côte D'Ivoire, but perhaps that goes too far. I can't currently think of country names it would break, but that's not saying they wouldn't be. And come to think of it, the official name is also a bit weird, mixing "Republic of" (English) with D'Ivoire (French).

There are other names with an apostrophe. These are going to be problematic, considering the general populace's facility with using punctuation. Likewise punctuation as in Bosnia-Herzegovina, Guinea-Bissau or accents as in Åland Islands, and just alternative spellings like Faeroes.

Palestine, State of does what some of the other names do, putting the main name first and any qualifiers like "State of" after a comma. But it doesn't match in this case. I think this is harder; removing punctuation is one thing, re-arranging word order is another.

I see elsewhere in en.yml there are aliases. Perhaps that's a better solution, adding a lot of aliases?

@sshaw
Copy link
Owner

sshaw commented Mar 3, 2021

Hi, thanks for bringing this to my attention.

I'm wondering if the library should erase punctuation and flatten to ASCII when comparing?

I think this will require a library to address the tricky cases, for example ß to ss. iconv can do this but one thing that is nice is this gem has not dependencies.

And come to think of it, the official name is also a bit weird, mixing "Republic of" (English) with D'Ivoire (French).

In the US this is common for names. For example, Gary Dell'Abate uses the ' (Italian) or Pedro Muñoz uses the ñ (Spanish). I also see US papers using São Paulo, Malmö, etc...

I see elsewhere in en.yml there are aliases. Perhaps that's a better solution, adding a lot of aliases?

Seems to be the best. I can see this becoming unmanageable but seems that we're far away from that.

Given you examples and similar existing cases, we should have aliases with non-ascii apostrophe and Palestine, State of variants. But one question here: is this name part of a standard somewhere? Not sure how to apply to others. We have some already and others no. For example: State of Israel but not Israel, State of.

Likewise punctuation as in Bosnia-Herzegovina, Guinea-Bissau

I don't think mdash or endash is appropriate here. Are there other Unicode dashes that should be covered?

Is there a use case for the name without a dash? I can see it both ways and don't have an issue having an alias without it.

or accents as in Åland Islands

Here I think it's fine to add an ASCII alias too

and just alternative spellings like Faeroes.

Yeah this should be an alias too.

sshaw added a commit that referenced this issue Mar 16, 2021
sshaw added a commit that referenced this issue Mar 16, 2021
sshaw added a commit that referenced this issue Mar 16, 2021
@sshaw
Copy link
Owner

sshaw commented Mar 16, 2021

Checkout master for some updates to this.

Is Palestine, State of part of a standard somewhere?

@wu-lee
Copy link
Author

wu-lee commented Mar 16, 2021

Thanks, will check, maybe I can remove some hacks!

The Carmen gem mentioned in the Readme for this project uses the Debian ISO-3166-1 data as a source: and I notice that data includes "Palestine, State of". I just happen to know where to find that - I've not gone to the ISO standard itself to check, which is presumably the most definitive.

My current use-case is to create a SKOS vocabulary of terms (in RDF) for the International Coop Association's database of members' locations and/or territories. Their data is notionally based on the ISO-3166-1 country code system, but they currently use English language labels instead of IDs in their database, which we need to convert to country codes, Their particular set of labels they have includes "Palestine, State of" and "Côte d’Ivoire" with the non-ASCII backquote. I'm not sure where these labels come from originally. I would hazard a guess that the backquote may have been automatically inserted by Word or Excel or something similar.

@sshaw
Copy link
Owner

sshaw commented Mar 20, 2021

The Carmen gem mentioned in the Readme for this project uses the Debian ISO-3166-1 data as a source: and I notice that data includes "Palestine, State of".

Thanks. At some point I will check that data to make sure it's included.

@sshaw
Copy link
Owner

sshaw commented Mar 25, 2021

Note to self: #9 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants