Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Update collation #168

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

[WIP] Update collation #168

wants to merge 4 commits into from

Conversation

camertron
Copy link
Collaborator

This PR is meant to address updating our current collation implementation to CLDR v26 and ICU 54.1.1. At the moment, there are several hurdles that need to be overcome:

  1. As @KL-7 mentioned in this issue, as of CLDR v24, the data format for collation rules has changed. Instead of a series of XML tags, the rules are now composed of a sequence of significant characters, one rule per line. I've already addressed this in the PR, i.e. the tailoring importer now parses and correctly interprets these lines.
  2. As of CLDR v22, collation test data is no longer published. Our collation implementation relies on these test data for validation, so we'll have to find some alternative way of doing this. I've tried to address the issue by using ICU to re-sort the test data in v21 before using it to validate our implementation, but I got quite a few test failures. Speaking of which...
  3. It looks like @KL-7 marked a bunch of tailoring tests as "pending" in the test files, which is done by prefixing the test case with two forward slashes, eg. "//". I couldn't figure out how he determined which tests to mark as pending, as there wasn't any corresponding importer. I've written an importer that grabs the (old) test data from CLDR v21 and re-sorts it using ICU. I would really appreciate some help figuring out which of these new tests to mark as pending. It's surprising to me that any tests fail, since my test importer uses ICU to re-sort all the test cases. Isn't our collator doing the same thing as ICU?

With this branch checked out, if you run bundle exec rake clean_vendored update:tailoring_data update:collation_tries update:tailoring_tests and then run FULL_SPEC=true bundle exec rspec spec/collation/tailoring_spec.rb quite a few locales report a bunch of failures. The most alarming of these is Japanese, which has 1007 failures out of 3339 active. I know that we haven't yet addressed things like stroke order in our collation implementation, so maybe that's the reason. Other locales like Spanish have 1 failure out of 402, which I don't understand either.

Anyway, I would really appreciate some help on this.

@camertron
Copy link
Collaborator Author

This test file used by ICU might be a good alternative source to the old tailoring test data.

@KL-7
Copy link
Contributor

KL-7 commented Nov 16, 2015

Hey @camertron, take a look at #52 and this gist – they explain why some of the tests were marked as pending.

Essentially, the main reason was that even back then the tests were already outdated, so I only aimed to pass the tests that were passed by ICU4J implementation. On top of that there was this problem with denormalized Japanese code points that resulted in one test failure, but, as mentioned in the #52, might cause more incorrect ordering on real data.

@camertron
Copy link
Collaborator Author

Ok, thanks @KL-7. I remember these from before. I understand why the tests were marked as pending, but I can't find any importer or other code that marks tests as pending if ICU doesn't sort them correctly. I attempted to do something like this with the TailoringTestsImporter in this PR, but I don't think it's working too well.

I'm currently discussing the future of the collation test data on the CLDR users mailing list and will report back here if I learn anything useful.

@KL-7
Copy link
Contributor

KL-7 commented Nov 16, 2015

Oh, sorry, I thought you were asking about the why. Regarding the how, it was done semi-manually (I went through our failures and confirmed that ICU failed on them as well) and I don't think I have any useful code left.

@CLAassistant
Copy link

CLAassistant commented Jul 18, 2019

CLA assistant check
All committers have signed the CLA.

@tahsin352
Copy link

i guess we can close this pr.

@camertron
Copy link
Collaborator Author

@tahsin352 why?

@tahsin352
Copy link

@tahsin352 why?

it seems very old pr, no updates on it for long time.

@camertron
Copy link
Collaborator Author

@tahsin352 that's true, but I'd like to keep it open. Hopefully I can use this work in the future as a springboard to finally modernize collation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants