-
Notifications
You must be signed in to change notification settings - Fork 44
test for WCAG technique H58: Using language attributes to identify changes in the human language #58
Comments
guesslanguage.js has a license that is not compatible with the MIT license. It's based on older scripts in other languages. |
I've gone through all the past older scripts and found it's GPL all the way down, but there are people like Spam Assassin who are apache licensed and use the same method. Basically, all of the libraries use the standard Unicode blocks for languages and then a simple algorithm that says if these characters are over 40% of characters in a set of strings over a certain number of characters, they are a certain language. I think we might want to contact the author of guesslanguage and ask if we can include his regex in quail. |
Here is another code base, with BSD-license, based on the same alogrithm principle you mentioned: https://github.com/webmil/text-language-detect |
And this is also MIT licensed and very interesting: https://github.com/shuyo/ldig |
In looking at a few of the projects mentioned, they are using the same kind of unicode ranges, but languages like Python or C have language shortcuts that map to unicode groups like https://code.google.com/p/chromium-compact-language-detector/source/browse/encodings.cc. |
I've started a I'm getting these unicode sets from documents like http://unicode.org/cldr/trac/browser/trunk/common/main/ar.xml, which list
|
One question I have is how to determine what language the current document is in if it's not set. Right now I first look at the element quail was asked to be run against, then it's immediate ancestor with a |
We don't need to have a base language. We could run the test for text in each html element and compare the guessed language with the language in the other elements and check if there is a lang attribute in between if there is a difference. This won't work for if there is mixed language within an element. |
I was also just thinking about non-basic latin words that are used in english and should not be considered a context switch. Can we say for now that we will just check all text-containing elements and compare them to each other? |
Yes, that's a good solution. And if the website contains the right semantic elements (other test) this will work for block or inline quotations as well. |
Great! I'm also thinking later down the line we can use the Content-Language HTTP header on the page as well, but that will require making an additional HTTP request on the current page, and I have found that this can cause problems where a page initiates an action (like visiting |
Great to have this in the code. Just curious, I couldnt track the code for that, but what happens when there is a single word in latin script like: 'এটি একটি ভাষা একক IBM স্ক্রিপ্ট'. |
Right now it would throw an error, but we could only capture strings of text that are either longer than n or only capture complete character groups (like a sentence). I've also reached out to an author of a good trigram database to include in quail, I think it's really heavy to include in a browser, but quail could accept a trigram database passed to it and use it to more accurately determine language change. |
Added an issue to the guesslanguage repo to discuss including the trigram database with quail. |
This trigram set has a BSD license: https://github.com/webmil/text-language-detect/blob/master/lib/data/lang.dat and is 340kb |
It would be very interesting to find a way we can include this even in the plugin. We could probably preload QUAIL with the tri-gram of only the given base language and start loading the other trigrams as soon as a certain treshold is reached that a text is not the base language? If we make the test more lightweight, we even don't have to know which language it is, we only have to know that it is not the base language. |
Given that we can only really accurately detect changes in script using regex, not language, we will definitely need a trigram database of some kind; however, any database would be too large to just include in the plugin code. For now, I'm going to just accept the format of the Trigram databases you identified (which are all identical) and that way a project just provides quail with the database, which could include just one language, or all of them. You are correct that if there is a base language identified, we can actually just run through text and see if anything does not match the trigrams in question, although we should probably only throw an error in that case if there's enough characters in the element for a trigram test to be viable. If there is no provided trigram, there is a method of generating trigrams, and we could just go element-by-element creating a trigram database on the fly and then throwing errors on elements that are wildly divergent from the rest of the page. I've started the |
I've spent a lot of time either building or going down different implementations, and I am just going to say at this point guessLanguage.js is our best hope, but I haven't gotten a response from the maintainer (the repo is over a year since the last commit.) At this point, I'd like to suggest we add guessLanguage as a dev dependency (i.e. it won't be rolled into a release, but people can run Thoughts? I'm going to go forward in the trigram-language branch with this model for the time being, I don't think there are any licensing issues with doing it this way. |
I'm almost ready to merge in the branch, but I'm just playing with how many characters long a string should be before guessLanguage even makes sense to use. There's some research on the subject (including math that I'm not going to pretend to understand) that is leading me toward a simple model:
Luckily we have the unicode blocks to separate these out. |
Actually it ends up the string length is unneeded since all character-based languages are in different unicode singletons and therefore captured even without guessLanguage. I'm going to pull into dev. |
Use guessLanguage.js to find language changes without unicode singletons. Closes #58
Check that the human language of the content of the element is the same as the inherited language for the element.
This can be done with a guesslanguage script. A guesslanguage script checks characters (chinese, japanese) used, and checks for regular words ('and', 'und', 'et' etc).
Here is one we can implement:
https://github.com/richtr/guessLanguage.js
The text was updated successfully, but these errors were encountered: