-
Notifications
You must be signed in to change notification settings - Fork 128
Translation System
People who have never attempted a translation project might not be aware of the problems which can arise. We would like to explain some of the fundamental difficulties.
The importance of word order varies significantly from language to language. In some, a case system might make sentence structure very clear with little importance placed on the order of words; in others, word order might be key to conveying a particular message. An example, consider the sentence "You have x messages". This could be broken down into
'You have ' + $count + ' messages'
This would lead to issues further down the line. Translating only the first and last snippets and reintroducing them into the original sentence, with the original order, might not yield a coherent sentence in the target language, because of the way in which different languages work, and the way in which words are ordered.
Not every word has a direct translation in every language. There are many cases where the meaning of a word will vary greatly because of the context that the word is in.
A concept which may be defined by a single word in the original language may have many different corresponding words in the target language, be it because the target language is more nuanced in general or because the specific subject is particularly well-defined in the target language. An example of the issues with direct translation is shown below. In German airlines, it is quite common to read the following:
"Bitte anschnallen!"
The meaning of this sentence is equivalent to the following English sentence:
"Fasten your seat belts!"
However, translating "Bitte anschnallen!" to English would yield this sentence:
"Please fasten!"
This sentence is clearly lacking, and whilst the meaning can be inferred from context, it would sounds unnatural to most native English speakers. A native German speaker would be very unlikely to use the direct translation of the English sentence:
"Bitte festschnallen mit ihrem Sicherheitsgurt!"
This sounds just as unnatural to German speaker as "Please fasten!" does to English speakers. To make the translation sound more fluid and "human" requires, unsurprisingly, a human element.
Translation software is currently unable to take into account context and select from a list of translations, leading to "errors" such as in the examples above.
This is a huge difference in the web interface. This changes about most punctuations and of course the flow of the page itself, even if you force a specific order, you must revert it from right to left.
In most languages (like English), there are 2 so called grammatical cases for numbers: B and B. In those languages B is used, if you have none, or many. And B is only used if you have just one:
You have 1 message.
You have 2 messages. (or also 0 or more than 2)
In other languages, there are up to 5 different cases for plural. It Sometimes depends on complex maths which we don't like to explain, but luckily the world has defined logic for this. This is a concept implemented in gettext, so this form is what we actually use, because our implementation is on top of gettext for most base infrastructure. The English (and most other languages) plural definition for gettext is:
nplurals=2; plural=(n != 1)
This describes the logic we've mentioned above, that we have 2 grammatical numbers (singular and plural), and the first plural form is used, if the amount described is not 1.
All these definitions are fixed. You can find them on http://translate.sourceforge.net/wiki/l10n/pluralforms.
Example: In Slovak (the language spoken in Slovakia) there are 3 "plural forms", defined by this gettext definition:
nplurals=3; plural=(n==1) ? 0 : (n>=2 && n<=4) ? 1 : 2
So the text above would require 3 cases:
Mas 1 spravu.
Mas 2 spravy. (or 3 and 4)
Mas 5 sprav. (or also 0 and more than 5)
Also relevant in most languages, is the gender, which might have influence on the case of the word. We just mention it in this documentation as an element that could be taken into concern.
We use our own community platform for managing all the translations. This allows us to make very individual concepts and workflows specifically. Integrating the translation with socializing components and more visualization options is a key that allows us to give people who translate the texts the optimum environment for understanding the deeper meaning of the text to translate.
The storage for the translations is a very important topic, it defines most of the decisions you have to make afterwards. The storage must be really fast and effective, especially inside the code. Replicating existing concepts here would produce a massive overhead, which would lead to the analyzis of the existing libraries for this topic which are fast enough for our requirements. Sadly, we also had to think about a solution that works inside JavaScript, so that we can integrate it easily into our JavaScript code, which drives most of the visuals on a modern browser on DuckDuckGo.
There are some pretty interesting solutions in Perl which allow us to really cover up all cases, like gender, but those solutions are specific to Perl and can't work in JavaScript. In the end we decided "down" to the very common gettext system, which also has a JavaScript implementation and is covered with implementations in all languages, like Perl, Ruby, Python and other languages where we might need to integrate translation. Especially, the existence of a very wide used and accepted plain C implementation makes it also reliable in many ways. The C library delivers directly a commandline tool to convert text based datafiles for the translations (the so called po files) to highly effective binary files, to make this data accessible very quickly (the binary file is called mo). This tool is called msgfmt and included in the gettext package of your distribution.
In the JavaScript implementation we have a small Perl program po2json which converts the same text datafile into a json format that is more usable in JavaScript. Sadly this datafile must be of course loaded in the browser, you might see that big JavaScript file on the load of DuckDuckGo which integrates the translations together with the libraries for using those. We compress this to make it smaller for the bandwidth. More optimization options are open here.
In the end gettext is able to solve the problems with the non direct translations and the grammatical number cases. It is by itself not able to solve the gender case or the order of text problems, especially with combined elements that have to be translated independently. The possibility to extend our system to manage the gender case too is open, but we are not heading towards this yet.
Many people who never did translation before, but heard of gettext, think that gettext itself directly handles everything you need for the translation, but it has no real concept for combined tokens or placeholders for dynamic text. gettext works only as storage and accessor for the translations you need. It solves lots of problems that you really can't easily solve alone, but it misses those very important details.
We still need to wrap gettext with sprintf to make it really useful. This will allow us to combine tokens with HTML and other formats. We will describe this in the next section.
We released this wrapping, which has the exactly same API for Perl, Python and JavaScript on CPAN and pypi. You can install it with cpan or your favorite CPAN package installer for Perl, like App::cpanminus:
cpanm Locale::Simple
or for python with pip2 in your userspace:
pip2 install --user locale-simple
Inside the Perl distribution, you find also all the JavaScript required.
sprintf is a C function that defines the so called printf conventions for formatting a text with dynamic data. You will find it in every language, so you can always refer to the usage via your favorite programming language documentation.
sprintf is made to take a format definition and some values for the dynamic parts in this definition to produce a new string. A simple example in Perl would be:
perl -e'print sprintf("%s is my search engine!","DuckDuckGo")."\n";'
The output of this will be:
DuckDuckGo is my search engine!
The %s in the first parameter of sprintf is replaced with the second parameter we have given to the sprintf call. This is a very simple example, %s means string in this case. Alternative options are:
%% a percent sign
%c a character with the given number
%s a string
%d a signed integer, in decimal
%u an unsigned integer, in decimal
%o an unsigned integer, in octal
%x an unsigned integer, in hexadecimal
%e a floating-point number, in scientific notation
%f a floating-point number, in fixed decimal notation
%g a floating-point number, in %e or %f notation
You will normally only face %s and probably %d in the context of DuckDuckGo, so don't be scared about the other options.
%d is specific to display a number, as decimal, so if you would say:
sprintf("%d bottles of beer",99.0);
In your language, where 99.0 defines a float number, you would still get back:
"99 bottles of beer"
A very important point here is the possibility to give several parameters, AND reorder them in the usage, for example:
sprintf("From %s to %s",'A','B');
# returns "From A to B"
This seems to force always $from in the first %s that appears, and $to in the second appearance of %s. If in some languages for example the order for this case gets "other around", then we can't change the order in the code of course, nor make a switch. Luckily sprintf allows us to use the data in other orders than given, as you can see on this example:
sprintf("To %2$s from %1$s",'A','B');
# returns "To B from A"
This tells sprintf to put the first extra value into the %1$s and the second extra value into %2$s. So, if a translation that hits several placeholders, needs a different order for them, it can use this syntax to achieve this.
For "Locale::Simple", we made a scraper which is able to parse through any Javascript, Perl or Python scripts, to find those tokens. This makes it relevant to be careful about the positions of the tokens: Always have the complete token, and everything that is in it, in one line.
Every parameter of the token must be really written out. That means you can't use code, to generate magically fixed msgctxt for a group of tokens and add it via variables. This is neither allowed nor the purpose of translation tokens. You can only use placeholders.
The database of the community platform stores all the translations, which then gets generated into the po files used by gettext in our translation system. I show you here the German translations of the examples, from the po that gets generated: msgid "Monthly newsletter:" msgstr "Monatlicher Newsletter:"
msgctxt "size"
msgid "Medium"
msgstr "Mittelgroß"
msgid "Hello %s!"
msgstr "Hallo %s!"
msgctxt "email"
msgid "You have %d message."
msgid_plural "You have %d messages."
msgstr[0] "Du hast %d Nachricht."
msgstr[1] "Du hast %d Nachrichten."
msgid "%2$s has %1$d message."
msgid_plural "%2$s has %1$d messages."
msgstr[0] "%2$s hat %1$d Nachricht."
msgstr[1] "%2$s hat %1$d Nachrichten."
msgid "From %s to %s"
msgstr "Von %s nach %s"
The pirate translation of the "Hello %s!" example would look like this:
msgid "Hello %s!"
msgstr "%s, ahoi! hrrr"
Our example to change the order of the placeholders would look like this:
msgid "From %s to %s"
msgstr "To %2$s from %1$s"
In the case of a language which has more than 2 plural forms the number in the brackets will just get stacked up:
msgid "You have %d message."
msgid_plural "You have %d messages."
msgstr[0] "Mas %d spravu."
msgstr[1] "Mas %d spravy."
msgstr[2] "Mas %d sprav."
Interesting sidenote, the highest amount of plural forms is 6. You can find this in the Arabic language. In the case of the placeholder with an amount and a dynamic text, a translation which uses the correct order, can of course then use the normal sprintf placeholders definition:
msgid "%2$s has %1$d message."
msgid_plural "%2$s has %1$d messages."
msgstr[0] "%d Nachricht hat %s."
msgstr[1] "%d Nachrichten hat %s."
l( msgid, ... )
ln( msgid, msgid_plural, count, ... )
lp( msgctxt, msgid, ... )
lnp( msgctxt, msgid, msgid_plural, count, ... )