language-detection

Build Status	Code Coverage	Version	Total Downloads	Maintenance	Minimum PHP Version	License

This library can detect the language of a given text string. It can parse given training text in many different idioms into a sequence of N-grams and builds a database file in JSON format to be used in the detection phase. Then it can take a given text and detect its language using the database previously generated in the training phase. The library comes with text samples used for training and detecting text in 110 languages.

Installation with Composer

Note: This library requires the Multibyte String extension in order to work.

$ composer require patrickschur/language-detection

Basic Usage

To detect the language correctly, the length of the input text should be at least some sentences.

use LanguageDetection\Language;
 
$ld = new Language;
 
$ld->detect('Mag het een onsje meer zijn?')->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "br" => 0.49634408602151,
    "nb" => 0.48849462365591,
    "nn" => 0.48741935483871,
    "fy" => 0.47822580645161,
    "dk" => 0.47172043010753,
    "sv" => 0.46408602150538,
    "bi" => 0.46021505376344,
    "de" => 0.45903225806452,
    [...]
)

API

`__construct(array $result = [])`

You can pass an array of languages to the constructor. To compare the desired sentence only with the given languages. This can dramatically increase the performance.

$ld = new Language(['de', 'en', 'nl']);
 
// Compares the sentence only with "de", "en" and "nl" language models.
$ld->detect('Das ist ein Test');

`whitelist(string ...$whitelist)`

Provide a whitelist. Returns a list of languages, which are required.

$ld->detect('Mag het een onsje meer zijn?')->whitelist('de', 'nn', 'nl', 'af')->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "nn" => 0.48741935483871,
    "de" => 0.45903225806452
)

`blacklist(string ...$blacklist)`

Provide a blacklist. Removes the given languages from the result.

$ld->detect('Mag het een onsje meer zijn?')->blacklist('dk', 'nb', 'de')->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "br" => 0.49634408602151,
    "nn" => 0.48741935483871,
    "fy" => 0.47822580645161,
    "sv" => 0.46408602150538,
    "bi" => 0.46021505376344,
    [...]
)

`bestResults()`

Returns the best results.

$ld->detect('Mag het een onsje meer zijn?')->bestResults()->close();

Result:

Array
(
    "nl" => 0.66193548387097
)

`limit(int $offset, int $length = null)`

You can specify the number of records to return. For example the following code will return the top three entries.

$ld->detect('Mag het een onsje meer zijn?')->limit(0, 3)->close();

Result:

Array
(
    "nl" => 0.66193548387097,
    "af" => 0.51338709677419,
    "br" => 0.49634408602151
)

`close()`

Returns the result as an array.

$ld->detect('This is an example!')->close();

Result:

Array
(
    "en" => 0.5889400921659,
    "gd" => 0.55691244239631,
    "ga" => 0.55376344086022,
    "et" => 0.48294930875576,
    "af" => 0.48218125960061,
    [...]
)

`setTokenizer(TokenizerInterface $tokenizer)`

The script use a tokenizer for getting all words in a sentence. You can define your own tokenizer to deal with numbers for example.

$ld->setTokenizer(new class implements TokenizerInterface
{
    public function tokenize(string $str): array 
    {
        return preg_split('/[^a-z0-9]/u', $str, -1, PREG_SPLIT_NO_EMPTY);
    }
});

This will return only characters from the alphabet in lowercase and numbers between 0 and 9.

`__toString()`

Returns the top entrie of the result. Note the echo at the beginning.

echo $ld->detect('Das ist ein Test.');

Result:

de

`jsonSerialize()`

Serialized the data to JSON.

$object = $ld->detect('Tere tulemast tagasi! Nägemist!');
 
json_encode($object, JSON_PRETTY_PRINT);

Result:

{
    "et": 0.5224748810153358,
    "ch": 0.45817028027498674,
    "bi": 0.4452670544685352,
    "fi": 0.440983606557377,
    "lt": 0.4382866208355367,
    [...]
}

Method chaining

You can also combine methods with each other. The following example will remove all entries specified in the blacklist and returns only the top four entries.

$ld->detect('Mag het een onsje meer zijn?')->blacklist('af', 'dk', 'sv')->limit(0, 4)->close();

Result:

Array
(
    "nl" => 0.66193548387097
    "br" => 0.49634408602151
    "nb" => 0.48849462365591
    "nn" => 0.48741935483871
)

ArrayAccess

You can also access the object directly as an array.

$object = $ld->detect(Das ist ein Test');
 
echo $object['de'];
echo $object['en'];
echo $object['xy']; // does not exists

Result:

0.6623339658444
0.56859582542694
NULL

Supported languages

The library currently supports 110 languages.

Language	Language Code	Language	Language Code
Abkhaz	ab	Italian	it
Afrikaans	af	Inuktitut	iu
Amharic	am	Japanese	ja
Arabic	ar	Javanese	jv
Aymara	ay	Georgian	ka
Azerbaijani, North (Cyrillic)	az-Cyrl	Khmer	km
Azerbaijani, North (Latin)	az-Latn	Korean	ko
Belarusan	be	Kanuri	kr
Bulgarian	bg	Kurdish	ku
Bislama	bi	Latin	la
Bengali	bn	Ganda	lg
Lingala	ln	Tibetan	bo
Lao	lo	Breton	br
Lithuanian	lt	Bosnian (Cyrillic)	bs-Cyrl
Latvian	lv	Bosnian (Latin)	bs-Latn
Marshallese	mh	Catalan	ca
Mongolian, Halh (Cyrillic)	mn-Cyrl	Chamorro	ch
Malay (Arabic)	ms-Arab	Corsican	co
Malay (Latin)	ms-Latn	Cree	cr
Maltese	mt	Czech	cs
Norwegian, Bokmål	nb	Welsh	cy
Ndonga	ng	German	de
Dutch	nl	Danish	dk
Norwegian, Nynorsk	nn	Dzongkha	dz
Navajo	nv	Greek (monotonic)	el-monoton
Polish	pl	Greek (polytonic)	el-polyton
Portuguese (Brazil)	pt-BR	English	en
Portuguese (Portugal)	pt-PT	Esperanto	eo
Romanian	ro	Spanish	es
Russian	ru	Sanskrit	sa
Estonian	et	Slovak	sk
Basque	eu	Slovene	sl
Persian	fa	Somali	so
Finnish	fi	Albanian	sq
Fijian	fj	Swati	ss
Faroese	fo	Swedish	sv
French	fr	Tamil	ta
Frisian	fy	Thai	th
Gaelic, Irish	ga	Tagalog	tl
Tonga	to	Gaelic, Scottish	gd
Turkish	tr	Galician	gl
Tatar	tt	Guarani	gn
Tahitian	ty	Gujarati	gu
Uyghur (Arabic)	ug-Arab	Hausa	ha
Uyghur (Latin)	ug-Latn	Hebrew	he
Ukrainian	uk	Urdu	ur
Hindi	hi	Uzbek	uz
Croatian	hr	Venda	ve
Hungarian	hu	Vietnamese	vi
Armenian	hy	Walloon	wa
Interlingua	ia	Wolof	wo
Indonesian	id	Xhosa	xh
Igbo	ig	Yoruba	yo
Ido	io	Chinese, Mandarin (Simplified)	zh-Hans
Icelandic	is	Chinese, Mandarin (Traditional)	zh-Hant

Other languages

The library is trainable which means you can change, remove and add your own language files to it. If your language not supported, feel free to add your own language files. To do that, create a new directory in resources and add your training text to it.

Note: The training text should be a .txt file.

Example

|- resources
    |- ham
        |- ham.txt
    |- spam
        |- spam.txt

As you can see, we can also used it to detect spam or ham. If you have added your own files, you must first generate a language profile for it. This may take a few seconds.

use LanguageDetection\Trainer;
 
$t = new Trainer();
 
$t->learn();

Remove these few lines after execution and now we can classify texts by their language with our own training text.

FAQ

How can I improve the detection phase?

To improve the detection phase you have to use more n-grams. But be careful this will slow down the script. I figured out that the detection phase is much better when you are using around 9.000 n-grams (default is 310). To do that look at the code right below:

$t = new Trainer();
 
$t->setMaxNgrams(9000);
 
$t->learn();

First you have to train it. Now you can classify texts like before but you must specify how many n-grams you want to use.

$ld = new Language();
 
$ld->setMaxNgrams(9000);
  
// "grille pain" is french and means "toaster" in english
var_dump($ld->detect('grille pain')->bestResults());

Result:

class LanguageDetection\LanguageResult#5 (1) {
  private $result =>
  array(2) {
    'fr' =>
    double(0.91307037037037)
    'en' =>
    double(0.90623333333333)
  }
}

Is the detection process slower if language files are very big?

No it is not. The trainer class will only use the best 310 n-grams of the language. If you don't change this number or add more language files it will not affect the performance. Only creating the N-grams is slower. However, the creation of N-grams must be done only once. The detection phase is only affected when you are trying to detect big chunks of texts.

Summary: The training phase will be slower but the detection phase remains the same.

Contributing

Feel free to contribute. Any help is welcome.

License

This projects is licensed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
resources		resources
src/LanguageDetection		src/LanguageDetection
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
phpunit.xml		phpunit.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

language-detection

Table of Contents

Installation with Composer

Basic Usage

API

`__construct(array $result = [])`

`whitelist(string ...$whitelist)`

`blacklist(string ...$blacklist)`

`bestResults()`

`limit(int $offset, int $length = null)`

`close()`

`setTokenizer(TokenizerInterface $tokenizer)`

`__toString()`

`jsonSerialize()`

Method chaining

ArrayAccess

Supported languages

Other languages

Example

FAQ

How can I improve the detection phase?

Is the detection process slower if language files are very big?

Contributing

License

About

Releases

Packages

Languages

License

Ejobs/language-detection

Folders and files

Latest commit

History

Repository files navigation

language-detection

Table of Contents

Installation with Composer

Basic Usage

API

__construct(array $result = [])

whitelist(string ...$whitelist)

blacklist(string ...$blacklist)

bestResults()

limit(int $offset, int $length = null)

close()

setTokenizer(TokenizerInterface $tokenizer)

__toString()

jsonSerialize()

Method chaining

ArrayAccess

Supported languages

Other languages

Example

FAQ

How can I improve the detection phase?

Is the detection process slower if language files are very big?

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`__construct(array $result = [])`

`whitelist(string ...$whitelist)`

`blacklist(string ...$blacklist)`

`bestResults()`

`limit(int $offset, int $length = null)`

`close()`

`setTokenizer(TokenizerInterface $tokenizer)`

`__toString()`

`jsonSerialize()`

Packages