Skip to content

Latest commit

 

History

History
118 lines (88 loc) · 4.58 KB

README.md

File metadata and controls

118 lines (88 loc) · 4.58 KB

Tesa (text sanitizer)

Build Status Code Coverage Scrutinizer Code Quality Latest Stable Version Packagist download count Dependency Status

The library contains a small collection of helper classes to support sanitization of text or string elements of arbitrary length with the aim to improve search match confidence during a query execution that is required by Semantic MediaWiki project and is deployed independently.

Requirements

  • PHP 5.3 / HHVM 3.5 or later
  • Recommended to enable the ICU extension

Installation

The recommended installation method for this library is by adding the following dependency to your composer.json.

{
	"require": {
		"onoi/tesa": "~0.1"
	}
}

Usage

use Onoi\Tesa\SanitizerFactory;
use Onoi\Tesa\Transliterator;
use Onoi\Tesa\Sanitizer;

$sanitizerFactory = new SanitizerFactory();

$sanitizer = $sanitizerFactory->newSanitizer( 'A string that contains ...' );

$sanitizer->reduceLengthTo( 200 );
$sanitizer->toLowercase();

$sanitizer->replace(
	array( "'", "http://", "https://", "mailto:", "tel:" ),
	array( '' )
);

$sanitizer->setOption( Sanitizer::MIN_LENGTH, 4 );
$sanitizer->setOption( Sanitizer::WHITELIST, array( 'that' ) );

$sanitizer->applyTransliteration(
	Transliterator::DIACRITICS | Transliterator::GREEK
);

$text = $sanitizer->sanitizeWith(
	$sanitizerFactory->newGenericTokenizer(),
	$sanitizerFactory->newNullStopwordAnalyzer(),
	$sanitizerFactory->newNullSynonymizer()
);
  • SanitizerFactory is expected to be the sole entry point for services and instances when used outside of this library
  • IcuWordBoundaryTokenizer is a preferred tokenizer in case the ICU extension is available
  • NGramTokenizer is provided to increase CJK match confidence in case the back-end does not provide an explicit ngram tokenizer
  • StopwordAnalyzer together with a LanguageDetector is provided as a means to reduce ambiguity of frequent "noise" words from a possible search index
  • Synonymizer currently only provides an interface

Contribution and support

If you want to contribute work to the project please subscribe to the developers mailing list and have a look at the contribution guidelinee. A list of people who have made contributions in the past can be found here.

Tests

The library provides unit tests that covers the core-functionality normally run by the continues integration platform. Tests can also be executed manually using the composer phpunit command from the root directory.

Release notes

  • 0.1.0 Initial release (2016-08-07)
  • Added SanitizerFactory with support for a
  • Tokenizer, LanguageDetector, Synonymizer, and StopwordAnalyzer interface

Acknowledgments

  • The Transliterator uses the same diacritics conversion table as http://jsperf.com/latinize (except the German diaeresis ä, ü, and ö)
  • The stopwords used by the StopwordAnalyzer have been collected from different sources, each json file identifies its origin
  • CdbStopwordAnalyzer relies on wikimedia/cdb to avoid using an external database or cache layer (with extra stopwords being available here)
  • JaTinySegmenterTokenizer is based on the work of Taku Kudo and his tiny_segmenter.js
  • TextCatLanguageDetector uses the wikimedia/textcat library to make predictions about a language

License

GNU General Public License 2.0 or later.