Skip to content

Statistical implementation

sandroacoelho edited this page Jul 26, 2013 · 1 revision

Our latest implementation is based on statistical methods and is available in a number of languages. Data collection can be performed on a Hadoop cluster using our version of PigNLProc. More details on the indexing process of this implementation can be found here and a fully automated indexing tool can be found here.

Open issues and questions

There are still several open issues with this implementation, see the open issues listed in our Issue tracker.

Q: Can the memory footprint be reduced? A: The memory footprint of this implementation is mainly due to context words, there are three ways to reduce it: 1. use disk-based context instead of memory-based context lookup (see Issue #187) 2. do not consider context (en_small.tar.gz) 3. Prune context data (see Issue #167).

Q: I want to pass a parameter to show more or fewer entities depending on their score. A: See Issue #188

Downloads

Download page

You can also use Spotlight out of the box on a Linux machine by following this guide.

For the memory requirements of the models, see our paper. As the English model is fairly big, en_small.tar.gz is a low-memory alternative for the English model that does not consider context words and hence will provide lower accuracy.

Clone this wiki locally