Statistical implementation

Our latest implementation is based on statistical methods and is available in a number of languages. Data collection can be performed on a Hadoop cluster using our version of PigNLProc. More details on the indexing process of this implementation can be found here and a fully automated indexing tool can be found here.

Open issues and questions

There are still several open issues with this implementation, see the open issues listed in our Issue tracker.

Q: Can the memory footprint be reduced? A: The memory footprint of this implementation is mainly due to context words, there are three ways to reduce it: 1. use disk-based context instead of memory-based context lookup (see Issue #187) 2. do not consider context (en_small.tar.gz) 3. Prune context data (see Issue #167).

Q: I want to pass a parameter to show more or fewer entities depending on their score. A: See Issue #188

Downloads

Download page

You can also use Spotlight out of the box on a Linux machine by following this guide.

For the memory requirements of the models, see our paper. As the English model is fairly big, en_small.tar.gz is a low-memory alternative for the English model that does not consider context words and hence will provide lower accuracy.

DBpedia Spotlight - Shedding Light on the Web of Documents

Home

Project

Statistical backend

Lucene backend

Developers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistical implementation

Open issues and questions

Downloads

Clone this wiki locally