NER tagging plugin for Vulyk, crowdsourcing framework

To connect plugin to Vulyk:

Install this plugin as pip package pip install git+https://github.com/lang-uk/vulyk-ner.git
Make sure to include configuration for it in local_settings.py

ENABLED_TASKS = {
    # other plugins will be somewhere here
    'vulyk_ner': 'NERTaggingTaskType'
}

Full installation instructions can be found here https://github.com/mrgambal/vulyk

Running tests after you made changes

python -m unittest discover -s test

Conversion utils included

There are two utitilies included in this package:

convert2vulyk.py which is Swiss Army Knife to convert/tag texts into the format, suitable for vulyk tasks
convert_vulyk2iob.py which allows you to convert the individual answers, exported from vulyk with ./manage.py db export command into standard IOB

Convert texts with `convert2vulyk.py`

convert2vulyk.py subcommand convert allows you to convert bunch of files (either txt or json, see --format) into a jsonlines file that you can feed directly into vulyk. It also can autodiscover annotation layer in brat standoff format (see --ann_autodiscovery). You can supply a glob-style string as input_files param for batch processing. Beware, when applied to raw txt file, the tool will fix the whitespaces around punctuation according to the rules of typography

Converting pre-tokenized json files ([["sent1_word1", "sent1_word2"], ["sent2_word1"]] format)

python bin/convert2vulyk.py -f json convert  tokenized/jsons/*.json > vulyk_tasks.jsonlines

Converting text files with no annotations

This will tokenize input text files using whitespace tokenizer, adjust whitespaces and ignore annotation layer (if any)

python bin/convert2vulyk.py -f txt convert --ignore_annotations tokenized/txt/*.txt > vulyk_tasks.jsonlines

Converting text files with annotation layer stored in *.ann format

This will tokenize input text files using whitespace tokenizer, adjust whitespaces, autodiscover *.ann file next to *.txt (if any) and adjust positions of found NER tokens.

python bin/convert2vulyk.py -f txt convert  tokenized/txt/*.txt > vulyk_tasks.jsonlines

Tag texts with `convert2vulyk.py`

Subcommand tag allows you to pre-annotate given texts (tokenized or raw) using either stanza or spacy. You might as well specify your own models with --ner-model

To do so, you have to install extra dependencies

pip install -r extra_requirements.txt

Tag pretokenized json files with SpaCy model

python bin/convert2vulyk.py -f json tag --ner_framework spacy --ner_model /my/best/spacy/model tokenized/json/*.json > vulyk_tasks.jsonlines

Tokenize and tag raw text files with stanza model

Beware: to tokenize raw texts, script will use lang-uk's tokenize-uk tokenizer, which is sometime naïve

python bin/convert2vulyk.py -f txt tag --ner_framework stanza --ner_model "uk" tokenized/txt/*.txt > vulyk_tasks.jsonlines

Import to Vulyk

./manage.py db load ner_tagging_task --batch batch_name ./path/save_to_file.json

For more details and possible parameters refer to python bin/convert2vulyk.py -h

Convert vulyk results to IOB with `convert2vulyk.py`

The tool allows you to convert one or more batches with answers, exported from vulyk into the iob files:

python bin/convert_vulyk2iob.py "test_results/*.jsonlines" test_results/iobs/

Each individual answer from the annotator will be stored according to scheme {batch_dir}/{username}/{task_id}.iob, where batch_dir is the basename of the input files, username is the name of the annotator, task_id is the unique identifier of the task from vulyk.

As usual, python bin/convert_vulyk2iob.py -h is your friend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NER tagging plugin for Vulyk, crowdsourcing framework

Conversion utils included

Convert texts with `convert2vulyk.py`

Converting pre-tokenized json files ([["sent1_word1", "sent1_word2"], ["sent2_word1"]] format)

Converting text files with no annotations

Converting text files with annotation layer stored in *.ann format

Tag texts with `convert2vulyk.py`

Tag pretokenized json files with SpaCy model

Tokenize and tag raw text files with stanza model

Import to Vulyk

Convert vulyk results to IOB with `convert2vulyk.py`

Files

README.md

Latest commit

History

README.md

File metadata and controls

NER tagging plugin for Vulyk, crowdsourcing framework

Conversion utils included

Convert texts with convert2vulyk.py

Converting pre-tokenized json files ([["sent1_word1", "sent1_word2"], ["sent2_word1"]] format)

Converting text files with no annotations

Converting text files with annotation layer stored in *.ann format

Tag texts with convert2vulyk.py

Tag pretokenized json files with SpaCy model

Tokenize and tag raw text files with stanza model

Import to Vulyk

Convert vulyk results to IOB with convert2vulyk.py

Convert texts with `convert2vulyk.py`

Tag texts with `convert2vulyk.py`

Convert vulyk results to IOB with `convert2vulyk.py`