Home

Welcome to the CliNER wiki!

CliNER

Clinical Named Entity Recognition system (CliNER) is an open-source natural language processing system for named entity recognition in clinical text of electronic health records. CliNER system is designed to follow best practices in clinical concept extraction, as established in i2b2 2010 shared task.

CliNER is implemented as a sequence classification task, where every token is predicted IOB-style as either: Problem, Test, Treatment, or None. Coomand line flags let you specify two different sequence classification algorithms: 1. CRF (default) - with linguistic and domain-specific features 2. LSTM

Please note that for optimal performance, CliNER requires the users to obtain a Unified Medical Language System (UMLS) license, since UMLS Metathesaurus is used as one of the knowledge sources for the above classifiers.

Free software: Apache v2.0 license

Optional Resources

There are a few external resources that are not packaged with CliNER but can improve prediction performance for feature extraction with the CRF.

GENIA

Why would I want this? The GENIA tagger is a tool similar to CliNER but designed for Biomedical text. Depending on the domain of your data, this tool's pretrained model may or may not be able to improve performance for CliNER as it detects concepts.

The GENIA tagger identifies named entities in biomedical text. To install:

    > wget http://www.nactem.ac.uk/tsujii/GENIA/tagger/geniatagger-3.0.2.tar.gz
    > tar xzvf geniatagger-3.0.2.tar.gz
    > cd geniatagger-3.0.2
    > make

Edit config.txt so that GENIA references the geniatagger executable just built. (e.g. "GENIA /someuser/CliNER/geniatagger-3.0.2/geniatagger")

GENIA Reference

UMLS

Why would I want this? The UMLS, or Unified Medical Language System, is a very comprehensive database of various medical terms and concepts. Access to it would allow CliNER to leverage domain-specific knowledge.

SORRY! This resource is contains potentially sensitive clinical data, and requires a confidentiality agreement. We can't do that part for you. Please see "Additional Resources" portion of this readme for instructions on how to obtain the UMLS tables.

In order to use the UMLS tables, you must request a license. See: http://www.nlm.nih.gov/databases/umls.html

How to obtain UMLS tables:

Download all the files from: https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html
Unzip mmsys.zip into a folder and put all other files downloaded into that folder.
Execute run_linux.sh and select 'Install UMLS' on gui.
Choose a destination for umls directory, hit 'Ok' and then 'Create New Config'.
Accept the agreement.
Select 'Only Active UMLS Sources' as your default subset.
Select 'Done' at the top right of gui pane and then select 'Begin Subset'.
This process may take a while, the directory '<Destination_Directory_Path>//META' should contain the necessary files needed.

You will need to get following tables: LRARBR, MRREL.RRF, MRCONSO.RRF, MRSTY.RRF

Put these tables in the $CLINER_DIR/umls_tables directory.

In order to tell CliNER that the tables are there, you must edit the file "$CLINER_DIR/config.txt" and change the line saying "UMLS None" to "UMLS ".

The database will be built from the tables when CliNER is run for the first time.

UMLS Reference

Please email [email protected] with your installation issues/questions.

i2b2 2010 Shared Task Data

These are resources that require login credentials to access secure data, so we can't provide you with them directly.

The Data Use and Confidentiality Agreement (DUA) with i2b2 forbids us from redistributing the i2b2 data. In order to gain access to the data, you must go to:

https://www.i2b2.org/NLP/DataSets/AgreementAR.php

to register and sign the DUA. Then you will be able to request the data through them.

Sample Result

The CliNER pipeline assumes that the clinical text has been preprocessed to be tokenized, as in accordance with the i2b2 format. We have included a simple tokenization script (see: tools/tok.py) that you can use or modify as you wish.

The silver model does come with some degradation of performance. Given that the alternative is no model, I think this is okay, but be aware that if you have the i2b2 training data, then you can build a model that performs even better on the i2b2 test data.

Original Model (trained on i2b2-train data with UMLS + GENIA feats)

TESTING 1.1 - Exact span for all concepts together

	TP	FN	FP	Recall	Precision	F1
Class Exact Span	23358	4904	7696	0.826	0.752	0.788

TESTING 1.2 - Exact span for separate concept classes

	TP	FN	FP	Recall	Precision	F1
Exact Span With Matching Class for Problem	9478	2291	3077	0.805	0.755	0.779
Exact Span With Matching Class for Treatment	6881	1402	2398	0.831	0.742	0.784
Exact Span With Matching Class for Test	6999	1211	2221	0.852	0.759	0.803

Silver Model (trained on mimic data that was annotated by Original Model)

TESTING 1.1 - Exact span for all concepts together

	TP	FN	FP	Recall	Precision	F1
Class Exact Span	20771	5504	10283	0.791	0.669	0.725

TESTING 1.2 - Exact span for separate concept classes:

	TP	FN	FP	Recall	Precision	F1
Exact Span With Matching Class for Problem	8735	2875	3820	0.752	0.696	0.7229464100972481
Exact Span With Matching Class for Treatment	5961	1278	3318	0.823	0.642	0.721758082092263
Exact Span With Matching Class for Test	6075	1351	3145	0.818	0.659	0.7299050823020545

Provide feedback

Saved searches

Use saved searches to filter your results more quickly