Basic Architecture
Most pipes provided by EDS-NLP aim to qualify pre-extracted entities. To wit, the basic usage of the library:
- Implement a normaliser (see
eds.normalizer
) - Add an entity recognition component (eg the simple but powerful
eds.matcher
) - Add zero or more entity qualification components, such as
eds.negation
,eds.family
oreds.hypothesis
. These qualifiers typically help detect false-positives.
Scope
Since the basic usage of EDS-NLP components is to qualify entities, most pipes can function in two modes:
- Annotation of the extracted entities (this is the default). To increase throughput, only pre-extracted entities (found in
doc.ents
) are processed. - Full-text, token-wise annotation. This mode is activated by setting the
on_ents_only
parameter toFalse
.
The possibility to do full-text annotation implies that one could use the pipes the other way around, eg detecting all negations once and for all in an ETL phase, and reusing the results consequently. However, this is not the intended use of the library, which aims to help researchers downstream as a standalone application.
Result persistence
Depending on their purpose (entity extraction, qualification, etc), EDS-NLP pipes write their results to Doc.ents
, Doc.spans
or in a custom attribute.
Extraction pipes
Extraction pipes (matchers, the date detector or NER pipes, for instance) keep their results to the Doc.ents
attribute directly.
Note that spaCy prohibits overlapping entities within the Doc.ents
attribute. To circumvent this limitation, we filter spans, and keep all discarded entities within the discarded
key of the Doc.spans
attribute.
Some pipes write their output to the Doc.spans
dictionary. We enforce the following doctrine:
- Should the pipe extract entities that are directly informative (typically the output of the
eds.matcher
component), said entities are stashed in theDoc.ents
attribute. - On the other hand, should the entity be useful to another pipe, but less so in itself (eg the output of the
eds.sections
oreds.dates
component), it will be stashed in a specific key within theDoc.spans
attribute.
Entity tagging
Moreover, most pipes declare spaCy extensions, on the Doc
, Span
and/or Token
objects.
These extensions are especially useful for qualifier pipes, but can also be used by other pipes to persist relevant information. For instance, the eds.dates
pipeline component:
- Populates
Doc.spans["dates"]
- For each detected item, keeps the normalised date in
Span._.date
Comments
Basic Architecture
Most pipes provided by EDS-NLP aim to qualify pre-extracted entities. To wit, the basic usage of the library:
- Implement a normaliser (see
eds.normalizer
) - Add an entity recognition component (eg the simple but powerful
eds.matcher
) - Add zero or more entity qualification components, such as
eds.negation
,eds.family
oreds.hypothesis
. These qualifiers typically help detect false-positives.
Scope
Since the basic usage of EDS-NLP components is to qualify entities, most pipes can function in two modes:
- Annotation of the extracted entities (this is the default). To increase throughput, only pre-extracted entities (found in
doc.ents
) are processed. - Full-text, token-wise annotation. This mode is activated by setting the
on_ents_only
parameter toFalse
.
The possibility to do full-text annotation implies that one could use the pipes the other way around, eg detecting all negations once and for all in an ETL phase, and reusing the results consequently. However, this is not the intended use of the library, which aims to help researchers downstream as a standalone application.
Result persistence
Depending on their purpose (entity extraction, qualification, etc), EDS-NLP pipes write their results to Doc.ents
, Doc.spans
or in a custom attribute.
Extraction pipes
Extraction pipes (matchers, the date detector or NER pipes, for instance) keep their results to the Doc.ents
attribute directly.
Note that spaCy prohibits overlapping entities within the Doc.ents
attribute. To circumvent this limitation, we filter spans, and keep all discarded entities within the discarded
key of the Doc.spans
attribute.
Some pipes write their output to the Doc.spans
dictionary. We enforce the following doctrine:
- Should the pipe extract entities that are directly informative (typically the output of the
eds.matcher
component), said entities are stashed in theDoc.ents
attribute. - On the other hand, should the entity be useful to another pipe, but less so in itself (eg the output of the
eds.sections
oreds.dates
component), it will be stashed in a specific key within theDoc.spans
attribute.
Entity tagging
Moreover, most pipes declare spaCy extensions, on the Doc
, Span
and/or Token
objects.
These extensions are especially useful for qualifier pipes, but can also be used by other pipes to persist relevant information. For instance, the eds.dates
pipeline component:
- Populates
Doc.spans["dates"]
- For each detected item, keeps the normalised date in
Span._.date
Core Components
This section deals with "core" functionalities offered by EDS-NLP:
- Generic matchers against regular expressions and list of terms
- Text cleaning
- Sentence boundaries detection
Available components
Component | Description |
---|---|
eds.normalizer | Non-destructive input text normalisation |
eds.sentences | Better sentence boundary detection |
eds.matcher | A simple yet powerful entity extractor |
eds.terminology | A simple yet powerful terminology matcher |
eds.contextual_matcher | A conditional entity extractor |
eds.endlines | An unsupervised model to classify each end line |
Comments
Core Components
This section deals with "core" functionalities offered by EDS-NLP:
- Generic matchers against regular expressions and list of terms
- Text cleaning
- Sentence boundaries detection
Available components
Component | Description |
---|---|
eds.normalizer | Non-destructive input text normalisation |
eds.sentences | Better sentence boundary detection |
eds.matcher | A simple yet powerful entity extractor |
eds.terminology | A simple yet powerful terminology matcher |
eds.contextual_matcher | A conditional entity extractor |
eds.endlines | An unsupervised model to classify each end line |
Miscellaneous
This section regroups components that extract information that can be used by other components, but have little medical value in itself.
For instance, the date detection and normalisation pipeline falls in this category.
Available components
Component | Description |
---|---|
eds.dates | Date extraction and normalisation |
eds.consultation_dates | Identify consultation dates |
eds.quantities | Quantity extraction and normalisation |
eds.sections | Section detection |
eds.reason | Rule-based hospitalisation reason detection |
eds.tables | Tables detection |
eds.split | Doc splitting |
Comments
Miscellaneous
This section regroups components that extract information that can be used by other components, but have little medical value in itself.
For instance, the date detection and normalisation pipeline falls in this category.
Available components
Component | Description |
---|---|
eds.dates | Date extraction and normalisation |
eds.consultation_dates | Identify consultation dates |
eds.quantities | Quantity extraction and normalisation |
eds.sections | Section detection |
eds.reason | Rule-based hospitalisation reason detection |
eds.tables | Tables detection |
eds.split | Doc splitting |
Named Entity Recognition Components
We provide several Named Entity Recognition (NER) components. Named Entity Recognition is the task of identifying short relevant spans of text, named entities, and classifying them into pre-defined categories. In the case of clinical documents, these entities can be scores, disorders, behaviors, codes, dates, quantities, etc.
Span setters: where are stored extracted entities ?
A component assigns entities to a document by adding them to the doc.ents
or doc.spans[group]
attributes. doc.ents
only supports non overlapping entities, therefore, if two entities overlap, the longest one will be kept. doc.spans[group]
on the other hand, can contain overlapping entities. To control where entities are added, you can use the span_setter
argument in any of these component.
Valid values for the span_setter
argument of a component can be :
- a (doc, matches) -> None callable
- a span group name
- a list of span group names
- a dict of group name to True or list of labels
The group name "ents"
is a special case, and will add the matches to doc.ents
Examples
span_setter=["ents", "ckd"]
will add the matches to bothdoc.ents
anddoc.spans["ckd"]
. It is equivalent to{"ents": True, "ckd": True}
.span_setter={"ents": ["foo", "bar"]}
will add the matches with label "foo" and "bar" todoc.ents
.span_setter="ents"
will add all matches only todoc.ents
.span_setter="ckd"
will add all matches only todoc.spans["ckd"]
.
Available components
Component | Description |
---|---|
eds.covid | A COVID mentions detector |
eds.charlson | A Charlson score extractor |
eds.sofa | A SOFA score extractor |
eds.elston_ellis | An Elston & Ellis code extractor |
eds.emergency_priority | A priority score extractor |
eds.emergency_ccmu | A CCMU score extractor |
eds.emergency_gemsa | A GEMSA score extractor |
eds.tnm | A TNM score extractor |
eds.adicap | A ADICAP codes extractor |
eds.drugs | A drug mentions extractor |
eds.cim10 | A CIM10 terminology matcher |
eds.umls | An UMLS terminology matcher |
eds.ckd | CKD extractor |
eds.copd | COPD extractor |
eds.cerebrovascular_accident | Cerebrovascular accident extractor |
eds.congestive_heart_failure | Congestive heart failure extractor |
eds.connective_tissue_disease | Connective tissue disease extractor |
eds.dementia | Dementia extractor |
eds.diabetes | Diabetes extractor |
eds.hemiplegia | Hemiplegia extractor |
eds.leukemia | Leukemia extractor |
eds.liver_disease | Liver disease extractor |
eds.lymphoma | Lymphoma extractor |
eds.myocardial_infarction | Myocardial infarction extractor |
eds.peptic_ulcer_disease | Peptic ulcer disease extractor |
eds.peripheral_vascular_disease | Peripheral vascular disease extractor |
eds.solid_tumor | Solid tumor extractor |
eds.alcohol | Alcohol consumption extractor |
eds.tobacco | Tobacco consumption extractor |
Comments
Named Entity Recognition Components
We provide several Named Entity Recognition (NER) components. Named Entity Recognition is the task of identifying short relevant spans of text, named entities, and classifying them into pre-defined categories. In the case of clinical documents, these entities can be scores, disorders, behaviors, codes, dates, quantities, etc.
Span setters: where are stored extracted entities ?
A component assigns entities to a document by adding them to the doc.ents
or doc.spans[group]
attributes. doc.ents
only supports non overlapping entities, therefore, if two entities overlap, the longest one will be kept. doc.spans[group]
on the other hand, can contain overlapping entities. To control where entities are added, you can use the span_setter
argument in any of these component.
Valid values for the span_setter
argument of a component can be :
- a (doc, matches) -> None callable
- a span group name
- a list of span group names
- a dict of group name to True or list of labels
The group name "ents"
is a special case, and will add the matches to doc.ents
Examples
span_setter=["ents", "ckd"]
will add the matches to bothdoc.ents
anddoc.spans["ckd"]
. It is equivalent to{"ents": True, "ckd": True}
.span_setter={"ents": ["foo", "bar"]}
will add the matches with label "foo" and "bar" todoc.ents
.span_setter="ents"
will add all matches only todoc.ents
.span_setter="ckd"
will add all matches only todoc.spans["ckd"]
.
Available components
Component | Description |
---|---|
eds.covid | A COVID mentions detector |
eds.charlson | A Charlson score extractor |
eds.sofa | A SOFA score extractor |
eds.elston_ellis | An Elston & Ellis code extractor |
eds.emergency_priority | A priority score extractor |
eds.emergency_ccmu | A CCMU score extractor |
eds.emergency_gemsa | A GEMSA score extractor |
eds.tnm | A TNM score extractor |
eds.adicap | A ADICAP codes extractor |
eds.drugs | A drug mentions extractor |
eds.cim10 | A CIM10 terminology matcher |
eds.umls | An UMLS terminology matcher |
eds.ckd | CKD extractor |
eds.copd | COPD extractor |
eds.cerebrovascular_accident | Cerebrovascular accident extractor |
eds.congestive_heart_failure | Congestive heart failure extractor |
eds.connective_tissue_disease | Connective tissue disease extractor |
eds.dementia | Dementia extractor |
eds.diabetes | Diabetes extractor |
eds.hemiplegia | Hemiplegia extractor |
eds.leukemia | Leukemia extractor |
eds.liver_disease | Liver disease extractor |
eds.lymphoma | Lymphoma extractor |
eds.myocardial_infarction | Myocardial infarction extractor |
eds.peptic_ulcer_disease | Peptic ulcer disease extractor |
eds.peripheral_vascular_disease | Peripheral vascular disease extractor |
eds.solid_tumor | Solid tumor extractor |
eds.alcohol | Alcohol consumption extractor |
eds.tobacco | Tobacco consumption extractor |
Span Pooler[source]
The eds.span_pooler
component is a trainable span embedding component. It generates span embeddings from a word embedding component and a span getter. It can be used to train a span classifier, as in eds.span_classifier
.
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline object TYPE: |
name | Name of the component TYPE: |
embedding | The word embedding component TYPE: |
pooling_mode | How word embeddings are aggregated into a single embedding per span. TYPE: |
hidden_size | The size of the hidden layer. If None, no projection is done and the output of the span pooler is used directly. TYPE: |
Comments
Span Pooler[source]
The eds.span_pooler
component is a trainable span embedding component. It generates span embeddings from a word embedding component and a span getter. It can be used to train a span classifier, as in eds.span_classifier
.
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline object TYPE: |
name | Name of the component TYPE: |
embedding | The word embedding component TYPE: |
pooling_mode | How word embeddings are aggregated into a single embedding per span. TYPE: |
hidden_size | The size of the hidden layer. If None, no projection is done and the output of the span pooler is used directly. TYPE: |
Text CNN[source]
The eds.text_cnn
component is a simple 1D convolutional network to contextualize word embeddings (as computed by the embedding
component passed as argument).
To be memory efficient when handling batches of variable-length sequences, this module employs sequence packing, while taking care of avoiding contamination between the different docs.
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline object TYPE: |
name | The name of the component TYPE: |
embedding | Embedding module to apply to the input TYPE: |
output_size | Size of the output embeddings Defaults to the TYPE: |
out_channels | Number of channels TYPE: |
kernel_sizes | Window size of each kernel TYPE: |
activation | Activation function to use TYPE: |
residual | Whether to use residual connections TYPE: |
normalize | Whether to normalize before or after the residual connection TYPE: |
Comments
Text CNN[source]
The eds.text_cnn
component is a simple 1D convolutional network to contextualize word embeddings (as computed by the embedding
component passed as argument).
To be memory efficient when handling batches of variable-length sequences, this module employs sequence packing, while taking care of avoiding contamination between the different docs.
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline object TYPE: |
name | The name of the component TYPE: |
embedding | Embedding module to apply to the input TYPE: |
output_size | Size of the output embeddings Defaults to the TYPE: |
out_channels | Number of channels TYPE: |
kernel_sizes | Window size of each kernel TYPE: |
activation | Activation function to use TYPE: |
residual | Whether to use residual connections TYPE: |
normalize | Whether to normalize before or after the residual connection TYPE: |
Trainable components overview
In addition to its rule-based pipeline components, EDS-NLP offers new trainable components to fit and run machine learning models for classic biomedical information extraction tasks.
All trainable components implement the TorchComponent
class, which provides a common API for training and inference.
Available components :
Name | Description |
---|---|
eds.transformer | Embed text with a transformer model |
eds.text_cnn | Contextualize embeddings with a CNN |
eds.span_pooler | A span embedding component that aggregates word embeddings |
eds.ner_crf | A trainable component to extract entities |
eds.span_classifier | A trainable component for multi-class multi-label span classification |
eds.span_linker | A trainable entity linker (i.e. to a list of concepts) |
Comments
Trainable components overview
In addition to its rule-based pipeline components, EDS-NLP offers new trainable components to fit and run machine learning models for classic biomedical information extraction tasks.
All trainable components implement the TorchComponent
class, which provides a common API for training and inference.
Available components :
Name | Description |
---|---|
eds.transformer | Embed text with a transformer model |
eds.text_cnn | Contextualize embeddings with a CNN |
eds.span_pooler | A span embedding component that aggregates word embeddings |
eds.ner_crf | A trainable component to extract entities |
eds.span_classifier | A trainable component for multi-class multi-label span classification |
eds.span_linker | A trainable entity linker (i.e. to a list of concepts) |
Tutorials
We provide step-by-step guides to get you started. We cover the following use-cases:
Spacy representations
Learn the basics of how documents are represented with spaCy.
Matching a terminology
Extract phrases that belong to a given terminology.
Qualifying entities
Ensure extracted concepts are not invalidated by linguistic modulation.
Detecting dates
Detect and parse dates in a text.
Processing multiple texts
Improve the inference speed of your pipeline
Detecting hospitalisation reason
Identify spans mentioning the reason for hospitalisation or tag entities as the reason.
↵ Detecting false endlines
Classify each line end and add the excluded
attribute to these tokens.
Aggregating results
Aggregate the results of your pipeline at the document level.
FastAPI
Deploy your pipeline as an API.
Visualization
Quickly visualize the results of your pipeline as annotations or tables.
Deep learning tutorial
Learn how EDS-NLP handles training deep-neural networks.
Training API
Learn how to quicky train a deep-learning model with edsnlp.train
.
Comments
Tutorials
We provide step-by-step guides to get you started. We cover the following use-cases:
Spacy representations
Learn the basics of how documents are represented with spaCy.
Matching a terminology
Extract phrases that belong to a given terminology.
Qualifying entities
Ensure extracted concepts are not invalidated by linguistic modulation.
Detecting dates
Detect and parse dates in a text.
Processing multiple texts
Improve the inference speed of your pipeline
Detecting hospitalisation reason
Identify spans mentioning the reason for hospitalisation or tag entities as the reason.
↵ Detecting false endlines
Classify each line end and add the excluded
attribute to these tokens.
Aggregating results
Aggregate the results of your pipeline at the document level.
FastAPI
Deploy your pipeline as an API.
Visualization
Quickly visualize the results of your pipeline as annotations or tables.
Deep learning tutorial
Learn how EDS-NLP handles training deep-neural networks.
Training API
Learn how to quicky train a deep-learning model with edsnlp.train
.
Overview of connectors
EDS-NLP provides a series of connectors apt to convert back and forth from different formats into spaCy representation.
We provide the following connectors:
Comments
Overview of connectors
EDS-NLP provides a series of connectors apt to convert back and forth from different formats into spaCy representation.
We provide the following connectors:
Pipeline evaluation
Comments
Pipeline evaluation
Utilities
EDS-NLP provides a few utilities to deploy pipelines, process RegExps, etc.
Comments
Utilities
EDS-NLP provides a few utilities to deploy pipelines, process RegExps, etc.
Work with RegExp
Comments
Work with RegExp
Tests Utilities
We provide a few testing utilities that simplify the process of:
- creating testing examples for NLP pipelines;
- testing documentation code blocs.
Comments
Tests Utilities
We provide a few testing utilities that simplify the process of:
- creating testing examples for NLP pipelines;
- testing documentation code blocs.