diff --git a/master/assets/overrides/partials/comments.html b/master/assets/overrides/partials/comments.html index 3c70a68f1..c727c146b 100644 --- a/master/assets/overrides/partials/comments.html +++ b/master/assets/overrides/partials/comments.html @@ -1,5 +1,4 @@ {% if page.url.split("/")[0] in ["concepts", "tutorials", "pipes", "tokenizers", "data", "utilities"] %} -

{{ lang.t("meta.comments") }}

Skip to content

Basic Architecture

Most pipes provided by EDS-NLP aim to qualify pre-extracted entities. To wit, the basic usage of the library:

  1. Implement a normaliser (see eds.normalizer)
  2. Add an entity recognition component (eg the simple but powerful eds.matcher)
  3. Add zero or more entity qualification components, such as eds.negation, eds.family or eds.hypothesis. These qualifiers typically help detect false-positives.

Scope

Since the basic usage of EDS-NLP components is to qualify entities, most pipes can function in two modes:

  1. Annotation of the extracted entities (this is the default). To increase throughput, only pre-extracted entities (found in doc.ents) are processed.
  2. Full-text, token-wise annotation. This mode is activated by setting the on_ents_only parameter to False.

The possibility to do full-text annotation implies that one could use the pipes the other way around, eg detecting all negations once and for all in an ETL phase, and reusing the results consequently. However, this is not the intended use of the library, which aims to help researchers downstream as a standalone application.

Result persistence

Depending on their purpose (entity extraction, qualification, etc), EDS-NLP pipes write their results to Doc.ents, Doc.spans or in a custom attribute.

Extraction pipes

Extraction pipes (matchers, the date detector or NER pipes, for instance) keep their results to the Doc.ents attribute directly.

Note that spaCy prohibits overlapping entities within the Doc.ents attribute. To circumvent this limitation, we filter spans, and keep all discarded entities within the discarded key of the Doc.spans attribute.

Some pipes write their output to the Doc.spans dictionary. We enforce the following doctrine:

  • Should the pipe extract entities that are directly informative (typically the output of the eds.matcher component), said entities are stashed in the Doc.ents attribute.
  • On the other hand, should the entity be useful to another pipe, but less so in itself (eg the output of the eds.sections or eds.dates component), it will be stashed in a specific key within the Doc.spans attribute.

Entity tagging

Moreover, most pipes declare spaCy extensions, on the Doc, Span and/or Token objects.

These extensions are especially useful for qualifier pipes, but can also be used by other pipes to persist relevant information. For instance, the eds.dates pipeline component:

  1. Populates Doc.spans["dates"]
  2. For each detected item, keeps the normalised date in Span._.date

    Comments

    Basic Architecture

    Most pipes provided by EDS-NLP aim to qualify pre-extracted entities. To wit, the basic usage of the library:

    1. Implement a normaliser (see eds.normalizer)
    2. Add an entity recognition component (eg the simple but powerful eds.matcher)
    3. Add zero or more entity qualification components, such as eds.negation, eds.family or eds.hypothesis. These qualifiers typically help detect false-positives.

    Scope

    Since the basic usage of EDS-NLP components is to qualify entities, most pipes can function in two modes:

    1. Annotation of the extracted entities (this is the default). To increase throughput, only pre-extracted entities (found in doc.ents) are processed.
    2. Full-text, token-wise annotation. This mode is activated by setting the on_ents_only parameter to False.

    The possibility to do full-text annotation implies that one could use the pipes the other way around, eg detecting all negations once and for all in an ETL phase, and reusing the results consequently. However, this is not the intended use of the library, which aims to help researchers downstream as a standalone application.

    Result persistence

    Depending on their purpose (entity extraction, qualification, etc), EDS-NLP pipes write their results to Doc.ents, Doc.spans or in a custom attribute.

    Extraction pipes

    Extraction pipes (matchers, the date detector or NER pipes, for instance) keep their results to the Doc.ents attribute directly.

    Note that spaCy prohibits overlapping entities within the Doc.ents attribute. To circumvent this limitation, we filter spans, and keep all discarded entities within the discarded key of the Doc.spans attribute.

    Some pipes write their output to the Doc.spans dictionary. We enforce the following doctrine:

    • Should the pipe extract entities that are directly informative (typically the output of the eds.matcher component), said entities are stashed in the Doc.ents attribute.
    • On the other hand, should the entity be useful to another pipe, but less so in itself (eg the output of the eds.sections or eds.dates component), it will be stashed in a specific key within the Doc.spans attribute.

    Entity tagging

    Moreover, most pipes declare spaCy extensions, on the Doc, Span and/or Token objects.

    These extensions are especially useful for qualifier pipes, but can also be used by other pipes to persist relevant information. For instance, the eds.dates pipeline component:

    1. Populates Doc.spans["dates"]
    2. For each detected item, keeps the normalised date in Span._.date

      Core Components

      This section deals with "core" functionalities offered by EDS-NLP:

      • Generic matchers against regular expressions and list of terms
      • Text cleaning
      • Sentence boundaries detection

      Available components

      Component Description
      eds.normalizer Non-destructive input text normalisation
      eds.sentences Better sentence boundary detection
      eds.matcher A simple yet powerful entity extractor
      eds.terminology A simple yet powerful terminology matcher
      eds.contextual_matcher A conditional entity extractor
      eds.endlines An unsupervised model to classify each end line

        Comments

        Core Components

        This section deals with "core" functionalities offered by EDS-NLP:

        • Generic matchers against regular expressions and list of terms
        • Text cleaning
        • Sentence boundaries detection

        Available components

        Component Description
        eds.normalizer Non-destructive input text normalisation
        eds.sentences Better sentence boundary detection
        eds.matcher A simple yet powerful entity extractor
        eds.terminology A simple yet powerful terminology matcher
        eds.contextual_matcher A conditional entity extractor
        eds.endlines An unsupervised model to classify each end line

          Miscellaneous

          This section regroups components that extract information that can be used by other components, but have little medical value in itself.

          For instance, the date detection and normalisation pipeline falls in this category.

          Available components

          Component Description
          eds.dates Date extraction and normalisation
          eds.consultation_dates Identify consultation dates
          eds.quantities Quantity extraction and normalisation
          eds.sections Section detection
          eds.reason Rule-based hospitalisation reason detection
          eds.tables Tables detection
          eds.split Doc splitting

            Comments

            Miscellaneous

            This section regroups components that extract information that can be used by other components, but have little medical value in itself.

            For instance, the date detection and normalisation pipeline falls in this category.

            Available components

            Component Description
            eds.dates Date extraction and normalisation
            eds.consultation_dates Identify consultation dates
            eds.quantities Quantity extraction and normalisation
            eds.sections Section detection
            eds.reason Rule-based hospitalisation reason detection
            eds.tables Tables detection
            eds.split Doc splitting

              Named Entity Recognition Components

              We provide several Named Entity Recognition (NER) components. Named Entity Recognition is the task of identifying short relevant spans of text, named entities, and classifying them into pre-defined categories. In the case of clinical documents, these entities can be scores, disorders, behaviors, codes, dates, quantities, etc.

              Span setters: where are stored extracted entities ?

              A component assigns entities to a document by adding them to the doc.ents or doc.spans[group] attributes. doc.ents only supports non overlapping entities, therefore, if two entities overlap, the longest one will be kept. doc.spans[group] on the other hand, can contain overlapping entities. To control where entities are added, you can use the span_setter argument in any of these component.

              Valid values for the span_setter argument of a component can be :

              • a (doc, matches) -> None callable
              • a span group name
              • a list of span group names
              • a dict of group name to True or list of labels

              The group name "ents" is a special case, and will add the matches to doc.ents

              Examples

              • span_setter=["ents", "ckd"] will add the matches to both doc.ents and doc.spans["ckd"]. It is equivalent to {"ents": True, "ckd": True}.
              • span_setter={"ents": ["foo", "bar"]} will add the matches with label "foo" and "bar" to doc.ents.
              • span_setter="ents" will add all matches only to doc.ents.
              • span_setter="ckd" will add all matches only to doc.spans["ckd"].

              Available components

              Component Description
              eds.covid A COVID mentions detector
              eds.charlson A Charlson score extractor
              eds.sofa A SOFA score extractor
              eds.elston_ellis An Elston & Ellis code extractor
              eds.emergency_priority A priority score extractor
              eds.emergency_ccmu A CCMU score extractor
              eds.emergency_gemsa A GEMSA score extractor
              eds.tnm A TNM score extractor
              eds.adicap A ADICAP codes extractor
              eds.drugs A drug mentions extractor
              eds.cim10 A CIM10 terminology matcher
              eds.umls An UMLS terminology matcher
              eds.ckd CKD extractor
              eds.copd COPD extractor
              eds.cerebrovascular_accident Cerebrovascular accident extractor
              eds.congestive_heart_failure Congestive heart failure extractor
              eds.connective_tissue_disease Connective tissue disease extractor
              eds.dementia Dementia extractor
              eds.diabetes Diabetes extractor
              eds.hemiplegia Hemiplegia extractor
              eds.leukemia Leukemia extractor
              eds.liver_disease Liver disease extractor
              eds.lymphoma Lymphoma extractor
              eds.myocardial_infarction Myocardial infarction extractor
              eds.peptic_ulcer_disease Peptic ulcer disease extractor
              eds.peripheral_vascular_disease Peripheral vascular disease extractor
              eds.solid_tumor Solid tumor extractor
              eds.alcohol Alcohol consumption extractor
              eds.tobacco Tobacco consumption extractor

                Comments

                Named Entity Recognition Components

                We provide several Named Entity Recognition (NER) components. Named Entity Recognition is the task of identifying short relevant spans of text, named entities, and classifying them into pre-defined categories. In the case of clinical documents, these entities can be scores, disorders, behaviors, codes, dates, quantities, etc.

                Span setters: where are stored extracted entities ?

                A component assigns entities to a document by adding them to the doc.ents or doc.spans[group] attributes. doc.ents only supports non overlapping entities, therefore, if two entities overlap, the longest one will be kept. doc.spans[group] on the other hand, can contain overlapping entities. To control where entities are added, you can use the span_setter argument in any of these component.

                Valid values for the span_setter argument of a component can be :

                • a (doc, matches) -> None callable
                • a span group name
                • a list of span group names
                • a dict of group name to True or list of labels

                The group name "ents" is a special case, and will add the matches to doc.ents

                Examples

                • span_setter=["ents", "ckd"] will add the matches to both doc.ents and doc.spans["ckd"]. It is equivalent to {"ents": True, "ckd": True}.
                • span_setter={"ents": ["foo", "bar"]} will add the matches with label "foo" and "bar" to doc.ents.
                • span_setter="ents" will add all matches only to doc.ents.
                • span_setter="ckd" will add all matches only to doc.spans["ckd"].

                Available components

                Component Description
                eds.covid A COVID mentions detector
                eds.charlson A Charlson score extractor
                eds.sofa A SOFA score extractor
                eds.elston_ellis An Elston & Ellis code extractor
                eds.emergency_priority A priority score extractor
                eds.emergency_ccmu A CCMU score extractor
                eds.emergency_gemsa A GEMSA score extractor
                eds.tnm A TNM score extractor
                eds.adicap A ADICAP codes extractor
                eds.drugs A drug mentions extractor
                eds.cim10 A CIM10 terminology matcher
                eds.umls An UMLS terminology matcher
                eds.ckd CKD extractor
                eds.copd COPD extractor
                eds.cerebrovascular_accident Cerebrovascular accident extractor
                eds.congestive_heart_failure Congestive heart failure extractor
                eds.connective_tissue_disease Connective tissue disease extractor
                eds.dementia Dementia extractor
                eds.diabetes Diabetes extractor
                eds.hemiplegia Hemiplegia extractor
                eds.leukemia Leukemia extractor
                eds.liver_disease Liver disease extractor
                eds.lymphoma Lymphoma extractor
                eds.myocardial_infarction Myocardial infarction extractor
                eds.peptic_ulcer_disease Peptic ulcer disease extractor
                eds.peripheral_vascular_disease Peripheral vascular disease extractor
                eds.solid_tumor Solid tumor extractor
                eds.alcohol Alcohol consumption extractor
                eds.tobacco Tobacco consumption extractor

                  Span Pooler[source]

                  The eds.span_pooler component is a trainable span embedding component. It generates span embeddings from a word embedding component and a span getter. It can be used to train a span classifier, as in eds.span_classifier.

                  Parameters

                  PARAMETER DESCRIPTION
                  nlp

                  The pipeline object

                  TYPE: Optional[Pipeline] DEFAULT: None

                  name

                  Name of the component

                  TYPE: str DEFAULT: 'span_pooler'

                  embedding

                  The word embedding component

                  TYPE: WordEmbeddingComponent

                  pooling_mode

                  How word embeddings are aggregated into a single embedding per span.

                  TYPE: Literal['max', 'sum', 'mean'] DEFAULT: mean

                  hidden_size

                  The size of the hidden layer. If None, no projection is done and the output of the span pooler is used directly.

                  TYPE: Optional[int] DEFAULT: None


                    Comments

                    Span Pooler[source]

                    The eds.span_pooler component is a trainable span embedding component. It generates span embeddings from a word embedding component and a span getter. It can be used to train a span classifier, as in eds.span_classifier.

                    Parameters

                    PARAMETER DESCRIPTION
                    nlp

                    The pipeline object

                    TYPE: Optional[Pipeline] DEFAULT: None

                    name

                    Name of the component

                    TYPE: str DEFAULT: 'span_pooler'

                    embedding

                    The word embedding component

                    TYPE: WordEmbeddingComponent

                    pooling_mode

                    How word embeddings are aggregated into a single embedding per span.

                    TYPE: Literal['max', 'sum', 'mean'] DEFAULT: mean

                    hidden_size

                    The size of the hidden layer. If None, no projection is done and the output of the span pooler is used directly.

                    TYPE: Optional[int] DEFAULT: None


                      Text CNN[source]

                      The eds.text_cnn component is a simple 1D convolutional network to contextualize word embeddings (as computed by the embedding component passed as argument).

                      To be memory efficient when handling batches of variable-length sequences, this module employs sequence packing, while taking care of avoiding contamination between the different docs.

                      Parameters

                      PARAMETER DESCRIPTION
                      nlp

                      The pipeline object

                      TYPE: PipelineProtocol DEFAULT: None

                      name

                      The name of the component

                      TYPE: str DEFAULT: 'text_cnn'

                      embedding

                      Embedding module to apply to the input

                      TYPE: TorchComponent[WordEmbeddingBatchOutput, BatchInput]

                      output_size

                      Size of the output embeddings Defaults to the input_size

                      TYPE: Optional[int] DEFAULT: None

                      out_channels

                      Number of channels

                      TYPE: int DEFAULT: None

                      kernel_sizes

                      Window size of each kernel

                      TYPE: Sequence[int] DEFAULT: (3, 4, 5)

                      activation

                      Activation function to use

                      TYPE: str DEFAULT: relu

                      residual

                      Whether to use residual connections

                      TYPE: bool DEFAULT: True

                      normalize

                      Whether to normalize before or after the residual connection

                      TYPE: Literal['pre', 'post', 'none'] DEFAULT: pre


                        Comments

                        Text CNN[source]

                        The eds.text_cnn component is a simple 1D convolutional network to contextualize word embeddings (as computed by the embedding component passed as argument).

                        To be memory efficient when handling batches of variable-length sequences, this module employs sequence packing, while taking care of avoiding contamination between the different docs.

                        Parameters

                        PARAMETER DESCRIPTION
                        nlp

                        The pipeline object

                        TYPE: PipelineProtocol DEFAULT: None

                        name

                        The name of the component

                        TYPE: str DEFAULT: 'text_cnn'

                        embedding

                        Embedding module to apply to the input

                        TYPE: TorchComponent[WordEmbeddingBatchOutput, BatchInput]

                        output_size

                        Size of the output embeddings Defaults to the input_size

                        TYPE: Optional[int] DEFAULT: None

                        out_channels

                        Number of channels

                        TYPE: int DEFAULT: None

                        kernel_sizes

                        Window size of each kernel

                        TYPE: Sequence[int] DEFAULT: (3, 4, 5)

                        activation

                        Activation function to use

                        TYPE: str DEFAULT: relu

                        residual

                        Whether to use residual connections

                        TYPE: bool DEFAULT: True

                        normalize

                        Whether to normalize before or after the residual connection

                        TYPE: Literal['pre', 'post', 'none'] DEFAULT: pre


                          Trainable components overview

                          In addition to its rule-based pipeline components, EDS-NLP offers new trainable components to fit and run machine learning models for classic biomedical information extraction tasks.

                          All trainable components implement the TorchComponent class, which provides a common API for training and inference.

                          Available components :

                          Name Description
                          eds.transformer Embed text with a transformer model
                          eds.text_cnn Contextualize embeddings with a CNN
                          eds.span_pooler A span embedding component that aggregates word embeddings
                          eds.ner_crf A trainable component to extract entities
                          eds.span_classifier A trainable component for multi-class multi-label span classification
                          eds.span_linker A trainable entity linker (i.e. to a list of concepts)

                            Comments

                            Trainable components overview

                            In addition to its rule-based pipeline components, EDS-NLP offers new trainable components to fit and run machine learning models for classic biomedical information extraction tasks.

                            All trainable components implement the TorchComponent class, which provides a common API for training and inference.

                            Available components :

                            Name Description
                            eds.transformer Embed text with a transformer model
                            eds.text_cnn Contextualize embeddings with a CNN
                            eds.span_pooler A span embedding component that aggregates word embeddings
                            eds.ner_crf A trainable component to extract entities
                            eds.span_classifier A trainable component for multi-class multi-label span classification
                            eds.span_linker A trainable entity linker (i.e. to a list of concepts)

                              Tutorials

                              We provide step-by-step guides to get you started. We cover the following use-cases:


                                Comments

                                Tutorials

                                We provide step-by-step guides to get you started. We cover the following use-cases:


                                  Overview of connectors

                                  EDS-NLP provides a series of connectors apt to convert back and forth from different formats into spaCy representation.

                                  We provide the following connectors:


                                    Comments

                                    Overview of connectors

                                    EDS-NLP provides a series of connectors apt to convert back and forth from different formats into spaCy representation.

                                    We provide the following connectors:


                                      Pipeline evaluation


                                        Comments

                                        Pipeline evaluation


                                          Utilities

                                          EDS-NLP provides a few utilities to deploy pipelines, process RegExps, etc.


                                            Comments

                                            Utilities

                                            EDS-NLP provides a few utilities to deploy pipelines, process RegExps, etc.


                                              Work with RegExp


                                                Comments

                                                Work with RegExp


                                                  Tests Utilities

                                                  We provide a few testing utilities that simplify the process of:

                                                  • creating testing examples for NLP pipelines;
                                                  • testing documentation code blocs.

                                                    Comments

                                                    Tests Utilities

                                                    We provide a few testing utilities that simplify the process of:

                                                    • creating testing examples for NLP pipelines;
                                                    • testing documentation code blocs.