Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Unify span getters / setters #203

Closed
Thomzoy opened this issue Apr 6, 2023 · 3 comments
Closed

Feature request: Unify span getters / setters #203

Thomzoy opened this issue Apr 6, 2023 · 3 comments

Comments

@Thomzoy
Copy link
Collaborator

Thomzoy commented Apr 6, 2023

Feature type

We might want to have a more uniform way of getting spans in pipelines. Currently, we have on_ents_only, on_spans, etc...
An idea is to expose a span_getter key in the configuration that could look like:

span_getter = dict(
    ents = True,
    spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
    labels = ["relevant_label"], # to keep only entities with a specific `label_`
)

If a more complex getter is needed, it could come from a span_getter factory

@Thomzoy Thomzoy changed the title Feature request: [feature] Feature request: Unify span getters Apr 6, 2023
@percevalw
Copy link
Member

To complete this, here are the different uses of such getters in the lib atm:

  • on_ents_only: in span classification pipes, such as negation or hypothesis to only classify ents instead of all tokens (this is specific to the implementation of the context algorithms), and in dates to only detect dates inside existing entities (useful for normalization purposes)

  • on_spans_groups/on_ents: used in span qualifier to retrieve the list of ents / spans that should be classified

        on_ents: Union[bool, Sequence[str]]
            Whether to look into `doc.ents` for spans to classify. If a list of strings
            is provided, only the span of the given labels will be considered. If None
            and `on_span_groups` is False, labels mentioned in `label_constraints`
            will be used.
        on_span_groups: Union[bool, Sequence[str], Mapping[str, Sequence[str]]]
            Whether to look into `doc.spans` for spans to classify:
    
            - If True, all span groups will be considered
            - If False, no span group will be considered
            - If a list of str is provided, only these span groups will be kept
            - If a mapping is provided, the keys are the span group names and the values
              are either a list of allowed labels in the group or True to keep them all
  • ent_labels / span_labels: in trainable NER pipe to retrieve the list of ents / spans that should be extracted

        ent_labels: Iterable[str]
            list of labels to filter entities for in `doc.ents`
        spans_labels: Mapping[str, Iterable[str]]
            Mapping from span group names to list of labels to look for entities
            and assign the predicted entities

    However, these a tightly tied to the output format of the component.

  • as_ents in measurements and dates: whether to export matches as ents instead of just outputing them to a span group

There are in fact two kind of span manipulation that occur at before and after a pipe:

  • span getters gather spans from multiple sources (from ents, from spans groups, filtered by labels, etc)
  • spans setters output spans to a given destination (to ents or to spans groups)

The upcoming refacto of edsnlp will allow most rule-based NER components to specify zones where they should look up entities. Following a discussion with @Thomzoy, we will also update components to specify where to output there prediction, i.e. spans or extensions.

Outputs

Here are some suggestions for various rule-based NER components. It seems that the behaviors of such components are too diverse to factorize the span setting parameters.

class DatesAndDurations:
    def __init__(
        self, 
        # value of the .label_ attribute set on dates/durations
        date_label="date", 
        duration_label="duration",
        # name of the span group `.spans[name]` to write matches (with an overwriting behavior)
        to_date_span_group="dates",
        to_duration_span_group="durations",
        # whether to also store matches as standard spaCy `.ents` entities
        to_ents: bool = True,
        # Where to look for candidates, by default the whole document (see below)
        span_getter: Optional[SpanGetter] = None,
    ):
        ...

class MyMatcher:
    def __init__(
        self,
        label: str = "my-custom-match",
        to_ents: bool = True,
        to_span_group: str = "my-custom-matches",
        # or can be overwritten: to_span_group = "my-custom-matches-ml",
        to_ents: bool = True,
        span_getter: Optional[SpanGetter] = None,
    ):
        ...

Inputs

On the other hand, span getters look better suited for factorization. A span getter could be:

  • a simple string, to look up a span group
    span_getter = "dates"
  • a list of strings, to look up multiple span groups
    span_getter = ["dates", "durations"]
  • a more complex/complete configuration, e.g. the one suggested by @Thomzoy
    span_getter = dict(
        ents = True,
        spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
        labels = ["relevant_label"], # to keep only entities with a specific `label_`
    )
  • a callable, to allow for more customizations
    def span_getter(doc):
        # do something with the doc
        return spans

In fact, the three first options could automatically be converted into hardcoded callables by the component, so that the component would only have to deal with a callable.

Trainable NER

The trainable NER components are a bit more complex, as we have to deal with both span getters (during training / evaluation) and span setters (during inference / evaluation).

Since the span setting configuration is inferred from the span getting configuration in the current implementation, it would be nice to keep this behavior. Learning from the .ents collection is not desirable, since this field is prone to overwritting, which does not mix well with evaluation and training.

I suggest the span_getter to only allow specifying the span groups to look up, and to infer the span setting configuration from it, as done currently.

The above span_getter configurations could be reused, with less options (no ents, no callable):

target_span_getter = dict(
    spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
    labels = ["relevant_label"], # to keep only entities with a specific `label_`
)
# or
target_span_getter = ["dates", "durations"]
# or 
target_span_getter = "dates"

For span setting, a to_span_groups and to_ents parameters could be used, and
be inferred from the training data

# to set "date" labelled matches to the "dates-ml" span group, same for durations
to_span_groups = {
    "dates-ml": "date",
    "durations-ml": ["duration"],
}
# to set all predictions to a single span group
to_span_groups = "ner-predictions"

# and to_ents to set all predictions in ents
to_ents = True
# or to filter by label
to_ents = ["date", "duration"]

Summary

A span getter can be either:

  • a simple string, to look up a span group
    span_getter = "dates"
  • a list of strings, to look up multiple span groups
    span_getter = ["dates", "durations"]
  • a more complex/complete configuration, e.g. the one suggested by @Thomzoy
    span_getter = dict(
        ents = True,
        spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
        labels = ["relevant_label"], # to keep only entities with a specific `label_`
    )
  • a callable, to allow for more customizations
    def span_getter(doc):
        # do something with the doc
        return spans

Each component using a span getter should accept some of these configuration, but not necessarily
all of them, and convert them to a callable if needed.

Span setters can target .ents or .spans (or both). to_ents-like params can be a mix of:

  • a boolean, to set all predictions to ents
    to_ents = True
  • a list of strings, to set all predictions with these labels to ents
    to_ents = ["date", "duration"]

And to_span_groups-like params can be a mix of:

  • a string, to set all predictions to a single span group
    to_span_groups = "ner-predictions"
  • a mapping, to set predictions to different span groups
    to_span_groups = {
        "dates-ml": "date",
        "durations-ml": ["duration"],
    }

Each component using a span setters/getters should accept some of these configuration but not necessarily all of them. It's important that we don't enforce strict modularity or uniformity, as it would make the API too complex and rigid.

@percevalw percevalw changed the title Feature request: Unify span getters Feature request: Unify span getters / setters Aug 24, 2023
@aricohen93
Copy link
Collaborator

As discussed remember to:

  • rename annotate function to maybe set_spans
  • always save in span_groups (due to interactions)
  • in the example of DatesAndDurations the two arguments to_date_span_group and to_duration_span_group could be replaced by a dict in span_group
  • take in consideration interactions between pipelines (ex. dates and biology)

percevalw added a commit that referenced this issue Sep 12, 2023
Co-Authored-By: Perceval Wajsbürt <[email protected]>
Co-Authored-By: Thomas Petit-Jean <[email protected]>
percevalw added a commit that referenced this issue Sep 12, 2023
Co-Authored-By: Perceval Wajsbürt <[email protected]>
Co-Authored-By: Thomas Petit-Jean <[email protected]>
percevalw added a commit that referenced this issue Sep 13, 2023
Co-Authored-By: Perceval Wajsbürt <[email protected]>
Co-Authored-By: Thomas Petit-Jean <[email protected]>
@percevalw
Copy link
Member

Closing as this was merged in #213

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants