-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Unify span getters / setters #203
Comments
To complete this, here are the different uses of such getters in the lib atm:
There are in fact two kind of span manipulation that occur at before and after a pipe:
The upcoming refacto of edsnlp will allow most rule-based NER components to specify zones where they should look up entities. Following a discussion with @Thomzoy, we will also update components to specify where to output there prediction, i.e. spans or extensions. OutputsHere are some suggestions for various rule-based NER components. It seems that the behaviors of such components are too diverse to factorize the span setting parameters. class DatesAndDurations:
def __init__(
self,
# value of the .label_ attribute set on dates/durations
date_label="date",
duration_label="duration",
# name of the span group `.spans[name]` to write matches (with an overwriting behavior)
to_date_span_group="dates",
to_duration_span_group="durations",
# whether to also store matches as standard spaCy `.ents` entities
to_ents: bool = True,
# Where to look for candidates, by default the whole document (see below)
span_getter: Optional[SpanGetter] = None,
):
...
class MyMatcher:
def __init__(
self,
label: str = "my-custom-match",
to_ents: bool = True,
to_span_group: str = "my-custom-matches",
# or can be overwritten: to_span_group = "my-custom-matches-ml",
to_ents: bool = True,
span_getter: Optional[SpanGetter] = None,
):
... InputsOn the other hand, span getters look better suited for factorization. A span getter could be:
In fact, the three first options could automatically be converted into hardcoded callables by the component, so that the component would only have to deal with a callable. Trainable NERThe trainable NER components are a bit more complex, as we have to deal with both span getters (during training / evaluation) and span setters (during inference / evaluation). Since the span setting configuration is inferred from the span getting configuration in the current implementation, it would be nice to keep this behavior. Learning from the I suggest the span_getter to only allow specifying the span groups to look up, and to infer the span setting configuration from it, as done currently. The above span_getter configurations could be reused, with less options (no ents, no callable): target_span_getter = dict(
spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
labels = ["relevant_label"], # to keep only entities with a specific `label_`
)
# or
target_span_getter = ["dates", "durations"]
# or
target_span_getter = "dates" For span setting, a # to set "date" labelled matches to the "dates-ml" span group, same for durations
to_span_groups = {
"dates-ml": "date",
"durations-ml": ["duration"],
}
# to set all predictions to a single span group
to_span_groups = "ner-predictions"
# and to_ents to set all predictions in ents
to_ents = True
# or to filter by label
to_ents = ["date", "duration"] SummaryA span getter can be either:
Each component using a span getter should accept some of these configuration, but not necessarily Span setters can target
And
Each component using a span setters/getters should accept some of these configuration but not necessarily all of them. It's important that we don't enforce strict modularity or uniformity, as it would make the API too complex and rigid. |
As discussed remember to:
|
Co-Authored-By: Perceval Wajsbürt <[email protected]> Co-Authored-By: Thomas Petit-Jean <[email protected]>
Co-Authored-By: Perceval Wajsbürt <[email protected]> Co-Authored-By: Thomas Petit-Jean <[email protected]>
Co-Authored-By: Perceval Wajsbürt <[email protected]> Co-Authored-By: Thomas Petit-Jean <[email protected]>
Closing as this was merged in #213 |
Feature type
We might want to have a more uniform way of getting spans in pipelines. Currently, we have
on_ents_only
,on_spans
, etc...An idea is to expose a
span_getter
key in the configuration that could look like:If a more complex getter is needed, it could come from a
span_getter
factoryThe text was updated successfully, but these errors were encountered: