Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiate duration and datetime #83

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Changelog

## Unreleased

### Added

- New `to_duration` method to convert an absolute date into a date relative to the note_datetime (or None)
- New `use_date_label` in `eds.dates` to store absolute and relative dates under a same `date` label (instead of `absolute` and `relative`)

### Changed

- Duration time entities (from `eds.dates`) are now stored in the `durations` span group, different than the `dates` span group
- `to_datetime` now only return absolute dates, converts relative dates into absolute if `doc._.note_datetime` is given, and None otherwise

## v0.8.0 (2023-03-09)

### Added
Expand Down
29 changes: 12 additions & 17 deletions docs/pipelines/misc/dates.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,27 +35,30 @@ doc = nlp(text)

dates = doc.spans["dates"]
dates
# Out: [23 août 2021, il y a un an, pendant une semaine, mai 1995]
# Out: [23 août 2021, il y a un an, mai 1995]

dates[0]._.date.to_datetime()
# Out: 2021-08-23T00:00:00+02:00

dates[1]._.date.to_datetime()
# Out: -1 year
# Out: None

note_datetime = pendulum.datetime(2021, 8, 27, tz="Europe/Paris")

dates[1]._.date.to_datetime(note_datetime=note_datetime)
# Out: DateTime(2020, 8, 27, 0, 0, 0, tzinfo=Timezone('Europe/Paris'))
# Out: 2020-08-27T00:00:00+02:00

date_3_output = dates[3]._.date.to_datetime(
date_2_output = dates[2]._.date.to_datetime(
note_datetime=note_datetime,
infer_from_context=True,
tz="Europe/Paris",
default_day=15,
)
date_3_output
# Out: DateTime(1995, 5, 15, 0, 0, 0, tzinfo=Timezone('Europe/Paris'))
date_2_output
# Out: 1995-05-15T00:00:00+02:00

doc.spans["durations"]
# Out: [pendant une semaine]
```

## Declared extensions
Expand All @@ -66,17 +69,9 @@ The `eds.dates` pipeline declares one [spaCy extension](https://spacy.io/usage/p

The pipeline can be configured using the following parameters :

| Parameter | Explanation | Default |
|------------------|--------------------------------------------------|-----------------------------------|
| `absolute` | Absolute date patterns, eg `le 5 août 2020` | `None` (use pre-defined patterns) |
| `relative` | Relative date patterns, eg `hier`) | `None` (use pre-defined patterns) |
| `durations` | Duration patterns, eg `pendant trois mois`) | `None` (use pre-defined patterns) |
| `false_positive` | Some false positive patterns to exclude | `None` (use pre-defined patterns) |
| `detect_periods` | Whether to look for periods | `False` |
| `detect_time` | Whether to look for time around dates | `True` |
| `on_ents_only` | Whether to look for dates around entities only | `False` |
| `as_ents` | Whether to save detected dates as entities | `False` |
| `attr` | spaCy attribute to match on, eg `NORM` or `TEXT` | `"NORM"` |
::: edsnlp.pipelines.misc.dates.factory.create_component
options:
only_parameters: true

## Authors and citation

Expand Down
15 changes: 3 additions & 12 deletions docs/pipelines/qualifiers/history.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,18 +80,9 @@ doc.ents[3]._.history # (2)

The pipeline can be configured using the following parameters :

| Parameter | Explanation | Default |
| -------------------- | -------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
| `attr` | spaCy attribute to match on (eg `NORM`, `TEXT`, `LOWER`) | `"NORM"` |
| `history` | History patterns | `None` (use pre-defined patterns) |
| `termination` | Termination patterns (for syntagma/proposition extraction) | `None` (use pre-defined patterns) |
| `use_sections` | Whether to use pre-annotated sections (requires the `sections` pipeline) | `False` |
| `use_dates` | Whether to use dates pipeline (requires the `dates` pipeline and ``note_datetime`` context is recommended) | `False` |
| `history_limit` | If `use_dates = True`. The number of days after which the event is considered as history. | `14` (2 weeks) |
| `exclude_birthdate` | If `use_dates = True`. Whether to exclude the birth date from history dates. | `True` |
| `closest_dates_only` | If `use_dates = True`. Whether to include the closest dates only. If `False`, it includes all dates in the sentence. | `True` |
| `on_ents_only` | Whether to qualify pre-extracted entities only | `True` |
| `explain` | Whether to keep track of the cues for each entity | `False` |
::: edsnlp.pipelines.qualifiers.history.factory.create_component
options:
only_parameters: true

## Declared extensions

Expand Down
16 changes: 14 additions & 2 deletions edsnlp/pipelines/misc/dates/dates.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,18 +41,23 @@ class Dates(BaseComponent):
false_positive : Union[List[str], str]
List of regular expressions for false positive (eg phone numbers, etc).
on_ents_only : Union[bool, str, List[str]]
Wether to look on dates in the whole document or in specific sentences:
Whether to look on dates in the whole document or in specific sentences:

- If `True`: Only look in the sentences of each entity in doc.ents
- If False: Look in the whole document
- If given a string `key` or list of string: Only look in the sentences of
each entity in `#!python doc.spans[key]`
detect_periods : bool
Whether to detect periods (experimental)
detect_time: bool
Whether to detect time inside dates
as_ents : bool
Whether to treat dates as entities
attr : str
spaCy attribute to use
use_date_label: bool
Whether to use a shared `date` label for absolute and relative dates
instead of `absolute` and `relative` labels
"""

# noinspection PyProtectedMember
Expand All @@ -68,8 +73,10 @@ def __init__(
detect_time: bool,
as_ents: bool,
attr: str,
use_date_label: bool = False,
):

self.use_date_label = use_date_label
self.nlp = nlp

if absolute is None:
Expand Down Expand Up @@ -193,8 +200,12 @@ def parse(self, dates: List[Tuple[Span, Dict[str, str]]]) -> List[Span]:
for span, groupdict in dates:
if span.label_ == "relative":
parsed = RelativeDate.parse_obj(groupdict)
if self.use_date_label:
span.label_ = "date"
elif span.label_ == "absolute":
parsed = AbsoluteDate.parse_obj(groupdict)
if self.use_date_label:
span.label_ = "date"
else:
parsed = Duration.parse_obj(groupdict)

Expand Down Expand Up @@ -275,7 +286,8 @@ def __call__(self, doc: Doc) -> Doc:
dates = self.process(doc)
dates = self.parse(dates)

doc.spans["dates"] = dates
doc.spans["dates"] = [d for d in dates if d.label_ != "duration"]
doc.spans["durations"] = [d for d in dates if d.label_ == "duration"]

if self.detect_periods:
doc.spans["periods"] = self.process_periods(dates)
Expand Down
63 changes: 53 additions & 10 deletions edsnlp/pipelines/misc/dates/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,17 +25,59 @@
@Language.factory("eds.dates", default_config=DEFAULT_CONFIG, assigns=["doc.spans"])
def create_component(
nlp: Language,
name: str,
absolute: Optional[List[str]],
relative: Optional[List[str]],
duration: Optional[List[str]],
false_positive: Optional[List[str]],
on_ents_only: Union[bool, List[str]],
detect_periods: bool,
detect_time: bool,
as_ents: bool,
attr: str,
name: str = "eds.dates",
absolute: Optional[List[str]] = None,
relative: Optional[List[str]] = None,
duration: Optional[List[str]] = None,
false_positive: Optional[List[str]] = None,
on_ents_only: Union[bool, List[str]] = False,
detect_periods: bool = False,
detect_time: bool = True,
as_ents: bool = False,
attr: str = "LOWER",
use_date_label: bool = False,
):
"""
Tags and normalizes dates, using the open-source `dateparser` library.

The pipeline uses spaCy's `filter_spans` function.
It filters out false positives, and introduce a hierarchy between patterns.
For instance, in case of ambiguity, the pipeline will decide that a date is a
date without a year rather than a date without a day.

Parameters
----------
nlp : spacy.language.Language
Language pipeline object
absolute : Union[List[str], str]
List of regular expressions for absolute dates.
relative : Union[List[str], str]
List of regular expressions for relative dates
(eg `hier`, `la semaine prochaine`).
duration : Union[List[str], str]
List of regular expressions for durations
(eg `pendant trois mois`).
false_positive : Union[List[str], str]
List of regular expressions for false positive (eg phone numbers, etc).
on_ents_only : Union[bool, str, List[str]]
Whether to look on dates in the whole document or in specific sentences:

- If `True`: Only look in the sentences of each entity in doc.ents
- If False: Look in the whole document
- If given a string `key` or list of string: Only look in the sentences of
each entity in `#!python doc.spans[key]`
detect_periods : bool
Whether to detect periods (experimental)
detect_time: bool
Whether to detect time inside dates
as_ents : bool
Whether to treat dates as entities
attr : str
spaCy attribute to use
use_date_label: bool
Whether to use a shared `date` label for absolute and relative dates
instead of `absolute` and `relative` labels
"""
return Dates(
nlp,
absolute=absolute,
Expand All @@ -47,4 +89,5 @@ def create_component(
detect_time=detect_time,
as_ents=as_ents,
attr=attr,
use_date_label=use_date_label,
)
Loading