Skip to content

Commit

Permalink
docs: add synthetic dataset description and reorganise pages
Browse files Browse the repository at this point in the history
  • Loading branch information
percevalw committed Jun 18, 2024
1 parent 73dede8 commit 07cadb0
Show file tree
Hide file tree
Showing 10 changed files with 130 additions and 32 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

# EDS-Pseudo

This project aims at detecting identifying entities documents, and was primarily tested
The EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily tested
on clinical reports at AP-HP's Clinical Data Warehouse (EDS).

The model is built on top of [edsnlp](https://github.com/aphp/edsnlp), and consists in a
Expand Down Expand Up @@ -84,7 +84,7 @@ test it directly on the **[demo](https://eds-pseudo-public.streamlit.app/)**.
```
To apply the model on many documents using one or more GPUs, refer to the documentation
of [edsnlp](https://aphp.github.io/edsnlp/latest/tutorials/multiple-texts/).
of [edsnlp](https://aphp.github.io/eds-pseudo/main/inference).
<!-- metrics -->
Expand Down
Binary file added docs/assets/figures/data-augmentation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
58 changes: 51 additions & 7 deletions docs/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,25 @@

!!! warning "Disclaimer"

We do not provide the dataset due to privacy and regulatory constraints. You will
We do not provide our internal dataset due to privacy and regulatory constraints. You will
however find the description of the dataset below. We also release the code for the
rule-based annotation system.

You can find fictive data in the
[`data/gen_dataset`](https://github.com/aphp/eds-pseudo/tree/main/data/gen_dataset/)
folder to test the model.
You can find the fictive dataset generation description in the [synthetic dataset](#synthetic-dataset) section.

## Format

By default, we expect the annotations to follow the format of the demo dataset [data/gen_dataset](https://github.com/aphp/eds-pseudo/tree/main/data/gen_dataset), but you can change the format by modifying the [config file](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg), and the "Datasets" part of it in particular, or the code of the [adapter](https://github.com/aphp/eds-pseudo/blob/main/eds_pseudo/adapter.py).
We expect the annotations to be a jsonlines file with the following format:

## Data Selection
```json
{ "note_id": "any-id-1", "note_text": "Jacques Chirac a été maire de Paris", "entities": [{"start": 0, "end": 7, ...] }
{ "note_id": "any-id-2", "note_text": "Elle est née en 2006", "entities": [{"start": 16, "end": 20, ...] }
...
```

but you can change the format by modifying the [config file](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg), and the "datasets" part of it in particular, or the code of the [adapter](https://github.com/aphp/eds-pseudo/blob/main/eds_pseudo/adapter.py) which is reponsible for loading the data during the training and evaluation.

## Internal Data Selection

We annotated around 4000 documents, selected according to the distribution of AP-HP's
Clinical Data Warehouse (CDW), to obtain a sample that is representative of the actual
Expand Down Expand Up @@ -54,7 +60,7 @@ We annotated clinical documents with the following entities :
## Statistics

To inspect the statistics for the latest version of our dataset, please refer to the
[latest release](/eds-pseudo/latest/dataset#statistics).
[v0.2.0 release](/eds-pseudo/v0.2.0/dataset#statistics).

<!--

Expand All @@ -73,3 +79,41 @@ The software tools used to annotate the documents with personal identification e
- [LabelStudio](https://labelstud.io/) for the first annotation campaign
- [Metanno](https://github.com/percevalw/metanno) for the second annotation campaign
but any annotation software will do.

## Synthetic dataset

We will now describe the synthetic dataset generation process, used to produce the public pseudonymisation model.

### Augmentation

Each synthetic training document is generated by augmenting a base fictitious template, replacing annotated entities with random values, generated from scratch or picked from a predefined public list :

- `PRENOM`: INSEE deceased list and INSEE natality list
- `NOM`: INSEE deceased list
- `VILLE`: INSEE deceased list
- `HOPITAL`: Handcrafted list
- `DATE`: Random dates, formatted as the original value in the template
- `ADRESSE`: No augmentation for now
- `MAIL`: Gen from fake first names, last names and handcrafted domains
- `TEL`: Random phone number
- `ZIP`: Random zip code
- `IPP`: Random number
- `NDA`: Random number
- `SECU`: Random number (following the French NSS format constraints)
- `DATE_NAISSANCE`: Random date

<figure style="text-align: center" markdown>
<img src="../assets/figures/data-augmentation.png" alt = "Template augmentation" style="max-height: 400px; max-width: 100%">
</figure>

### Template writing

The template writing process was done iteratively:

1. We wrote a few starting annotated samples that we added to the base template list.
2. We augmented the base templates following the process of the previous section.
3. We trained the model on this augmented dataset.
4. We evaluated the model on the internal training set (acting as a validation set).
5. We picked the examples with the worst performance and wrote fictitious snippets with similar grammatical and syntactic structures and added them to the base template list.
6. At the same time, we improved the augmentation process to account for these errors.
7. We repeated the process starting from step 2 until we reached a satisfactory performance.
15 changes: 9 additions & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
# Overview

EDS-Pseudo is a project aimed at detecting identifying entities in textual documents,
and was primarily tested on clinical reports at AP-HP's Clinical Data Warehouse (EDS).
The EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily tested
on clinical reports at AP-HP's Clinical Data Warehouse (EDS).

The model is built on top of [edsnlp](https://github.com/aphp/edsnlp), and consists in a
hybrid model (rule-based + deep learning) for which we provide rules [`eds_pseudo/pipes`](https://github.com/aphp/eds-pseudo/tree/main/eds-pseudo/pipes) and a training recipe [`scripts/train.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/train.py).
hybrid model (rule-based + deep learning) for which we provide
rules ([`eds-pseudo/pipes`](https://github.com/aphp/eds-pseudo/tree/main/eds_pseudo/pipes))
and a training recipe [`train.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/train.py).

We also provide a small set of fictive documents
[`data/gen_dataset/train.jsonl`](https://github.com/aphp/eds-pseudo/blob/main/data/gen_dataset/train.jsonl)
to test the method.
We also provide some fictitious
templates ([`templates.txt`](https://github.com/aphp/eds-pseudo/blob/main/data/templates.txt)) and a script to
generate a synthetic
dataset [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py).

The entities that are detected are listed below.

Expand Down
4 changes: 1 addition & 3 deletions docs/inference.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
# Inference

## Parallelizing inference
# Parallelized Inference

When processing multiple documents, we can optimize the inference by parallelizing the
computation on multiple cores and GPUs.
Expand Down
47 changes: 47 additions & 0 deletions docs/pretrained.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Pretrained model

An even simpler option than the rule-based model, and with a better performance (although a bit more compute-intensive)
consists in using the public pretrained model.
This model is available on the HuggingFace model hub at
[AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public) and was trained on synthetic data described in the
[Dataset](/dataset) page. You can also test it directly on the **[demo](https://eds-pseudo-public.streamlit.app/)**.

## Installation

1. Install the latest version of edsnlp

```shell
pip install "edsnlp[ml]" -U
```

2. Get access to the model at [AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public)
3. Create and copy a token [https://hf.co/settings/tokens?new_token=true](https://hf.co/settings/tokens?new_token=true)
4. Register the token (only once) on your machine

```python
import huggingface_hub
huggingface_hub.login(
token=YOUR_TOKEN,
new_session=False,
add_to_git_credential=True,
)
```
5. Load the model

```python
import edsnlp
nlp = edsnlp.load("AP-HP/eds-pseudo-public", auto_update=True)
doc = nlp(
"En 2015, M. Charles-François-Bienvenu "
"Myriel était évêque de Digne. C’était un vieillard "
"d’environ soixante-quinze ans ; il occupait le "
"siège de Digne depuis 2006."
)
for ent in doc.ents:
print(ent, ent.label_, str(ent._.date))
```
To apply the model in parallel on many documents using one or more GPUs, refer to the [Inference](/inference) page.
7 changes: 3 additions & 4 deletions docs/quickstart.md → docs/rule-based.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Quickstart
# Rule-based model

## Installation

Expand All @@ -18,10 +18,9 @@ poetry install
If you face issues with the installation, try to lower the maximum python version to
<= 3.10 (in `pyproject.toml`).

## Without machine learning
## Rule-based model definition

If you do not have a labelled dataset, you can still use the rule-based components of the
model.
A simple option consists in using the rule-based components of the model.

```python
import edsnlp
Expand Down
1 change: 1 addition & 0 deletions docs/synthetic-dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Synthetic Dataset
20 changes: 12 additions & 8 deletions docs/training.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# Training
# Training a custom model

If neither the rule-based model nor the public model are sufficient for your needs, you can
train your own model. This section will guide you through the process.

## Requirements

Expand All @@ -11,9 +14,8 @@ To train a model, you will need to provide:
In any case, you will need to modify the
[configs/config.cfg](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg) file to
reflect these changes. This configuration already contains the rule-based components of
the previous section, feel free to add or remove them as you see fit. You may also want
to modify the [pyproject.toml](https://github.com/aphp/eds-pseudo/blob/main/pyproject.toml) file to change the name of packaged model
(defaults to `eds-pseudo-aphp`).
the previous section, feel free to add or remove them as you see fit. The [configs/config.cfg](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg) file also contains
the name of the package model in the `[package]` section (defaults to `eds-pseudo-public`).

## DVC

Expand Down Expand Up @@ -49,10 +51,10 @@ dvc repro
python scripts/package.py
```

You should now be able to install and publish it:
You should now be able to install and use it:

```{: .shell data-md-color-scheme="slate" }
pip install dist/eds_pseudo_aphp-0.3.0-*
pip install dist/eds_pseudo_your_eds-0.3.0-*
```

## Use it
Expand All @@ -62,10 +64,10 @@ To test it, execute
=== "Loading the packaged model"

```python
import eds_pseudo_aphp
import eds_pseudo_your_eds

# Load the model
nlp = eds_pseudo_aphp.load()
nlp = eds_pseudo_your_eds.load()
```

=== "Loading from the folder"
Expand Down Expand Up @@ -106,3 +108,5 @@ existing_nlp = ...

existing_nlp.add_pipe(nlp.get_pipe("ner"), name="ner")
```

To apply the model in parallel on many documents using one or more GPUs, refer to the [Inference](/inference) page.
6 changes: 4 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,13 @@ theme:

nav:
- index.md
- quickstart.md
- Demo: https://eds-pseudo-public.streamlit.app" target="_blank
- dataset.md
- rule-based.md
- pretrained.md
- training.md
- inference.md
- reproducibility.md
- dataset.md
- results.md
- changelog.md

Expand Down

0 comments on commit 07cadb0

Please sign in to comment.