docs: add synthetic dataset description and reorganise pages

aphp · Jun 18, 2024 · 07cadb0 · 07cadb0
1 parent 73dede8
commit 07cadb0
Show file tree

Hide file tree

Showing 10 changed files with 130 additions and 32 deletions.
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@
 
 # EDS-Pseudo
 
-This project aims at detecting identifying entities documents, and was primarily tested
+The EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily tested
 on clinical reports at AP-HP's Clinical Data Warehouse (EDS).
 
 The model is built on top of [edsnlp](https://github.com/aphp/edsnlp), and consists in a
@@ -84,7 +84,7 @@ test it directly on the **[demo](https://eds-pseudo-public.streamlit.app/)**.
    ```
 
 To apply the model on many documents using one or more GPUs, refer to the documentation
-of [edsnlp](https://aphp.github.io/edsnlp/latest/tutorials/multiple-texts/).
+of [edsnlp](https://aphp.github.io/eds-pseudo/main/inference).
 
 <!-- metrics -->
 

diff --git a/docs/assets/figures/data-augmentation.png b/docs/assets/figures/data-augmentation.png
diff --git a/docs/dataset.md b/docs/dataset.md
@@ -2,19 +2,25 @@
 
 !!! warning "Disclaimer"
 
-    We do not provide the dataset due to privacy and regulatory constraints. You will
+    We do not provide our internal dataset due to privacy and regulatory constraints. You will
     however find the description of the dataset below. We also release the code for the
     rule-based annotation system.
 
-    You can find fictive data in the
-    [`data/gen_dataset`](https://github.com/aphp/eds-pseudo/tree/main/data/gen_dataset/)
-    folder to test the model.
+    You can find the fictive dataset generation description in the [synthetic dataset](#synthetic-dataset) section.
 
 ## Format
 
-By default, we expect the annotations to follow the format of the demo dataset [data/gen_dataset](https://github.com/aphp/eds-pseudo/tree/main/data/gen_dataset), but you can change the format by modifying the [config file](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg), and the "Datasets" part of it in particular, or the code of the [adapter](https://github.com/aphp/eds-pseudo/blob/main/eds_pseudo/adapter.py).
+We expect the annotations to be a jsonlines file with the following format:
 
-## Data Selection
+```json
+{ "note_id": "any-id-1", "note_text": "Jacques Chirac a été maire de Paris", "entities": [{"start": 0, "end": 7, ...] }
+{ "note_id": "any-id-2", "note_text": "Elle est née en 2006", "entities": [{"start": 16, "end": 20, ...] }
+...
+```
+
+but you can change the format by modifying the [config file](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg), and the "datasets" part of it in particular, or the code of the [adapter](https://github.com/aphp/eds-pseudo/blob/main/eds_pseudo/adapter.py) which is reponsible for loading the data during the training and evaluation.
+
+## Internal Data Selection
 
 We annotated around 4000 documents, selected according to the distribution of AP-HP's
 Clinical Data Warehouse (CDW), to obtain a sample that is representative of the actual
@@ -54,7 +60,7 @@ We annotated clinical documents with the following entities :
 ## Statistics
 
 To inspect the statistics for the latest version of our dataset, please refer to the
-[latest release](/eds-pseudo/latest/dataset#statistics).
+[v0.2.0 release](/eds-pseudo/v0.2.0/dataset#statistics).
 
 <!--
 
@@ -73,3 +79,41 @@ The software tools used to annotate the documents with personal identification e
 - [LabelStudio](https://labelstud.io/) for the first annotation campaign
 - [Metanno](https://github.com/percevalw/metanno) for the second annotation campaign
 but any annotation software will do.
+
+## Synthetic dataset
+
+We will now describe the synthetic dataset generation process, used to produce the public pseudonymisation model.
+
+### Augmentation
+
+Each synthetic training document is generated by augmenting a base fictitious template, replacing annotated entities with random values, generated from scratch or picked from a predefined public list :
+
+- `PRENOM`: INSEE deceased list and INSEE natality list
+- `NOM`: INSEE deceased list
+- `VILLE`: INSEE deceased list
+- `HOPITAL`: Handcrafted list
+- `DATE`: Random dates, formatted as the original value in the template
+- `ADRESSE`: No augmentation for now
+- `MAIL`: Gen from fake first names, last names and handcrafted domains
+- `TEL`: Random phone number
+- `ZIP`: Random zip code
+- `IPP`: Random number
+- `NDA`: Random number
+- `SECU`: Random number (following the French NSS format constraints)
+- `DATE_NAISSANCE`: Random date
+
+<figure style="text-align: center" markdown>
+<img src="../assets/figures/data-augmentation.png" alt = "Template augmentation" style="max-height: 400px; max-width: 100%">
+</figure>
+
+### Template writing
+
+The template writing process was done iteratively:
+
+1. We wrote a few starting annotated samples that we added to the base template list.
+2. We augmented the base templates following the process of the previous section.
+3. We trained the model on this augmented dataset.
+4. We evaluated the model on the internal training set (acting as a validation set).
+5. We picked the examples with the worst performance and wrote fictitious snippets with similar grammatical and syntactic structures and added them to the base template list.
+6. At the same time, we improved the augmentation process to account for these errors.
+7. We repeated the process starting from step 2 until we reached a satisfactory performance.
diff --git a/docs/index.md b/docs/index.md
@@ -1,14 +1,17 @@
 # Overview
 
-EDS-Pseudo is a project aimed at detecting identifying entities in textual documents,
-and was primarily tested on clinical reports at AP-HP's Clinical Data Warehouse (EDS).
+The EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily tested
+on clinical reports at AP-HP's Clinical Data Warehouse (EDS).
 
 The model is built on top of [edsnlp](https://github.com/aphp/edsnlp), and consists in a
-hybrid model (rule-based + deep learning) for which we provide rules [`eds_pseudo/pipes`](https://github.com/aphp/eds-pseudo/tree/main/eds-pseudo/pipes) and a training recipe [`scripts/train.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/train.py).
+hybrid model (rule-based + deep learning) for which we provide
+rules ([`eds-pseudo/pipes`](https://github.com/aphp/eds-pseudo/tree/main/eds_pseudo/pipes))
+and a training recipe [`train.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/train.py).
 
-We also provide a small set of fictive documents
-[`data/gen_dataset/train.jsonl`](https://github.com/aphp/eds-pseudo/blob/main/data/gen_dataset/train.jsonl)
-to test the method.
+We also provide some fictitious
+templates ([`templates.txt`](https://github.com/aphp/eds-pseudo/blob/main/data/templates.txt)) and a script to
+generate a synthetic
+dataset [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py).
 
 The entities that are detected are listed below.
 

diff --git a/docs/inference.md b/docs/inference.md
@@ -1,6 +1,4 @@
-# Inference
-
-## Parallelizing inference
+# Parallelized Inference
 
 When processing multiple documents, we can optimize the inference by parallelizing the
 computation on multiple cores and GPUs.

diff --git a/docs/pretrained.md b/docs/pretrained.md
@@ -0,0 +1,47 @@
+# Pretrained model
+
+An even simpler option than the rule-based model, and with a better performance (although a bit more compute-intensive)
+consists in using the public pretrained model.
+This model is available on the HuggingFace model hub at
+[AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public) and was trained on synthetic data described in the
+[Dataset](/dataset) page. You can also test it directly on the **[demo](https://eds-pseudo-public.streamlit.app/)**.
+
+## Installation
+
+1. Install the latest version of edsnlp
+
+    ```shell
+    pip install "edsnlp[ml]" -U
+    ```
+
+2. Get access to the model at [AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public)
+3. Create and copy a token [https://hf.co/settings/tokens?new_token=true](https://hf.co/settings/tokens?new_token=true)
+4. Register the token (only once) on your machine
+
+    ```python
+    import huggingface_hub
+
+    huggingface_hub.login(
+        token=YOUR_TOKEN,
+        new_session=False,
+        add_to_git_credential=True,
+    )
+    ```
+5. Load the model
+
+    ```python
+    import edsnlp
+
+    nlp = edsnlp.load("AP-HP/eds-pseudo-public", auto_update=True)
+    doc = nlp(
+        "En 2015, M. Charles-François-Bienvenu "
+        "Myriel était évêque de Digne. C’était un vieillard "
+        "d’environ soixante-quinze ans ; il occupait le "
+        "siège de Digne depuis 2006."
+    )
+
+    for ent in doc.ents:
+        print(ent, ent.label_, str(ent._.date))
+    ```
+
+To apply the model in parallel on many documents using one or more GPUs, refer to the [Inference](/inference) page.
diff --git a/docs/quickstart.md → docs/rule-based.md b/docs/quickstart.md → docs/rule-based.md
@@ -1,4 +1,4 @@
-# Quickstart
+# Rule-based model
 
 ## Installation
 
@@ -18,10 +18,9 @@ poetry install
 If you face issues with the installation, try to lower the maximum python version to
 <= 3.10 (in `pyproject.toml`).
 
-## Without machine learning
+## Rule-based model definition
 
-If you do not have a labelled dataset, you can still use the rule-based components of the
-model.
+A simple option consists in using the rule-based components of the model.
 
 ```python
 import edsnlp

diff --git a/docs/synthetic-dataset.md b/docs/synthetic-dataset.md
@@ -0,0 +1 @@
+# Synthetic Dataset
diff --git a/docs/training.md b/docs/training.md
@@ -1,4 +1,7 @@
-# Training
+# Training a custom model
+
+If neither the rule-based model nor the public model are sufficient for your needs, you can
+train your own model. This section will guide you through the process.
 
 ## Requirements
 
@@ -11,9 +14,8 @@ To train a model, you will need to provide:
 In any case, you will need to modify the
 [configs/config.cfg](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg) file to
 reflect these changes. This configuration already contains the rule-based components of
-the previous section, feel free to add or remove them as you see fit. You may also want
-to modify the [pyproject.toml](https://github.com/aphp/eds-pseudo/blob/main/pyproject.toml) file to change the name of packaged model
-(defaults to `eds-pseudo-aphp`).
+the previous section, feel free to add or remove them as you see fit. The [configs/config.cfg](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg) file also contains
+the name of the package model in the `[package]` section (defaults to `eds-pseudo-public`).
 
 ## DVC
 
@@ -49,10 +51,10 @@ dvc repro
     python scripts/package.py
     ```
 
-You should now be able to install and publish it:
+You should now be able to install and use it:
 
 ```{: .shell data-md-color-scheme="slate" }
-pip install dist/eds_pseudo_aphp-0.3.0-*
+pip install dist/eds_pseudo_your_eds-0.3.0-*
 ```
 
 ## Use it
@@ -62,10 +64,10 @@ To test it, execute
 === "Loading the packaged model"
 
     ```python
-    import eds_pseudo_aphp
+    import eds_pseudo_your_eds
 
     # Load the model
-    nlp = eds_pseudo_aphp.load()
+    nlp = eds_pseudo_your_eds.load()
     ```
 
 === "Loading from the folder"
@@ -106,3 +108,5 @@ existing_nlp = ...
 
 existing_nlp.add_pipe(nlp.get_pipe("ner"), name="ner")
 ```
+
+To apply the model in parallel on many documents using one or more GPUs, refer to the [Inference](/inference) page.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -28,11 +28,13 @@ theme:
 
 nav:
   - index.md
-  - quickstart.md
+  - Demo: https://eds-pseudo-public.streamlit.app" target="_blank
+  - dataset.md
+  - rule-based.md
+  - pretrained.md
   - training.md
   - inference.md
   - reproducibility.md
-  - dataset.md
   - results.md
   - changelog.md