diff --git a/README.md b/README.md index 8e3b477..442a005 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ # EDS-Pseudo -This project aims at detecting identifying entities documents, and was primarily tested +The EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily tested on clinical reports at AP-HP's Clinical Data Warehouse (EDS). The model is built on top of [edsnlp](https://github.com/aphp/edsnlp), and consists in a @@ -84,7 +84,7 @@ test it directly on the **[demo](https://eds-pseudo-public.streamlit.app/)**. ``` To apply the model on many documents using one or more GPUs, refer to the documentation -of [edsnlp](https://aphp.github.io/edsnlp/latest/tutorials/multiple-texts/). +of [edsnlp](https://aphp.github.io/eds-pseudo/main/inference). diff --git a/docs/assets/figures/data-augmentation.png b/docs/assets/figures/data-augmentation.png new file mode 100644 index 0000000..8fb7bcc Binary files /dev/null and b/docs/assets/figures/data-augmentation.png differ diff --git a/docs/dataset.md b/docs/dataset.md index b8f0912..6664873 100644 --- a/docs/dataset.md +++ b/docs/dataset.md @@ -2,19 +2,25 @@ !!! warning "Disclaimer" - We do not provide the dataset due to privacy and regulatory constraints. You will + We do not provide our internal dataset due to privacy and regulatory constraints. You will however find the description of the dataset below. We also release the code for the rule-based annotation system. - You can find fictive data in the - [`data/gen_dataset`](https://github.com/aphp/eds-pseudo/tree/main/data/gen_dataset/) - folder to test the model. + You can find the fictive dataset generation description in the [synthetic dataset](#synthetic-dataset) section. ## Format -By default, we expect the annotations to follow the format of the demo dataset [data/gen_dataset](https://github.com/aphp/eds-pseudo/tree/main/data/gen_dataset), but you can change the format by modifying the [config file](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg), and the "Datasets" part of it in particular, or the code of the [adapter](https://github.com/aphp/eds-pseudo/blob/main/eds_pseudo/adapter.py). +We expect the annotations to be a jsonlines file with the following format: -## Data Selection +```json +{ "note_id": "any-id-1", "note_text": "Jacques Chirac a été maire de Paris", "entities": [{"start": 0, "end": 7, ...] } +{ "note_id": "any-id-2", "note_text": "Elle est née en 2006", "entities": [{"start": 16, "end": 20, ...] } +... +``` + +but you can change the format by modifying the [config file](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg), and the "datasets" part of it in particular, or the code of the [adapter](https://github.com/aphp/eds-pseudo/blob/main/eds_pseudo/adapter.py) which is reponsible for loading the data during the training and evaluation. + +## Internal Data Selection We annotated around 4000 documents, selected according to the distribution of AP-HP's Clinical Data Warehouse (CDW), to obtain a sample that is representative of the actual @@ -54,7 +60,7 @@ We annotated clinical documents with the following entities : ## Statistics To inspect the statistics for the latest version of our dataset, please refer to the -[latest release](/eds-pseudo/latest/dataset#statistics). +[v0.2.0 release](/eds-pseudo/v0.2.0/dataset#statistics).