Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions About Example Part_3_2_Extracting_Diseases_from_Electronic_Health_Records.ipynb #23

Open
JBarsotti opened this issue Jun 21, 2024 · 5 comments

Comments

@JBarsotti
Copy link

JBarsotti commented Jun 21, 2024

This is an amazing module. Thanks for all your hard work.

I was working through the notebook notebooks/introductory/Part_3_2_Extracting_Diseases_from_Electronic_Health_Records.ipynb, and I have a couple of questions:

  1. Why do we need to retrain the modelpack on our own personal? I've tried it without retraining, and it seems to work okay, still. Am I missing something?
  2. I have access to the entire UMLS database. I tried to use that as my medpack model, but it doesn't seem to work with the code in notebooks/introductory/Part_3_2_Extracting_Diseases_from_Electronic_Health_Records.ipynb. Even on the simple example "This patient suffers from diabetes," it isn't able to recognize diabetes as an entity. When I run it on large clinical notes, it does not return a lot of CUIs that map to preferred names. They are just listed as "Unknown." Any ideas?

Thanks for an awesome module! It really is great.

@mart-r
Copy link
Collaborator

mart-r commented Jun 21, 2024

Hi,

To answer your questions:

  1. Retraining is necessary if the data you use the model on is in some way different from the data it was originally trained on. For instance, different hospitals/trusts may have different conventions on how to describe similar situations. So if the base model works well enough for you, then that's great - keep using it. But in general, in order to improve performance on a particular dataset, fine tuning on that dataset - or another similar dataset - is needed.
  2. By "the entire UMLS database" do you perhaps mean the full UMLS model as distributed in medcat readme? I will assume that's what you meant. Unfortunately, the models we provide publicly are not guaranteed to be particularly performant. The full UMLS model is an example model. While it was trained (in a self-supervised capacity) on MIMIC-III (which undoubtably has many-many references to diabetes), it has not received any validation on its performance. My best bet is that the model was unable to disambiguate the name and was thus unable to determine which concept was being referenced in the training data. UMLS is a massive ontology, diabetes may refer to many different concepts. And due to the self-supervised nature of the training it received, the model was unable to properly learn the name. But again, this is just speculation.

@JBarsotti
Copy link
Author

Thank you for the fast reply! Your responses make sense. One other question:

If I wanted to include ICD-10 codes as part of a model, is there a way to do that using the prebuilt models, or do I need a new one?

@mart-r
Copy link
Collaborator

mart-r commented Jun 21, 2024

Some models have ICD-10 mappings baked into them. So you may be able to look up the CUIs in cat.cdb.addl_info['cui2icd10']. In fact, if you use CAT.get_entities, the default behaviour would be to make use of the ICD-10 mappings embedded in the CDB (if they exist).
A recognised entity in this case could look something like this:

{'pretty_name': 'Fever', 'cui': '386661006', 'type_ids': ['67667581'], 'types': ['finding'], 'source_value': 'fever', 'detected_name': 'fever', 'acc': 1.0, 'context_similarity': 1.0, 'start': 29, 'end': 34, 'icd10': ['R509', 'R508', 'R502', 'P819', 'P818', 'T670', 'O752', 'P810', 'O864'], 'ontologies': ['SNOMED-CT'], 'snomed': [], 'id': 2, 'meta_anns': {}}

If the ICD-10 mappings do not exist within a model pack, you would need to add them or map the Snomed or UMLS term yourself.

@JBarsotti
Copy link
Author

Thakns again for the reply! Is there a model out there that you would recommend that has ICD codes baked in?

@mart-r
Copy link
Collaborator

mart-r commented Jun 21, 2024

I don't know off the top of my head. But the SNOMED model is more likely to have ICD10 mappings since we have built in functionality for that within the SNOMED preprocessing script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants