Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies between reports for images from the same study #4

Open
anicolson opened this issue Aug 9, 2024 · 0 comments
Open

Discrepancies between reports for images from the same study #4

anicolson opened this issue Aug 9, 2024 · 0 comments

Comments

@anicolson
Copy link

Hi,

There are differences between the reports for images from the same study (from df_chexpert_plus_240401.csv):

>>> import pandas as pd
>>> df = pd.read_csv('/datasets/work/hb-mlaifsp-mm/work/archive/chexpertplus/df_chexpert_plus_240401.csv')
>>> def study_id(path):
...     parts = path.split('/')
...     return f'{parts[0]}_{parts[1]}_{parts[2]}'
... 
>>> df['study_id'] = df['path_to_image'].apply(study_id)
>>> unique_counts = df.groupby('study_id')['section_impression'].nunique()
>>> study_ids_with_descrepencies = unique_counts[unique_counts > 1].index
>>> df[df['study_id'].isin(study_ids_with_descrepencies)][['path_to_image', 'section_impression']].head(10)
                                   path_to_image                                 section_impression
62   train/patient25098/study3/view2_lateral.jpg  \n1.  2021/2/6 QS/1 Data Systems 0951 HOURS AP...
65   train/patient25098/study3/view1_frontal.jpg  \n1.  3/18/2009 kollabio 0951 HOURS AP AND LAT...
91   train/patient15635/study4/view2_lateral.jpg  \nNO INTERVAL CHANGE SINCE 5/2/2000. AGAIN, TH...
93   train/patient15635/study4/view1_frontal.jpg  \nNO INTERVAL CHANGE SINCE 7-10-13. AGAIN, THE...
144  train/patient14693/study7/view1_frontal.jpg   \n \n1.   CARDIOMEGALY AND MILD PULMONARY EDE...
147  train/patient14693/study7/view2_lateral.jpg   \n \n1.   CARDIOMEGALY AND MILD PULMONARY EDE...
390  train/patient08385/study1/view1_frontal.jpg  \n \n1.  Interval resolution of RIGHT pleural ...
391  train/patient08385/study1/view2_lateral.jpg  \n \n1.  Interval resolution of RIGHT pleural ...
400  train/patient05959/study4/view1_frontal.jpg   \n \n1. FRACTURE OF THE POSTEROLATERAL LEFT S...
401  train/patient05959/study1/view1_frontal.jpg  \n1. EXTENSIVE PATCHY OPACITY THROUGH THE RIGH...
>>> len(df[df['study_id'].isin(study_ids_with_descrepencies)][['path_to_image', 'section_impression']])
11303

This occurs with other sections as well.

The dates (and some added strings around the dates) and phone numbers are what seem to differ between the sections, e.g.,:

>>> df[df['study_id'].isin(study_ids_with_descrepencies)][['section_impression']].iloc[0].item()
'\n1.  2021/2/6 QS/1 Data Systems 0951 HOURS AP AND LATERAL VIEWS OF THE UPRIGHT CHEST\nREDEMONSTRATE RIGHT INTERNAL JUGULAR DOUBLE LUMEN CATHETER WITH TIP\nIN THE REGION OF THE CAVOATRIAL JUNCTION.\n2.  THERE HAS BEEN INTERVAL DEVELOPMENT OF BILATERAL LEFT GREATER\nTHAN RIGHT PLEURAL EFFUSIONS.  THERE ARE ALSO INCREASED BIBASILAR\nOPACITIES WHICH ARE LIKELY COMPRESSIVE ATELECTASIS AS A RESULT OF THE\nEFFUSIONS.  THESE FINDINGS ARE RELATIVELY ACUTE ONSET,  AND SUGGEST\nPULMONARY EDEMA, ALTHOUGH INFECTION CANNOT BE EXCLUDED.\n'
>>> df[df['study_id'].isin(study_ids_with_descrepencies)][['section_impression']].iloc[1].item()
'\n1.  3/18/2009 kollabio 0951 HOURS AP AND LATERAL VIEWS OF THE UPRIGHT CHEST\nREDEMONSTRATE RIGHT INTERNAL JUGULAR DOUBLE LUMEN CATHETER WITH TIP\nIN THE REGION OF THE CAVOATRIAL JUNCTION.\n2.  THERE HAS BEEN INTERVAL DEVELOPMENT OF BILATERAL LEFT GREATER\nTHAN RIGHT PLEURAL EFFUSIONS.  THERE ARE ALSO INCREASED BIBASILAR\nOPACITIES WHICH ARE LIKELY COMPRESSIVE ATELECTASIS AS A RESULT OF THE\nEFFUSIONS.  THESE FINDINGS ARE RELATIVELY ACUTE ONSET,  AND SUGGEST\nPULMONARY EDEMA, ALTHOUGH INFECTION CANNOT BE EXCLUDED.\n'

Another example:

>>> df[df['study_id'].isin(study_ids_with_descrepencies)][['section_impression']].iloc[6].item()
'\n \n1.  Interval resolution of RIGHT pleural effusion without active \ndisease in the chest.\n \n"Physician to Physician Radiology Consult Line: (916) 919-2522"\n \n'
>>> df[df['study_id'].isin(study_ids_with_descrepencies)][['section_impression']].iloc[7].item()
'\n \n1.  Interval resolution of RIGHT pleural effusion without active \ndisease in the chest.\n \n"Physician to Physician Radiology Consult Line: (616) 985-3791"\n \n'

I was wondering if this was due to the de-identification process used; can we ignore these differences? I.e., have the dates and phone numbers been replaced with random versions?

Thanks,
A.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant