Discrepancies between reports for images from the same study #4

anicolson · 2024-08-09T05:12:50Z

Hi,

There are differences between the reports for images from the same study (from df_chexpert_plus_240401.csv):

>>> import pandas as pd
>>> df = pd.read_csv('/datasets/work/hb-mlaifsp-mm/work/archive/chexpertplus/df_chexpert_plus_240401.csv')
>>> def study_id(path):
...     parts = path.split('/')
...     return f'{parts[0]}_{parts[1]}_{parts[2]}'
... 
>>> df['study_id'] = df['path_to_image'].apply(study_id)
>>> unique_counts = df.groupby('study_id')['section_impression'].nunique()
>>> study_ids_with_descrepencies = unique_counts[unique_counts > 1].index
>>> df[df['study_id'].isin(study_ids_with_descrepencies)][['path_to_image', 'section_impression']].head(10)
                                   path_to_image                                 section_impression
62   train/patient25098/study3/view2_lateral.jpg  \n1.  2021/2/6 QS/1 Data Systems 0951 HOURS AP...
65   train/patient25098/study3/view1_frontal.jpg  \n1.  3/18/2009 kollabio 0951 HOURS AP AND LAT...
91   train/patient15635/study4/view2_lateral.jpg  \nNO INTERVAL CHANGE SINCE 5/2/2000. AGAIN, TH...
93   train/patient15635/study4/view1_frontal.jpg  \nNO INTERVAL CHANGE SINCE 7-10-13. AGAIN, THE...
144  train/patient14693/study7/view1_frontal.jpg   \n \n1.   CARDIOMEGALY AND MILD PULMONARY EDE...
147  train/patient14693/study7/view2_lateral.jpg   \n \n1.   CARDIOMEGALY AND MILD PULMONARY EDE...
390  train/patient08385/study1/view1_frontal.jpg  \n \n1.  Interval resolution of RIGHT pleural ...
391  train/patient08385/study1/view2_lateral.jpg  \n \n1.  Interval resolution of RIGHT pleural ...
400  train/patient05959/study4/view1_frontal.jpg   \n \n1. FRACTURE OF THE POSTEROLATERAL LEFT S...
401  train/patient05959/study1/view1_frontal.jpg  \n1. EXTENSIVE PATCHY OPACITY THROUGH THE RIGH...
>>> len(df[df['study_id'].isin(study_ids_with_descrepencies)][['path_to_image', 'section_impression']])
11303

This occurs with other sections as well.

The dates (and some added strings around the dates) and phone numbers are what seem to differ between the sections, e.g.,:

>>> df[df['study_id'].isin(study_ids_with_descrepencies)][['section_impression']].iloc[0].item()
'\n1.  2021/2/6 QS/1 Data Systems 0951 HOURS AP AND LATERAL VIEWS OF THE UPRIGHT CHEST\nREDEMONSTRATE RIGHT INTERNAL JUGULAR DOUBLE LUMEN CATHETER WITH TIP\nIN THE REGION OF THE CAVOATRIAL JUNCTION.\n2.  THERE HAS BEEN INTERVAL DEVELOPMENT OF BILATERAL LEFT GREATER\nTHAN RIGHT PLEURAL EFFUSIONS.  THERE ARE ALSO INCREASED BIBASILAR\nOPACITIES WHICH ARE LIKELY COMPRESSIVE ATELECTASIS AS A RESULT OF THE\nEFFUSIONS.  THESE FINDINGS ARE RELATIVELY ACUTE ONSET,  AND SUGGEST\nPULMONARY EDEMA, ALTHOUGH INFECTION CANNOT BE EXCLUDED.\n'
>>> df[df['study_id'].isin(study_ids_with_descrepencies)][['section_impression']].iloc[1].item()
'\n1.  3/18/2009 kollabio 0951 HOURS AP AND LATERAL VIEWS OF THE UPRIGHT CHEST\nREDEMONSTRATE RIGHT INTERNAL JUGULAR DOUBLE LUMEN CATHETER WITH TIP\nIN THE REGION OF THE CAVOATRIAL JUNCTION.\n2.  THERE HAS BEEN INTERVAL DEVELOPMENT OF BILATERAL LEFT GREATER\nTHAN RIGHT PLEURAL EFFUSIONS.  THERE ARE ALSO INCREASED BIBASILAR\nOPACITIES WHICH ARE LIKELY COMPRESSIVE ATELECTASIS AS A RESULT OF THE\nEFFUSIONS.  THESE FINDINGS ARE RELATIVELY ACUTE ONSET,  AND SUGGEST\nPULMONARY EDEMA, ALTHOUGH INFECTION CANNOT BE EXCLUDED.\n'

Another example:

>>> df[df['study_id'].isin(study_ids_with_descrepencies)][['section_impression']].iloc[6].item()
'\n \n1.  Interval resolution of RIGHT pleural effusion without active \ndisease in the chest.\n \n"Physician to Physician Radiology Consult Line: (916) 919-2522"\n \n'
>>> df[df['study_id'].isin(study_ids_with_descrepencies)][['section_impression']].iloc[7].item()
'\n \n1.  Interval resolution of RIGHT pleural effusion without active \ndisease in the chest.\n \n"Physician to Physician Radiology Consult Line: (616) 985-3791"\n \n'

I was wondering if this was due to the de-identification process used; can we ignore these differences? I.e., have the dates and phone numbers been replaced with random versions?

Thanks,
A.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancies between reports for images from the same study #4

Discrepancies between reports for images from the same study #4

anicolson commented Aug 9, 2024

Discrepancies between reports for images from the same study #4

Discrepancies between reports for images from the same study #4

Comments

anicolson commented Aug 9, 2024