EMR-data-science: Introduction to Data Science with Simulated Electronic Medical Record Data

Sanity checking skills for clinical informatics

This is a collection of open source educational materials (mostly Databricks notebooks) for introducing fundamental concepts of data science to a clinical audience. We focus on exploratory analysis, visualization, and interpretable machine learning (ML) models assuming that these will be particularly useful skills for clinician data scientists involved in planning and oversight of research, who will need to sanity check various findings.

The data for these exercises was generated by Synthea using the standard collection of modules. ML is most useful in situations where classifications or predictions of outcomes must be made on the basis of many weak associations (if they can be made based on a small number of strong associations, you probably don't need ML). Unfortunately, Synthea data often lacks the subtle statistical relationships among variables that would make for compelling machine learning demonstrations. The missing subtlety is sometimes manifested in associations that have not been included in the simulation, and sometimes in associations that are overly significant. This makes some outcomes impossible to predict, while others can be predicted with far too great certainty.

However, the same assortment of statistically inappropriate relationships that make it difficult to demonstrate ML on this data make it a treasure trove for sanity checking! Clinicians will easily be able to identify associations between disorders, treatments, observations, and patient characteristics that are either suspiciously strong or conspicuously absent.

After negotiating some potential pitfalls, we are able to identify a set of features correlated (but not too strongly correlated) with a clinical outcome, which lets us demonstrate a machine learning classifier. The model we use is an Explainable Boosting Machine (EBM), a form of generalized additive model that comes with its own visualization tools for understanding the contribution of each feature to the prediction.

These are the HTML versions of the notebooks:

Co-occurrence plots, using various metrics:

Sample Data

The 'sample_data.zip' archive contains CSV files copied from the "Synthetic Mass" 1k patient sample.

This dataset is described in this reference:

Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. 
Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. 
Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007

Workshop Instructions

The workshop instructions are in the ML_with_simulated_EMR.pptx file; see Part 0: Setting up Databricks.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
docs		docs
0_Load_data.py		0_Load_data.py
1_Synthea_exploration.sql		1_Synthea_exploration.sql
2_Synthea_cooccurrence.py		2_Synthea_cooccurrence.py
3_Synthea_predict_breast_cancer.py		3_Synthea_predict_breast_cancer.py
LICENSE		LICENSE
ML_with_simulated_EMR.pptx		ML_with_simulated_EMR.pptx
README.md		README.md
extra_credit.sql		extra_credit.sql
sample_data.zip		sample_data.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EMR-data-science: Introduction to Data Science with Simulated Electronic Medical Record Data

Sanity checking skills for clinical informatics

Sample Data

Workshop Instructions

About

Releases

Packages

Contributors 2

Languages

License

rmhorton/EMR-data-science

Folders and files

Latest commit

History

Repository files navigation

EMR-data-science: Introduction to Data Science with Simulated Electronic Medical Record Data

Sanity checking skills for clinical informatics

Sample Data

Workshop Instructions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages