This is a collection of open source educational materials (mostly Databricks notebooks) for introducing fundamental concepts of data science to a clinical audience. We focus on exploratory analysis, visualization, and interpretable machine learning (ML) models assuming that these will be particularly useful skills for clinician data scientists involved in planning and oversight of research, who will need to sanity check various findings.
The data for these exercises was generated by Synthea using the standard collection of modules. ML is most useful in situations where classifications or predictions of outcomes must be made on the basis of many weak associations (if they can be made based on a small number of strong associations, you probably don't need ML). Unfortunately, Synthea data often lacks the subtle statistical relationships among variables that would make for compelling machine learning demonstrations. The missing subtlety is sometimes manifested in associations that have not been included in the simulation, and sometimes in associations that are overly significant. This makes some outcomes impossible to predict, while others can be predicted with far too great certainty.
However, the same assortment of statistically inappropriate relationships that make it difficult to demonstrate ML on this data make it a treasure trove for sanity checking! Clinicians will easily be able to identify associations between disorders, treatments, observations, and patient characteristics that are either suspiciously strong or conspicuously absent.
After negotiating some potential pitfalls, we are able to identify a set of features correlated (but not too strongly correlated) with a clinical outcome, which lets us demonstrate a machine learning classifier. The model we use is an Explainable Boosting Machine (EBM), a form of generalized additive model that comes with its own visualization tools for understanding the contribution of each feature to the prediction.
These are the HTML versions of the notebooks:
Co-occurrence plots, using various metrics:
The 'sample_data.zip' archive contains CSV files copied from the "Synthetic Mass" 1k patient sample.
This dataset is described in this reference:
Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J.
Synthea™ Novel coronavirus (COVID-19) model and synthetic data set.
Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007
The workshop instructions are in the ML_with_simulated_EMR.pptx file; see Part 0: Setting up Databricks.