Computational Notebook for "Segmentation-free Characterization of Cell Types and Microscale Tissue Assemblies in Human Colorectal Cancer with Variational Autoencoders"
Gregory J. Baker1,2,3,*,#, Edward Novikov1,4,*, Yu-An Chen1,2, Clemens B. Hug1, Sebastián A. Cajas Ordóñez4, Siyu Huang4, Clarence Yapp1,5, Shannon Coy1,2,6, Hanspeter Pfister4, Artem Sokolov1,7, Peter K. Sorger1,2,3,#
1Laboratory of Systems Pharmacology, Program in Therapeutic Science, Harvard Medical School, Boston, MA
2Ludwig Center for Cancer Research at Harvard, Harvard Medical School, Boston, MA
3Department of Systems Biology, Harvard Medical School, Boston, MA
4Harvard John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA
5Image and Data Analysis Core, Harvard Medical School, Boston, MA
6Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA
*Co-first Authors: G.J.B., E.N.
#Corresponding Authors: [email protected] (G.J.B.), [email protected] (P.K.S)
A detailed characterization of human tissue organization and understanding of how multiscale histological structures differ in response to disease and therapy can serve as important biomarkers of disease progression and therapeutic response. Although highly multiplex images of tissue contain detailed information on the abundance and distribution of proteins within and across cells, their analysis via segmentation-based methods captures little morphological information, is influenced by signal contamination across segmentation boundaries, and requires custom algorithms to study multi-cellular tissue architectures. Here we classify individual cell states and recurrent microscale tissue motifs in human colorectal adenocarcinoma by training a class of generative neural networks (variational autoencoders, VAEs) on multi-scale image patches derived from whole-slide imaging data. Our work demonstrates how unsupervised computer vision can be used to characterize cells and their higher-order structural assemblies in a manner that simultaneously accounts for protein abundance and spatial distribution while overcoming limitations of segmentation-based analysis.
The Python code (i.e., Jupyter Notebooks) in this GitHub repository was used to generate the figures in the aforementioned study.
The Python code (i.e., Jupyter Notebooks) in this GitHub repository was used to generate figures in the paper. To run the code, first clone this repo onto your computer. Then download the required input data files from the Sage Bionetworks Synpase data repository into the top level of the cloned repo. Next, change directories into the top level of the cloned repo and create and activate a dedicated Conda environment with the necessary Python libraries by running the following commands:
cd <path/to/this/repo>
conda env create -f environment.yml
conda activate vae-paper
Run the computational notebook in JupyterLab with this command:
jupyter lab
Source code for the VAE analysis pipeline used in this study is freely-available and archived on GitHub.
New data associated with this paper is available at the HTAN Data Portal. Input data required to run the source code found here is freely-available at Sage Synapse
This work was supported by the Ludwig Cancer Research and the Ludwig Center at Harvard (P.K.S., S.S.) and by NIH NCI grants U54-CA225088, U2C-CA233280, and U2C-CA233262 (P.K.S., S.S.). S.S. is supported by the BWH President’s Scholars Award.
The Python code (i.e., Jupyter Notebooks) in this GitHub repository is archived on Zenodo at