Output Data Structure

Original Pipeline Output

Specifically, for the cell-gene clustering model:

Variable	Description
z	Gene cluster assignments from the final iteration of Gibbs sampling
complete.z	History of gene cluster assignments across all iterations of Gibbs
z.stability	[0,1] measure of stability for the gene clustering chain
complete.z.stability	History of z.stability over all iterations of Gibbs sampling
z.prob	Probability of each cluster assignment
y	Cell cluster assignments from the final iteration of Gibbs sampling
complete.y	History of cell cluster assignments across all iterations of Gibbs
y.stability	[0,1] measure of stability for the cell clustering chain
complete.y.stability	Historyof y.stability over all iterations of Gibbs sampling
completeLogLik	Log-likelihood of all gene and cell cluster assignments over all iterations of Gibbs sampling
finalLogLik	Log-likelihood of final gene and cell cluster assignments

Object Design for Output

Ideally, what we should get back should contain a list of mcmclist of mcmc objects. Each list item is an mcmclist for one of the models that was run (e.g. k=4, l=10 versus k=5, l=10), where the mcmc objects in the mcmc contain information on the Gibbs sampling for that chain in that model (phew!). It should also contain information on the parameters used to run celda, as well as additional information on model performance (the complete log likelihood, etc).

There's a couple gotchas to note off the bat that will make designing a well-behaving data structure annoying:

Different models have different outputs. The gene, cell, and gene*cell clustering models all behave differently, with the last one returning information on clustering in gene and cell space. I'm not 100% on the best way for the output to be structured given that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output Data Structure

Original Pipeline Output

Object Design for Output

Clone this wiki locally