Skip to content

Output Data Structure

Sean edited this page Mar 28, 2017 · 3 revisions

Original Pipeline Output

Specifically, for the cell-gene clustering model:

Variable Description
z Gene cluster assignments from the final iteration of Gibbs sampling
complete.z History of gene cluster assignments across all iterations of Gibbs
z.stability [0,1] measure of stability for the gene clustering chain
complete.z.stability History of z.stability over all iterations of Gibbs sampling
z.prob Probability of each cluster assignment
y Cell cluster assignments from the final iteration of Gibbs sampling
complete.y History of cell cluster assignments across all iterations of Gibbs
y.stability [0,1] measure of stability for the cell clustering chain
complete.y.stability Historyof y.stability over all iterations of Gibbs sampling
completeLogLik Log-likelihood of all gene and cell cluster assignments over all iterations of Gibbs sampling
finalLogLik Log-likelihood of final gene and cell cluster assignments

Object Design for Output

Ideally, what we should get back should contain a list of mcmclist of mcmc objects. Each list item is an mcmclist for one of the models that was run (e.g. k=4, l=10 versus k=5, l=10), where the mcmc objects in the mcmc contain information on the Gibbs sampling for that chain in that model (phew!). It should also contain information on the parameters used to run celda, as well as additional information on model performance (the complete log likelihood, etc).

There's a couple gotchas to note off the bat that will make designing a well-behaving data structure annoying:

  • Different models have different outputs. The gene, cell, and gene*cell clustering models all behave differently, with the last one returning information on clustering in gene and cell space. I'm not 100% on the best way for the output to be structured given that.