Store original cell-by-gene matrix #1012

arteymix · 2024-01-31T20:14:04Z

It's not absolutely necessary because we can always store those aggregated by cell types which addresses sparsity issue and keep original files on-disk in their original storage.

If we have to keep them in Gemma, we will need to explore some form of CSR storage for data vectors.

Original matrices do not have the dimensionality of EE's bioassays. They will have to be mapped in some way so that they can be annotated with cell types, etc.

arteymix · 2024-01-31T20:19:53Z

This is how I envision this:

SingleCellDimension will store cell IDs, cell types and bioassays that each individual cell belong to
SingleCellDataVector will store cell-level expression data and will refer to a SingleCellDimension that is reused across all genes

class DesignElementDataVector {
    // relocate bioAssayDimension in RawExpressionDataVector and ProcessedExpressionDataVector or in a new shared subclass
    QuantitationType quantitationType;              // reused for all vectors;
     ExpressionExperiment expressionExperiment;
}

class SingleCellDataVector extends DesignElementDataVector {
    SingleCellDimension singleCellDimension; // reused for all vectors
}

class SingleCellDimension {
    List<String> cellIds;
    Integer numberOfCells; // equals to cellIds.size()
    List<Characteristic> cellTypes;
    List<BioAssay> bioAssays;
}

To be more efficient in storing cell IDs, we can join then with a newline and gzip-compress them as a binary blob. This can be achieved transparently with a Hibernate UserType.

To be more efficient in storing cell types and bioassays, we can store two companion arrays that identify which range of the vector the type or bioassay applies to. I will also explore the possibility of having transparent sparse lists with a Hibernate UserType.

arteymix · 2024-02-07T17:30:11Z

Following up on a discussion with @ppavlidis, it will be important to store original cell type labels in the single-cell dimension.

At first, I thought of using the same approach that we do for bioassays, but that had some drawbacks so we will need something similar to how we store cell IDs.

Unlike cell IDs though, types are going to be very repetitive and might benefit from some kind of relational storage for the types and a byte vector for assigning the types to individual cells. Having that would make it easy to query which labels are being used across all our datasets and establish rules for assigning factor values later on.

arteymix · 2024-02-10T02:54:55Z

I've turned cell type labelling into an analysis, so that we can attach multiple labelling to our single-cell dimensions. This make sense if we have to relabel cells and want to retain the original ones in the database.

arteymix self-assigned this Jan 31, 2024

arteymix added the single cell Issues related to single-cell data support label Jan 31, 2024

arteymix linked a pull request Feb 10, 2024 that will close this issue

Prototype for storing single-cell data #1020

Draft

6 tasks

arteymix closed this as completed Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store original cell-by-gene matrix #1012

Store original cell-by-gene matrix #1012

arteymix commented Jan 31, 2024

arteymix commented Jan 31, 2024 •

edited

Loading

arteymix commented Feb 7, 2024

arteymix commented Feb 10, 2024

Store original cell-by-gene matrix #1012

Store original cell-by-gene matrix #1012

Comments

arteymix commented Jan 31, 2024

arteymix commented Jan 31, 2024 • edited Loading

arteymix commented Feb 7, 2024

arteymix commented Feb 10, 2024

arteymix commented Jan 31, 2024 •

edited

Loading