Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store original cell-by-gene matrix #1012

Closed
arteymix opened this issue Jan 31, 2024 · 3 comments · May be fixed by #1020
Closed

Store original cell-by-gene matrix #1012

arteymix opened this issue Jan 31, 2024 · 3 comments · May be fixed by #1020
Assignees
Labels
single cell Issues related to single-cell data support

Comments

@arteymix
Copy link
Member

It's not absolutely necessary because we can always store those aggregated by cell types which addresses sparsity issue and keep original files on-disk in their original storage.

If we have to keep them in Gemma, we will need to explore some form of CSR storage for data vectors.

Original matrices do not have the dimensionality of EE's bioassays. They will have to be mapped in some way so that they can be annotated with cell types, etc.

@arteymix arteymix self-assigned this Jan 31, 2024
@arteymix
Copy link
Member Author

arteymix commented Jan 31, 2024

This is how I envision this:

  • SingleCellDimension will store cell IDs, cell types and bioassays that each individual cell belong to
  • SingleCellDataVector will store cell-level expression data and will refer to a SingleCellDimension that is reused across all genes
class DesignElementDataVector {
    // relocate bioAssayDimension in RawExpressionDataVector and ProcessedExpressionDataVector or in a new shared subclass
    QuantitationType quantitationType;              // reused for all vectors;
     ExpressionExperiment expressionExperiment;
}

class SingleCellDataVector extends DesignElementDataVector {
    SingleCellDimension singleCellDimension; // reused for all vectors
}

class SingleCellDimension {
    List<String> cellIds;
    Integer numberOfCells; // equals to cellIds.size()
    List<Characteristic> cellTypes;
    List<BioAssay> bioAssays;
}

To be more efficient in storing cell IDs, we can join then with a newline and gzip-compress them as a binary blob. This can be achieved transparently with a Hibernate UserType.

To be more efficient in storing cell types and bioassays, we can store two companion arrays that identify which range of the vector the type or bioassay applies to. I will also explore the possibility of having transparent sparse lists with a Hibernate UserType.

@arteymix arteymix added the single cell Issues related to single-cell data support label Jan 31, 2024
@arteymix
Copy link
Member Author

arteymix commented Feb 7, 2024

Following up on a discussion with @ppavlidis, it will be important to store original cell type labels in the single-cell dimension.

At first, I thought of using the same approach that we do for bioassays, but that had some drawbacks so we will need something similar to how we store cell IDs.

Unlike cell IDs though, types are going to be very repetitive and might benefit from some kind of relational storage for the types and a byte vector for assigning the types to individual cells. Having that would make it easy to query which labels are being used across all our datasets and establish rules for assigning factor values later on.

@arteymix arteymix linked a pull request Feb 10, 2024 that will close this issue
6 tasks
@arteymix
Copy link
Member Author

I've turned cell type labelling into an analysis, so that we can attach multiple labelling to our single-cell dimensions. This make sense if we have to relabel cells and want to retain the original ones in the database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
single cell Issues related to single-cell data support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant