-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store original cell-by-gene matrix #1012
Comments
This is how I envision this:
class DesignElementDataVector {
// relocate bioAssayDimension in RawExpressionDataVector and ProcessedExpressionDataVector or in a new shared subclass
QuantitationType quantitationType; // reused for all vectors;
ExpressionExperiment expressionExperiment;
}
class SingleCellDataVector extends DesignElementDataVector {
SingleCellDimension singleCellDimension; // reused for all vectors
}
class SingleCellDimension {
List<String> cellIds;
Integer numberOfCells; // equals to cellIds.size()
List<Characteristic> cellTypes;
List<BioAssay> bioAssays;
} To be more efficient in storing cell IDs, we can join then with a newline and gzip-compress them as a binary blob. This can be achieved transparently with a Hibernate UserType. To be more efficient in storing cell types and bioassays, we can store two companion arrays that identify which range of the vector the type or bioassay applies to. I will also explore the possibility of having transparent sparse lists with a Hibernate UserType. |
Following up on a discussion with @ppavlidis, it will be important to store original cell type labels in the single-cell dimension. At first, I thought of using the same approach that we do for bioassays, but that had some drawbacks so we will need something similar to how we store cell IDs. Unlike cell IDs though, types are going to be very repetitive and might benefit from some kind of relational storage for the types and a byte vector for assigning the types to individual cells. Having that would make it easy to query which labels are being used across all our datasets and establish rules for assigning factor values later on. |
I've turned cell type labelling into an analysis, so that we can attach multiple labelling to our single-cell dimensions. This make sense if we have to relabel cells and want to retain the original ones in the database. |
It's not absolutely necessary because we can always store those aggregated by cell types which addresses sparsity issue and keep original files on-disk in their original storage.
If we have to keep them in Gemma, we will need to explore some form of CSR storage for data vectors.
Original matrices do not have the dimensionality of EE's bioassays. They will have to be mapped in some way so that they can be annotated with cell types, etc.
The text was updated successfully, but these errors were encountered: