You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
detect that a GEO series contains single-cell data
detect what kind of single cell format is being used
delegate loading to the appropriate SingleCellDataLoader implementation
So far, I've encountered the following formats:
MEX files (barcodes, feature, matrix) prefixed by GSM IDs
MEX files, one supplementary file per sample, no GSM IDs in names
AnnData, single file
Seurat Disk, single file
GEO has metadata for the storage format. We can use that if available.
Ideally we should use supplementary files provided at the sample-level with a fallback on the series.
In some cases, it will be necessary to inspect the content of a TAR archive and extract only the necessary files.
For AnnData and Seurat Disk, there is no standardized way of attributing a sample name to individual cells and the matrix does not contain GSM IDs. We have to detect which variable encodes the sample name.
For MEX, cell types are generally provided alongside in a tabular format. We can add minimal support for loading those files. I expect however that it will be difficult to cover all possible real world examples.
Last but not least, some authors provide single cell data in a custom format or already aggregated by cell types.
The text was updated successfully, but these errors were encountered:
I have some good progress on this. I've downloaded 935 MEX datasets and16 AnnData so far. I have to look into cell type assignments that are provided by the authors now.
So far, I've encountered the following formats:
GEO has metadata for the storage format. We can use that if available.
Ideally we should use supplementary files provided at the sample-level with a fallback on the series.
In some cases, it will be necessary to inspect the content of a TAR archive and extract only the necessary files.
For AnnData and Seurat Disk, there is no standardized way of attributing a sample name to individual cells and the matrix does not contain GSM IDs. We have to detect which variable encodes the sample name.
For MEX, cell types are generally provided alongside in a tabular format. We can add minimal support for loading those files. I expect however that it will be difficult to cover all possible real world examples.
Last but not least, some authors provide single cell data in a custom format or already aggregated by cell types.
The text was updated successfully, but these errors were encountered: