Identify the input formats for single-cell data we want to support in Gemma #1028

arteymix · 2024-02-12T00:44:38Z

It looks like there's three data formats commonly used out there:

SeuratDisk, and HDF5-based storage for Seurat
AnnData on-disk storage, another HDF5-based storage
MEX a 10X format with the .mtx extension, does not include samples/factors, so an additional user-supplied mapping would be necessary
Loom which is another HDF5-based format

The text was updated successfully, but these errors were encountered:

arteymix · 2024-02-12T00:46:06Z

I'm currently implementing the import for AnnData HDF5. It will be built on top of HDF5 JNI API and reusable for Seurat.

MTX is relatively easy to do since it's just a tabular format that we can parse with Apache Commons CSV.

arteymix · 2024-02-14T01:50:03Z

Another format to consider are aggregated MEX which are basically multiple samples combined in a single matrix.

arteymix · 2024-02-14T01:52:31Z

Sample associations for MEX formats can generally be determined by the naming scheme of the submitted files which get prefixed by GSM IDs.

I haven't looked yet at aggregated ones, but I suspect we might not have that benefit.

Those nitty gritty details should be dealt with on the GEO loader.

arteymix · 2024-04-12T21:25:18Z

This is done. We'll support MEX, AnnData and Seurat Disk.

arteymix assigned sanjarogic Feb 12, 2024

arteymix linked a pull request Feb 12, 2024 that will close this issue

Prototype for storing single-cell data #1020

Draft

6 tasks

arteymix added the single cell Issues related to single-cell data support label Feb 12, 2024

arteymix closed this as completed Apr 12, 2024

Provide feedback