Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load single-cell data from GEO #1029

Closed
3 tasks done
arteymix opened this issue Feb 12, 2024 · 2 comments · May be fixed by #1020
Closed
3 tasks done

Load single-cell data from GEO #1029

arteymix opened this issue Feb 12, 2024 · 2 comments · May be fixed by #1020
Assignees
Labels
single cell Issues related to single-cell data support

Comments

@arteymix
Copy link
Member

arteymix commented Feb 12, 2024

  • detect that a GEO series contains single-cell data
  • detect what kind of single cell format is being used
  • delegate loading to the appropriate SingleCellDataLoader implementation

So far, I've encountered the following formats:

  • MEX files (barcodes, feature, matrix) prefixed by GSM IDs
  • MEX files, one supplementary file per sample, no GSM IDs in names
  • AnnData, single file
  • Seurat Disk, single file

GEO has metadata for the storage format. We can use that if available.

Ideally we should use supplementary files provided at the sample-level with a fallback on the series.

In some cases, it will be necessary to inspect the content of a TAR archive and extract only the necessary files.

For AnnData and Seurat Disk, there is no standardized way of attributing a sample name to individual cells and the matrix does not contain GSM IDs. We have to detect which variable encodes the sample name.

For MEX, cell types are generally provided alongside in a tabular format. We can add minimal support for loading those files. I expect however that it will be difficult to cover all possible real world examples.

Last but not least, some authors provide single cell data in a custom format or already aggregated by cell types.

@arteymix arteymix self-assigned this Feb 12, 2024
@arteymix arteymix added the single cell Issues related to single-cell data support label Feb 12, 2024
@arteymix arteymix linked a pull request Feb 25, 2024 that will close this issue
6 tasks
@arteymix
Copy link
Member Author

arteymix commented Apr 4, 2024

I'll attempt to parse metadata for all the datasets that our curators have collected so far and identify those that we can load right now.

@arteymix
Copy link
Member Author

I have some good progress on this. I've downloaded 935 MEX datasets and16 AnnData so far. I have to look into cell type assignments that are provided by the authors now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
single cell Issues related to single-cell data support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant