Data format specs after mapping #102

rcannood · 2022-09-29T08:34:44Z

This is a first attempt at deriving a data format specification.

Once we figure out some of the APIs, we could include these in our config.vsh.yaml definitions (similar to https://github.com/openproblems-bio/openproblems-v2/tree/main/src/label_projection/api )

After Cell Ranger or BD Rhapsody mapping

obs:
  index # cell id
  sample
  cell_type # human-readable name
  organism # ?
  tissue # ?

mod:
  # gene expression
  rna:
    layers:
      counts
      velocity_spliced # ?
      velocity_unspliced # ? 
    var:
      index # feature_id, preferably an ensembl id
      feature_name

  # Antibody Capture
  prot: 
    layers:
      counts
    var:
       index # feature_id
       feature_name # Associated protein names

  # IR receptor data
  vdj: 
    obsm:
      vdj_t
      vdj_b

  # Custom Capture
  custom:
    X: # raw counts

uns:
  sample_info: # dictionary of data frames, every data frame has a 'sample_id' column
    cellranger: h5attributes(h5)
    qc: # Data frame with columns:
      sample_id # corresponds to .obs["sample_id"]
      component_id # the component that generated these qc values, e.g. mapping/cellranger_count
      category # 10x example [Cells, Library], BD example [Sequencing Quality, Library Quality, ...]
      group_name # example 'ABC_1'
      metric_name
      metric_value # numerical values, example 1000, 0.1 -- strip % signs
  param_log: # list of dicts
    - pipeline_id
      component_id
      component_version
      id
      params: { input: ..., output: ..., arg1: ..., arg2: ... } # not the full path of files should be stored, only the base names

After single sample RNA

mod:
  rna:
    obs:
      doublet_prob
      doublet_score
      doublet_bool
      <standard names for scanpy calculate qc metrics>
    var:
      <standard names for scanpy calculate qc metrics>
    layers:
      ambient_corrected_counts

After multi sample RNA

New fields:

mod:
  rna:
    var:
      highly_variable ( boolean )
    layers:
      normalized

After integration RNA

New fields:

mod:
  rna:
    obs:
      cluster
    obsm:
      X_pca
      X_integrated
      X_umap
    obsp:
      connectivities
      distances
    uns:
      neighbors: # for compatibility with umap
        connectivities_key
        distances_key
        params: { ... }

After annotation

Since it could be used across modalities, so should be able to output in the root of the mudata.

obsm:
  annotation_scvi: # data frame with the predictions and scores?
  annotation_bbknn: # data frame
  # all in one: with just the predictions?
  annotation:
    prediction_scvi
    prediction_bbknn
    ...
uns:
  ...?

WIP!

Logging QC metrics

uns:
  sample_info: # dictionary of data frames, every data frame has a 'sample_id' column
    cellranger: h5attributes(h5)
    qc: # Data frame with columns:
      sample_id # corresponds to .obs["sample_id"]
      component_id # the component that generated these qc values, e.g. mapping/cellranger_count
      category # 10x example [Cells, Library], BD example [Sequencing Quality, Library Quality, ...]
      group_name # example 'ABC_1'
      metric_name
      metric_value # numerical values, example 1000, 0.1 -- strip % signs

Logging execution

uns:
  param_log: # list of dicts
    - pipeline_id
      component_id
      component_version
      id
      params: { input: ..., output: ..., arg1: ..., arg2: ... } # not the full path of files should be stored, only the base names

rcannood · 2022-12-06T13:38:22Z

An (incomplete) overview is included on the website: https://openpipelines.bio/guide/data_api.html

rcannood mentioned this issue Nov 17, 2022

Add multiomics support to Cell Ranger #86

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data format specs after mapping #102

Data format specs after mapping #102

rcannood commented Sep 29, 2022 •

edited

Loading

rcannood commented Dec 6, 2022 •

edited by ddemaeyer

Loading

Data format specs after mapping #102

Data format specs after mapping #102

Comments

rcannood commented Sep 29, 2022 • edited Loading

After Cell Ranger or BD Rhapsody mapping

After single sample RNA

After multi sample RNA

After integration RNA

After annotation

Logging QC metrics

Logging execution

rcannood commented Dec 6, 2022 • edited by ddemaeyer Loading

rcannood commented Sep 29, 2022 •

edited

Loading

rcannood commented Dec 6, 2022 •

edited by ddemaeyer

Loading