Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data format specs after mapping #102

Open
rcannood opened this issue Sep 29, 2022 · 1 comment
Open

Data format specs after mapping #102

rcannood opened this issue Sep 29, 2022 · 1 comment

Comments

@rcannood
Copy link
Contributor

rcannood commented Sep 29, 2022

This is a first attempt at deriving a data format specification.

Once we figure out some of the APIs, we could include these in our config.vsh.yaml definitions (similar to https://github.com/openproblems-bio/openproblems-v2/tree/main/src/label_projection/api )

After Cell Ranger or BD Rhapsody mapping

obs:
  index # cell id
  sample
  cell_type # human-readable name
  organism # ?
  tissue # ?

mod:
  # gene expression
  rna:
    layers:
      counts
      velocity_spliced # ?
      velocity_unspliced # ? 
    var:
      index # feature_id, preferably an ensembl id
      feature_name

  # Antibody Capture
  prot: 
    layers:
      counts
    var:
       index # feature_id
       feature_name # Associated protein names

  # IR receptor data
  vdj: 
    obsm:
      vdj_t
      vdj_b

  # Custom Capture
  custom:
    X: # raw counts

uns:
  sample_info: # dictionary of data frames, every data frame has a 'sample_id' column
    cellranger: h5attributes(h5)
    qc: # Data frame with columns:
      sample_id # corresponds to .obs["sample_id"]
      component_id # the component that generated these qc values, e.g. mapping/cellranger_count
      category # 10x example [Cells, Library], BD example [Sequencing Quality, Library Quality, ...]
      group_name # example 'ABC_1'
      metric_name
      metric_value # numerical values, example 1000, 0.1 -- strip % signs
  param_log: # list of dicts
    - pipeline_id
      component_id
      component_version
      id
      params: { input: ..., output: ..., arg1: ..., arg2: ... } # not the full path of files should be stored, only the base names

After single sample RNA

mod:
  rna:
    obs:
      doublet_prob
      doublet_score
      doublet_bool
      <standard names for scanpy calculate qc metrics>
    var:
      <standard names for scanpy calculate qc metrics>
    layers:
      ambient_corrected_counts

After multi sample RNA

New fields:

mod:
  rna:
    var:
      highly_variable ( boolean )
    layers:
      normalized

After integration RNA

New fields:

mod:
  rna:
    obs:
      cluster
    obsm:
      X_pca
      X_integrated
      X_umap
    obsp:
      connectivities
      distances
    uns:
      neighbors: # for compatibility with umap
        connectivities_key
        distances_key
        params: { ... }

After annotation

Since it could be used across modalities, so should be able to output in the root of the mudata.

obsm:
  annotation_scvi: # data frame with the predictions and scores?
  annotation_bbknn: # data frame
  # all in one: with just the predictions?
  annotation:
    prediction_scvi
    prediction_bbknn
    ...
uns:
  ...?

WIP!

Logging QC metrics

uns:
  sample_info: # dictionary of data frames, every data frame has a 'sample_id' column
    cellranger: h5attributes(h5)
    qc: # Data frame with columns:
      sample_id # corresponds to .obs["sample_id"]
      component_id # the component that generated these qc values, e.g. mapping/cellranger_count
      category # 10x example [Cells, Library], BD example [Sequencing Quality, Library Quality, ...]
      group_name # example 'ABC_1'
      metric_name
      metric_value # numerical values, example 1000, 0.1 -- strip % signs

Logging execution

uns:
  param_log: # list of dicts
    - pipeline_id
      component_id
      component_version
      id
      params: { input: ..., output: ..., arg1: ..., arg2: ... } # not the full path of files should be stored, only the base names
@rcannood
Copy link
Contributor Author

rcannood commented Dec 6, 2022

An (incomplete) overview is included on the website: https://openpipelines.bio/guide/data_api.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant