Osteosarcoma Dataset Annotation (SCPCP000017) #601

yutarohtanaka · 2024-07-12T16:41:15Z

yutarohtanaka
Jul 12, 2024

Proposed analysis

We plan to perform the annotation of osteosarcoma snRNA-seq samples in the SCPCP000017 (n=28) dataset. Our processing and cell type annotation will include filtering for ambient and background RNA, filtering for low quality nuclei and doublets, cell type annotation, and malignant cell annotation.

Scientific goals

To share a validated, curated set of cell type annotations for the osteosarcoma samples in this dataset.

Methods or approach

Filtering for Ambient RNA
CellBender is a computational tool that is able to remove the ambient / background RNA from count matrices. We will compare the performance of CellBender to the DropletUtils::emptyDropsCellRanger() (which we understand has been performed for the “filtered counts” provided) to evaluate the best performing method on this data, and remove all potential background RNA.
(More than happy to skip this step if it would be preferable to use the emptyDrops-filtered matrices)
Filtering for Low Quality Nuclei
Here, “low quality nuclei” are defined as nuclei with less than 300 genes or 500 UMI counts expressed, or more than 6000 genes or 50,000 UMI counts expressed. Additionally, we filter out nuclei that have no ribosomal gene expression, more than 20% and 5% of mitochondrial and hemoglobin genes respectively, over total expressed genes. We use scanpy built-in functions to perform this.
We will also filter out any sparsely expressed genes that are expressed in less than 5 cells.
Filtering for Doublets
We have primarily used scrublet in our prior work, and found that it is able to identify doublets (and multiplets) with reasonable confidence. Here, we plan to use scrublet to call and filter out any potential doublets in each sample.
Annotating Cell Types
We will perform two separate methods of cell type annotation - a manually curated marker cell identification based approach, and a supervised machine learning approach - to increase confidence and granularity of cell types annotated.

Marker Cell Based Approach
We will use lists of 5-10 genes for each cell type, and use the decoupler AUCell tool to score all of the cells in the samples on these genes. The genes for each cell type will be curated from existing literature and datasets such as CellxGene. Each cell will be assigned the cell type that it scores the highest on.
Supervised Machine Learning
We plan to use the CellTypist package, a supervised, adaptable cell type prediction method. We will run this by using the package-provided pretrained models for the immune, epithelial, and endothelial cells, and will fine-tune an additional model to provide annotations for mesenchymal cells on a compiled single-cell expression object of mesenchymal cells extracted from CellxGene.
Each cell will be assigned the cell type that the model returns the highest confidence score for.
In our experience, we have found that these two approaches largely agree (especially in annotating non-malignant cells), and we were able to increase our confidence in using two complementary approaches. We can provide a detailed cell type table with calls from each of these methods as supplementary data, if needed. When cells are assigned conflicting cell types, we will further conduct manual review using an expanded set of marker genes, and revision of the confidence scores computed by the machine learning method.

Validating Cell Types
In step 4, we annotate the cell types on a per-nuclei basis. To confirm these annotations, we perform PCA, leiden clustering, and UMAP visualization. We expect nuclei of the same annotated type to cluster together. If nuclei cluster with nuclei annotated as a different type, we refine the annotations made.
Identifying Malignant Cells
As it is known that osteosarcoma arises from an osteoblastic cell lineage, we run inferCNVpy per sample to infer copy number alterations using the non-mesenchymal cells (eg. Endothelial, Immune, and Epithelial cells) as the control, “non-malignant” cells.
Validation through Cohort Merging
After the normal and tumor cell types have been annotated per-sample in steps 1-6, we will merge the whole cohort into one expression matrix, and perform PCA, leiden clustering and UMAP visualization on the whole cohort object.
Given known inter tumor heterogeneity, we expect that normal cell types identified will cluster together, and tumor cells should predominantly cluster per individual sample.
We will adjust any cell type annotations that have already been made if any cells cluster with other cell types.

Existing modules

This processing and cell type annotation workflow largely follows the existing documentation in #292 (Ewing Sarcoma), with some adaptations.

Input data

We will start with the count matrices (the .h5ad “unfiltered counts file”) provided in the SCPCP000017 data repository. The analysis will be conducted using publicly available packages, and we will provide a final curated table of cell type markers including references used along with the cell types annotation files.

Scientific literature

(CellBender) https://www.nature.com/articles/s41592-023-01943-7
(scanpy) https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0
(decouplerpy) https://doi.org/10.1093/bioadv/vbac016
(CellTypist) https://www.cell.com/cell/fulltext/S0092-8674(23)01312-0
(CellxGene) https://doi.org/10.1101/2023.10.30.563174
(inferCNVpy) https://github.com/icbi-lab/infercnvpy

Other details

All of this analysis will be able to be performed on our local and cloud environments, and will predominantly be conducted in Python.
We plan to have all of this annotation performed and available to share within the next two months.

Jen-OMalley · 2024-07-12T16:58:59Z

Jen-OMalley
Jul 12, 2024
Maintainer

Hi @yutarohtanaka. I'm Jen, the Scientific Community Manager at the Data Lab. Thank you for sharing your proposed analysis! The team is currently reviewing this. We are looking forward to discussing more with you soon! We'll get back to you with next steps within 3 business days.

In the meantime, we'll be setting up an AWS account for you. Once we do, you should receive an email with an invitation to finish setting up your account. I'll reach out again when you should be expecting to see this. Let us know if you have any questions about OpenScPCA. We look forward to working with you!

2 replies

yutarohtanaka Jul 12, 2024
Author

Thanks so much!

jaclyn-taroni Jul 12, 2024
Maintainer

Hi @yutarohtanaka, your AWS account has been created, and you should receive an email to complete setup. Here are instructions for setting up AWS!

sjspielman · 2024-07-18T17:02:36Z

sjspielman
Jul 18, 2024
Maintainer

Hi @yutarohtanaka, I'm Stephanie, one of the Data Scientists in the Data Lab. We're looking forward to having you on board as an OpenScPCA contributor!

Before you get started, I wanted to provide some additional guidance.

First, in your opening discussion, you describe some data processing steps including filtering ambient RNA, low-quality nuclei, and doublets.

In an effort to keep the analysis across modules uniform and transparent, we instead ask that you start your analysis with the processed objects available on the ScPCA Portal (_processed.rds or _processed_rna.h5ad). These objects have already undergone removal of empty droplets, filtering of low quality cells, and normalization. See the documentation on the ScPCA Portal for more information about how these objects were processed. This ensures that all analyses in OpenScPCA have the same starting point.

Therefore, rather than starting with the .h5ad "unfiltered" files provided in the SCPCP000017 data repository, it would be preferable for you to start with the .h5ad (or .rds if you end up venturing into R at any point, though I do see your proposed methods are all python!) "processed" datasets instead. Please let us know if you have any specific questions or comments about this starting point!

In addition, as part of OpenScPCA we have also annotated doublets in all samples using scDblFinder. Currently, we are in the process of running scDblFinder on all samples, and the results will be made available to you via S3. These results will include a TSV file with the output from scDblFinder. If you plan to annotate and remove doublets, please use these results. If you would like to use these results before they are available, you may also run the available script in the doublet-detection module.

If you need to use filtering, normalization, or doublet detection methods that are different from the methods that are already defined within the project, please provide your rationale and evidence that supports the alternative method and we can discuss this further.

When you are ready to get started, you can download the data using the download-data.py script in the repository, which will download the processed objects by default.
Note that you may also prefer to work with our test data, described in the linked documentation above, while developing this code.

Please follow the below steps to start contributing to the project:

After you have initiated your module, you will be ready to continue with the rest of the analysis that you proposed. I would recommend that you break up your work into the following steps, where each bullet point would be an issue and at least one subsequent pull request (PR).

Curate a list of marker genes you intend to use
- If you already have this set of genes, this could just be adding a metafile with that information for scripts to consume into your module, as well as associated documentation for where these genes came from and how they were chosen.
Use the list of marker genes to annotate samples using AUCell
Annotate samples with CellTypist
- I'm not sure if this will be a separate step for you, but in case it is, you should first write any separate scripts needed to subset data to mesenchymal cells. This would not be a separate step if the subsetting is directly part of CellTypist script you'll write. If you do generate subsetted data files along the way, you might save this in your module's scratch directory.
Validate cell types, including comparing AUCell and CellTypist results to see if the annotations match your expectation that these methods return similar cell types
Identify malignant cells with inferCNVpy
- If you perform any additional validation here or comparison to previously-identified cell types, then that would also be its own issue and PR too.
Merge all samples together
- This analysis will likely be something we'll want to come back to and discuss. For example, for visualization, you will probably want to perform batch correction/integration, but this may not be necessary for all downstream applications. To the extent you want to integrate the data, you'll need to pick a suitable tool, and we're definitely glad to discuss this more once you get to this analysis stage!
Generate the final table of results
- It's likely the final annotations you derive may come from several sources, so at some point you'll want to write code that combines the specific labels you consider final into a single table.
- In addition, this final table might contain manually-labeled cells in addition to labels from automated methods. Please make sure that manual changes you make to automated cell type labels are clearly documented (e.g., what evidence or information was used to make the manual call? why do you think a given manual call is better than an automated one?) and reproducible as well. In other words, manual cell typing should still be part of a script, not a fully manual overwrite of a result file.

For more information on contributing to the project, I recommend you start by reviewing these sections of the documentation:

Finally, I'd like to note one part of your opening discussion comment:

We plan to have all of this annotation performed and available to share within the next two months.

Please bear in mind that, as an OpenScPCA contributor, you your work will be conducted openly and iteratively. This means that, although the final results will of course take some time to fully establish, your analysis code will be openly available the entire time you are working. You can find more information about this in our policies, notably this one:

Code and other materials you commit to your fork and push to GitHub will be publicly available and licensed under the same license as the AlexsLemonade repository (Creative Commons Attribution 4.0 International and 3-Clause BSD licenses).

Let us know if you have any questions or comments, or want to generally discuss anything else! We're happy to have you on board!

4 replies

yutarohtanaka Jul 21, 2024
Author

Hi Stephanie,

Thank you so much for your feedback - and excited to be working with you and the DataLab!

Data Processing
Thank you for your guidance - we will be sure to let you know if we find that alternative methods appear to perform better on the specific cohort, but otherwise will plan to use the processed objects with the scDblFinder identified doublets removed.
PR
We will be sure to open pull requests when we complete each step of the analysis as you've suggested.
Data Sharing
Thank you for pointing us to your policies - we are more than happy to iteratively share the code / analysis, including any intermediate output.

Thanks so much again, and will be in touch when we have made progress, or anything else comes up!

sjspielman Jul 26, 2024
Maintainer

Great! A couple quick replies -

otherwise will plan to use the processed objects with the scDblFinder identified doublets removed.

Please stay tuned to our announcements page. We will post an announcement when the scDblFinder results for all ScPCA libraries have been processed for you to use straight away without having to run that module yourself.

Thanks so much again, and will be in touch when we have made progress, or anything else comes up!

Sounds good! Just FYI, I'd recommend filing your first issue and pull request establishing your analysis module very early in this process so that you can continue to build up incrementally from there with your next issues and PRs. This will also ensure you're headed in the right direction for this analysis regarding the overall scheme and structure of the OpenScPCA open contribution model.

sjspielman Aug 1, 2024
Maintainer

Hi @yutarohtanaka , I wanted to let you know that results from scDblFinder are now available for download, as described in this announcement. Let us know if you have questions about using these results in your analysis, should you choose to do so!

yutarohtanaka Aug 3, 2024
Author

@sjspielman
Thank you so much for letting us know!
We will start working with these results, and will let you know any discrepancies we find with the Doublet detection methods we've run already on this and other datasets. And will be sure to submit a pull request for this submission very soon.
Thanks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Osteosarcoma Dataset Annotation (SCPCP000017) #601

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Osteosarcoma Dataset Annotation (SCPCP000017) #601

yutarohtanaka Jul 12, 2024

Proposed analysis

Scientific goals

Methods or approach

Existing modules

Input data

Scientific literature

Other details

Replies: 2 comments · 6 replies

Jen-OMalley Jul 12, 2024 Maintainer

yutarohtanaka Jul 12, 2024 Author

jaclyn-taroni Jul 12, 2024 Maintainer

sjspielman Jul 18, 2024 Maintainer

yutarohtanaka Jul 21, 2024 Author

sjspielman Jul 26, 2024 Maintainer

sjspielman Aug 1, 2024 Maintainer

yutarohtanaka Aug 3, 2024 Author

yutarohtanaka
Jul 12, 2024

Replies: 2 comments 6 replies

Jen-OMalley
Jul 12, 2024
Maintainer

yutarohtanaka Jul 12, 2024
Author

jaclyn-taroni Jul 12, 2024
Maintainer

sjspielman
Jul 18, 2024
Maintainer

yutarohtanaka Jul 21, 2024
Author

sjspielman Jul 26, 2024
Maintainer

sjspielman Aug 1, 2024
Maintainer

yutarohtanaka Aug 3, 2024
Author