Osteosarcoma Dataset Annotation (SCPCP000017) #601
Replies: 2 comments 6 replies
-
Hi @yutarohtanaka. I'm Jen, the Scientific Community Manager at the Data Lab. Thank you for sharing your proposed analysis! The team is currently reviewing this. We are looking forward to discussing more with you soon! We'll get back to you with next steps within 3 business days. In the meantime, we'll be setting up an AWS account for you. Once we do, you should receive an email with an invitation to finish setting up your account. I'll reach out again when you should be expecting to see this. Let us know if you have any questions about OpenScPCA. We look forward to working with you! |
Beta Was this translation helpful? Give feedback.
-
Hi @yutarohtanaka, I'm Stephanie, one of the Data Scientists in the Data Lab. We're looking forward to having you on board as an OpenScPCA contributor! Before you get started, I wanted to provide some additional guidance. First, in your opening discussion, you describe some data processing steps including filtering ambient RNA, low-quality nuclei, and doublets. In an effort to keep the analysis across modules uniform and transparent, we instead ask that you start your analysis with the processed objects available on the ScPCA Portal ( Therefore, rather than starting with the In addition, as part of OpenScPCA we have also annotated doublets in all samples using scDblFinder. Currently, we are in the process of running scDblFinder on all samples, and the results will be made available to you via S3. These results will include a TSV file with the output from scDblFinder. If you plan to annotate and remove doublets, please use these results. If you would like to use these results before they are available, you may also run the available script in the doublet-detection module. If you need to use filtering, normalization, or doublet detection methods that are different from the methods that are already defined within the project, please provide your rationale and evidence that supports the alternative method and we can discuss this further. When you are ready to get started, you can download the data using the Please follow the below steps to start contributing to the project:
After you have initiated your module, you will be ready to continue with the rest of the analysis that you proposed. I would recommend that you break up your work into the following steps, where each bullet point would be an issue and at least one subsequent pull request (PR).
For more information on contributing to the project, I recommend you start by reviewing these sections of the documentation: Finally, I'd like to note one part of your opening discussion comment:
Please bear in mind that, as an OpenScPCA contributor, you your work will be conducted openly and iteratively. This means that, although the final results will of course take some time to fully establish, your analysis code will be openly available the entire time you are working. You can find more information about this in our policies, notably this one:
Let us know if you have any questions or comments, or want to generally discuss anything else! We're happy to have you on board! |
Beta Was this translation helpful? Give feedback.
-
Proposed analysis
We plan to perform the annotation of osteosarcoma snRNA-seq samples in the SCPCP000017 (n=28) dataset. Our processing and cell type annotation will include filtering for ambient and background RNA, filtering for low quality nuclei and doublets, cell type annotation, and malignant cell annotation.
Scientific goals
To share a validated, curated set of cell type annotations for the osteosarcoma samples in this dataset.
Methods or approach
Filtering for Ambient RNA
CellBender is a computational tool that is able to remove the ambient / background RNA from count matrices. We will compare the performance of CellBender to the DropletUtils::emptyDropsCellRanger() (which we understand has been performed for the “filtered counts” provided) to evaluate the best performing method on this data, and remove all potential background RNA.
(More than happy to skip this step if it would be preferable to use the emptyDrops-filtered matrices)
Filtering for Low Quality Nuclei
Here, “low quality nuclei” are defined as nuclei with less than 300 genes or 500 UMI counts expressed, or more than 6000 genes or 50,000 UMI counts expressed. Additionally, we filter out nuclei that have no ribosomal gene expression, more than 20% and 5% of mitochondrial and hemoglobin genes respectively, over total expressed genes. We use scanpy built-in functions to perform this.
We will also filter out any sparsely expressed genes that are expressed in less than 5 cells.
Filtering for Doublets
We have primarily used scrublet in our prior work, and found that it is able to identify doublets (and multiplets) with reasonable confidence. Here, we plan to use scrublet to call and filter out any potential doublets in each sample.
Annotating Cell Types
We will perform two separate methods of cell type annotation - a manually curated marker cell identification based approach, and a supervised machine learning approach - to increase confidence and granularity of cell types annotated.
We will use lists of 5-10 genes for each cell type, and use the decoupler AUCell tool to score all of the cells in the samples on these genes. The genes for each cell type will be curated from existing literature and datasets such as CellxGene. Each cell will be assigned the cell type that it scores the highest on.
We plan to use the CellTypist package, a supervised, adaptable cell type prediction method. We will run this by using the package-provided pretrained models for the immune, epithelial, and endothelial cells, and will fine-tune an additional model to provide annotations for mesenchymal cells on a compiled single-cell expression object of mesenchymal cells extracted from CellxGene.
Validating Cell Types
In step 4, we annotate the cell types on a per-nuclei basis. To confirm these annotations, we perform PCA, leiden clustering, and UMAP visualization. We expect nuclei of the same annotated type to cluster together. If nuclei cluster with nuclei annotated as a different type, we refine the annotations made.
Identifying Malignant Cells
As it is known that osteosarcoma arises from an osteoblastic cell lineage, we run inferCNVpy per sample to infer copy number alterations using the non-mesenchymal cells (eg. Endothelial, Immune, and Epithelial cells) as the control, “non-malignant” cells.
Validation through Cohort Merging
After the normal and tumor cell types have been annotated per-sample in steps 1-6, we will merge the whole cohort into one expression matrix, and perform PCA, leiden clustering and UMAP visualization on the whole cohort object.
Given known inter tumor heterogeneity, we expect that normal cell types identified will cluster together, and tumor cells should predominantly cluster per individual sample.
We will adjust any cell type annotations that have already been made if any cells cluster with other cell types.
Existing modules
This processing and cell type annotation workflow largely follows the existing documentation in #292 (Ewing Sarcoma), with some adaptations.
Input data
We will start with the count matrices (the .h5ad “unfiltered counts file”) provided in the SCPCP000017 data repository. The analysis will be conducted using publicly available packages, and we will provide a final curated table of cell type markers including references used along with the cell types annotation files.
Scientific literature
(CellBender) https://www.nature.com/articles/s41592-023-01943-7
(scanpy) https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0
(decouplerpy) https://doi.org/10.1093/bioadv/vbac016
(CellTypist) https://www.cell.com/cell/fulltext/S0092-8674(23)01312-0
(CellxGene) https://doi.org/10.1101/2023.10.30.563174
(inferCNVpy) https://github.com/icbi-lab/infercnvpy
Other details
All of this analysis will be able to be performed on our local and cloud environments, and will predominantly be conducted in Python.
We plan to have all of this annotation performed and available to share within the next two months.
Beta Was this translation helpful? Give feedback.
All reactions