From ac36ad386ec92d13d4aeceaa8d4581865763b0b3 Mon Sep 17 00:00:00 2001
From: Dominik <dotto@fredhutch.org>
Date: Tue, 28 Nov 2023 13:29:16 -0800
Subject: [PATCH] move manuscript data processing to a notebook

---
 README.md                       |   8 +
 notebooks/manuscript_data.ipynb | 501 ++++++++++++++++++++++++++++++++
 2 files changed, 509 insertions(+)
 create mode 100644 notebooks/manuscript_data.ipynb

diff --git a/README.md b/README.md
index 7f2a5289..740c0fda 100755
--- a/README.md
+++ b/README.md
@@ -17,6 +17,14 @@ Palantir has been implemented in Python3 and can be installed using:
 
 A tutorial on Palantir usage and results visualization for single cell RNA-seq data can be found in this notebook: http://nbviewer.jupyter.org/github/dpeerlab/Palantir/blob/master/notebooks/Palantir_sample_notebook.ipynb
 
+## Processed data and metadata
+
+`scanpy anndata` objects are available for download for the three replicates generated in the manuscript:
+- [Replicate 1 (Rep1)](https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep1.h5ad)
+- [Replicate 2 (Rep2)](https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep2.h5ad)
+- [Replicate 3 (Rep3)](https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep3.h5ad)
+
+This notebook details how to use the data in `Python` and `R`: http://nbviewer.jupyter.org/github/dpeerlab/Palantir/blob/master/notebooks/manuscript_data.ipynb
 
 ## Comparison to trajectory detection algorithms
 Notebooks detailing the generation of results comparing Palantir to trajectory detection algorithms are available [here](https://github.com/dpeerlab/Palantir/blob/master/notebooks/comparisons)
diff --git a/notebooks/manuscript_data.ipynb b/notebooks/manuscript_data.ipynb
new file mode 100644
index 00000000..2277e576
--- /dev/null
+++ b/notebooks/manuscript_data.ipynb
@@ -0,0 +1,501 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "68a5c2f5-9391-4170-b5ea-9df9ad5eafb4",
+   "metadata": {},
+   "source": [
+    "# Access and Analyze `scanpy anndata` Objects from a Manuscript\n",
+    "\n",
+    "This guide provides steps to access and analyze the `scanpy anndata` objects associated with a recent manuscript. These objects are essential for computational biologists and data scientists working in genomics and related fields. There are three replicates available for download:\n",
+    "\n",
+    "- [Replicate 1 (Rep1)](https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep1.h5ad)\n",
+    "- [Replicate 2 (Rep2)](https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep2.h5ad)\n",
+    "- [Replicate 3 (Rep3)](https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep3.h5ad)\n",
+    "\n",
+    "Each `anndata` object contains several elements crucial for comprehensive data analysis:\n",
+    "\n",
+    "1. `.X`: Filtered, normalized, and log-transformed count matrix.\n",
+    "2. `.raw`: Original, filtered raw count matrix.\n",
+    "3. `.obsm['MAGIC_imputed_data']`: Imputed count matrix using MAGIC algorithm.\n",
+    "4. `.obsm['tsne']`: t-SNE maps (as presented in the manuscript), generated using scaled diffusion components.\n",
+    "5. `.obs['clusters']`: Cell clustering information.\n",
+    "6. `.obs['palantir_pseudotime']`: Cell pseudo-time ordering, as determined by Palantir.\n",
+    "7. `.obs['palantir_diff_potential']`: Palantir-determined differentiation potential of cells.\n",
+    "8. `.obsm['palantir_branch_probs']`: Probabilities of cells branching into different lineages, according to Palantir.\n",
+    "9. `.uns['palantir_branch_probs_cell_types']`: Labels for Palantir branch probabilities.\n",
+    "10. `.uns['ct_colors']`: Color codes for cell types, as used in the manuscript.\n",
+    "11. `.uns['cluster_colors']`: Color codes for cell clusters, as used in the manuscript.\n",
+    "12. `.varm['mast_diff_res_pval']`: MAST algorithm p-values for differential expression analysis across clusters.\n",
+    "13. `.varm['mast_diff_res_statistic']`: Statistical values from MAST for differential expression.\n",
+    "14. `.uns['mast_diff_res_columns']`: Column names for MAST differential expression results.\n",
+    "\n",
+    "## Python Code for Data Access:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "63f356a7-3856-4596-a7b3-9fc05cc3029a",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-11-28T21:20:46.755293Z",
+     "iopub.status.busy": "2023-11-28T21:20:46.755059Z",
+     "iopub.status.idle": "2023-11-28T21:20:59.646740Z",
+     "shell.execute_reply": "2023-11-28T21:20:59.645355Z",
+     "shell.execute_reply.started": "2023-11-28T21:20:46.755266Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import scanpy as sc\n",
+    "\n",
+    "# Read in the data, with backup URLs provided\n",
+    "adata_Rep1 = sc.read(\n",
+    "    \"../data/human_cd34_bm_rep1.h5ad\",\n",
+    "    backup_url=\"https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep1.h5ad\",\n",
+    ")\n",
+    "adata_Rep2 = sc.read(\n",
+    "    \"../data/human_cd34_bm_rep2.h5ad\",\n",
+    "    backup_url=\"https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep2.h5ad\",\n",
+    ")\n",
+    "adata_Rep3 = sc.read(\n",
+    "    \"../data/human_cd34_bm_rep3.h5ad\",\n",
+    "    backup_url=\"https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep3.h5ad\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "bee4a735-7c47-415a-b1e3-ee776998dbd5",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-11-28T21:20:59.650053Z",
+     "iopub.status.busy": "2023-11-28T21:20:59.649313Z",
+     "iopub.status.idle": "2023-11-28T21:20:59.659463Z",
+     "shell.execute_reply": "2023-11-28T21:20:59.658910Z",
+     "shell.execute_reply.started": "2023-11-28T21:20:59.650021Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "AnnData object with n_obs × n_vars = 5780 × 14651\n",
+       "    obs: 'clusters', 'palantir_pseudotime', 'palantir_diff_potential'\n",
+       "    uns: 'cluster_colors', 'ct_colors', 'palantir_branch_probs_cell_types'\n",
+       "    obsm: 'tsne', 'MAGIC_imputed_data', 'palantir_branch_probs'"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "adata_Rep1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "515e6760-8f95-42d6-87ba-1a2375797ccf",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-11-28T21:20:59.660313Z",
+     "iopub.status.busy": "2023-11-28T21:20:59.660133Z",
+     "iopub.status.idle": "2023-11-28T21:20:59.676952Z",
+     "shell.execute_reply": "2023-11-28T21:20:59.676283Z",
+     "shell.execute_reply.started": "2023-11-28T21:20:59.660295Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "AnnData object with n_obs × n_vars = 6501 × 14913\n",
+       "    obs: 'clusters', 'palantir_pseudotime', 'palantir_diff_potential'\n",
+       "    uns: 'cluster_colors', 'ct_colors', 'palantir_branch_probs_cell_types'\n",
+       "    obsm: 'tsne', 'MAGIC_imputed_data', 'palantir_branch_probs'"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "adata_Rep2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "61d7a8e0-0916-4099-8982-5599d7166104",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-11-28T21:20:59.678250Z",
+     "iopub.status.busy": "2023-11-28T21:20:59.677863Z",
+     "iopub.status.idle": "2023-11-28T21:20:59.691822Z",
+     "shell.execute_reply": "2023-11-28T21:20:59.691131Z",
+     "shell.execute_reply.started": "2023-11-28T21:20:59.678220Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "AnnData object with n_obs × n_vars = 12046 × 14044\n",
+       "    obs: 'clusters', 'palantir_pseudotime', 'palantir_diff_potential'\n",
+       "    uns: 'cluster_colors', 'ct_colors', 'palantir_branch_probs_cell_types'\n",
+       "    obsm: 'tsne', 'MAGIC_imputed_data', 'palantir_branch_probs'"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "adata_Rep3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b057a720-f0f4-40b0-8bcf-02efc9b2124d",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-11-28T19:21:40.634650Z",
+     "iopub.status.busy": "2023-11-28T19:21:40.634039Z",
+     "iopub.status.idle": "2023-11-28T19:21:40.647637Z",
+     "shell.execute_reply": "2023-11-28T19:21:40.646498Z",
+     "shell.execute_reply.started": "2023-11-28T19:21:40.634595Z"
+    }
+   },
+   "source": [
+    "## Converting `anndata` Objects to `Seurat` Objects Using R\n",
+    "\n",
+    "For researchers working with R and Seurat, the process to convert `anndata` objects to Seurat objects involves the following steps:\n",
+    "\n",
+    "1. **Set Up R Environment and Libraries**:\n",
+    "   - Load the necessary libraries: `Seurat` and `anndata`.\n",
+    "\n",
+    "2. **Download and Read the Data**:\n",
+    "   - Use `curl::curl_download` to download the `anndata` from the provided URLs.\n",
+    "   - Read the data using the `read_h5ad` method from the `anndata` library.\n",
+    "\n",
+    "3. **Create Seurat Objects**:\n",
+    "   - Use the `CreateSeuratObject` function to convert the data into Seurat objects, incorporating counts and metadata from the `anndata` object.\n",
+    "   - Transfer additional data like tSNE embeddings, imputed gene expressions, and cell fate probabilities into the appropriate slots in the Seurat object.\n",
+    "\n",
+    "### R Code Snippet:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "562d56fb-80dc-4f44-8266-3ca559e79106",
+   "metadata": {
+    "jupyter": {
+     "source_hidden": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# this cell only exists to allow running R code inside this python notebook using a conda kernel\n",
+    "import sys\n",
+    "import os\n",
+    "\n",
+    "# Get the path to the python executable\n",
+    "python_executable_path = sys.executable\n",
+    "\n",
+    "# Extract the path to the environment from the path to the python executable\n",
+    "env_path = os.path.dirname(os.path.dirname(python_executable_path))\n",
+    "\n",
+    "print(\n",
+    "    f\"Conda env path: {env_path}\\n\"\n",
+    "    \"Please make sure you have R installed in the conda environment.\"\n",
+    ")\n",
+    "\n",
+    "os.environ['R_HOME'] = os.path.join(env_path, 'lib', 'R')\n",
+    "\n",
+    "%load_ext rpy2.ipython"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "ed46f119-e8be-45ba-b447-b46e8b947cf8",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-11-28T21:21:01.081154Z",
+     "iopub.status.busy": "2023-11-28T21:21:01.080675Z",
+     "iopub.status.idle": "2023-11-28T21:23:08.313753Z",
+     "shell.execute_reply": "2023-11-28T21:23:08.313058Z",
+     "shell.execute_reply.started": "2023-11-28T21:21:01.081128Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "R[write to console]: Loading required package: SeuratObject\n",
+      "\n",
+      "R[write to console]: Loading required package: sp\n",
+      "\n",
+      "R[write to console]: \n",
+      "Attaching package: ‘SeuratObject’\n",
+      "\n",
+      "\n",
+      "R[write to console]: The following object is masked from ‘package:base’:\n",
+      "\n",
+      "    intersect\n",
+      "\n",
+      "\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "    WARNING: The R package \"reticulate\" only fixed recently\n",
+      "    an issue that caused a segfault when used with rpy2:\n",
+      "    https://github.com/rstudio/reticulate/pull/1188\n",
+      "    Make sure that you use a version of that package that includes\n",
+      "    the fix.\n",
+      "    "
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "R[write to console]: \n",
+      "Attaching package: ‘anndata’\n",
+      "\n",
+      "\n",
+      "R[write to console]: The following object is masked from ‘package:SeuratObject’:\n",
+      "\n",
+      "    Layers\n",
+      "\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Feature names cannot have underscores ('_'), replacing with dashes ('-')\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Data is of class matrix. Coercing to dgCMatrix.\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Feature names cannot have underscores ('_'), replacing with dashes ('-')\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Feature names cannot have underscores ('_'), replacing with dashes ('-')\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Feature names cannot have underscores ('_'), replacing with dashes ('-')\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Data is of class matrix. Coercing to dgCMatrix.\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Feature names cannot have underscores ('_'), replacing with dashes ('-')\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Feature names cannot have underscores ('_'), replacing with dashes ('-')\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Feature names cannot have underscores ('_'), replacing with dashes ('-')\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Data is of class matrix. Coercing to dgCMatrix.\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Feature names cannot have underscores ('_'), replacing with dashes ('-')\n",
+      "\n",
+      "R[write to console]: Warning:\n",
+      "R[write to console]:  Feature names cannot have underscores ('_'), replacing with dashes ('-')\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%R\n",
+    "library(Seurat)\n",
+    "library(anndata)\n",
+    "\n",
+    "create_seurat <- function(url) {\n",
+    "  file_path <- sub(\"https://s3.amazonaws.com/dp-lab-data-public/palantir/\", \"../data/\", url)\n",
+    "  if (!file.exists(file_path)) {\n",
+    "    curl::curl_download(url, file_path)\n",
+    "  }\n",
+    "  data <- read_h5ad(file_path)\n",
+    "  \n",
+    "  seurat_obj <- CreateSeuratObject(\n",
+    "    counts = t(data$X), \n",
+    "    meta.data = data$obs,\n",
+    "    project = \"CD34+ Bone Marrow Cells\"\n",
+    "  )\n",
+    "  tsne_data <- data$obsm[[\"tsne\"]]\n",
+    "  rownames(tsne_data) <- rownames(data$obs)\n",
+    "  colnames(tsne_data) <- c(\"tSNE_1\", \"tSNE_2\")\n",
+    "  seurat_obj[[\"tsne\"]] <- CreateDimReducObject(\n",
+    "    embeddings = tsne_data,\n",
+    "    key = \"tSNE_\"\n",
+    "  )\n",
+    "  imputed_data <- t(data$obsm[[\"MAGIC_imputed_data\"]])\n",
+    "  colnames(imputed_data) <- rownames(data$obs)\n",
+    "  rownames(imputed_data) <- rownames(data$var)\n",
+    "  seurat_obj[[\"MAGIC_imputed\"]] <- CreateAssayObject(counts = imputed_data)\n",
+    "  fate_probs <- as.data.frame(data$obsm[[\"palantir_branch_probs\"]])\n",
+    "  colnames(fate_probs) <- data$uns[[\"palantir_branch_probs_cell_types\"]]\n",
+    "  rownames(fate_probs) <- rownames(data$obs)\n",
+    "  seurat_obj <- AddMetaData(seurat_obj, metadata = fate_probs)\n",
+    "\n",
+    "  return(seurat_obj)\n",
+    "}\n",
+    "\n",
+    "human_cd34_bm_Rep1 <- create_seurat(\"https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep1.h5ad\")\n",
+    "human_cd34_bm_Rep2 <- create_seurat(\"https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep2.h5ad\")\n",
+    "human_cd34_bm_Rep3 <- create_seurat(\"https://s3.amazonaws.com/dp-lab-data-public/palantir/human_cd34_bm_rep3.h5ad\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "a7c8b823-4d18-4252-acc1-4a9f51f929b9",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-11-28T21:23:08.315660Z",
+     "iopub.status.busy": "2023-11-28T21:23:08.315364Z",
+     "iopub.status.idle": "2023-11-28T21:23:08.361153Z",
+     "shell.execute_reply": "2023-11-28T21:23:08.360630Z",
+     "shell.execute_reply.started": "2023-11-28T21:23:08.315642Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "An object of class Seurat \n",
+      "29302 features across 5780 samples within 2 assays \n",
+      "Active assay: RNA (14651 features, 0 variable features)\n",
+      " 1 layer present: counts\n",
+      " 1 other assay present: MAGIC_imputed\n",
+      " 1 dimensional reduction calculated: tsne\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%R\n",
+    "\n",
+    "human_cd34_bm_Rep1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "094067ac-b251-4e37-8d67-eedc2641b8fa",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-11-28T21:23:08.362383Z",
+     "iopub.status.busy": "2023-11-28T21:23:08.361964Z",
+     "iopub.status.idle": "2023-11-28T21:23:08.400063Z",
+     "shell.execute_reply": "2023-11-28T21:23:08.399518Z",
+     "shell.execute_reply.started": "2023-11-28T21:23:08.362356Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "An object of class Seurat \n",
+      "29826 features across 6501 samples within 2 assays \n",
+      "Active assay: RNA (14913 features, 0 variable features)\n",
+      " 1 layer present: counts\n",
+      " 1 other assay present: MAGIC_imputed\n",
+      " 1 dimensional reduction calculated: tsne\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%R\n",
+    "\n",
+    "human_cd34_bm_Rep2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "6fb000c4-41ee-4147-aba8-08c0e6f7deb5",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-11-28T21:23:08.401196Z",
+     "iopub.status.busy": "2023-11-28T21:23:08.400878Z",
+     "iopub.status.idle": "2023-11-28T21:23:08.441148Z",
+     "shell.execute_reply": "2023-11-28T21:23:08.440627Z",
+     "shell.execute_reply.started": "2023-11-28T21:23:08.401171Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "An object of class Seurat \n",
+      "28088 features across 12046 samples within 2 assays \n",
+      "Active assay: RNA (14044 features, 0 variable features)\n",
+      " 1 layer present: counts\n",
+      " 1 other assay present: MAGIC_imputed\n",
+      " 1 dimensional reduction calculated: tsne\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%R\n",
+    "\n",
+    "human_cd34_bm_Rep3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e208ff84-85d0-40f7-b08d-9153537b088a",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "da1",
+   "language": "python",
+   "name": "da1"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  },
+  "widgets": {
+   "application/vnd.jupyter.widget-state+json": {
+    "state": {},
+    "version_major": 2,
+    "version_minor": 0
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}