Skip to content

Internship report comparing available tools to process Seq-Scope data

Notifications You must be signed in to change notification settings

SchapiroLabor/VoigtGesa_2024_seqscope_report

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 

Repository files navigation

Workflow Optimization for Seq-Scope Spatial Transcriptomics Data Analysis

Internship Report

In this project, I helped to establish the spatial transcriptomics platform Seq-Scope using a novel flow cell. The wet lab experiments were performed by collaborators, who provided me with the datasets and sample information. My contribution was to 1) systematically compare available data analysis pipelines, and 2) assess the quality of the data. The latter provided information to adjust wet lab procedures for further experimental runs. This report describes challenges and outcomes of both aspects of my internship.

Seq-Scope Overview

Seq-Scope is a spatial transcriptomics (ST) method that promises high resolution at low cost. Since its introduction through a publication by Cho et al. (2022), no fewer than seven other manuscripts have been made available that describe modifications and improvements for both wet and dry lab.

Briefly, the experimental procedure includes the generation of spatial barcode (HDMI) clusters on a flow cell, tissue hybridization with imaging, and sequencing of the captured transcripts (Fig. 1). The output files reflect these three main steps: The so-called 1st-Seq files contain the sequence and location of each spatial barcode (HDMI) cluster. A bright-field image shows the hematoxylin & eosin (H&E) stained tissue slice on the flow cell. The 2nd-Seq files contain the sequence of the captured mRNA on one read (R2) and the spatial barcode at the capture location on the other read (R1). The second sequencing step is performed on a different instrument, so the transcript location is found by matching the HDMIs from both sequencing files. Furthermore, the H&E image, if properly aligned to the sequencing data coordinates, can be used to guide cell typing.

image

Fig. 1: Seq-Scope wet lab procedure, modified after Cho et al. (2021). A-C) Oligonucleotides, each containing a distinct spatial barcode (HDMI), are hybridized at random locations on the flow cell. As they are locally amplified during 1st-Seq, position and sequence of the spatial barcode are registered. D) The reads are enzymatically cut by DraI to expose the oligo-dT region. E) A thin tissue section is placed on the flow cell, and mRNA from the tissue binds to the oligo-dT tail. The tissue gets stained with H&E, is imaged, and removed from the flow cell. F-J) The mRNA gets transcribed to cDNA, thereby hybridizing spatial barcode and mRNA information. After library preparation, the reads get sequenced again (2nd-Seq).

Datasets

By our collaborators, we were provided with datasets from several experimental runs, initially using healthy murine liver tissue, then healthy or infarcted heart tissue from mouse or zebrafish. The 1st-Seq data was generated using the Illumina MiniSeq platform, which has an non-patterned flow cell with an imaging field of ca. 1.5 cm2. There were several issues with the initial liver dataset, which, in brief, led to very few reads containing mappable cDNA, as well as most HDMIs from 1st-Seq and 2nd-Seq not matching. This report will thus mostly focus on the heart datasets.

As a direct comparison, I processed a publicly available dataset of healthy murine liver that was provided as an exemplary dataset with the NovaScope pipeline. This served as a positive control for pipeline functionalities and to compare quality control (QC) outputs. It was generated using a NovaSeq6000 S4 flow cell for 1st-Seq, which is a patterned flow cell with an imaging field of ca. 12.8 x 7 mm (ca. 0.9 cm2). The liver data of the Cho et al. (2021) publication was originally used but discontinued due to major differences to our in-house setup, mainly using HDMI sequences of length 20 nt and the Illumina MiSeq platform for 1st-Seq, which has near-round imaging tiles, leading to a very fractured final read distribution.

Since both our own and the publicly available datasets were not created for biological interpretation, the sequencing depth of 2nd-Seq was low. Approximately 163 million 2nd-Seq read pairs were provided for the NovaScope dataset. For our datasets, sequencing depth was adjusted based on area covered by tissue on the 1st-Seq flow cell, 2nd-Seq library quality, and monetary constraints. In total, between 31 and 236 million 2nd-Seq read pairs were registered for each of our experiments. For clarity, the term 'shallow liver dataset' will subsequently only be used for the NovaScope data.

Quality Control and Troubleshooting

For quality control (QC), I first ran fastqc on the sequencing files, which provides a report compiling several QC statistics and plots. I also randomly sampled a number of read sequences to exemplify these results. Further, I used QC metrics that were provided as output by the tested pipelines.

The quality of the datasets improved with every experimental run. The following sections describe the issues that were still present in the last dataset we received at the time of writing.

Incomplete DraI enzymatic digestion

Reads of 2nd-Seq R2 that show an intact DraI cleavage site are highly present in the libraries (see also Fig. 2). I estimate this fraction to be between 5-20%, but definitely severely limiting the experiment. Reads that are not properly cleaved after 1st-Seq do not allow for the binding of mRNA. Since they are sequenced during 2nd-Seq nonetheless, these reads contribute to sequencing cost but do not provide useful information. Most importantly, they limit the total amount of information that could be obtained from the library due to the decreased binding of mRNA. Since activity of the enzyme was unimpaired outside of the Seq-Scope run, the issue might be with the buffer conditions on the 1st-Seq flow cell, or with washing off the cleaved read parts. Further tests are ongoing.

Varying library quality

Using identical experimental protocols, the amount of information gained from the experiments varied widely. For two replicates of experiments using mouse heart, for example, one library had 5x fewer entries in the final gene-by-position matrix than the other, when controlling for 2nd-Seq sequencing depth. This quality difference was also apparent prior to 2nd-Seq due to the low fraction of reads of the required length. While this helped to inform the desired sequencing depth per sample, we currently have little indication what might cause this variability. Additionally, the low library quality likely also contributed to the fraction of reads that had intact primers and barcodes but were too short to map (25-46% instead of 6% in the positive control) (see also Fig. 2).

image

Fig. 2: Exemplary reads from 2nd-Seq R2. Four reads each were sampled from random locations in the fastq files. The DraI cleavage site is marked in red. The lower four reads exemplify the issue of non-mappable reads due to their length.

Sequencing cluster collapse

For 2nd-Seq R2, the fastqc plot "Per base sequence content" showed a peak in "G" content at around 60 bp (Fig. 3). As helpfully suggested by Adam Cribbs, this is due to partial sequencing cluster collapse. With increasing length of the read, the probability of reaching the poly-A tail increases. At a certain point, most nucleotides are adenine, which decreases the nucleotide diversity (locally) to the point that bases at this position get registered as the default "G". One option to circumvent this issue is by using a more robust sequencer, as our machine only used two colors for base-calling and was therefore more sensitive to low nucleotide diversity. Another option is to increase nucleotide diversity by pooling different samples. Since neither of these options was available to us at the time, we decided to increase nucleotide diversity by spiking in more PhiX (20% instead of 5%). This did not visibly improve the issue. However, since the cluster collapse stably appeared at around 60 bp, the number of bases sequenced before the collapse was deemed sufficient for gene mapping. Since the PhiX reads contribute to sequencing cost without providing information of interest, I would recommend to reduce the spike-in percentage to previous levels.

Fig. 3: Fastqc plot showing per base sequence content of 2nd-Seq R2 for one of our datasets.

High noise

The number of mapped reads was similarly high in regions of the flow cell that were tissue covered compared to those that did not (Fig. 4). This high signal-to-noise ratio might be improved by increasing enzymatic tissue digestion, to improve release of mRNA, which will be tested in an upcoming experiment. To decrease the level of noise caused by mRNA diffusion, the incubation time of the tissue on the flow cell should be as short as possible.

Other aspects

Mechanical scratches on the inner flow cell surfaces, as well as air bubbles during 1st-Seq, result in lack of 1st-Seq data in the affected areas (Fig. 4), and should thus be minimized. Additionally, the number of reads registered during 1st-Seq is slightly lower in certain areas of the flow cell. This is because the non-patterned flow cell used here is registered in rows of rectangular tiles, and the sequencing coverage is marginally lower in tile corners. Further, the tiles do not cover the entire flow cell, so samples of particular interest should be placed close to the center of the flow cell to ensure capture of the entire sample area. In areas of fibrosis, the number of captured reads is also decreased. Since these issues can only partially be mitigated in the wet lab, artifacts need to be controlled for in the dry lab as well, such as by gene imputation or exclusion of affected areas.

image

Fig. 4: Plot of total gene counts per spatial location, output by the NovaScope pipeline (brightness increases with count, no legend given). Three heart tissue samples were placed on the flow cell. Their partial outline is visible, but almost masked by background signal. Dark areas show scratches and air bubbles on the flow cell, as well as corners of sequencing tiles.

Comparison of Available Pipelines

For the original Seq-Scope publication, data processing was performed using STtools, which focusses on high-resolution platforms but should also be compatible with other ST methods. Similarly, the snakemake pipeline SpaceMake implements settings for compatibility with Slide-Seq V2, 10x Visium, and Seq-Scope, as well as offering customization options for other ST methods. The original pipeline has recently been included in a package called Open-ST, which improves some steps in the previous workflow, but mainly offers 3D reconstruction of consecutive tissue slices. The most recent option for Seq-Scope data processing is NovaScope, another snakemake pipeline that handles data processing until the generation of a spatial gene expression matrix (SGE) (see Main Pipeline Functionalities). For the results obtained here, I used NovaScope (before release of v1.0.0), as well as OpenST (v0.1.2) and SpaceMake (v0.7.8)

Installation

Both tools require the setup of a conda environment as well as the installation of external tools. In addition, reference genome files need to be provided. SpaceMake offers instructions to create a conda environment based on an environment.yaml file, and then, via pip, to install the spacemake package. Additionally, Drop-seq needs to be installed, a Java and R-based library originally written for the identically named single-cell sequencing technology, which is to be built from source via the GitHub repository. The Open-ST package (openst) can be downloaded from GitHub, and will soon be available via pip as well.

The NovaScope documentation guides through the creation of a conda (mamba) environment with Snakemake, a python virtual environment with a pyenv_req.txt file. Several external tools are required:

  • STARsolo, a widely-used alignment tool, to be installed from source.
  • SAMtools, a common C-based tool suite to process and view read alignments, available as a module on the bwHPC or otherwise from GitHub or Sourceforge.
  • spatula, a C++ library to perform NovaScope-specific functionalities, available through GitHub.
  • qgenlib, a C++ library mainly handling command-line parsing and helper functions for spatula, available through GitHub.
  • HTSlib, a C library to access special SAMtools file formats, and is also required for qgenlib, to be installed from source.
  • ImageMagick, to edit and convert images, to be installed from source.
  • GDAL, a tool required only required for histology alignment, available on the bwHPC or through conda. A Linux-compatible Docker image is also provided, containing all the required dependencies. However, to fully take advantage of parallelization using a slurm scheduler e.g., on the bwHPC, the developers strongly recommend to install the required tools directly.

In both cases, installation is documented for the main components. NovaScope also supplies installation instructions for external tools, if not preinstalled.

Documentation & User Friendliness

SpaceMake documentation is provided and puts a special focus on customization of input parameters to be compatible with different platforms, but lacks detail in other aspects. Notably, there is no explanation of individual steps, which hinders troubleshooting and interpretation of results. The Open-ST documentation, which is currently in beta phase, only describes steps implemented in the additional openst package. In contrast, the NovaScope documentation is very extensive and well organized (as are its log files), including explanation of output files for every step.

When asking for assistance via GitHub issues, developers from both platforms (SpaceMake & NovaScope) are quick to respond and interested in helping to resolve problems.

Compatibility with (our) Seq-Scope Data

Both Open-ST and NovaScope were specifically designed for the NovaSeq6000 S4 flow cell. In case of Seq-Scope data from other systems, data processing is still possible (see Main Pipeline Functionalities). The Open-ST processing through SpaceMake offers particularly many customization options since it was originally built to be compatible with other platforms as well, such as Visium and Slide-Seq. However, automatic (fine) image alignment is dependent on the fiducial marks of a patterned flow cell and therefore not suitable for our MiniSeq system.

Resource Usage

To process the exploratory liver data, SpaceMake was run with the following parameters:

#SBATCH --partition=single
#SBATCH --time=04:00:00
#SBATCH --mem=60gb
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16

With these resources, the pipeline took ca. 5h 45 min to process the exploratory dataset until the last possible step. The runtime could be decreased to 2h when mapping only against the mouse reference genome (see Open-ST/SpaceMake Pipeline).

NovaScope can be configured to independently submit jobs via the slurm scheduler using the so-called master_job.slurm file with long runtime and a cluster configuration file. Since the individual steps almost exclusively do not require much memory and/or multiple cores, this temporarily frees up resources for other cluster users. The computationally most intensive step is the alignment, for which the default parameters (also used here) are 10 threads and 70 GB. On the shallow liver dataset, it took 2h 15 min to finish all steps. Processing time for our other datasets varied, depending on library quality.

Main Pipeline Functionalities

The processing of Seq-Scope data can be broken down into just a few key steps: preprocessing of the fastq files, alignment of cDNA to a reference genome, and finally the creation of a spatial gene expression matrix (SGE). From this point, processing is more dependent on the biological or technical usecase of the data (see also Downstream Options).

Open-ST/SpaceMake Pipeline

Before running the pipeline, the HDMIs and tile information, including coordinates, need to be extracted from the raw 1st-Seq fastq files. The reverse complement of the HDMI needs to be created in order for later matching with the 2nd-Seq HDMI. The openst function barcode_preprocessing accomplishes these tasks while writing the output to individual files per tile.

SpaceMake first uses the 2nd-Seq files to extract the HDMI and UMI according to the barcode_flavor specified in the configuration file. As opposed to NovaScope, SpaceMake offers poly-A trimming and removal of adaptor residuals, implemented via Drop-seq functions.

In the second step, reads are mapped against the reference genome. The recommended mapping_strategy uses bowtie2 to discard reads that align to the PhiX or ribosomal RNA reference. Only reads passing this filter are mapped against a user-provided reference genome using STARsolo.

The mapped reads are then collected into a digital gene expression matrix (DGE). Only the top n barcodes with the most mapped reads are included (openst default n = 100,000), which limits downstream processing and can also be used to exclude off-tissue spots. Note that at this step, the DGE is only based on 2nd-Seq data, meaning the indices are HDMIs extracted from R1 and have not been matched against 1st-Seq HDMIs. In this step, QC statistics such as the fraction of unmapped reads by cause are generated and summarized in an html report.

For the creation of the SGE, HDMIs from 1st-Seq and 2nd-Seq R1 are now matched with each other. Only spatial barcodes that are contained in both and also have mapped reads are included in the final matrix. For the exploratory dataset, this was the last possible processing step due to the low number of reads passing these criteria.

NovaScope Pipeline

As preprocessing steps, the spatial barcode and corresponding tile & coordinates are extracted and sorted by tile. The reverse complement of the HDMIs is generated and duplicates are removed. For all these steps, QC statistics are provided, as well as a plot of barcode density per coordinates, based on a flow cell layout file.

Spatial barcodes from 1st- and 2nd-Seq are matched. Only reads whose HDMIs are contained in the resulting whitelist are then mapped to a reference genome using STARsolo. The output provides information on several transcriptomic features ('Gene', 'GeneFull', 'SJ', and 'Velocyto', see STARsolo documentation), as well as several QC statistics (e.g., how many reads could not be mapped due to which reason).

Lastly, the SGE is created directly from the mapped reads. The results are visualized by plotting the read density by coordinates, as total read count and by a gene set of interest.

For further information about individual rules, see the NovaScope documentation, as well as the preprint.

Downstream Options

Following the creating of an SGE, the most important analysis steps are image alignment and spot clustering.

In order to relate the tile coordinates back to the H&E image, both Open-ST and NovaScope offer automatic alignment. However, high precision (Open-ST claims to achieve ~1um accuracy) is achieved using the fiducial marks of the patterned flow cell. Additionally, the NovaScope module for image alignment is currently not available. Alternative tools for image registration have not been tested as part of this project.

Since read counts per barcode are low, individual spots cannot be used directly for robust cell typing. There are several approaches to solve this issue, ranging from simply aggregating spots for clustering to using neural networks to assign cell types based on neighboring transcriptomic profiles. For Seq-Scope specifically, how (or whether) the H&E image data is used for segmentation and cell typing is another important metric to consider. Since an extensive comparison between methods lays outside of the scope of this report, only a brief overview will be given here.

SpaceMake by default imposes a Visium- or Slide-Seq-like meshgrid on the spots, aggregating them into hexagonal or circular structures of (by default) 10 μm diameter. These units are then processed like single cells. Further automatic analyses including filtering, clustering, and marker gene identification is based on scanpy and squidpy functionalities. A more sophisticated approach is implemented by Open-ST, which uses the H&E image to guide spot aggregation. In addition to the aforementioned image alignment, cell segmentation with radial expansion is offered using cellpose. NovaScope recommends the use of FICTURE (see below), and provides a step-by-step tutorial for downstream processing in a separate GitHub repository.

While there is a plethora of segmentation tools compatible with spatial transcriptomics data, some of prominent examples include:

  • Baysor, which is commonly used, especially in combination with cellpose segmentation masks as priors. However, Baysor in its native implementation scales poorly, which is especially problematic for non-targeted transcriptomic data such as Seq-Scope.
  • FICTURE, a self-described "segmentation-free spatial factorization method" which assigns spots to types based transcripts in neighboring spots. In the preprint, this tool showed similar accuracy to Baysor while performing better at scale. Image data is not used to inform factorization.
  • Boms, a cell segmentation tool originally developed for fluorescent in-situ hybridization (FISH) data, which can incorporate e.g., cellpose segmentation masks as a prior. A main drawback is that there is no manuscript and very little documentation provided.
  • Bering, which segments cells based on transcript colocalization using graph & convolutional neural networks.

Conclusion & Outlook

In this project, I processed several Seq-Scope datasets and systematically compared different pipelines developed for this purpose. I was able to identify a number of issues in data quality, which helped to significantly increase it in subsequent datasets. I further identified the most suitable processing tools for data analysis, such that, within our lab, datasets can now be processed seamlessly in just a few hours.

Despite this progress, the data quality is currently too low to derive meaningful biological insights. The main limiting factor is the efficiency of the DraI enzymatic digestion after 1st-Seq. Without significant improvements, even immensely higher sequencing depth for 2nd-Seq would not provide the necessary transcriptome coverage. A back-of-the-envelope calculation showed that, even when using our library with the best quality, 2nd-Seq read pairs in the order of billions would be required to reach adequate gene counts per cell for phenotyping. However, it might be possible to achieve better results using tissues other than heart, such as brain tissue.

Of the pipelines specifically designed for Seq-Scope data processing, NovaScope clearly stands out with a fast and well-documented workflow. While SpaceMake offers better compatibility with other ST platforms or experimental designs, and Open-ST extends spatial data analysis to another dimension, these additional features are less relevant for a standard experimental design such as ours.

While both SpaceMake and NovaScope are capable of processing Seq-Scope data from the sequencing files to the SGE matrix, they are limited in some crucial aspects. Firstly, image registration relies on the fiducial marks of patterned flow cells such as the NovaSeq6000, as has not been made available as part of NovaScope. As recent publications such as the FICTURE preprint show, non-patterned flow cells as in the HiSeq2500 system are also attractive options for Seq-Scope users. Support for image data from such platforms is currently not provided by either tool.

Furthermore, downstream options are limited. While NovaScope provides guidance on using FICTURE for spatial factorization, this method does not integrate image data at all. Conversely, Open-ST performs cell segmentation with cellpose, which is not informed by transcriptome data. While both approaches might be viable options depending on the usecase, to fully leverage the advantages of Seq-Scope, cell typing would be informed by both transcriptome and image data simultaneously.

In summary, while Seq-Scope is a promising technology, it requires further development to leverage its full potential. Using multiple tissues and several rounds of optimization, we were not able to obtain data of sufficient quality. For data processing, NovaScope is a well-documented pipeline that efficiently processes raw files to generate a gene-by-location matrix. For further processing, other studies will be needed to best combine the potential of high-resolution spatial transcriptomics data and H&E images.

About

Internship report comparing available tools to process Seq-Scope data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published