Chapter1.tex

%Chapter 1

\newcommand{\edit}[1]{\textcolor{black}{#1}}

\renewcommand{\thechapter}{1}

\chapter{Introduction}

\section{Genome assembly}

The genome sequence of an organism is the blueprint for building that organism.
It is a key resource that allows researchers to better understand the organism's function and evolution.
Initially published in 2001, the human genome has undergone dozens of revisions over the years~\cite{monya2012}.
Researchers fill in gaps, and correct mistakes in the sequence.
It is not an easy task determining what parts of the genome are missing, what parts are mistakes, and what parts are due to experimental artifacts from the sequencing machine.
%The genome sequence of an organism is a critical resource for biologists trying to understand the organism's function and evolution.
Obtaining the genome of an organism is difficult as modern sequencing
technologies can only ``read'' small stretches of the genome (called
\emph{reads}).
The fact that these tiny \emph{reads} (under a few
thousands of basepairs/characters in length) can be glued together to reconstruct genomes
comprising millions to billions of basepairs is by no means evident
and was the subject of vigorous scientific debate during the early
days of sequencing technologies~\cite{green1997against,weber1997human}. The modern genomic revolution was in no small part made
possible by the development of algorithms and computational tools called
\emph{genome assemblers} able to reconstruct near-complete
representations of a genome's sequence from the fragmented data
generated by sequencing instruments.  Despite tremendous advances made
over the past 30 years in both sequencing technologies and assembly
algorithms, genome assembly remains a highly difficult computational
problem.  In all but the simplest cases, genome assemblers cannot
fully and correctly reconstruct an organism's genome.  Instead, the
output of an assembler consists of a set of contiguous sequence
fragments (\emph{contigs}), which can be further ordered and oriented
into \emph{scaffolds}, representing the relative placement of the
contigs, with possible intervening gaps, along the genome.

% The genome of an organism is a blueprint for life.
% The human genome was published in 2001.
% Even a genome as highly researched as the human periodically releases updates correcting mistakes.
% The main reference has undergone over 38 revisions.
% Filling in gaps.
% Correcting misassemblies.
% When a new organism is assembled, it is not easy determining what is missing, what is a mistake, and what is experimental artifact.


\subsection{Computational challenges of assembly}

Theoretical analyses of the assembly problem, commonly formulated as
an optimization problem within an appropriately defined graph, have
shown that assembly is
NP-hard~\cite{myers1995,medvedev2007computability}, i.e., finding the
correct optimal solution may require an exhaustive search of an
exponential number of possible solutions.  The difficulty of genome
assembly is due to the presence of repeated DNA
segments (\emph{repeats}) in most genomes. Repeats longer than the length of the sequenced reads lead to ambiguity in the reconstruction of the genome
-- many different genomes can be built from the same set of
reads~\cite{nagarajan2009complexity,kingsford2010assembly}.

As a result, practical implementations of assembly algorithms (such as
ABySS~\cite{ABySS}, Velvet~\cite{Velvet},
SOAPdenovo~\cite{li2010novo}, etc.) return just an approximate
solution that either contains errors, or is fragmented, or both.
Ideally, in a genomic experiment, assembly would be followed by the
scrupulous manual curation of the assembled sequence to correct the
hundreds to thousands of errors~\cite{salzberg2005misassemblies}, and
fill in the gaps between the assembled
contigs~\cite{nagarajan2010finishing}. Despite the value of fully
completed and verified genome sequences~\cite{fraser2002value}, the
substantial effort and associated cost necessary to conduct a
finishing experiment to its conclusion can only be justified for a
few high-priority genomes (such as reference strains or model
organisms). The majority of the genomes sequenced today are
automatically reconstructed in a ``draft'' state.  Despite the fact
that valuable biological conclusions can be derived from draft
sequences~\cite{branscomb2002high}, these genomes are of uncertain
quality~\cite{chain2009genome}, possibly impacting the conclusions of
analyses and experiments that rely on their primary sequence.

\subsection{Assessing the quality of an assembly}

% Currently, there are two ways to evaluate assemblies -
% reference-based and \emph{de novo} evaluation. When a reference
% genome is available, an assembly quality can be estimated
% based on percentage of a true genome reconstructed (comple- teness statistics) or some biologically relevant scores such as a number of bases covered by GenBank or a number of amino

Assessing the quality of the sequence output by an assembler is
of critical importance, not just to inform downstream analyses, but
also to allow researchers to choose from among a rapidly increasing
collection of genome assemblers. Despite apparent incremental
improvements in the performance of genome assemblers, none of the
software tools available today outperforms the rest in all assembly
tasks. As highlighted by recent assembly
bake-offs~\cite{earl2011assemblathon,salzberg2011gage}, different
assemblers ``win the race'' depending on the specific characteristics
of the sequencing data, the structure of the genome being assembled,
or the specific needs of the downstream analysis process.
Furthermore, these recent competitions have highlighted the inherent
difficulty of assessing the quality of an assembly.  More
specifically, all assemblers attempt to find a trade-off between
contiguity (the size of the contigs generated) and accuracy of the
resulting sequence.  Evaluating this trade-off is difficult even when
a gold standard is available, e.g., when re-assembling a genome with
known sequence.  In most practical settings, a reference genome
sequence is not available, and the validation process must rely on
other sources of information, such as independently derived data from
mapping experiments~\cite{zhou2007validation}, or from transcriptome
sequencing~\cite{adamidi2011novo}. Such data are, however, often not
generated due to their high cost relative to the rapidly decreasing
costs of sequencing. Most commonly, validation relies on \emph{de
  novo} approaches based on the sequencing data alone, which include
global ``sanity checks'' (such as gene density, expected to be high in
bacterial genomes, measured, for example, through the fraction of the
assembled sequence that can be recognized by PFAM
profiles~\cite{genovo2011}) and internal consistency
measures~\cite{amosvalidate2008} that evaluate the placement of reads
and mate-pairs along the assembled sequence.

The validation approaches outlined above can highlight a number of
inconsistencies or errors in the assembled sequence, information
valuable as a guide for further validation and refinement experiments,
but difficult to use in a comparative setting where the goal is to
compare the quality of multiple assemblies of a same dataset.  For
example, even if a reference genome sequence is available, while all
differences between the reassembled genome and the reference are, at
some level, assembly mistakes, it is unclear whether one should weigh
single nucleotide differences and short indels as much as larger
structural errors (e.g., translocation or large scale copy-number
changes)~\cite{earl2011assemblathon} when comparing different
assemblies.  Furthermore, while recent advances in visualization
techniques, such as the FRCurve of Narzisi et
al.~\cite{FRC2011,vezzi2012feature}, have made it easier for
scientists to appropriately visualize the overall tradeoff between
assembly contiguity and correctness, there exist no established
approaches that allow one to appropriately weigh
the relative importance of the multitude of assembly quality measures,
many of which provide redundant information~\cite{vezzi2012feature}.

\section{Contributions of this dissertation}

In Chapter 2, we present our LAP framework, an objective and holistic approach for evaluating and
comparing the quality of assemblies derived from a same dataset.  Our
approach defines the quality of an assembly as the likelihood that the
observed reads are generated from the given assembly, a value which can
be accurately estimated by appropriately modeling the sequencing
process. We show that our approach is able to automatically and accurately reproduce the
reference-based ranking of assembly tools produced by highly-cited assembly competitions: the
Assemblathon~\cite{earl2011assemblathon} and GAGE~\cite{salzberg2011gage}
competitions.

In Chapter 3, we extend our \emph{de novo} LAP framework to evaluate metagenomic assemblies.
We will show that by modifying our likelihood calculation to take into account abundances of assembled sequences, we can accurately and efficiently compare metagenomic assemblies.
%We evaluate our extended framework on results generated from the Human Microbiome Project (HMP) and
We find that our extended LAP framework is able to reproduce results on data from the Human Microbiome Project (HMP)~\cite{mitreva2012structure,methe2012framework} that closely match the reference-based evaluation metrics and outperforms other \emph{de novo} metrics traditionally used to measure assembly quality.
Finally, we have integrated our LAP framework into the metagenomic analysis pipeline MetAMOS~\cite{treangen2013metamos}, allowing any user to reproduce quality assembly evaluations with relative ease.

In Chapter 4, we provide a novel regression testing framework for genome assemblers.
Our framework that uses two assembly evaluation mechanisms: \emph{assembly likelihood}, calculated using our LAP framework\cite{LAP}, and \emph{read-pair coverage}, calculated using REAPR\cite{hunt2013reapr}, to determine if code modifications result in non-trivial
changes in assembly quality.
%We evaluate our framework using popular assemblers SOAPdenovo \cite{li2010novo} and Minimus \cite{sommer2007minimus}.
We study assembler evolution in two contexts. First,
we examine how assembly quality changes
throughout the version history of the popular assembler SOAPdenovo~\cite{luo2012soapdenovo2}. Second,
we show that our framework can correctly evaluate decrease in assembly quality
using fault-seeded versions of another assembler Minimus~\cite{sommer2007minimus}.
Our results show that our framework accurately detects trivial
changes in assembly quality produced from permuted input reads and using
multi-core systems, which fail to be detected using traditional regression
testing methods.

In Chapter 5, we build on the pipeline described in Chapter 4 and introduce VALET, a \emph{de novo} pipeline for finding mis­assemblies within metagenomic assemblies.
We flag regions of the genome that are statistically inconsistent with the data generation process and underlying species abundances.
VALET is the first tool to accurately and efficiently find mis­assemblies in metagenomic datasets.
We run VALET on publicly available datasets and use the findings to suggest improvements for future metagenomic assemblers.

In Chapter 6, we discuss our other contributions to bioinformatics relating to the domains of clustering, compression, and cloud computing.