Historian is a multiple aligner that aims at providing accurate historical reconstructions of the evolution of a set of DNA or protein sequences. This stands in contrast to most multiple alignment tools, which instead try to provide protein alignments that correctly identify regions of 3D structural homology. If you are trying to predict the structure of a protein, you should probably use one of those other tools. If you care about the evolutionary history of your sequences, use Historian.
Most multiple alignment tools (think: Clustal, Muscle, ProbCons) are optimized for homology-based structure prediction, and tested on structural alignment benchmarks (e.g. BAliBase, Oxbench, Prefab, Sabmark). That's a good empirical approach as far as it goes, because structurally-informed protein alignments make a good "gold standard" for benchmarking alignment tools. Often, these tools have scoring schemes that are optimized for reproducing common signatures of protein selection, such as reduced indel rates in hydrophobic regions. However, optimizing for structure has the unfortunate drawback of introducing biases into the estimates of indel (and possibly substitution) rates (as a rule of thumb, all methods tend to underestimate mutation rates, but with standard multiple aligners the biases can be unpredictable and can vary widely at different indel rates). Consequently, this approach yields a less-than-accurate picture of evolutionary history.
By contrast, Historian uses an explicit evolutionary model of indel and substitution events, derived rigorously from statistical phylogenetics using finite-state transducers as evolutionary operators. In simulation tests (using the third-party evolution simulator indel-Seq-Gen), it introduces significantly fewer biases than other tools. It also performs pretty well on structural alignment benchmarks, though not as well as tools like Muscle and ProbCons that are optimized for that.
The basic method and initial benchmarks of the approach are described in (Accurate Reconstruction of Insertion-Deletion Histories by Statistical Phylogenetics; Westesson, Lunter, Paten and Holmes, PLoS One, 2012). The implementation described in that paper is a program called ProtPal. Historian is a clean reimplementation of ProtPal that also runs a lot faster, is more user-friendly, and has significantly more features.
The closest method to Historian is PRANK (Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis; Löytynoja and Goldman, Science, 2008). Compared to PRANK, Historian is of comparable accuracy (on simulation benchmarks) or slightly more accurate (on structural alignment benchmarks), has more features, and runs faster.
To build from source, type make
. This will create a binary file bin/historian
.
At present, Historian requires the following build environment to compile:
- Apple LLVM version 7.3.0 or later (clang-703.0.31), or gcc version 4.8.3 or later
- Boost C++ library version 1.62.0 or later
- Gnu Scientific Library (GSL) version 2.2.1 or later
If you want to run the tests, type make test
. (The tests are oriented to a Mac OSX build environment; numerical precision errors may cause slight differences in output on different machines, which may lead to some tests failing.)
Pre-compiled binaries are also available from the GitHub repository release page.
The simplest way to use Historian is just to point it at a FASTA file. It will then estimate a guide alignment, estimate a tree from that (using neighbor-joining), and perform a full ancestral reconstruction.
For example, using a test file of HIV GP120 sequences that is included in the repository:
historian data/gp120.fa
This will generally be pretty fast, but you can make it faster (at a slight cost in accuracy) using the -fast
option:
historian data/gp120.fa -fast
The -fast
option is an alias for several reconstruction options, as described in the help message.
Historian is one of these toolbox programs where the first argument can be a command, specifying what action is to be performed. If you omit this first command-argument, Historian assumes you want to reconstruct something. You can make this explicit as follows:
historian reconstruct data/gp120.fa
You can also abbreviate reconstruct
to recon
or just r
:
historian r data/gp120.fa
All commands can be abbreviated to single letters. Of course, if you are doing reconstruction (as noted above), you can omit the command entirely, but the examples further below will use different commands.
By default, the historian
program runs in classic Unix mode, i.e. printing nothing except the output. This can be a bit boring for long jobs, so run with -v
for verbose colorized logging output, or -vv
for more verbose, or -vvv
(or equivalently -v3
) for even more logging, and so on. Probably -v2
is about the right balance between showing some feedback on progress and not drowning you in psychedelic rainbow log messages.
historian -v2 data/gp120.fa
This produces output somewhat like this (but more colorful - GitHub-flavored Markdown won't show the colors):
Using default amino acid model (lg)
Alphabet: arndcqeghilkmfpstwyv
Substitution model has 1 mixture component, expected rate 1
Insertion rate 0.01, expected insertion length 2.94118
Deletion rate 0.01, expected deletion length 2.94118
Loading sequences from data/gp120.fa
Building guide alignment (data/gp120.fa)
Estimating initial tree by neighbor-joining (data/gp120.fa)
Starting reconstruction on 19-node tree (data/gp120.fa)
Aligning ENV_HV2BE/24-510 (489 states, 488 transitions) and ENV_HV2D1/24-501 (480 states, 479 transitions)
Aligning ENV_HV2G1/23-502 (482 states, 481 transitions) and (ENV_HV2BE/24-510:0.106114,ENV_HV2D1/24-501:0.0784677) (689 states, 808 transitions)
...
...and so on.
Historian speaks a variety of input and output formats. By default, it outputs alignments in Stockholm format, which allows easy extraction of the alignment while also affording space for metadata like trees. If you prefer your alignments in another output format, for example Nexus or FASTA, use -output nexus
or -output fasta
.
Input sequence formats will usually be auto-detected, but this behavior can be overridden to stipulate particular file formats via the command-line arguments.
As briefly alluded to above, Historian does several performance-optimizing steps en route to a reconstruction. First, it builds a quick-guess multiple alignment by a greedy maximal-spanning-tree type approach; this can optionally be accelerated by a k-mer match step (confining the alignment to diagonals of the dynamic programming matrix that pass a minimum threshold of k-mer matches) and by using a sparse random spanning forest instead of a dense all-vs-all comparison. Second, it uses this alignment to build a guide tree by neighbor-joining. Third, it builds a progressive reconstruction that includes suboptimal alignments in something like a partial-order graph. And fourth, it optionally does iterative refinement to optimize the reconstruction. The latter two steps (reconstruction and refinement) can be constrained to stay near the guide alignment for performance reasons.
The default settings attempt to navigate this maze of options for you, mostly using the higher-accuracy options until memory becomes a limiting factor and then switching to the more approximate options. However, as a power user, you may want to take control of these options. Command-line arguments allow you to supply guide alignments and/or guide trees, and change the parameters or behavior of the standard workflow.
For example, if you want to supply sequences in FASTA format (gp120.fa) and a guide tree in New Hampshire format (gp120.tree.nh)
historian -seqs data/gp120.fa -tree data/gp120.tree.nh
Alternatively, if your sequences are in Nexus or Stockholm format, you can encode the tree together with your sequences, using the appropriate syntax for encoding New Hampshire-format trees in those formats (#=GF NH
for Stockholm).
If you want to estimate the tree and use UPGMA instead of neighbor-joining, so as to enforce an ultrametric tree, use the -upgma
option, e.g.
historian -upgma data/gp120.fa
To save the guide alignment, use the -saveguide
option:
historian data/gp120.fa -saveguide gp120.guide.fa
If you already have your sequences aligned, and you want to use this alignment as the guide alignment, you only need to supply that alignment (not the sequences):
historian -guide gp120.guide.fa
The guide alignment is, by default, just treated as a "hint". Historian will do dynamic programming in a "band" around the guide alignment, sliding gaps back and forth up to a maximum distance specified by the -band
argument. If, instead, you want to use the guide alignment as a strict constraint, and find the best reconstruction that is exactly consistent with the guide, then set the band to zero:
historian -guide gp120.guide.fa -band 0
For some alignments, it may be the case that there is no reconstruction under Historian's model that is exactly consistent with the guide (due to ordering of gaps), in which case you might want to relax the band to 1:
historian -guide gp120.guide.fa -band 1
These arguments are all listed in the help text, available via the -h
option and copied below.
Historian's underlying model is a simple one: there is a substitution rate matrix, an insertion rate, a deletion rate, and insertion/deletion extension probabilities. These are all specified in a JSON file format, several examples of which can be found in the model directory.
The default model lg
is an amino acid substitution matrix estimated by Le and Gascuel (2008) using XRate on a dataset of Pfam alignments, with indel rates and probabilities that were also estimated from Pfam. However, the historian
program allows you to load a model from a file using the -model
option, or to use one of the preset models using the -preset
option. You can also add discretized-gamma rate categories using the -gamma
and -shape
options. For example, to use the Whelan and Goldman model with 4 rate categories and gamma shape parameter 1.5:
historian data/gp120.fa -preset wag -gamma 4 -shape 1.5 -v2
Alternatively, the model parameters can be estimated directly from sequence data using the built-in EM algorithm that is the same algorithm used by XRate (as described in this paper).
To estimate rates from data, use the fit
command. Model-fitting takes a little longer than reconstruction, since the EM algorithm typically takes a few iterations to converge, so you might want to turn on some logging. For example:
historian fit data/gp120.fa -fast -vv >gp120.model.json
You can then load this model using the -model
option, and use it to reconstruct another sequence history. For example, using it to reconstruct the ancestors of the Cas9 bridge helix domain (Pfam family PF16593), included as file data/PF16593.fa in the repository:
historian reconstruct -model gp120.model.json data/PF16593.fa
Often, a single protein family will not include enough information to reliably fit an amino acid rate matrix - the model will be over-trained. You can aggregate multiple datasets together into a larger training set simply by listing the files successively on the command line:
historian fit -fast data/gp120.fa data/PF16593.fa -v3 >aggregated.model.json
If you only want to estimate the indel rates and not the substitution matrix, then you can use the -fixsubrates
option to hold the substitution rates constant:
historian fit data/gp120.fa -fixsubrates >gp120.model.json
Conversely, you can use -fixgaprates
to hold the indel rates (and indel extension parameters) constant, while estimating substitution rates. Other aspects of the model-fitting algorithm (for example, the use of Laplace pseudocounts, or the EM convergence criteria) can be set via the command-line options.
The model-fitting algorithm described above is an Expectation-Maximization algorithm that iteratively estimates the a posteriori expected number of indel and substitution events of each possible type, along with the amount of time spent in each state, and then uses these expected counts and times to update the mutation rates.
For some applications, these event counts and wait times - the so-called "sufficient statistics" for EM - may be more useful (or more stable, or reliable) than the rate parameters themselves.
If you are just interested in these counts, you can get at them using the count
command. For example:
historian count data/gp120.fa >gp120.counts.json
One application of the counts is as pseudocounts representing hyperparameters of Dirichlet priors. JSON counts files that were computed using the count
command can be passed as pseudocounts to the fit
command via the -counts
option:
historian fit data/PF16593.fa -counts gp120.counts.json -nolaplace >PF16593.model.json
The -nolaplace
option here indicates that we don't want to add the usual +1 Laplace pseudocounts in this case.
You can also use the count
command to estimate counts for a bunch of alignments in parallel, then combine the counts using the sum
command, and finally estimate the maximum-likelihood parameters for these combined counts using fit
. This amounts to doing a single iteration of the EM algorithm. Probably the only reason you'd want to do this would be if you were implementing some kind of roll-your-own MapReduce-style approach to splitting up the EM algorithm on a large dataset. Which is probably unlikely, but here's roughly what it'd look like:
historian count file1.fa >file1.counts.json
historian count file2.fa >file2.counts.json
...
historian count fileN.fa >fileN.counts.json
historian sum file1.counts.json file2.counts.json ... fileN.counts.json >summed.counts.json
historian fit -nolaplace -counts summed.counts.json >updated.model.json
If you care to, you can simulate from a model using the generate
command. You will need to specify a tree:
historian generate data/gp120.tree.nh
The simulator accepts the standard model-specification options (-preset
, -gamma
, -scale
etc.). You can change the output format with -output
and the random number seed with -seed
.
Thus, for example
historian generate data/gp120.tree.nh -seed 123 -scale 10 -preset dayhoff -output fasta
Historian includes an experimental MCMC implementation for co-sampling trees and alignments. Currently, this implementation only works for ultrametric trees. It is available via the mcmc
command.
At its core, Historian uses the phylogenetic transducer method. See Westesson et al, 2012 for an evaluation and brief description of the method, or this arXiv report for a tutorial introduction.
Very briefly, the idea of this method is as follows. The main recursion of Felsenstein's pruning algorithm for calculating the likelihood of a multiple alignment column can be summarized, in matrix form, as Fn=(Bl Fl).(Br Fr) where n, l and r are the node and its two children, Bn is the branch substitution matrix on the branch leading to node n, (A B) denotes the matrix product and A.B the pointwise (Hadamard) product, with each Fn denoting an ancestral sequence profile. If for our matrix representation we use weighted finite-state transducers, with (A B) denoting the operation of transducer composition and A.B the operation of transducer intersection, then Felsenstein's algorithm yields an instance of Sankoff's algorithm for multiple sequence alignment, and Fn is a state machine. We constrain the algorithm to be practical by retaining only high-probability states of Fn at each stage. The branch transducers Bn are derived using a simple approximation that indel events on a single branch never overlap.
- The tests work for the following development environment. Other builds may give subtly different results due to rounding errors.
- Apple LLVM version 7.3.0 (clang-703.0.31)
- Boost 1.62.0
- GSL 2.2.1
- At present, Historian requires the Boost C++ library due to inconsistencies in the way different C++ compilers and standard libraries implement regular expressions.
The following is the message that appears when you type historian help
:
Usage: historian {recon[struct],count,fit,mcmc,generate,help,version} [options]
EXAMPLES
Reconstruction:
historian recon seqs.fa [-tree tree.nh] -output fasta >reconstruction.fa
historian recon -guide guide.fa [-tree tree.nh] >reconstruction.stk
historian recon guide.stk >reconstruction.stk
historian recon data.nex -output nexus >reconstruction.nex
Event counting:
historian count seqs.fa [-tree tree.nh] [-model model.json] >counts.json
historian count -guide guide.fa [-tree tree.nh] >counts.json
historian count -recon reconstruction.fa -tree tree.nh >counts.json
Model fitting:
historian fit seqs.fa >newmodel.json
historian fit -counts counts.json >newmodel.json
Simulation:
historian generate [-model model.json] [-rootlen N] tree.nh >sim.stk
Commands can be abbreviated to single letters, like so:
historian r seqs.fa >reconstruction.stk
historian c seqs.fa >counts.json
historian f -counts counts.json >model.json
historian g tree.nh >sim.stk
(etc.)
If a command is omitted, 'reconstruct' is assumed.
OPTIONS
Model specification options
~~~~~~~~~~~~~~~~~~~~~~~~~~~
-model <file> Load substitution & indel model from file (JSON)
-preset <name> Select preset model by name
(jc, dayhoff, jtt, wag, lg, ECMrest, ECMunrest)
-normalize Normalize expected substitution rate
-insrate <R>, -delrate <R>, -insextprob <P>, -delextprob <P>
Override indel parameters
-inslen <L>, -dellen <L>
Alternate way of setting -insextprob & -delextprob
-gaprate <R>, -gapextprob <P>, -gaplen <L>
Shorthand to set both insertion & deletion params
-subscale <N>, -indelscale <N>, -scale <N>
Scale substitution rates, indel rates, or both
-gamma <N> Add N discretized-gamma rate categories
-shape <S> Specify shape parameter for gamma distribution
-savemodel <f> Save model to file, prior to any model-fitting
Reconstruction file I/O options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-auto <file> Auto-detect file format and guess its purpose
-seqs <file> Specify unaligned sequence file (FASTA)
-guide <file> Specify guide alignment file (gapped FASTA)
-tree <file> Specify phylogeny file (New Hampshire)
-nexus <file>, -stockholm <file>
Specify phylogeny & guide alignment together
-saveguide <f> Save guide alignment to file
(guide tree too, if output format allows)
-output (nexus|fasta|stockholm)
Specify output format (default is Stockholm)
-noancs Do not display ancestral sequences
-codon Interpret sequences as spliced protein-coding DNA/RNA
Reconstruction algorithm options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The reconstruction algorithm iterates through the guide tree in postorder,
aligning each sibling pair and reconstructing a profile of their parent.
The dynamic programming is constrained to a band around a guide alignment.
-band <n> Size of band around guide alignment (default 40)
-noband Unlimit band, removing dependence on guide alignment
The reconstructed parent profile is a weighted finite-state transducer
sampled from the posterior distribution implied by the children. The
posterior probability threshold for inclusion in the parent profile and
max number of states in the parent profile can both be tweaked to trade off
sensitivity vs performance.
-profminpost <P>, -profsamples <N>
Specify minimum posterior prob. (P) for retaining DP states
in profile (default .001), or sample N traces randomly
-profmaxstates <S>, -profmaxmem <M>
Limit profile to at most S states, or to use at most M% of
memory for DP matrix (default is -profmaxstates 12000)
Following alignment, ancestral sequence reconstruction can be performed.
-ancseq Predict ancestral sequences (default is to leave them as *'s)
-ancprob Report posterior probabilities for ancestral residues
For additional accuracy in historical reconstruction, the alignment can be
iteratively refined, or MCMC-sampled. By default, refinement is enabled and
MCMC is disabled. (MCMC currently requires an ultrametric tree.)
-norefine Disable iterative refinement after initial reconstruction
-mcmc Run MCMC sampler after reconstruction
-samples <N> Number of MCMC iterations per sequence (default 100)
-trace <file> Specify MCMC trace filename
-fixtree Fix tree during MCMC (sample alignment only)
Guide alignment & tree estimation options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The guide aligner builds a maximal spanning tree of pairwise alignments.
It can be accelerated in two ways. The first is by using a sparse random
forest instead of a fully connected all-vs-all pairwise comparison.
-rndspan Use a sparse random spanning graph, not all-vs-all pairs
The second way to optimize construction of the guide alignment is by
confining the pairwise DP matrix to cells around a subset of diagonals
that contain above a threshold number of k-mer matches. To turn on the
former optimization, use -rndspan; the latter is turned on by default for
sequences whose full DP matrix would not otherwise fit in memory (the
memory threshold can be set with -kmatchmb). It can be disabled with
-kmatchoff, or enabled (for a particular k-mer threshold) with -kmatchn.
-kmatchn <n> Threshold# of kmer matches to seed a diagonal
(default sets this as low as available memory will allow)
-kmatch <k> Length of kmers for pre-filtering heuristic (default 6)
-kmatchband <n> Size of DP band around kmer-matching diagonals (default 64)
-kmatchmb <M> Set kmer threshold to use M megabytes of memory
-kmatchoff No kmer threshold, do full DP
Following construction of the guide alignment, a tree is estimated using a
distance matrix method. By default this is neighbor-joining.
-upgma Use UPGMA to estimate tree (default for MCMC)
-nj Use neighbor-joining, not UPGMA, to estimate tree
-jc Use Jukes-Cantor-like estimates for distance matrix
If you are confident the guide alignment & tree should be reasonably obvious,
and just want to get on to reconstruction as quickly as possible:
-fast Run fast. Shorthand for the following:
-profminpost .01 -profmaxstates 5000
-faster Run even faster. Shorthand for the following:
-rndspan -kmatchn 3 -band 10 -profmaxstates 1 -jc -norefine
Model-fitting and event-counting options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In reconstruction mode, any supplied alignment will be interpreted as a hint,
i.e. a guide alignment. In MCMC, counting, or model-fitting mode, any alignment
that contains a full ancestral sequence reconstruction will be interpreted as a
reconstruction. To force the alignment to be interpreted as a reconstruction,
precede it with -recon, -nexusrecon or -stockrecon (depending on the format).
-recon <file>, -nexusrecon <file>, -stockrecon <file>
Use precomputed reconstruction (FASTA/NEXUS/Stockholm)
-mininc <n> EM convergence threshold as relative log-likelihood increase
(default is .001)
-maxiter <n> Max number of EM iterations (default 100)
-nolaplace Do not add Laplace +1 pseudocounts during model-fitting
-fixsubrates Do not estimate substitution rates or initial composition
-fixgaprates Do not estimate indel rates or length distributions
General options
~~~~~~~~~~~~~~~
-verbose, -vv, -vvv, -v4, -v5, etc.
Various levels of logging (-nocolor for monochrome)
-V, --version Print GNU-style version info
-h, --help Print help message
-seed <n> Seed random number generator (mt19937; default seed 5489)
REFERENCES
The initial reconstruction uses phylogenetic transducers, as described in:
Westesson, Lunter, Paten & Holmes (2012). Accurate Reconstruction of
Insertion-Deletion Histories by Statistical Phylogenetics.
PLoS One, DOI: 10.1371/journal.pone.0034572
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0034572
A longer, tutorial-style introduction is available here:
Westesson, Lunter, Paten & Holmes (2012).
Phylogenetic Automata, Pruning, and Multiple Alignment.
http://arxiv.org/abs/1103.4347
Model-fitting uses the following phylogenetic EM algorithm:
Holmes & Rubin (2002). An Expectation Maximization Algorithm
for Training Hidden Substitution Models.
Journal of Molecular Biology, 317(5).
The MCMC kernels for co-sampling alignments and trees are described in:
Holmes & Bruno (2001). Evolutionary HMMs: A Bayesian Approach to
Multiple Alignment. Bioinformatics, 17(9).
Redelings & Suchard (2005). Joint Bayesian Estimation of Alignment
and Phylogeny. Systematic Biology, 54(3).