This program calculates transitive alignments, it takes as its input a set of alignmnets from a set of query sequences to an intermediate database, a set of aligmnets from the intermediate database to a target database, and outputs a set of alignments from the queries to the targets.
Typically, the target database is well described but distantly related
to the query sequences, and the intermediate database is large, but
less well described. Using the intermediate to provide a large
evolutionary context allows transalign
to detect distant
relationships with a higher sensitivity than direct alignments, and
without needing to construct explicit stochastic models of the
targets.
First, you need BLAST results in XML format, and an installed
transalign
executable (see the next section for this). This will
require a bit of disk space. We're going to use UniRef 50 as the
intermediate database, and SwissProt (sprot.fa) as the final target.
The process will look something like this:
formatdb -i uniref50.fa -pT
formatdb -i sprot.fa -pT
blastx -i input.fa -d uniref50.fa -e1e-4 -a 8 -m 7 -o inp_vs_u50.xml
blastp -i uniref50.fa -d sprot.fa -e1e-4 -a 8 -m 7 -o u50_vs_sp.xml
(Options are: -e
limits the evalue of hits to avoid generating an excess of false
positives, -a
specifies the number of parallel threads, and -m
specifies the output format, in this case XML.)
You should now have the necessary input data, and you can run
transalign inp_vs_u50.xml u50_vs_sp.xml > inp_vs_sp.txt
You can display the brief, built-in help by running
transalign --help
. This gives the following output:
transalign v0.1, ©2012 Ketil Malde
transalign [OPTIONS] [BLASTXML FILES]
Calculate sequence alignments transitively.
Common flags:
-l --long long output detailing matching positions
-c --cache generate alignment cache for initial alignment
-n --limit=INT max number of alignments to consider
-o --outfile=ITEM output to file
-e --extract=ITEM explicit list of alignments to extract
--cite output citation information
-b --blastfilter=NUM exclude intermediate alignment with per-column score
less than this value (not using this option disables
the filter)
-d --debug show debug output
-? --help Display help message
-V --version Print version information
--numeric-version Print just the version number
-v --verbose Loud verbosity
-q --quiet Quiet verbosity
Long output produces a large table matching query positions with target positions, while the default is to output a table similar to BLAST tabular output.
Sometimes BLAST will generate a large number of alignments (for
instance will very repetitive proteins generate many alternative
pairwise matches) which causes transalign
to consume a substantial
amount of memory. You can limit the number of considered alignments
using -n
.
Using the -e
option, transalign
outputs only alignments for the
requested query sequences. This is most useful when the alignment
caches are already generated.
To speed up operation and avoid doing the same work over, transalign
builds alignment caches (for BLAST results foo.xml
, it will create a
directory foo.xml.d
containing the cache). You can also construct
this cache separately, using blastextract foo.xml
. There is also a
program, showcache
, to inspect a cached alignment. The default is
to build a cache for the second step, but not the first, -c
builds
caches for both input files.
As the BLAST XML output can sometimes be large, transalign
will
parse these files once to generate a cache for them. With recent
versions of transalign
, this cache will be split over multiple
directories, limiting the number of files per directory to something
manageable. This does not mean that using NFS necessesarily is such a
good idea, but at least it should work.
The output is in a table format, somewhat similar to BLAST's (-m 8). The columns are: Query, Target, Score, Alignment length, Average score, Query start, Query end, Target start, Target end.
The program is written in Haskell, and distributed as source code.
This means you need a working Haskell compiler and environment, and
optionally, the necessary tools to download the source from its
darcs
repository.
See the generic
installation instructions for
details on the various ways to acquire and install transalign
.