Skip to content

ketil-malde/transalign

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transalign - calculate transitive alignments

This program calculates transitive alignments, it takes as its input a set of alignmnets from a set of query sequences to an intermediate database, a set of aligmnets from the intermediate database to a target database, and outputs a set of alignments from the queries to the targets.

Typically, the target database is well described but distantly related to the query sequences, and the intermediate database is large, but less well described. Using the intermediate to provide a large evolutionary context allows transalign to detect distant relationships with a higher sensitivity than direct alignments, and without needing to construct explicit stochastic models of the targets.

Running transalign

First, you need BLAST results in XML format, and an installed transalign executable (see the next section for this). This will require a bit of disk space. We're going to use UniRef 50 as the intermediate database, and SwissProt (sprot.fa) as the final target. The process will look something like this:

formatdb -i uniref50.fa -pT
formatdb -i sprot.fa -pT
blastx -i input.fa -d uniref50.fa -e1e-4 -a 8 -m 7 -o inp_vs_u50.xml
blastp -i uniref50.fa -d sprot.fa -e1e-4 -a 8 -m 7 -o u50_vs_sp.xml

(Options are: -e limits the evalue of hits to avoid generating an excess of false positives, -a specifies the number of parallel threads, and -m specifies the output format, in this case XML.) You should now have the necessary input data, and you can run

transalign inp_vs_u50.xml u50_vs_sp.xml > inp_vs_sp.txt

Transalign options

You can display the brief, built-in help by running transalign --help. This gives the following output:

transalign v0.1, ©2012 Ketil Malde

transalign [OPTIONS] [BLASTXML FILES]
Calculate sequence alignments transitively.

Common flags:
  -l --long             long output detailing matching positions
  -c --cache            generate alignment cache for initial alignment
  -n --limit=INT        max number of alignments to consider
  -o --outfile=ITEM     output to file
  -e --extract=ITEM     explicit list of alignments to extract
     --cite             output citation information
  -b --blastfilter=NUM  exclude intermediate alignment with per-column score
                        less than this value (not using this option disables
                        the filter)
  -d --debug            show debug output
  -? --help             Display help message
  -V --version          Print version information
     --numeric-version  Print just the version number
  -v --verbose          Loud verbosity
  -q --quiet            Quiet verbosity

Long output produces a large table matching query positions with target positions, while the default is to output a table similar to BLAST tabular output.

Sometimes BLAST will generate a large number of alignments (for instance will very repetitive proteins generate many alternative pairwise matches) which causes transalign to consume a substantial amount of memory. You can limit the number of considered alignments using -n.

Using the -e option, transalign outputs only alignments for the requested query sequences. This is most useful when the alignment caches are already generated.

To speed up operation and avoid doing the same work over, transalign builds alignment caches (for BLAST results foo.xml, it will create a directory foo.xml.d containing the cache). You can also construct this cache separately, using blastextract foo.xml. There is also a program, showcache, to inspect a cached alignment. The default is to build a cache for the second step, but not the first, -c builds caches for both input files.

Good and bad practices

As the BLAST XML output can sometimes be large, transalign will parse these files once to generate a cache for them. With recent versions of transalign, this cache will be split over multiple directories, limiting the number of files per directory to something manageable. This does not mean that using NFS necessesarily is such a good idea, but at least it should work.

Examining the output

The output is in a table format, somewhat similar to BLAST's (-m 8). The columns are: Query, Target, Score, Alignment length, Average score, Query start, Query end, Target start, Target end.

Downloading and installing transalign

The program is written in Haskell, and distributed as source code. This means you need a working Haskell compiler and environment, and optionally, the necessary tools to download the source from its darcs repository.

See the generic installation instructions for details on the various ways to acquire and install transalign.

About

GSoC 2014

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Haskell 99.9%
  • Shell 0.1%