- Introduction
- Installation
2.1. Installing from conda
2.2. Compile from source - Running EMERALD
3.1. EMERALD input
3.2. Command line options
3.3. EMERALD output
3.4. Example
3.5. Precomputed alignment safety windows - About EMERALD
EMERALD is a command line protein sequence aligner that explores the suboptimal space and calculates
EMERALD's features include
- using custom substitution matrices (by default: BLOSUM62) and affine-linear gap score
- multi threading
- selecting a custom representative sequence
Schematic representation of EMERALD’s safety window calculation
EMERALD is already compiled for Linux and Mac OS silicon. You can download the EMERALD binary in the and run it on the command line.
EMERALD can be installed via conda:
conda install -c conda-forge -c bioconda emerald
EMERALD is written in C++ and uses the gmp library for the representation of big integers. Additionally, cmake is needed for the compilation. After installing gmp and downloading the source, navigate to its main directory and run
cmake .
followed by
make
to compile.
Use --help
for a first overview of the commands.
EMERALD expects .fasta
cluster files of protein sequences.
EMERALD defines two kinds of sequences: the singular representative sequence
and cluster members
for all the other sequences. The representative sequence is aligned with all the cluster members, resulting in
-
-f, --file {FILE}
Path to input FASTA file, mandatory argument. -
-o, -output {FILE}
Path to output file, mandatory argument. Note: EMERALD does not erase the content of the output file but only appends to the existing file. -
-a, --alpha {value}
$\alpha$ value for safety,$0.5 < \alpha \leq 1$ , by default: 0.75. The safety windows will be partial alignments contained in an$\alpha$ proportion of all alignments. If$\alpha$ is chosen outside this range, a warning will be displayed. EMERALD will keep running but it can crash. -
-d, --delta {value}
$\Delta$ value for the size of the suboptimal space, any positive integer, by default: 0. If$\Delta$ is larger, more alignments will be considered suboptimal, which will decrease the number and lengths of the safety windows. -
-i, --threads {value}
How many threads to use. By default 1 thread is used. -
-r, --reference {sequence}
Select a specific sequence as representative sequence by some unique identitifer in the sequence description. By default the first sequence in the cluster will be the representative.
-c, --costmat {file}
This file is a lower triangular matrix C which for which C[a][b] is the aligning score of the amino acids a and b. The amino acids are given in the following order:Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
. Examples are given in the utils directory.-s, --special {value}
is an integer assigned to the score of aligned amino acids in which one of the two is not included in the list above.-g, --gapcost {value}
and-e, --startgap {value}
Defines the affine-linear gap score function, by default -1 and -11, respectively.-m, --windowmerge
In addition to printing out the calculated safety windows, EMERALD merges them and prints additional lines with the merged safety windows. Safety windows get merged if they are intersecting or adjacent to each other.-w, --drawgraph {dir}
Experimental: Writes dot graph files into the given directory plotting the suboptimal alignment graph.
By default, EMERALD uses the BLOSUM62 substitution matrix for its cost assignments.
EMERALD's output is stored in the given output file, while stdout is used for log messages. The first part of the output is the following.
Representative sequence description
Representative sequence
Number of aligned sequence pairs
Following for every aligned sequence pair:
Cluster sequence description
Cluster sequence
Number of safety windows
Finally, every safety window will be printed in a separate line:
Safety windows are half open intervals, the left index is inclusive and the right index is exclusive, and indexing starts at 0.
examples/ex1.fasta
(same as in the Overview):
>Representative sequence
MSFDLKSKFLG
>Cluster member 1
MSKLKDFLFKS
>Cluster member 2
MSLGSFKDKFL
>Cluster member 3
MSLKDKKFLKS
>Cluster member 4
MSFLKKKFDSL
Output (in examples/ex1.out
):
$ ./emerald -f examples/ex1.fasta -o examples/ex1.out -a 0.75 -d 8
>Representative sequence
MSFDLKSKFLG
5
>Cluster member 1
MSKLKDFLFKS
3
0 2 0 2
4 6 3 5
8 11 8 11
>Cluster member 2
MSLGSFKDKFL
2
0 3 0 3
4 9 5 10
>Cluster member 3
MSLKDKKFLKS
2
0 2 0 2
7 10 6 9
>Cluster member 4
MSFLKKKFDSL
2
0 3 0 3
5 9 4 8
We already pre-computed safety windows for the DIAMOND2 DeepClust clustered SwissProt Database (~400k seqs). If users wish to use this pre-computed dataset, they can download it from figshare.
EMERALD is being developed by Andreas Grigorjew in the Graph Algorithms team part of the Algorithmic Bioinformatics group at the University of Helsinki.
If you encounter bugs or want to give feedback, please use the Issue tracker or contact me directly.
Please cite the following reference when using EMERALD for your research:
- Grigorjew, A., Gynter, A., Dias, F.H. et al. Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters using EMERALD. Genome Biol 24, 168 (2023). https://doi.org/10.1186/s13059-023-03008-6
- An author erratum is available here.
Experimental data was clustered using DIAMOND DeepClust:
- Buchfink B, Ashkenazy H, Reuter K, Kennedy JA, Drost HG, "Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust", bioRxiv 2023.01.24.525373; doi: https://doi.org/10.1101/2023.01.24.525373