Allio, R., Schomaker-Bastos, A., Romiguier, J., Prosdocimi, F., Nabholz, B., & Delsuc, F.
Mitofinder is a pipeline to assemble mitochondrial genomes and annotate mitochondrial genes from trimmed read sequencing data.
This software is suitable for all linux-like systems with gcc installed (Unfortunately not Windows < v.10).
- Installation guide for MitoFinder
- How to use MitoFinder
- Detailed options
- INPUTS
- OUTPUTS
- Particular cases
- UCE annotation
- How to cite MitoFinder
- How to get reference mitochondrial genomes from ncbi
- How to submit your annotated mitochondrial genome(s) to GenBank NCBI
Clone mitofinder from GitHub
git clone https://github.com/RemiAllio/MitoFinder.git
cd MitoFinder
./install.sh
PATH/TO/MITOFINDER/mitofinder -h
or download master.zip
wget https://github.com/RemiAllio/MitoFinder/archive/master.zip
unzip master.zip
mv MitoFinder-master MitoFinder
cd MitoFinder
./install.sh
PATH/TO/MITOFINDER/mitofinder -h
cd PATH/TO/MITOFINDER/
p=$(pwd)
echo -e "\n#Path to mitofinder \nexport PATH=$PATH:$p" >> ~/.bashrc
source ~/.bashrc
WARNING: If you previously installed MitoFinder on your system and want to install a new version, you should replace the old MitoFinder PATH by the updated one in your ~/.bashrc file. To do so, you need to edit your ~/.bashrc file, remove the lines that add MitoFinder to the PATH, and close your terminal. Then, you should open a new terminal and re-execute the command lines from above.
TIP: If you are connected to cluster, you can use either nano
or vi
to edit the ~/.bashrc file.
To check if the right version of MitoFinder is actually in your PATH:
mitofinder -v
Clone mitofinder from GitHub
git clone https://github.com/RemiAllio/MitoFinder.git
or download master.zip
wget https://github.com/RemiAllio/MitoFinder/archive/master.zip
unzip master.zip
mv MitoFinder-master MitoFinder
Once installed, you need to indicate the paths to the directory containing the executables on the Mitofinder.config file.
TIPS:
(1) If the executable is in your PATH, to find it you can use which
. For example, which megahit
.
(2) If not, you can go to the directory containing the executable and use pwd
to get the PATH. Then, you can copy the PATH in the Mitofinder.config file.
Given that MitoFinder uses makeblastdb, blastn, and blastx, you need to download the associated binaries (latest versions here: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/).
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.10.0+-x64-macosx.tar.gz
tar -xvf ncbi-blast-2.10.0+-x64-macosx.tar.gz
cd ncbi-blast-2.10.0+/bin/
Once installed, you need to indicate the PATH to the directory containing the binaries in the Mitofinder.config file.
To get MitoFinder to work, you need to install at least one of the following assemblers.
Installation with conda:
conda install -c bioconda megahit
If you install megahit with conda, the executable will be in the miniconda/anaconda bin directory.
For example : /Users/remiallio/bin/miniconda2/bin/
We recommend to download the pre-compiled binaries:
curl http://cab.spbu.ru/files/release3.14.0/SPAdes-3.14.0-Darwin.tar.gz -o SPAdes-3.14.0-Darwin.tar.gz
tar -zxf SPAdes-3.14.0-Darwin.tar.gz
cd SPAdes-3.14.0-Darwin/bin/
To our knowledge, IDBA-UD is not supported for Mac OS at the moment.
The arwen source code is available in the arwen directory of MitoFinder. However, it is compiled for Linux. So, to make it executable you need to compile it on your own Mac OS system using gcc.
cd PATH/TO/MITOFINDER/arwen/
gcc arwen1.2.3.c
mv a.out arwen
Once it is compiled, you can test it by running:
./arwen -h
First, you can choose the assembler using the following options:
-- megahit (default: faster)
-- metaspades (recommended: a bit slower but more efficient (see associated paper). WARNING: Not compatible with single-end reads)
-- idba
TIP: use mitofinder --example to print basic usage examples
mitofinder -j [seqid] -1 [left_reads.fastq.gz] -2 [right_reads.fastq.gz] -r [genbank_reference.gb] -o [genetic_code] -p [threads] -m [memory]
mitofinder -j [seqid] -s [SE_reads.fastq.gz] -r [genbank_reference.gb] -o [genetic_code] -p [threads] -m [memory]
MitoFinder can also be run directly on a previously computed assembly (one or several contig.s in fasta format)
mitofinder -j [seqid] -a [assembly.fasta] -r [genbank_reference.gb] -o [genetic_code] -p [threads] -m [memory]
Use the same command line.
WARNING: If you want to compute the assembly again (for example because it failed) you have to remove the assembly results' directory (--override option). If not, MitoFinder will skip the assembly step.
Depending on the proximity of your reference, you can play with the following parameters : nWalk; --blast-eval; --blast-identity-nucl; --blast-identity-prot; --blast-size
cd PATH/TO/MITOFINDER/test_case/
mitofinder -j Aphaenogaster_megommata_SRR1303315 -1 Aphaenogaster_megommata_SRR1303315_R1_cleaned.fastq.gz -2 Aphaenogaster_megommata_SRR1303315_R2_cleaned.fastq.gz -r reference.gb -o 5 -p 5 -m 10
usage: mitofinder [-h] [--megahit] [--idba] [--metaspades] [-j PROCESSNAME]
[-1 PE1] [-2 PE2] [-s SE] [-a ASSEMBLY] [-m MEM]
[-l SHORTESTCONTIG] [-p PROCESSORSTOUSE] [-r REFSEQFILE]
[-e BLASTEVAL] [-n NWALK] [--override] [--adjust-direction]
[--ignore] [--new-genes] [--allow-intron] [--numt]
[--intron-size INTRONSIZE] [--max-contig MAXCONTIG]
[--cds-merge] [--out-gb] [--contig-size CONTIGSIZE]
[--rename-contig RENAME]
[--blast-identity-nucl BLASTIDENTITYNUCL]
[--blast-identity-prot BLASTIDENTITYPROT]
[--blast-size ALIGNCUTOFF] [--circular-size CIRCULARSIZE]
[--circular-offset CIRCULAROFFSET] [-o ORGANISMTYPE] [-v]
[--example] [--citation]
Mitofinder is a pipeline to assemble and annotate mitochondrial DNA from
trimmed sequencing reads.
optional arguments:
-h, --help show this help message and exit
--megahit Use Megahit for assembly. (Default)
--idba Use IDBA-UD for assembly.
--metaspades Use MetaSPAdes for assembly.
-j PROCESSNAME, --seqid PROCESSNAME
Sequence ID to be used throughout the process
-1 PE1, --Paired-end1 PE1
File with forward paired-end reads
-2 PE2, --Paired-end2 PE2
File with reverse paired-end reads
-s SE, --Single-end SE
File with single-end reads
-a ASSEMBLY, --assembly ASSEMBLY
File with your own assembly
-m MEM, --max-memory MEM
max memory to use in Go (MEGAHIT or MetaSPAdes)
-l SHORTESTCONTIG, --length SHORTESTCONTIG
Shortest contig length to be used (MEGAHIT). Default =
100
-p PROCESSORSTOUSE, --processors PROCESSORSTOUSE
Number of threads Mitofinder will use at most.
-r REFSEQFILE, --refseq REFSEQFILE
Reference mitochondrial genome in GenBank format
(.gb).
-e BLASTEVAL, --blast-eval BLASTEVAL
e-value of blast program used for contig
identification and annotation. Default = 0.00001
-n NWALK, --nwalk NWALK
Maximum number of codon steps to be tested on each
size of the gene to find the start and stop codon
during the annotation step. Default = 5 (30 bases)
--override This option forces MitoFinder to override the previous
output directory for the selected assembler.
--adjust-direction This option tells MitoFinder to adjust the direction
of selected contig(s) (given the reference).
--ignore This option tells MitoFinder to ignore the non-
standart mitochondrial genes.
--new-genes This option tells MitoFinder to try to annotate the
non-standard animal mitochondrial genes (e.g. rps3 in
fungi). If several references are used, make sure the
non-standard genes have the same names in the several
references
--allow-intron This option tells MitoFinder to search for genes with
introns. Recommendation : Use it on mitochondrial
contigs previously found with MitoFinder without this
option.
--numt This option tells MitoFinder to search for both
mitochondrial genes and NUMTs. Recommendation : Use it
on nuclear contigs previously found with MitoFinder
without this option.
--intron-size INTRONSIZE
Size of intron allowed. Default = 5000 bp
--max-contig MAXCONTIG
Maximum number of contigs matching to the reference to
keep. Default = 0 (unlimited)
--cds-merge This option tells MitoFinder to not merge the exons in
the NT and AA fasta files.
--out-gb Do not create annotation output file in GenBank
format.
--contig-size CONTIGSIZE
Minimum size of a contig to be considered. Default =
1000
--rename-contig RENAME
"yes/no" If "yes", the contigs matching the
reference(s) are renamed. Default is "yes" for de novo
assembly and "no" for existing assembly (-a option)
--blast-identity-nucl BLASTIDENTITYNUCL
Nucleotide identity percentage for a hit to be
retained. Default = 50
--blast-identity-prot BLASTIDENTITYPROT
Amino acid identity percentage for a hit to be
retained. Default = 40
--blast-size ALIGNCUTOFF
Percentage of overlap in blast best hit to be
retained. Default = 30
--circular-size CIRCULARSIZE
Size to consider when checking for circularization.
Default = 45
--circular-offset CIRCULAROFFSET
Offset from start and finish to consider when looking
for circularization. Default = 200
-o ORGANISMTYPE, --organism ORGANISMTYPE
Organism genetic code following NCBI table (integer):
1. The Standard Code 2. The Vertebrate Mitochondrial
Code 3. The Yeast Mitochondrial Code 4. The Mold,
Protozoan, and Coelenterate Mitochondrial Code and the
Mycoplasma/Spiroplasma Code 5. The Invertebrate
Mitochondrial Code 6. The Ciliate, Dasycladacean and
Hexamita Nuclear Code 9. The Echinoderm and Flatworm
Mitochondrial Code 10. The Euplotid Nuclear Code 11.
The Bacterial, Archaeal and Plant Plastid Code 12. The
Alternative Yeast Nuclear Code 13. The Ascidian
Mitochondrial Code 14. The Alternative Flatworm
Mitochondrial Code 16. Chlorophycean Mitochondrial
Code 21. Trematode Mitochondrial Code 22. Scenedesmus
obliquus Mitochondrial Code 23. Thraustochytrium
Mitochondrial Code 24. Pterobranchia Mitochondrial
Code 25. Candidate Division SR1 and Gracilibacteria
Code
-v, --version Version 1.3
--example Print getting started examples
--citation How to cite MitoFinder
Mitofinder needs several files to run depending on the method you have choosen (see above):
- Reference_file.gb containing at least one mitochondrial genome of reference extracted from NCBI
- left_reads.fastq.gz containing the left reads of paired-end sequencing
- right_reads.fastq.gz containing the right reads of paired-end sequencing
- SE_reads.fastq.gz containing the reads of single-end sequencing
- assembly.fasta containing the assembly on which MitoFinder have to find and annotate mitochondrial contig.s
Mitofinder returns several files for each mitochondrial contig found:
- [Seq_ID]_final_genes_NT.fasta containing the nucleotides sequences of the final genes selected from all contigs found by MitoFinder
- [Seq_ID]_final_genes_AA.fasta containing the amino acids sequences of the final genes selected from all contigs found by MitoFinder
- [Seq_ID]_mtDNA_contig.fasta containing a mitochondrial contig
- [Seq_ID]_mtDNA_contig.gff containing the final annotation for a given contig (GFF3 format)
- [Seq_ID]_mtDNA_contig.tbl containing the final annotation for a given contig (Genbank submission format)
- [Seq_ID]_mtDNA_contig.gb containing the final annotation for a given contig (Genbank format for visualization)
- [Seq_ID]_mtDNA_contig_genes_NT.fasta containing the nucleotide sequences of annotated genes for a given contig
- [Seq_ID]_mtDNA_contig_genes_AA.fasta containing the amino acids sequences of annotated genes for a given contig
- [Seq_ID]_mtDNA_contig.png schematic representation of the annotation of the mtDNA contig
- [Seq_ID]_mtDNA_contig.infos containing the initial contig name, the length of the contig and the GC content
/!\ Close reference required /!\
For the particular cases below, we recommend using MitoFinder in two different steps. First, you can use it to assemble and/or identify mitochondrial-like contigs, then use it in a second step to annotate these particular contigs (option -a) with the corresponding additional options.
Also, these options are recommended for cases in which a (really) close reference is available.
/!\ Close reference required /!\
In some taxa (e.g. fungi), it's possible to find mitochondrial genes containing intron(s). In these cases, we add the --allow-intron option (combined with --intron-size and --cds-merge). However, it is important to note that, despite the search for start and stop codons is functional for this option, there is no search for intronic boundaries. The exon annotation is based only on the similarity with the reference. That's why a close reference is necessary and even with a good reference, we recommend to double check the exon annotation.
/!\ Close reference required /!\
Once you have identified nuclear contigs that may contain NUMTs, you can use MitoFinder to find the NUMTs using the --numt option. Basically, this option allows MitoFinder to find the same gene several times in a contig. Given that the NUMTs can be full of stop codons, we recommand to limit the number of walks (--nwalk 0) that MitoFinder can do to improve the annotation (looking for start and stop codons).
MitoFinder starts by assembling both mitochondrial and nuclear reads using de novo metagenomic assemblers. It is only in a second step that mitochondrial contigs are identified and extracted. MitoFinder thus provides UCE contigs that are already assembled and the annotation can be done from the following file:
- [Seq_ID]_link_[assembler].scafSeq containing all assembled contigs from raw reads.
To do so, we recommend the use of the PHYLUCE pipeline which is specifically designed to annotate ultraconserved elements (Faircloth 2015; Tutorial: https://phyluce.readthedocs.io/en/latest/tutorial-one.html#finding-uce-loci).
You can thus use the file [Seq_ID]_link_[assembler].scafSeq and start the Phyluce pipeline at the "Finding UCE" step.
If you use MitoFinder, please cite:
- Allio, R, Schomaker‐Bastos, A, Romiguier, J, Prosdocimi, F, Nabholz, B, Delsuc, F. MitoFinder: Efficient automated large‐scale extraction of mitogenomic data in target enrichment phylogenomics. Mol Ecol Resour. 2020; 00: 1– 14. https://doi.org/10.1111/1755-0998.13160
Please also cite the following references depending on the option chosen for the assembly step in MitoFinder:
- Li, D., Luo, R., Liu, C. M., Leung, C. M., Ting, H. F., Sadakane, K., Yamashita, H. & Lam, T. W. (2016). MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods, 102(6), 3-11.
- Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome research, 27(5), 824-834.
- Peng, Y., Leung, H. C., Yiu, S. M., & Chin, F. Y. (2012). IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 28(11), 1420-1428.
For tRNAs annotation:
- Laslett, D., & Canbäck, B. (2008). ARWEN: a program to detect tRNA genes in metazoan mitochondrial nucleotide sequences. Bioinformatics, 24(2), 172-175.
For UCEs extraction:
- Faircloth, B. C. (2016). PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics, 32(5), 786-788.
- Go to NCBI
- Select "Nucleotide" in the search bar
- Search for mitochondrion genomes:
- RefSeq (if available)
- Sequence length from 12000 to 20000
- Download complete record in GenBank format
If you have few mitochondrial genomes to submit, you should be able to do it with BankIt through the NCBI submission portal.
If you want to submit several complete or partial mitogenomes, we designed MitoFinder to strealine the submission process using tbl2asn.
tbl2asn requires:
- Template file containing a text ASN.1 Submit-block object (suffix .sbt). Create submission template.
- Nucleotide sequence data containing the mitochondrial sequence(s) and associated information (suffix .fsa).
- Feature Table containing annotation information for the mitochondrial sequence(s).
- Comment file containing assembly and annotation method information (assembly.cmt). Create comment template
Because tbl2asn requires the FASTA file to contain information associated with the data, we wrote a script to create a FASTA file containing the mitochondrial contig(s) found by MitoFinder for each species (Seq_ID) with the associated information. This script and the associated example files can be found in the MitoFinder directory named "NCBI_submission".
- index_file.csv A CSV file (comma-delimited table) containing the metadata information.
The headers of the index file are as follows: Directory path, Seq ID, organism, location, mgcode, SRA, keywords ...
The first two columns are mandatory and the names cannot be changed but you can complete the index file with the different source modifiers of NCBI by adding columns in the index file.
The directory path correponds to the path where the [Seq_ID]_mtDNA_contig.fasta file, or [Seq_ID]mtDNA_contig*.fasta files if you have several contigs for the same individual, could be found. If left blank, the script will search for the contig in the directory where you run the script from (./).
/PATH/TO/MITOFINDER/NCBI_submission/create_tbl2asn_files.py -i index_file.csv
TIPS:
(1) You can copy or link (symbolic links) all your FASTA and TBL contig files in the same directory and run the script from this directory.
(2) You can leave blanks in the index file if some species do not need a given source modifier.
- [Seq_ID].fsa new FASTA file containing all mtDNA contigs and the information for a given [Seq_ID]
- [Seq_ID].tbl new TBL file containing all mtDNA contigs and the information for a given [Seq_ID]
Once your FASTA and TBL files have been created, you can run tbl2asn (download here: ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/) as follows:
tbl2asn -t template.sbt -i [Seq_ID].fsa -V v -w assembly.cmt -a s
This command will create several files:
- [Seq_ID].sqn Submission file (.sqn) to be sent by e-mail to [email protected]
- [Seq_ID].val Containing ERROR and WARNING values associated with tbl2asn. (ERROR explanations here)
If you don't have any error and you are happy with the annotation, you can submit your mitochondrial contig(s) by sending the .sqn files to [email protected]