Trim, circularise, orient & filter long read bacterial genome assemblies
There is already a good piece of software to trim/circularise and orient genome assemblies called Circlator. Please try that first!
You should only try Berokka if:
- You only have the contig files and do not have the corrected reads anymore
- Your contigs are simple cases with clear overhang and could be done manually with BLAST
- Circlator fails on your data even after troubleshooting
NOTE: orientation to dnaA or rep genes is not yet implemented.
Using Homebrew will install all the dependencies for you: Linux or MacOS
brew install brewsci/bio/berokka
Using Bioconda) will take care of everything:
conda install -c conda-forge -c bioconda -c defaults berokka
git clone https://github.com/tseemann/berokka.git
./berokka/bin/berokka -h
You will need to install all the dependencies manually:
- BioPerl >=Β 1.6 (for
Bio::SeqIO
andBio::SearchIO
) - BLAST+ >= 2.3.0 (for
blastn
)
Input should be completed long-read assemblies in FASTA format, such as those from CANU or HGAP.
% berokka --outdir trimdir canu.contigs.fasta
<snip>
Did you know? berokka is a play on the concept of overhang vs hangover
% ls trimdir/
01.input.fa
02.trimmed.fa
03.results.tab
% cat trimdir/03.results.tab
#sequence status old_len new_len trimmed
tig00000000 trimmed 5461026 5448790 12236
tig00000002 trimmed 138825 113601 25224
tig00000003 trimmed 57075 43297 13778
tig00000004 kept 24900 24900 0
tig00000006 trimmed 1620 1320 300
tig00000007 removed 2380 0 0
Filename | Format | Description |
---|---|---|
01.input.fa | FASTA | All the input sequences |
02.trimmed.fa | FASTA | The (possibly) trimmed sequences |
03.results.tab | TSV | Summary of results |
The 02.trimmed.fa
output has been augmented with new header data (unless --noanno
used):
circular=true
- inform that this is a circular sequence (Rebaler uses this)overhang=N
- informs that N bp were trimmed offlen=N
- the new contig length if it was present (Canu adds this)suggestCircular=yes
if theno
version was present (Canu adds this)class=replicon
if theclass=contig
was present and we circularised
-
--filter <FASTA>
allows you to remove contigs which match 50% of sequences in this file. Berokka comes with the standard Pacbio control sequence. You can provide your own FASTA file using this option. If you want to disable filtering, use--filter 0
. -
--readlen LENGTH
can be used for datasets that won't seem to circularise. It affects the length of the match it attempts to make using BLAST. -
--noanno
will ensure that the FASTA descriptions are not altered between the input and output FASTA files. -
--keepfiles
and--debug
are primarily for use by the developer.
Berocca is a brand of effervescent drink and vitamin tablets containing vitamin B and C.
It is a popular cure for a hangover. A key role of the berokka
tool is to remove the
"overhang" that occurs at the ends of long-read assemblies of circular genomes.
Please file questions, bugs or ideas to the Issue Tracker
Not published yet.
- Torsten Seemann