A pipeline to identify homologous or homeologous regions within or between genomes. A masked genome is aligned to itself or another genome using blast and then the alignments are filtered. The output is in a format suitable for generating circos plots.
Table of Contents
These scripts have been tested with Python 3. The scripts require the following programs and files.
Programs:
blast (must be in format #6, see example below)
Files:
A masked genome file or two depending on use
- Generate hard-masked genome (from NCBI, lowercase basepairs are softmasked):
example:
sed -e '/^>/! s/[[:lower:]]/N/g' GCF_002021735.2_Okis_V2_genomic.fna > GCF_002021735.2_Okis_V2_genomic.masked.fna
- Generate blast database:
example:
makeblastdb -in GCF_002021735.2_Okis_V2_genomic.masked.fna -dbtype nucl
- Align blast database to self:
example:
blastn -task megablast -db GCF_002021735.2_Okis_V2_genomic.masked.fna -query GCF_002021735.2_Okis_V2_genomic.masked.fna -out GCF_002021735.2.vs.self.aln -outfmt 6 -perc_identity 80 -max_hsps 40000
-
Identify homeologous regions in genome:
python General_linear_filter_fmt6.v1.3.py -aln GCF_002021735.2.vs.self.aln -gap 100000 -min 10000 -print no 2> Gap100K.Min10K.txt
help (and further explanations): python General_linear_filter_fmt6.v1.3.py -h
note: requires Linear_Alignments_v4.py and GeneralOverlap_v1.py in same working directory -
Output in Circos plot format and filter small homeologous regions:
python CircosOutput.v1.1.py -input Gap100K.Min10K.txt -tbl ChrNameColor.txt -tLen 100000 > Gap100K.Min10K.circos.txt 2> Gap100K.Min10K.pid.circos.txt
help (and further explanations): python CircosOutput.v1.1.py -h
The tbl is a tab-delimited file with the chromosome name in the first column. New chromosome in the second column (can be the same as first column) and the color for the circos plot. See example file in repository.
Distributed under the MIT License.