ERVmap is one part curated database of human proviral ERV loci and one part a stringent algorithm to determine which ERVs are transcribed in RNA seq data.
Tokuyama M. et. al., ERVmap analysis reveals genome-wide transcription of human endogenous retroviruses. Proc Natl Acad Sci U S A. 2018 Dec 11;115(50):12565-12572. doi: 10.1073/pnas.1814589115.
ERVmap 1.1 includes:
- The introduction of ERVmap auto, to allow easy processing of multiple files
- Descriptive error catching for incorrect arguments
- Thorough documentation for easier setup and use
- Improved folder structure
- Bug fixes
Also keep an eye out for ERVmap 2, featuring updated dependencies and cleaner architecture. Coming soon.
bedtools 2.29.2
btrim 0.3.0 (http://graphics.med.yale.edu/trim/)
bwa 0.7.17
cufflinks 2.2.1
perl 5.26
python 2.7.16
R 3.6.2
samtools 1.9
tophat 2.1.1
bowtie 2.4.2
htseq 0.11.3 for Python
File::Type 0.22 for Perl
DESeq 1.38.0 for R (via BiocManager)
erv_genome.pl
interleaved.pl
run_clean_htseq.pl
clean_htseq.pl
merge_count.pl
normalize_with_file.pl
normalize_deseq.r
File | Possible source | Suggested file location |
---|---|---|
HG38 primary assembly fasta | ensembl | ref/Bowtie2_genome/genome.fa AND ref/bwa_genome/genome.fa |
HG38 comprehensive gene annotation (gtf) | ensembl | ref/genes.gtf |
ERV bed | Included | ref/ERVmap.bed |
Illumina Adapter | Included* | ref/illumina_adapter.txt |
Genome file | Included | ref/GRCh38.genome_file.txt |
Bowtie transcriptome files | Autogenerated (see below) | ref/Bowtie2_genome/known |
*Oligonucleotide sequences © 2021 Illumina, Inc. All rights reserved.
Index files must be generated manually before the first run if not already generated:
- BWA index files:
bwa index -p ref/bwa_genome/genome ref/bwa_genome/genome.fa
- Bowtie index files:
bowtie2-build ref/Bowtie2_genome/genome.fa ref/Bowtie2_genome/genome
Transcriptome files are normally auto-generated on the first run of ERVmap. However, if ERVmap is unable to generate the transcriptome files (e.g. on a server with limited write permissions for computational nodes), the transcriptome can be manually generated with the following code after generating the Bowtie index files:
tophat2 -G ref/genes.gtf --transcriptome-index=Bowtie2_genome/known Bowtie2_genome/genome
Running the below blocks of code in the terminal will yield raw counts for cellular genes and ERVmap loci as separate files.
Skip the following step for single end data
interleaved.pl --read1 my_file_R1.fastq.gz --read2 my_file_R2.fastq.gz > my_file.fastq
The above will convert my_file_R1.fastq.gz
and my_file_R2.fastq.gz
into one file, my_file.fastq
Modify the below lines of code for your particular data and folder structure
dataset_name=my_data #used for creating unique directory names
input_file=/path/to/my_file.fastq.gz #absolute path to file
ref=~/ERVmap/ref #folder containing required reference files (see below)
scripts=~/ERVmap/scripts #folder containing included perl and R scripts
$scripts/erv_genome.pl -start_stage 1 -end_stage 6 \
--fastq $input_file \
--genome $ref/bwa_genome/genome \
--genome_Bowtie2 $ref/Bowtie2_genome/genome \
--bed $ref/ERVmap.bed \
--genomefile $ref/GRCh38.genome_file.txt \
--gtf $ref/genes.gtf \
--transcriptome $ref/Bowtie2_genome/known \
--adaptor $ref/illumina_adapter.txt \
--filter $scripts/parse_bam.pl \
--cell ${dataset_name}_working
Note that if btrim, tophat2, bwa, samtools, or bedtools are not in your path variable, they will have to be specified as arguments as well. (if unsure, check with type <program>
)
output_folder=./output/$dataset_name
mkdir -p $output_folder/erv $output_folder/cellular
input_name=${input_file##*/}
input_name=${input_name%%.*}
mv ./${dataset_name}_working/herv_coverage_GRCh38_genome.txt $output_folder/erv/${input_name}.e
mv ./${dataset_name}_working/GRCh38/htseq.cnt $output_folder/cellular/${input_name}.c
Repeat the above steps for all the input samples.
These steps will yield normalized ERV read counts based on size factors obtained through DESeq analysis. Use the output files from above.
$scripts/run_clean_htseq.pl $output_folder/cellular c c2 __ $scripts
$scripts/merge_count.pl 3 6 e $output_folder/erv > $output_folder/erv/merged_erv.txt
$scripts/merge_count.pl 0 1 c2 $output_folder/cellular > $output_folder/cellular/merged_cellular.txt
$scripts/normalize_deseq.r $output_folder/cellular/merged_cellular.txt $output_folder/cellular/normalized_cellular.txt $output_folder/cellular/normalized_factors.txt
$scripts/normalize_with_file.pl $output_folder/cellular/normalized_factors.txt $output_folder/erv/merged_erv.txt > $output_folder/full_normalized_erv_counts.txt
Also included in this github is ERVmap auto. After installing dependencies and gathering reference files, ERVmap auto allows you to run the above code on all appropriately named fastq files in a specified directory.
If installation is completed with files and directories named as specified above, and with the ERVmap base directory in your home folder (~
) this will work as-is. Otherwise, you will need to edit ERVmap_auto.sh to work with your folder structure.
To use, place all fastq files in an input folder, with the following names:
- Single end data as <name>_SS.fastq.gz
- First reads of pair-end data as <name>_R1.fastq.gz
- Second reads of pair-end data as <name>_R2.fastq.gz
Then run bash ERVmap_auto.sh <input_folder>
This will output the following files of interest in the specified output folder:
- full_normalized_erv_counts.txt: a tab-separated table of ERV counts across all samples, normalized to size factors based on cellular gene count using DEseq
- merged_erv.txt: a tab-separated table of un-normalized ERV counts across all samples
- normalized_cellular.txt: a space-separated table of cellular gene counts across all samples, normalized using DEseq
- merged_cellular.txt: a space-separated table of un-normalized cellular gene counts across all samples
- normalized_factors.txt: a tab-separated table containing the factors used to normalize each of the input samples
- Maria Tokuyama
- Thomas Deckers
- Eric Liu
- Yong Kong