This resource provides the perl code used in paper below
-
"Dynamic Landscape of Human L1 Transposition Revealed with Functional Data Analysis". (2020) Molecular Biology and Evolution 37 (12), 3576-3600. D Chen, MA Cremona, Z Qi et al.
The pipeline has been tailored for the Washington University HTCF computing environment (https://htcf.wustl.edu/docs/) which uses the slurm queueing system (https://slurm.schedmd.com/tutorials.html). No guarantees are made about other systems, setups, configurations, etc.
The main functions are
-
- Align the LINE1 (L1) reads to the human genome to idenfiy the integration sites.
-
- Cluster the L1 integration sites by a custom window size (default: 500 bp).
-
- de-novo motif discovery of the L1 integration sites to identify potential binding proteins.
-
- Motif enrichment anlaysis of specified transcription factor(s).
These scripts are wrapped by a master perl script. Download all the scripts in one folder and run the warpper perl script 'L1_bar_wrapperV3.pl'. Note: the prefix of the output name is from the first "-"(hyphen) deliminated input name.
Usage: perl L1_bar_wrapperV3.pl <read1.fq> <read2.fq> <barcode> <genome_aligner> <TF_to_scan>
- For
<genome_aligner>
: options can be combined without spaces in between (i.e. 12, 123 or 1234); Genome database is hg19.
- bowtie2 for read2 single end alingment
- bowtie2 for paired end alignment
- novoalign for read2 single end alignment
- novoalign for paired end alignment
- For
<TF_to_scan>
: underscore delimited transcription factors to scan;
- plot genome-wide distribution of features by home-made R plot functions (cyto_plotV2.R)
- simplified cytoband.txt (cytoband_hg19_2), which doesn't have grey regions in chr plot.
genomeplot.pdf