This repository contains scripts used for the construction and analysis of minigraph-based graphs for version 2 of the barley pan-genome.
Scripts are written in BASH and R and were developed and tested on Rocky Linux v8.8. Other Linux distributions may be compatible but no testing has taken place in this regard.
The scripts are organised as follows:
scripts/graphConstruction:
- buildGraph_incremental.sh: a Slurm array job script for the incremental construction and SV calling of pan-genome graphs with minigraph, using a single chromosome for each subtask
- combineGraphs.sh: combines all graphs built for individual chromosomes into a single graph and builds a GBWT index
- convert.sh: converts the chromosome-based graphs from GFA to VG format for downstream analysis
- convertToGFA.sh: converts the combined, genome-wide graph from VG to GFA format for compatibility with other tools
scripts/SVstats:
- extractSVCoordsByType.sh: extracts structural variant entries from the final BED files produced during graph construction/SV calling and separates them by type (simple inversions, deletions and insertions, but no nested complex variants)
- violinPlot.r: takes the output from extractSVCoordsByType.sh and plots the length statistics of each SV category in a violin plot
- SV_comparison.sh: extracts deletions and insertions from both Minigraph and Assemblytics SV files and feeds their coordinates to bedtools intersect for overlap comparison
scripts/graphStats:
- odgiheaps_byGroup.sh: runs odgi heaps command for each group (domesticated, wild and all lines) and then plots a saturation plot using script heaps_fit.R which is supplied with the odgi package
- vgStat.sh and vgStat_joint.sh: computes basic statistics for chromosome-based graphs and combined graph respectively
scripts/mappingExperiment:
- mapWGS.sh: a Slurm array job script for downloading and mapping five public barley whole genome shotgun datasets from the European Nucleotide Archive (ENA); mappings are carried out using both bwa mem and vg giraffe for comparison
- extractStats.sh: extracts and formats mapping statistics from each mapping for further analysis
- calcMappingStats.r: takes the output from the preceding script and calculates comparative mapping statistics such as % reads mapped, % properly paired reads, etc.
scripts/mappingExperiment/linearisedGraph:
- 00_gfa2fa.sh: converts separate GFA format graphs (one per chromosome) into FASTA files using gfatools gfa2fa
- 01_concat.sh: combines the per-chromosome FASTA files
- 02_indexBWA.sh: builds BWA index for mapping
- 03_mapWGS.sh: array job for downloading and mapping of the five public WGS samples
scripts/srh1_analysis:
- 01_extractNodesToFASTA.sh: extracts all individual nodes from genome-wide graph to a FASTA file
- 02_blastVsNodes.sh: BLASTs the srh1 enhancer region from cv. Barke against all 76 pan-genome accessions
- 03_extractSubgraph.sh: extracts a smaller subgraph containing the nodes spanning the deletion and flanking regions
- 04_odgiviz.sh: code for plotting the subgraph
- 05_mapToGraph.sh: mapping of the eight core800 samples to the graph
- 06_chunkGam.sh: extraction of a smaller portion of the GAM files produced in the previous step
- 07_genotype.sh: genotyping of the srh1 region in the graph for the eight core800 samples
Input data for graph construction was 76 barley genomes (approx. 5 gigabases each), split into separate chromosome input files each. Graph construction was carried out incrementally with minigraph on a per-chromosome basis and required 44-49 GB peak RAM and 10-22 days of wallclock time per subtask. Minigraph offers multithreading but this is effectively disabled when single, entire chromosome sequences are used as input (see lh3/minigraph#62) and results in a single thread of execution for graph construction.