/minigraph-barley

Repository for scripts used in barley pan-genome graph construction with Minigraph

Primary LanguageShell

minigraph-barley

This repository contains scripts used for the construction and analysis of minigraph-based graphs for version 2 of the barley pan-genome.

Scripts are written in BASH and R and were developed and tested on Rocky Linux v8.8. Other Linux distributions may be compatible but no testing has taken place in this regard.

The scripts are organised as follows:

scripts/graphConstruction:

  • buildGraph_incremental.sh: a Slurm array job script for the incremental construction and SV calling of pan-genome graphs with minigraph, using a single chromosome for each subtask
  • combineGraphs.sh: combines all graphs built for individual chromosomes into a single graph and builds a GBWT index
  • convert.sh: converts the chromosome-based graphs from GFA to VG format for downstream analysis
  • convertToGFA.sh: converts the combined, genome-wide graph from VG to GFA format for compatibility with other tools

scripts/SVstats:

  • extractSVCoordsByType.sh: extracts structural variant entries from the final BED files produced during graph construction/SV calling and separates them by type (simple inversions, deletions and insertions, but no nested complex variants)
  • violinPlot.r: takes the output from extractSVCoordsByType.sh and plots the length statistics of each SV category in a violin plot
  • SV_comparison.sh: extracts deletions and insertions from both Minigraph and Assemblytics SV files and feeds their coordinates to bedtools intersect for overlap comparison

scripts/graphStats:

  • odgiheaps_byGroup.sh: runs odgi heaps command for each group (domesticated, wild and all lines) and then plots a saturation plot using script heaps_fit.R which is supplied with the odgi package
  • vgStat.sh and vgStat_joint.sh: computes basic statistics for chromosome-based graphs and combined graph respectively

scripts/mappingExperiment:

  • mapWGS.sh: a Slurm array job script for downloading and mapping five public barley whole genome shotgun datasets from the European Nucleotide Archive (ENA); mappings are carried out using both bwa mem and vg giraffe for comparison
  • extractStats.sh: extracts and formats mapping statistics from each mapping for further analysis
  • calcMappingStats.r: takes the output from the preceding script and calculates comparative mapping statistics such as % reads mapped, % properly paired reads, etc.

scripts/mappingExperiment/linearisedGraph:

  • 00_gfa2fa.sh: converts separate GFA format graphs (one per chromosome) into FASTA files using gfatools gfa2fa
  • 01_concat.sh: combines the per-chromosome FASTA files
  • 02_indexBWA.sh: builds BWA index for mapping
  • 03_mapWGS.sh: array job for downloading and mapping of the five public WGS samples

scripts/srh1_analysis:

  • 01_extractNodesToFASTA.sh: extracts all individual nodes from genome-wide graph to a FASTA file
  • 02_blastVsNodes.sh: BLASTs the srh1 enhancer region from cv. Barke against all 76 pan-genome accessions
  • 03_extractSubgraph.sh: extracts a smaller subgraph containing the nodes spanning the deletion and flanking regions
  • 04_odgiviz.sh: code for plotting the subgraph
  • 05_mapToGraph.sh: mapping of the eight core800 samples to the graph
  • 06_chunkGam.sh: extraction of a smaller portion of the GAM files produced in the previous step
  • 07_genotype.sh: genotyping of the srh1 region in the graph for the eight core800 samples

Input data for graph construction was 76 barley genomes (approx. 5 gigabases each), split into separate chromosome input files each. Graph construction was carried out incrementally with minigraph on a per-chromosome basis and required 44-49 GB peak RAM and 10-22 days of wallclock time per subtask. Minigraph offers multithreading but this is effectively disabled when single, entire chromosome sequences are used as input (see lh3/minigraph#62) and results in a single thread of execution for graph construction.