/tiny

Whole genome alignments using LastZ software in vertebrate species. Code (1) drives embarrassingly parallel execution of commands, (2) merge output into sensible formats and (3) provides ways to draw figures.

Primary LanguageRMIT LicenseMIT

Micro-chromosome conservation in vertebrates

Contributors: Hardip Patel, Paul Waters, Aurora Ruíz-Herrera, Arthur Georges, Jenny A.M. Graves

Data preparation

The process was automated using the preparegenomes.sh script. Basic steps were as follows.

  1. Genome assemblies were downloaded from relevant sources (NCBI or DNAZoo). Check metadata/species.txt for details of species and download paths.
  2. .2bit, .capsule, .sizes files were created.
  3. Genome files were split into smaller regions (1Mb sequence size and ~5Mb of total sequence per file) without overlap. This step allows for embarassingly parallel lastz alignments to capitalise on large HPC facilities.

One-way all-vs-all whole genome alignments using LastZ

We used the following parameters for LastZ alignments:
K=2400 L=3000 Y=9400 H=2000 --ambiguous=iupac

List of species considered for analyses

Species Species Code Common Name NCBI Taxonomy ID Citation
Gallus gallus CHICK Chicken 9031 Hillier, L.W. et al, 2004
Trachemys scripta elegans TRASE Red-eared slider turtle 31138 Simison, W.B. et al, 2020
Corvus monedula CORMO Jackdaw 30423 Weissensteiner, M.H. et al, 2020
Cygnus olor CYGOL Mute swan 8869 VGP & B10K: Embargoed
Naja naja NAJNA Indian cobra 35670 Kushal Suryamohan, K. et al, 2020
Ornithorhynchus anatinus ORNAN Platypus 9258 Zhou, Y., et al, 2021
Gopherus evgoodei GOPEV Goodes thornscrub tortoise 1825980 VGP: Embargoed
Dermochelys coriacea DERCO Leatherback sea turtle 27794 VGP: Embargoed
Chelonia mydas CHEMY Green sea turtle 8469 VGP: Embargoed.
Also at DNAzoo from male blood. Draft genome: Zhuo, W. et al, 2013
Lacerta agilis LACAG Sand lizard 80427 VGP: Embargoed
Thamnophis elegans THAEL Western terrestrial garter snake 35005 VGP: Embargoed
Podarcis muralis PODMU Common wall lizard 64176 Andrade, P. et al, 2019
Zootoca vivipara ZOOVI Common lizard 8524 Yurchenko, A.A. et al, 2020
Crotalus viridis CROVV Prairie rattlesnake 8742 Pasquesi, G.I.M. et al, 2018
Calypte anna CALAN Anna's hummingbird 9244 Rhie, A. et al, 2021
Falco rusticolus FALRU Gyrfalcon 120794 VGP & B10K: Embargoed
Aquila chrysaetos chrysaetos AQUCH Golden eagle 223781 Sanger 25G & VGP: Embargoed.
Also at DNAzoo from female blood. Draft genome: Bussche, R.A. et al, 2017
Dromaius novaehollandiae DRONO Emu 8790 DNAzoo from male blood. Draft genome: Sackton, T.B. et al, 2019
Salvator merianae SAMER Argentine black and white tegu 96440 DNAzoo from male blood. Draft genome: Roscito, J.G. et al, 2018
Python bivittatus PYTBI Burmese python 176946 DNAzoo from feamle blood. Draft genome: Castoe, T.A. et al, 2013
Gymnogyps californianus GYMCA California condor 33616 DNAzoo from blood. Draft genome: Unpublished
Alligator mississippiensis ALLMI American alligator 8496 DNAzoo from male blood. Draft genome: St John, J.A., et al, 2017
Alligator sinensis ALLSI Chinese alligator 38654 DNAzoo from female blood. Draft genome: Wan, Q.H. et al, 2013
Homo sapiens HUMAN Human 9606 Lander, E. et al, 2001
Sarcophilus harrisii SARHA Tasmanian devil 9305 Wellcome Sanger Institute
Branchiostoma floridae BRAFL Florida lancelet 7739 Simakov, O. et al, 2020
Phascolarctos cinereus PHACI Koala 38626 DNAzoo from male heart. Draft genome: Johnson, R.N. et al, 2018
Casuarius casuarius CASCA Southern cassowary 8787 DNAzoo from male blood. Draft genome: Sackton, T.B. et al, 2019
Tympanuchus cupido TYMCU Greater prairie chicken 9004 DNAzoo from male blood. Draft genome: Unpublished
Eopsaltria australis EOPAU Eastern yellow robin 44318 DNAzoo from female liver. Draft genome: Gan, H.M. et al, 2019
Lichenostomus melanops cassidix LIMCA Helmeted honeyeater 1497555 DNAzoo Draft genome: Unpublished
Patagioenas fasciata PATFA Band-tailed pigeon 372321 DNAzoo from 'frozen' male sample. Draft genome: Murray, G.G.R. et al, 2017
Phalacrocorax auritus PHAAI Double-crested cormorant 56069 DNAzoo from blood. Draft genome: Burga, A. et al, 2017
Rhea americana RHEAM Greater rhea 8797 DNAzoo from female blood. Draft genome: Sackton, T.B. et al, 2019
Strix occidentalis STROC Spotted owl 201991 DNAzoo sample not described. Draft genome: Hanna, Z.R. et al, 2017
Struthio camelus STRCA Common ostrich 8801 DNAzoo from female blood. Draft genome: Zhang, G. et al, 2014
Intellagama lesueurii lesueurii INTLE Eastern water dragon 103694 DNAzoo from liver. Draft genome: Unpublished

Typical workflow

This workflow is automated using getMAF.sh script.

  1. Fix LastZ alignments to assign genomic coordinates to sequence alignments as alignments were performed for 1Mb sub-sequences (fixlastz.pl).
  2. Performing chaining of LastZ alignments (axtChain).
  3. Sort chains (chainSort).
  4. Generate prenet files from chain output (chainPreNet).
  5. Perform netting of chains (chainNet).
  6. Report nets of alignments against target genome (netSyntenic).
  7. Report nets of alignments against query genome (netSyntenic).

Further steps that can be peformed but not performed as yet for this project. Commands are included in the getMAF.sh script, however, we have commented them out for now.

  1. Convert nets to .axt format (netToaxt).
  2. Sort .axt alignment files (axtSort).
  3. Generate .maf format output for alignments (axtToMaf).

Credits

  1. Workflow ideas for lastz alignments were borrowed from Daren Card (thanks mate) available here.

  2. Lastz alignment parameters were obtained from several sources:

    • The bird genome alignment paper GitHub repo here.
    K=2400 L=3000 Y=9400 H=2000
    
    K = 2400, L = 3000, Y = 9400, H = 2000 for placental mammals
    K = 2400, L = 3000, Y = 3400, H = 2000 for non-placental mammals
    K = 1500, L = 2500 and W = 5  to find co-linear alignments in the un-aligning regions that are flanked by local alignments (gaps in the chains)
    
    • Ensembl Compara LastZ pairwise alignment settings for the GitHub Repo here
    default => 'T=1 L=3000 H=2200 O=400 E=30 --ambiguous=iupac', # ensembl genomes settings
    # Vertebrata
    7742    => 'T=1 K=3000 L=3000 H=2200 O=400 E=30 --ambiguous=iupac',
    # Catarrhini, Sus, Carnivora, Triticeae
    9526    => 'T=1 K=5000 L=5000 H=3000 M=10 O=400 E=30 Q='
    9822    => 'T=1 K=5000 L=5000 H=3000 M=10 O=400 E=30 --ambiguous=iupac',
    33554   => 'T=1 K=5000 L=5000 H=3000 M=10 O=400 E=30 --ambiguous=iupac',
    147389  => 'T=1 K=5000 L=5000 H=3000 M=10 O=400 E=30 --ambiguous=iupac --identity=75..100',
    # Vigna, Solanaceae
    3913    => 'T=1 L=3000 H=2200 O=400 E=30 --ambiguous=iupac --matchcount=1000',
    4070    => 'T=1 L=3000 H=2200 O=400 E=30 --ambiguous=iupac --matchcount=1000',
    #4107    => 'K=5000 L=5000 H=3000 O=400 E=30 --ambiguous=iupac M=10 --notransition --step=20',
    #4107    => 'K=5000 L=5000 H=3000 M=10 O=400 E=30 --ambiguous=iupac --notransition --step=20',
    
    • Default parameters of LastZ
    # hsp_threshold (K)      = 3000
    # gapped_threshold (L)   = 3000
    # x_drop (X)             = 910
    # y_drop (Y)             = 9400
    # gap_open_penalty (O)   = 400
    # gap_extend_penalty (E) = 30