/soybean_sv_paper

Structural variation analysis in soybean using Oxford Nanopore and Illumina sequencing data

Primary LanguageRGNU General Public License v3.0GPL-3.0

Code for the analysis of structural variation in soybean

Overview

This repository contains all the code needed to reproduce the analyses presented in the paper titled "Combined use of Oxford Nanopore and Illumina sequencing yields insights into soybean structural variation biology".

As a disclaimer, readers should be aware that most of the code was reorganized and integrated into the Makefile only after analyses were performed. Therefore, those trying to run the analyses might run into issues related to paths or software version incompatibilities. We encourage users who encounter problems while trying to run this code to open a GitHub issue or contact the repo maintainer directly. We believe that the code in this repository and the associated Makefile should still be useful to help those interested in understanding the analyses that were performed.

Software dependencies

These analyses were initially run on a Linux high-performance computer running openSUSE Leap 15.3. However, the code should run seamlessly on any Linux machine running the bash shell.

The following software should be installed to reproduce the analyses. Some of these programs may themselves have additional dependencies. Versions used for this work are indicated in parentheses. The path to each of the executables should be modified in the Makefile for the code to run properly.

The breakpoint refinement pipeline should be installed under scripts/breakpoint_refinement.

The scripts and Makefile also expect to find Circos configuration files (the files distributed under etc in the software package) under external/circos_config_files.

Some programs needed for reproducing analyses were modified from existing software:

  • We forked the R package sveval and slightly modified it to add support for benchmarking duplications and for extracting more exhaustive output. This version can be installed from our fork by using the commit 65f2781cad9c1e0979c93efac41f8157a436703f on branch soybean-nanopore-svs.

  • The script scripts/addMissingPaddingGmax4.py was adapted from addMissingPaddingHg38.py to use the soybean reference genome instead of the human reference genome. The original MIT copyright notice is included in our modified file.

Data availability

Sequencing data

  • The Illumina data used in this project is available from the SRA using the BioProject accession number PRJNA356132. This data should be placed under illumina_data/raw_fastq/ to reproduce the analyses.

  • The Oxford Nanopore data generated by this project is available from the SRA using the BioProject accession number PRJNA751911. This data should be placed under nanopore_data/ to reproduce the analyses.

Reference data

The following datasets are available from the Web and should be added to the repository to reproduce the analyses:

  • The SoyTEdb fasta file (SoyBase_TE_Fasta.txt) can be downloaded from SoyBase and should be placed under te_analysis/te_database/ to reproduce the analyses.
  • The non-reference transposable elements found by Tian et al. (2012) can be downloaded from the supplementary data to their paper. The data can be converted to a text file and saved under te_analysis/tian2012_tes.txt.
  • The reference genome sequence and annotation of soybean cultivar Williams82, assembly version 4 can be downloaded from Phytozome. The files needed (Gmax_508_v4.0.fa, Gmax_508_Wm82.a4.v1.gene_exons.gff3, Gmax_508_Wm82.a4.v1.gene.gff3, Gmax_508_Wm82.a4.v1.repeatmasked_assembly_v4.0.gff3) should be placed under refgenome/.
  • Gene Ontology annotations for Williams82 assembly version 4 can be downloaded from SoyBase. We saved this file under the name gene_analysis/soybase_genome_annotation_v4.0_04-20-2021.txt because we accessed it on April 20, 2021.
  • The soybean chloroplast and mitochondrion genome sequences can be downloaded from SoyBase. These should be concatenated together and placed under refgenome/bt_decoy_sequences.fasta. They should also be concatenated to Gmax_508_v4.0.fa and placed under the name refgenome/Gmax_508_v4.0_mit_chlp.fasta

Data generated by the analysis

Several of the VCF files generated by the analysis as well as the result from the permutation test on the overlap between SVs and genic features are available on Figshare.

Querying the Makefile

We used GNU Make to describe the dependencies among our scripts and data through a Makefile. In theory, the Makefile should allow running all the analyses that were done in the paper in the proper order, given that all the sequencing and reference data are available. In practice, our Makefile is intended as a tool to query this repository to understand what scripts should be run and in what order to obtain a particular result. Here, we give a short introduction for people who are not yet familiar with Make so they can query our Makefile effectively.

GNU make should be installed by default on many Linux distributions. If not, please visit their website for download and installation.

Prior to querying the Makefile using GNU Make, you should have created files and directories pointing to all the required data and scripts as described above. One does not need to actually download the data, but can simply touch the files needed. This can be achieved automatically by running the convenience script touch_files.sh.

Make describes dependencies using a set of targets which depend on a list of prerequisites, and includes for each target a so-called recipe of shell commands used to create the target from the prerequisites. To get a list of the available targets in the Makefile, simply type the following command while in the top-level directory of this repository:

make list

Each of these targets can be given as a argument to the make command to launch the commands required to create the target. For example, the following command would run all the analyses used to make the paper:

make all

We do not recommend running this command given the high computing requirements of these analyses. However, the list of all commands that would be run if the command were to be launched can be obtained with the -n option:

make -n all

As an example, to get all the commands needed to create Figure 1 from scratch, the following command can be used:

make -n figures/figure_1.png

Make automatically determines which commands need to be run based on the last time when each target was updated. If you want to trick Make into thinking that all targets were properly created, you can use the -t option to touch each target and update its timestamp instead of running the commands:

make -t all

If this command runs properly, then running make all should print make: Nothing to be done for 'all'. If you want the list of commands to be printed even though the target is up to date, then you can add the -B option:

make -Bn all

You can simulate what commands would actually be run from a given point in the analysis by modifying the timestamp of a code or data file of interest and running make -n again. For example, the code below would list all the commands needed to update Figure 3 after the code to call SNVs using playtpus has been modified.

touch structure_analysis/call_snps.sh
make -n figures/figure_3.png

With these tools in hand, you should be able to effectively query the Makefile and understand the analysis workflow that we used.

Citation

If you use part of this code for your analyses, please cite:

Lemay, MA., Sibbesen, J.A., Torkamaneh, D. et al. Combined use of Oxford Nanopore and Illumina sequencing yields insights into soybean structural variation biology. BMC Biol 20, 53 (2022). https://doi.org/10.1186/s12915-022-01255-w