This readme contains all the necessary scripts needed to repeat the mapping step in the assembly efforts behind Lotus japonicus Gifu v1.2 genome. It deals with how we have used paired-end reads obtained from two distinct recombinant inbred line (RIL) populations, namely Gifu×Burtii and MG20×Gifu, to polish our assembly.
The following datasets must be available on your machine. They can be downloaded from NCBI's SRA website in the links provided below:
-
Paired-end reads from the parent lines:
- Gifu: https://www.ebi.ac.uk/ena/data/view/SAMEA4807266 (part of the PRJEB27969 study),
- Burtii: [TBA], and
- MG20: https://www.ebi.ac.uk/ena/data/view/SAMEA4807284 (part of the PRJEB27969 study)
-
All the paired-end reads for the RIL populations:
- Gifu×Burtii: https://submit.ncbi.nlm.nih.gov/subs/sra/SUB4699503/overview, and
- MG20×Gifu: https://submit.ncbi.nlm.nih.gov/subs/sra/SUB4699504/overview
The following tools will need to be installed and be available in your environment before running any of the bash scripts in this repository. The version number indicated for each tool used is the minimum version used when analysis was done, and verified to be working.
- bcftools 1.3
- bwa 0.7.15
- picard 2.7.1 (run in Java 1.8.0 environment)
- sambamba 0.6.5
- samtools 1.3
- vcftools 0.1.15
The two final steps of the RIL mapping pipelines requires running python scripts from bash, namely 07_vcf-concat.sh
and 08_variant-smoothing.sh
. Their dependencies are:
If you want to perform data visualization as the final step, you will need to have R installed and the following packages installed, too:
- ggplot2
- gplots
- reshape2
- ape
- RColorBrewer
R packages can be installed by simply running install.packages(PACKAGE_NAME)
.
The pipeline consists of bash scripts, located in the /lib
directory, that should be run sequentially, starting from the lowest index, 01
, and up till 08
. The scripts have been intentionally split into several files to allow catching of errors or fine-tuning of parameters in between step, instead of combining them into a large file. It is, of course, possible to automate the entire process, but this is not part of the scope of this repository.
Each bash script contains certain variables that must be configured manually on your end. They are marked within the blocks at the top of the files:
###=====================###
### CONFIGURATION START ###
###=====================###
# (configuration here)
###=====================###
### CONFIGURATION END ###
###=====================###
The scripts contains markers that indicate certain codes that should be executed as part of a job submission script in the computing cluster that you are using. These operations are computationally expensive and should not run on the same thread as your terminal. These lines are marked with:
###======================###
### JOB SUBMISSION START ###
###======================###
jobCommand="(arbitrary command here)"
###======================###
### JOB SUBMISSION END ###
###======================###
The jobCommand
should be executed as-is by your computing cluster's own job management and dispatch system. Please consult with your system administrator or person-in-charge if you have any doubts on submitting jobs to a queue.
TL;DR: Leave them alone.
The python files are simply dependencies used by the bash scripts and should not be executed out of context or order, and should never be used individually.
The Rscript in the /visualization
directory allows you use visualize data generated from the genetic map. Please note that all steps in the /lib
directory need to be completed first: this Rscript requires certain files to be in place before you can proceed.
The Rscript will generate charts in PDF and PNG format in the respective RIL population folder, namely Gifu-Burttii
and Gifu-MG20
folders in the output directory. The Rscript will need to be run twice: and remember to swap the pop1
and pop2
variables around, so that both individual RIL population will be processed.