The scripts in this repository are to generate the outputs in Solanum appendiculatum genome project.
- Meng Wu
- Rafael Guerrero
- De novo Asseemble Genome
- Estimate Genome Size
- Examine Genome Quality
- Genome Annotation
- Gene Family Analysis
- Transcriptome Analysis
- Sex Determination Analysis
sh assembly_scripts/pool_assembly.sh
Notes: You need to change the PATH in the corresponding config file
jellyfish count -m 25 -t 8 -C -s 4G -o male_reads.counts \
<(zcat <male_reads1.fq>) <(zcat <male_reads2.fq>)
jellyfish count -m 25 -t 8 -C -s 4G -o fema_reads.counts \
<(zcat <female_reads1.fq>) <(zcat <female_reads2.fq>)
jellyfish histo -o male_reads.histo male_reads.counts
jellyfish histo -o fema_reads.histo fema_reads.counts
Notes: The two histo files can be then uploaded onto GenomeScope website (http://qb.cshl.edu/genomescope/) to estimate the genome size separately for female and male plants
sh qc_scripts/contaminants_dbbuild.sh -d <working dir> -n 16
sh qc_scripts/contaminants_remove.sh -d <working dir> -g <genome file> -r solanum \
-R human,bacteria,viral -n 16 -i 80 -c 50
Notes: After first building the contaminant database, you need to manually edit "DeconSeqConfig.pm" (see the example file in 'qc_scripts') in deconseq2 package directory before running 'contaminants_remove.sh'
sh qc_scripts/basecall_qc.sh -d <working dir> -g <genome file> \
-f <female_reads1.fq> -r <female_reads2.fq>
sh qc_scripts/basecall_qc.sh -d <working dir> -g <genome file> \
-f <male_reads1.fq> -r <male_reads2.fq>
<path_to_BUSCO>/run_BUSCO.py -c 4 -i <genome file> -m genome -f \
-l <path_to_BUSCO>/embryophyta_odb9 -o busco_check
sh annotation_scripts/repeats_annotation.sh -d <working dir> -g <genome file> \
-T <Tpases020812DNA> -P <alluniRefprexp070416> -n 16 -S 1234
sh annotation_scripts/rnaseq_alignment.sh
maker -EXE # generate maker_exe.ctl; need to add the PATH of required programs
sh annotation_scripts/gene_annotation.sh -d <working dir> -g <genome file> \
-e <transcriptome fasta> -p <uniprot solanaceae fasta> \
-R <Repeats.lib> -B <busco/embryophyta_odb9> \
-M maker_exe.ctl -I solanum_appendiculatum
python annotation_scripts/annotation_by_aed.py --genome <genome file> --gff <GFF output from MAKER2> --AED 0.5
sh annotation_scripts/function_blast.sh -d <working dir> \
-s <proteins fasta from MAKER> -B <the directory saving all the blast datasets>
java -Xmx2g -jar <AHRD PATH>/ahrd.jar <your_use_case_file.yml>
Notes: Here the blast databasets include 'ITAG3.2_proteins.fasta', 'TAIR10_pep_20101214', and 'uniprot_sprot.fasta'. You need to modify the YAML file to add the PATH of BLAST results (see example file in 'annotation_scripts').
SatsumaSynteny -t <tomato genome> -q <genome assembly> -o <output dir> -n 8
orthofinder -f <protein sequences directory>
Notes: the directory contains input protein fasta files, with one file per species. For S. appendiculum, the fasta file was the one generated from 'annotation_by_aed.py'. The 'Orthogroups.GeneCount.csv' file in the OrthoFinder outputs can be then modified in the format a little bit and renamed as 'unfiltered_cafe_input.txt' (see the example file in 'cafe_scripts')
python cafe_scripts/cafetutorial_clade_and_size_filter.py \
-i unfiltered_cafe_input.txt -o filtered_cafe_input.txt -s
sh cafe.sh
Notes: To set up CAFE, check more detail at https://iu.app.box.com/v/cafetutorial-pdf
featurecount -a <GTF file> -t exon -p -o raw_featurecount_data.txt <bam_file1> <bam_file2> <...>
Notes: The GFF file is generated above from 'annotation_by_aed.py'; Those BAM files are generated from 'rnaseq_alignment.sh'
Rscript DEG_scripts/DEG-analysis.R
sh sexy_scripts/sexy_kmers.sh
Identify illumina reads containing SSKs, aligned them to the genome, and count read depth across 10kb-windows
sh sexy_scripts/sexy_illumina.sh
sh assembly_script/fema_assembly.sh
sh assembly_script/male_assembly.sh
Note: The goal is to obtain corrected pacbio long reads only. You can abort the script after you can find the reads fasta file "mr.41.15.17.0.029.1.fa" in the working directory.
Identify corrected pacbio reads (from separate genome assembly for the female/male) containing SSKs and aligned them to the genome
sh sexy_scripts/sexy_pacbio.sh
sh sexy_scripts/sexy_variants.sh