Montpellier, 2017-2019
Submited to Molecular Ecology Ressources, 2019
-
- R packages: ggplot2, plyr, reshape, gridExtra
See https://www.sylabs.io/docs/ for instructions to install Singularity.
singularity pull --name snpsdata_analysis.simg shub://Grelot/reserveBenefit--snpsdata_analysis:snpsdata_analysis
singularity run snpsdata_analysis.simg
We work on three species : mullus surmuletus, diplodus sargus and serranus cabrilla.
Let's define the wildcard species
as any of these three species.
-
genome assembly
.fasta
-
SNPs data from radseq
.vcf
Only one randomly selected SNP was retained per locus, and a locus was retained only if present in at least 85% of individuals. Individuals with an excess coverage depth (>1,000,000x) or >30% missing data were filtered out. We kept loci with maximum observed heterozygosity=0.6.
- Remove loci with inbreeding coefficient Fis > 0.5 or < -0.5
- Keep all pairs of loci that are closer than 5000 bp
- Keep pairs of loci with linkage desequilibrum r² > 0.8
- Keep SNPs with a minimum minor allele frequency (MAF) of 1%
- Remove loci that deviated significantly (p-value <0.01) from expected Hardy-Weinberg genotyping frequencies under random mating
- Keep all pairs of loci that are closer than 5000 bp
- Keep pairs of loci with linkage desequilibrum r² > 0.8
- Keep SNPs with a minimum minor allele frequency (MAF) of 1%
species
.vcf: SNPs from radseq data ofspecies
species
.sumstats.tsv: Summary statistics for each population
species
.lmiss: number of missing individuals by locus tablespecies
.imiss: number of missing loci by individual tablespecies
.idepth: mean locus depth coverage by individual tablespecies
.geno.ld: linkage desequilibrum _r² tablespecies
.snps.fisloc_rm.vcfspecies
.fisloc_rm.ld_5000.logspecies
.fisloc_rm.ld_5000.recode.vcfspecies
.fisloc_rm.ld_5000.r2.recode.vcfspecies
.fisloc_rm.ld_5000.r2.maf001.recode.vcfspecies
.fisloc_rm.ld_5000.r2.maf001.hwe.recode.vcf: final filtered snpsspecies
filtering_count_snps_report.tsv: number of SNPs at each filtering step
cd filter_vcf
bash filter_vcf.sh
- Split the genome into genome-windows of 400 Kbp.
- Count number of SNPs located on each genome-windows.
- Count number of reads for each SNP for each individuals.
species
.fasta: genome fasta file ofspecies
species
.vcf: SNPs from radseq data ofspecies
species
.gff3: coordinates and related information of coding region annotation genome ofspecies
species
coverage.bed: a table with row as genome-windows of 400000bp of the genome ofspecies
with genome-coordinates (scaffold, start position, end position) and coverage (number of SNPs)species
meandepth.bed: a table with row as SNPs with genome-windows, coordinates (scaffold, start position, end position) and depth coverage (number of reads) for each SNP for each individualsspecies
coords.snps.bed: coordinates (scaffold, position) of SNPs onto genomesspecies
coding.snps.bed: snps located on coding region
bash snpsontothegenome/command.sh
Rscript snpsontothegenome/figure_cover_genome.R
species
coords.snps.bed : coordinates (scaffold, position) of SNPs onto genomes
Rscript snpsontothegenome/average_distance_loci.R
Simply count number of lines of the file species
coding.snps.bed (each line is a snp located on a coding region)
Simply count SNPs annotated as "mitochondrial" by Augustus
species | number of SNPs located in mitochondrial regions |
---|---|
diplodus | 173 |
mullus | 178 |
serran | 226 |
- distance_loci.csv : mean, median and sd distance between consecutive loci
species | mean | median | sd | max | min |
---|---|---|---|---|---|
diplodus | 35388.9078430345 | 23751 | 34996.9143024498 | 459616 | 5000 |
mullus | 30716.8684498214 | 20930 | 29189.8335674228 | 384550 | 5002 |
serran | 28239.7585528699 | 19084 | 27013.2843728281 | 403508 | 733 |
- summary_snps.csv: number of SNPs, average distance between consecutive loci (in bp) and number of SNPs located on a coding region for each
species
species | number_snps | average_distance_bp | number_coding_snps | number_mitochondrial_snps |
---|---|---|---|---|
diplodus | 20074 | 35389 | 11978 | 173 |
mullus | 15710 | 30717 | 10304 | 178 |
serranus | 21101 | 28240 | 13107 | 226 |
More detail about coding snps location (CDS, exon, intron) in this table: count_snps_annotation.csv