Data and Code accompanying the manuscript

Epigenetic and genetic population structure is coupled in a marine invertebrate

Repository structure

  • analyses - code output, organized by sub directories
  • code - code used including knitr products
  • data - files associated with annotation, genetics, and methylation
  • genome annot - code and fasta files used for genome annotation
  • genome-features - genome feature tracks for visualization and analysis (eg bedtools)
  • protocols - protocol for 2bRAD library prep
  • snps - calling snps from MBD-BS data (not used in manuscript)

Below are select files in the repository distinctly referenced in the manuscript

Genome Assembly

  • PBJelly code genome-annot/20171130_emu_pbjelly.ipynb
  • Maker custom repeat library genome-annot/Ostrea_lurida_v081-families.fa
  • transcriptome assembly genome-annot/Olurida_transcriptome_v3.fasta
  • Crassostrea gigas proteins genome-annot/GCA_000297895.1_oyster_v9_protein.faa
  • Crassostrea virginica proteins genome-annot/GCF_002022765.2_C_virginica-3.0_protein.faa

Genetic Analyses

  • Meyer 2bRAD protocol protocols/2bRAD_11Aug2015.pdf

Genetics analyses code

  • ANGSD_HCSSonly.ipynb: ANGSD genotyping, converting ANGSD output to VCF, running outlier analysis with Bayescan, generating relatedness matrix (not used in paper), running ngsAdmix for admixture, FST (overall, per gene, per SNP), calculate the number of SNPs in genes, generate GO enrichment files for genes with FST > 0.3, PCA of all genetic samples, PCA of MBD samples

Genetics analyses output files

  • HCSS_Afilt32m70_01_pp90_m75_BSouts.vcf: VCF file of 7 SNPs determined to be outliers by Bayescan with FDR < 0.1 and prior of 10.

  • HCSS_sfsm70_Per{Site,Gene}Fst.csv: CSV files with pairwise FST values > 0 on either a per site or per gene basis.
    2bRAD data

  • HCSS_Afilt32m70_01_pp90.vcf: the primary genetic data used in the manuscript, HC and SS populations, all SNPS with MAF > 1% and a genotype probobility of > 90%; _m75.recode.vcf is filtered for SNPs genotyped in at least 75% of individuals

Genetic analyses subdirectories

DNA Methylation

Methylation code

  • 00-Bismark.sh - Bismark code
  • 01-methylkit.Rmd - methylKit code plus additional analyses. Includes initial look at methylation data, filtering for coverage, incorporate loci that were very likely unmethylated in one population but highly methylated in the other, generating a final methylation dataset for comparative analysis among populations, conducting differential methylation analysis, PCA, and generating distance matrix from % methylation for integration with genetic data.
  • 02-Generating-gene-region-feature-files.Rmd - Bedtools to expand genome feature files to include 2kb up/downstream of features.
  • 03-General-Methylation-Patterns.Rmd - Uses all data from both populations to characterize methylation in the Ostrea lurida genome. Includes calling methylation status & filtering for coverage, using bedtools to identify genome features containing methylated CpGs, testing for differences among distributions of methylated CpGs vs. all CpGs. Also includes methylation island analysis code that was not used for paper.
  • 04-DMG-analysis.Rmd - Differentially methylated gene analysis among populations using binomial GLMs, gene annotation and enrichment analysis, overlap among DMGs and DMLs, Pst calcs at the gene level. Contains lots of extraneous code to identify DMGs and run Pst using various thresholds for the min. number of methylated loci per gene.
  • 05-Annotations.Rmd - Use bedtools and enrichment analyses to characterize where differentially methylated loci (DMLs) are located in genome. Includes extraneous code annotating "MACAU" loci (i.e. Size associated Loci, SALs).
  • 06-Meth-Pst-bins - Calculate methylation Pst for genome bins. 10kb bins were used in paper, but code also contains extraneous code for 1kb bins.

Methylation analyses subdirectories

  • analyses/methylation/ - Contains output files primarily from the 01-methylkit.Rmd notebook relevant to comparative methylation analysis (but not including DML and DMG analyses).
  • analyses/methylation-genome-characteristics/ - Contains output files primarily from the 03-General-Methylation-Patterns.Rmd notebook relevant to characterizing methylation in the Ostrea lurida genome.
  • analyses/DMGs/ - Contains output files primarily from the 04-DMG-analysis.Rmd notebook relevant to identifying differentially methylated genes and their functions.
  • analyses/DMLs/ - Contains output files primarily from the 01-methylkit.Rmd and 05-Annotations.Rmd notebooks relevant to differentially methylated loci and their functions.

Methylation analyses output files

  • all_methylated_5x & all_unmethylated_5x - R objects containing loci in the O. lurida genome that are methhylated and unmethylated, respectively, used to characterize methylation patterns for the species / draft genome.
  • analyses/methylation-genome-characteristics/methylated-<_feature_>.bed - Bed files containing locations of methylated loci, where feature is gene, gene2kb, 2kbflank-up, 2kbflank-down, exon, CDS, mRNA, TE, or ASV.
  • meth_filter - methylBase object containing coverage, numCs, and numTs for each sample for the final set of filtered loci used in comparative analyses.
  • perc.meth & percent-methylation-filtered.tab - Percent methylation data for loci that were filtered at 5x, used in comparative analyses.
  • myDiff25p, dml25_counts, myDiff25p.tab, and dml25.bed - differentially methylated loci (DMLs) among populations using min 25% difference
  • dist.manhat.csv and dist.manhat.DMLs.csv - Manhattan distance matrices generated from % methylation matrices using all methylation data and differentially methylated loci, respectively.
  • PCA.filtered - PCA R object representing global methylation patterns & differences among populations (uses all loci after filtering)
  • perc_meth_bins_10kb_Pst and Pst_bins_10kb.tab- Pst calculation results for random 10kb bins that contain both genetic and methylation data.
  • DMGs_2kbslop and DMGs_2kbslop_annotated.tab - Differentially methylated genes (DMGs) among poulations.

Combined methylation and genetic analyses

  • MBD_samples_genetic_analysisHCSS_5x.ipynb: integrating methylation and genetic data for 5x filtered methylation data; correlate distance matrices, correlate P_{ST} and F_{ST}, correlate PC scores, explore CpG-SNPs (includes some preliminary analyses with addition GBS SNPs not used in the paper), TsTv ratio
  • mQTL_analysis-RankNorm-5x.ipynb (large file, view on nbviewer): Methylation QTL analysis

Combined methylation and genetic analyses subdirectories

Code for manuscript figures: Figures.Rmd

Archived files

Several directories contain archive/ subdirectories, which contain code and results from analyses that were not ultimately needed for the paper. For instance, we performed a MACAU analysis to identify methylated loci associated with oyster size ("Size associated loci", or "SALs"), but this was not included in our publication. NOTE: some MACAU-relevant code also remains in non-archived notebooks (e.g. in the 05-Annotations.Rmd notebook).

  • Files for relatedness matrix for MACAU: HSmbdsamples_rab.txt and mbdsamples_rab.txt. File starting with HS is generated from an ANGSD run using only HC/SS samples, the other is a run using HC/SS/NF samples. The two files are very tightly correlated so should not matter much.
    • GWAS_*.pvalues: pvalues from GWAS of either weight or width for each SNP
  • Afilt32m70_01_pp75.vcf: 3 populations, all SNPS with MAF > 1% and a genotype probobility of > 75%; _m75.recode.vcf is filtered for SNPs genotyped in at least 75% of individuals