Natural selection constrains neutral diversity across a wide range of species

This repository contains code to replicate the analysis in "Natural selection constrains neutral diversity across a wide range of species"

Prepare inputs

This section describes the scripts in the "prepare_input_files" folder used to:

  • process and clean genetic maps in preparation for computing genetic maps
  • estimate functional density across the genome from GFF files
  • estimate theta for four-fold degenerate sites from VCF files

These inputs then form the basis of the recombination rate estimation and population genetic models in the next sections.

Initial setup:

  1. extract_exons_from_gff.pl: script to convert GFF files with CDS records to bed files containing coding exon records
  2. get_gene_density.pl: script to estimate gene density in windows across the genome from exon BED files, launched with run_pi.sh helper script for this analysis
  3. get_pi.pl: script to estimate pi in windows across the genome from parsed and filtered VCF files, launched with run_gd.sh helper script for this analysis
  4. load_maps.R: script to clean up genetic maps and produce input for recombination rate estimation; run as R --vanillia < load_maps.R

The raw files required to run these scripts are not distributed on Github, as they are either too large or the terms under which redistribution is allowed are unclear. Thus, to replicate our analysis we have provided intermediate files that represent the output of these steps:

  • the R objects all_maps_list.RData and raw_genetic_maps.RData can be loaded into R (with the load command) and represent, respectively, all the genetic maps as a list and the raw underlying map data frames
  • map_info_raw.txt is a short description of the genetic maps generated by the load_maps.R script
  • std.*.txt and q30.*.txt are summarized, windowed estimates of nucleotide diversity (pi) for all species, with either the standard filters or a more stringent Q30 filter (described in the manuscript)
  • gd.*.txt are summarized, windowed estimates of coding sequence density (defined as coding sites / total sites) in windows across each genome
  • total.exons.txt is a summary file with the total number of exonic coding bases in each genome (produced by get_gene_density.pl)
  • *.average_pi.txt are summary files of overall pi for each genome (produced by get_pi.pl)

It should be possible to replicate our analysis directly from these files without having to rerun any scripts in the prepare_input_files folder. However, we provide all the scripts both as documentation and to allow for easier reuse of our pipeline.

Recombination rate estimation

This section describes the script in the "rec_rate_est" folder used to:

  • Clean up genetic map data
  • Estimate recombination rates
  • Produce windowed files for population genetic models

There is only a single script, linkedsel_rec_est.R, that does all the necessary steps. These steps are:

  1. Remove mismapped markers from genetic maps (cases where genetic and physical maps are not congruent)
  2. Remove duplicate markers from genetic maps (cases where a single marker has multiple hits to the genome assembly)
  3. Make a mask of regions to ignore with low quality recombination estimates
  4. Estimate recombination rates using two methods
  5. Read in gene density and pi estimates from prepare_input_files
  6. Produce output files for population genetic modeling

Note: this script will also produce a large number of plots, and requires that the directories plots/recrate and plots/marey_maps exist! To turn off plotting, remove or comment out the for loops in the R code that follow comments starting with #plot...

Note also: this script depends on four R packages: polynom, splines, qualV, and plyr

Population genetics models

This section describes the scripts in the "pogen_models" folder used to:

  • compute the effect of background selection on windows across the genome, taking into account recombination rate and gene density, based on the method of Rockman et al and Flowers et al.
  • fit a variety of population genetic models to the data using nonlinear regression in R

Population genetic modeling:

  1. Compile compute_gk.cpp. It should compile on both Mac OS X and most flavors of Linux. It has been tested on Mac OS X 10.6.8 with g++ version 4.2.1, and on CentOS 6.4 with g++ 4.4.7 To compile, simply run g++ -o compute_gk compute_gk.cpp
  2. Using the output of the linkedsel_rec_est.R script above, run compute_gk. An example of how to do this in parallel with the Perl module Parallel::ForkManager is included as run_gk.pl. This is the script we used to produce the results in the manuscript. Note that the number of threads is HARD CODED and will need to be changed for your system. To run compute_gk without the helper Perl script, see the comments in the beginning of the code. While this code should be fine to use generally, note that it requires input in a specific format and DOES NOT sanitize or check inputs, as it is written to use only the output of the R script. Proceed with caution.
  3. Run the selection models coded in run_bgs.R with the perl script run_selection_models.pl. Note this script also uses Parallel:ForkManager, and expects the input directory (with the output of compute_gk) as the first command line option. run_bgs.R requires the R packages data.table and minpack.lm, and potentially requires a lot of memory (>16 GB) for species with big genomes The output directories for both logs and model output are hardcoded in the scripts as Rlogs and modres. Please create these directories before running, or change the code as appropriate. The logs directory is in the perl script, and the modres in the R script.
  4. Run summarize_select.R to parse the output of the population genetic models and produce a final summary table for analysis. Note that this script requires a selfing.txt file describing the levels of selfing in each species to be analyzed, and hard codes several parameters such as the number of windows and the mutation estimates.

Species properties

This section describes the scripts "spec_props" used to:

  • estimate species range based on occurrence data

And the data in "spec_props" describing:

  • body size, generation time, and trophic level data for each species
  • assembly QC parameters

Estimate species range:

  1. the load_range_gbif.R script downloads geotagged occurrence data for each species from GBIF and saves it to a .RData file. This can take some time to run.
  2. the species_range.R script uses R to compute ranges based on an alpha-hull, and writes the results out as range.final.
  3. in some cases, occurrence data is not available from GBIF, so geotagged locations are either loaded from an included file or hard coded in the R script

Note that the range estimation scripts require a number of R packages: dismo, rgeos, alphahull, rgdal, maptools, sp. The species_range.R script also requires the 50m ocean shapefile from http//www.naturalearthdata.com/download/50m/physical/ne_50m_ocean.zip

Data files:

  1. ne_cors.txt: body size, mass, trophic level, kingdom, generation time, and density. Sources are described in spcor_ne.xls and the manuscript.
  2. qc_params.txt: total assembly size, size of assembly placed on chromosomes, ungapped placed assembly size, total genome size from C-value estimates, and whether our polymorphism data comes from domesticated populations. Sources described in spcor_assembly.xls and the manuscript.

Final analysis

This section describes the scripts in the "final_analysis" folder used to produce the final figures and statistics for the manuscript.

  • linear_models.R: this R script contains the source code to replicate the linear modeling of the relationship between the impact of selection on neutral diversity and proxies for Nc, as described in the manuscript. It also contains the source code to generate Figure 3.
  • correlation_analysis.R: this R script contains the source code to replicate the analysis of partial correlations between neutral diversity and recombination rate, as described in the manuscript. It also contains the source code to generate Figure 2.
  • best_model_analysis.R: this R script contains the source code to replicate the analysis of the relative likelihoods of different population genetic models, as described in the manuscript. It also contains the source code to generate Figure 4.
  • figure1.R: this R script contains the source code to generate Figure 1.

Misc scripts

The "misc_scripts" folder contains a few filtering and data parsing scripts used for our analysis.