burnt-rosemary

Data importation modules for Baxter lab's GWAS database

Installation

Environment

Make a .env file in your root directory. You may use the .env.example as a basis for it.

# .env.example
database=baxdb
user=baxdb_owner
password=password
host=localhost
port=5432

Usage

Importing data into the GWAS database is split into four phases: initialization, gather, collection, and then results.

python import.py --verbose -f data/maize.json data/maize282

The input configuration file (.json) is used to locate the data files. Below is an example data configuration file.

{
  "species_shortname": "maize",
  "species_binomial_name": "Zea mays",
  "species_subspecies": "",
  "species_variety": "",
  "population_name": "Maize282",
  "number_of_chromosomes": 10,
  "genotype_version_assembly_name": "B73 RefGen_v4",
  "genotype_version_annotation_name": "AGPv4",
  "reference_genome_line_name": "282set_B73",
  "phenotype_filename": "5.mergedWeightNorm.LM.rankAvg.longFormat.csv",
  "gwas_algorithm_name": "MLMM",
  "imputation_method_name": "impute to major allele",
  "kinship_algortihm_name": "van raden",
  "kinship_filename": "4.AstleBalding.synbreed.kinship.csv",
  "population_structure_algorithm_name": "Eigenstrat",
  "population_structure_filename": "4.Eigenstrat.population.structure.10PCs.csv",
  "gwas_run_filename": "9.mlmmResults.csv",
  "gwas_results_filename": "9.mlmmResults.csv",
  "missing_SNP_cutoff_value": 0.2,
  "missing_line_cutoff_value": 0.2,
  "minor_allele_frequency_cutoff_value": 0.1
}

Required Input Files

Phenotype file .csv
Kinship .csv
Population structure .csv
GWAS results/run .csv
Genotype .012, .012.indv, and .012.pos (generated by VCF)

1. Phenotype File

This file contains all measures and measurements for each pedigree. It is the source for the tables: phenotype

2. Kinship File

This file is a simple 2D matrix of all the lines/pedigrees and thei kinship measurements.

3. Population Structure File

This file contains N prinicple components to define the population structure

4. GWAS Results/Run File

This contains the results of the GWAS analysis. It will include the SNP, p-value, cofactor, null p-value, model, trait, number of SNPs, number of lines, and principle components

5. Genotype Files

These files are sometimes collapsed into three single files, but they must be separated by chromosome, using the naming convension: chr<NUMBER>_species.<EXTENSION>

For example: chr4_maize.012, chr4_maize.012.pos, chr4_maize.012.indv.

tparkerd/pgwasdbi