burnt-rosemary
Data importation modules for Baxter lab's GWAS database
Installation
Environment
Make a .env
file in your root directory. You may use the .env.example
as a basis for it.
# .env.example
database=baxdb
user=baxdb_owner
password=password
host=localhost
port=5432
Usage
Importing data into the GWAS database is split into four phases: initialization, gather, collection, and then results.
python import.py --verbose -f data/maize.json data/maize282
The input configuration file (.json
) is used to locate the data files. Below is an example data configuration file.
{
"species_shortname": "maize",
"species_binomial_name": "Zea mays",
"species_subspecies": "",
"species_variety": "",
"population_name": "Maize282",
"number_of_chromosomes": 10,
"genotype_version_assembly_name": "B73 RefGen_v4",
"genotype_version_annotation_name": "AGPv4",
"reference_genome_line_name": "282set_B73",
"phenotype_filename": "5.mergedWeightNorm.LM.rankAvg.longFormat.csv",
"gwas_algorithm_name": "MLMM",
"imputation_method_name": "impute to major allele",
"kinship_algortihm_name": "van raden",
"kinship_filename": "4.AstleBalding.synbreed.kinship.csv",
"population_structure_algorithm_name": "Eigenstrat",
"population_structure_filename": "4.Eigenstrat.population.structure.10PCs.csv",
"gwas_run_filename": "9.mlmmResults.csv",
"gwas_results_filename": "9.mlmmResults.csv",
"missing_SNP_cutoff_value": 0.2,
"missing_line_cutoff_value": 0.2,
"minor_allele_frequency_cutoff_value": 0.1
}
Required Input Files
- Phenotype file
.csv
- Kinship
.csv
- Population structure
.csv
- GWAS results/run
.csv
- Genotype
.012
,.012.indv
, and.012.pos
(generated by VCF)
1. Phenotype File
This file contains all measures and measurements for each pedigree. It is the source for the tables: phenotype
2. Kinship File
This file is a simple 2D matrix of all the lines/pedigrees and thei kinship measurements.
3. Population Structure File
This file contains N prinicple components to define the population structure
4. GWAS Results/Run File
This contains the results of the GWAS analysis. It will include the SNP, p-value, cofactor, null p-value, model, trait, number of SNPs, number of lines, and principle components
5. Genotype Files
These files are sometimes collapsed into three single files, but they must be separated by chromosome, using the naming convension: chr<NUMBER>_species.<EXTENSION>
For example: chr4_maize.012
, chr4_maize.012.pos
, chr4_maize.012.indv
.