This is a companion repo for gwaslab.
A collection of commonly used formats for GWAS summmary statistics.
All the formats are stored as json files.
Each format consists of the following info (manually curated):
meta_data
: meta data, inlcluding software name, source urls, version and so on.format_dict
: target format to gwaslab format column-name conversion dictionary
For example : format for metal software
{
"meta_data":{"format_name":"metal",
"format_source":"https://genome.sph.umich.edu/wiki/METAL_Documentation",
"format_version":"20220726"
},
"format_dict":{
"MarkerName":"SNPID",
"Allele1":"EA",
"Allele2":"NEA",
"Freq1":"EAF",
"Effect":"BETA",
"StdErr":"SE",
"P-value":"P",
"Direction": "DIRECTION"
}
}
Supported formats:
ssf
: GWAS-SSFgwascatalog
: GWAS Catalog formatpgscatalog
: PGS Catalog formatplink
: PLINK output formatplink2
: PLINK2 output formatsaige
: SAIGE output formatregenie
: output formatfastgwa
: output formatmetal
: output formatmrmega
: output formatfuma
: input formatldsc
: input formatlocuszoom
: input formatvcf
: gwas-vcf formatbolt_lmm
: output format
- GWAS-SSF
- CITATION: Hayhurst, J., Buniello, A., Harris, L., Mosaku, A., Chang, C., Gignoux, C. R., ... & Barroso, I. (2022). A community driven GWAS summary statistics standard. bioRxiv.
- GWAS Catalog
- SOURCE: https://www.ebi.ac.uk/gwas/docs/summary-statistics-format
- CITATION: Buniello, A., MacArthur, J. A. L., Cerezo, M., Harris, L. W., Hayhurst, J., Malangone, C., ... & Parkinson, H. (2019). The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic acids research, 47(D1), D1005-D1012.
- metal
- SOURCE: https://genome.sph.umich.edu/wiki/METAL_Documentation
- CITATION: Willer, C. J., Li, Y., & Abecasis, G. R. (2010). METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics, 26(17), 2190-2191.
- pgscatalog
- SOURCE: https://www.pgscatalog.org/downloads/#dl_ftp_scoring
- CITATION: Lambert, S. A., Gil, L., Jupp, S., Ritchie, S. C., Xu, Y., Buniello, A., ... & Inouye, M. (2021). The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nature Genetics, 53(4), 420-425.
- saige
- SOURCE: https://github.com/weizhouUMICH/SAIGE/wiki/Genetic-association-tests-using-SAIGE#output-file
- CITATION: Zhou, W., Nielsen, J. B., Fritsche, L. G., Dey, R., Gabrielsen, M. E., Wolford, B. N., ... & Lee, S. (2018). Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nature genetics, 50(9), 1335-1341.
- regenie
- SOURCE: https://rgcgithub.github.io/regenie/options/#output
- CITATION: Mbatchou, J., Barnard, L., Backman, J., Marcketta, A., Kosmicki, J. A., Ziyatdinov, A., ... & Marchini, J. (2021). Computationally efficient whole-genome regression for quantitative and binary traits. Nature genetics, 53(7), 1097-1103.
- plink
- SOURCE: https://www.cog-genomics.org/plink/1.9/formats
- CITATION:Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81(3), 559-575.
- plink2
- SOURCE: https://www.cog-genomics.org/plink/2.0/formats
- CITATION: Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4(1), s13742-015.
- fastgwa
- SOURCE:https://yanglab.westlake.edu.cn/software/gcta/#fastGWA
- CITATION:Jiang, L., Zheng, Z., Qi, T., Kemper, K. E., Wray, N. R., Visscher, P. M., & Yang, J. (2019). A resource-efficient tool for mixed model association analysis of large-scale data. Nature genetics, 51(12), 1749-1755.
- mrmega
- SOURCE:https://genomics.ut.ee/en/tools
- CITATION: Mägi, R., Horikoshi, M., Sofer, T., Mahajan, A., Kitajima, H., Franceschini, N., ... & Morris, A. P. (2017). Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution. Human molecular genetics, 26(18), 3639-3650.
- fuma
- SOURCE:https://fuma.ctglab.nl/tutorial#snp2gene
- CITATION: Watanabe, K., Taskesen, E., Van Bochoven, A., & Posthuma, D. (2017). Functional mapping and annotation of genetic associations with FUMA. Nature communications, 8(1), 1-11.
- ldsc
- SOURCE:https://github.com/bulik/ldsc/wiki/Heritability-and-Genetic-Correlation
- CITATION: Bulik-Sullivan, B. K., Loh, P. R., Finucane, H. K., Ripke, S., Yang, J., Patterson, N., ... & Neale, B. M. (2015). LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature genetics, 47(3), 291-295.
- locuszoom
- SOURCE:https://my.locuszoom.org/about/
- CITATION: Pruim, R. J., Welch, R. P., Sanna, S., Teslovich, T. M., Chines, P. S., Gliedt, T. P., ... & Willer, C. J. (2010). LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics, 26(18), 2336-2337.
- vcf
- SOURCE:https://github.com/MRCIEU/gwas-vcf-specification
- CITATION: Lyon, M. S., Andrews, S. J., Elsworth, B., Gaunt, T. R., Hemani, G., & Marcora, E. (2021). The variant call format provides efficient and robust storage of GWAS summary statistics. Genome biology, 22(1), 1-10.
- bolt_lmm
- SOURCE:https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html
- CITATION: Loh, P. R., Tucker, G., Bulik-Sullivan, B. K., Vilhjalmsson, B. J., Finucane, H. K., Salem, R. M., ... & Price, A. L. (2015). Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nature genetics, 47(3), 284-290.
Future update: To add fields in meta_data:
format_cite_name
: formal name of the format, e.g. GWAS-SSF v0.1format_separator
: separator used in the format, e.g.\t
format_na
: NA notation in the format, e.g.#NA
format_comment
: comment line, e.g.#
format_col_order
: column order