jianyangqt/gcta

Error: there are too many SNPs that have large difference in allele frequency

Opened this issue · 4 comments

Hi @anglixue,

I have been trying to run mtCOJO on a trait (GWAS summary available here: https://figshare.com/articles/dataset/scz2022/19426775?file=34517828) while adjusting for another trait (GWAS summary available here: https://conservancy.umn.edu/handle/11299/241912 filename: GSCAN_CigDay_2022_GWAS_SUMMARY_STATS_EUR.txt.gz) using GCTA v1.94.1 and this is the error message I received:

Error: there are too many SNPs that have large difference in allele frequency. Please check the GWAS summary data.
An error occurs, please check the options or data

This is the command I used:
./gcta --bfile /home/1000G/ALL.chr1-22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes --mtcojo-file data_list_scz_cigday.txt --ref-ld-chr /home/gcta/eur_w_ld_chr/ --w-ld-chr /home/gcta/eur_w_ld_chr/ --out mtcojo_scz_cigday

This is the log file:


  • Genome-wide Complex Trait Analysis (GCTA)
  • version v1.94.1 Linux
  • Built at Nov 15 2022 21:14:25, by GCC 8.5
  • (C) 2010-present, Yang Lab, Westlake University
  • Please report bugs to Jian Yang jian.yang@westlake.edu.cn

Analysis started at 16:32:44 PST on Mon Nov 27 2023.
Hostname: tscc-4-60.sdsc.edu

Accepted options:
--bfile /home/1000G/ALL.chr1-22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes
--mtcojo-file data_list_scz_cigday.txt
--ref-ld-chr /home/gcta/eur_w_ld_chr/
--w-ld-chr /home/gcta/eur_w_ld_chr/
--out mtcojo_scz_cigday

Reading PLINK FAM file from [/home/1000G/ALL.chr1-22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.fam].
2504 individuals to be included from [/home/1000G/ALL.chr1-22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.fam].
Reading PLINK BIM file from [/home/1000G/ALL.chr1-22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bim].
80845844 SNPs to be included from [/home/1000G/ALL.chr1-22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bim].

Reading GWAS summary data from [data_list_scz_cigday.txt] ...
7341181 SNPs in common between the target trait and the covariate trait(s).
Filtering out SNPs with multiple alleles or missing value ...
864 SNPs have missing value or mismatched alleles. These SNPs have been saved in [mtcojo_scz_cigday.badsnps].
7340317 SNPs are retained after filtering.
There are 3888 genome-wide significant SNPs with p < 5.0e-08.

Reading PLINK BED file from [/home/1000G/ALL.chr1-22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bed] in SNP-major format ...
Genotype data for 2504 individuals and 3888 SNPs to be included from [/home/1000G/ALL.chr1-22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bed].
Calculating allele frequencies ...
Checking the difference in allele frequency between the GWAS summary datasets and the LD reference sample...
5478219 SNP(s) have large difference of allele frequency between the GWAS summary data and the reference sample. These SNPs have been saved in [mtcojo_scz_cigday.freq.badsnps].
Error: there are too many SNPs that have large difference in allele frequency. Please check the GWAS summary data.
An error occurs, please check the options or data

Could you please help me solve this issue?

Thanks!
Shreya

Have you checked the definition of effect allele in your summary data? The effect alleles may be mismatched between GWAS and LD reference. In ma format, A1 is the effect allele (see here https://yanglab.westlake.edu.cn/software/gcta/#COJO).

Hi there, i am getting the same error and my effecct allele frequency column seems to be correct!! I was wondering if you managed to solve this issue?

Thanks,
Aydan

Hi Aydan,
Usually this error is due the inconsistent definition of allele frequency in user's own GWAS summary even the allele frequency column seems to be correct.
I would suggest you generate the freq files from the LD reference file you provided using the PLINK function. And then compare the freq file with your own GWAS.

Thank you for your response! Would this mean that there are inconsistencies in the frequency (effect allele) column or also in definition of A1 and A2? Could it be that i am using 1000G as reference for induvidual level genotypes instead of the genotye data for the relevant GWAS studies (these are not publicly available)?

I have also calculated the distribution of the differences in the frequency. What does mtCojo use a threshold of differences because from GWAS and 1000G it seems that only less than 1% of SNP frequencies have differences of more than 0.1.

Interval here is the difference between the MAF from 1000G and GWAS, count is the number of SNPs
Screenshot 2024-09-10 at 14 13 18