bulik/ldsc

Is it necessary to use the --merge-alleles parameter when doing genetic correlation in ldsc?

Opened this issue · 11 comments

Hi,
I have two sumstats, before doing genetic correlation, I have intersected the SNPs of the two data and generated the sumstats.gz files. When I do rg, it reports this error:

./ldsc.py
--ref-ld-chr eur_w_ld_chr/
--out ...
--rg ...sumstats.gz,...sumstats.gz
--w-ld-chr eur_w_ld_chr/

Reading summary statistics from ...1_sumstats.gz ...
Read summary statistics for 2629833 SNPs.
Reading reference panel LD Score from eur_w_ld_chr/[1-22] ...
Read reference panel LD Scores for 1290028 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from eur_w_ld_chr/[1-22] ...
Read regression weight LD Scores for 1290028 SNPs.
After merging with reference panel LD, 522341 SNPs remain.
After merging with regression SNP LD, 522341 SNPs remain.
Computing rg for phenotype 2/2
Reading summary statistics from ...2_sumstats.gz ...
Read summary statistics for 2689050 SNPs.
After merging with summary statistics, 522329 SNPs remain.
522328 SNPs with valid alleles.
ERROR computing rg for phenotype 2/2, from file ...2_sumstats.gz.
Traceback (most recent call last):
File "...sumstats.py", line 409, in estimate_rg
loop = _read_other_sumstats(args, log, p2, sumstats, ref_ld_cnames)
File "...sumstats.py", line 441, in _read_other_sumstats
loop['Z2'] = _align_alleles(loop.Z2, alleles)
File "...sumstats.py", line 517, in _align_alleles
raise KeyError(msg)
KeyError: 'Incompatible alleles in .sumstats files: AGAC. Did you forget to use --merge-alleles with munge_sumstats.py?'

When I use the --merge-alleles to match w_hm3.snplist, it can work. I think this may be caused by the unequal number of SNPs in the two sumstats.gz data. But when I use w_hm3.snplist, although the total number of SNPs written is the same, the number of SNPs with effect sizes is also inconsistent.The following is the number of SNPs written after using w_hm3.snplist for two sumstats.gz data:
1:Writing summary statistics for 1217311 SNPs (524794 with nonmissing beta)
2:Writing summary statistics for 1217311 SNPs (532061 with nonmissing beta)
.How can this problem be solved? Do I have to use the --merge-alleles parameter?

@dqq0404 The code does not properly handle indels. The simplest solution is to remove indels from your input data.

@dqq0404 The code does not properly handle indels. The simplest solution is to remove indels from your input data.

Thanks!! I have solved this problem. I want to ask that can I use the parameter--ref-ld-chr eur_w_ld_chr/ to calculate the gcov_intercept in order to calculate potential sample overlap? Or do I have to calculate my own ldsccore as a template?

@dqq0404 No, you cannot use ldsc to calculate sample overlap since it assumes there was none.

The simplest way to estimate sample overlap from summary statistics is to compute the correlation between null z-scores (absolute value < 2).

@dqq0404 You can calculate the intercept of the regression using reference LD scores.

Note that LAVA is estimating the variance-covariance matrix of the effect sizes, not the sample overlap.

If ldsc is applied on non-EUR gwas summary data, does w_hm3.snplist also work? (--merge-alleles w_hm3.snplist)

@leoarrow1 Yes, assuming that you are using reference weights computed for the same SNPs in w_hm3.snplist using reference genotypes of individuals of similar genetic ancestry.

@leoarrow1 Yes, assuming that you are using reference weights computed for the same SNPs in w_hm3.snplist using reference genotypes of individuals of similar genetic ancestry.

Thanks for prompt response! However, when I applied this to non-EUR GWAS summary statistics, it would remove the majority SNPs when using --merge-alleles w_hm3.snplist. These are two example which are intended for ldsc.py --rg $trait1ldsc,$trait2ldsc.

# non-EUR was summary example 1 (records in $ldsc/munge_sumstats.py)
...
Read 8421976 SNPs from --sumstats file.
Removed 7359129 SNPs not in --merge-alleles.
Removed 137645 SNPs with missing values.
Removed 0 SNPs with INFO <= 0.9.
Removed 0 SNPs with MAF <= 0.01.
Removed 0 SNPs with out-of-bounds p-values.
Removed 167 variants that were not SNPs or were strand-ambiguous.
925035 SNPs remain.
...
# non-EUR was summary example 2
...
Read 10080803 SNPs from --sumstats file.
Removed 8895794 SNPs not in --merge-alleles.
Removed 0 SNPs with missing values.
Removed 0 SNPs with INFO <= 0.9.
Removed 0 SNPs with MAF <= 0.01.
Removed 0 SNPs with out-of-bounds p-values.
Removed 535 variants that were not SNPs or were strand-ambiguous.
1184474 SNPs remain.
...

Is this normal for LDSC analysis? Does the remained SNPs (the minority of all SNPs) still have enough power to perform LDSC analysis? Any thought is appreciated!

I think it is normal.Because when I use the reference of European 1000G ldscore without using --merge-alleles w_hm3.snplist, the number of remaining SNPs is close to yours, and I found that the reference genome of Europeans has retained the SNP of w_hm3_no_MHC. If it is for non-EUR, I think it should be the same. You can try using the non-EUR reference template without the --merge-alleles parameter to see if the number of retained SNPs is similar.

@leoarrow1 These numbers are expected. Only SNPs for which weights are computed are used.

The total number of variants present in weights.EAS.hm3_noMHC (https://zenodo.org/records/7768714/files/1000G_Phase3_EAS_weights_hm3_no_MHC.tgz?download=1) is 1071448.

The total number of variants present in 1000G_Phase3_weights_hm3_no_MHC (https://zenodo.org/records/7768714/files/1000G_Phase3_weights_hm3_no_MHC.tgz?download=1) is 1187349.