daner Direction field potentially unreliable
rkwalters opened this issue · 1 comments
Summary
In some cases metaber7
flips the sign of the meta-analysis result to keep the effect for the intended A1/A2, but does not also flip signs in the Direction field.
Details
The workflow for metaber7
is roughly:
- Built a
metal
script for the meta-analysis - Read through each input GWAS file to build an index of descriptive info on each variant (e.g. chr/bp, a1/a2, info, freq). For the sake of tracking frequency, A1/A2 are taken from the first time the variant is encountered.
- Create a "clean" copy of each input GWAS (excludes extreme ORs and SNPs with A1/A2 discordant with the indexed A1/A2.
- Run meta-analysis using
metal
with the clean GWASs - Read through
metal
results, and write daner output with the meta-analysis results plus the info thatmetal
isn't set to track (case/control freq, info, sample size, etc).
For step 5, it is possible that the order of A1/A2 reported by metal
is not the order of A1/A2 indexed by metaber7
in step 2 (see Background below). In order to ensure that the allele frequency information added to the final output is correct, metaber7
checks for this swap. If the alleles have been swapped, metaber7
flips the sign of the effect in the meta-analysis results to match the A1/A2 order indexed in step 2. This maintains A1/A2 in the order indexed from the output GWAS, and reports the corresponding freq and effect size.
This is all reasonable, but the meta-analysis results also include a "Direction" field that reports the sign of the effect from each cohort in the meta-analysis. This field gets passed through directly from the metal
output, and thus always reports the sign for the A1/A2 ordering in metal
even when the the final output is reported a swapped order with the sign of the effect size flipped.
Example
Adapted from observed data, with numbers rounded/obscured to protect actual results
Dataset 1:
CHR SNP BP A1 A2 FRQ_A_500 FRQ_U_10000 INFO OR SE P ngt
1 rs123456789 87654321 G T 0.01000000000000000 0.02000000000000000 0.90000 1.0100000000000 0.3000000 0.9000000 0
Dataset 2:
CHR SNP BP A1 A2 FRQ_A_1500 FRQ_U_10000 INFO OR SE P ngt
1 rs123456789 87654321 G T 0.01000000000000000 0.02000000000000000 0.90000 0.9500000000000 0.2000000 0.7000000 0
Metal output:
MarkerName Allele1 Allele2 Effect StdErr P-value Direction HetISq HetChiSq HetDf HetPVal
rs123456789 t g 0.0900 0.2000 0.500 -+ 0.0 0.100 1 0.7000
Note: effect of T allele is positive, with Direction "-+" in the two cohorts, matching the above input with A1/A2 swapped.
Daner output:
CHR SNP BP A1 A2 FRQ_A_5000 FRQ_U_10000 INFO OR SE P ngt Direction HetISqt HetDf HetPVa Nca Nco Neff
1 rs123456789 87654321 G T 0.01000 0.02000 0.900 0.9000 0.2000 0.5000 0 -+ 0.0 1 0.7000 2000 20000 8000.00
Note: A1/A2 swapped to match input GWAS, and reports negative effect (OR < 1) of G allele. Direction field still reports "-+", which is result for T allele rather than G allele.
Possible Solutions
The relevant code block in metaber7
for step 5 is lines 923-1056. Should be relatively straight-forward to fix in one of two ways:
-
Update the Direction field when flipping the effect size from
metal
. This update could be done along the effect size flip at line 956. It would require parsing the Direction field string to make the relevant substitutions. Printing this updated string would occur at line 1019 (in place of$cells[6]
). -
Take A1/A2 order as reported by
metal
, and flip the indexed allele frequency to match. In place of the effect size flip at 956, frequenciesfca{}/nca{}
andfco{}/nco{}
currently printed lins 996-1000 wound need to be updated. Also, the A1/A2 frommetal
(e.g.$a1_name
) would need to be printed rather than the indexed alleles (e.g.$a1{$snp_name}
) at line 975-976.
Not sure which of these might be preferred based on any impacts on downstream analysis/processing.
Additional Background
This issue arises in part due to strange allele swapping behavior from metal
. While metal
documentation states that the reported Allele1
is the first allele from the first occurrence of the marker in the input files, it appears this isn't actually the case. This alternative instruction file instead says the "lowest numeric reference allele" with be used if the input files differ, which seems closer to accurate, but still not quite right.
The relevant code block from the metal
source code is:
bool FlipAlleles(String & al1, String & al2, double & effect, double & freq)
{
al1.ToLower();
al2.ToLower();
if (al1 > al2)
{
al1.Swap(al2);
effect *= -1.0;
freq = 1.0 - freq;
}
if (al1 == "a" || al1 == "c" && al2 == "g")
return false;
FlipAllele(al1);
FlipAllele(al2);
if (al1 > al2)
{
al1.Swap(al2);
effect *= -1.0;
freq = 1.0 - freq;
}
return true;
}
This gets run for the first occurrence of every marker.
If it returns true
, then FlipAllele(al1)
and FlipAllele(al2)
are run again prior to final output. Because this code actually alters al1
and al2
, any switching from al1.Swap(al2)
will propagate though to metal
results files.
By my read, this means markers where the first occurrence has alleles C/A, C/T, G/A, G/C, G/T, and T/A markers will have their allele order swapped, while A/C, A/G, A/T, C/G, T/C, and T/G markers will have their allele orders preserved. Looking through some results from metal
seems to confirm this. It's also consistent with the motivating example above that swaps G/T to T/G.
Thank you for the explanation about how METAL actually handles the flipping of the alleles. It was baffling me when I was comparing my results with what I thought they should look like after referencing the METAL documentation. Your interpretation of METAL's behavior is consistent with what I see in my results.
Do you have any sense of why METAL handles the flipping of the alleles this way?