Ripkelab/ricopili

daner Direction field potentially unreliable

rkwalters opened this issue · 1 comments

Summary

In some cases metaber7 flips the sign of the meta-analysis result to keep the effect for the intended A1/A2, but does not also flip signs in the Direction field.

Details

The workflow for metaber7 is roughly:

  1. Built a metal script for the meta-analysis
  2. Read through each input GWAS file to build an index of descriptive info on each variant (e.g. chr/bp, a1/a2, info, freq). For the sake of tracking frequency, A1/A2 are taken from the first time the variant is encountered.
  3. Create a "clean" copy of each input GWAS (excludes extreme ORs and SNPs with A1/A2 discordant with the indexed A1/A2.
  4. Run meta-analysis using metal with the clean GWASs
  5. Read through metal results, and write daner output with the meta-analysis results plus the info that metal isn't set to track (case/control freq, info, sample size, etc).

For step 5, it is possible that the order of A1/A2 reported by metal is not the order of A1/A2 indexed by metaber7 in step 2 (see Background below). In order to ensure that the allele frequency information added to the final output is correct, metaber7 checks for this swap. If the alleles have been swapped, metaber7 flips the sign of the effect in the meta-analysis results to match the A1/A2 order indexed in step 2. This maintains A1/A2 in the order indexed from the output GWAS, and reports the corresponding freq and effect size.

This is all reasonable, but the meta-analysis results also include a "Direction" field that reports the sign of the effect from each cohort in the meta-analysis. This field gets passed through directly from the metal output, and thus always reports the sign for the A1/A2 ordering in metal even when the the final output is reported a swapped order with the sign of the effect size flipped.

Example

Adapted from observed data, with numbers rounded/obscured to protect actual results

Dataset 1:

CHR SNP BP A1 A2 FRQ_A_500 FRQ_U_10000 INFO OR SE P ngt
1 rs123456789 87654321 G T 0.01000000000000000 0.02000000000000000 0.90000 1.0100000000000 0.3000000 0.9000000 0

Dataset 2:

CHR SNP BP A1 A2 FRQ_A_1500 FRQ_U_10000 INFO OR SE P ngt
1 rs123456789 87654321 G T 0.01000000000000000 0.02000000000000000 0.90000 0.9500000000000 0.2000000 0.7000000 0

Metal output:

MarkerName	Allele1	Allele2	Effect	StdErr	P-value	Direction	HetISq	HetChiSq	HetDf	HetPVal
rs123456789	t	g	0.0900	0.2000	0.500	-+	0.0	0.100	1	0.7000

Note: effect of T allele is positive, with Direction "-+" in the two cohorts, matching the above input with A1/A2 swapped.

Daner output:

CHR	SNP	BP	A1	A2	FRQ_A_5000	FRQ_U_10000	INFO	OR	SE	P	ngt	Direction	HetISqt	HetDf	HetPVa	Nca	Nco	Neff
1	rs123456789	87654321	G	T	0.01000	0.02000	0.900	0.9000	0.2000	0.5000	0	-+	0.0	1	0.7000	2000	20000	8000.00

Note: A1/A2 swapped to match input GWAS, and reports negative effect (OR < 1) of G allele. Direction field still reports "-+", which is result for T allele rather than G allele.

Possible Solutions

The relevant code block in metaber7 for step 5 is lines 923-1056. Should be relatively straight-forward to fix in one of two ways:

  1. Update the Direction field when flipping the effect size from metal. This update could be done along the effect size flip at line 956. It would require parsing the Direction field string to make the relevant substitutions. Printing this updated string would occur at line 1019 (in place of $cells[6]).

  2. Take A1/A2 order as reported by metal, and flip the indexed allele frequency to match. In place of the effect size flip at 956, frequencies fca{}/nca{} and fco{}/nco{} currently printed lins 996-1000 wound need to be updated. Also, the A1/A2 from metal (e.g. $a1_name) would need to be printed rather than the indexed alleles (e.g. $a1{$snp_name}) at line 975-976.

Not sure which of these might be preferred based on any impacts on downstream analysis/processing.

Additional Background

This issue arises in part due to strange allele swapping behavior from metal. While metal documentation states that the reported Allele1 is the first allele from the first occurrence of the marker in the input files, it appears this isn't actually the case. This alternative instruction file instead says the "lowest numeric reference allele" with be used if the input files differ, which seems closer to accurate, but still not quite right.

The relevant code block from the metal source code is:

bool FlipAlleles(String & al1, String & al2, double & effect, double & freq)
   {
   al1.ToLower();
   al2.ToLower();

   if (al1 > al2)
      {
      al1.Swap(al2);
      effect *= -1.0;
      freq = 1.0 - freq;
      }

   if (al1 == "a" || al1 == "c" && al2 == "g")
      return false;

   FlipAllele(al1);
   FlipAllele(al2);

   if (al1 > al2)
      {
      al1.Swap(al2);
      effect *= -1.0;
      freq = 1.0 - freq;
      }

   return true;
   }

This gets run for the first occurrence of every marker.

If it returns true, then FlipAllele(al1) and FlipAllele(al2) are run again prior to final output. Because this code actually alters al1 and al2, any switching from al1.Swap(al2) will propagate though to metal results files.

By my read, this means markers where the first occurrence has alleles C/A, C/T, G/A, G/C, G/T, and T/A markers will have their allele order swapped, while A/C, A/G, A/T, C/G, T/C, and T/G markers will have their allele orders preserved. Looking through some results from metal seems to confirm this. It's also consistent with the motivating example above that swaps G/T to T/G.

Thank you for the explanation about how METAL actually handles the flipping of the alleles. It was baffling me when I was comparing my results with what I thought they should look like after referencing the METAL documentation. Your interpretation of METAL's behavior is consistent with what I see in my results.

Do you have any sense of why METAL handles the flipping of the alleles this way?