rgcgithub/regenie

Document output format with `--htp` flag for region-based regenie step 2

Opened this issue · 5 comments

The output format for gene/region-based regenie step 2 [edit: when using --htp] does not appear to be documented; https://rgcgithub.github.io/regenie/options/#output_1 just says it is "the same output format mentioned above."

With --build-mask sum, [edit: as mentioned in #336, --htp is intentionally ignored in this case, so] the columns have the same names as a single-variant run:

CHROM GENPOS ID ALLELE0 ALLELE1 A1FREQ N TEST BETA SE CHISQ LOG10P EXTRA

but they are used in ways that don't quite match the header. Empirically, it looks like:

  • CHROM and GENPOS are the coordinates of the first variant in the region
  • ID is <gene name>.<mask name>
  • ALLELE0 is "ref"
  • ALLELE1 is the mask name
  • A1FREQ is presumably calculated after combining all alleles into a pseudo-allele as specified by --build-mask (right?)
  • TEST is "ADD" (at least with all the input parameters I tried)
  • EXTRA is usually NA; presumably it's that "additional column included to specify if Firth/SPA corrections failed."
  • The rest of the columns are exactly what they say on the tin.

With [edit: --htp and] --build-mask max or --build-mask comphet however, the output format is completely different, and now includes the following columns:

Name    Chr     Pos     Ref     Alt     Trait   Cohort  Model   Effect  LCI_Effect      UCI_Effect      Pval    AAF     Num_Cases       Cases_Ref       Cases_Het       Cases_Alt       Num_Controls    Controls_Ref    Controls_Het    Controls_Alt  Info

As far as I can tell:

  • Name is <gene name>.<mask name>
  • Chr and Pos are again the coordinates of the first variant
  • Ref is "ref"
  • Alt is the mask
  • Trait is the phenotype name
  • Cohort is "TEST"
  • Model is something like "ADD-WGR-FIRTH" (I get the gist but I'm not sure what the exact info conveyed here is?)
  • There are now columns for Effect, LCI_Effect (?), and UCI_effect (?) instead of a single beta
  • Pval is self-explanatory
  • AAF I'm again assuming is calculated based on the combined pseudo-alleles
  • Num_Cases, Cases_Ref, etc. are self-explanatory for binary traits, and for continuous traits the counts seem to be shoehorned in by considering everyone a "case" and leaving the "controls" columns NA
  • Info seems to have a series of semicolon-separated name=value pairs, e.g. REGENIE_BETA=0.026852;REGENIE_SE=0.009518;MAC=160787.000000;SCORE=294.673996;SKATV=10948.247814;LOG10P=2.313457

I'd greatly appreciate if you could confirm or correct my assumptions here, fill in the gaps, and put all the information into the official documentation. A few "sample output" files could also be quite helpful, especially for designing workflows that take regenie's output as their input.

I'm guessing that LCI_Effect and UCI_Effect lower and upper ends confidence interface on the beta (aka Effect). Would be nice to have documentation of these.

Hi,

The details of the output with --build-mask are specified on the website:
image

What is the full command you are using? (it seems like you may be using the --htp option which is an internal dev command [not documented on the website] & has a different format from the native one)

Cheers,
Joelle

@joellembatchou You are correct, we are using the --htp TEST option. This particular regenie pipeline actually predates my tenure at my company, so I'll ask around and see if anyone else remembers why we did that.

My best guess here is that we specifically wanted the Cases_Ref / Cases_Het / Controls_Alt etc. statistics, which are not available in the standard, documented output format.

So I guess my request is for there to be some documented way to get the Cases_Ref / Cases_Het / Controls_Alt etc. statistics out of regenie: either document the currently-available --htp, or add (an option for) these columns in the standard output format.