Table of Contents
- GENERAL INFORMATION
- CITATION
- DOWNLOAD AND INSTALLATION
- SYNOPSIS
- OPTIONS
- INPUT FILES
- Phenotype File
- VCF File
- Genetic Data File in Plain-Text Format
- Pairwise Inclusion Probability Matrix
- File that Contains the File Names of the Pairwise Inclusion Probability Matrices
- File that Contains the Variants for Conditional Analysis
- File that Contains Variants' Grouping Information in Gene-based Analysis
- File that Contains the Subset of Variants to be Analyzed in Single-Variant Analysis
- OUTPUT FILES
- VERSION HISTORY
- CONTACT
SUGEN is a command-line software program written in C++ to implement the weighted and unweighted approaches described by Lin et al. (2014) for various types of association analysis under complex survey sampling. The current version of the program can accommodate continuous, binary, and right-censored time-to-event traits. It can perform single-variant and gene-based association analysis. In single-variant analysis, it can perform standard association analysis, conditional analysis, and gene-environment interaction analysis using Wald statistics. In standard association analysis, we include the SNP of interest and other covariates (if any) as predictors in the regression model. In conditional analysis, we include the SNP of interest, the SNPs that are conditioned on, and other covariates (if any) as predictors in the regression model. In gene-environment interaction analysis, we include the SNP of interest, the environment variables, the interactions between the SNP and environment variables, and other covariates (if any) as predictors in the regression model. In gene-based analysis, it generates the score statistics and covariance matrix for variants in each gene. These summary statistics can be loaded into the software program MASS to perform all commonly used gene-based association tests.
Lin, D. Y., Tao, R., Kalsbeek, W., Zeng, D., Gonzalez, F., Fernández-Rhodes, L., Graff, M., Koch, G., North, K. E., and Heiss, G. (2014). "Genetic Association Analysis Under Complex Survey Sampling: The Hispanic Community Health Study/Study of Latinos", American Journal of Human Genetics, 95(6): 675-688.
The latest version of SUGEN can be downloaded from github or github page.
-
Unzip the package.
unzip SUGEN-master.zip
-
Go to the SUGEN directory.
cd ./SUGEN-master
-
Install SUGEN. When successful, an executable called "SUGEN" will be generated in ./SUGEN-master.
make
SUGEN [--pheno pheno_file] [--formula formula] [--id-col iid] [--family-col fid] \
[--weight-col wt] [--vcf vcf_file.gz] [--genetic-text genetic_text_file] [--dosage] [--probmatrix prob_file] \
[--subset subset_expression] [--unweighted] [--model model] [--robust-variance] \
[--left-truncation left_truncation_time] [--cond cond_file] [--ge envi_covs] [--score] \
[--score-rescale rescale_rule] [--group group_file] [--hetero-variance strata] [--out-prefix out_prefix] \
[--out-zip] [--extract-chr chr] [--extract-range range] [--extract-file extract_file] \
[--ge-output-detail][--group-maf maf_ub] [--group-callrate cr_lb]
-
--pheno pheno_file
Specifies the phenotype file. The default name is pheno.txt. -
--formula formula
Specifies the regression formula. In linear or logistic regression, the format offormula
is"trait=covariate_1+covariate_2+...+covariate_p"
The trait and covariates must appear in
pheno_file
. If there is no covariate, then we specify the formula as"trait="
In Cox proportional hazards regression, the format of
formula
is"(time, event)=covariate_1+covariate_2+...+covariate_p"
The time, event indicator, and covariates must appear in
pheno_file
. If there is no covariate, then we specify the formula as"(time, event)="
-
--id-col iid
Specifies the subject ID column inpheno_file
. The default column name is IID. -
--family-col fid
Specifies the family ID column inpheno_file
. The default column name is FID. If study subjects are independent, then we specify the family ID column to be the same as the subject ID column. -
--weight-col wt
Specifies the weight column inpheno_file
. The default column name is WT. This option is ignored if--unweighted
is specified. -
--vcf vcf_file.gz
Specifies the block compressed and indexed VCF file. The default name is geno.vcf.gz. -
--genetic-text genetic_text_file
Specifies the genetic data file in plain-text format. This option cannot be specified together with any of the following options:--vcf
,--dosage
,--cond
,--extract-chr
,--extract-range
,--extract-file
,--group
,--group-maf
,--group-callrate
,--score
,--score-rescale
. -
--dosage
Analyzes dosage data in the VCF file. The dosages must be stored in the DS field of the VCF file. This requirement is the same as RAREMETALWORKER. -
--probmatrix prob_file
Specifies the file that contains the file names of the pairwise inclusion probability matrices. The default name is probmatrix.txt. This option is optional in weighted analysis and ignored in unweighted analysis. -
--subset subset_expression
Restricts analysis to a subset of subjects inpheno_file
. For example, if one wants to restrict the analysis to subjects whose var_a equals level_1, where var_a is a column inpheno_file
, and level_1 is one of the values of var_a, then we can specify--subset_expression "var_a=level_1"
.
-
--unweighted
Uses the unweighted approach. -
--model model
Specifies the regression model. There are three options: linear (linear regression), logistic (logistic regression), and coxph (Cox proportional hazards regression). The default value is linear. In linear or logistic regression, the trait is continuous or binary (0/1), respectively. In Cox proportional hazards regression, the event time is positive, and the event indicator is binary (0/1). -
--robust-variance
If this option is specified, then the robust variance estimator will be used. Otherwise, the model-based variance estimator will be used. -
--left-truncation left_truncation_time
Specifies the left truncation time (if any) in Cox proportional hazards regression. -
--cond cond_file
In single-variant analysis, performs conditional analysis conditioning on the variants included incond_file
. There is no default value forcond_file
. The format of the variant IDs incond_file
is chromosome:position. This option is valid only when--score
is not specified. In this situation, either--cond cond_file
or--ge envi_covs
can be specified, but not both. If neither is specified, then standard association analysis is performed. -
--ge envi_covs
In single-variant analysis, performs gene-environment interaction analysis.envi_covs
are the names of the environment variables. The format ofenvi_covs
is covariate_1,covariate_2,...,covariate_k. That is, multiple environment variables are separately by commas. There is no default value forenvi_covs
. This option is valid only when--score
is not specified. In this situation, either--cond cond_file
or--ge envi_covs
can be specified, but not both. If neither is specified, then standard association analysis is performed. -
--score
Uses score statistics. -
--score-rescale rescale_rule
Specifies the method to rescale the score statistics. There are two options: naive and optimal. The default value is naive. This option is valid only when--score
is specified. -
--group group_file
Performs gene-based association analysis. Gene memberships of variants are defined ingroup_file
. There is no default value forgroup_file
. This option is valid only when--score
is specified. -
--hetero-variance strata
Allows the residual variance in linear regression to be different in different levels ofstrata
.
-
--out-prefix prefix
Specifies the prefix of the output files. The default prefix is results. -
--out-zip
Zips the output files. -
--extract-chr chr
Restricts single-variant analysis to variants in chromosomechr
. This option is valid only when--group group_file
is not specified. -
--extract-range range
Restricts single-variant analysis to variants in chromosomechr
and positions inrange
. The format ofrange
is 1000000-2000000. This option is valid only when--group group_file
is not specified and--extract-chr chr
is specified. -
--extract-file extract_file
Restricts single-variant analysis to variants inextract_file
. The format of the variant IDs inextract_file
is chromosome:position. This option is valid only when--group group_file
,--extract-chr chr
, and--extract-range range
are not specified. -
--ge-output-detail
In gene-environment interaction analysis, output the covariances between the genetic variant, environment variables, and gene-environment interaction variables. Otherwise, only output the variances of the genetic variant, environment variables, and gene-environment interaction variables. -
--group-maf maf_ub
Specifies the minor allele frequency (MAF) upper bound for gene-based association analysis.maf_ub
is a real number between 0 and 1. Its default value is 0.05. Variants with MAFs greater thanmaf_ub
will not be included in the analysis. -
--group-callrate cr_lb
Specifies the call rate lower bound for gene-based association analysis.cr_lb
is a real number between 0 and 1. Its default value is 0. Variants with call rates less thancr_lb
will not be included in the analysis.
The phenotype file should be tab-delimited. Missing data are denoted by NA. The rows represent study subjects. The 1st row is the header line. This file should include the subject ID column, family ID column (unless the subjects are independent), weight column (unless the unweighted approach is used, i.e., when --unweighted
is specified), trait column (with trait values being continuous or binary if model=linear
or model=logistic
, respectively), event time and indicator columns (if model=coxph
), and covariates columns (unless there is no covariate in formula
). Subjects with missing values in any of the columns specified by --formula formula
,
--id-col iid
, --family-col fid
, or --weight-col wt
are excluded from the analysis.
The VCF file contains the genotype data. The format specifications of a VCF file can be found here. The VCF file should be compressed and indexed by bgzip and tabix, respectively, using the following commands:
bgzip vcf_file
tabix -p vcf -f vcf_file.gz
We recommend users to store the SNP genotype or dosage data in VCF format, because there are far more analysis and output options available in SUGEN when using VCF files.
This file should be tab-delimited. Missing data are denoted by NA. The rows represent genetic features, such as SNPs or genes. The columns represent study subjects. The 1st row contains the subject IDs. The 1st column contains the genetic feature IDs. An example is as follows:
id | subject_1 | subject_2 | subject_3 |
---|---|---|---|
gene1 | 0.07 | 0.25 | 0.37 |
gene2 | NA | 0.67 | 0.15 |
The files that contain the pairwise inclusion probability matrices should be tab-delimited. The 1st row is the header line containing the subject IDs. The remaining rows constitute a symmetric square matrix. That is to say, the number of rows equals the number of columns plus 1 (for the header line). The marginal inclusion probability of the ith subject is in the (i+1)th row and ith column. The pairwise inclusion probability of the ith and jth subjects is in the (i+1)th row and jth column, as well as in the (j+1)th row and ith column. All inclusion probabilities are strictly greater than 0 and less than or equal to 1. Missing values are not allowed. Note that there can be multiple pairwise inclusion probability matrices. Subjects in different pairwise inclusion probability matrices are assumed to be independent. Note that these pairwise inclusion probability matrices are optional in the weighted approach and not needed in the unweighted approach.
Each row is the file name of one pairwise inclusion probability matrix. Note that this file is optional in the weighted approach and not needed in the unweighted approach.
Each row is a variant ID, which should be in chromosome:position format. Note that this file is needed
only when we perform conditional analysis (i.e., when --cond cond_file
is specified).
Each row is a gene, which should be in the following format:
gene_1 variant_1,variant_2
gene_2 variant_3,variant_4,variant_5
The gene and variant IDs are separated by a tab. The variant IDs in the same gene are separated by commas. Variant IDs should be in chromosome:position format. Note that this file is needed only when we perform gene-based analysis (i.e., when --group group_file
is specified).
Each row is a variant ID, which should be in chromosome:position format. Note that this file is needed only when --extract-file extract_file
is specified.
The rows represent varaints. The first row is the header line. Missing values are denoted by NA. Tables 1-3 describe the columns of prefix.wald.out
in standard association analysis, conditional analysis, and gene-environment interaction analysis, respectively.
Column Name | Description |
---|---|
CHROM | Chromosome. |
POS | Position. |
VCF_ID | Varaint ID in the VCF file. |
REF | Reference allele. |
ALT | Alternative allele. |
ALT_AF | Alternative allele frequency. |
ALT_AC | Alternative allele count. |
N_INFORMATIVE | Number of subjects included in the analysis. |
N_REF | Number of subjects with two reference alleles. |
N_HET | Number of subjects with one reference and one alternative alleles. |
N_ALT | Number of subjects with two alternative alleles. |
N_DOSE | Number of subjects with genotype dosages. |
ALT_AF_CASE | Alternative allele frequency among cases. This column is present only when model=logistic . |
N_CASE | Number of cases included in the analysis. This column is present only when model=logistic . |
ALT_AF_EVENT | Alternative allele frequency among cases. This column is present only when model=coxph . |
N_EVENT | Number of cases included in the analysis. This column is present only when model=coxph . |
BETA | Effect estimate. |
SE | Standard error estimate of BETA. |
PVALUE | p-value. |
Column Name | Description |
---|---|
CHROM | Chromosome. |
POS | Position. |
VCF_ID | Varaint ID in the VCF file. |
REF | Reference allele. |
ALT | Alternative allele. |
ALT_AF | Alternative allele frequency. |
ALT_AC | Alternative allele count. |
N_INFORMATIVE | Number of subjects included in the analysis. |
N_REF | Number of subjects with two reference alleles. |
N_HET | Number of subjects with one reference and one alternative alleles. |
N_ALT | Number of subjects with two alternative alleles. |
N_DOSE | Number of subjects with genotype dosages. |
ALT_AF_CASE | Alternative allele frequency among cases. This column is present only when model=logistic . |
N_CASE | Number of cases included in the analysis. This column is present only when model=logistic . |
ALT_AF_EVENT | Alternative allele frequency among cases. This column is present only when model=coxph . |
N_EVENT | Number of cases included in the analysis. This column is present only when model=coxph . |
BETA | Effect estimate. |
SE | Standard error estimate of BETA. |
PVALUE | p-value. |
BETA_variant | Effect estimate of variant that is conditioned on. |
SE_variant | Standard error estimate of BETA_variant. |
PVALUE_variant | p-value of variant that is conditioned on. |
Column Name | Description |
---|---|
CHROM | Chromosome. |
POS | Position. |
VCF_ID | Varaint ID in the VCF file. |
REF | Reference allele. |
ALT | Alternative allele. |
ALT_AF | Alternative allele frequency. |
ALT_AC | Alternative allele count. |
N_INFORMATIVE | Number of subjects included in the analysis. |
N_REF | Number of subjects with two reference alleles. |
N_HET | Number of subjects with one reference and one alternative alleles. |
N_ALT | Number of subjects with two alternative alleles. |
N_DOSE | Number of subjects with genotype dosages. |
ALT_AF_CASE | Alternative allele frequency among cases. This column is present only when model=logistic . |
N_CASE | Number of cases included in the analysis. This column is present only when model=logistic . |
ALT_AF_EVENT | Alternative allele frequency among cases. This column is present only when model=coxph . |
N_EVENT | Number of cases included in the analysis. This column is present only when model=coxph . |
PVALUE_G | p-value of the variant. |
PVALUE_INTER | p-value of the interaction term(s) between the variant and environment variable(s). |
PVALUE_BOTH | p-value of both the variant and gene-environment interaction terms. |
BETA_G | Effect estimate of the variant. |
BETA_envi | Effect estimate of environment variable envi. |
BETA_G:envi | Effect estimate of the interaction term between the variant and environment variable envi, denoted by G:envi. |
COV_G_G | Variance estimate of BETA_G. |
COV_envi_envi | Variance estimate of BETA_envi. |
COV_G:envi_G:envi | Variance estimate of BETA_G:envi. |
COV_G_envi | Covariance estimate between BETA_G and BETA_envi. This column is present only when --ge-output-detail is specified. |
COV_G_G:envi | Covariance estimate between BETA_G and BETA_G:envi. This column is present only when --ge-output-detail is specified. |
COV_envi_G:envi | Covariance estimate between BETA_envi and BETA_G:envi. This column is present only when --ge-output-detail is specified. |
The rows represent SNPs. The first row is the header line. Missing values are denoted by NA. Tables 4 describes the columns of prefix.score.snp.out
in standard association analysis.
Column Name | Description |
---|---|
GENE_ID | Gene ID. In single-variant analysis (i.e., --group group_file is not specified), GENE_ID equals CHROM:POS. |
CHROM | Chromosome. |
POS | Position. |
VCF_ID | Varaint ID in the VCF file. |
REF | Reference allele. |
ALT | Alternative allele. |
ALT_AF | Alternative allele frequency. |
ALT_AC | Alternative allele count. |
N_INFORMATIVE | Number of subjects included in the analysis. |
N_REF | Number of subjects with two reference alleles. |
N_HET | Number of subjects with one reference and one alternative alleles. |
N_ALT | Number of subjects with two alternative alleles. |
N_DOSE | Number of subjects with genotype dosages. |
U | Score statistic. |
V | Variance estimate of U. |
BETA | Effect estimate. |
SE | Standard error estimate of BETA. |
PVALUE | p-value. |
The gene-based summary statistics are stored in MASS format. They can be loaded into the software program MASS to perform all commonly used gene-based association tests. They can also be converted by the software program PreMeta to files that are compatible with other commonly used rare-variant meta-analysis software programs, including RAREMETAL, seqMeta, and MetaSKAT.
-
1.0 (released on May 29th, 2013)
First version released. -
2.0 (released on Nov 12nd, 2013)
- Added the capability to perform gene-environment interaction analysis.
- Deleted the tab delimiter at the end of each row in the output file.
-
3.0 (released on Dec 7th, 2013)
Added the capability to perform logistic regression for binary (0/1) traits. -
4.0 (released on Feb 9th, 2014)
Added the capability to analyze data with multiple pairwise inclusion probability matrices. -
4.1 (released on Mar 13rd, 2014)
Added the capability to deal with imputed genotype dosages. -
5.0 (released on May 21st, 2014)
- Modified the variance estimation formula. Included both the model-based and robust variance estimators.
- Changed the format of the phenotype file.
-
5.1 (released on Aug 14th, 2014)
Added the capability to perform conditional analysis. -
5.2 (released on Sep 21st, 2014)
Modified the variance estimation formula. Used a new approach to trim the pairwise inclusion probabilities. -
6.0 (released on Oct 1st, 2014)
Added the unweighted approach. -
6.1 (released on Oct 6th, 2014)
Changed some option names. Changed some column names in output files. -
6.2 (released on Nov 18th, 2014)
Changed the name of the software program from "SOLReg" to "SUGEN". -
6.3 (released on Nov 13rd, 2015)
Improved the computational efficiency of unweighted analysis. -
7.0 (released on March 30th, 2016)
Improved the user interface. Changed the genotype file format from plain text to VCF. Added the capability to perform gene-based association analysis. -
7.1 (released on May 2nd, 2016)
Added the capability to handle dosage data. -
7.2 (released on May 5th, 2016)
Fixed a bug in reading the phenotype file when it contains redundant columns. -
7.3 (released on May 30th, 2016)
- Fixed a bug in gene-environment interaction analysis where the environment variable is the last covariate in the model.
- Added the
--subset
option. - Added the
--hetero-variance
option. - Modified the model-based variance estimator so that it is stable for rare variants.
-
8 (released on September 29, 2016)
- Added the capability to perform Cox proportional hazards regression.
- Modified the model-based covariance matrix estimator in gene-based tests so that it is more accurate for rare variants.
- Fixed a bug in reading the phenotype file when the subject ID or family ID column is the last column of the phenotype file.
-
8.1 (released on November 2, 2016)
- Added p-values in the gene-environment interaction analysis output file.
- Fixed a bug in the weighted approach.
-
8.2 (released on January 5, 2017)
- Added columns ALT_AF_CASE (ALT_AF_EVENT) and N_CASE (N_EVENT) to the single-variant analysis results file in logistic (Cox proportional hazards) regression.
-
8.3 (released on January 18, 2017)
- Added the
--ge-output-detail
option.
- Added the
-
8.4 (released on February 26, 2017)
- Produced warnings instead of errors when there is no SNP to be conditioned on in conditional analysis.
-
8.5 (released on March 21, 2017)
- Fixed a bug in the p-value calculation in gene-environment interaction analysis.
-
8.6 (released on May 28, 2017)
- Added ALT_AC calculation for dosage data.
-
8.7 (released on July 19, 2017)
- Added the capability to handle genetic data files in plain-text format.
-
8.8 (released on July 29, 2017)
- Fixed a bug in weighted analysis when some studies have no subjects eligible for the association analysis.
-
8.9 (released on July 27, 2018)
- Updated libStatGen to v1.0.14 and Eigen to v3.3.4.
-
8.10 (released on June 30, 2019)
- Fixed a bug in the data preperation function for score tests in proportional hazards regression.
-
8.11 (current version, released on November 12, 2019)
- Fixed a bug in reading the left-truncation time.
For questions, please contact Ran Tao.