/tejaas

Tejaas - a tool for discovering trans-eQTLs

Primary LanguageCGNU General Public License v3.0GPL-3.0

Tejaas: Discover trans-eQTLs

Description

Tejaas is a command line tool to find trans-eQTLs from eQTL data. It is released under the GNU General Public License version 3.

Tejaas is based on the hypothesis that a trans-eQTL should regulate the expression levels of multiple genes. In brief, it implements two statistical methods to find trans-eQTLs:

  • RR-score (Reverse Regression): It performs a multiple linear regression with L2-regularization using expression levels of all genes to explain the genotype of a candidate SNP. In contrast to conventional methods, the direction of the regression is reversed, with the gene expressions as explanatory variables. RR-score is a statistic which estimates whether more genes are required to explain the allele counts of a SNP than expected by chance.
  • JPA-score (Joint P-value Analysis): It evaluates the distribution of p-values of the pairwise linear association of a candidate SNP with all available gene expression levels. Any null SNP (no trans-effect) will have a uniform distribution of p-values, while a trans-eQTL will be associated with more genes than expected by chance, leading to overdispersion near zero. The JPA-score is a statistic which estimates whether the distribution of p-values is significantly overdispersed near zero.

Additionally, it also implements a non-linear unsupervised confounder correction using k-nearest neighbors called KNN correction.

Dependencies

  • Python version 3.6 or higher
  • C compiler
  • Any linear algebra routine, e.g. Intel MKL, OpenBLAS, etc.
  • Any flavor of Message Passing Interface (MPI), e.g. OpenMPI, MPICH, etc.
  • Other Python libraries:
    • NumPy / array operations
    • SciPy / optimization and other special functions
    • statsmodel / used for ECDF calculation in JPA-score
    • Pygtrie / used for reading MAF file in RR-score / maf null
    • mpi4py / linked to MPI and MKL for python parallelization
    • scikit-learn / used for PCA decomposition in KNN correction

Installation

See the wiki for detailed installation instructions. Here is a quick start guide:

1. Install dependencies

Installation of Tejaas depends on other dependencies. If you are using conda, then they can be installed using

conda install numpy scipy statsmodels scikit-learn
pip install mpi4py

2. Install Tejaas

2a. Tejaas can be installed directly from the PyPI repository:

pip install tejaas

2b. You can also download this repository and build Tejaas:

git clone git@github.com:soedinglab/tejaas.git
cd tejaas
pip install -e .

3. Run Tejaas!

tejaas [OPTIONS]

See below for valid options or try tejaas --help.

Run an example to check installation

An example script example/run_example.sh is provided to check the installation.

cd tejaas/example
./run_example.sh <outdir> <ncpu>

The script will download some example input files in <outdir>/data and run Tejaas on <ncpu> cores. The output will be created in <outdir>/data. Note that the data subdirectory is automatically created within the <outdir> specified by the user. Check if the output is correct:

python compare_with_gold.py --outdir <outdir>

This will check if the output in <outdir>/data (again note that data subdirectory is automatically appended and need not be specified by the user) matches with the results provided in the example/gold subdirectory.

Input Files

  • Gene expression file
  • Genotype file
    • VCF
    • Oxford
    • Dosage
  • GENCODE file
  • Population minor allele frequency

Tejaas [OPTIONS]

Option                   Argument Description Priority Default value
--vcf FILEPATH Input genotype file in vcf.gz format Required (vcf or oxf) --
--oxf FILEPATH Input genotype file in Oxford format Required (vcf or oxf) --
--dosage Flag for reading dosage files. The file is specified with the --oxf option, e.g. --oxf FILEPATH --dosage Optional False
--fam FILEPATH Input fam file for samples names of Oxford genotype Optional --
--chrom INT Chromosome number of the genotype file Required --
--include-SNPs START:END Colon-separated index of SNPs to be included Optional --
--gx FILEPATH Input gene expression file for trans-eQTL discovery Required --
--gxcorr FILEPATH Input gene expression file for target gene discovery Optional --gx file
--gxfmt OPTION Input gene expression file format (see format details below). Supported options: gtex, cardiogenics, geuvadis Optional gtex
--gtf FILEPATH Input GTF file from GENCODE to read gene Ensembl IDs. Used for selecting biotypes and getting genomic locations. Required --
--trim Flag to trim version number from GENCODE Ensembl IDs Optional False
--biotype OPTION Which biotypes to select from the GTF file. Supported options: protein_coding, lncRNA. Optional protein_coding lncRNA
--outprefix STRING Full path to output file names. The extensions are generated by Tejaas. Optional out
--method OPTION Name of method to run. Supported options: jpa or rr Optional rr
--null OPTION Null model to use for RR-score. Supported options: perm or maf Optional perm
--cismask Flag to mask cis-Genes within a window for each candidate SNP. Gene positions are obtained from the GENCODE annotation file. Optional False
--window FLOAT Window (number of base pairs) used for masking cis genes Optional 1e6
--prior-sigma FLOAT Standard deviation of the normal prior for reverse multiple linear regression Optional 0.1
--knn INT Number of neighbours for KNN (use 0 if you do not want to use KNN) Optional 0
--psnpthres FLOAT Target genes will be reported only for trans-eQTLs below this threshold p-value for RR/JPA-score Optional 0.0001
--pgenethres FLOAT Target genes will be reported only if their association with trans-eQTLs are below this threshold p-value Optional 0.05
--jpanull FILEPATH File containing list of null model JPA-scores Optional --
--maf-file FILEPATH Read minor allele frequency (MAF) of SNPs from this file, e.g. to read population MAF for maf null (see documentation for file format) Optional --
--shuffle Flag to randomly shuffle the genotypes to obtain a null distribution Optional False
--shuffle-with FILEPATH Shuffle the genotypes in the same order of donor IDs specified in FILEPATH Optional --

Usage Examples

  1. For quick start or installation check, run Tejaas with all default options:
tejaas --vcf ${VCFFILE} --chrom ${CHRM} --gx ${GXFILE} --gtf ${GTFFILE} --cismask --outprefix ${OUTPREFIX}

This will create RR-scores at γ=0.1 and masking all genes within 1Mb of each SNP. The p-values will be computed from the permuted null model. Default format for the gene expression is the same as the GTEx format, and default gtf file is the GENCODE v26 release. For target gene discovery, it will use the same file as used for trans-eQTL discovery.

  1. Example of running Tejaas RR-score. We recommend using the perm null model for calcuting p-values from the RR-score and a separate confounder-corrected gene expression file for target gene discovery. In this example, RR-score is calculated for first 1000 SNPs excluding the first 20 --include-SNPs 21:1000. KNN correction is performed with 20 nearest neighbors --knn 20. All cis-genes within +2MB and -2Mb are masked during analysis --cismask --window 2e6. RR-score calculation uses a prior normal distribution with standard deviation of 0.05 --prior-sigma 0.05. The output reports target genes only for SNPs with p-value < 1e-6 --psnpthres 0.000001. Here, GXFILE is the raw gene expression file, GXCORRFILE is the confounder-corrected gene expression file, VCFFILE is the genotype file in .vcf.gz format and GTFFILE is the GENCODE annotation file.
mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs 21:1000 --gx ${GXFILE} --gxcorr ${GXCORRFILE} \
                   --gxfmt gtex --gtf ${GTFFILE} --trim  --outprefix ${OUTPREFIX} \
                   --cismask --window 2e6 --psnpthres 0.000001 \
                   --knn 20 --method rr --null perm --prior-sigma 0.05
  1. Example of running JPA-score with no KNN correction. Empirical p-values are calculated from the null scores loaded from NULLFILE specified by the --jpanull option. If NULLFILE does not exist, then it will create 100000 null scores and write them in the NULLFILE before calculating JPA-scores. If --jpanull option is not used, then p-values for the JPA-scores are calculated from an analytical construction of null model.
mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs 1:100 \
                   --gx ${GXFILE} --gxfmt gtex --gtf ${GTFFILE} --outprefix ${OUTPREFIX} \
                   --knn 0 --method jpa --jpanull ${NULLFILE}
  1. Example of parallelizing job submission.
NMAX=20000 # number of SNPs per job
for CHRM in $( seq 1 22 ); do
    VCFFILE="file_path_here_${CHRM}.vcf.gz"
    NTOT=$( calculate_no_of_SNPs_in_this_chromosome )
    NJOB=$( echo $(( (NTOT + NMAX - 1)/NMAX )) )
    for (( i=0; i < ${NJOB}; i++ )); do
        STARTSNP=$(( NMAX * i + 1 ))
        ENDSNP=$(( NMAX * (i + 1) ))
        if [ ${ENDSNP} -gt ${NTOT} ]; then
            ENDSNP=${NTOT}
        fi
        mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs ${STARTSNP}:${ENDSNP} --gx ${GXFILE} --gxcorr ${GXCORRFILE} \
                           --gxfmt gtex --gtf ${GTFFILE} --trim  --outprefix ${OUTPREFIX} \
                           --cismask --psnpthres 0.000001 --knn 20 --method rr --null perm --prior-sigma 0.05
    done
done

Contributors

Version History

1.0.2 Bump release version