/Askat

Genome association wrapper

Primary LanguageRebolApache License 2.0Apache-2.0

Askat

You can download the program here

Or you can get the whole project and a zip file using this command: git clone https://github.com/pcingola/Askat.git (requires git installed).

You can take a look at some slides on the wrapper.


Basics

Requirements

  • R : Make sure you have a resonably up to date version of R installed and availabe in you PATH.
  • Java 1.6 : Most modern computers have Java installed.
  • Fast-LMM : Pre-compiled versions of fast-Lmm can be found here (Linux and Windows).
    WARNING: You should use relatively new version of Fast-LMM (e.g. version v1.09).
  • R Libraries: The ASKAT wrapper will tell you how to to install any missing R libraries.

Install

Just unzip the file and run the JAR by doing:

unzip askat.zip

Running the program

In order to run the program you need to go into the directory where you installed it an run the JAR file:

cd askat
java -Xmx2G -jar Askat.jar [options] genotype

E.g.: To show a help message, just run the program without any parameters:

$ java -jar Askat.jar
ASKAT algorithm by Karim Oualkacha, optimized from N^3 to N^2 complexity by Stepan Grinek 
Askat wrapper version 1.01 (build 2012-11-21), by Pablo Cingolani

Usage: java -jar Askat.jar [options] genotype
Options:
	-b <num>       : Number of SNPs used for calculating the kinship matrix. Default: 100000
	-d             : Debug mode (implies verbose)
	-d1            : Debug mode. Perform only one sub-block calculation and stop
	-i <bed>       : BED file containing intervals to group SNPs. Default: none
	-maxMaf        : Maximum MAF (minor allelel frequency). Default: 1.0
	-noDep         : Do not perform dependency check.
	-h             : Show this help and exit.
	-kin <type>    : Kinship estimation type. Options {chr, avg, all, block}. Default: CHROMOSOME
	-p <num>       : Number of parallel processes. Default: 8
	-pathBin <dir> : Path to binary programs (e.g. FastLmm). Default: './'.
	-pathR <dir>   : Path to R scripts (ASKAT scripts). Default './r/'.
	-sb <num>      : Number of SNPs used for calculating the ASKAT algorithm. Default: 20
	-pACC <double> : Accuracy parameter for the p-value computation, default is 1e-9.
  -v             : Be verbose.

Some comments about the 'intervals' option

Whenever command line option '-i file.bed' is specified, Askat wrapper analyses all variants matching those intervals as a single block. This means that variants are not subdivided into sub-blocks. This is effectively like setting the subBlock parameter ('-sb') to infinity.

If a variant hits more than one interval, then it is analyzed on all intervals. Although overlaping intervals are allowed, each interval is 'treated' as a unique block (this means that no statistical corrections are made in the model).

Running a full example

The genotype 'karim4k' is available for testing the program:

WARNING! By default Askat will try to use ALL the processors available in the computer. This will produce a significant slow down for all other users in the system.

The command line fo the example is:

$ java -Xmx1g -jar Askat.jar -v karim4k | tee karim4k.out 

Note that we specified 1G of ram to be used by the program (parameter '-Xmx1G'), but in some other cases, more memory me be required.

The output looks like this:

00:00:00.001  ASKAT algorithm by Karim Oualkacha
00:00:00.004  Askat wrapper version 1.0 - epsilon 'almost 1.0' (build 2012-06-29), by Pablo Cingolani

00:00:00.004  Checking dependencies.
00:00:00.005  Checking dependency: Program 'R'
00:00:00.274  OK
00:00:00.275  Checking dependency: Program 'Rscript'
00:00:00.434  OK
00:00:00.434  Checking dependency: Program 'fastlmmc'
00:00:00.535  OK
00:00:00.536  Checking dependency: R library 'GenABEL'
00:00:02.099  Checking dependency: R library 'CompQuadForm'
00:00:02.387  Checking dependency: R library 'nFactors'
00:00:02.982  Checking dependency: R library 'MASS'
00:00:03.286  All dependencies found.

00:00:03.287  Running algorithm.
00:00:03.287  Creating blocks.
00:00:04.423  Creating block 'karim4k.block.1_100.tped'. Number of entries: 4000
00:00:04.511  Running block: karim4k.block.1_100.tped
00:00:04.512  Calculating kinship matrix for block: karim4k.block.1_100
00:00:23.197  Starting block: karim4k.block.1_100
00:00:23.284  Create batches.
      File 'karim4k.block.1_100.tped' has 4000 lines.
      Split up to 500 lines per batch.
00:00:23.287  Batch 1. Line 1. Creating batch : karim4k.block.1_100.1.askat
00:00:23.496  Batch 2. Line 501. Creating batch : karim4k.block.1_100.2.askat
00:00:23.673  Batch 3. Line 1001. Creating batch : karim4k.block.1_100.3.askat
00:00:23.845  Batch 4. Line 1501. Creating batch : karim4k.block.1_100.4.askat
00:00:24.019  Batch 5. Line 2001. Creating batch : karim4k.block.1_100.5.askat
00:00:24.190  Batch 6. Line 2501. Creating batch : karim4k.block.1_100.6.askat
00:00:24.365  Batch 7. Line 3001. Creating batch : karim4k.block.1_100.7.askat
00:00:24.538  Batch 8. Line 3501. Creating batch : karim4k.block.1_100.8.askat
00:01:33.747  ASKAT_RESUTS:  Block:  karim4k.block.1_100.1.askat  Sub-Block:  1 - 20   chr:pos:  1:100 - 1:2000       Id:  snp_1 - snp_20       p-value:  0.0006166988  Q:  38588.06  ...
00:01:35.749  ASKAT_RESUTS:  Block:  karim4k.block.1_100.2.askat  Sub-Block:  1 - 20   chr:pos:  1:50100 - 1:52000    Id:  snp_501 - snp_520    p-value:  0.6740977     Q:  10817.98  ...
00:01:36.750  ASKAT_RESUTS:  Block:  karim4k.block.1_100.3.askat  Sub-Block:  1 - 20   chr:pos:  1:100100 - 1:102000  Id:  snp_1001 - snp_1020  p-value:  4.551971e-07  Q:  57434.71  ...
00:01:36.750  ASKAT_RESUTS:  Block:  karim4k.block.1_100.8.askat  Sub-Block:  1 - 20   chr:pos:  1:350100 - 1:352000  Id:  snp_3501 - snp_3520  p-value:  0.6530349     Q:  11460.14  ...
00:01:37.751  ASKAT_RESUTS:  Block:  karim4k.block.1_100.4.askat  Sub-Block:  1 - 20   chr:pos:  1:150100 - 1:152000  Id:  snp_1501 - snp_1520  p-value:  0.9720287     Q:  5038.441  ...
00:01:37.751  ASKAT_RESUTS:  Block:  karim4k.block.1_100.6.askat  Sub-Block:  1 - 20   chr:pos:  1:250100 - 1:252000  Id:  snp_2501 - snp_2520  p-value:  0.1502854     Q:  18876.09  ...
00:01:38.752  ASKAT_RESUTS:  Block:  karim4k.block.1_100.5.askat  Sub-Block:  1 - 20   chr:pos:  1:200100 - 1:202000  Id:  snp_2001 - snp_2020  p-value:  1.340421e-06  Q:  54710.54  ...
00:01:38.753  ASKAT_RESUTS:  Block:  karim4k.block.1_100.7.askat  Sub-Block:  1 - 20   chr:pos:  1:300100 - 1:302000  Id:  snp_3001 - snp_3020  p-value:  1.385112e-05  Q:  48982.58  ...
00:02:22.786  ASKAT_RESUTS:  Block:  karim4k.block.1_100.1.askat  Sub-Block:  21 - 40  chr:pos:  1:2100 - 1:4000      Id:  snp_21 - snp_40      p-value:  0.2958578     Q:  16597.14  ...
00:02:25.788  ASKAT_RESUTS:  Block:  karim4k.block.1_100.2.askat  Sub-Block:  21 - 40  chr:pos:  1:52100 - 1:54000    Id:  snp_521 - snp_540    p-value:  0.7232833     Q:  10029.99  ...
00:02:26.789  ASKAT_RESUTS:  Block:  karim4k.block.1_100.3.askat  Sub-Block:  21 - 40  chr:pos:  1:102100 - 1:104000  Id:  snp_1021 - snp_1040  p-value:  0.01380279    Q:  28006.57  ...
00:02:27.790  ASKAT_RESUTS:  Block:  karim4k.block.1_100.8.askat  Sub-Block:  21 - 40  chr:pos:  1:352100 - 1:354000  Id:  snp_3521 - snp_3540  p-value:  0.001698173   Q:  37035.88  ...
00:02:28.791  ASKAT_RESUTS:  Block:  karim4k.block.1_100.4.askat  Sub-Block:  21 - 40  chr:pos:  1:152100 - 1:154000  Id:  snp_1521 - snp_1540  p-value:  0.1902052     Q:  18511.61  ...
00:02:28.791  ASKAT_RESUTS:  Block:  karim4k.block.1_100.5.askat  Sub-Block:  21 - 40  chr:pos:  1:202100 - 1:204000  Id:  snp_2021 - snp_2040  p-value:  0.6349479     Q:  10588.31  ...
00:02:28.792  ASKAT_RESUTS:  Block:  karim4k.block.1_100.6.askat  Sub-Block:  21 - 40  chr:pos:  1:252100 - 1:254000  Id:  snp_2521 - snp_2540  p-value:  0.8626755     Q:  8638.986  ...
00:02:28.792  ASKAT_RESUTS:  Block:  karim4k.block.1_100.7.askat  Sub-Block:  21 - 40  chr:pos:  1:302100 - 1:304000  Id:  snp_3021 - snp_3040  p-value:  0.3494253     Q:  15464.29  ...
...
...
...

Input data formats

ASKAT wrapper requires the input data to be formatted in two file: TFAM and TPED format (for details, see PLINK software package).

  • TFAM: This is the typical "Transposed FAM". TFAM file have individual and family information, where one row is an individual.
    The columns are:
    • Family ID
    • Individual ID
    • Paternal ID
    • Maternal ID
    • Sex (1=male; 2=female; other=unknown)
    • Phenotype
  • TPED: This is a "Transposed PED". TPED file have SNP information, where one row is a SNP (for all samples).
    The columns are:
    • chromosome (1-22, X, Y or 0 if unplaced)
    • rs# or snp identifier
    • Genetic distance (morgans)
    • Base-pair position (bp units)
    • Columns 5 and on: Genotype informaton (two bases per sample, assuming )

E.g.: TFAM data:

1 1_1 0 0 1 4.91995
1 1_2 0 0 2 7.14442
1 1_3 1_1 1_2 2 5.26482
2 2_1 0 0 1 2.87951
2 2_2 0 0 2 1.52721
2 2_3 2_1 2_2 2 2.45878
2 2_4 2_1 2_2 1 2.59495
3 3_1 0 0 1 2.17923
3 3_2 0 0 2 3.91823
3 3_3 3_1 3_2 1 2.55475

E.g.: TPED data (long lines have been truncated):

1 snp_1 0 100 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A ...
1 snp_2 0 200 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A ...
1 snp_3 0 300 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A ...
1 snp_4 0 400 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A ...
1 snp_5 0 500 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A ...
1 snp_6 0 600 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A T T A A T A T A T A T A A A T A A A A A T A T A A A ...
1 snp_7 0 700 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A ...
1 snp_8 0 800 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A ...
1 snp_9 0 900 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A T A A A A A T A T A A A A A ...