######################################################################################### TTTTTTTTTTTTTTTTTTTTTTT OOOOOOOOO BBBBBBBBBBBBBBBBB IIIIIIIIII T:::::::::::::::::::::T OO:::::::::OO B::::::::::::::::B I::::::::I T:::::::::::::::::::::T OO:::::::::::::OO B::::::BBBBBB:::::B I::::::::I T:::::TT:::::::TT:::::T O:::::::OOO:::::::O BB:::::B B:::::B II::::::II TTTTTT T:::::T TTTTTT O::::::O O::::::O B::::B B:::::B I::::I T:::::T O:::::O O:::::O B::::B B:::::B I::::I T:::::T O:::::O O:::::O B::::BBBBBB:::::B I::::I T:::::T O:::::O O:::::O B:::::::::::::BB I::::I T:::::T O:::::O O:::::O B::::BBBBBB:::::B I::::I T:::::T O:::::O O:::::O B::::B B:::::B I::::I T:::::T O:::::O O:::::O B::::B B:::::B I::::I T:::::T O::::::O O::::::O B::::B B:::::B I::::I TT:::::::TT O:::::::OOO:::::::O BB:::::BBBBBB::::::B II::::::II T:::::::::T OO:::::::::::::OO B:::::::::::::::::B I::::::::I T:::::::::T OO:::::::::OO B::::::::::::::::B I::::::::I TTTTTTTTTTT OOOOOOOOO BBBBBBBBBBBBBBBBB IIIIIIIIII TOBI: Tumor Only Boosting Identification of Driver Mutations Tumor-Only Boosting Identification (TOBI) is a framework for unified germline and somatic analysis analysis using largely tumor-only samples. TOBI uses gradient booosting to learn features of confirmed somatic variants from a small training set of tumor-normal sampless, then generates a classification model that identifies variants with somatic characteristics in tumor-only samples. First, WES files from tumor samples undergo variant calling, annotation, and filtering for quality. TOBI then merges variants across multiple samples. In pre-processing, TOBI labels variants from the training set as somatic (“som”) or non-somatic (“non_som"). Finally, in the machine learning step, TOBI generates and applies a somatic classifier. Ver. 1.2: April 12, 2016 cjmadubata & tchu modified from Alireza Roshan Ghias's code (Ver. 1.1: Nov 07, 2014 https://github.com/alireza202/TOBI.git TOBI) dependencies: - Python 2.7.11 - Perl v5.10.1 - R v3.1.2 - Java 1.7.0_25 - samtools 0.1.19 - bcftools 0.1.19 - VCFtools v0.1.10.1 - snpEff v3.6 & dbNSFP (https://sites.google.com/site/jpopgen/dbNSFP) - snpSift v3.6 ######################################################################################### ###varCall_filtering### inputs at each step: V (variant calling): indexed .bam files in a folder. Files must have .bam extension and filename cannot start with a number. A (annotation): .vcf files in a folder. Files must have .vcf extension and filename cannot start with a number. If starting from this step, please format vcf to match bcftools output. F (filter): .vcf files in a folder. Files must have .vcf extension and filename cannot start with a number. usage: TOBIvaf.py [-h] [--inputdir INPUTDIR] [--output OUTPUT] [--config CONFIG] [--steps STEPS] [--cluster {hpc,amazon}] [--debug] [--cleanup] [--ref REF] [--start START] [--end END] [--snpeff SNPEFF] [--annovcf ANNOVCF] [--dbnsfp DBNSFP] [--vcftype {default,TCGA}] [--mergename MERGENAME] TOBIv1.2: Tumor Only Boosting Identification of Driver Mutations All arguments can be specified in a config file. (See included varCall.config file as an example). Arguments: General Arguments: -h, --help show this help message and exit --inputdir INPUTDIR [REQUIRED] directory for bam/vcf files. --output OUTPUT [REQUIRED] output directory. --config CONFIG config file specifying command line arguments. Arguments specified in the command line overwrite config file arguments. --steps STEPS [REQUIRED] Specify which steps of pipeline to run. V: variant calling A: annotate F: filter M: merge eg. --steps AF --cluster {hpc,amazon} [REQUIRED] Specify which cluster to run on. hpc: run on an SGE hpc cluster amazon: CURRENTLY UNIMPLEMENTED --debug Debug/verbose flag. Default: False --cleanup Delete temporary debug files. Default True VCF Step Arguments: --ref REF [REQUIRED - VCF] Reference genome file. --start START Start index used for testing. Will not work in config. Default 1 --end END End index used for testing. Will not work in config. Default 74 Annotation Step Arguments: --snpeff SNPEFF [REQUIRED - ANNOTATE] Directory where snpEff is --annovcf ANNOVCF [REQUIRED - ANNOTATE] A comma separated list of .vcf files to annotate with. --dbnsfp DBNSFP [REQUIRED - ANNOTATE] Path to dbNSFP file Filter Step Arguments: --vcftype {default,TCGA} Specifies vcf type specically for TCGA filtering Merge Step Arguments: --mergename MERGENAME [REQUIRED - MERGE] Name for final merged file ######################################################################################### ### machine_learning ### Step 8. Pre-processing using R. Needs customization each time. usage: TOBIml.py [-h] [--input INPUT] [--output OUTPUT] [--somatic SOMATIC] [--log LOG] [--check_missed CHECK_MISSED] [--suffix SUFFIX] [--vcftype {default,TCGA}] [--train_size TRAIN_SIZE] [--verbose] {preprocess,machinelearning} TOBIv1.2: Tumor Only Boosting Identification of Driver Mutations. Machine learning step. positional arguments: {preprocess,machinelearning} preprocess: preprocessing step; machinelearning: machine learning step optional arguments: -h, --help show this help message and exit --input INPUT [REQUIRED] input file --output OUTPUT [REQUIRED] output file for PP, output folder for ML --somatic SOMATIC [REQUIRED] formatted file containing somatic variants --log LOG Optional argument to specify a log to pipe stdout and stderr to --check_missed CHECK_MISSED [PP ARG] checking which mutations in important genes are missed by filtering --suffix SUFFIX [ML ARG] a label specific to this particular run (e.g. <date>_<disease>) --vcftype {default,TCGA} Specifies vcf type specically for TCGA filtering --train_size TRAIN_SIZE [ML ARG] number of patients you want in the training set. --verbose verbose flag