/GARFIELD-NGS

GARFIELD-NGS: Genomic vARiants FIltering by dEep Learning moDels in NGS

Primary LanguagePerl

GARFIELD-NGS

GARFIELD-NGS: Genomic vARiants FIltering by dEep Learning moDels in NGS
By Viola Ravasio and Edoardo Giacopuzzi (edoardo.giacopuzzi@unibs.it) Version 1.0

USAGE: perl Predict.pl --input input.vcf[.gz] --output output.vcf --platform [illumina/ion]

GARFIELD-NGS is implemented as Perl script and requires java to perform predictions. Tool was tested on RHEL operating system v.7.2, with Perl v.5.16 and Java JDK 1.8.0_65. Bgzip is also needed and must be available from your path to handle compressed VCF files.

Clone the current repository to any folder and then run the Predict.pl script from any location. All included files must be placed in the same folder as Predict.pl file. Temporary files are created during prediction process in the same folder of input.vcf file and automatically removed.

The script takes an input vcf (or vcf.gz) file and adds CP values for each variants in INFO field, generating a new file VCF. Prediction scores are marked as [Sample Name]_true=[value] in the output VCF.

Platform for prediction [ion/illumina] must be specified by --platform

GARFIELD-NGS requires VCF files generated by GATK for Illumina platform or TVC (Torrent Variant Caller) for ION. Multisample VCF can be processed. In this case, an independent CP value is added in output file for each sample in the format [Sample Name]_true=[value]. The tool returns a standard VCF output and can thus be easily integrated in already established analysis pipelines.

Predictions are made based on four deep learning models specifically trained on INDELs and SNPs from Illumina and ION platforms. Calculated CP value ranges from 0 to 1, with higher values associated to true variants. Variants with score below filtering threshold should be considered as false positive.

Please note that GARFIELD-NGS is optimized to work on single sample VCF. Multisample VCF are supported, but prediction value may be less reliable. Our prediction models are based on INFO column values that are computed per variant by variant callers, so in multi-sample files these values will be based on the cumulative data across all samples and this could alter prediction reliability on single samples.

Suggested CP thresholds for filtering:
ION SNPs 0.139 INDELs 0.320
Illumina SNPs 0.025 INDELs 0.630

For additional information on GARFIELD-NGS useful see our paper pre-print: http://www.biorxiv.org/content/early/2017/06/14/149146