/ForestQC

Quality control on genetic variants from next-generation sequencing data using random forest

Primary LanguagePythonMIT LicenseMIT

ForestQC

ForestQC uses a random forest classifier to identify high-quality and low-quality variants from genetic variants detected from next-generation sequencing data. Citation: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007556

Installation

To install the software. This software is compatible with OSX, 64-bit Linux and 64-bit Windows systems.

$ conda install forestqc -c avallonking

To update the software.

$ conda update forestqc -c avallonking

After installation, you can check the usage

$ ForestQC -h
# and check the usage for each command, for example, stat
$ ForestQC stat -h

We provide a usage example. Note that all example files are only for testing. Do not use them to analyze real data.

$ git clone git@github.com:avallonking/ForestQC.git
$ cd ForestQC/examples/example_script_and_output/
$ cat example.script.sh
$ sh example.script.sh

Workflow

example/example_output/example.script.sh provides step-by-step tutorial about how to run this software.

Before doing the classification, we should set the thresholds for Outlier_GQ and Outlier_DP. It would print out Outlier_DP threshold and Outlier_GQ threshold on the screen, which would be used as inputs in the next step. Idealy, all vcf files in this analysis should be included, but we suggest to use Chromosome 1 only to speed up the computation if the dataset is very large. Please make sure the memory size (set with -m argument) smaller than the total size of input files and the memory limit of your system

$ ForestQC set_outlier -m 1G -i [vcf_file1],[vcf_file2],[...]

If you use other reference genome other than hg19, please use our script to generate a GC content table for your reference genome. Otherwise, we have prepared hg19 GC content table in examples/gc_content_hg19.tsv and you can directly use it.

# generate GC content table for your reference genome
$ ForestQC compute_gc -i [ref_genome.fasta] -o [output_file]

First, we need to calculate ths statistics from vcf file. It will output a file containing all statistics information for each variant. Note:

  • It is highly recommended to calculate the statistics for different regions of the genome at the same time, which can make this procedure much faster
  • The vcf file should have the information of each individual on eahc site
  • You don't have to merge all vcf files together
  • If no discordant genotype file provided, the number of discordant genotype of all variants will be NA
  • If no pedigree file provided, the mendel errors of all variants will be NA
  • If no HWE p-value file provided, HWE p-value would be computed by ForestQC. Or if ForestQC cannot find HWE p-value for some sites in the provided HWE p-value file, ForestQC would compute HWE p-value for those sites
$ ForestQC stat -i [input_vcf] -o [output_filename] -c [gc_content_file] -g [gender_file(optional)] -p [ped_file (optional)] -d [discordant_genotype_file (optional)] -w [hwe_file(optional)] --gq [Outlier_GQ] --dp [Outlier_DP] -as [user_defined_statistics_file (optional)]

Second, we need to divide the dataset into high-quality, low-quality and "undertermined" variants. The output files would be three: good.xx (high-quality variants), bad.xx (low-quality variants) and grey.xx. (variants with undetermined quality) The output filename is only the suffix, that is, the latter part of the file name. All output files would be in the same folder with input files. Note that you don't have to merge the input file together. Also, the model used in classification and splitting must be the same. Or you would get ValueError

$ ForestQC split -i [input_file] -o [output_filename_suffix (optional)] -t [user_defined_threshold_file (optional)] -as [user_defined_statistics_names (if user-defined statistics added in last step, this is required)]

Third, we can train our random forest model and apply it on the classification of gray variants. The output files would be predicted_good_xx and predictred_bad_xx. The output filename is only the suffix, that is, the latter part of the file name. All output files would be in the same folder with input files. Note that you need to merge all high-quality variants into one file, all low-quality variants into one file and all "undetermined" variants into one file. Also, the model used in classification and splitting must be the same. Otherwise, you would get ValueError

$ ForestQC classify -g [good_variants] -b [bad_variants] -y [grey_variants] -o [output_filename_suffix (optional)] -t [probability_threshold (optional)] -f [selected_features_names (optional)]

Fourth, to get the final sets of good variants and bad variants, we need to combine the variants predicted by the random forest model (step 3) and the variants separated by the filters (step 2).

$ cat [good_variants(from step 2)] [predicted_good_variants(from step 3)] > [total_good_variants]
$ cat [bad_variants(from step 2)] [predicted_bad_variants(from step 3)] > [total_bad_variants]

Fifth, to get a VCF file containing the variants passing ForestQC, you can get the positions of bad variants and remove them from the original VCF file.

# if chromosome names in the original VCF file are like "chr1", "chr2", etc.
$ awk -F "\t" 'NR>1{print $2"\t"$3}' [total_bad_variants] > [bad_variants_positions]
# if chromosome names in the original VCF file are like "1", "2", etc.
$ awk -F "\t" 'NR>1{print $2"\t"$3}' [total_bad_variants] | sed 's/chr//' > [bad_variants_positions]
# remove bad variants from the original VCF file
$ vcftools --gzvcf [original_gziped_vcf_file] --exclude-positions [bad_variants_positions] --recode --recode-INFO-all -c | gzip -c > [output_vcf]

And VCFtools would need to be installed.

File format

Note that all files are tab-separated. If you came across any IndexError, please check your input files and make sure they are all tab-separated.

Output file

  • Statistics file (tab-separated, assuming that we included user-defined features)
RSID CHR POS REF ALT MAF Mean_DP Mean_GQ SD_DP SD_GQ Outlier_DP Outlier_GQ Discordant_Geno Mendel_Error Missing_Rate HWE ABHet ABHom GC user-defined_feature1 user-defined_feature2
chr1:144  1 144 A T 0.03  54.00 54.00 23.00 13.24 0.43 0.23 0.1 0.06 0.01 1.0 0.45 0.99 0.435 2.0 4.3
chr1:145  1 145 A T 0.03  54.02 52.00 26.00 11.64 0.33 0.43 0.2 0.03 0.03 1.0 0.49 0.98 0.435 NA 4.3
  • Final result file (Prbability means the probability of a variant being good. The variant is a good variant if Good = 1 or 1.0, or a bad variant if Good = 0, grey variants do not have Good column before it is predicted)
RSID CHR POS REF ALT MAF Mean_DP Mean_GQ SD_DP SD_GQ Outlier_DP Outlier_GQ Discordant_Geno Mendel_Error Missing_Rate HWE ABHet ABHom GC user-defined_feature1 user-defined_feature2 Probability Good
chr1:144  1 144 A T 0.03  54.00 54.00 23.00 13.24 0.43 0.23 0.1 0.03 0.01 1.0 0.45 0.99 0.435 2.0 4.3 1

Input file

  • Gender file(2 columns, tab-separated. f for female, m for male):
SampleID  Gender
12312sd12  m
4342sd423  f
7hi987989  0     # Any samples whose gender is neither m nor f will be ignored
  • Pedigree file(9 columns, tab-separated). Note that SeqID is the sample ID occuring in VCF files. If control information is provided, ForestQC only calculate HWE p-value for each site with control samples.
FamilyID IndividualID FatherID MotherID Sex PhenotypeID(1 if it is control) DBPID SamID SeqID
C1  2 3 4 m  1  xxx xxx xxx
  • Discordant genotype rate file (2 columns, tab-separated, gzip-compressed or not)
SNP_ID  Discordant_Genotype_Rate
chr1:259   0.01
chr3:122   0.20
  • HWE p-value file (2 columns, tab-separated):
SNP_ID  HWE_p-value
chr3:899   1.0
chr2:900   0.77
  • User-defined features (tab-separated, data are calculated by users in advance. In practice, missing values will be imputed with the median of the sample)
RSID feature1 feature2 feature3
chr1:32 2.3 4.5 2.3
chr2:34 4.3 NA 5.6
  • User-defined filter thresholds (tab-separated, 4 columns, "rare" means filters for rare variants, "common" for common variants, "all" for both)
good all filter threshold
bad all MAF threshold
bad rare Mendel_Error threshold
bad common Missing_Rate threshold
outlier rare Missing_Rate threshold
outlier common Missing_Rate threshold
  • GC content file (tab-separated, 3 columns, including chrosome, starting positions and GC content)
chrM 1 0.3
chrM 1001 0.46