k-mer based filtering of ExpansionHunter 5 output.
- Version number change only from v0.12.0.
- --save_filtered flag added to save removed calls in a separate output VCF (_removed.vcf)
- --auto flag added to allow auto margin resizing on genotypes below the read length. This allows the separate use of both --rank and --auto in one command.
- Revise kmer_mul argument to take a float in range 0.0-1.0. Default value = 0.2
- Update margin tightening for genotyped motif lengths < read length
- new handling for motifs < read length - if a motif is < the read length the margin size is set to the locus + 20% of the genotyped motif length. This reduces the left-censoring behaviour the filtering was exhibiting at certain loci
- reference-sized STRs should now be correctly reported, unless they occur as the result of a LowDepth call - ie genotypes of ./. in the format string will be filtered, but genotypes of / will not.
- fixes
--rank
parameter - adds
auto
option to--rank
. This is recommended for most uses.auto
will act as--rank 1
on loci with only one motif defined, but at motifs where n loci are defined, it will act as--rank n
.
Ensure you have Python 3.6 or later installed. You can download Python from the official website: https://www.python.org/
To install Pysam, use pip:
pip install pysam
If you need to install python libraries to user install directories:
pip install pysam --user
Alternatively, install Conda, then:
conda env create -f environment.yml
source activate eh5_kmer
python kmer_filter.py -b <bam_path> -c <catalog_path> -v <vcf_path> -o <output_path> -m <margin> -r <kmer_mul> --logs
- -b, --bam: Path to the EH5 BAM file.
- -c, --catalog: Path to the JSON catalog file used by EH5.
- -v, --vcf: Path to the EH5 VCF file.
- -o, --output: Path for the output files. The script will generate _validated.vcf (and optionally _kmers.tsv.gz and _removed.vcf).
- -m, --margin: Margin for fetching sequence reads from the BAM file (default is 1,000).
- --rank: integer. k-mer rank; will accept a call if it is in the top (rank) of kmers at the locus in BAMs (default = 1).
- --auto: boolean. If present, will act as --rank n at motifs where n loci are defined.
- -r, --kmer_mul: Multiplier for the k-mer size (float in range 0.0 - 1.0, default is 0.2).
- --keep_lowdepth: boolean. Low depth calls are removed by default. If present, calls flagged as LowDepth are kmer checked (default = False)
- --logs: boolean. If present, will save _kmers.tsv.gz file containing all kmer counts in the BAM.
- --save_filtered: boolean. If present, will save removed calls in a separate file _removed.vcf.
To process a single sample (here imaginatively called 'sample'):
Without the log containing all kmer counts:
python kmer_filter.py -b sample.bam -c catalog.json -v sample.vcf -o ./sample_kmer
With the log containing all kmer counts:
python kmer_filter.py -b sample.bam -c catalog.json -v sample.vcf -o ./sample_kmer --logs