A simple tool to filter BLASTx results with special emphasis on ancient DNA studies. xFilter implements the same filtering approach as FAMLI but adds a couple of features designed for the annotation of ancient DNA short reads. FAMLI solved one the principal problems when annotating short reads by iteratively assigning multi-mapped reads to the most likely true protein. You can find more information here.
In addition to the filter that removes references with uneven coverage, xFilter allows to filter them based on the expected breadth of coverage. xFilter can dynamically trim the references for the coverage calculations using the average alignment length of the queries mapped to each reference. Furthermore, xFilter allows to aggregate the coverage values of the filtered references into higher categories, like KEGG orthologs or viral genomes.
For the BLASTx searches we recommend to use MMSseqs2 with parameters optimized for ancient DNA data as described [here]
We recommend having conda installed to manage the virtual environments
First, we create a conda virtual environment with:
wget https://raw.githubusercontent.com/genomewalker/x-filter/master/environment.yml
conda env create -f environment.yml
Then we proceed to install using pip:
pip install x-filter
Using pip
pip install git+ssh://git@github.com/genomewalker/x-filter.git
By cloning in a dedicated conda environment
git clone git@github.com:genomewalker/x-filter.git
cd x-filter
conda env create -f environment.yml
conda activate x-filter
pip install -e .
xFilter uses a BLASTx m8 formatted file containing aligned reads to references. It has to contain query and subject lengths. For a complete list of options:
$ xFilter --help
usage: xFilter [-h] [-i INPUT] [-t THREADS] [-p PREFIX] [-n ITERS] [-e EVALUE] [-s SCALE] [-b BITSCORE] [-f FILTER]
[--breadth BREADTH] [--breadth-expected-ratio BREADTH_EXPECTED_RATIO] [--depth DEPTH]
[--depth-evenness DEPTH_EVENNESS] [--evalue-perc EVALUE_PERC] [--evalue-perc-step EVALUE_PERC_STEP]
[-m MAPPING_FILE] [--no-trim] [--anvio] [--annotation-source ANNOTATION_SOURCE] [--debug] [--version]
A simple tool to filter BLASTx m8 files using the FAMLI algorithm
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
A blastx m8 formatted file containing aligned reads to references. It has to contain query
and subject lengths (default: None)
-t THREADS, --threads THREADS
Number of threads to use (default: 1)
-p PREFIX, --prefix PREFIX
Prefix used for the output files (default: None)
-n ITERS, --n-iters ITERS
Number of iterations for the FAMLI-like filtering (default: 25)
-e EVALUE, --evalue EVALUE
Evalue where to filter the results (default: 1e-10)
-s SCALE, --scale SCALE
Scale to select the best weithing alignments (default: 0.9)
-b BITSCORE, --bitscore BITSCORE
Bitscore where to filter the results (default: 60)
-f FILTER, --filter FILTER
Which filter to use. Possible values are: breadth, depth, depth_evenness,
breadth_expected_ratio (default: breadth_expected_ratio)
--breadth BREADTH Breadth of the coverage (default: 0.5)
--breadth-expected-ratio BREADTH_EXPECTED_RATIO
Expected breath to observed breadth ratio (default: 0.5)
--depth DEPTH Depth to filter out (default: 0.1)
--depth-evenness DEPTH_EVENNESS
Reference with higher evenness will be removed (default: 1.0)
--evalue-perc EVALUE_PERC
Percentage of the -log(Evalue) to filter out results (default: None)
--evalue-perc-step EVALUE_PERC_STEP
Step size to find the percentage of the -log(Evalue) to filter out results (default: 0.1)
-m MAPPING_FILE, --mapping-file MAPPING_FILE
File with mappings to genes for aggregation (default: None)
--no-trim Deactivate the trimming for the coverage calculations (default: True)
--anvio Create output compatible with anvi'o (default: False)
--annotation-source ANNOTATION_SOURCE
Source of the annotation (default: unknown)
--debug Print debug messages (default: False)
--version Print program version
One would run xFilter as:
xFilter --input xFilter-1M-test.m8.gz --bitscore 60 --evalue 1e-5 --filter breadth_expected_ratio --breadth-expected-ratio 0.4 --n-iters 25 --mapping-file ko_gene_list.tsv.gz --threads 8
--input: A BLASTx tabular output with the query and subject lengths.
--bitscore: Bitscore where to filter the BLASTx results.
--evalue: Evalue where to filter the BLASTx results.
--filter: Which filter to use to filter the references. We have the following options:
- depth: Filter out references with a depth below a given threshold.
- depth_evenness: Filter out references with a depth evenness below a given threshold. This is FAMLI's approach (SD/MEAN)
- breadth: Filter out references with a breadth below a given threshold.
- breadth_expected_ratio: Filter out references with a breadth below the observed breadth to expected breadth ratio as in breadth/(1 - e-coverage).
--breadth-expected-ratio: The observed breadth to expected breadth ratio.
--mapping-file: File with mappings to genes for aggregation. It contains two columns: the gene name and the grouping.
--threads: Number of threads
xFilter by default will trim the coverage values on both 5' and 3' ends based on the average alignment length (in amino acid) mapped to each query. This can be deactivated with the --no-trim option.
xFilter will generate the following files:
- {prefix}_no-multimap.tsv.gz: This file contains the filtered BLASTx results after removing non-well supported references and multi-mappings.
- {prefix}_cov-stats.tsv.gz: This file contains the coverage statistics of the references after the filtering. It contains the following columns:
- reference: Reference name
- depth_mean: Coverage mean
- depth_std: Coverage standard deviation
- depth_evenness: Coverage evenness (SD/MEAN)
- breadth: Breadth of coverage
- breadth_expected: Expected breadth of coverage
- breadth_expected_ratio: Observed breadth to expected breadth ratio (scaled)
- n_alns: Number of alignments
- {prefix}_group-abundances-agg.tsv.gz: If a mapping file is provided, it reports:
- group: Group name
- coverage_mean: Mean coverage values
- coverage_stdev: Standard deviation of coverage values
- coverage_median: Median coverage values
- coverage_sum: Sum of coverage values
- n_genes: Number of genes in the group
- avg_read_length: Average read length of the reads mapping to a gene
- stdev_read_length: Standard deviation of read length of the reads mapping to a gene
- avg_identity: Average identity of the reads mapping to a gene
- stdev_identity: Standard deviation of identity of the reads mapping to a gene
- {prefix}_group-abundances.tsv.gz: If a mapping file is provided, it reports:
- reference: Reference name
- group: Group name
- depth_mean: Coverage mean
- depth_std: Coverage standard deviation
- depth_evenness: Coverage evenness (SD/MEAN)
- breadth: Breadth of coverage
- breadth_expected: Expected breadth of coverage
- breadth_expected_ratio: Observed breadth to expected breadth ratio
- n_alns: Number of alignments
- avg_read_length: The average of the average lengths of the reads mapping to the genes in a group
- stdev_read_length: The average of the standard deviations of the lengths of the reads mapping to the genes in a group
If you use --anvio
it will generate the output necessary for the anvi-estimate-metabolism
program. Check here for its usage. You can define the source
using --annotation-source
Note: At the moment only available in anvi'o
master
- {prefix}_group-abundances-anvio.tsv.gz
- gene_id: Reference name
- enzyme_accession: Group name
- source: Source of the annotation
- coverage: Coverage mean
- detection: Breadth of coverage
At the moment we only can handle files with 2,147,483,647
alignments owing to a restriction imposed by datatable. If there are more alignments, xFilter
will read the files by batches and try to filter the alignments using an iterative approach where each reference keeps the alignments within a given percentage of the best -log(evalue)
. By default, xFilter
will start iteratively trying to keep 90% of the alignments until the 10% for each reference, using steps of 10% and then it will stop if the data cannot be fitted within the datatable
limitations. The user can specify the granularity of the iterability with the --evalue-perc-step
options. A specific threshold can also be provided with --evalue-perc-step
xFilter performs a very rough estimation of the number of alignments based on reading 1% of the file and estimate the size of each line