metacleaner

Automated curation of barcode sequence databases for metabarcoding and metagenomics

DNA barcode reference databases generated by tools like MetaCurator - which operate on sequences retrieved from NCBI - sometimes contain falsely labelled accessions (e.g., see: https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13314). In particular, we found that pollen metabarcoding experiments using plant ITS1 and ITS2 region databases would yield many reads corresponding to random sequences in fungi, or to ITS1/ITS2 sequences in plants of the wrong genera or multiple genera at the same % identity and query coverage. You can read a detailed report of this problem using an example dataset on my research website.

metacleaner takes as input a .fasta file and searches for hits against "good" and "bad" sequence databases to filter undesired accessions before downstream use.

Workflow:

Query sequences are searched for hits against known undesired sequences using blastn; query sequences with hits above user-defined thresholds for percentage identity (pident) and query coverage (qcovs) are flagged as mislabeled.
Query sequences are searched for hits against known desired sequences using blastn. Query sequences with no hits above user-defined thresholds for percentage identity (pident) and query coverage (qcovs) are flagged as potentially mislabeled, while hits above these thresholds are flagged as candidate clean sequences.
Taxonomy info for the top hits against desired sequences for both the candidate clean sequences and potentially mislabeled sequences is retrieved using taxonomizr and compared against the taxonomy info of the query sequence. If the taxonomy info of the query and subject (including cases where there are multiple hits) are not similar at a user-defined level of taxonomy (one of superkingdom, phylum, class, order, family, genus, or species), the query sequence is flagged as mislabeled. An additional inclusion filter can be set with -w, --addfilter. All hits at the pident and qcovs thresholds must have this information or they will be flagged as mislabeled.
Flagged sequences are filtered from the query database.

Output:

FILEPREFIX_badseqids.txt - a tab delimited file listing: 1) the flagged query accession, 2) taxonomy info for the query accession 3) the top hit for the query sequence against the provided sequence databases, 4) taxonomy info for the subject accession, and 5) the reason for flagging the query sequence.
FILEPREFIX_clean.fasta - the cleaned sequences.
FILEPREFIX_clean.tax - taxonomy information for the cleaned sequences.

Setup:

metacleaner requires as input:

A properly formatted .fasta file (no line breaks) with NCBI accession IDs as sequence headers
A path to a directory containing either blastdb databases - corresponding to the known "good" and "bad" sequences - generated with makeblastdb, or .fasta files to create the databases from (for example, plant and non-plant ITS1/ITS2 sequences)

Additionally, taxonomizr requires an SQLite database called accessionTaxa.sql. The directory containing this file can be specified using -d or --taxadb, or will otherwise be generated via taxonomizr::prepareDatabase() (warning: this requires up to 70Gb of disk space and may take several hours to complete).

Usage:

Important note: metacleaner.py must be run from the same directory as taxafilter.R!

python3 metacleaner.py --query /path/to/query.fasta --outdir /path/to/output_directory --blastdbdir /path/to/blastdb_directory --badblastdb badSeqs --badblastdbinput badSeqs.fasta --goodblastdb goodSeqs --goodblastdbinput goodSeqs.fasta --taxadb /path/to/directory/containing/accessionTaxa.sql

Options:

-q, --query <(required) path to query fasta file>   
-o, --outdir <path to output directory>   
-e, --pident <(default=100) percent identity threshold for filtering>   
-v, --qcovs <(default=100) query cover threshold for filtering>   
-s, --sortonly <(default=F) one of T/F. T = start at sorting step (requires previously generated blastn output files in output directory specified by -o)    
-l, --filterlevel <(default=genus) one of superkingdom, phylum, class, order, family, genus, or species>   
-b, --blastdbdir <(required) path to badblastdb directory>   
-x, --badblastdb <(required) badblastdb name>    
-f, --badblastdbinput <path to fasta file for badblastdb (required if badblastdb does not already exist in directory specified by -b)>   
-y, --goodblastdb <(required) goodblastdb name>    
-g, --goodblastdbinput <path to fasta file for goodblastdb (required if goodblastdb does not already exist in directory specified by -b)>   
-d, --taxadb <(required unless constructing from scratch) path to directory containing accessionTaxa.sql file for taxonomizr>    
-c, --chunks <(default=100) number of chunks to split query file into (higher values may increase speed for larger query files)>   
-t, --threads <(default=1) number of threads for blastn>  
-w, --addfilter <Additional filter level, one of taxa levels as in -l (--filterlevel) and filter value, separated by a comma (e.g., class,Magnoliopsida);
hits at pident and qcovs with taxa info other than that set in addfilter will be removed>

If you use metacleaner in your research, please cite:

Crone, M., Boyle, N., Bresnahan, S.T., Biddinger, D., Grozinger, C.M. (2023). More than mesolectic: Characterizing the nutritional niche of Osmia cornifrons. Ecology and Evolution 13, e10640. https://doi.org/10.1002/ece3.10640

sbresnahan/metacleaner