DominikBuchner/BOLDigger

Discrepancy between BOLDigger output and BOLD's identification engine

Closed this issue · 3 comments

Hi,

I came a cross a weird by chance while going through my BOLDigger output file.

I have ~8,000 COI metabarcoding sequences which I classified with BOLDigger. I was using boldigger-cline v2.1.2 at that time, which was in July 2023. I opened this issue here because I dont think this is an issue specific to the commandline tool.

Below is are the top 20 hits for ASV17731:

grafik

When I manually check this sequence on BOLD's website against the All Barcode Records on BOLD database, I get the following nearest matches:

grafik

What btohers me is that these 20 matches in BOLD and BOLDigger are almost identical. But BOLDigger says this ASV has 87.38% similarity with the insect family Chironomidae, while on BOLD, this similarity value is attributed to a taxon of Ochrophyta.

I checked this now, in August 2023, so a month from the initial classification. But I honestly dont think that this has anything to do with this. Or am I wrong?

How can this sequence - according to BOLdigger - have the exact same similarity value for an insect family as well as an algae, while the former is not even listed in the output when I manually consult the BOLD identification engine?

This is the sequence in case you would like to reproduce the problem:

ASV17731
ATTATCATCTATTCAAGCGCATTCAGGGCCTTCAGTAGATATGGCGATTTTTAGTTTACATTTATCAGGTGCAGGTTCTATTTTAGGAGCAATTAATTTTATTGTAACTATCTTTAACATGCGTGCCCCAGGACTTTTCTTACATAAAATGCCTCTTTTTGTATGATCTGTATTAGTAACTGCATTTTTACTTTTATTATCTTTACCAGTTTTCGCTGGAGCAATTACTATGCTTTTAACAGATCGTAACTTTAATACAAGCTTTTATGATCCTGCCGGAGGAGGAGATCCAGTATTATACCAACATCTTTTC

Cheers

nauras

Can you please check if this issue persists once you figured out your versioning problems? Might be solved with one of the more recent updated versions!

Yes, just had the same thought. I most likely didn't use v2.2.0 after all, but 1.0.0.

Update: after updating boldigger-cline to v2.2.1, this issue has been solved.

I just tested this with a fasta file of 10 sequences containing the ASV in question. The top 20 hits now equal the top 20 hits when performing a manual identification on BOLD.