Sequence identifier warnings from `get-ncbi-data`.
Opened this issue · 0 comments
mikerobeson commented
A user initially reported this issue when running the following command:
qiime rescript get-ncbi-data \
--p-query "txid4751[ORGN] AND (ITS1 OR ITS2 OR its1 OR its2) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]" \
--p-ranks kingdom phylum class order family genus species \
--p-rank-propagation \
--p-n-jobs 4 \
--o-sequences ITS-ref-seqs-ng.qza \
--o-taxonomy ITS-ref-tax-ng.qza \
--verbose
Which resulted in the following errors:
WARNING:2023-05-10 08:31:04,095:MainProcess:Using pdb|8E5T|3 as a sequence identifier, because it did not come down with an accession version.
WARNING:2023-05-10 08:31:04,096:MainProcess:Using pdb|7V08|6 as a sequence identifier, because it did not come down with an accession version.
WARNING:2023-05-10 08:31:04,096:MainProcess:Using pdb|7UQZ|6 as a sequence identifier, because it did not come down with an accession version.
WARNING:2023-05-10 08:31:04,096:MainProcess:Using pdb|7UQB|6 as a sequence identifier, because it did not come down with an accession version.
...
I was able to reproduce the issue. I exported the resulting FASTA file and did observe sequences with headers like those shown above. I also manually ran BLAST on a few of the sequences, they did appear to contain ITS sequences, though I've not tested thoroughly. I am not sure why pdb
identifiers are used, when the returned data might actually contain the requested ITS DNA sequences.
The warning message comes from specifically these lines from ncbi.py
.
Probably not really a true issue, but it can be difficult to trace back the origin of these data.