bokulich-lab/RESCRIPt

Sequence identifier warnings from `get-ncbi-data`.

Opened this issue · 0 comments

A user initially reported this issue when running the following command:

qiime rescript get-ncbi-data   \
    --p-query "txid4751[ORGN] AND (ITS1 OR ITS2 OR its1 OR its2) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]" \
    --p-ranks kingdom phylum class order family genus species \
    --p-rank-propagation \
    --p-n-jobs 4 \
    --o-sequences ITS-ref-seqs-ng.qza \
    --o-taxonomy ITS-ref-tax-ng.qza \
    --verbose

Which resulted in the following errors:

WARNING:2023-05-10 08:31:04,095:MainProcess:Using pdb|8E5T|3 as a sequence identifier, because it did not come down with an accession version.
WARNING:2023-05-10 08:31:04,096:MainProcess:Using pdb|7V08|6 as a sequence identifier, because it did not come down with an accession version.
WARNING:2023-05-10 08:31:04,096:MainProcess:Using pdb|7UQZ|6 as a sequence identifier, because it did not come down with an accession version.
WARNING:2023-05-10 08:31:04,096:MainProcess:Using pdb|7UQB|6 as a sequence identifier, because it did not come down with an accession version.
...

I was able to reproduce the issue. I exported the resulting FASTA file and did observe sequences with headers like those shown above. I also manually ran BLAST on a few of the sequences, they did appear to contain ITS sequences, though I've not tested thoroughly. I am not sure why pdb identifiers are used, when the returned data might actually contain the requested ITS DNA sequences.

The warning message comes from specifically these lines from ncbi.py.

Probably not really a true issue, but it can be difficult to trace back the origin of these data.