Discrepancy in Size of nt Dataset Downloads: Direct Link vs. update_blastdb.pl Command
ShuZishan opened this issue · 2 comments
Why is the nt dataset downloaded from this link https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ larger [378GB] compared to the one downloaded using the command update_blastdb.pl --decompress nt
[151GB]? Why are there differences between the two downloads? Could you provide details on the specific data that has been added or removed, and the reasons for these changes? I would greatly appreciate it.
You want to go to NCBI for that info since they set up both sets of data: https://www.ncbi.nlm.nih.gov/books/NBK62345/
nt.##.tar.gz The nucleotide sequence database contains entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, con, and environmental sequences are excluded. RefSeq genomic entries are also excluded.
nt.gz The FASTA equivalent of the nt.##.tar.gz database files.
Search the page for "Getting the preformatted database files" for a description of the benefits of the files downloaded through update_blastdb.pl.
But here's the ultimate explanation for the file size discrepancy:
Preformatted database files remove the makeblastdb formatting steps, and saves valuable processing time and diskspace
If I understand correctly, the preformatted downloads are stored as presumably optimized binary databases instead of as plain text FASTAs.