Identify the excluded genomes and why they are excluded
Opened this issue · 0 comments
ccbaumler commented
Upon comparing the two final manifests:
wc -l manifest/sourmash.manifest.original.csv
65705 manifest/sourmash.manifest.original.csv
wc -l manifest/sourmash.manifest.csv
64375 manifest/sourmash.manifest.csv
comm -23 <(sort manifest/sourmash.manifest.original.csv) <(sort manifest/sourmash.manifest.csv) | wc -l
1330
The 1330 excluded genomes may be exported into a new file excluded.genomes.csv
:
head excluded.genomes.csv
signatures/000eca019c6c56e66c37f649faabea61.sig.gz,000eca019c6c56e66c37f649faabea61,000eca01,31,DNA,0,1000,5917,1,"GCF_000745545.1 Caulobacter henricii strain=CF287, ASM74554v1",/dev/fd/63
signatures/0013ee76e5e54ff16572c6019ac87675.sig.gz,0013ee76e5e54ff16572c6019ac87675,0013ee76,31,DNA,0,1000,3113,1,"GCF_004345375.1 Nitrosomonas sp. Nm134 strain=Nm134, ASM434537v1",/dev/fd/63
signatures/004c6c71df933b0727f08d39ac08ede0.sig.gz,004c6c71df933b0727f08d39ac08ede0,004c6c71,31,DNA,0,1000,3976,1,"GCF_011065905.1 Clostridium estertheticum strain=FP4, ASM1106590v1",/dev/fd/63
signatures/006c16914076de98b82c3051fd6d3152.sig.gz,006c16914076de98b82c3051fd6d3152,006c1691,31,DNA,0,1000,4587,1,"GCF_000968535.2 Methylomicrobium alcaliphilum 20Z strain=20Z, ASM96853v1",/dev/fd/63
signatures/007e39a06d05ecbd99f4171d71bcd29f.sig.gz,007e39a06d05ecbd99f4171d71bcd29f,007e39a0,31,DNA,0,1000,4581,1,"GCF_000245055.1 Desulfovibrio sp. U5L strain=U5L, ASM24505v1",/dev/fd/63
signatures/00a238371d14b6c28dec7e031038892d.sig.gz,00a238371d14b6c28dec7e031038892d,00a23837,31,DNA,0,1000,3197,1,"GCF_000211855.2 Lacinutrix sp. 5H-3-7-4 strain=5H-3-7-4, ASM21185v3",/dev/fd/63
signatures/00e81a5cad7337fcb893c47b08b6deb8.sig.gz,00e81a5cad7337fcb893c47b08b6deb8,00e81a5c,31,DNA,0,1000,9994,1,"GCF_003752655.1 Streptomyces griseorubiginosus strain=SAI-142, ASM375265v1",/dev/fd/63
signatures/014909b6d652f382c8204d23cb3f144f.sig.gz,014909b6d652f382c8204d23cb3f144f,014909b6,31,DNA,0,1000,2707,1,"GCA_011333355.1 Deltaproteobacteria bacterium, ASM1133335v1",/dev/fd/63
signatures/017eec1003b2034b03ed00d3f18179a6.sig.gz,017eec1003b2034b03ed00d3f18179a6,017eec10,31,DNA,0,1000,4379,1,"GCF_000243715.2 Leptospira broomii serovar Hurstbridge str. 5399 strain=5399, gls454050v02",/dev/fd/63
signatures/01c84d63434ff050360b40cb49897db9.sig.gz,01c84d63434ff050360b40cb49897db9,01c84d63,31,DNA,0,1000,5848,1,"GCF_000364225.1 Eubacterium plexicaudatum ASF492 strain=ASF492, Euba_plex_ASF492_V1",/dev/fd/63
The first five within the excluded.genomes.csv
have statuses of suppressed, replaced by another version, and seems like it should work:
genbank | refseq | name | status |
---|---|---|---|
GCA_000745545.1 | GCF_000745545.1 | ASM74554v1 | suppressed |
GCA_004345375.1 | GCF_004345375.1 | ASM434537v1 | suppressed |
GCA_011065905.1 | GCF_011065905.1 | ASM1106590v1 | replaced v2 |
GCA_000968535.1 | GCF_000968535.2 | ASM96853v1 | Appears good |
GCA_000245055.1 | GCF_000245055.1 | ASM24505v1 | suppressed |
I am thinking that I could include a script to parse the genbank/refseq string to check the FTP server status.
xargs <ftp.list.txt curl -I 2>&1 | awk '/HTTP\// {print $2}'
or
#! /bin/bash
for site in $(cat ftp.list.txt)
do
if wget --spider -S "$site" 2>&1 | grep -w "403\|404\|500\|502\|503" ; then
echo "$site is down"
fi
done