Bookworm-project/HTMetadata-Bookworm

Hathifile selection against IDs is missing data

Opened this issue · 4 comments

See: #7 (comment)

To fix, we need to compare the trimmed hathifile with the id list. Presumably there are esoteric filenames that are being missed, or the recently pushed method that pulls out special characters before comparison is messing with the sorting, tricking the script into thinking an ID was missed.

Possibly not important: In my latest run, I'm getting a total number that's slightly different from any mentioned so far: about 4.62 million volumes. I'm not trimming the Hathi file (or am I?), so I'm not sure where the discrepancy comes from.

bschmidt@sibelius:/raid/htrc-bookworm$ wc -l jsoncatalog.txt 
4624648 jsoncatalog.txt

Maybe the Metadata API isn't as current as the EF dataset, these volumes come in so fast. Maybe if you compare the file lists? My querying list is still 300k smaller, so checking on my system wouldn't tell me anything about whether its a something missing from my querying list or a missing response as per your issue.

Try this admittedly ugly comparison and maybe try a manual query with some of the ids here: http://chinkapin.pti.indiana.edu:9994/solr/meta/select/?q=id:ID&wt=json

cat jsoncatalog.txt | perl -pe "s/^.+?filename\": \"(.+?)\".+/\1/g" | sed "s/+/:/g" | sed "s:=:/:g" | sort >found-data.txt
comm -3 data/pd-ids.txt found-data.txt

I just ran this script, and it finds 225179 missing items. The manual query also fails for every ID I've tried. Here's a random sample of 0.01% in case anyone wants to check some: full list is downloadable here. I would guess these are just PD things that haven't yet made the cut into Solr? That's probably an unavoidable problem.

bschmidt@sibelius:/raid/htrc-bookworm$ perl -ne 'print if rand() < .0001' missing
mdp.39015077515180
mdp.39015081888565
mdp.49015000671744
uc1.b104616
uc1.b111973
uc1.b242615
uc1.b24556
uc1.b257337
uc1.b258547
uc1.b261686
uc1.b266411
uc1.b288418
uc1.b289276
uc1.b33892
uc1.b50948
uc1.b51033
uc1.$b570776
uc1.b612454
uc1.b70318
uc1.b70577
uc1.b84638

Yup. thanks for checking. We're on 4.6mil, then.

On Wed, Mar 18, 2015 at 6:35 PM Benjamin Schmidt notifications@github.com
wrote:

I just ran this script, and the manual query also fails for every ID I've
tried. Here's a random sample of 0.01% in case anyone wants to check some:
full list is downloadable here http://benschmidt.org/missing.txt. I
would guess these are just PD things that haven't yet made the cut into
Solr? That's probably an unavoidable problem.

bschmidt@sibelius:/raid/htrc-bookworm$ perl -ne 'print if rand() < .0001' missing
mdp.39015077515180
mdp.39015081888565
mdp.49015000671744
uc1.b104616
uc1.b111973
uc1.b242615
uc1.b24556
uc1.b257337
uc1.b258547
uc1.b261686
uc1.b266411
uc1.b288418
uc1.b289276
uc1.b33892
uc1.b50948
uc1.b51033
uc1.$b570776
uc1.b612454
uc1.b70318
uc1.b70577
uc1.b84638


Reply to this email directly or view it on GitHub
#8 (comment)
.