iipc/openwayback

Crawl appears to have stalled

mtcjayne opened this issue · 3 comments

I have a crawl that has been sitting at 0 URIs/sec and 0 KB/sec for at least an hour, and I'm not sure if this is normal, recoverable behavior or not.

Some of the stats it shows on the job page follow:

Job is Active: RUNNING
Totals
2932 downloaded + 10533 queued = 13465 total
152 MiB crawled (152 MiB novel, 0 B dupByHash, 0 B notModified)
Alerts
none
Rates
0 URIs/sec (0.19 avg); 0 KB/sec (9 avg)
Load
0 active of 25 threads; NaN congestion ratio; -1 deepest queue; 0 average depth
Elapsed
4h20m24s318ms
Threads
25 threads: 25 ABOUT_TO_GET_URI; 25 noActiveProcessor 
Frontier
RUN - 35 URI queues: 0 active (0 in-process; 0 ready; 0 snoozed); 0 inactive; 0 ineligible; 0 retired; 35 exhausted 
Memory
63232 KiB used; 123856 KiB current heap; 253440 KiB max heap

Here is the log from when I first started it (before correcting the contact information) until I checkpointed and stopped it.

You might want to shift this over to https://github.com/internetarchive/heritrix3 as it's more likely to be seen be people with Heritrix3 experience.

It does look odd, as the queues are all exhausted and so the crawl should be in the EMPTY state rather than RUNNING. Maybe repost on https://github.com/internetarchive/heritrix3 and add details of what version and operation system you are running etc.?

Those SEVERE errors look suspicious

2018-04-24T02:25:34.762Z SEVERE close() ros[ToeThread #16: https://i.warosu.org/robots.txt

Maybe one of the error logs (nonfatal-errors.log, runtime-errors.log or alerts.log), or heritrix_out.log, has more info about that, like a stack trace?

(Yeah this belongs in https://github.com/internetarchive/heritrix3 but oh well 😁 )

ldko commented

@nlevitt it is over there now too. 😉 Sorry, I probably should have closed this here when it opened up over there. Discussion can continue over there.