clarin-eric/linkchecker

Adding flag/unflag operation for URLs in process

Closed this issue · 2 comments

wowasa commented

next version of linkchecker-persistence will allow to flag URLs at look-up to exclude them form consecutive look ups. To use the flag, the Spout has to flag each URL it integrates in the processing chain, while either the StatusUpdaterBolt or the ack-method of the Sprout has to unflag it after processing.

wowasa commented

as far as I can see from the logs the flag will decrease the number of database look ups significantly, since at the moment we have with each look up a considerable number of URLs which are already in the processing chain and are dropped by the LPASpout, which assures that each URL is just once in the buffer

wowasa commented

this is abandoned, since I realized that slow queues (those with long crawl delays) would grow steadily with this approach