crawler-commons/url-frontier

Reduce thread contention when the DB gets large

Closed this issue · 0 comments

The queues object we use is a LinkedHashMap wrapped by a synchronized map; this is so that we can control the iteration order while benefiting from a simple locking mechanism. The only time we need to explicitly lock the access to the map is when we iterate on its content in order to avoid a ConcurrentModificationException.

When the DB gets large (e.g. 1 billion URLs in total in a single instance), the threads deadlock when trying to access the queues, in particular, AbstractFrontierService.getURLs() hogs the queues for a long time and prevents the addition of new URLs (which needs to run computeIfAbsent() on the queues). As a result the URLs take a long time to get added which can cause timeout problems on the crawler side.

Given that the getURLs process only needs to get an iterator to get the head of the queue and then pop it back at the end, there is no need to lock on the queues while calling the costly sendURLsForQueue.