decrypto-org/spider

Improve prioritizing algorithm to circumvent black holes

Closed this issue · 0 comments

Currently, we have an issue with pages that have a considerable amount of links to themselves, as websites that list all bitcoin transactions, blocks or websites that host an extensive library and let you read a book page by page. To circumvent this issue and get a content page for all yet scraped pages, we should first consider scraping pages of hosts that were not yet scraped.
So the prioritization would be as following (Always sorted by depth as well. We want to scrape in depth order):

  1. Scrape paths from hosts that were never scraped yet
  2. Sort by incoming/outgoing unique links (Unique in the sense of different hosts that link to this host)
  3. Sort by random

Here a few SQL snippets from todays meeting (@dionyziz , @zetavar ):

Find all unique incoming/outgoing diffs

        SELECT
             destpathid, COUNT(DISTINCT p.baseUrlId) AS inuniquecount
        FROM
             links l CROSS JOIN paths p ON l.srcpathid = p.pathid 
        WHERE
             destpathid IN (1, 2, 3, 4, ...)
        GROUP BY
             destpathid

Find all that are not yet scraped

        SELECT
            pathid, MIN(lastFinishedTimestamp) AS mintime, BOOL_OR(inProgress) AS ongoing
        FROM
            path
        GROUP BY
            baseUrlBaseUrlId
        HAVING
            ongoing = false AND
            mintime = '0000-00-00 00:00:00'