Improve prioritizing algorithm to circumvent black holes
Closed this issue · 0 comments
jogli5er commented
Currently, we have an issue with pages that have a considerable amount of links to themselves, as websites that list all bitcoin transactions, blocks or websites that host an extensive library and let you read a book page by page. To circumvent this issue and get a content page for all yet scraped pages, we should first consider scraping pages of hosts that were not yet scraped.
So the prioritization would be as following (Always sorted by depth as well. We want to scrape in depth order):
- Scrape paths from hosts that were never scraped yet
- Sort by incoming/outgoing unique links (Unique in the sense of different hosts that link to this host)
- Sort by random
Here a few SQL snippets from todays meeting (@dionyziz , @zetavar ):
Find all unique incoming/outgoing diffs
SELECT
destpathid, COUNT(DISTINCT p.baseUrlId) AS inuniquecount
FROM
links l CROSS JOIN paths p ON l.srcpathid = p.pathid
WHERE
destpathid IN (1, 2, 3, 4, ...)
GROUP BY
destpathid
Find all that are not yet scraped
SELECT
pathid, MIN(lastFinishedTimestamp) AS mintime, BOOL_OR(inProgress) AS ongoing
FROM
path
GROUP BY
baseUrlBaseUrlId
HAVING
ongoing = false AND
mintime = '0000-00-00 00:00:00'