perf: filter 404 URLs out by Common Crawl's cluster.idx

Question

perf: filter 404 URLs out by Common Crawl's cluster.idx

Closed this issue 3 years ago · 4 comments

tianjianjiang commented 3 years ago

Tentative Tasks

0. Download month-wise cluster.idx
1. Convert cluster.idx → a python dictionary of URL (actually SURT) parts;
2. Convert each input URL → (SURT) parts;
3. (Partially) Match (2) with (1) → a list of cdx-\d{5}.gz (with ranges, of course);
4. Convert matched cdx-\d{5}.gz → like (1) except probably for the whole SURTs only;
5. (Exactly) Match (2) with (4) → None or a WARC file path with a range;
6. Try (0)-(5) with more cluster.idx files that are not exactly in the same month as the OpenWebText URLs are.
7. Multi-thread at least for (2) and (4); in theory (1), (3), (5), and by extend (6) are thread-safe but not 100% sure yet.

Tentative Outcomes

Running (0)-(3) sequentially for 2018-10 can be done in 6 minutes on Colab.

Background

Currently trying to really do partial matching iteratively with cluster.idx and cdx-\d{5}.gz locallly.

Below are cut-n-pasted from the pessimistic comments:

Unfortunately, the chance to get a matched URL from cluster.idx is much lower than I've anticipated.
For example, among 10,240 successfully downloaded URLs of 2018-10, only 7 are found in the corresponding cluster.idx.
Since cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index, it is after all an understandable outcome...

cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index

Although it is possible to develop a fuzzy search that uses a partial URL to close in on potential index files (cdx-\d{5}.gz), and then recursively apply that fuzzy search on those cdx-\d{5}.gz, I probably don't have enough time to do so...

Answer 1 · 2021-11-14T00:12:59.000Z

Unfortunately, the chance to get a matched URL from cluster.idx is much lower than I've anticipated.
For example, among 10,240 successfully downloaded URLs of 2018-10, only 7 are found in the corresponding cluster.idx.
Since cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index, it is after all an understandable outcome...

Answer 2 · 2021-11-14T00:29:21.000Z

cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index

Although it is possible to develop a fuzzy search that uses a partial URL to close in on potential index files (cdx-\d{5}.gz), and then recursively apply that fuzzy search on those cdx-\d{5}.gz, I probably don't have enough time to do so...

Answer 3 · 2021-11-15T14:55:47.000Z

Reopening for another attempt.

Answer 4 · 2021-12-01T18:01:37.000Z

Basically done.