bigscience-workshop/metadata

perf: filter 404 URLs out by Common Crawl's cluster.idx

Closed this issue · 4 comments

Tentative Tasks

  • 0. Download month-wise cluster.idx
  • 1. Convert cluster.idx → a python dictionary of URL (actually SURT) parts;
  • 2. Convert each input URL → (SURT) parts;
  • 3. (Partially) Match (2) with (1) → a list of cdx-\d{5}.gz (with ranges, of course);
  • 4. Convert matched cdx-\d{5}.gz → like (1) except probably for the whole SURTs only;
  • 5. (Exactly) Match (2) with (4) → None or a WARC file path with a range;
  • 6. Try (0)-(5) with more cluster.idx files that are not exactly in the same month as the OpenWebText URLs are.
  • 7. Multi-thread at least for (2) and (4); in theory (1), (3), (5), and by extend (6) are thread-safe but not 100% sure yet.

Tentative Outcomes

  • Running (0)-(3) sequentially for 2018-10 can be done in 6 minutes on Colab.

Background

Currently trying to really do partial matching iteratively with cluster.idx and cdx-\d{5}.gz locallly.

Below are cut-n-pasted from the pessimistic comments:

Unfortunately, the chance to get a matched URL from cluster.idx is much lower than I've anticipated.
For example, among 10,240 successfully downloaded URLs of 2018-10, only 7 are found in the corresponding cluster.idx.
Since cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index, it is after all an understandable outcome...

cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index

Although it is possible to develop a fuzzy search that uses a partial URL to close in on potential index files (cdx-\d{5}.gz), and then recursively apply that fuzzy search on those cdx-\d{5}.gz, I probably don't have enough time to do so...

Unfortunately, the chance to get a matched URL from cluster.idx is much lower than I've anticipated.
For example, among 10,240 successfully downloaded URLs of 2018-10, only 7 are found in the corresponding cluster.idx.
Since cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index, it is after all an understandable outcome...

cluster.idx only samples approximately every 3000 URLs (as a cluster) from the whole index

Although it is possible to develop a fuzzy search that uses a partial URL to close in on potential index files (cdx-\d{5}.gz), and then recursively apply that fuzzy search on those cdx-\d{5}.gz, I probably don't have enough time to do so...

Reopening for another attempt.

Basically done.