perf: filter 404 URLs out by Common Crawl's cluster.idx
Closed this issue · 4 comments
Tentative Tasks
- 0. Download month-wise
cluster.idx
- 1. Convert
cluster.idx
→ a python dictionary of URL (actually SURT) parts; - 2. Convert each input URL → (SURT) parts;
- 3. (Partially) Match (2) with (1) → a list of
cdx-\d{5}.gz
(with ranges, of course); - 4. Convert matched
cdx-\d{5}.gz
→ like (1) except probably for the whole SURTs only; - 5. (Exactly) Match (2) with (4) →
None
or a WARC file path with a range; - 6. Try (0)-(5) with more
cluster.idx
files that are not exactly in the same month as the OpenWebText URLs are. - 7. Multi-thread at least for (2) and (4); in theory (1), (3), (5), and by extend (6) are thread-safe but not 100% sure yet.
Tentative Outcomes
- Running (0)-(3) sequentially for
2018-10
can be done in 6 minutes on Colab.
Background
Currently trying to really do partial matching iteratively with cluster.idx
and cdx-\d{5}.gz
locallly.
Below are cut-n-pasted from the pessimistic comments:
Unfortunately, the chance to get a matched URL from
cluster.idx
is much lower than I've anticipated.
For example, among 10,240 successfully downloaded URLs of 2018-10, only 7 are found in the correspondingcluster.idx
.
Sincecluster.idx
only samples approximately every 3000 URLs (as a cluster) from the whole index, it is after all an understandable outcome...
cluster.idx
only samples approximately every 3000 URLs (as a cluster) from the whole index
Although it is possible to develop a fuzzy search that uses a partial URL to close in on potential index files (
cdx-\d{5}.gz
), and then recursively apply that fuzzy search on thosecdx-\d{5}.gz
, I probably don't have enough time to do so...
Unfortunately, the chance to get a matched URL from cluster.idx
is much lower than I've anticipated.
For example, among 10,240 successfully downloaded URLs of 2018-10, only 7 are found in the corresponding cluster.idx
.
Since cluster.idx
only samples approximately every 3000 URLs (as a cluster) from the whole index, it is after all an understandable outcome...
cluster.idx
only samples approximately every 3000 URLs (as a cluster) from the whole index
Although it is possible to develop a fuzzy search that uses a partial URL to close in on potential index files (cdx-\d{5}.gz
), and then recursively apply that fuzzy search on those cdx-\d{5}.gz
, I probably don't have enough time to do so...
Reopening for another attempt.
Basically done.