/warc-benchmark

Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

Primary LanguageShellApache License 2.0Apache-2.0

wat-benchmark

This repository acts as a Hello World for working with WARC files.

Its subfolders contain implementations that fetch a WARC file and search all captures from .com domains for a regex that detects YouTube links.

See also the blog post.

This is not bulletproof, production-ready code - I/O retries, closing resources and robust character decoding is omitted to focus on the WARC aspect of the code.