cc-getpage

cc-getpage is a lightweight Python utility for retrieving individual pages from the Common Crawl archive. It provides a simple way to fetch specific web pages using Common Crawl's index and downloads the corresponding WARC file segment.

For bulk downloads or entire snapshots, please use the official cc-downloader program.

Features

Fetches specific web pages from Common Crawl archives
Lists available crawl snapshots for selection
Supports manual or automatic crawl selection
Displays archived versions of a URL for selection
Downloads only the necessary WARC segment
Includes automatic retries with backoff

Usage

python cc-getpage.py <URL> [CRAWL-ID]

Contribute

Pull requests are welcome. Feel free to improve features or fix bugs.

License

This project is licensed under the MIT Licence.

Contact