cc-getpage
is a lightweight Python utility for retrieving individual pages from the Common Crawl archive. It provides a simple way to fetch specific web pages using Common Crawl's index and downloads the corresponding WARC file segment.
For bulk downloads or entire snapshots, please use the official cc-downloader
program.
- Fetches specific web pages from Common Crawl archives
- Lists available crawl snapshots for selection
- Supports manual or automatic crawl selection
- Displays archived versions of a URL for selection
- Downloads only the necessary WARC segment
- Includes automatic retries with backoff
python cc-getpage.py <URL> [CRAWL-ID]
Pull requests are welcome. Feel free to improve features or fix bugs.
This project is licensed under the MIT Licence.
For support or questions, visit Common Crawl or open an issue on GitHub. You're also welcome to join our Discord server or Google Group.