/py-domain-crawler-and-comparison-tool

This repository contains Python scripts to crawl and compare a website for changes.

Primary LanguagePython

Domain Crawler and Comparison Tool

This repository contains Python scripts to crawl and compare a website for changes.

Table of Contents

Overview

  1. capture.py - A web crawler that goes through all the pages of a given domain and exports the URLs, status codes, page sizes, and heights into both .txt and .html formats.
  2. compare.py - A script that takes two .txt files (generated by crawl_website.py), representing the old and new versions of a website, and compares them side by side. It exports the differences into an HTML file, highlighting the discrepancies.

Installation

Installing Python

  • Windows: Download the installer from Python's official site and follow the installation steps. Make sure to check the "Add Python to PATH" checkbox during installation.

  • macOS: Python comes pre-installed on macOS, but you can also download the latest version from Python's official site.

  • Linux: Use your distribution's package manager to install Python. For example, on Ubuntu:

sudo apt-get update
sudo apt-get install python3

Installing Requirements

After installing Python, you need to install the required packages. Navigate to the project folder in your terminal and run:

pip install -r requirements.txt

Usage

  1. Crawling a Website:
python crawl_website.py

Follow the prompts to enter the website domain and select the type of crawl.

  1. Comparing Websites:
python compare.py

Follow the prompts to select the .txt files to be compared.

Why Compare Websites?

  • Migrating to a New Platform/Host: Before switching to a new platform or hosting service, you may want to ensure that all URLs from the old platform exist in the new platform and function as expected.

  • Switching WordPress Themes: A change in theme may result in differences in content display, load times, or even broken links. Comparing the website before and after the switch can highlight these issues.

  • SEO Analysis: Ensuring that URLs, especially high-traffic ones, remain consistent during any changes can help preserve SEO rankings.

  • Quality Assurance: Before rolling out a redesigned website, comparing the old and new sites can help identify bugs, missing content, or other issues that need to be addressed.