Web Crawler

The project simply builds a web crawler to check and find broken webpages across the whole website.

User Story

As a developer

I want a tool to automatically check all the webpages in the website

So that I can quickly identify if the new features or the bug fixing changes introduced to the website break any existing pages.

Acceptance Criteria

All the public facing webpages in the website can be easily located and tested.
Any error pages should be logged for further follow-ups.

Getting Started

Add URLs for crawling

In the spider class (e.g: ./mycrawler/spiders/pageavailability.py), replace the example.com URL with a real one for crawling.

Install and Run

This project is tested in MacOS ONLY.

Install Docker for Mac
Clone this project to your local environment.
Run docker-compose up from the top level directory for your project.

This docker-compose up command will start a crawler service and run the crawler for the specified website.

Common Practices

Avoiding getting banned for scraping