SiteSpider

SiteSpider is a lightweight, asynchronous web crawler built using Node.js. This tool allows users to crawl websites, collect all unique URLs, and generate reports on the links discovered.

Features

Multi-Threaded Crawling: Uses asynchronous requests to maximize crawling speed.
Link Discovery: Extracts both relative and absolute URLs from each crawled page.
URL Hits Tracking: Counts and tracks how many times each URL appears within the site
HTML-Only Parsing: Skips non-HTML pages to focus solely on web content.
Reporting: Exports crawl results to a CSV file or Excel file for easy analysis and documentation.

Installation

To get started, clone this repository and install the required dependencies:

git clone https://github.com/shyarnis/SiteSpider.git
cd SiteSpider
yarn install

Usage

To start the web crawler, run the following command:

yarn start <website_url> [report_format]

<website_url>: Base URL of the site you want to crawl.
[report_format](optional): Specify console, csv, or xlsx. Defaults to console if not provided.

Examples

Crawl and output report in the console.

yarn start https://tinyclouds.org

Crawl and save the report as the CSV file.

yarn start https://tinyclouds.org csv

Crawl and save the report as the Excel file.

yarn start https://tinyclouds.org xlsx

The report are saved at output directory of the application.