This is a web crawler application that recursively explores and maps the structure of a website.
To use the app, run:
npm start <url to crawl>
Example:
npm start https://www.spacedventures.com/
This web crawler application does the following:
- Starts from a given base URL and recursively crawls all pages within the same domain.
- Extracts and follows links from each page to discover the site structure.
- Keeps track of how many times each page is linked to within the site.
- Respects the same-domain policy, not following links to external websites.
-
The main
crawlPage
function:- Checks if the current URL is within the same domain as the base URL.
- Normalizes URLs and tracks visited pages.
- Fetches page content, checks for errors, and ensures it's HTML.
- Extracts URLs from the HTML and recursively crawls them.
-
The
getURLsFromHTML
function:- Parses HTML using JSDOM.
- Extracts all
<a>
tags and theirhref
attributes. - Handles both relative and absolute URLs.
-
The
normalizeURL
function:- Standardizes URLs by removing the protocol and trailing slashes.
The crawler returns an object where:
- Keys are normalized URLs of crawled pages
- Values are the number of times each URL was encountered during the crawl
This output provides:
- A map of all unique pages within the domain
- The frequency of internal links to each page
This web crawler can be useful for:
- Creating a sitemap
- Analyzing internal linking structure
- Identifying the most linked-to pages within a site
- Understanding the overall structure of a website
This crawler is designed for educational and analytical purposes. Always respect robots.txt files and website terms of service when crawling websites.