Apache Nutch Web Crawler

This project utilizes Apache Nutch to crawl websites and indexes the extracted data into Apache Solr, enabling powerful search capabilities.

Prerequisites

Install and Configure Apache Nutch:
- Download and extract the latest version of Nutch.
- Modify nutch-site.xml for specific configurations like Solr integration, crawl delay, etc.
Install and Start Solr:
- Download, extract, and start Solr.
- Create a new core for Nutch.
Crawl Configuration:
- Set up seed URLs.
- Configure regex-urlfilter.txt to specify allowed or disallowed URLs.
Start the Crawl Process using Nutch commands.
View & Search Indexed Data in Solr's Admin UI.

Initiate web crawls using Nutch's command line tools.
View indexed data on Solr's Admin UI: http://localhost:8983/solr/#/nutch/core-overview.
Use Solr's powerful search capabilities to search through the indexed content.