This project utilizes Apache Nutch to crawl websites and indexes the extracted data into Apache Solr, enabling powerful search capabilities.
- Java Development Kit (JDK)
- Windows Subsystem for Linux (WSL) for Windows users
- A Code Editor like Visual Studio Code
-
Install and Configure Apache Nutch:
- Download and extract the latest version of Nutch.
- Modify
nutch-site.xml
for specific configurations like Solr integration, crawl delay, etc.
-
Install and Start Solr:
- Download, extract, and start Solr.
- Create a new core for Nutch.
-
Crawl Configuration:
- Set up seed URLs.
- Configure
regex-urlfilter.txt
to specify allowed or disallowed URLs.
-
Start the Crawl Process using Nutch commands.
-
View & Search Indexed Data in Solr's Admin UI.
- Initiate web crawls using Nutch's command line tools.
- View indexed data on Solr's Admin UI:
http://localhost:8983/solr/#/nutch/core-overview
. - Use Solr's powerful search capabilities to search through the indexed content.