The aim of this project is to perform the various tasks in the different stages of information retrieval and adopt different methods to enhance and optimise each stage. The stages include Crawling, Indexing and Querying and Classification. In this report, our group has selected Amazon Books as the target website for crawling, and use the crawled information for further processing.
Ensure Python and pip are installed on the machine and both are included in the path variable.
To start crawling links, cd to the root folder and run:
cd crawl
python crawlLinks.py
crawlLinks.py now crawls links under cook-outdoor cooking. If links for other topics are to be crawled, change the parameters at line 18 and line 47 in the script.
To start crawling book details, run:
python crawlBooks.py
Change the input (links) and output (book details) files if you want to crawl different topics.
To start indexing, cd to the root folder and run:
cd solr/solr-7.2.1/bin
solr start
solr create -c amazon
cd ..
python solr_indexing.py
To install django, run:
pip install django
To start the django web server, run:
cd gui
python manage.py runserver
Open a web browser and go to the link
127.0.0.1:8000
Type your query for the books, select the book categories and then enter/click the submit button for querying.
To grab python packages for classification, run:
pip install scikitlearn
pip install pandas
pip install numpy
To run classification, cd to the root folder and run:
cd classification
python classification2.py