Web Retrieval Engine Implementation for University Domain

Developed vector space model based web retrieval engine for University of Memphis domain (memphis.edu).
Crawled and preprocessed 10, 000 web pages and docs (text, pdf, docx and pptx) from University of Memphis domain.
Built modules - web crawler (incremental), text preprocessor (removes- (markup, metadata, uppercase, digits, punctuation, space, stop words), tokenize, stem from raw HTML/docs), Indexer (doc-url, doc-term, term-doc), TF-IDF vector generator, webpage relevance ranker and performance evaluator (F1, precision, recall).
Used TF-IDF vector space model for web page matching and cosine similarity function for web page ranking.

Go to search_engine/search_engine_website
Run inverse_document_indexer_final function in "search_engine.py" file to collect documents(html/php/txt/doc/docx/ppt/pptx) using web crawler.
This builds vector space model with inverse document indexer and TF-IDF vector for all collected documents.
Option available to change to website by changing the url value in "search_engine.py".
Enter query term for retrieving or searching within collected web documents.

To run Django server go to ”search_engine/search_engine_website”
Open command prompt in the current directory of manage.py and type manage.py preceded by python.exe location and python in the follwoing manner:
C:\Users\Anjana\Anaconda3\pythonmanage.pyrunserverserver (format->locationforpython.exe+python+manage.py)
To view web interface for search engine go to http://127.0.0.1:8000/

Current Version : v1.0.0.0

Last Update : 12.01.2017

anjanatiha/Web-Retrieval-Search-Engine-Implementation-for-University-Web-Domain