A search engine for crawling news from the web, storing in a structured way, and querying through the stored documents for finding the most relevant results using Machine Learning and Information Retrieval techniques.
Before running the server, you need to install the required packages. To do that, create a virtual environment and run
$ pip install -r requirements.txt
The version of python used is 3.8.10
. Before running the search engine for the first time, issue
the following command:
$ python manage.py migrate
For running the search engine, use the following command:
$ python manage.py runserver
This search engine is designed to retrieve the most relevant documents given a query by the user. The steps of implementation are comprised of 1) indexing and storing documents, 2) querying through documents.
When a document is received, first, some preprocessing steps, including Normalization, Tokenization, Stemmingand Removing stop words are done. Then the document is stored as a vector. We need to accept a weighting scheme for weighting words within a document and vectorize it. For this purpose, TF-IDF method is used.
When the user provides a query, the query is converted to a vector using the same process as used for documents (TF-IDF). Then the similarity of the query to each document is calculated using the Cosine similarity. Then, K most similar documents to the query are returned.
According to Wikipedia, we can have different weighting scheme combinations for queries and documents. The weighting schemes we have implemented are as shown in the following table:
Different weighting schemes for converting documents and queries to vectors
In this table,
As discussed, the relevant documents are retrieved based on their Cosine similarity with the query. However, we have a large number of documents, and comparing the query with each document and returning the most relevant ones is not an efficient method. For this reason, we use clustering techniques to cluster the documents and compare each query only to the centers of the cluster. The most relevant clusters are then retrieved, and the query is compared to the documents of those clusters. For clustering the documents, we need to specify the number of clusters. To do that, we apply the Residual Sum of Squares (RSS) method.
The search can also be done in a specific category of documents. The category should be given in the query by cat:<category name>
To search through the categories, first, we should perform classification over documents. For this purpose, we categorize all
the News to 8 classes: Science, Culture-Art, Politics, Economy, Social, International, Sport, and Multimedia.
Since classification is a supervised method, we require to have labeled data. Therefore, we labeled some documents
and classified them using the KNN algorithm.