Information Retrieval Project

This project is focused on information retrieval using Python and various libraries. It involves reading and preprocessing a set of text documents, calculating TF-IDF scores for both the documents and a query, and then using cosine similarity to rank and retrieve relevant documents for a given query.

Project Overview

Mount Google Drive: The project begins by mounting Google Drive to access the dataset and save the results.
Import Libraries: Import necessary libraries, such as pandas, NLTK, and more.
Read Documents: Read the text documents from Google Drive and store each document in specific variables (d1 to d10).
Preprocessing Documents: Perform preprocessing on each document, including lowercasing, tokenization, removing punctuation, removing stop words, lemmatization, and stemming.
Write Preprocessed Documents: Write the preprocessed documents to new files in the "dataOut" directory on Google Drive.
Read Preprocessed Documents: Read the preprocessed documents for further analysis.
Compute Most Frequent Words: Calculate the most frequent words in each document and display the top 4 words for each.
Create a Bag of Words: Create a wordset from all documents, which will be used to create a dictionary for calculating TF, IDF, and TF-IDF.
Calculate TF for Documents: Compute Term Frequency (TF) for each word in each document.
Calculate IDF for Documents: Compute Inverse Document Frequency (IDF) for each word in all documents.
Calculate TF-IDF for Documents: Compute TF-IDF scores for each word in all documents.
Query Preprocessing: Preprocess a query by lowercasing, tokenization, removing punctuation, removing stop words, lemmatization, and stemming.
Compute TF-IDF for the Query: Calculate TF-IDF scores for the query.
Rank Documents by Similarity: Calculate the cosine similarity between the query and all documents, ranking the documents based on similarity scores.
Display Relevant Documents: Display relevant and non-relevant documents based on the cosine similarity scores. Documents with NaN scores are considered non-relevant.

This project allows you to search for relevant documents in your dataset using a provided query.

The documents are ranked based on their similarity to the query, providing a simple information retrieval system.

Renad-CAI/Query-Based-Information-Retrieval-with-TF-IDF-and-Cosine-Similarity-using-Python

Information Retrieval Project

Project Overview