This project is focused on information retrieval using Python and various libraries. It involves reading and preprocessing a set of text documents, calculating TF-IDF scores for both the documents and a query, and then using cosine similarity to rank and retrieve relevant documents for a given query.
-
Mount Google Drive: The project begins by mounting Google Drive to access the dataset and save the results.
-
Import Libraries: Import necessary libraries, such as pandas, NLTK, and more.
-
Read Documents: Read the text documents from Google Drive and store each document in specific variables (d1 to d10).
-
Preprocessing Documents: Perform preprocessing on each document, including lowercasing, tokenization, removing punctuation, removing stop words, lemmatization, and stemming.
-
Write Preprocessed Documents: Write the preprocessed documents to new files in the "dataOut" directory on Google Drive.
-
Read Preprocessed Documents: Read the preprocessed documents for further analysis.
-
Compute Most Frequent Words: Calculate the most frequent words in each document and display the top 4 words for each.
-
Create a Bag of Words: Create a wordset from all documents, which will be used to create a dictionary for calculating TF, IDF, and TF-IDF.
-
Calculate TF for Documents: Compute Term Frequency (TF) for each word in each document.
-
Calculate IDF for Documents: Compute Inverse Document Frequency (IDF) for each word in all documents.
-
Calculate TF-IDF for Documents: Compute TF-IDF scores for each word in all documents.
-
Query Preprocessing: Preprocess a query by lowercasing, tokenization, removing punctuation, removing stop words, lemmatization, and stemming.
-
Compute TF-IDF for the Query: Calculate TF-IDF scores for the query.
-
Rank Documents by Similarity: Calculate the cosine similarity between the query and all documents, ranking the documents based on similarity scores.
-
Display Relevant Documents: Display relevant and non-relevant documents based on the cosine similarity scores. Documents with NaN scores are considered non-relevant.
This project allows you to search for relevant documents in your dataset using a provided query.
The documents are ranked based on their similarity to the query, providing a simple information retrieval system.