Tweet Retrieval

Implementation of an inverted index for text retrieval, using the ranked information retrieval model for natural language queries. We are considering the following steps:

Preprocessing

Tokenization
Stopwords filtering
Stemming

Index building

Using cosine similarity with Term Frequency - Inverse Document Frequency weight.
It can be stored in secondary memory.

Query

Proposal of three different queries and showing the top 10 tweets that approximate each query.
The queries are one or more words in natural language.

To analyze the perforamnce of the implementation, a collection of approximately 20 000 tweets is provided.

The dictionary of terms is build upon the text attribute.
docID represents the ID of the tweet.

After performance analysis, there's a proposal of an algorithmic solution for the use of secondary memory with big data collections.

SebastianHurtado98/tweetRetrieval

Tweet Retrieval

Preprocessing

Index building

Query