Description

This project introduces a concurrent application for information retrieval, using the Standard Boolean Model. More precisely, this implementation offers the possibility of parallel query processing, over the Cranfield Collection of text documents, using Atomic Memory Transactions implemented in C++.

Standard Boolean Model

Based on Boolean logic and classical set theory, the Boolean Model corresponds documents and queries to set of terms. As a result, retrieval is based on whether documents contain query terms or not.

For example, given a set of documents Doc_i and a query Q:

Doc₁ -> {word₁, word₂, word₃}
Doc₂ -> {word₂, word₃}
Doc₃ -> {word₃}
Q -> {word₁, word₂, word₃}

The Boolean model would evaluate the documents as follows:

Doc₁ -> score = 3 (contains 3 terms)
Doc₂ -> score = 2 (contains 2 terms)
Doc₃ -> score = 1 (contains 3 terms)

Cranfield Collection

The test collection of Cranfield includes 1400 abstracts of aeronautical journal articles, a set of 225 queries, and exhaustive relevance evaluations of all (query, document) pairs.

Pre-Processing

Initially, the Cranfield collection was stored in two files:

cran.all.1400, which contains 1400 abstracts of aeronautical journal articles
cran.qry, which contains 225 relevant queries

In order to facilitate parallel processing, documents and queries are splitted to 1400 text files for the documents and 225 for the queries.

Apart from splitting, the SnowballAnalyzer and StopAnalyzer classes of Apache Lucene are used for stemming and stop-words removal.

pinac0099/information-retrieval

Description

Standard Boolean Model

Cranfield Collection

Pre-Processing