Information Retrieval. Assignment 2

This is further improvement of my boolean_retrieval project.

Preparing environment

Technical requirements still the same, you need:

Python 3.6
NLTK with NLTK Data

You can find installation details in boolean_retrieval project.

Structure

There are 2 subprojects in this repository, each responsible for different thing:

index_builder
search

You can run each through python main.py and see nice help using python main.py -h (there you can discover, for example, what arguments to use for changing default filenames)

index_builer

This subproject creates an inverted index for documents from documents.json (default name, could be changed) and stores it in index.json (default name, could be changed).

The structure of documents.json must be the following:

[
{ "id": 57, "title": "...", "body": "..."},
{ "id": 368, "title": "...", "body": "..."},

]

index.json now has more information, comparing to boolean_retrieval, where we had only posting lists. Here it contains document_frequency for each term and term_frequency for term-document pairs.

The structure of index.json:

{
    "museum": {
        "document_frequency": 2,
        "term_frequency": {
            "158": 1,
            "3573": 5
        }
    },
    "cooper": {
        "document_frequency": 5,
        "term_frequency": {
            "158": 2,
            "3754": 7,
            "4148": 1,
            "5522": 2,
            "5578": 1
        }
    },
    
}

For stemming SnowballStemmer from nltk is used.

search

This simple search engine performs ranked retrieval of documents from documents.json using index.json. Comparing to boolean_retrieval, now you can type free text queries:

PS C:\workspace\ranked_retrieval\search> python main.py --verbose --limit=3
> word processing
score=0.2829, document_id=3182
WORD PROCESSING' AN INTRODUCTION AND APPRAISAL.

IDENTIFIES AND DESCRIBES THE FACILITIES AVAILABLE FOR MECHANISING AND AUTOMATING LIBRARY AND INFORMATION SERVICES USING WORD PROCESSING SYSTEMS-TEXT STORAGE AND SEARCHING, ORDER PROCESSING AND BUDGET CONTROL, PREPARATION OF CATALOGUES, CIRCULATION CONTROL, PERIODICALS CONTROL AND CIRCULATION, INFORMATION STORAGE AND RETRIEVAL SYSTEM USE, AND POSSIBLE FUTURE USES.

--------------------------------------------------
score=0.1365, document_id=3444
COMPUTER-ASSISTED PRODUCTION OF BIBLIOGRAPHIC DATABASES IN HISTORY.

PAPER PRESENTED AT THE 2ND INTERNATIONAL CONFERENCE ON DATA BASES IN THE HUMANITIES AND SOCIAL SCIENCES, MADRID 16-19 JUNE 80. DESCRIBES THE USE OF A COMPUTERISED TEXT PROCESSING SYSTEM TO PRODUCE PRINTED AND ON-LINE VERSIONS OF THE BIBLIOGRAPHIC DATA BASES IN HISTORY-2HISTORICAL ABSTRACTS1 AND 2AMERICA' HISTORY AND LIFE1. ABSTRACTS OR CITATIONS OF BOOKS, ARTICLES AND DISSERTATIONS ARE EDITED, SUBJECT-CLASSIFIED, AND SUBJECT AND AUTHOR-INDEXED. THE AMERICAN BIBLIOGRAPHICAL CENTER'S PROFILE INDEX IS A COMPUTER-ASSISTED, ROTATED STRING INDEX, THE VOCABULARY IS A COMBINATION OF NATURAL LANGUAGE KEY WORDS AND PREFERRED SUBJECT HEADINGS SELECTED BY EDITOR-INDEXERS. THE PROCESSING SYSTEM COMPRISES KEYBOARDING, EXTRACTION OF INDEX TERMS, EXPANSION OF ABBREVIATED INDEX TERMS TO FULL TEXT, FILE CORRECTION, SORTING AND PHOTOCOMPOSITION OF PAGES. FILES OF BIBLIOGRAPHIC ENTRIES ARE SORTED FOR USE IN 1 OR MORE DATA BASES, AND ANNUAL INDEX FILES ARE MERGED TO PRODUCT CUMULATIVE INDEXES.

--------------------------------------------------
score=0.0855, document_id=2643
AUDIO-VISUAL MATERIAL IN LIBRARY COLLECTIONS.

IN THE LAST DECADE THE DEMANDS OF SOCIETY FOR INFORMATION HAVE UNDERGONE A GREAT CHANGE. LEARNING AND SELF-EDUCATION ARE PLAYING AN INCREASINGLY GREATER PART IN THE EDUCATION PROCESS AND AUDIO-VISUAL MATERIAL IN LIBRARIES HAVE AN IMPORTANT ROLE. DESCRIBES THE ORGANISATION AND PROCESSING OF AN AUDIO-VISUAL COLLECTION AND THE SERVICES WHICH CAN BE PROVIDED.

--------------------------------------------------

With argument --verbose you can see full text of documents. Argument --limit limits, how many results to show (default is 20)

Implemenation details

For stemming again SnowballStemmer is used. Weighting scheme for ranked retrieval is lnc.ltc:

Before asking any queries the system pre-calculates term-document weights, using formula 1 + log10(term_frequency) and normalizes it by document vector's length (for cosine similarity). It is done only for present term-document pairs from index.json, and results are stored in a HashMap (memory efficient, comparing to two-dimensional array) for fast future accesses. Also, inverse document frequency idf is computed for all terms. All pre-computations require linear time depending on the present term-document pairs from index.json.
When free text query is typed, the system computes term-query weights using formula (1 + log10(term_frequency_in_query)) * idf(term) and normalizes them too. It requires linear time depending on the query length.

To efficiently calculate document scores term-at-a-time approach is used for query terms:

for term, query_weight in term_query_weights.items():
    for document_id, document_weight in term_document_weights[term].items():
        from_document_id_to_score[document_id] += query_weight * document_weight

Time complexity will linearly depend on number of term-document pairs (from pre-computed HashMap) for query terms.

Documents, are sorted by their scores (O(N log N), where N is the number of documents, containing query terms) to show top relevant.

Evaluation

Test set: 100 randomly choosen documents from LISA document collection. index_builder/documents.json contains exactly these 100 documents.

Test query: information retrieval

Manually determined relevance for this query of all 100 documents:

document_id	relevant?	document_id	relevant?
2	No	27	No
158	No	214	No
293	No	398	Yes
426	No	440	No
473	No	703	No
811	No	951	No
953	No	1019	No
1020	No	1088	No
1089	No	1160	No
1238	No	1305	No
1319	No	1461	No
1489	No	1491	No
1625	No	1675	No
1709	No	1734	No
1789	Yes	1830	No
1854	No	1900	No
1912	No	2151	No
2257	No	2311	No
2359	No	2384	No
2458	No	2487	No
2493	No	2507	No
2525	No	2643	No
2668	No	2685	No
2789	Yes	2790	Yes
2795	No	2808	No
2843	No	2882	Yes
2883	No	2981	No
3066	No	3182	No
3187	No	3258	No
3388	Yes	3441	No
3444	No	3511	No
3573	No	3586	No
3748	No	3754	No
3814	No	3910	Yes
3955	No	4019	No
4034	No	4046	No
4117	No	4148	No
4231	No	4249	No
4369	No	4445	No
4534	No	4535	No
4554	No	4747	No
4767	No	4914	No
4918	No	4941	No
5023	No	5081	No
5300	No	5421	No
5456	No	5522	No
5578	No	5661	No
5711	No	5803	No
5840	No	5968	No
5971	No	5990	No

Let's test the system:

PS C:\workspace\ranked_retrieval\search> python main.py --limit 20
> information retrieval
score=0.3585, document_id=1789 <- relevant
score=0.3412, document_id=2790 <- relevant
score=0.3352, document_id=3388 <- relevant
score=0.3332, document_id=398  <- relevant
score=0.3196, document_id=2882 <- relevant
score=0.3021, document_id=2789 <- relevant
score=0.2221, document_id=3182
score=0.1807, document_id=3814
score=0.1798, document_id=5990
score=0.1661, document_id=3910 <- relevant
score=0.1558, document_id=4369
score=0.1517, document_id=5803
score=0.0874, document_id=3258
score=0.0847, document_id=1734
score=0.0831, document_id=3748
score=0.0797, document_id=4747
score=0.0783, document_id=2151
score=0.0773, document_id=2257
score=0.0746, document_id=4046
score=0.0728, document_id=3511

It has found all relevant documents.

Metrics on top 20

Confusion matrix:


7	0
13	80

Precision = 7 / 20 = 0.35
Recall = 7 / 7 = 1
F1 Score = 2 * (Precision * Recall) / (Precision + Recall) = 0.519

Screenshots

To see working functionality, check out Screenshots directory

and-kul/ranked_retrieval