This is further improvement of my boolean_retrieval project.
Technical requirements still the same, you need:
- Python 3.6
- NLTK with NLTK Data
You can find installation details in boolean_retrieval project.
There are 2 subprojects in this repository, each responsible for different thing:
- index_builder
- search
You can run each through python main.py
and see nice help using python main.py -h
(there you can discover, for example, what arguments to use for changing default filenames)
This subproject creates an inverted index for documents from documents.json (default name, could be changed) and stores it in index.json (default name, could be changed).
The structure of documents.json must be the following:
[
{ "id": 57, "title": "...", "body": "..."},
{ "id": 368, "title": "...", "body": "..."},
]
index.json now has more information, comparing to boolean_retrieval, where we had only posting lists. Here it contains document_frequency for each term and term_frequency for term-document pairs.
The structure of index.json:
{
"museum": {
"document_frequency": 2,
"term_frequency": {
"158": 1,
"3573": 5
}
},
"cooper": {
"document_frequency": 5,
"term_frequency": {
"158": 2,
"3754": 7,
"4148": 1,
"5522": 2,
"5578": 1
}
},
}
For stemming SnowballStemmer from nltk is used.
This simple search engine performs ranked retrieval of documents from documents.json using index.json. Comparing to boolean_retrieval, now you can type free text queries:
PS C:\workspace\ranked_retrieval\search> python main.py --verbose --limit=3
> word processing
score=0.2829, document_id=3182
WORD PROCESSING' AN INTRODUCTION AND APPRAISAL.
IDENTIFIES AND DESCRIBES THE FACILITIES AVAILABLE FOR MECHANISING AND AUTOMATING LIBRARY AND INFORMATION SERVICES USING WORD PROCESSING SYSTEMS-TEXT STORAGE AND SEARCHING, ORDER PROCESSING AND BUDGET CONTROL, PREPARATION OF CATALOGUES, CIRCULATION CONTROL, PERIODICALS CONTROL AND CIRCULATION, INFORMATION STORAGE AND RETRIEVAL SYSTEM USE, AND POSSIBLE FUTURE USES.
--------------------------------------------------
score=0.1365, document_id=3444
COMPUTER-ASSISTED PRODUCTION OF BIBLIOGRAPHIC DATABASES IN HISTORY.
PAPER PRESENTED AT THE 2ND INTERNATIONAL CONFERENCE ON DATA BASES IN THE HUMANITIES AND SOCIAL SCIENCES, MADRID 16-19 JUNE 80. DESCRIBES THE USE OF A COMPUTERISED TEXT PROCESSING SYSTEM TO PRODUCE PRINTED AND ON-LINE VERSIONS OF THE BIBLIOGRAPHIC DATA BASES IN HISTORY-2HISTORICAL ABSTRACTS1 AND 2AMERICA' HISTORY AND LIFE1. ABSTRACTS OR CITATIONS OF BOOKS, ARTICLES AND DISSERTATIONS ARE EDITED, SUBJECT-CLASSIFIED, AND SUBJECT AND AUTHOR-INDEXED. THE AMERICAN BIBLIOGRAPHICAL CENTER'S PROFILE INDEX IS A COMPUTER-ASSISTED, ROTATED STRING INDEX, THE VOCABULARY IS A COMBINATION OF NATURAL LANGUAGE KEY WORDS AND PREFERRED SUBJECT HEADINGS SELECTED BY EDITOR-INDEXERS. THE PROCESSING SYSTEM COMPRISES KEYBOARDING, EXTRACTION OF INDEX TERMS, EXPANSION OF ABBREVIATED INDEX TERMS TO FULL TEXT, FILE CORRECTION, SORTING AND PHOTOCOMPOSITION OF PAGES. FILES OF BIBLIOGRAPHIC ENTRIES ARE SORTED FOR USE IN 1 OR MORE DATA BASES, AND ANNUAL INDEX FILES ARE MERGED TO PRODUCT CUMULATIVE INDEXES.
--------------------------------------------------
score=0.0855, document_id=2643
AUDIO-VISUAL MATERIAL IN LIBRARY COLLECTIONS.
IN THE LAST DECADE THE DEMANDS OF SOCIETY FOR INFORMATION HAVE UNDERGONE A GREAT CHANGE. LEARNING AND SELF-EDUCATION ARE PLAYING AN INCREASINGLY GREATER PART IN THE EDUCATION PROCESS AND AUDIO-VISUAL MATERIAL IN LIBRARIES HAVE AN IMPORTANT ROLE. DESCRIBES THE ORGANISATION AND PROCESSING OF AN AUDIO-VISUAL COLLECTION AND THE SERVICES WHICH CAN BE PROVIDED.
--------------------------------------------------
With argument --verbose
you can see full text of documents.
Argument --limit
limits, how many results to show (default is 20)
For stemming again SnowballStemmer is used. Weighting scheme for ranked retrieval is lnc.ltc:
-
Before asking any queries the system pre-calculates term-document weights, using formula
1 + log10(term_frequency)
and normalizes it by document vector's length (for cosine similarity). It is done only for present term-document pairs from index.json, and results are stored in a HashMap (memory efficient, comparing to two-dimensional array) for fast future accesses. Also, inverse document frequencyidf
is computed for all terms. All pre-computations require linear time depending on the present term-document pairs from index.json. -
When free text query is typed, the system computes term-query weights using formula
(1 + log10(term_frequency_in_query)) * idf(term)
and normalizes them too. It requires linear time depending on the query length. -
To efficiently calculate document scores term-at-a-time approach is used for query terms:
for term, query_weight in term_query_weights.items(): for document_id, document_weight in term_document_weights[term].items(): from_document_id_to_score[document_id] += query_weight * document_weight
Time complexity will linearly depend on number of term-document pairs (from pre-computed HashMap) for query terms.
-
Documents, are sorted by their scores (O(N log N), where N is the number of documents, containing query terms) to show top relevant.
Test set: 100 randomly choosen documents from LISA document collection. index_builder/documents.json contains exactly these 100 documents.
Test query: information retrieval
Manually determined relevance for this query of all 100 documents:
document_id | relevant? | document_id | relevant? |
---|---|---|---|
2 | No | 27 | No |
158 | No | 214 | No |
293 | No | 398 | Yes |
426 | No | 440 | No |
473 | No | 703 | No |
811 | No | 951 | No |
953 | No | 1019 | No |
1020 | No | 1088 | No |
1089 | No | 1160 | No |
1238 | No | 1305 | No |
1319 | No | 1461 | No |
1489 | No | 1491 | No |
1625 | No | 1675 | No |
1709 | No | 1734 | No |
1789 | Yes | 1830 | No |
1854 | No | 1900 | No |
1912 | No | 2151 | No |
2257 | No | 2311 | No |
2359 | No | 2384 | No |
2458 | No | 2487 | No |
2493 | No | 2507 | No |
2525 | No | 2643 | No |
2668 | No | 2685 | No |
2789 | Yes | 2790 | Yes |
2795 | No | 2808 | No |
2843 | No | 2882 | Yes |
2883 | No | 2981 | No |
3066 | No | 3182 | No |
3187 | No | 3258 | No |
3388 | Yes | 3441 | No |
3444 | No | 3511 | No |
3573 | No | 3586 | No |
3748 | No | 3754 | No |
3814 | No | 3910 | Yes |
3955 | No | 4019 | No |
4034 | No | 4046 | No |
4117 | No | 4148 | No |
4231 | No | 4249 | No |
4369 | No | 4445 | No |
4534 | No | 4535 | No |
4554 | No | 4747 | No |
4767 | No | 4914 | No |
4918 | No | 4941 | No |
5023 | No | 5081 | No |
5300 | No | 5421 | No |
5456 | No | 5522 | No |
5578 | No | 5661 | No |
5711 | No | 5803 | No |
5840 | No | 5968 | No |
5971 | No | 5990 | No |
Let's test the system:
PS C:\workspace\ranked_retrieval\search> python main.py --limit 20
> information retrieval
score=0.3585, document_id=1789 <- relevant
score=0.3412, document_id=2790 <- relevant
score=0.3352, document_id=3388 <- relevant
score=0.3332, document_id=398 <- relevant
score=0.3196, document_id=2882 <- relevant
score=0.3021, document_id=2789 <- relevant
score=0.2221, document_id=3182
score=0.1807, document_id=3814
score=0.1798, document_id=5990
score=0.1661, document_id=3910 <- relevant
score=0.1558, document_id=4369
score=0.1517, document_id=5803
score=0.0874, document_id=3258
score=0.0847, document_id=1734
score=0.0831, document_id=3748
score=0.0797, document_id=4747
score=0.0783, document_id=2151
score=0.0773, document_id=2257
score=0.0746, document_id=4046
score=0.0728, document_id=3511
It has found all relevant documents.
Confusion matrix:
7 | 0 |
13 | 80 |
- Precision = 7 / 20 = 0.35
- Recall = 7 / 7 = 1
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall) = 0.519
To see working functionality, check out Screenshots directory