A fast, multi threaded indexer and search engine for Wikipedia.
To index a Wikidump, place it in the folder indexer/
and from the indexer
folder, run the command:
bash indexing.sh
To search an uploaded database of Wikidump, write the search words in searcher/searchtext.txt
, one word a line, and from the searcher
folder, run the command:
bash searching.sh
Used to scrap pages from random categories to get a smaller subset of Wikipedia for testing
Indexes Wikidump files using XML Sax Handler. Data is pre processed with techniques like Stemming, removin stop words, etc. Then, it is stored in inverted index in a SQLite database.
Searches the SQLite database and ranks pages using Okapi BM25 method, producing the top results.
- Add support for searching in category, title, etc
- Add front end