- download the wikipedia xml dump, e.g.: https://dumps.wikimedia.org/dewiki/20190501/dewiki-20190501-pages-articles-multistream.xml.bz2
- unzip the file to get the pure XML
- create a folder named index in the same directory
- fill out all config files in the .config folder
- run the indexer to get everything set up indexer.py (takes some hours)
- run api.py to start a development webserver to serve the REST API endpoint
- run frontend.py as a client to the API Server to try out the application
In the config files (within the .config folder) are some configurable values for each script
The indexer must be executed before being able to use the searcher script. ! Attention ! This takes a couple of hours and needs some (>10GB) disc space
What does it do:
- The indexer takes the wikipedia dump as XML and slices it into smaller files of 100 wikipedia articles each.
- It also creates an index where all the titles along with the filename of the article text and the article id are stored in.
- Then the index is going to be sorted so it can be searched easily.
The searcher searches the index and returns the article to a given search word. If the search word is not in the index but there is at least one article that starts with the search word, a list with all the optional titles is returned.
The api script contains a REST API contains a development webserver that is running on localhost, port 5000. The API has one endpoint ("/definition/") and accepts POST requests with an option to receive a long and a short answer. Because the index File is loaded into memory there needs to be about 2 GiB of free Memory on the host system!
Graphical client to the REST API server.
Some tests are implemented.
- Complete the Docstrings for all scripts, classes and functions
- Make Webserver a wsgi production server instead of the flask development server
- Implement more tests
- Optimize the parsing of the articles to get better (short) results