university-books-concordance

A University project in which we had to take txt book files from Gutenberg and then parse and index the words in each book. I used CoreNLP (.NET wrapper) to help parsing documents.

Disclaimers:

Use it at your own risk - I am not reliable for anything
If you do use it for your own University project - please don't just change variables and submit - because you will be caught... Use it only to help you understand how one may approach such a task.

Screenshots

Open Issues

books with a . (period) in their names will not be loaded (fix by preprocessing the documents - see |high_ascii_normalization.cs| for an example)
some word parsing won't work as expected:
- ain't = ai (fix by updating code in |qualified_words.cs|)
- some words that start with ’ (’s) (fix by preprocessing document and changing/removing these ’)

Getting Started

You need Visual Studio 2015 (express is good enough) and the project uses NuGet for dependencies
You also need Java JRE 8 (32bit) - otherwise you might get an error such as "failed to initialize CoreNLP"
Install MySql server (community edition is good enough)
Create a new (empty) schema called: books (can be any name)
Compile the project
Edit the application.exe.config and set: 'storage_folder' and 'connection_string' to valid values
Run application.exe (/output/release/application/application.exe)
Press the ResetDB and you can start

You can download txt book files from: https://www.gutenberg.org/ Also included in the project (under /database) are sample documents and the full DB schema

Features (assignment tasks)