This is our project for the course Information Retrieval and Data Mining given at the TU/e. It is a search engine written in Python which lets you index and search a Collection of NIPS papers. It is based on the Python package Whoosh, whish is a indexing and searching library implemented in pure Python.
- Java Development Kit (JDK) with the environment variable set (i.e.
JAVA_HOME
set to the installation directory, for exampleC:\Program Files\Java\jdk1.8.0_101
) - Anaconda, which is a package manager for Python. As we will use PyPi later on, you can add the
Scripts
directory, which can be found in the Anaconda root installation directory (e.g.C:\ProgramData\Anaconda3\Scripts
) to your PATH system variable. Make sure to reopen any terminal in order to have thepip
command available. - The Stanford CoreNLP: download and extract the folder to a desired location.
- The SQLite database of the Collection of NIPS papers. Make sure to put this file (which is named
database.sqlite
by default) in the directory calleddata
in the root of this project. - The wordnet and stopwords resource, which can be downloaded using the NLTK Downloader. In PyCharm, open up a Python console (Tools -> Python Console) and enter the following commands:
>>> import nltk
>>> nltk.download('wordnet')
>>> nltk.download('stopwords')
-
In version 3.8.0 of CoreNLP (which is currently the lastest stable version), there is a problem with removing certain control characters from the text that will be tokenized. For some control characters, this resulted in a `JSONDecodeError. In this commit, the problem was solved. However, no there has not been a new release of the CoreNLP that contains this fix. Therefore, you should manually build the project, which can be found here. Building of CoreNLP is very simple, as it only requires you to have Ant and the JDK installed:
- After following the steps listed on the repository, you will end up with a jar called
stanford-corenlp.jar
. - Assuming you have downloaded and extracted the Stanford CoreNLP, go to the CoreNLP folder and remove the
stanford-corenlp-3.8.0.jar
, which came with the Stanford CoreNLP. - Rename the previously build
stanford-corenlp.jar
tostanford-corenlp-3.8.0.jar
and move it the location were you removed the jar that came originally with the package. Thats it! - To install the required python packages, run the following command:
pip install -r requirements.txt
- After following the steps listed on the repository, you will end up with a jar called
-
NLTk version 3.2.5 is required to run the application. Currently, conda installs NLTK 3.2.4, and ignores the correctly installed version using PyPi. To solve this:
- Remove NLTK both from Anaconda (
conda remove nltk
) and PyPi (pip uninstall nltk
). - Let (in this case) PyCharm reindex the installed packages (which happens automatically if your "focus" is on the application)
- Install nltk using PyPi (
pip install nltk
) - Ignore any messages from PyCharm stating that the requirement (nltk >= 3.2.5) is not fulfilled.
- Remove NLTK both from Anaconda (
-
December 6th: Download the latest sqlite database from here, with new version of paper to author suggestions.
-
December 3rd: Download the latest sqlite database from here, with fixed reference. Added suggested papers. -
November 26th: Download the latest sqlite database from here, with more abstracts than ever and manually added some authors to papers. -
November 24th: Download the latest sqlite database from here including topics and new paper suggestions. -
November 16th: Download the latest sqlite database from here with updated author suggestions from Bart and paper to author suggestions by Matthias. -
November 13th: Download the latest sqlite database from here. This version contains the references, authors, author graph. No topics yet, some abstracts may be missing. -
If you encounter the module not found exception for a local module and you are using PyCharm, then mark the corresponding directory as source. For example if PyCharm mentions that the tokenizers.stanford module is not found, then mark the indexer folder (where stanford is a child of tokenizers in indexer) as a source. This can be done by going to File > Settings > Project Structure > indexer > Mark as source in the top of the pop up window.
-
If your Windows Powershell / Git Bash doesn't respond, right click and hit Properties. In the Edit Options disable the Quick Edit mode and the Insert Mode.
-
"NLTK was unable to find the
JAVA_HOME
environment variable". Make sure you have set the environment variable. If you have done this, restart PyCharm and try again. This should work.
Our project makes use of the Stanford CoreNLP, which provides a set of human language technology tools. The Stanford CoreNLP ships with a built-in server, which requires only the CoreNLP dependencies.
To run the Stanford CoreNLP as a server, do the following:
- Go to the location to which you extracted the Stanford CoreNLP.
- Open up a command line, go the into the CoreNLP directory and enter the following command:
# Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -encoding utf8 -lowerCase -port 9000 -timeout 800000
- In the command line, the final message should be:
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
- If no port is specified, the default port will be 9000.
To run the project with the web interface, you can use the following command:
python.exe manage.py runserver
To debug using the web interface, you can do the following in PyCharm:
- Create a new Python Run/Debug configuration
- In the Configuration tab, enter following values in the corresponding fields:
- Script:
path\to\manage.py
- Script paramaters:
runserver
- Working directory:
path\to\projectroot
- Script:
- Run the project using this configuration
It can happen that the output in the terminal is not showing the most recent output. Pressing some buttons while in the terminal causes the terminal to show any unprocessed output. Make sure to do this when you have the feeling when indexing/processing is taking forever.