irext

Evaluation of Categorical Labelling of Wikipedia Pages

There is one package inside of classify which contains the main.py script which runs the program.

It requires one argument, the textfile, and has an optional argument for how much data you want to sample from the unprocessed Wikipedia pages.

We use python3 in our project, and assume you have installed the trec_car tools from:

https://github.com/TREMA-UNH/trec-car-tools

To install the required dependencies use the shell script provided or the setup.py script, make sure you have pip3 installed:

sudo sh install.sh

You can run the project with a datafile (unprocessed Wikipedia CBOR file):

python3 classify/main.py [datafile]

usage: main.py [-h] [--samples SAMPLES] datafile [datafile ...]

Evaluating Category Classification in Information Retrieval

positional arguments:
  datafile           the unprocessed cbor data file for Wikipedia pages

optional arguments:
  -h, --help         show this help message and exit
  --samples SAMPLES  number of samples to pull from the dataset

claire-voyant/irext

irext