Lab 1 - Preprocessing Data

Overview
Module Description
Usage
Development
[Change Log](#change log)

Overview

The purpose of this module is to preprocess a set of SGML documents representing a Reuters article database into a dataset of feature vectors and class labels. The datasets will be employed in future assignments for automated categorization, similarity search, and building document graphs.

Description

This python module contains the following files and directories:

preprocess.py - main module for preprocessing the Reuters article database
feature.py - sub-module that generates feature vector dataset
data/
- reut2-xxx.sgm - formatted articles (replace xxx from {000,...,021})

Running preprocess.py will generate the following file:

dataset.csv

The feature vectors in the dataset were generated using the following methodologies

TF-IDF of title & body words to select the top 1000 words as features
Filtering nouns & verbs from the term lists, and repeating the previous process

For a more detailed report of the methodology used to sanitize and construct these refined datasets and feature vectors, read the file in this project titled Report1.md using the following command

> less Report1.md

Potential additional to future iterations of feature vector generation:

different normalization
bigram/trigram/n-gram aggregation
stratified sampling: starting letter, stem, etc.
binning: equal-width & equal-depth (grouping by topics/places, part-of-speech, etc)
entropy-based discretization (partitioning based on entropy calculations)

Usage

This module relies on several libraries to perform preprocessing, before anything:

Ensure NLTK is installed and the corpus and tokenizers are installed:

> pip install NLTK

Next, enter a Python shell and download the necessary NLTK data:

> python
$ import nltk
$ nltk.download()

From the download window, ensure punkt, wordnet and stopwords are downloaded onto your machine.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Download which package (l=list; x=cancel)?
  Identifier> punkt
    Downloading package punkt to /home/3/loua/nltk_data...
      Unzipping tokenizers/punkt.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> stopwords
    Downloading package stopwords to /home/3/loua/nltk_data...
      Unzipping corpora/stopwords.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> wordnet
    Downloading package wordnet to /home/3/loua/nltk_data...
      Unzipping corpora/wordnet.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q

Next, ensure BeautifulSoup4 is installed:

> pip install beautifulsoup4

To run the code, first ensure the preprocess.py file has execute privileges:

> chmod +x preprocess.py

Next, ensure the tfidf.py, feature1.py, feature2.py, and feature3.py files are in the same directory as preprocess.py. Also, ensure there is a data/ directory in the same folder as preprocess.py and the data/ directory containing the reut2-xxx.sgm files is present. To begin preprocessing the data, run:

> python preprocess.py

> ./preprocess.py

The preprocessing might take some time to complete.

Once preprocess.py finishes execution, three datasets are generated by the code labeled dataset1.csv, dataset2.csv, and dataset3.csv in the project directory (same folder as preprocess.py). To view these datasets, run:

> less datasetX.csv

where X is replaced with 1, 2, or 3 depending on the dataset.

Development

This module was developed using python 2.7.10 using the NLTK and BeautifulSoup4 modules.

Contributors

Ankai Lou (lou.56@osu.edu)
Daniel Jaung (jaung.2@osu.edu)

Change Log

2015-09-11 - version 1.0.3

Finalize the construction of output of dataset3.csv
Update Report1.md to reflect approach/rationale of dataset3.csv
Finalize documentation
Include usage of scikit-learn

2015-09-11 - Version 1.0.2

Update tf-idf module to use log normalization & probabilistic inverse frequency
Finalize the construction of output of dataset2.csv
Update Report1.md to reflect approach/rationale of dataset2.csv
Begin construction for dataset3.csv
TODO: finish Report1.md and dataset3.csv

2015-09-11 - Version 1.0.1

Fixed td-idf module to provide normalized scores in the range [0,1]
Updated tokenization in preprocess.py to filter non-english words and shorter stems
Updated the feature selection process for feature vector 1 to run in minimal time
Finalize the construction and output of dataset1.csv
Began construction for dataset2.csv
TODO: finish Report1.md and dataset2.csv; start dataset3.csv

2015-09-10 - Version 1.0.0:

Initial code import
Added functionality to generate parse tree
Added functionality to generate document objects
Added functionality to tokenize, stem, and filter words
Added functionality to generate lexicons for title & body words
Prepare documents for feature selection & dataset generation

ssantic/reuters-preprocessing