- Overview
- Module Description
- Usage
- Development
- [Change Log](#change log)
The purpose of this module is to preprocess a set of SGML documents representing a Reuters article database into a dataset of feature vectors and class labels. The datasets will be employed in future assignments for automated categorization, similarity search, and building document graphs.
This python module contains the following files and directories:
- preprocess.py - main module for preprocessing the Reuters article database
- feature.py - sub-module that generates feature vector dataset
- data/
- reut2-xxx.sgm - formatted articles (replace xxx from {000,...,021})
Running preprocess.py
will generate the following file:
- dataset.csv
The feature vectors in the dataset were generated using the following methodologies
- TF-IDF of title & body words to select the top 1000 words as features
- Filtering nouns & verbs from the term lists, and repeating the previous process
For a more detailed report of the methodology used to sanitize and construct these refined datasets and feature vectors, read the file in this project titled Report1.md
using the following command
> less Report1.md
Potential additional to future iterations of feature vector generation:
- different normalization
- bigram/trigram/n-gram aggregation
- stratified sampling: starting letter, stem, etc.
- binning: equal-width & equal-depth (grouping by topics/places, part-of-speech, etc)
- entropy-based discretization (partitioning based on entropy calculations)
This module relies on several libraries to perform preprocessing, before anything:
Ensure NLTK is installed and the corpus and tokenizers are installed:
> pip install NLTK
Next, enter a Python shell and download the necessary NLTK data:
> python
$ import nltk
$ nltk.download()
From the download window, ensure punkt
, wordnet
and stopwords
are downloaded onto your machine.
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Download which package (l=list; x=cancel)?
Identifier> punkt
Downloading package punkt to /home/3/loua/nltk_data...
Unzipping tokenizers/punkt.zip.
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
Identifier> stopwords
Downloading package stopwords to /home/3/loua/nltk_data...
Unzipping corpora/stopwords.zip.
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
Identifier> wordnet
Downloading package wordnet to /home/3/loua/nltk_data...
Unzipping corpora/wordnet.zip.
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> q
Next, ensure BeautifulSoup4 is installed:
> pip install beautifulsoup4
To run the code, first ensure the preprocess.py
file has execute privileges:
> chmod +x preprocess.py
Next, ensure the tfidf.py
, feature1.py
, feature2.py
, and feature3.py
files are in the same directory as preprocess.py
. Also,
ensure there is a data/
directory in the same folder as preprocess.py
and the data/
directory containing the reut2-xxx.sgm
files is present. To begin preprocessing the data, run:
> python preprocess.py
or
> ./preprocess.py
The preprocessing might take some time to complete.
Once preprocess.py
finishes execution, three datasets are generated by the code labeled dataset1.csv
, dataset2.csv
, and dataset3.csv
in the project directory (same folder as preprocess.py
). To view these datasets, run:
> less datasetX.csv
where X
is replaced with 1, 2, or 3 depending on the dataset.
- This module was developed using python 2.7.10 using the NLTK and BeautifulSoup4 modules.
- Ankai Lou (lou.56@osu.edu)
- Daniel Jaung (jaung.2@osu.edu)
2015-09-11 - version 1.0.3
- Finalize the construction of output of dataset3.csv
- Update Report1.md to reflect approach/rationale of dataset3.csv
- Finalize documentation
- Include usage of scikit-learn
2015-09-11 - Version 1.0.2
- Update tf-idf module to use log normalization & probabilistic inverse frequency
- Finalize the construction of output of dataset2.csv
- Update Report1.md to reflect approach/rationale of dataset2.csv
- Begin construction for dataset3.csv
- TODO: finish Report1.md and dataset3.csv
2015-09-11 - Version 1.0.1
- Fixed td-idf module to provide normalized scores in the range [0,1]
- Updated tokenization in preprocess.py to filter non-english words and shorter stems
- Updated the feature selection process for feature vector 1 to run in minimal time
- Finalize the construction and output of dataset1.csv
- Began construction for dataset2.csv
- TODO: finish Report1.md and dataset2.csv; start dataset3.csv
2015-09-10 - Version 1.0.0:
- Initial code import
- Added functionality to generate parse tree
- Added functionality to generate document objects
- Added functionality to tokenize, stem, and filter words
- Added functionality to generate lexicons for title & body words
- Prepare documents for feature selection & dataset generation