===============================================================
CORPUS CATCHER by Translate.org.za <http://translate.sourceforge.net/wiki/corpuscatcher/index> Version 0.1 Copyright (c) 2008 Zuza Software Foundation Last updated: July 16, 2008
INTRODUCTION
CorpusCatcher is a corpus collection toolset. It can help you to build language or topic specific corpora from publicly available web resources. This can be very useful for many purposes, especially for data to build spell checkers.
It was written in Python and can therefore easily be used, in part or in whole, in other Python projects. It was originally written to simplify the use of BootCaT <http://clic.cimec.unitn.it/marco/tools_and_resources.html>, but has grown to replace the used BootCaT parts with Python ports.
If you are interested in CorpusCatcher, or working on spell checkers, you may be interested in Spelt <http://translate.sourceforge.net/wiki/spelt/index>.
DOCUMENTATION
See the wiki at <http://translate.sourceforge.net/wiki/corpuscatcher/index> (complete instructions are in the README file there).
INSTALLATION
These tools are simple command-line tools written in Python, so all that is needed for installation is to extract all the files in the distribution archive into a directory.
Dependencies:
- Python >= 2.4
- mechanize module (only tested with version 0.1.7b)
- pysearch module (only tested with version 3.0)