/corpuscatcher

Corpus collection toolset

Primary LanguagePythonGNU General Public License v2.0GPL-2.0

===============================================================

                     CORPUS CATCHER
                           by
                    Translate.org.za
<http://translate.sourceforge.net/wiki/corpuscatcher/index>

                       Version 0.1
                   Copyright (c) 2008
                Zuza Software Foundation
               Last updated: July 16, 2008

  1. INTRODUCTION

    CorpusCatcher is a corpus collection toolset. It can help you to build language or topic specific corpora from publicly available web resources. This can be very useful for many purposes, especially for data to build spell checkers.

    It was written in Python and can therefore easily be used, in part or in whole, in other Python projects. It was originally written to simplify the use of BootCaT <http://clic.cimec.unitn.it/marco/tools_and_resources.html>, but has grown to replace the used BootCaT parts with Python ports.

    If you are interested in CorpusCatcher, or working on spell checkers, you may be interested in Spelt <http://translate.sourceforge.net/wiki/spelt/index>.

  2. DOCUMENTATION

    See the wiki at <http://translate.sourceforge.net/wiki/corpuscatcher/index> (complete instructions are in the README file there).

  3. INSTALLATION

    These tools are simple command-line tools written in Python, so all that is needed for installation is to extract all the files in the distribution archive into a directory.

    Dependencies:

    1. Python >= 2.4
    2. mechanize module (only tested with version 0.1.7b)
    3. pysearch module (only tested with version 3.0)