/tadpole

The good old predecessor of Frog

Primary LanguageLexGNU General Public License v3.0GPL-3.0

Tadpole 0.6

  A Tagger-Lemmatizer-Morphological-Analyzer-Dependency-Parser for Dutch
  Version  0.6
  http://ilk.uvt.nl/tadpole
 
  Copyright 2006-2010 Bertjan Busser, Antal van den Bosch, and Ko
  van der Sloot
  ILK Research Group, Faculty of Humanities, Tilburg University
  http://ilk.uvt.nl

  Tadpole is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation; either version 3 of the License, or
  (at your option) any later version.

  Tadpole is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program.  If not, see <http://www.gnu.org/licenses/>.

  For more information and updates, see:
      http://ilk.uvt.nl/tadpole

---------------------------------------------------------------------
Installation and Quick Start

Tadpole relies on Timbl version 6.3, TimblServer version 1.0, and Mbt
version 3.2. TimblServer relies on Timbl; Mbt relies on Timbl and
TimblServer. The logical order of installation is therefore (1) Timbl,
(2) TimblServer, (3) Mbt, and (4) Tadpole. Tadpole will NOT work with
previous versions of Timbl and Mbt. The three software packages can be
downloaded from

  http://ilk.uvt.nl/timbl (Timbl and TimblServer)
  http://ilk.uvt.nl/mbt   (Mbt)

Please consult the installation instructions with these packages.

Tadpole also relies on Python 2.5 or higher, libboost 1.33 or higher,
and ICU 3.6 or higher. Please consult your system maintainer if you
cannot install these packages yourself.

When you have downloaded the Tadpole tarball from
http://ilk.uvt.nl/downloads/pub/software/tadpole , you can untar the
package, and go to the Tadpole directory. If you installed Timbl,
TimblSever and Mbt in the same install directory (i.e., you specified
the same install directory with "--prefix=<installdir>" in all three
package installations), it is sufficient to the same with Tadpole.

%prompt$> tar zxvf tadpole-0.6.tar.gz
%prompt$> cd tadpole-0.6
%prompt$> ./configure --prefix=<installdir>
%prompt$> make && make install

Invoking the Tadpole binary without arguments prints a basic usage:

%prompt$> ./Tadpole
Tadpole v.0.6
Options:
	-d <dirName> path to config dir (default ./config)
	-T <taggerconfigfile> (uses Mbt-style settings file)
	-M <MBMAconfigfile> (morphological analysis)
	   accepts:
		t <treefile>
		m <mode>
	-L <MBlemconfigfile> (lemmatizer)
	   accepts:
		p <prefix> (for filenames)
		O <timbl options>
	-U <mwuconfigfile> (multiwordchunker)
	   accepts:
		t <mwu unit file>
		c <connect_char> (char between tokens in a mwu)
	-P <parserconfigfile> 
	   accepts:
	   to do...
	-t <testfile>
	--testdir=<directory> (all files in this dir will be processed
	-o <outputfile> (default stdout)
	--outputdir=<outputfile> (default stdout)
	-s <output field separator> (default tab)
	-S <port> (run as server instead of reading from testfile)
	-K : keep intermediate files, (last sentence only) (default false)
	-d <debug level> (for more verbosity)
	--skip=<components> Allows to skip certain Tadpole components.
	  Especially the dependency parser is resource intensive and may
	  want to be skipped when not required. Components are indicated by
	  one character, multiple may be combined:
	  t - tokeniser, p - parser, m - morphological analyser

The following command line is an example run of Tadpole on the provided
sample text file test.txt


%prompt%> ./Tadpole -t test.txt

This should produce output (to stdout) like this:

1    De	    de	    [de]   LID(bep,stan,rest)	2	det
2    oprichter	    oprichter			[op][richt][er]	N(soort,ev,basis,zijd,stan)	8	su
3    van	    van				[van]		VZ(init)			2	mod
4    Wikipedia	    Wikipedia			[Wikipedia]	SPEC(deeleigen)			3	obj1
5    ,		    ,				[,]		LET()				4	punct
6    Jimmy_Wales    Jimmy_Wales			[Jimmy]_[Wales]	SPEC(deeleigen)			2	app
7    ,		    ,				[,]		LET()				6	punct
8    wil	    willen			[wil]		WW(pv,tgw,ev)			0	ROOT
9    een	    een				[een]		LID(onbep,stan,agr)		11	det
10   nieuwe	    nieuw			[nieuw][e]	ADJ(prenom,basis,met-e,stan)	11	mod
11   zoekmachine    zoekmachine			[zoek][machine]	N(soort,ev,basis,zijd,stan)	8	su
12   lanceren	    lanceren			[lanceren]	WW(inf,vrij,zonder)		8	vc
13   .		    .				[.]		LET()				12	punct

The first column is a token counter; the second column is the token
itself, followed by its lemma and its morphological analysis. The
fifth column is the CGN POS tag. The sixth column points to the
token counter of the head token of the line's token in the dependency
graph; the seventh column contains the type of dependency relation
between the two tokens.


---------------------------------------------------------------------
Credits

Many thanks go out to the people who made the developments of the
Tadpole components possible: Walter Daelemans, Jakub Zavrel, Ko van
der Sloot, Sabine Buchholz, Sander Canisius, Gert Durieux, and Peter
Berck. 

Thanks to Erik Tjong Kim Sang and Lieve Macken for stress-testing the
first versions of Tadpole, and to Rogier Kraf, Guy De Pauw, Joost
Hengstmengel, Frederik Vaassen, Wouter van Atteveldt, Joseph Turian,
Barbara Plank, Jan-Pieter Kunst, Robert Hensing, Theo van den Heuvel,
and Martha van den Hoven for valuable bug reports, comments, and
suggestions for improvements.


---------------------------------------------------------------------
References

Tadpole is described in the following paper:

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (to
 appear). An efficient memory-based morphosyntactic tagger and parser for
 Dutch, To appear in Selected Papers of the 17th Computational Linguistics in
 the Netherlands Meeting, Leuven, Belgium.

We kindly ask you to refer to this paper if you make use of Tadpole in
your own work.

You can find more information on components of Tadpole in these papers,
which can be downloaded from http://ilk.uvt.nl/publications :

Daelemans, W., Zavrel, J, Berck, P, and Gillis, S. (1996). MBT: A
 Memory-Based Part of Speech Tagger-Generator. In: E. Ejerhed and I. Dagan
 (eds.) Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen,
 Denmark, pp. 14-27.

Van den Bosch, A., Daelemans, W., and Weijters, A. (1996). Morphological
 analysis as classification: An inductive-learning approach. In Proceedings
 of NeMLaP-2, Bilkent University, Turkey, 79-89.

Van den Bosch, A., and Daelemans, W. (1999). Memory-based morphological
 analysis. In Proceedings of the 37th Annual Meeting of the Association for
 Computational Linguistics, ACL'99, University of Maryland, USA, June 20-26,
 1999, pp. 285-292.

Zavrel, J., and Daelemans W. (1999).  Recent Advances in Memory-Based
 Part-of-Speech Tagging. In: Actas del VI Simposio Internacional de
 Comunicacion Social, Santiago de Cuba, pp. 590-597.