Tadpole 0.6 A Tagger-Lemmatizer-Morphological-Analyzer-Dependency-Parser for Dutch Version 0.6 http://ilk.uvt.nl/tadpole Copyright 2006-2010 Bertjan Busser, Antal van den Bosch, and Ko van der Sloot ILK Research Group, Faculty of Humanities, Tilburg University http://ilk.uvt.nl Tadpole is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. Tadpole is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. For more information and updates, see: http://ilk.uvt.nl/tadpole --------------------------------------------------------------------- Installation and Quick Start Tadpole relies on Timbl version 6.3, TimblServer version 1.0, and Mbt version 3.2. TimblServer relies on Timbl; Mbt relies on Timbl and TimblServer. The logical order of installation is therefore (1) Timbl, (2) TimblServer, (3) Mbt, and (4) Tadpole. Tadpole will NOT work with previous versions of Timbl and Mbt. The three software packages can be downloaded from http://ilk.uvt.nl/timbl (Timbl and TimblServer) http://ilk.uvt.nl/mbt (Mbt) Please consult the installation instructions with these packages. Tadpole also relies on Python 2.5 or higher, libboost 1.33 or higher, and ICU 3.6 or higher. Please consult your system maintainer if you cannot install these packages yourself. When you have downloaded the Tadpole tarball from http://ilk.uvt.nl/downloads/pub/software/tadpole , you can untar the package, and go to the Tadpole directory. If you installed Timbl, TimblSever and Mbt in the same install directory (i.e., you specified the same install directory with "--prefix=<installdir>" in all three package installations), it is sufficient to the same with Tadpole. %prompt$> tar zxvf tadpole-0.6.tar.gz %prompt$> cd tadpole-0.6 %prompt$> ./configure --prefix=<installdir> %prompt$> make && make install Invoking the Tadpole binary without arguments prints a basic usage: %prompt$> ./Tadpole Tadpole v.0.6 Options: -d <dirName> path to config dir (default ./config) -T <taggerconfigfile> (uses Mbt-style settings file) -M <MBMAconfigfile> (morphological analysis) accepts: t <treefile> m <mode> -L <MBlemconfigfile> (lemmatizer) accepts: p <prefix> (for filenames) O <timbl options> -U <mwuconfigfile> (multiwordchunker) accepts: t <mwu unit file> c <connect_char> (char between tokens in a mwu) -P <parserconfigfile> accepts: to do... -t <testfile> --testdir=<directory> (all files in this dir will be processed -o <outputfile> (default stdout) --outputdir=<outputfile> (default stdout) -s <output field separator> (default tab) -S <port> (run as server instead of reading from testfile) -K : keep intermediate files, (last sentence only) (default false) -d <debug level> (for more verbosity) --skip=<components> Allows to skip certain Tadpole components. Especially the dependency parser is resource intensive and may want to be skipped when not required. Components are indicated by one character, multiple may be combined: t - tokeniser, p - parser, m - morphological analyser The following command line is an example run of Tadpole on the provided sample text file test.txt %prompt%> ./Tadpole -t test.txt This should produce output (to stdout) like this: 1 De de [de] LID(bep,stan,rest) 2 det 2 oprichter oprichter [op][richt][er] N(soort,ev,basis,zijd,stan) 8 su 3 van van [van] VZ(init) 2 mod 4 Wikipedia Wikipedia [Wikipedia] SPEC(deeleigen) 3 obj1 5 , , [,] LET() 4 punct 6 Jimmy_Wales Jimmy_Wales [Jimmy]_[Wales] SPEC(deeleigen) 2 app 7 , , [,] LET() 6 punct 8 wil willen [wil] WW(pv,tgw,ev) 0 ROOT 9 een een [een] LID(onbep,stan,agr) 11 det 10 nieuwe nieuw [nieuw][e] ADJ(prenom,basis,met-e,stan) 11 mod 11 zoekmachine zoekmachine [zoek][machine] N(soort,ev,basis,zijd,stan) 8 su 12 lanceren lanceren [lanceren] WW(inf,vrij,zonder) 8 vc 13 . . [.] LET() 12 punct The first column is a token counter; the second column is the token itself, followed by its lemma and its morphological analysis. The fifth column is the CGN POS tag. The sixth column points to the token counter of the head token of the line's token in the dependency graph; the seventh column contains the type of dependency relation between the two tokens. --------------------------------------------------------------------- Credits Many thanks go out to the people who made the developments of the Tadpole components possible: Walter Daelemans, Jakub Zavrel, Ko van der Sloot, Sabine Buchholz, Sander Canisius, Gert Durieux, and Peter Berck. Thanks to Erik Tjong Kim Sang and Lieve Macken for stress-testing the first versions of Tadpole, and to Rogier Kraf, Guy De Pauw, Joost Hengstmengel, Frederik Vaassen, Wouter van Atteveldt, Joseph Turian, Barbara Plank, Jan-Pieter Kunst, Robert Hensing, Theo van den Heuvel, and Martha van den Hoven for valuable bug reports, comments, and suggestions for improvements. --------------------------------------------------------------------- References Tadpole is described in the following paper: Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (to appear). An efficient memory-based morphosyntactic tagger and parser for Dutch, To appear in Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium. We kindly ask you to refer to this paper if you make use of Tadpole in your own work. You can find more information on components of Tadpole in these papers, which can be downloaded from http://ilk.uvt.nl/publications : Daelemans, W., Zavrel, J, Berck, P, and Gillis, S. (1996). MBT: A Memory-Based Part of Speech Tagger-Generator. In: E. Ejerhed and I. Dagan (eds.) Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 14-27. Van den Bosch, A., Daelemans, W., and Weijters, A. (1996). Morphological analysis as classification: An inductive-learning approach. In Proceedings of NeMLaP-2, Bilkent University, Turkey, 79-89. Van den Bosch, A., and Daelemans, W. (1999). Memory-based morphological analysis. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, ACL'99, University of Maryland, USA, June 20-26, 1999, pp. 285-292. Zavrel, J., and Daelemans W. (1999). Recent Advances in Memory-Based Part-of-Speech Tagging. In: Actas del VI Simposio Internacional de Comunicacion Social, Santiago de Cuba, pp. 590-597.