/ilps-nl-splitter

Primary LanguagePerlGNU Lesser General Public License v3.0LGPL-3.0

This repository contains the nl-splitter compound segmentation tool for archival purposes since the original website (https://ilps.science.uva.nl/resources/compound-splitter-nl/) is no longer available. See https://web.archive.org/web/20200813005715/https://ilps.science.uva.nl/resources/compound-splitter-nl/ for an archived version of the website. All credit goes to the original authors. Find the original README below.


Compound word splitter for the Dutch language, v 1.0

(c) 2006-2010 ISLA, University of Amsterdam

Contact: Valentin Jijkoun jijkoun@uva.nl

Files

  • compound_server.conf - configuration file for the server.

  • compound_server.pl - perl script that runs a server which splits words that get sent to it into parts. The actual compound splitting is done by CompoundSplitter.pm. Run the server using command:

    $ perl compound_server.pl &
    

    The server listens on port 50500 (by default). It accepts one word per line. If the word is a compound, it prints back the split version of the word. If the word is not a compound, it prints an empty line.

    You can test the server manually to TCP port 50500:

    $ telnet localhost 50500 
    
  • CompoundSplitter.pm - perl module that does the actual compound splitting, used by compound_server.pl

  • compoundSplitter.t - test Perl script for CompoundSplitter.pm

  • run.out - frequency counts of Dutch words, based on the Alpino Dutch newspaper corpus; used as dictionary by CompoundSplitter.pm

Authors

The following people were involved in the implementation of the compound splitter at various times (in alphabetical order):

Katja Hofmann Valentin Jijkoun Jaap Kamps Christof Monz