Implementation for the LocalMaxs algorithm for extracting multiword units (MWUs) from plain text, described in (Silva and Lopes, 1999)
These programs implement the LocalMaxs algorithm for extracting multiword units (MWUs) from plain text, described in (Silva and Lopes, 1999). There are two versions of the algorithm: strict and relaxed. (The paper describes the strict version.)
There are two implementations of strict LocalMaxs: If your corpus is large then it is possible that only bigcorpus version is able to cope with it. Otherwise, you may choose the relaxed version for its greater recall or the strict version for its greater precision.
Originally there were four versions available, which were incorporated into three as seen below:
- Simple version: multiwords 1.2 strict and relaxed versions (This implementation is not suitable for large corpora because it requires lots of memory.)
- Strict-bigcorpus version: multiwords 1.5 bigcorpus-rev5 strict version (This can handle huge corpora. See details above.).)
- Relaxed version: multiwords 2.0 relaxed version (Uses somewhat lesser memory works on disk, slow.)
Command syntax:
Requirements: Python 3
./multiwords.py dice|scp strict|relaxed MAXN For example this command will extract bigrams and trigrams from the given corpus, using scp as the "glue" function:
./multiwords.py scp strict 3 < corpus.txt > mwus.txt
Note: A bug fixed since the original version. See source code for details.
Requirements: Python 3
./multiwords2.py dice|scp MAXN TEXTFILE OUTPUTDIR
MAXN is an integer ≥ 2
TEXTFILE is the corpus file, previously tokenized and lowercased
OUTPUTDIR is the name of a directory (it will be created if it doesn't exist) where the program writes temporary and output files. The output files will be named OUTPUTDIR/Nmwus.txt, N being 2, 3, ..., MAXN.
For example this command will extract bigrams and trigrams from the given corpus, using scp as the "glue" function:
./multiwords2.py scp 3 corpus.txt results
The output files will be results/2mwus.txt and results/3mwus.txt
Note: This version is highly optimised for speed, and the handling of big corpora. This repository contains the original version (in a previous commit). That version need perl, sed and Python 3 to work.
Requirements:
- MAWK as it is the fastest AWK implementation available!
- Standard Unix tools: grep, cut, sort, uniq
- Nothing else...
./multiwords-strict-bigcorpus-rev5/multiwords.sh dice|scp MAXN SORTBUF < input.txt > output.txt
SORTBUF: There is four sorts in the pipeline, so you should not add more than 90% of your system's memory!
You should export TEMPDIR=/much_space/ because the big temporary files!
For example:
./multiwords-strict-bigcorpus-rev5/multiwords.sh dice 3 90% < input.txt > output.txt
This modified code is made available under the GNU Lesser General Public License v3.0: The original code available from here and here (also available in this repository) are under the Creative Commons Attribution 3.0 Unported License by the authors authors Joaquim Ferreira da Silva and José Gabriel Pereira Lopes. My modifications are licensed under the GNU GPL v3.0 license (CC-BY-3.0 -> CC-BY-SA 3.0 -> CC-BY-SA 4.0 -> GNU GPL v3.0).
If you use this implementation of the original Local Maxima algorithm please cite the following paper:
A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. Joaquim Ferreira da Silva, and José Gabriel Pereira Lopes. In Proceedings of the Sixth Meeting on Mathematics of Language (MOL6), Orlando, Florida July 23-25, 1999. pp. 369-381. (pdf)
@inproceedings{da1999local,
title={A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora},
author={Silva, Joaquim Ferreira da and Lopes, José Gabriel Pereira},
booktitle={Sixth Meeting on Mathematics of Language},
pages={369--381},
year={1999}
}