tweet-norm-es: Spanish Tweet Normalization
Our system for the Tweet-Norm 2013 competition, which we later improved for the paper in Ruiz, Cuadros, Etchegoyhen (2014). (See complete references at the end).
Requires
- psutil: apt-get install python-psutil
- SRILM and pysrilm (https://github.com/njsmith/pysrilm)
- KenLM (https://github.com/kpu/kenlm)
Running
Preferred: from Python shell. Options can be specified in tnconfig.py, but some of the settings in that file can be modified with command line options when calling the program (we'll expose more options in the CLI in the future).
>>> import sys
>>> sys.argv = [""]
>>> execfile("/path/to/tweet-norm-es/twenor/processing.py")
# If using command line arguments instead of tnconfig.py
Usage: processing.py [-h] [-t] [-c COMMENT]
optional arguments:
-h, --help show this help message and exit
-t, --tag tag with FreeLing
-c COMMENT, --comment COMMENT
comment for run (shown in cumulog.txt)
#COMMAND_LINE OPTIONS BELOW HERE NOT YET FUNCTIONAL (set them in config/tnconfig.py)
-b, --baseline baseline run: accept all OOV
-x MAXDISTA, --maxdista MAXDISTA
maximum edit distance above which candidate is
filtered
-d DISTAW, --distaw DISTAW
weight for edit-distance scores
-l LMW, --lmw LMW weight for language model scores
-p LMPATH, --lmpath LMPATH
path to Arpa file for language model
-w LM_WINDOW, --lm_window LM_WINDOW
left-window for context lookup in language model
E.g.
>>> sys.argv = ["", "--comment", "test Freeling tagging", "--tag"]
>>> execfile("/path/to/tnor2/twenor/processing.py")
Also from command line:
python /path/to/tweet-norm-es/twenor/processing.py
Project Structure
tweet-norm-es
|_ config
|_ tnconfig.py Config file
|_ scripts
|_ neweval.py Tweet-Norm workshop's evaluation script
|_ twenor
|_ preparation.py Common preparation functions
|_ freelmgr.py Freeling Analyzer calls
|_ twittero.py Basic tweet analysis objects: Tweet, Token, OOV, ...
|_ preprocessing.py OOV preprocessing with regexes and lists
|_ editor.py Candidate Generation and Distance-Scoring
|_ lmmgr.py Language Model creation, candidate lookup and scoring
|_ postprocessing.py Recasing
|_ entities.py Form lookup in entity resources
|_ processing.py Main program
|_ network.py Combination network to generate all candidate combinations for a tweet
|_ global_lm_scorer.py Applies candidate-combination network
|_ data Regex lists, entity lists, correction model data, LMs etc.
|_ evaluation
|_ dev devset texts and annotations
|_ eval test-set texts and annotations
System Architecture
Publications
-
Ruiz Fabo, Pablo, Montse Cuadros, and Thierry Etchegoyhen. (2014). Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models. Procesamiento del Lenguaje Natural 52:45-52. SEPLN, Spanish NLP Society.
-
Ruiz, Pablo, Montse Cuadros, and Thierry Etchegoyhen. (2013) Lexical Normalization of Spanish Tweets with Preprocessing Rules, Domain-specific Edit Distances, and Language Models. In Tweet-Norm@ SEPLN, pp. 59-63. IV Congreso Español de Informática.