/enelvo

A flexible normalizer for user-generated content

Primary LanguagePythonMIT LicenseMIT


Enelvo

A flexible normaliser for user-generated content in Portuguese.

tests Code style: black

Enelvo is a tool for normalising noisy words in user-generated content written in Portuguese -- such as tweets, blog posts, and product reviews. It is capable of identifying and normalising spelling mistakes, internet slang, acronyms, proper nouns, and others.

The tool was developed as part of my master's project. You can find more details about how it works in my dissertation (in Portuguese) or in this paper (in English). For more information in Portuguese, please visit the project website.

Citing

If you use Enelvo or any code from Enelvo in your research work, you are kindly asked to acknowledge the use of the tool in your publications.

Bertaglia, Thales Felipe Costa, and Maria das Graças Volpe Nunes. "Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization." Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT). 2016.

BibTeX:

@inproceedings{bertaglia2016exploring,
  title={Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization},
  author={Bertaglia, Thales Felipe Costa and Nunes, Maria das Gra{\c{c}}as Volpe},
  booktitle={Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)},
  pages={112--120},
  year={2016}
}

Getting Started

You can install Enelvo using pip by running:

pip3 install --user enelvo 

To make sure that the installation was successful, run:

python3 -m enelvo --input in.txt --output out.txt

If eveything went correctly, out.txt will be written -- containing the normalised version of in.txt.

There is also a REST-based microservice for Enelvo, developed by Thiago D'Ávila. Instructions can be found on the repository page.

Running

You can use the tool, with the most simple configuration, by running:

python3 -m enelvo --input in.txt --output out.txt

There are two required arguments: --input (path to the input file or folder) and --output (path+file name or just path, if the input is a folder to which Enelvo will write the output). Enelvo considers that each line in the input file is a sentence, so format it accordingly. Use option -h to see the full list of arguments and their explanation.

If your input is a folder/directory, you need to use the flag -F. Each output file will be written to the directory specified in --output, as original_file_name + .normalized.

You can also run Enelvo in interactive mode. In this case, you will be able to type in sentences and their normalised version will be displayed in real-time. To use interactive mode, just run:

python3 -m enelvo --interactive

Each of the arguments and their usage will be explained in the following section.

Arguments

There are some arguments that allow you to personalise how Enelvo works. You can find the details by adding -h or --help when running the tool. The main ones are:

                   Option                         Default Description
-h, --help - Displays list of commands.
-l, --lex LEX_FILE unitex-full-clean+enelvo-ja-corrigido.txt Changes the lexicon of words considered correct.
-F, --folder - Sets input as a folder.
-iglst, --ignore-list LIST_FILE None Sets a list of words that will be ignored by the normaliser.
-fclst, --force-list LIST_FILE None Sets a list of words (and optionally their corrections) that will always be processed by the normaliser.
-t, --tokenizer readable regular Changes tokeniser configuration. A readable tokeniser does not replace entities (hashtags, number etc) for tags.
-cpns, --capitalize-pns - Capitalises proper nouns (e.g. maria -> Maria).
-cacs, --capitalize-acs - Capitalises acronyms (e.g bbb -> BBB).
-cinis, --capitalize-inis - Capitalises initials in a sentence.
-sn, --sanitize - Removes punctuation, emojis, and emoticons from all sentences.

The following sections will explain each one more thoroughly.

Changing the Lexicon

Argument -l or --lex lets you choose the lexicon of words considered correct -- i.e, this argument sets the language dictionary. The input must be the full file path (e.g. ../some/folder/dict-pt.txt).

Ignore and Force Lists

Unfortunately, the language lexicons we use are not perfect. Sometimes they contain words that are not in fact correct, therefore preventing them from being normalised. They also sometimes don't contain words that are correct, thus wrongly marking them as noise. In order to ease this problem, Enelvo implements ignore and force lists.

An ignore list is a list of words that will always be considered correct -- even if not contained in the language lexicon. To use it, add -iglst path_to_list or -ignore-list path_to_list. The input must be the full path file and the file must contain a single word per line.

A force list is a list of words that will always be considered noisy -- even if contained in the language lexicon. Thus, these words will alway be normalised. To use it, add -fclst path_to_list or -force-list path_to_list. The input must be the full path file and the file must contain a single word per line.

For the force list, you can also force a correction. To do so, just write the word and its correction separated by a comma. You can mix both formats, for example:

vc
q,que
oq, o que
kk
etc

Lines containing a comma will assume that the word after the comma is a forced correction. Other lines will just force the word to be corrected regularly by the normaliser.

Changing the Tokeniser

By default, the tokeniser used in Enelvo replaces some entities with pre-defined tags. Twitter usernames become USERNAME, numbers (including dates, phone numbers etc) -> NUMBER, URLs -> URL, Twitter hashtags -> HASHTAG, emojis -> EMOJI etc.

If you want to keep the tokens as they are (so no replacement tags), use -t readable or --tokenizer readable.

Capitalising Entities

Enelvo can capitalise different entities using lexicons. In order to do so, you just need to set a flag for each entity that you want to capitalise.

To capitalise proper nouns, set -cpns or --capitalize-pns.

To capitalise acronyms, set -cacs or --capitalize-acs.

To capitalise initials (first letter after punctuation or at the beggining of a sentence), set -cinis or --capitalize-inis.

Cleaning the Text

Enelvo also provides some methods for "cleaning" the text. If you want to remove punctuation, emojis, and emoticons from all sentences, simply set -snor --sanitize.

Other Arguments

There are some other arguments used to control the internal functioning of the normalisation methods (like thresholds etc). Use -h or --help to see further details.

What Else?

Everything described here is related to using Enelvo as a tool. However, it can be personalised and configured way further when used as an API/library. It is possible to generate and score candidates using a lot of different metrics and methods -- you can even use your own metrics! The easiest way of doing this is using the Normaliser class. Have a look at example.py and normaliser.py to understand how to start. The code is reasonably well-commented, so it shouldn't be too difficult to understand what is happening.

If you have any questions, comments, suggestions or problems with Enelvo, please feel free to contact me.

Acknowledgements

Many people were fundamental in carrying out this project (and my master's in general), so I would like to thank some of them:

Graça Nunes, Henrico Brum, Rafael Martins, Raphael Silva, and Thiago Pardo, who devoted a (big) portion of their valuable time to annotate the corpus that served as the basis for this project.

Marcos Treviso for helping me organise and implement many parts os this project, and for teaching me a great deal of what I know about NLP.

Carolina Coimbra and Thiago D'Ávila, for being the first ones to use Enelvo, for reporting many bugs, and for suggesting valuable improvements to the tool.

All my fellow labmates from NILC for helping throughout my whole master's.

Thank you all! ❤️