HAREM Datasets Preprocessing

The HAREM collections are popular Portuguese datasets that are commonly used in Named Entity Recognition (NER) task. In their original XML format, some phrases can have multiple entity identification solutions and entities can be assigned more than one class (<ALT> tags and | characters indicating multiple solutions). This annotation scheme is good for representing vagueness and indeterminacy. However, it introduces complications when modeling NER as sequence tagging problem, specially during evaluation, because a single true answer is required.

The script xml_to_json.py converts the XML file to JSON format and selects a single solution for all <ALT> tags and vague entities:

For each Entity with multiple classes, it selects the first valid class.
For each <ALT> tag, it selects the solution with the highest number of entities.

The script is tested for the following XML files:

FirstHAREM: CDPrimeiroHAREMprimeiroevento.xml
MiniHAREM: CDPrimeiroHAREMMiniHAREM.xml

Total and Selective scenarios

Recent works often train and report performances for two scenarios: Total and Selective. Total scenario corresponds to the full dataset with 10 Entity classes:

PESSOA (Person)
ORGANIZACAO (Organization)
LOCAL (Location)
TEMPO (Date)
VALOR (Value)
ABSTRACCAO (Abstraction)
ACONTECIMENTO (Event)
COISA (Thing)
OBRA (Title)
OUTRO (Other)

The Selective scenario considers only the first 5 classes of the list above.

The script is compatible to both scenarios and selects the entities respecting the chosen scenario.

Usage

The scripts are tested with Python 3.6.

Install the requirements:

$ pip install -r requirements.txt

Run the script:

$ xml_to_json.py path_to_xml_file.xml --scenario [total|selective]

The converted file will be saved with the same name and suffix -{scenario}.json

Tests

To run the tests, first install the test requirements and run the tests:

$ pip install requirements_test.txt
$ HAREM_DATA_DIR=test_files/ python tests.py

fabiocapsouza/harem_preprocessing

HAREM Datasets Preprocessing

Total and Selective scenarios

Usage

Tests