The HAREM collections are popular Portuguese datasets that are commonly used in Named Entity Recognition (NER) task. In their original XML format, some phrases can have multiple entity identification solutions and entities can be assigned more than one class (<ALT>
tags and |
characters indicating multiple solutions).
This annotation scheme is good for representing vagueness and indeterminacy. However, it introduces complications when modeling NER as sequence tagging problem, specially during evaluation, because a single true answer is required.
The script xml_to_json.py
converts the XML file to JSON format and selects a single solution for all <ALT>
tags and vague entities:
- For each Entity with multiple classes, it selects the first valid class.
- For each
<ALT>
tag, it selects the solution with the highest number of entities.
The script is tested for the following XML files:
- FirstHAREM: CDPrimeiroHAREMprimeiroevento.xml
- MiniHAREM: CDPrimeiroHAREMMiniHAREM.xml
Recent works often train and report performances for two scenarios: Total and Selective. Total scenario corresponds to the full dataset with 10 Entity classes:
- PESSOA (Person)
- ORGANIZACAO (Organization)
- LOCAL (Location)
- TEMPO (Date)
- VALOR (Value)
- ABSTRACCAO (Abstraction)
- ACONTECIMENTO (Event)
- COISA (Thing)
- OBRA (Title)
- OUTRO (Other)
The Selective scenario considers only the first 5 classes of the list above.
The script is compatible to both scenarios and selects the entities respecting the chosen scenario.
The scripts are tested with Python 3.6.
Install the requirements:
$ pip install -r requirements.txt
Run the script:
$ xml_to_json.py path_to_xml_file.xml --scenario [total|selective]
The converted file will be saved with the same name and suffix -{scenario}.json
To run the tests, first install the test requirements and run the tests:
$ pip install requirements_test.txt
$ HAREM_DATA_DIR=test_files/ python tests.py