/temporal-annotation

Linguistic annotation of events, time expressions and links between them for economic news.

Primary LanguagePython

Temporal Relation Annotation

This is temporal annotation project, tailored to the annotation of economic news. The annotation guidelines are many ways are a streamlining of TimeML, from which they draw deep inspiration.

This repository includes three things: the set-up files for annotation, a set of data annotated by three annotators, and the code needed to run our baseline classifier which was trained on the preliminary annotations.

The project set-up files for use with MAE, which includes:

  • MAE Set-up File (temporal_annoation.dtd)
  • Batched data of 100 articles in the annotation_data folder.
  • The full source data as Full-Economic-News_DFW-839861.csv, for reference.

The set of data annotated by three annotators can be found in raw_annotation_data, in subdirectories organized by annotator.

Code

The code for running our classifier should be run in the following order.

Converts the source csv to batched xml files for use in MAE:

  • csv_xml.py

Extract tags and their position information from MAE-generated XML files:

  • process_tags.py

Add “unspecified” tags to untagged adjacent pairs:

  • unspecified_generator.py

Read from the processed tags to calculate IAA scores by tag type:

  • IAA.py

Select shared tags to create standard train/test datasets:

  • build_gold.py

Obtain bag-of-words and tag features from standard datasets:

  • extract_features.py

Implement a logistic regression model for link classification, train, and evaluate:

  • build_model.py

Datasets

annotation_data: all data from the news corpus

annotated_data:

  • Starting with “NEWS” and ending with annotator initials: annotated files from MAE
  • Starting with number and ending with annotator initials: cleaned tag sets

raw_annotated_data: original files from annotators

silver.txt: all unique TLINK tags

features_silver.txt: feature vectors made from the silver dataset