/harvey-corpus

Syntactic chunks and semantic entities annotations and guidelines for the Harvey corpus of primary care text.

GNU General Public License v3.0GPL-3.0

Harvey Corpus Repository

This repository contains the annotation guidelines used for the building of the Harvey corpus of clinical text.

Files:

  • Annotation guidelines: guidelines.pdf
  • Syntactic chunk annotation (redacted): annotation/harvey-chunks-redacted.txt
  • Sematic expressions (redacted): annotation/harvey-expressions-redacted.txt

About Harvey

The Harvey corpus is a collection of linguistically annotated de-identified clinical text. The data consists of primary care patient examination notes (GP notes) with layers of linguistic annotation. The data was licensed to the PREP project at the University of Sussex and the Brighton and Sussex Medical School. The first annotation layer contains part of speech tags automatically assigned by cTAKES. The other two layers consist of manually annotated syntactic chunks and named entities (expressions).

References

@Article{Savkov2016,
author="Savkov, Aleksandar
and Carroll, John
and Koeling, Rob
and Cassell, Jackie",
title="Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus",
journal="Language Resources and Evaluation",
year="2016",
month="Sep",
day="01",
volume="50",
number="3",
pages="523--548",
issn="1574-0218",
doi="10.1007/s10579-015-9330-7",
url="https://doi.org/10.1007/s10579-015-9330-7"
}

Licence

The Harvey Corpus annotations and guidelines are released under the GPL license.