Harvey Corpus Repository

This repository contains the annotation guidelines used for the building of the Harvey corpus of clinical text.

Files:

Annotation guidelines: guidelines.pdf
Syntactic chunk annotation (redacted): annotation/harvey-chunks-redacted.txt
Sematic expressions (redacted): annotation/harvey-expressions-redacted.txt

About Harvey

The Harvey corpus is a collection of linguistically annotated de-identified clinical text. The data consists of primary care patient examination notes (GP notes) with layers of linguistic annotation. The data was licensed to the PREP project at the University of Sussex and the Brighton and Sussex Medical School. The first annotation layer contains part of speech tags automatically assigned by cTAKES. The other two layers consist of manually annotated syntactic chunks and named entities (expressions).

References

@Article{Savkov2016,
author="Savkov, Aleksandar
and Carroll, John
and Koeling, Rob
and Cassell, Jackie",
title="Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus",
journal="Language Resources and Evaluation",
year="2016",
month="Sep",
day="01",
volume="50",
number="3",
pages="523--548",
issn="1574-0218",
doi="10.1007/s10579-015-9330-7",
url="https://doi.org/10.1007/s10579-015-9330-7"
}

Licence

The Harvey Corpus annotations and guidelines are released under the GPL license.

savkov/harvey-corpus

Harvey Corpus Repository

About Harvey

References

Licence