Segmentation Corpora

This repository holds multiply segmented corpora from the papers below. The data formats are as specified in the Segmentation Representation Specifcation Version 1.1 [PDF], and are of two types:

  • JSON (JavaScript Object Notation) or
  • TSV (Tab Separated Values)

To evaluate this corpora, and other segmentation metrics, use the SegEval software package.


  • /kubla_khan_fournier_2013/ and kubla_khan_fournier_2013.json - Segmentations of the poem Kubla Khan by Samuel Taylor Coleridge (1816), codings collected by Fournier (2013); and
  • /stargazer_hearst_1997/ and stargazer_hearst_1997.json - Segmentations of Stargazers look for life by Baker (1990), codings collected by Hearst (1997).