/refcor

Reference corpora for authorship attribution studies

refcor

Reference corpora for authorship attribution studies.

This repository contains three collection of novels developed for stylometric authorship attribution studies. Each collection contains seventy-five novels from twenty-five different authors, each author contributing three texts, respectively.

German English French
Source of the texts TextGrid Gutenberg Ebooks libres et gratuits
Range of original publication dates 1774–1926 1838–1921 1827–1934
Total number of tokens 10,354,989 11,771,901 7,401,126
Length of shortest novel (tokens) 19,820 40,720 33,501
Length of longest novel (tokens) 761,821 456,637 209,992
Mean length of novels (tokens) 138,067 156,958 98,681
Standard deviation of novel length 134,857 85,890 42,194