Reference corpora for authorship attribution studies.
This repository contains three collection of novels developed for stylometric authorship attribution studies. Each collection contains seventy-five novels from twenty-five different authors, each author contributing three texts, respectively.
German | English | French | |
---|---|---|---|
Source of the texts | TextGrid | Gutenberg | Ebooks libres et gratuits |
Range of original publication dates | 1774–1926 | 1838–1921 | 1827–1934 |
Total number of tokens | 10,354,989 | 11,771,901 | 7,401,126 |
Length of shortest novel (tokens) | 19,820 | 40,720 | 33,501 |
Length of longest novel (tokens) | 761,821 | 456,637 | 209,992 |
Mean length of novels (tokens) | 138,067 | 156,958 | 98,681 |
Standard deviation of novel length | 134,857 | 85,890 | 42,194 |