A structured list of textual corpora, created for use with corpus downloader. This list is read by the Python module and command-line program corpus-downloader, originally developed for use in DHBox.
Do you own or maintain a textual corpus? Please fork this repository and add it to the list. Here are a list of current fields and their descriptions:
shortname
: a short and easy-to-type name for your corpus, e.g. shc
for the corpus “Shakespeare His Contemporaries.”
title
: the full name for your corpus, e.g. “Shakespeare His Contemporaries.”
categories
: a disciplinary label you can assign to your corpus. If if it’s mostly of interest to classics scholars, write “classics.” Current values include “literature,” “classics,” and “history,” but feel free to add your own.
languages
: The ISO 649-2 language code for the language(s) of your corpus. Examples include deu
(German), eng
(English), enm
(Middle English), and fra
(French). For a more complete list, check here.
text
: information about the actual text(s) of your corpus.
markup
: how the text is encoded. E.g. TXT
(plain text), HTML
(HTML files), or TEI
(TEI XML).
url
: the URL for your textual corpus, e.g. http://www.folgerdigitaltexts.org/download/txt/FolgerDigitalTexts_TXT_Complete.zip
file-format
: the file format of the URL above. In the above example, zip
.
- shortname: perseus-c-greek
title: Perseus Canonical Greek
categories: classics
authors: multiple
languages: grc
text:
markup: TEI
url: https://github.com/PerseusDL/canonical-greekLit.git
file-format: git