/public-adl-text-sources

The texts used for building Archive for Danish Literature

Primary LanguageXSLT

Public ADL text sources

The Archive for Danish Literature, ADL, comes to you via a collaboration between

As of writing, the corpus comprises 498 volumes with in total 165512 pages of Danish literature. The whole corpus has been encoded using TEI, but only about two-thirds of the pages have been subject to OCR and text encoding. This repository contains all those texts.

We also describe our data and particular our encoding practices. We also give information on how we envisage submissions could be structured.

Getting text

As might have noticed are all the texts in a XML format called Text Encoding Initiative (TEI). For many purposes, if not all, that is a good format.

If you want to extract texts from the files, you can use the the scripts

  1. get_titles.xsl
  2. get_the_text.xsl
  3. extract_stuff.sh

The first one (get_titles.xsl) creates a list of works inside a TEI file.

xsltproc  get_titles.xsl texts/hcaeventyr01val.xml 
workid57967;Eventyr, fortalte for Børn. Første Samling. Første Hefte. 1885.
workid58084;Fyrtøiet
workid59091;Lille Claus og store Claus
workid61051;Prindsessen paa Ærten
workid61317;Den lille Idas Blomster
workid62461;Eventyr, fortalte for Børn. Første Samling. Andet Hefte. 1885.
workid62544;Tommelise
workid64209;Den uartige Dreng
workid64656;Reisekammeraten

...

The second script (get_the_text.xsl) creates one text file per title in the TEI file.

Finally, you can adapt the shell script extract_stuff.sh to do both things directly.

Contributing documents

Projects with relevant scope can contribute documents to ADL, provided the

  • Copyright issues are resolved
  • They are accepted by DSL and KB
  • The XML is valid TEI

A contribution can be received by branch and pull request in github as is the practice on GitHub.