/dach-gt

Ground truth and full text for selected prints of German libraries

Primary LanguageShellCreative Commons Zero v1.0 UniversalCC0-1.0

Ground truth and full text for selected prints of German archives and libraries

Collection of useful commands

# Remove empty lines from ALTO and PAGE XML.
perl -i -ne "tr|\r||d; next if /^\s*$/;print" *.xml

# Remove ALTO files without fulltext.
rm -f $(grep -L 'CONTENT="..*"' *.xml)

# Remove PAGE files without fulltext.
rm -f $(grep -L '<Unicode>..*</Unicode>' *.xml)