/hathi

A collection of scripts for enhancing HathiTrust records

Primary LanguageXSLT

Important

Archived May 2024. No longer used or open for changes.

Get MARC Records From Hathi OAI

The first step is to get Hathi MARC records for volumes that are public domain or opened by a rights holder from the Hathi OAI feed.

Use HathiFiles to Enhance OAI MARC Records

OAI Marc Records don't include the MARC 245 subfield P or a govdocs indicator. Use the hathifiles metadata to add it (matching against hathitrust record number extracted in previous step).

Retrieve the latest hathifile

OUTDIR="output"
curl https://www.hathitrust.org/filebrowser/download/177119 -o $OUTDIR/hathi_full.txt.gz
gunzip $OUTDIR/hathi_full.txt.gz

Get HathiTrust Record Numbers from OAI harvest

OAIFILES="../oai_harvester/harvested/to-process/*.xml"
echo "Extracting record numbers..."
for f in $OAIFILES
do
  name=$(basename $f .xml)
  saxon -o $OUTDIR/ids/$name.txt $f extractIdentifiers.xsl
done
echo "Sorting..."
sort $OUTDIR/ids/*.txt  -o $OUTDIR/all-OAI-ids.txt

Use Hathitrust record IDs to create ID/title lookup from HathiFiles

  • The result file includes 3 data elements, HT Record Number, HathiFiles Title, Gov Docs flag
java -jar parseHathiFiles.jar $OUTDIR/hathi_full.txt $OUTDIR/IdToTitle $OUTDIR/all-OAI-ids.txt
sort $OUTDIR/IdToTitle -o $OUTDIR/IdToTitleSorted
perl TitleLookup.pl $OUTDIR/IdToTitleSorted

Merge titles in harvested records

for f in $OAIFILES
  do name=$(basename $f .xml)
  echo "Processing $name..."
  saxon -o $OUTDIR/xml/$name.xml $f mergeHathiFiles.xsl
  echo "done."
done

Compress files and copy to bonnet for harvesting

tar cvfz hathi_enhanced_2017-02-07.tar.gz $OUTDIR/xml/*.xml
scp hathi_enhanced_2017-02-07.tar.gz exlibris@bonnnet.bc.edu:primo/hathitrust/