/LCWA-MODS

A collection of MODS records from the LC web archives with exploratory activities for manipulating XML with Python in Jupyter Notebooks.

Primary LanguageJupyter NotebookOtherNOASSERTION

LCWA MODS

This repo contains sample data for use and testing of processing scripts for text/XML. The sample data comes from the Library of Congress Web Archives (LCWA), a program that has been selecting, harvesting, and preserving Web sites since 2000. The sample data is an unsystematically collected sample of metadata for 28 sites preserved in the Web archives at the Library of Congress. The metadata is gathered as sample data for practicing basic operations and manipulations of XML information; this metadata is formatted according to the structure and schema defined in the Metadata Object Description Schema (MODS), a format initially developed in 2002 for the communication of resource description information by libraries and archives. The data is described in more detail below. In addition to this readme, this git repository contains:

The Jupyter notebook contains information about downloading MODS records for use as sample data to practice parsing XML in python. These sample files were generated in August of 2018 from MODS metadata records for archived Web sites collected by the Library of Congress.

Those familiar with the Library may know about the LCCN, a general control number that provides unique identifiers for most items that are held by the Library of Congress. The Web archives described in these MODS records do not have LCCNs. Instead, an LCWA (Web Archives) offers a unique identifier for the metadata records.

These are newer MODS records that don't have LCSH:

lcwaN0010234,lcwaN0001999,lcwaN0003238,lcwaN0010144,lcwaN0010145,
lcwaN0012178,lcwaN0012179,lcwaN0012180,lcwaN0012184,lcwaN0012195,
lcwaN0010932,lcwaN0010933,lcwaN0010936,lcwaN0010937,lcwaN0010940,

These have LCSH in <subject>:

lcwaN0010888,lcwaN0010226,lcwaN0009692,lcwaN0009700,lcwaN0010401

These are election sites that include <subject> both lcsh and "local" headings noted as "lcwat", which represent a taxonomy that was developed for the quick categorization of sites during the nomination and harvesting process:

lcwaE0008846,lcwaE0008263,lcwaE0008338,lcwaE0008918,lcwaE0008001

These are some previous generation records, which illustrate slight differences in format and naming convention.

  • lcwa00097019 Brazilian Presidential Election 2010 Web Archive
  • dfd3979a7fb56bb3acc06b7b0129633c,00853935a711639f58b0f35bae8d7781 Example from 2002 Winter Olympics and NYPL (September 11, 2001 Web Archive)