This repository contains scripts to analyze, convert and publish subject indexing data from K10plus catalog.
The data is reduced and cleaned up in four sets:
- Full K10plus dumps (PICA+): not published so far
- Subject-related fields (PICA+): https://doi.org/10.5281/zenodo.6810555
- Vocabularies, identifier, and sources: not published so far
- Syntactically valid identifiers of known vocabularies, aka normalized subject indexing data (TSV and RDF): https://doi.org/10.5281/zenodo.7016625
If not interested in sources and details of cataloging, best use the set 4. For the full picture use set 2.
Requires pica-rs >= 0.13.0, Bash and Perl to be installed.
-
Processing from set 1 to 2 is implemented in
extract-subject-fields.sh
. -
Processing from set 2 to 3 is implemented in
extract-subjects.pl
:cat kxp-subjects-sample_2021-06-30*.dat | ./extract-subjects.pl > subjects.tsv
-
Processing from set 3 to 4 is implemented in
clean-subjects.pl
andtriples.pl
:
Additional error invalid-ids.tsv
is created during processing.
Final statistics are created in stats.sh
.
See Makefile
for execution.
Dumps of subject indexing in K10plus catalog are published yearly to quarterly. Each dump is around 15 Gigabytes (that's around 10% of full K10plus data) and split into multiple files. Copies of the full dump may be found at https://analytics.gbv.de/dumps/kxp/. Data is provided in PICA Normalized format with one record per line.
The data is reduced to data fields used for subject indexing in K10plus catalog and limited to records with at least one library holding. Records without any subject indexing are omitted. See K10plus format documentation and file README.md of the data publication for details.
Script zenodo-get
in this repository can be used for automatic download from published data at Zenodo.
The TSV file subjects.tsv
can be generated with extract-subjects.pl
, it is not published yet. The file file contains four columns:
- Bibliographic record identifier (PPN)
- Vocabulary symbol
- Notation or identifier in the vocabulary (subfield
$a
in most cases) - source (in most cases subfield
$A
, if available). Multiple sources are concatenated with separator|
.
Note: Source field $A
for GND contains other data as well unless this issue has been fixed, so use with care!
The data can also be used to detect cataloging errors such as invalid notations etc.
Normalized subject indexing data in TSV only contains column 1-3, limited to syntactically valid notations of selected vocabularies (see vocabularies.json
).
See https://coli-conc.gbv.de/subjects/ for an API that makes use of this data.
2022-03-31:
- RDF Dump includes skos:inScheme statements