Inspect MARC data from Alma via OAI-PMH using Solr
Install the required gems with bundle
bundle install
Don't need the solr_wrapper dependency?
bundle install --without=development
If needed, start a Solr instance (you may want to use a separate terminal for this)
bundle exec solr_wrapper
Load data from Alma (see the configuration section for more)
bundle exec ruby ./load.rb
Get a CSV with some stats (see the reporting section for more)
bundle exec ruby ./report.rb # ==> ./data.csv
Clear all the data out of Solr
bundle exec ruby ./clean.rb
There are two configuration options: The URL to Solr, and the name of the OAI set in Alma.
The former is used by all three scripts in this repo, while the latter is only used by load.rb
.
These can be changed by setting the environment variables SOLR_URL
, and ALMA_SET
respectively.
By default, the Solr URL is the one used by solr_wrapper: http://127.0.0.1:8983/solr/alma-data-core
, and the Alma set is blacklighttest
.
If you wanted to use a set called 'sunspot', you could run ALMA_SET=sunspot bundle exec ruby ./load.rb
, or if you have a permanent Solr at https://example.com/solr/sunspot-core, you could run SOLR_URL="https://example.com/solr/sunspot-core" bundle exec ruby ./report.rb
Example Solr fields could be f_001_isi
, d_040_ssim
, i_028_ind2_ssim
, or s_040_a_isi
.
The pieces of a fieldname between the underscores each have some meaning starting with the prefix.
Prefixes come in five flavors:
f_
fields count the number of times that field occured in a MARC record.c_
fields are for control fields, and contain the value of the MARC field.d_
fields are for data fields, and contain the subfield codes from that MARC field.s_
fields are for subfields, and contain the values for that subfield.i_
fields are for indicators, and contain the values for that indicator.
The next piece is the field's tag from the MARC record.
s_
and i_
Solr fields have an extra piece here.
For i_
fields, it is either ind1
or ind2
, corresponding to which indicator they describe.
For s_
fields, this piece is the subfield code.
Finally, each field has a suffix: either _isi
or _ssim
.
If the suffix is _isi
, the Solr field is a count of how many times the field was found in the MARC record.
If the suffix is _ssim
, the Solr field contains the values from the MARC field.
With suffixes, there are some caveats:
- There are no
f_*_ssim
fields, as the values in tags are present in other places (c_*_ssim
ors_*_ssim
fields). - There are no
c_*_isi
fields, as that data would match thef_*_isi
fields d_*_ssim
fields only contain unique elements, rather than all the subfield codes.d_*_isi
fields do count the total number of times the subfields appeared.
Using the example fields from above (f_001_isi
, d_040_ssim
, i_028_ind2_ssim
, or s_040_a_isi
), we now know that, for a single MARC record or Solr document:
f_001_isi
counts how many times the 001 field appearedd_040_ssim
contains all the unique subfield codes for the 040 fieldi_028_ind2_ssim
contains all the values for the second indicator for the 028 fields_040_a_isi
counts how many times the 040$a subfield occurred.
Additionally, each document contains some other fields.
id
, which should be the same as thec_001_ssim
field.l_ldr_ssim
contains the value of the leader, and (for completeness)l_ldr_isi
counts the number of times a leader was found.escaped_tags_ssim
andescaped_codes_ssim
will have any tags and subfield codes (respectively) that had to be changed to fit Solr's field name limitations.raw_tsi
contains the raw MARC xml, if the correct variable was set in theload.rb
script when it was run.
The load.rb
script takes some effort to ensure an ingest completes successfully.
The most common issue is with non-conforming MARC data.
Solr does not allow certain non-alphanumeric characters in its fieldnames, so the two fields escaped_tags_ssim
and escaped_codes_ssim
capture offenders for easy identification.
For example, if a record has a 856 tag with a '|' subfield code, the escaped_codes_ssim
field will have '856_|' in its values.
You can still find the value of that subfield- it will be in the s_856___ssim
field on the Solr doc.
All offending characters are replaced with underscores ('_').
Less likely, but still handled are multiple instances of or missing 001s, leaders, or other control fields. In these cases, the script will use a fallback identifier rather than guessing what value is correct. The fallback identifiers start at zero and increment by one for each case found.
Finally, the script will retry it's OAI and Solr queries a limited number of times before accepting defeat.
WARNING! Rerunning the ingest script will overwrite the import.log file.
The report takes all the different _isi
fields, and produces two statistics for each: how many MARC records (or Solr documents) have that field, and how many occurances of that field exist across all the records.
WARNING! Rerunning the report script will overwrite the data.csv file.
Coming soon!