deutsche-nationalbibliothek/pica-rs

Write PPN-prepended PICA to update and merge PICA dumps?

Closed this issue · 2 comments

I am not sure whether the following use case is relevant to pica-rs or can better be solved by other means: Given two PICA+ database dumps (normalized PICA format, one line per record), merge both but only keep the first record of multiple records with same PPN. This is required to update dumps with lists of changed and added records (deletion of records would require special treatment, e.g. filter out records that only contain a PPN but no other fields, but this is not the hard part).

I guess using standard unix tool sort is the best bet when the PPN is given as individual key (sort -u -k1,1). For large datasets, option parallel and buffer-size help to speed up sorting until disk I/O is the bottleneck. If both dumps are pre-sorted, sort -m -u -k1,1 will do.

What's needed is a method to create normalize PICA files with PPN prepended to each record (separated by space), so it can be processed by sort. Removing the PPN can could be done with sed or also by pica-rs. Example of the format:

012345X 003@ 012345X021A aEin Buchhzum Lesen

If such feature does not suit to pica-rs, using its PICA parsing library may help to write a performant script to prepend PPNs, right?

I am not sure whether the following use case is relevant to pica-rs or can better be solved by other means: Given two PICA+ database dumps (normalized PICA format, one line per record), merge both but only keep the first record of multiple records with same PPN.

I've a similar but not the same requirement/problem: see #585. Does concatenation of records without duplicates (based on the IDN value) solve your problem? If so, I would like to add a switch (hash or idn) for detecting duplicates in #585.

Otherwise, how about using the following procedure in order to produce your "prefix-format":

$ pica cat -s DUMP.dat.gz -o dump.dat
$ pica select "003@.0" dump.dat -o idn.csv
$ paste -d ' ' idn.csv dump.dat

If such feature does not suit to pica-rs, using its PICA parsing library may help to write a performant script to prepend PPNs, right?

Yes, you can use the pica-record crate.

Thanks for the idea to use pica select and paste to sort records by PPN!

Does concatenation of records without duplicates (based on the IDN value) solve your problem? If so, I would like to add a switch (hash or idn) for detecting duplicates in #585.

Yes it would, but with our 80 million records a simple BTreeSet of PPN strings may not be a good solution. PPN can be mapped to uint32 (remove last character) and seen values could be tracked with a set of non-overlapping uint32 ranges. so BTreeSet of Range would do with O(log n) but with sorted sets of records we can have O(1) so I'd prefer sort and merge.