impresso/impresso-text-acquisition

`handelstztg` integrity issues

Closed this issue · 0 comments

The sanity check scripts have spotted an inconsistency for 6 issues of handelsztg:

  • handelsztg-1884-01-22-a
  • handelsztg-1884-02-27-a
  • handelsztg-1884-04-21-a
  • handelsztg-1884-09-11-a
  • handelsztg-1884-11-17-a
  • handelsztg-1884-12-31-a

What happens is:

  • a certain number of page IDs is found in the page JSON files but not in the issue JSON files (s3://original-canonical-testing/handelsztg/issues/ vs s3://original-canonical-testing/handelsztg/pages/); see below for the full list
  • I've manually checked one of this six cases and the issue JSON document is not to be found in the corresponding .bz2 file
bzcat /tmp/handelsztg-1884-issues.jsonl.bz2|jq --slurp '.[]|select(.id=="handelsztg-1884-12-31-a")'

Affected page ids:
handelsztg-1884-01-22-a-p0001
handelsztg-1884-01-22-a-p0003
handelsztg-1884-01-22-a-p0004
handelsztg-1884-01-22-a-p0005
handelsztg-1884-01-22-a-p0006
handelsztg-1884-02-27-a-p0001
handelsztg-1884-02-27-a-p0002
handelsztg-1884-02-27-a-p0003
handelsztg-1884-02-27-a-p0004
handelsztg-1884-02-27-a-p0005
handelsztg-1884-04-21-a-p0001
handelsztg-1884-04-21-a-p0002
handelsztg-1884-04-21-a-p0003
handelsztg-1884-09-11-a-p0001
handelsztg-1884-09-11-a-p0002
handelsztg-1884-09-11-a-p0003
handelsztg-1884-11-17-a-p0001
handelsztg-1884-11-17-a-p0003
handelsztg-1884-11-17-a-p0004
handelsztg-1884-12-31-a-p0001
handelsztg-1884-12-31-a-p0002
handelsztg-1884-12-31-a-p0003
handelsztg-1884-12-31-a-p0004
handelsztg-1884-12-31-a-p0005
handelsztg-1884-12-31-a-p0006
handelsztg-1884-12-31-a-p0007
handelsztg-1884-12-31-a-p0008
handelsztg-1884-12-31-a-p0009
handelsztg-1884-12-31-a-p0010
handelsztg-1884-12-31-a-p0011
handelsztg-1884-12-31-a-p0012
handelsztg-1884-12-31-a-p0013