cisocrgroup/ocrd_cis

ocropy segment does not handle input files properly

Closed this issue · 2 comments

finkf commented

ocrd-cis-ocropy-segment does not segment any image files.

Looking at the code in ocrd_cis/ocropy/segment.py line 220 looks wrong:

for (n, input_file) in enumerate(self.input_files):

Shouldn't this be:

for (n, input_file) in enumerate(self.workspace.mets.find_files(fileGrp=self.input_file_grp)):

As well as line 113 seems weird:

if hasattr(self, 'output_file_grp'):
  try:
    self.output_file_grp, self.image_file_grp = self.output_file_grp.split(',')
  except ValueError:
    self.image_file_grp = FALLBACK_FILEGRP_IMG
    LOG.info("No output file group for images specified, falling back to '%s'", FALLBACK_FILEGRP_IMG)

Looking at the code in ocrd_cis/ocropy/segment.py line 220 looks wrong:

for (n, input_file) in enumerate(self.input_files):

Shouldn't this be:

for (n, input_file) in enumerate(self.workspace.mets.find_files(fileGrp=self.input_file_grp)):

These 2 are the same. But recently, the input_files property has become even more, because it must also differentiate between PAGE and image files in the same fileGrp, for the same pageId.

So: no, this is the correct pattern, and since OCR-D/spec#164 we must use input_files.

As well as line 113 seems weird:

if hasattr(self, 'output_file_grp'):
  try:
    self.output_file_grp, self.image_file_grp = self.output_file_grp.split(',')
  except ValueError:
    self.image_file_grp = FALLBACK_FILEGRP_IMG
    LOG.info("No output file group for images specified, falling back to '%s'", FALLBACK_FILEGRP_IMG)

This used to be the correct pattern for processors that write AlternativeImages. (The conditional was necessary because there are other non-processing contexts like --help which have no output_file_grp defined.)

After OCR-D/spec#164 we write them to the same output fileGrp, so this is not needed anymore (see #57).

finkf commented

Ok. Thanks for the clarification