Fix PDF image extraction + predictable image paths

Question

Fix PDF image extraction + predictable image paths

axfelix opened this issue 8 years ago · 8 comments

PDF image extraction doesn't work at the moment -- I'm not sure if Grobid or Cermine implement it at all, but PDF images aren't extracted from most documents in our corpus. We need to add a call to pdfimages (from xpdf/poppler) for documents that are uploaded as PDF and not passed through meTypeset. We should make sure the output location for these matches that of meTypeset (currently /var/documents/$user/$job/metypeset/media/image#.png) so that these image URLs are predictable for Substance.

Answer 1 · 2016-10-06T17:18:27.000Z

@kaschioudi - no hurry on this unless you're blocked on other issues.

Answer 2 · 2016-10-26T23:10:41.000Z

OK, so, pdfimages generally works for this on some test documents, using the syntax pdfimages -j file.doc image, where "image" is the file name output prefix and file.doc is the input. Some of the output is in "ppm" rather than jpeg format, but imagemagick can fix that easily, e.g.: for x in $(ls *.ppm); do magick $x $(echo $x | sed -e "s/\.ppm/.jpg/g"); done; rm *.ppm.

Shouldn't be too hard to implement this in its own module, then add a call to the merge module that adds image elements to the end of the Body text for any PDF that's missing them?

Answer 3 · 2016-11-07T22:53:19.000Z

I've mostly implemented this on the pdfimages branch. I still need a) a general cleanup and sanity check over the code, and b) a good way of passing the list of pdfimages from the Extraction module to the Merge module that re-adds them.

Answer 4 · 2017-01-04T22:16:30.000Z

Removed the branch because this was implemented upstream in Cermine CeON/CERMINE#34 (comment) -- now we just need to make sure that Cermine output images are moved to the same path as meTypeset. Currently cermine output images are in a folder called documentname.images and meTypeset's are in metypeset/media

Answer 5 · 2017-01-30T19:06:06.000Z

Looking into this...

We need to switch from using the PdfNLMContentExtractor class in Cermine to the ContentExtractor class (in https://github.com/pkp/xmlps/blob/master/module/Cermine/src/Cermine/Model/Converter/Cermine.php) to benefit from image extraction support.

However, trying to do this with an upstream Cermine build causes the converter to fail in our stack. Upstream Cermine builds still work fine when being called with the legacy(?) PdfNLMContextExtractor class. From looking at our code, I initially thought that this was because the old method only produces one output file and thus was designed to be pipe-able, which it appears to be from how we're using it: https://github.com/pkp/xmlps/blob/master/module/Cermine/src/Cermine/Model/Converter/Cermine.php#L103. However, I can't reproduce this piping behaviour when running upstream builds of Cermine locally, which I'm now really confused about, as I'm not sure how it continues to work in our pipeline...

@kaschioudi , if you have any ideas...

Answer 6 · 2017-03-08T21:48:08.000Z

Thanks! Going to review this. We should probably make sure meTypeset and Cermine images are output to the same path moving forward so that we don't need to think about different relative links in XML output.

Answer 7 · 2017-03-29T21:23:42.000Z

This is working, and merged into master. Leaving open until we harmonized the image output paths though.

Answer 8 · 2018-06-29T16:44:40.000Z

I believe this was fixed in the most recent round of Texture work through the meta file wrapper.