Sicos1977/TesseractOCR

Ability to read MultiPageTiffs from memory

washcycle opened this issue · 14 comments

Looking to use this leptonica to read multipage tiffs from memory.

/*!
 * \brief   pixaReadMemMultipageTiff()
 *
 * \param[in]    data    const; multiple pages; tiff-encoded
 * \param[in]    size    size of cdata
 * \return  pixa, or NULL on error
 *
 * <pre>
 * Notes:
 *      (1) This is an O(n) read-from-memory version of pixaReadMultipageTiff().
 * </pre>
 */
PIXA *
pixaReadMemMultipageTiff(const l_uint8  *data,
                         size_t          size)
{

https://github.com/DanBloomberg/leptonica/blob/master/src/tiffio.c

Would we only need to update the Interop class to add this?

I dropped support for multi-page tiff images in favor of making this library much easier to use. Just use another tool to split the tiff in seperate files first and then feed them to TesseractOCR

Can you please reconsider this? Splitting first introduces considerable overhead.

I can add the method that is mentioned in the first post in this issue and after that you have to feed the pix object to the OCR engine.... but I'm not going to change the ocr classes because I dropped support for multi page tiffs so that this library was much easier to use.

That would be great. having the ability to load a multiple page tiff and iterate through the images is all I need. I can fid the individual Images into tesseract myself.

Thank you for reconsidering this!😊

I'll try to make some time to implements it this weekend.

Is it possible to supply me with a multi-page tiff?

Let me know if I can help with the implementation.

Helps is always welcome, at the moment time is my issue. I'll try to implement the new feature in the next week. First need to finish some other work.

what kind of API did you have in mind?

Just using leptonica to split the multipage tiff in seperate PIX (image) objects and feed them into the Tesseract engine one by one.

Sorry for the long long delay but I added this method to the Array class

        /// <summary>
        ///     Loads the multi-page tiff from the memory <paramref name="bytes"/>
        /// </summary>
        /// <param name="bytes"></param>
        /// <returns></returns>
        public static Array LoadMultiPageTiffFromMemory(byte[] bytes)
        {
            IntPtr pixaHandle;

            fixed (byte* ptr = bytes)
            {
                pixaHandle = LeptonicaApi.Native.pixaReadMemMultipageTiff(ptr, bytes.Length);
            }

            if (pixaHandle == IntPtr.Zero) throw new IOException("Failed to load multi page image from memory");

            return new Array(pixaHandle);
        }

You can use it like this to read a multi page tiff image from memory

        [TestMethod]
        public void CanParseMultiPageTifFromMemory()
        {
            using var engine = CreateEngine();
            var bytes = File.ReadAllBytes(TestFilePath("./processing/multi-page.tif"));
            using var pixA = TesseractOCR.Pix.Array.LoadMultiPageTiffFromMemory(bytes);
            var i = 1;

            foreach (var pix in pixA)
            {
                using (var page = engine.Process(pix))
                {
                    var text = page.Text.Trim();

                    var expectedText = $"Page {i}";
                    Assert.AreEqual(text, expectedText);
                }

                i++;
            }
        }