Ability to read MultiPageTiffs from memory

Question

Ability to read MultiPageTiffs from memory

washcycle opened this issue 2 years ago · 14 comments

Looking to use this leptonica to read multipage tiffs from memory.

/*!
 * \brief   pixaReadMemMultipageTiff()
 *
 * \param[in]    data    const; multiple pages; tiff-encoded
 * \param[in]    size    size of cdata
 * \return  pixa, or NULL on error
 *
 * <pre>
 * Notes:
 *      (1) This is an O(n) read-from-memory version of pixaReadMultipageTiff().
 * </pre>
 */
PIXA *
pixaReadMemMultipageTiff(const l_uint8  *data,
                         size_t          size)
{

https://github.com/DanBloomberg/leptonica/blob/master/src/tiffio.c

Would we only need to update the Interop class to add this?

Answer 1 · 2022-11-26T16:08:56.000Z

I dropped support for multi-page tiff images in favor of making this library much easier to use. Just use another tool to split the tiff in seperate files first and then feed them to TesseractOCR

Answer 2 · 2022-11-26T16:12:13.000Z

Fair point. Excellent idea though, I never even considered that as an option. Regards, Matt

…

On Sat, Nov 26, 2022 at 10:09 AM Kees ***@***.***> wrote: I dropped support for multi-page tiff images in favor of making this library much easier to use. Just use another tool the split the tiff in seperate files first and then feed them to TesseractOCR — Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIGT7TT6R4473W4NX5ZF6LWKIY2HANCNFSM6AAAAAARUL6KYA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 3 · 2023-02-28T22:29:46.000Z

Can you please reconsider this? Splitting first introduces considerable overhead.

Answer 4 · 2023-03-01T06:27:13.000Z

I can add the method that is mentioned in the first post in this issue and after that you have to feed the pix object to the OCR engine.... but I'm not going to change the ocr classes because I dropped support for multi page tiffs so that this library was much easier to use.

Answer 5 · 2023-03-01T15:39:53.000Z

That would be great. having the ability to load a multiple page tiff and iterate through the images is all I need. I can fid the individual Images into tesseract myself.

Answer 6 · 2023-03-02T12:40:48.000Z

Thank you for reconsidering this!😊

Answer 7 · 2023-03-02T17:05:17.000Z

I'll try to make some time to implements it this weekend.

Answer 8 · 2023-03-08T17:54:30.000Z

Is it possible to supply me with a multi-page tiff?

Answer 9 · 2023-03-08T18:19:42.000Z

found this one

https://www.nightprogrammer.org/wp-uploads/2013/02/multipage_tiff_example.tif

Answer 10 · 2023-03-10T23:45:35.000Z

Let me know if I can help with the implementation.

Answer 11 · 2023-03-11T14:44:53.000Z

Helps is always welcome, at the moment time is my issue. I'll try to implement the new feature in the next week. First need to finish some other work.

Answer 12 · 2023-03-12T14:19:18.000Z

what kind of API did you have in mind?

Answer 13 · 2023-03-14T18:09:57.000Z

Just using leptonica to split the multipage tiff in seperate PIX (image) objects and feed them into the Tesseract engine one by one.

Answer 14 · 2023-04-22T07:33:16.000Z

Sorry for the long long delay but I added this method to the Array class

        /// <summary>
        ///     Loads the multi-page tiff from the memory <paramref name="bytes"/>
        /// </summary>
        /// <param name="bytes"></param>
        /// <returns></returns>
        public static Array LoadMultiPageTiffFromMemory(byte[] bytes)
        {
            IntPtr pixaHandle;

            fixed (byte* ptr = bytes)
            {
                pixaHandle = LeptonicaApi.Native.pixaReadMemMultipageTiff(ptr, bytes.Length);
            }

            if (pixaHandle == IntPtr.Zero) throw new IOException("Failed to load multi page image from memory");

            return new Array(pixaHandle);
        }

You can use it like this to read a multi page tiff image from memory

        [TestMethod]
        public void CanParseMultiPageTifFromMemory()
        {
            using var engine = CreateEngine();
            var bytes = File.ReadAllBytes(TestFilePath("./processing/multi-page.tif"));
            using var pixA = TesseractOCR.Pix.Array.LoadMultiPageTiffFromMemory(bytes);
            var i = 1;

            foreach (var pix in pixA)
            {
                using (var page = engine.Process(pix))
                {
                    var text = page.Text.Trim();

                    var expectedText = $"Page {i}";
                    Assert.AreEqual(text, expectedText);
                }

                i++;
            }
        }