mupdf tar fixes

Question

mupdf tar fixes

q3cpma opened this issue 5 years ago · 14 comments

Hello, would it be possible to import these commits fixing tar (cbt) support from mupdf? Without these, mupdf is advertising GNU tar support without for long names and didn't support ustar name fields of length 100.

http://git.ghostscript.com/?p=mupdf.git;a=blobdiff;f=source/fitz/untar.c;h=1b73cfb577ea91c68d797da3cd471cdb104d4609;hp=92788155500db0dbfdfd49eb741d6dc37793d305;hb=407f39091c765303cd2712930c2fdd4cbee3cc69;hpb=cfe80eb6ad89eb4906320180ca833a6e9d9568f6
http://git.ghostscript.com/?p=mupdf.git;a=blobdiff;f=source/fitz/untar.c;h=416c99dada201013cfa9c448dae648e86550ebd1;hp=1b73cfb577ea91c68d797da3cd471cdb104d4609;hb=4429482cf95ab2aedd6dc866f808c7593d0884cd;hpb=407f39091c765303cd2712930c2fdd4cbee3cc69
http://git.ghostscript.com/?p=mupdf.git;a=blobdiff;f=source/fitz/untar.c;h=76edc613c5340010bc05098c5220fa8d572e8668;hp=416c99dada201013cfa9c448dae648e86550ebd1;hb=5f5664306d9cec6ba49236736b565b7d2ed4d741;hpb=4429482cf95ab2aedd6dc866f808c7593d0884cd
http://git.ghostscript.com/?p=mupdf.git;a=blobdiff;f=source/fitz/untar.c;h=131ace02ea32e6cb5bb55f3b09ba2c6c7215ffcb;hp=76edc613c5340010bc05098c5220fa8d572e8668;hb=6cc1dd819d77a7dbff55e52ae589c1fded074f4c;hpb=5f5664306d9cec6ba49236736b565b7d2ed4d741

Answer 1 · 2020-02-08T10:33:21.000Z

Reference to relevant issue: koreader/koreader#5624

Possibly held up by #762 but if you can do the same as #943 I don't think anybody would have any objections.

Answer 2 · 2020-02-08T14:07:42.000Z

On Sat, Feb 08, 2020 at 02:33:21AM -0800, Frans de Jonge wrote: Reference to relevant issue: koreader/koreader#5624 Possibly held up by #762 but if you can do the same as #943 I don't think anybody would have any objections.

Here's a patch for that repo that should do it: https://0x0.st/iivm.diff I merged everything under a single patch named tar-fixes, hope it's okay. Since Lua on Gentoo (especially mine) is broken, building koreader-base didn't work easily, so I tested it on mupdf-1.13.0 directly. It built and solved my problems as it should.

Answer 3 · 2020-02-08T16:42:33.000Z

I'm building on Gentoo with no issues?

┌─(niluje@illyria:pts/10)─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────(~/MPLAYER/zsync2-koreader/build)─┐
└─(0.28:68%:17:40:%)── eix -I dev-lang/lua                                                                                                                                                                                                                                                                                                              ──(Sat, Feb 08)─┘
[I] dev-lang/lua
     Available versions:  
     (0)    5.1.5-r4 [M](~)5.1.5-r5
     (5.1)  [M](~)5.1.5-r100 [M](~)5.1.5-r101 [M](~)5.1.5-r102 [M](~)5.1.5-r103
     (5.2)  [M](~)5.2.3 [M](~)5.2.3-r1 [M](~)5.2.3-r2 [M](~)5.2.3-r3 [M](~)5.2.4^t [M](~)5.2.4-r1^t [M](~)5.2.4-r2^t
     (5.3)  [M](~)5.3.3 [M](~)5.3.3-r1 [M](~)5.3.3-r2 [M](~)5.3.5^t [M](~)5.3.5-r1^t [M](~)5.3.5-r2^t
       {+deprecated doc emacs readline static test test-complete ABI_MIPS="n32 n64 o32" ABI_RISCV="lp64 lp64d" ABI_S390="32 64" ABI_X86="32 64 x32"}
     Installed versions:  5.1.5-r4(12:10:00 AM 01/28/2020)(deprecated readline -emacs -static ABI_MIPS="-n32 -n64 -o32" ABI_RISCV="-lp64 -lp64d" ABI_S390="-32 -64" ABI_X86="64 -32 -x32")
     Homepage:            http://www.lua.org/
     Description:         A powerful light-weight programming language designed for extending applications

[I] dev-lang/luajit
     Available versions:  (2) 2.0.5-r1 **2.1.0_beta3
       {lua52compat static-libs}
     Installed versions:  2.0.5-r1(2)(04:30:11 PM 07/10/2018)(-lua52compat -static-libs)
     Homepage:            http://luajit.org/
     Description:         Just-In-Time Compiler for the Lua programming language

Found 2 matches

(Well, barring the usual LuaRocks 3 weirdness, which I take care of like this).

Sidebar: I'm not sure anyone really actually wants to be using CBT instead of CBZ. By nature, tar is not seekable (the T stands for "tape", after all, not really a seekable medium ^^), so it's designed entirely around being streamed.

That point may be moot if mupdf doesn't actually seek in zips either, though ;D.

Answer 4 · 2020-02-08T16:51:08.000Z

Clearly one person does. ;-)

In any case using CBZ with PNG or JPEG doesn't seem like it'd make much sense either. Those are already compressed in a very efficient manner; throwing ZIP on top just seems like a waste of cycles all around. And I recall that for some reason even uncompressed ZIP seemed to waste more cycles than expected.

Incidentally, what I noticed in my comics from Humble Bundle is that the PDF is preferable over CBZ as a rule of thumb. Fundamentally they're both largely just a collection of JPEGs, but PDF can include vectors, most often or at least most noticeably text.

Answer 5 · 2020-02-08T16:59:07.000Z

When I say CBZ, I usually mean with no compression (i.e., store, not deflate) ;).

There's a bit of an issue with PDFs for Comics in our case: the whole file is read and stored in memory on open. Which means a high-res GN will kill a 512MB device :/. (I can actually confirm that, we get oom-killed when trying to load the > 500MB PDF I had on hand ^^).

I haven't looked where the culprit is, but switching to an mmap would take care of that kind of issues.

Answer 6 · 2020-02-08T17:01:13.000Z

Huh, I read Cells at Work on my H2O and that one was 457 MB. Not quite 512 but you'd think it'd be close enough for problems.

Most comics don't really work on my H2O because, well, they're not manga. :-P But the PDF vector advantage nevertheless persists.

Answer 7 · 2020-02-08T17:02:26.000Z

Might have been sneakier than that then, that particular file was very neatly laid out, which means there were fancy uncompressed (very) high-res TIFF images mixed w/ PDF vectors and stuff.

Since it crashed on open because of an oom-kill, I just assumed that it was trying to read the whole file in memory ;).

Answer 8 · 2020-02-08T17:11:19.000Z

In case you grabbed the Humble Bundle in question, it's Cells at Work Volume 5.

I just double checked and memory use while loading that file is < 200 MB. It does, however, quickly balloon past 1 GB if you skip through pages, although I can't seem to get it past 1.3 GB. But that's on my desktop where I'm currently only using ~14 GB RAM (out of 32) so I don't know how that behavior maps to the H2O. If nothing else it'd be a lot harder to skip through that quickly.

Answer 9 · 2020-02-08T18:20:20.000Z

The memory usage you may see as you browse pages with PDF is probably just the visited pages' blitbuffers getting cached - that cache size is computed depending on the system available/free memory: https://github.com/koreader/koreader/blob/master/frontend/cache.lua#L15-L58

Answer 10 · 2020-02-08T18:22:09.000Z

On Sat, Feb 08, 2020 at 08:42:34AM -0800, NiLuJe wrote: I'm building on Gentoo with no issues? ``` ***@***.***:pts/10)─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────(~/MPLAYER/zsync2-koreader/build)─┐ └─(0.28:68%:17:40:%)── eix -I dev-lang/lua ──(Sat, Feb 08)─┘ [I] dev-lang/lua Available versions: (0) 5.1.5-r4 [M](~)5.1.5-r5 (5.1) [M](~)5.1.5-r100 [M](~)5.1.5-r101 [M](~)5.1.5-r102 [M](~)5.1.5-r103 (5.2) [M](~)5.2.3 [M](~)5.2.3-r1 [M](~)5.2.3-r2 [M](~)5.2.3-r3 [M](~)5.2.4^t [M](~)5.2.4-r1^t [M](~)5.2.4-r2^t (5.3) [M](~)5.3.3 [M](~)5.3.3-r1 [M](~)5.3.3-r2 [M](~)5.3.5^t [M](~)5.3.5-r1^t [M](~)5.3.5-r2^t {+deprecated doc emacs readline static test test-complete ABI_MIPS="n32 n64 o32" ABI_RISCV="lp64 lp64d" ABI_S390="32 64" ABI_X86="32 64 x32"} Installed versions: 5.1.5-r4(12:10:00 AM 01/28/2020)(deprecated readline -emacs -static ABI_MIPS="-n32 -n64 -o32" ABI_RISCV="-lp64 -lp64d" ABI_S390="-32 -64" ABI_X86="64 -32 -x32") Homepage: http://www.lua.org/ Description: A powerful light-weight programming language designed for extending applications [I] dev-lang/luajit Available versions: (2) 2.0.5-r1 **2.1.0_beta3 {lua52compat static-libs} Installed versions: 2.0.5-r1(2)(04:30:11 PM 07/10/2018)(-lua52compat -static-libs) Homepage: http://luajit.org/ Description: Just-In-Time Compiler for the Lua programming language Found 2 matches ``` (Well, barring the usual LuaRocks 3 weirdness, which I take care of like [this](http://trac.ak-team.com/trac/browser/niluje/Configs/trunk/Kindle/Misc/koreader-luarocks-3.patch)).

Well, I tried to unmask a recent Lua to use it with vis, but I reverted and it made some things a bit unstable for me, right now (probably some link from eselect-lua still there).

Sidebar: I'm not sure anyone really actually wants to be using CBT instead of CBZ. By nature, tar is not seekable (T stands for "tape", after all, not really a seekable medium ^^).

I'm pretty sure we always read comics sequentially, though. Even when using zip's store methode, tar is faster: $ du -h Berserk\ v01\ $2003$\ $Digital$\ $danke-Empire$.cbt 453M Berserk v01 (2003) (Digital) (danke-Empire).cbt $ time unzip -q Berserk\ v01\ $2003$\ $Digital$\ $danke-Empire$.cbz -d out unzip -q Berserk\ v01\ $2003$\ $Digital$\ $danke-Empire$.cbz -d out 1.96s user 0.14s system 99% cpu 2.110 total $ time 7z x -oout Berserk\ v01\ $2003$\ $Digital$\ $danke-Empire$.cbz ... 7z x -oout Berserk\ v01\ $2003$\ $Digital$\ $danke-Empire$.cbz 0.14s user 0.15s system 99% cpu 0.290 total $ time tar -x -f Berserk\ v01\ $2003$\ $Digital$\ $danke-Empire$.cbt -C out tar -x -f Berserk\ v01\ $2003$\ $Digital$\ $danke-Empire$.cbt -C out 0.01s user 0.16s system 99% cpu 0.171 total $ time busybox tar -x -f Berserk\ v01\ $2003$\ $Digital$\ $danke-Empire$.cbt -C out busybox tar -x -f Berserk\ v01\ $2003$\ $Digital$\ $danke-Empire$.cbt - 0.00s user 0.14s system 99% cpu 0.142 total It IS a useless benchmark since mupdf uses zlib's minizip and it streams page per page, which means that JPEG/PNG decoding and disk read will be the true bottleneck. Still, using an archive format without compression really makes sense for JPEG/PNG content. I agree that tar/pax needs a replacement, Plan 9 should really have saved us on that point.

Answer 11 · 2020-02-08T18:34:20.000Z

@poire-z Oh right, exactly. So on the H2O they'd just get unloaded a lot quicker, but there'd never be a problem except with really high DPI individual images or something.

Answer 12 · 2020-02-08T18:53:32.000Z

If I got vector containing comics, I'd still rasterize them ahead of time, since I prefer to trade a very little amount of disk space to avoid this expensive operation on underpowered hardware. Seriously, Kobo should include some hardware acceleration for PDF rasterization and image decoding. Same for Amazon, they're just selling underpowered smartphones, right now. Compare this to digital cameras, where hardware is REALLY interesting (Sony's Bionz: MIPS 3000 with DSP or Nion's Expeed: ARM+FR-V+DSP; all running embedded Linux or µITRON). The current state is: * Kindle: NXP i.MX7D and i.MX6 SoloLite, which have an EPD controller, but no special purpose DSP/ASIC. Too bad, since they used the i.MX508 on the first Paperwhite, which was made specially for ereaders. * Kobo: i.MX6 SoloLite everywhere I can understand Kobo not having the resources, but Amazon has no excuse for not making something a bit more interesting (other than money). The state of modern "embedded". (Sorry for off-topic)

Answer 13 · 2020-02-08T19:06:17.000Z

Sorry, here's a better patch that also solves koreader/koreader#4287, forgot about it. https://0x0.st/iixR.diff

Answer 14 · 2020-02-08T19:13:53.000Z

After a bit of testing on various formats (gnu with entry of length 101, ustar with 100, ustar with split component 5+100 and pax), only pax doesn't work with these patches. Should be okay.