Consider more reliable and stable detection methods for distinguishing zimit classic and zimit2

Question

Consider more reliable and stable detection methods for distinguishing zimit classic and zimit2

Jaifroid opened this issue 8 months ago · 5 comments

As suggested here, we could look for 'warc2zim' AND 'zimit' strings in Scraper metadata (we currently only look for 'warc2zim', but it's not currently guaranteed to be stable), and if '_sw:yes' is not in tags, then it's zimit2. If it's there, then there is a Service Worker, meaning it's zimit classic.

We currently rely on finding 'warc-headers' in the declared MIME type. But it's possible (if currently unlikely) that such headers could be reintroduced if they are needed in future versions of zimit2, so it would be good to have other options as outlined above.

Answer 1 · 2024-01-25T16:53:51.000Z

c528c94 addresses the first part of this issue (adds test for 'zimit' in the scraper name).

Answer 2 · 2024-01-25T18:03:12.000Z

@Jaifroid The recommended way of doing it is to rely on _sw ZIM tag. Zimit2 should not need anything special at reader level AFAIK.
@benoit74 Wonder this not explicit in the documentation of warc2zim.

Answer 3 · 2024-01-25T18:57:16.000Z

Thanks, @kelson42 I agree, I just can't use that method yet because all the zimit2 ZIMs produced so far have '_sw:yes'. Until that's fixed as requested by rgaudin, I have to use the current method.

There is a specific requirement in the reader to detect links and PDFs that cannot be opened in the webview or iframe due to sandboxing / CSP. Kiwix Serve has already been patched via libkiwix, and other readers that use libkiwix will have the patch. The issue is that Wombat aggressively rewrites such links, so they can't be detected without either temporarily disabling Wombat or using other workarounds. I've patched both KJS readers.

Answer 4 · 2024-02-12T09:00:35.000Z

Both changes have been done:

https://dev.library.kiwix.org/raw/solar.lowtechmagazine.com_en_all_2024-02/meta/Scraper : warc2zim 2.0.0-dev2 + zimit 2.0.0-dev1 + Browsertrix crawler 0.12.4
https://dev.library.kiwix.org/raw/solar.lowtechmagazine.com_en_all_2024-02/meta/Tags : _ftindex:yes;_category:other;lowtech

Not all tests ZIMs have been already rebuilt with this latest code change, but at least you have few to test.

Answer 5 · 2024-02-12T09:01:46.000Z

@benoit74 Excellent, thanks!