haystack/nb

The sidebar doesn't load when a pdf has one page only

Opened this issue · 32 comments

currently in nbclient, if a pdf has one page, the sidebar does not load.

does this only happen for single page pdfs or even multi page pdfs?

please provide publicly accessible test cases if possible

https://home.ttic.edu/~avrim/book.pdf

This is the textbook that I am using. After some experimentation I realized that the issue is, the library you are using to make the pdf annotatable requires intense preprocessing (around 3-4 minutes for initial setup) and until the entire pdf isn't preprocessed, neither the annotation sidebar nor the annotations, show up. This makes sense, since the core is the pdf while the annotations buildup on the pdf itself, but this becomes a really big issue, when such setup time is required each time, the open window is changed/ tab is changed.
This makes the software unusable as the time required for loading is just not bearable.

I tried out the same document with nb1, and found that as each page was rendered as a single image and the annotations where blocklike in nature, thus loosing fine control, the rendering of each page was initiated as and when required, making the process faster.
My suggestion would be to provide a highlighting ability, that doesn't directly map to the threads, but maps to a background user-invisible and user non-interactable block style annotation, thus maybe making the system faster.

I'm probably missing out a lot of details since I don't know the code thoroughly, but I'd be happy to help out!

Also, would it be possible to make the code of nb1 publicly available? I wasn't able to find it in the haystack repositories.

The right solution to the problem that you've identified is for NB to "process" the pdf (which nowadays means converting it to html for in-browser rendering) on the server once, and store it there, and deliver that HTML directly to the client at time of use, instead of the current approach of shipping the pdf to each client for processing at the time of use. There should be an issue for this but I can't find it; if it really isn't there we should add it @JumanaFM .

Exactly, preprocessing is something that i think is happening on the client side, and if possible it should happen all at once in the pdf uploading process. It will probably save a lot of resources.

On another note, i checked out this same issue with mozila's inbuilt pdf viewer and hypothes.is's pdf annotator as well (both being open source) but neither of them seems to have this issue. Any idea how they manage and if same source code can be used? Mozilla doesnt have the ability to annotate and highlight, but other plugins based on mozila's pdf annotators work pretty smooth too.

Also, thanks for the nb1 link!

Ahh, got it. But i briefly looked at the nb2 source code and you also use pdf.js . Sorry in advance if its a basic question.

Right; we use the same canonical library as everyone else. But we're running it every time on the client, when instead it really ought to be run once on the server.

If you are looking to contribute this would be a very nice issue to work on.

Im planning to modify nb a bit for my own requirements and I really need to be able to work with big files for this. I'd love to contribute!
Any resources for this specific issue you could point me towards?

I'd love for you to contribute back anything you think could be helpful to others. In particular this prerendering of large pdfs would be of great general benefit. NB1 did this (it rendered into images instead html, but same idea).

I take it you've already found the client and server code. We're active on the repo discussion and happy to help out if you need help understanding or finding specific things.

Yup I have already setup nb2 on my laptop, but my system kept on crashing because of the local hosting. I think that for some reason, nb2 does both pre-rendering and client side rendering as it took twice the amount of time for my local nb than the hosted nb. Just a guess though.

I'll start with figuring out how nb1 rendered images so that I can use that here.

Totally agreed. Especially the points about pdf to images. I initially wanted to use nb1 when that was the only option available but then I realized that the without fine control over the text the context of the related question would be lost on the readers. This wouldn't be a very big issue, and was easily workaround able, but just made me postpone my project for later.

Nb2 did initially feel like more of a frontend modification at the cost of speed, but as i went deeper, i realized that a lot of features were added making it more user-friendly.

But if I shouldn't even refer to nb1 code, where is a good place to start?

Are you asking specifically about how to tackle server side rendering in nb2?

Yes. Maybe some resource or something I can look into or something that already implements this well.

Yes. Maybe some resource or something I can look into or something that already implements this well.

This is how it's done on NB currently https://github.com/haystack/nb/blob/7f0e24a07db0b5de1f54c5d4f20114a14d994f73/public/nb_viewer.html
Take a look and contribute if you can, we appreciate it!

At present, nb_viewer fetches the target pdf from the nb server, then uses the pdf.js library to convert it to html that nb can annotate. we should instead be using the same pdf.js library on the server, to convert the pdf to html there once, then save the resulting html in a suitable cache directory so that html can be served on request.

Yes. Maybe some resource or something I can look into or something that already implements this well.

This is how it's done on NB currently https://github.com/haystack/nb/blob/7f0e24a07db0b5de1f54c5d4f20114a14d994f73/public/nb_viewer.html Take a look and contribute if you can, we appreciate it!

really helpful, thanks!

why not just save the generated html file on the server, deleting the original pdf?

At present, nb_viewer fetches the target pdf from the nb server, then uses the pdf.js library to convert it to html that nb can annotate. we should instead be using the same pdf.js library on the server, to convert the pdf to html there once, then save the resulting html in a suitable cache directory so that html can be served on request.

what im thinking is that once the professor uploads the file on the server, the server takes the file converts it to a html file and saves that file for all later use.
if the student/professor wants to download the file as a pdf, we perform the same thing in reverse on the server and provide the document

It seems that converting pdfs to html documents doesnt always workout and most of the files have their own specific fonts without which the file gets corrupted.
Also I looked a bit deeper into the hypothesis code and it seems that they arent using the pdf to html system either.
Not really sure how to proceed at this point

pdfs that cannot be converted are just as big a problem with the current system as they would be with server-side conversion---it's the same library either way. So we're no worse off doing the conversion server side.

But such problematic pdfs are rare and getting rarer, because pdfjs is also the library that gets used by firefox to render pdfs in the browser, so it gets lots of attention.

Google chrome uses a different conversion library, pdfium, for the same purpose. We could use that library instead of pdfjs if we decided it was more robust. Pdfium would have to run in a separate process since it isn't js based, but we could easily have our server invoke it at need, using for example this python wrapper.

Riiight, that makes sense. Ill try this

@JumanaFM sorry for bothering you again and again but is there any documentation for pdf.js at all? no matter where I search I cant seem to find any documentation for the library at all. The official docs point to links that are incomplete and the only documentation that exists is user contributed and doesn't make a lot of sense ((https://github.com/MeiKatz/pdfjs-docs/blob/master/README.md)). Where did you refer for the documentation?

I dont mind switching to pdfium but if i can I'd prefer staying close to the source code

@JumanaFM sorry for bothering you again and again but is there any documentation for pdf.js at all? no matter where I search I cant seem to find any documentation for the library at all. The official docs point to links that are incomplete and the only documentation that exists is user contributed and doesn't make a lot of sense ((https://github.com/MeiKatz/pdfjs-docs/blob/master/README.md)). Where did you refer for the documentation?

I dont mind switching to pdfium but if i can I'd prefer staying close to the source code

Not a bother, happy to help!
The best resource is the official page
https://mozilla.github.io/pdf.js/

Another resource that might be helpful is hypothesis
https://github.com/hypothesis/pdf.js-hypothes.is

Not a bother, happy to help! The best resource is the official page https://mozilla.github.io/pdf.js/

Another resource that might be helpful is hypothesis https://github.com/hypothesis/pdf.js-hypothes.is

Thanks a lot!! I found a few more random resources, but the best docs are in the examples on the official page itself. Not a lot to go by, but you can get a brief overview.

It might be worth investigating online which of pdf.js and pdfium is considered most robust/able to handle the most pdf weirdness/produces the best html all we do is invoke it for conversion, so the coupling to nb is very light---so it would probably be quite easy to switch, though we would need to keep using pdfjs for the legacy documents since we rely on the converted html being the same every time.

Sure ill look into comparing both too