Google Books alternative - Full text browsing and search
ocdtrekkie opened this issue Β· 153 comments
Project description
I have a lot of ebooks, mostly in PDF format, but some EPUBs as well, and I feel the information in them is often better than on the Internet. But I've got no way to find that information easily. I'd like a self-hosted (ideally web-based) platform that I can just upload all my books, and then search and browse them from wherever I am.
I asked for something like this on HN, and other than general full text search apps to run on my desktop, there really wasn't anything to do this.
Relevant Technology
Needs to support hosting on a Linux box. Should ideally be usable from Windows, Linux, Android, Mac, etc. through a web interface.
Who is this for
Probably at least somewhat savvy users since we're talking about self-hosting here, but hopefully one could admin a server for their family to use or something.
Some technology one could build on for indexing is ElasticSearch and the mapper-attachments.
I'm doing this. Will update here :)
@mysticmode Cool! Really looking forward to this. There's a lot of potential here to make a open source, ebook indexing, library. One could do auto-fetching of book covers, overview of all books, fetching reviews, having highlights/bookmarks, etc.
I think there is a lot of work that could be done here and there might be room for a collaboration for several people at some point.
This'll be awesome πͺ
This could also be easily wrapped in something like electron if that adds any value.
To make it a bit easier you could just make a website and use kiosk-view/app-view in chrome. They probably have something similar in Firefox etc. That way you don't have to deal with electronic as well
There is already a server mode for the ebook management software calibre, but it's rather ugly and feature-poor. Developing something with a good server client architecture would be good. Server should do the scraping, searching and managing of ebooks (maybe throw in a few converter plugins for different formats), the client would display it neatly, possibly via chrome or electron. Would be interested, especially serverside.
What about building an electron wrapped app, that authenticates and syncs with dropbox. We could use the API to search the contents of PDF files on your dropbox and bring the results back to the Electron app. I don't think the contents search supports EPUBs though.
A dependency on Dropbox, a proprietary cloud service would mostly defeat the point.
Great idea! I'm heavy user of Google Books, but would really like some open source solution. How do you plan to search in PDF files? Some kind of OCR will be required. It's also worth mention that I have non-English books as well.
@la0rg Most PDFs are already OCR'd. Or, in many cases, were always digital to begin with. Page images are rare, PDFs are a multimedia format.
Yes, pdf as file format should be supported, but for OCR you are better served with tesseract, which is Open Source and does an amazing job.
Elastic search is good. I'm trying to take baby steps here and implement something which is well-thought and discussed rather than just another ebook reader which is open source and web client based.
First step. Domain registration http://www.libreread.org/ :)
I'm not sure, if we can discuss the whole process here. So, I'll create a slack team and let you know.
I feel the information in them is often better than on the Internet. But I've got no way to find that information easily
@ocdtrekkie Could you elaborate in examples of how do you need the full text search should be?
Also if you take the existing ebook readers like Kindle, iBooks, etc., other than those are proprietary, what features that you need are missing in it or other ebook readers?
Answer from anyone on the features is appreciated! :)
Thanks!
@mysticmode If I'm looking for a section on a programming structure, for example, in my programming books, I'd expect searching it to show me which books mention it, and let me open that book to the first mention.
Another feature that would be super important would be the ability to import a set of my ebooks data in some format I could easily create. In my case, I have a homegrown (and relatively shoddy) database app which indexes my books, and I'd want to be able to import the metadata I have into a format I could feed into this system, rather than having to sit there and reenter all of that information.
I'm not super concerned about the actual reading elements of this, much more than embedding the PDF reader in my web browser. But that's just my personal use case, I suppose.
Features:
- Open Source
- Support PDF and EPUB
- Browser based
- Full-text search on the metadata and the content.
- Fetch book reviews
- Centralised annotations (maybe a browser plugin like hypothes.is)
- Highlights
- Library with categories
I'm looking into Elastic Search for PDF formats
Basicly this would be Plex, but for Books
While it's certainly cool to have a lot of features and there is a lot of potential here. But I think it's smart to start with the "core feature" and scope out from that when it's solved. I'd say start small, solve the critical, core feature (which is creating a full text register for pdf ebooks) and expand on that in time. Along the way you'll learn a lot about the problem and probably see new ways to improve it.
Just creating infrastructure for setting up a distributable project with ElasticSearch & attachments is a job in it of it self. Maybe a good way to start would be just test it out with some ElasticSearch plugin for searching your indexes. And then creating a small web server and a UI as an extension of that when you have the index fundaments on place. When you have that foundation it's much easier to work in parallel for people who want to join in also, I think. Just my two cents. I think the chance of success increases drastically if the scope is small from the beginning. π
I'm with @SuperManitu here about calibre. I've always been thinking of something that is either using calibre's own database (which is probably a bad idea) or something that just imports and/or exports to it as a starting point. I also thought it could be something peer-to-peer, e.g. utilizing torrent technology to broadcast metadata or content. That might really come in handy in a academic context with open data in mind.
I agree with @la0rg on OCR support for PDF formats. I think full-text search for most ebooks that are in PDF formats would work well on most cases with the extracted data. But OCR support should be in the pipeline though.
As @mikaelbr suggested, I'll try to start with the basic implementation of PDF extraction and search and share it here. Then if people find it good, we can move on from there.
I'm looking into pdf.js
and elastic-search
. Through pdf.js
we could get the rendered HTML5
.
- I'm passing that to elastic search. Using strip char filter we could get the search content stripped from HTML.
- we could full-text search on all the documents and find the relevant book as we have all the book contents in elastic search as documents.
- I'm passing the
HTML
content which is inelastic-search
documents to the client instead of pdf to be converted and rendered every time in the browser. - I'm keeping the pdf files as a backup. If it is self-hosted, people would want to download the books whenever they need.
I'm writing this, if someone knows these technologies for them to tell me if I'm on the right way :)
I'll do the above approach and try to share it in couple days.
I don't think pdfjs is needed you will almost never read the pdf in browser. For thise cases I would just open the pdf in a new tab and let the browser does its job. Pdfjs is rather slow and not a pleasurable reading experiance.
No. When the user uploads a book, I'd be using pdf.js and converting to HTML through a headless browser like phantom-js
in the server.
I tried python pdfminer
but the extracted HTML is messy. PDF.js gives the clean code
It might be easier to do the server in Java, as Elasticsearch provides a Java API, you can use a REST microframework like Jersey and you have libraries like PDFBox for reading pdfs. Plus Node's performance on bigger files is really bad.
I would opt for ElasticSearch + Jersey (or similar) REST Server + clientside SPA (preferably typescript) + optionally (later on) nodejs for Server-rendering the SPA.
As this is a rather complicated setup provide a zero-config docker container to run it.
Working with filesystems is costly. I need to think about this.
Maybe I could use java as a semi-standalone process to extract pdfs.
But I'm planning to use nodejs as the base for the application. As this is open source, I think using javascript for server is better when it comes to collaboration and it works pretty well for SPA.
I'll try the pdf extractors and see which one suits best. As for text, python pdfminer works well.
I dont think javascript is better for collaboration, a type system can help you a lot when using code of others.
I'll create a working prototype in the next days
What language would you use then?
On Wed, Oct 19, 2016, 14:24 SuperManitu notifications@github.com wrote:
I dont think javascript is better for collaboration, a type system can
help you a lot when using code of others.
I'll create a working prototype in the next daysβ
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHVdG_2tWB-zzFRRfI9rxOVFe4dzwMozks5q1gv8gaJpZM4KY3le
.
As said a small REST Server written in Java as Java is easy to adopt and learn, has a good type system and has very good tooling (maven and eclipse/intellij)
For the SPA I would use typescript as it enhances Javascript with an unobstrusive type system and is very popular, so you get type definitions for almost all javascript libraries
What about C++? Also has good tools and isn't much harder to learn than
Java IMO
On Wed, Oct 19, 2016, 14:35 SuperManitu notifications@github.com wrote:
As said a small REST Server written in Java as Java is easy to adopt and
learn, has a good type system and has very good tooling (maven and
eclipse/intellij)
For the SPA I would use typescript as it enhances Javascript with an
unobstrusive type system and is very popular, so you get type definitions
for almost all javascript librariesβ
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHVdG8Kk5uFdg3ejFAy_l6Pxm9NsoeGnks5q1g6IgaJpZM4KY3le
.
We could also write it in Python, which is by far easier to learn for newer
programmers, and has great support for platforms, with plenty of libs we
could use.
On Wed, Oct 19, 2016, 14:39 Fredrik August Madsen-Malmo <
mail.fredrikaugust@gmail.com> wrote:
What about C++? Also has good tools and isn't much harder to learn than
Java IMOOn Wed, Oct 19, 2016, 14:35 SuperManitu notifications@github.com wrote:
As said a small REST Server written in Java as Java is easy to adopt and
learn, has a good type system and has very good tooling (maven and
eclipse/intellij)
For the SPA I would use typescript as it enhances Javascript with an
unobstrusive type system and is very popular, so you get type definitions
for almost all javascript librariesβ
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHVdG8Kk5uFdg3ejFAy_l6Pxm9NsoeGnks5q1g6IgaJpZM4KY3le
.
I personally like C++ far more than Java, but it has too many disadvantages in this particular case:
Pro:
- No JVM required
Cons:
- No central module registry (like maven or npm)
- Many caveats not clear to new developers (const, object creation and manual memory management to name a few)
- complicated dependency management (autotools, cmake, or similar)
Python:
Con:
- No type system
Yeah, those are good points. Dependency management isn't really something contributors need to worry about a lot, considering most of that will be done during the initial phase. Consts and memory management are indeed a bit harder for newer devs.
Python however I don't see why we couldn't use. The lack of types isn't really that big of a problem IMO, but it is handy.
With Python we don't have to worry too much about setups and the like, as we have pip for deps and could use something like pep8 for formatting tools, plus most (if not all) IDEs have support for Python.
I'm interested in this, and I'm for using Python. The low barrier to entry for newer devs is probably a huge plus for using the language for the backend. Also, @SuperManitu, I believe @mysticmode has already created a repo for this project.
I wouldnt use python, because having a type system is really helpful, python has the tendency to be rather slow compared to Java and i had problems with some libs using native code. Plus using intendation as blocks is ugly
@supermanitu, if we were to use Java, what backend framework would you
suggest?
On Wed, Oct 19, 2016 at 10:01 AM SuperManitu notifications@github.com
wrote:
I wouldnt use python, because having a type system is really helpful,
python has the tendency to be rather slow compared to Java and i had
problems with some libs using native code. Plus using intendation as blocks
is uglyβ
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFW7PiGTiU9If4eMXe6tKtMKJ8vCnNpTks5q1iKogaJpZM4KY3le
.
As said I would use the Jersey RESTful Framework: https://jersey.java.net/documentation/latest/getting-started.html#new-project-structure
Here is a rather good explanation: http://www.vogella.com/tutorials/REST/article.html
I don't like python either because of the indentation "thing". If you're up
for using something like python that doesn't use that system we could use
ruby.
On Wed, Oct 19, 2016, 16:15 SuperManitu notifications@github.com wrote:
As said I would use the Jersey RESTful Framework:
https://jersey.java.net/documentation/latest/getting-started.html#new-project-structureβ
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHVdG4_JFBgQ8uEsn02zbseHQHU2Qmdqks5q1iXwgaJpZM4KY3le
.
I'm personally not for using Ruby. PHP and Java are good for me.
On Wed, Oct 19, 2016, 12:37 PM Fredrik A. Madsen-Malmo <
notifications@github.com> wrote:
I don't like python either because of the indentation "thing". If you're up
for using something like python that doesn't use that system we could use
ruby.On Wed, Oct 19, 2016, 16:15 SuperManitu notifications@github.com wrote:
As said I would use the Jersey RESTful Framework:
https://jersey.java.net/documentation/latest/getting-started.html#new-project-structure
β
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<
#11 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AHVdG4_JFBgQ8uEsn02zbseHQHU2Qmdqks5q1iXwgaJpZM4KY3le.
β
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFW7PmCM57_whOesg3fwdPx-2prE1evYks5q1kdhgaJpZM4KY3le
.
When we are handling files, in our case pdf, epub
. We could try using Xpdf I haven't tested it yet, but I read in few places that it gives better output than PDFBox
. It's written in C++
and it's licensed under GPL
. If I start this project, I would like this to be in GPL. And there are multiple pdf extractors available in C++ which we can test.
As far as language goes for the core part of our application, I certainly won't choose java, it's confusing philosophy of licensing for it's ecosystem and the court claims + object-oriented + Certainly hard to pickup for a new programmer makes me go for some other languages that has a fairly good motive towards open philosophy and good in performance compared to java and considering readability.
I would choose Elixir. if we need multi-threading for this application, it does right away and goes well with performance compared to java. The web framework Phoenix is ruby inspired. The coding approach is readable and far easier to pick-up quickly than java.
Gotta say Elixir is far better than Ruby in performance plus You will experience the taste of Ruby style with the speed of Erlang.
Never used Elixir, but from what I've seen it looks good. Should be a good choice for the server.a
I think using node.js with walmarts electrode as a framework would be great. The project can be super modularized. I believe react would work well for this project and is such a popular framework people can really help with its development and pick up easy. It's also easy to onboard someone
Do we really need electron though? Couldn't it just be a website?
On Wed, Oct 19, 2016, 23:49 Ngoc Buu Tran notifications@github.com wrote:
I think using node.js with walmarts electrode as a framework would be
great. The project can be super modularized. I believe react would work
well for this project and is such a popular framework people can really
help with its development and pick up easy. It's also easy to onboard
someoneβ
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHVdG218x9lMlWMXhFwJ6EYaesBlyfqHks5q1pBfgaJpZM4KY3le
.
I absolutely would like to see a standards-compliant website I can access from anywhere. (As a personal note, I'd like to be able to host it as a Sandstorm.io app, which is possible as long as it's A. web-based and B. runs on 64-bit Linux.)
For the SPA I would make a simple website. Using electron can be done if needed later (but i doubt that). As Frontend frameworks there is either React + Redux or Cycle.js (which I'm in favour of). Used Angular2 and wonβt use it again.
@SuperManitu Could you explain why do we need a front-end framework in our use-case? I don't think our front-end is complex, so far with what @ocdtrekkie pointed out earlier, the browser based app should be standards-compatible and upon features it should be minimal at first. Most of the process is happening in the backend.
Let's build it with bare-bone javascript or maybe typescript, and on the go.. we'll figure out if we need a framework that might help us solve the problems that we would be facing at that time. I think building the minimal version first -> then get it right -> then get it better is the way to go.
To initiate this project, I'm doing the initial setup now and share it here once ready.
I think it's better if we take this discussion to chat. I've created a slack team http://libreread.slack.com
Let's discuss the tools and intricacies of the app in the chat. And If we want to discuss about the features, we can post here.
Please share your email if you would like to join the development, so I could add you on slack team.
Thanks!
Yes of course, for the beginning we dont need a framework. Just standard Typescript. Awaiting the setup :)
Email is supermanitu@gmail.com
@mysticmode email is nickmorrrison09@gmail.com
@FutureProg hey, let me know if you get the invite :) It's bouncing here and I couldn't send the invite again. Weird
@mysticmode Nope, haven't received one yet. I think you can post a join link?
@FutureProg I revoked and sent it again. Please check now.
@mysticmode I'm not certain what's going on but I still haven't received an invite (and it says my email isn't in the system when I do a password reset). Are you sure you're sending it to nickmorrison09@gmail.com ?
I have no interest in joining Slack. Please just make sure when there's a repository to check out, that the link to it makes it into this thread.
FYI: There's also created a Slack team that can be used for discussing projects from this repo. You can get an automatic invitation here http://opensourceideas.herokuapp.com/
@ocdtrekkie We'll make sure a link to the project is posted here and in this overview π
@FutureProg yeah, your email has a spelling mistake previously :)
@ocdtrekkie Sure, we are yet to do a basic setup, we are testing tools and working on the mockup. I need to think about this, I'm keeping slack only for discussing on development. This is the repo https://github.com/mysticmode/libreread
I'll notify you about the process in this thread soon.
Elasticsearch with the Apache Tika mapper plugin already does search of PDFs and many other formats, you just need to provide a web client.
Hi, I have created a basic version of LibreRead. You can signup, upload books and it will get indexed in Elastic Search as a background job. Once the indexing is complete, you can search through all books by Metadata and Content. I have implemented PDF.js for UI consistency. Also I'm going to use ePub.js for epubs.
https://github.com/mysticmode/LibreRead
Here are some screenshots of what I built.
There is more work to be done in order to publish this. I think I'm going to roll out the beta version first.
To-Do:
-
Multiple user Roles
- Super admin (Can manage books and add/revoke users)
- Admin (can upload books)
- User (Only read books)
-
Book Access
- Public (Any user can read)
- Private (Only the user uploaded can read)
- Permissions (You can select specific users for reading access. This option will be available when you upload)
-
Create Notes and see other users notes while reading.
-
ePub implementation
This is well underway with brilliant work from all of you and @mysticmode. I'm closing the issue as it's started and progress has been made. Please feel free to continue the discussion even though the issue is closed.
Hi, I'm asking for code contribution help from you guys for LibreRead.
Major help is needed in the backend which is written in Flask. For setup, Please check the Readme post in the repo for development setup. I can help you with that, see my email below.
I'm working on a new design and it would be greatly helpful if someone could port the old app code to work with the new design and continue taking part with the development.
Current list of features:
- Add books (only pdfs)
- Full-text search
- Collections
- Highlights + Annotations
To do list of features:
- Multiple user roles
- Book Access
- ePub Implementation (Major work, Not sure for the first release)
You can email me for further detail hello@nirm.al or please comment here.
UPDATE on LibreRead:
- It's a single user product.
- Backend being written in Go.
- Updated README.md with new features & goals for the initial release.
- New design.
Soon I'll show you the version put on the server for testing.
I deployed my code to the server http://172.104.59.151:8080/signin
email: rkumarnirmal@gmail.com
password: demo
It's not done yet. Right now I'm testing file upload and full-text search. Would love to know your feedback.
Work needs to be done:
- Collections
- Account settings
- Landing page design
- Deployment documentation
- Much more testing
@mysticmode Pretty Neat! I even tried sign up. But in the confirmation mail I got the URL as "localhost:8080...." instead of "172.104.59.151:8080...."
@karuppiah7890 Sorry for that! I should disable signup. It's a single user product :)
Adding one more request. The design is responsive. But still needs some fine-tuning. You could check in hand-held devices and please let me know if I need to make any change.
Thanks!
@mysticmode Oh yeah, I remember seeing somewhere in discussions that it's for a single user. Self hosted system and for private use.
@mysticmode Not very responsive. Menu is responsive, but books are eaten up (and no scroll bar) when size width decreases. Try decreasing your browser width and see it.
And a good tool to try out responsiveness across lots of devices is : https://sizzy.co/
Code for the sizzy project https://github.com/kitze/sizzy
@karuppiah7890 Could you check in the mobile/tablet browser once? I've added max-device-width
I think that is the reason if you decrease desktop browser width, you are not seeing it responsive.
@karuppiah7890 I could add multi-user support. But that brings more complexity. For example, you wouldn't like your invited user to fill up the storage space with lots of books. We should be providing a way to control the storage space for each user.
And if there is multi-user support, people would expect to share ebooks with other users. In that case, We need to put a notice that we are not responsible for sharing copyrighted ebooks in the platform. And you have to control the book reading access. You wouldn't like to share it to everyone.
Those will take more thoughts and time. I'm trying to keep it simple for the initial release.
Makes sense!
@mysticmode It works fine in my tab!
Cool stuff! π
I've added EPUB support including full-text search feature. But it's still WIP, we need to do more user testing.
For Demo:
http://demo.libreread.org/
email: rkumarnirmal@gmail.com
password: demo
To Do list:
- Test by uploading and using more EPUBs.
- Highlights & Annotations for both PDF and EPUB
- Account settings
I'm planning to launch beta in the next couple weeks by completing the above to do list.
π
IMPORTANT NOTICE:
The project repo has been moved to Savannah
https://savannah.nongnu.org/projects/libreread/
@mysticmode Does the upload work for demo ?
@karuppiah7890 Yeah, is there a problem?
Oops. Sorry. Yes, I just noticed. It does work, but it doesn't show the thumbnail of the Ebook though. And I was dumb as I forgot it takes some time to index the PDF in the background and started trying out the search immediately after upload, to see if the book is present as thumbnail wasn't shown. But now I am able to search. Just that the thumbnail is just white
The book has a simple page with title (first page) as a thumbnail, from what I see in my file explorer
@mysticmode Check the demo site. The book I uploaded is Eloquent JavaScript
Yeah, I can see that and it doesn't generate cover for some PDFs. I'm using poppler-utils to generate PDF cover.
Some days before @ocdtrekkie pointed out that the user should be able to edit the Metadata of the ebook(title, author and cover). So you can add cover to the ebook manually.
I'll be adding that feature in the next coming days :)
That sounds cool. It's true. Sometimes the Ebook may not have all the metadata properly. :)
@karuppiah7890 Could you try some other PDFs which has cover image?
As said, I'll add the feature where you could attach PDF cover if it doesn't generate from the code.
Yes, it works in a pretty smooth manner
There is a problem with my EPUB implementation. Right now I'm doing it like this
- unzip epub
- load all htmls into a single file
- show that file
That way I could just show the entire epub content as a single html file. I thought scrolling through a single page would be more natural than clicking next and previous buttons.
But I have a problem here, EPUB table of contents doesn't work that way. Each link points to the particular file. I tried to manipulate it, but it didn't go well.
So, I'm going to do EPUB viewer in a traditional way like other EPUB readers. I'm going to use Redis for this.
- unzip epub
- Fetch the spine data(id and href) and store in redis
- Load each html based on the spine data by using next/previous button.
This way table of contents will automatically work and I don't need to do manipulation.
I'll get back when I'm done with this implementation.
Things need to be done for the beta release:
- Working EPUB implementation.
- Edit/Delete books.
- Edit/Delete Collections.
- Highlights/Annotations.
- Account settings.
EPUB support including the search functionality is implement now. Please let me know your feedback.
http://demo.libreread.org
Email: rkumarnirmal@gmail.com
Password: demo
I'm going to work on Highlight/Annotations feature.
I have done Highlights & Annotations for PDFs. Now you can Highlight a text and add comment to it on PDF files.
I'm starting to do the same for EPUBs now.
But we need to more user testing on this feature. If you are interested, Please check the uploaded book here
http://demo.libreread.org
Email: rkumarnirmal@gmail.com
Password: demo
Thanks!
It's pretty neat @mysticmode ! But I noticed some unusual things while trying out stuff
@karuppiah7890 Thanks for trying it out! What is it?
Once I highlight a line and then delete the highlight, then I am not able to highlight it again
@karuppiah7890 Wow! That's a nice find. Thank you! will fix it :)
It never gets selected with the blue highlight itself, when trying to highlight a text which is a superset of it. Everything gets selected except the old highlighted text
Ah! That's quite complicated for me now. Highlighting over the already highlighted text. I'll try to do that.
No, I mean, highlighting over an already highlighted and deleted highlight text
Yup, I will fix that :)
Like, say the text is "This is an example of an highlighted text". I highlight the word "example" and then delete the highlight. Then when I try to highlight whole sentence, it doesn't show blue highlight for "example" while selecting also.
And you could change the color of the highlight selection from blue to something else and also make it transparent so people can read when they are highlighting. And I see you use tooltips to show options to change the color and for notes and for deleting highlight. I think you could use the same for asking people if they want to highlight a text - as a tooltip, instead of showing an alert for asking if they want to highlight. I am talking about something like Medium publication highlight feature.