/search-ebooks

Python command-line program that searches through content and metadata of different types of ebooks [work-in-progress]

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

search-ebooks


🚧    Work-In-Progress

search-ebooks is a Python command-line program that searches through content and metadata of various types of ebooks (djvu, epub, pdf, txt). It is based on the pyebooktools package for building ebook management scripts.

⚠️

  • For the moment, the search-ebooks script is only tested on macOS. It will be tested also on linux.
  • More to come! Check the Roadmap to know what is coming soon.

Introduction

search-ebooks allows you to choose the search methods for the different ebook formats. These are the supported search-backends for each type of ebooks:

File type Supported search-backends
.djvu
  1. djvutxt + re (default)
  2. ebook-convert + re
.epub
  1. ebook-convert + re (default)
  2. epubtxt + re
.doc [1]
  1. catdoc or textutil + re (default)
  2. ebook-convert + re
.pdf
  1. pdftotext + re (default)
  2. ebook-convert + re
  • The utilities mentioned in the Supported search-backends column are used to extract text from ebooks. Then the text is searched with Python's re library (re.findall and re.search). epubtxt is the only one that is not an ebook converter per se like the others.
  • More specifically, epubtxt consists in uncompressing first the epub file with unzip since epubs are zipped HTML files. Then, the extracted text is searched with Python's re library. I tried to use zipgrep to do both the unzipping and searching but I couldn't make it to work with regular expressions such as \bpattern\b.
  • The default search methods (except for epub) are used since they are quicker to extract text than calibre's ebook-convert. But if these default utilities are not installed, then the searching relies on ebook-convert for converting the documents to txt
  • On macOS, textutil is a built-in command-line text converter.
  • Eventually, I will add support for Lucene as a search backend since it has "powerful indexing and search features, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities".

⚠️

I didn't set epubtxt as a default search-backend for epub files because it also includes the HTML tags in the extracted text even though text extraction is faster than with ebook-convert.

Once I clean up the extracted text, I will set epubtxt as a default search method for epub files if it is still faster than ebook-convert for text extraction.


All the utilities that extract text make use of a file-based cache to save the converted files (txt) of the ebooks and hence subsequent searching can be greatly speed up.

The search results are ranked in decreasing order of the number of matches found in the ebook contents.

Dependencies

Python dependencies

  • Platforms: macOS [soon linux]
  • Python: >= 3.6
  • diskcache >= 5.2.1 for caching persistently the converted files into txt
  • pyebooktools >= 0.1.0a3 for converting files to txt (see convert_to_txt.py) along with its library lib.py that has useful functions for building ebook management scripts.

Other dependencies

As explained in the documentation for pyebooktools, you need recent version of:

  • calibre to convert ebook files to txt and get metadata from ebooks

And optionally, you might need recent versions of the following utilities:

  • (Highly recommended) poppler, catdoc and DjVuLibre can be installed for faster than calibre's conversion of .pdf, .doc and .djvu files respectively to .txt.

    ⚠️

    On macOS, you don't need catdoc since it has the built-in textutil command-line tool that converts any txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive file.

  • Tesseract for running OCR on books - version 4 gives better results even though it's still in alpha. OCR is disabled by default since it is a slow resource-intensive process.

Cache

Cache is used especially to save the converted ebook files into txt to avoid re-converting them which is a time consuming process, especially if it is a document with hundreds of pages. DiskCache, a disk and file backed cache library, is used by the search-ebooks script.

The cache is also used to save the results of calibre's ebook-meta when searching the metadata of ebooks such as their authors and tags.

The search-ebooks script can use the cache with the --use-cache flag.

ℹ️

The MD5 hashes of the ebook files are used as keys to the file-based cache.

A file-based cache library was choosen instead of a memory-based cache like Redis because the converted files (txt) needed to be persistent to speed up subsequent searches and since we are storing huge quantities of data (e.g. we can have thousands of ebooks to search from), a memory-based cache might not be suited. In order to avoid using too much disk space, you can set the cache size with the --cache-size-limit flag which by default is set to 1 GB.

As an example to see how much disk space you might need to cache the txt conversion of one thousand ebooks, let's say that on average each txt file (what is actually being cached) uses approximately 700 KB which roughly corresponds to a file with 350 pages. Thus, you will need a cache size of at least 700 MB to be able to store the txt conversion of one thousand ebooks.

Also DiskCache has interesting features compared to other file-based cache libraries such as being thread-safe and process-safe and supporting multiple eviction policies. See Features for a more complete list.

See DiskCache Cache Benchmarks for comparaisons to Memcached and Redis.

⚠️

  • When enabling the cache with the --use-cache flag, the search-ebooks script has to cache the converted ebooks (txt) if they were not already saved in previous runs. Therefore, the speed up of the searching will be seen in subsequent executions of the script.
  • Keep in mind that caching has its caveats. For instance if the ebook is modified (e.g. tags were added) then the search-ebooks script has to run ebook-meta again since the keys in the cache are the MD5 hashes of the ebooks.
  • There is no problem in the cache growing without bounds since its size is set to a maximum of 1 GB by default (check the --cache-size-limit option) and its eviction policy determines what items get to be evicted to make space for more items which by default it is the least-recently-stored eviction policy (check the --eviction-policy option).

Tips

Use --regex for regex-based search

Use the --regex flag to perform regex-based search of ebook contents and metadata. Thus:

  • --query "a battle" will find any line that contains the words "a battle".
  • --query "^a battle" --regex will find any line that starts with the words "a battle" because the --regex flag considers the search query as a regex.

ℹ️

By default, the search-ebooks script considers the search queries as non-regex, i.e. it searches for the given query anywhere in the text by not processing any regex tokens (e.g. $ or ^).

When searching ebook contents and metadata at the same time, note that both types of search are linked by ANDs. For instance, the following command will search for the "reason" string on those ebooks whose filenames start with "The" and whose tags contain "history":

$ search-ebooks ~/ebooks/ --query "reason" --filename "^The" --tags "history" --regex -i --use-cache

Search for a string that spans multiple lines

Let's say we want to search for the string "turned into a democracy" in the following text:

Find string than can span multiple lines in a text

The difficulty in searching the given string is that sometimes it spans multiple lines and you want to make the regex as general as possible in matching the string no matter where the newline(s) happens in the string.


If we use the simple search query without tokens "turned into a democracy", we will only match the first occurrence of the given string, as shown in the following regex101.com demo:

Result of executing a simple search query without tokens, just the string

To match all occurrences of the string no matter how many lines it spans, the following regex will do the trick: "turned\s+into\s+a\s+democracy". We replaced the space between the words with whitespaces (one or unlimited), as shown in the following regex101.com demo:

Result of executing a search query where spaces between words are replaced white multiple whitespaces

We can now try it out with the search-ebooks script which will search the ~/ebooks/ folder from the Examples:

$ search-ebooks ~/ebooks/ --query "turned\s+into\s+a\s+democracy" --regex -i --use-cache

Output:

Output of ``search-ebooks`` script when using the correct search query with appropriate tokens

ℹ️

Only the ebook Politics_ A Treatise on Government by Aristotle whose two versions epub and txt correspond to the same translation could match the given string "turned into a democracy" which is found in the following part of the txt version:

section where the match was found in the book *Politics_ A Treatise on Government by Aristotle.txt*

and in the text conversion of the epub file:

section where the match was found in the book *Politics_ A Treatise on Government by Aristotle.epub*

Advanced tip: perform "full word" search with regex

The search-ebooks script accepts regular expressions for the search queries through the --regex flag. Thus you can perform specific searches such as a "full word" search (also called "whole words only" search) or a "starts with" search by making use of regex-based search queries.

This is how you would perform some of the important types of search based on regular expressions:

Search type Regex Examples
"full word" search \bword\b: surround the word with the \b anchor --query "\bknowledge\b" --regex: will match exactly the word "knowledge" thus words like "acknowledge" or "knowledgeable" will be rejected
"starts with" search ^string: add the caret ^ before the string to match lines that start with the given string --query "^Th" --regex: will find all lines that start with the characters "Th"
"ends with" search string$: add the dollar sign $ at the end of the string to match all lines that start with the given string --query "through the$" --regex: will find all lines that end with the words "through the"
"contains pattern" search
  • string: a regex without tokens will find the string anywhere in the text even if it is part of a word.
  • string1|string2: searches for the literal text string1 or string2. The vertical bar is called the alternation operator.
  • --query "^The|disputed.$" --regex: will find all lines that either start with "The" or end with "disputed."
  • --filename "Aristotle|Plato" --regex: will select those ebooks whose filenames contain either "Aristotle" or "Plato"

ℹ️

The --regex flag in the examples allow you to perform regex-based search of ebook contents and metadata, i.e. the search-ebooks treats the search queries as regular expressions.

Examples

We will present search examples that are not trivial in order to show the potential of the search-ebooks script for executing complex queries.

This is the ~/ebooks/ folder that contains the files which we will search from in the following examples:

List of ebooks to search from

ℹ️

Of the total eight PDF files, two are files that contain only images: Les Misérables by Victor Hugo.pdf and The Republic by Plato.pdf which both consist of only two images for testing purposes.

Search ebooks with certain filenames

We want to search for the word "knowledge" but only for those ebooks whose filenames contain either "Aristotle" or "Plato" and also we want the search to be case insensitive (i.e. ignore case):

$ search-ebooks ~/ebooks/ --query "\bknowledge\b" --filename "Aristotle|Plato" --regex -i --use-cache

ℹ️

  • --regex treats the search query and metadata (e.g. filename) as regex.
  • \bknowledge\b matches exactly the word "knowledge", i.e. it performs a “whole words only” search. Thus, words like "acknowledge" or "knowledgeable" are rejected.
  • The -i flag ignores case when searching in ebook contents and metadata.
  • Since we already converted the files to txt in previous runs, we make use of the cache with the --use-cache flag.

Output:

Output for example: filenames satisfy a given pattern

ℹ️

  • The txt and pdf versions of The Ethics of Aristotle by Aristotle show different number of matches because they are not the same translations and hence the word "knowledge" might come from the introduction (written by another author) or the translator's footnotes, depending on the version of the text.
  • On the other hand, the txt and epub versions of Politics_ A Treatise on Government by Aristotle show the same number of matches because they are both the same translation.
  • As explained previously, The Republic by Plato.pdf is not included in the matches because it is a file with images only and since we didn't use the --ocr flag, the file couldn't be converted to txt. The next example makes use of the --ocr flag.

Search documents with images

We will execute the previous query but this time we will include the file The Republic by Plato.pdf (which contains images) in the search by using the --ocr flag which will convert the images to text with Tesseract:

$ search-ebooks ~/ebooks/ --query "\bknowledge\b" --filename "Aristotle|Plato" --regex -i --use-cache --ocr true

ℹ️

  • The --ocr flag allows you to search .pdf, .djvu and image files but it is disabled by default because OCR is a slow resource-intensive process.

  • The --ocr flag takes on three values: {always,true,false} where:

    • always: try OCR-ing first the ebook before trying the simple conversion tools
    • true: use OCR for books that failed to be converted to txt or were converted to empty files by the simple conversion tools
    • false: try the simple conversion tools only. No OCR.

    More info in pyebooktools README.


Output:

Output for example: OCR PDF file with images

ℹ️

  • Since the file The Republic by Plato.pdf was not already processed, the cache didn't have its text conversion at the start of the script. But by the end of the script, the text conversion was saved in the cache.
  • As you can see from the search time, OCR is a slow process. Thus, use it wisely!

Search ebook metadata

Search for the words "confront" OR "treason" in ebook contents but only for those ebooks that have the "drama" AND "history" tags:

$ search-ebooks ~/ebooks/ --query "confront|treason" --tags "^(.*drama)(.*history).*$" --regex -i --use-cache

ℹ️

  • The AND-based regex ^(.*drama)(.*history).*$ is a little more complex than an OR-based regex which only uses a vertical bar |. [2]
  • calibre's ebook-meta is used by the search-ebooks script to get ebook metadata such as Title and Tags. The cache not only save the text conversion but also ebook metadata.
  • The --tags option acts like a filter by only executing the "confront|treason" regex on those ebooks that have at least the two tags "drama" and "history".

Output:

Output for example: search ebook metadata

ℹ️

  • The results of ebook-meta were already cached from previous runs of the search-ebooks script by using the --use-cache flag. Hence, the running time of the script can be speed up not only by caching the text conversion of ebooks but also the results of ebook-meta.

  • Here is the output of ebook-meta when running it on Julius Caesar by William Shakespeare.epub:

    Output of ``ebook-meta``
  • All the other 16 ebooks from the ~/ebooks/ folder were rejected for not satisfying the two regexes (--query and --tags).

  • Julius Caesar by William Shakespeare.pdf doesn't have any tag, unlike its epub counterpart.

  • Julius Caesar by William Shakespeare.epub only matches once for the word "treason".


If we don't cache calibre's ebook-meta and the converted files (txt), the searching time is greater:

Output for example: search ebook metadata without cache

ℹ️

See Cache for important info to know about using the --use-cache flag.

Roadmap

Starting from first priority tasks

Short-term

  1. Add examples for searching text content and metadata of ebooks
  2. Add instructions on how to install the searchebooks package
  3. Test on linux
  4. Create a docker image for this project

Medium-term

  1. Add tests on Travis CI

  2. Eventually add documentation on Read the Docs

  3. Read also metadata from calibre's metadata.opf if found

  4. Add support for Lucene as a search backend

    PyLucene will be used to access Lucene's text indexing and searching capabilities from Python

Long-term

  1. Add support for multiprocessing so you can have multiple ebook files being searched in parallel based on the number of cores

  2. Implement a GUI, specially to make navigation of search results easier since you can have thousands of matches for a given search query

    Though, for the moment not sure which GUI library to choose from (e.g. Kivy, TkInter)

License

This program is licensed under the GNU General Public License v3.0. For more details see the LICENSE file in the repository.

References

[1]txt, html, rtf, rtfd, doc, wordml, or webarchive. See https://ss64.com/osx/textutil.html
[2]Regex from stackoverflow (but without positive lookahead)