pyebooktools: A Python repository from raul23

🚧 Work-In-Progress

This project (version 0.1.0a3) is a Python port of ebook-tools which is written in Shell by na--. The Python script ebooktools.py is a collection of tools for automated organization and management of large ebook collections.

Check also my other project search-ebooks which is based on pyebooktools for searching through the content and metadata of ebooks.

⚠️

Check organize-ebooks which is the Python port of organize-ebooks.sh and includes a Docker image for easy installation of all needed dependencies and Python package.

About

The ebooktools.py script is a Python port of the shell scripts from ebook-tools and makes use of the following modules:

edit_config.py edits a configuration file which can either be the main config file that contains all the options defined below or the logging config file whose default values is defined in default_logging.py. The edit subcommand from the ebooktools.py script uses this module.
convert_to_txt.py converts the supplied file to a text file. It can optionally also use OCR for .pdf, .djvu and image files. The convert subcommand from the ebooktools.py script uses this module.
find_isbns.py tries to find valid ISBNs inside a file or in a string if no file was specified. Searching for ISBNs in files uses progressively more resource-intensive methods until some ISBNs are found, for more details see
- the documentation for ebook-tools (shell scripts) or
- search_file_for_isbns() from lib.py (Python function where ISBNs search in files is implemented).
The find subcommand from the ebooktools.py script uses this module.
organize_ebooks.py is used to automatically organize folders with potentially huge amounts of unorganized ebooks. This is done by renaming the files with proper names and moving them to other folders:
- By default it searches the supplied ebook files for ISBNs, downloads the book metadata (author, title, series, publication date, etc.) from online sources like Goodreads, Amazon and Google Books and renames the files according to a specified template.
- If no ISBN is found, the script can optionally search for the ebooks online by their title and author, which are extracted from the filename or file metadata.
- Optionally an additional file that contains all the gathered ebook metadata can be saved together with the renamed book so it can later be used for additional verification, indexing or processing.
- Most ebook types are supported: .epub, .mobi, .azw, .pdf, .djvu, .chm, .cbr, .cbz, .txt, .lit, .rtf, .doc, .docx, .pdb, .html, .fb2, .lrf, .odt, .prc and potentially others. Even compressed ebooks in arbitrary archive files are supported. For example a .zip, .rar or other archive file that contains the .pdf or .html chapters of an ebook can be organized without a problem.
- Optical character recognition (OCR [Wikipedia]) can be automatically used for .pdf, .djvu and image files when no ISBNs were found in them by the fast and straightforward conversion to .txt. This is very useful for scanned ebooks that only contain images or were badly OCR-ed in the first place.
- Files are checked for corruption (zero-filled files, broken pdfs, corrupt archive, etc.) and corrupt files can optionally be moved to another folder.
- Non-ebook documents, pamphlets and pamphlet-like documents like saved webpages, short pdfs, etc. can also be detected and optionally moved to another folder.
Ref.: [ORG]
The organize subcommand from the ebooktools.py script uses this module.
rename_calibre_library.py traverses a calibre library folder, renames all the book files in it by reading their metadata from calibre's metadata.opf files. Then the book files are either moved or symlinked to an output folder along with their corresponding metadata files. The rename subcommand from the ebooktools.py script uses this module.
split_into_folders.py splits the supplied ebook files (and the accompanying metadata files if present) into folders with consecutive names that each contain the specified number of files. The split subcommand from the ebooktools.py script uses this module.

Thus, you have access to various subcommands from within the ebooktools.py script.

⭐

ebook-tools is the original Shell project I ported to Python. I used the same names for the script options (short and longer versions) so that if you used the shell scripts, you will easily know how to run the corresponding subcommand with the given options.

ebooktools.py is the name of the Python script which will always be referred that way in this document (i.e. no hyphen and ending with .py) to distinguish from the original Shell project ebook-tools.

pyebooktools is the name of the Python package that you need to install to have access to the ebooktools.py script.

Installation and dependencies

To install the script ebooktools.py, follow these steps:

Install the dependencies below.
Install the pyebooktools package below.

Python dependencies

Platforms: macOS [soon linux]
Python: >= 3.6
lxml >= 4.4 for parsing Calibre's metadata.opf files.

ℹ️

When installing the pyebooktools package below, the lxml library is automatically installed if it is not found or upgraded to the correct supported version.

Other dependencies

As explained in the documentation for ebook-tools, you need recent versions of:

calibre for fetching metadata from online sources, conversion to txt (for ISBN searching) and ebook metadata extraction. Versions 2.84 and above are preferred because of their ability to manually specify from which specific online source we want to fetch metadata. For earlier versions you have to set isbn_metadata_fetch_order and organize_without_isbn_sources to empty strings.

p7zip for ISBN searching in ebooks that are in archives.

Tesseract for running OCR on books - version 4 gives better results even though it's still in alpha. OCR is disabled by default and another engine can be configured if preferred.

Optionally poppler, catdoc and DjVuLibre can be installed for faster than calibre's conversion of .pdf, .doc and .djvu files respectively to .txt.

⚠️

On macOS, you don't need catdoc since it has the built-in textutil command-line tool that converts any txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive file.

Optionally the Goodreads and WorldCat xISBN calibre plugins can be installed for better metadata fetching.

⭐

If you only install calibre among these dependencies, you can still have a functioning program that will organize and manage your ebook collections:

fetching metadata from online sources will work: by default calibre comes with Amazon and Google sources among others

conversion to txt will work: calibre's own ebook-convert tool will be used

All subcommands should work but accuracy and performance will be affected as explained in the list of dependencies above.

Install `pyebooktools`

Install first the Python dependencies and other tools.
It is highly recommended to install the pyebooktools package in a virtual environment using for example venv or conda.
Make sure to update pip:
```
$ pip install --upgrade pip
```

Install the pyebooktools package (bleeding-edge version) with pip:

$ pip install git+https://github.com/raul23/pyebooktools#egg=pyebooktools

⚠️

Make sure that pip is working with the correct Python version. It might be the case that pip is using Python 2.x You can find what Python version pip uses with the following:
$ pip -V
If pip is working with the wrong Python version, then try to use pip3 which works with Python 3.x

Test installation

Test your installation by importing pyebooktools and printing its version:
```
$ python -c "import pyebooktools; print(pyebooktools.__version__)"
```
You can also test that you have access to the ebooktools.py script by showing the program's version:
```
$ ebooktools --version
```

Usage, options and configuration

All of the options documented below can either be passed to the ebooktools.py script via command-line arguments or via the configuration file config.py which is created along with the logging config file logging.py when the ebooktools.py script is run the first time with any of the subcommands defined below. The default values for these config files are taken from default_config.py and default_logging.py, respectively.

In order to use the parameters found in the configuration file config.py, use the --use-config flag. Hence, you don't need to specify a long command-line in the terminal by using this flag. See the edit subcommand to know how to edit this configuration file.

Most arguments are not required and if nothing is specified, the default values defined in the default config file default_config.py will be used.

The ebooktools.py script consists of various subcommands for the organization and management of ebook collections. The usage pattern for running one of the subcommands is as followed:

ebooktools {edit,convert,find,organize,rename,split} [OPTIONS]

where [OPTIONS] includes general options (as defined in the General options section) and options specific to the subcommand (as defined in the Script usage, subcommands and options section).

⚠️

In order to avoid data loss, use the --dry-run or --symlink-only option when running some of the subcommands (e.g. rename and split) to make sure that they would do what you expect them to do, as explained in the Security and safety section.

General options

Most of these options are part of the common library lib.py and may affect some or all of the subcommands.

General control flags

-h, --help; no config variable; default value False

Show the help message and exit.
-v, --version; no config variable; default value False

Show program's version number and exit.

-q, --quiet; config variable quiet; default value False

Enable quiet mode, i.e. nothing will be printed.

--verbose; config variable verbose; default value False

Print various debugging information, e.g. print traceback when there is an exception.

-u, --use-config; no config variable; default value False

If this is enabled, the parameters found in the main config file config.py will be used instead of the command-line arguments.

ℹ️

Note that any other command-line argument that you use in the terminal with the --use-config flag is ignored, i.e. only the parameters defined in the main config file config.py will be used.

-d, --dry-run; config variable dry_run; default value False

If this is enabled, no file rename/move/symlink/etc. operations will actually be executed.

--sl, --symlink-only; config variable symlink_only; default value False

Instead of moving the ebook files, create symbolic links to them.

--km, --keep-metadata; config variable keep_metadata; default value False

Do not delete the gathered metadata for the organized ebooks, instead save it in an accompanying file together with each renamed book. It is very useful or for additional verification, indexing or processing at a later date. [KM]

Script usage, subcommands and options

The usage pattern for running a given subcommand is the following:

ebooktools {edit,convert,find,organize,rename,split} [OPTIONS]

where [OPTIONS] includes general options and options specific to the subcommand as shown below.

ℹ️

Don't forget the name of the Python script ebooktools before the subcommand.

All subcommands are affected by the following global options:

-h, --help
-q, --quiet
-v, --verbose
-u, --use-config
--log-level
--log-format

The -h, --help option can be applied specifically to each subcommand or to the ebooktools.py script (when called without the subcommand). Thus when you want the help message for a specific subcommand, you do:

ebooktools {edit,convert,find,split} -h

which will show you the options that affect the choosen subcommand.

And if you want the help message for the whole ebooktools.py script:

ebooktools -h

which will show you the project description and description of each subcommand without showing the subcommand options.

Examples

More examples can be found at examples.rst.

Example 1: convert a pdf file to text with OCR

To convert a pdf file to text with OCR:

$ ebooktools convert --ocr always -o converted.txt pdf_to_convert.pdf

By setting --ocr to always, the pdf file will be first OCRed before trying the simple conversion tools (pdftotext or calibre's ebook-convert if the former command is not found).

Running pyebooktools v0.1.0a3
Verbose option disabled
OCR=always, first try OCR then conversion
Will run OCR on file 'pdf_to_convert.pdf' with 1 page...
OCR successful!

Example 2: find ISBNs in a pdf file

Find ISBNs in a pdf file:

$ ebooktools find pdf_file.pdf

Output:

Running pyebooktools v0.1.0a3
Verbose option disabled
Searching file 'pdf_file.pdf' for ISBN numbers...
Extracted ISBNs:
9789580158448
1000100111

The search for ISBNs starts in the first pages of the document to increase the likelihood that the first extracted ISBN is the correct one. Then the last pages are analyzed in reverse. Finally, the rest of the pages are searched.

Thus, in this example, the first extracted ISBN is the correct one associated with the book since it was found in the first page.

The last sequence 1000100111 was found in the middle of the document and is not an ISBN even though it is a technically valid but wrong ISBN that the regular expression isbn_blacklist_regex didn't catch. Maybe it is a binary sequence that is part of a problem in a book about digital system.

Uninstall

To uninstall the pyebooktools package:

$ pip uninstall pyebooktools

ℹ️

When uninstalling the pyebooktools package, you might be informed that the configuration files logging.py and config.py won't be removed by pip. You can remove those files manually by noting their paths returned by pip. Or you can leave them so your saved settings can be re-used the next time you re-install the package.

Example: uninstall the package and remove the config files

$ pip uninstall pyebooktools
Found existing installation: pyebooktools 0.1.0a3
Uninstalling pyebooktools-0.1.0a3:
  Would remove:
    /Users/test/miniconda3/envs/ebooktools_py37/bin/ebooktools
    /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools-0.1.0a3.dist-info/*
    /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/*
  Would not remove (might be manually added):
    /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/configs/config.py
    /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/configs/logging.py
Proceed (y/n)? y
  Successfully uninstalled pyebooktools-0.1.0a3
$ rm -r /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/

Limitations

Same limitations as for ebook-tools apply to this project too:

Automatic organization can be slow - all the scripts are synchronous and single-threaded and metadata lookup by ISBN is not done concurrently. This is intentional so that the execution can be easily traced and so that the online services are not hammered by requests. If you want to optimize the performance, run multiple copies of the script on different folders.

The default setting for isbn_metadata_fetch_order includes two non-standard metadata sources: Goodreads and WorldCat xISBN. For best results, install the plugins (1, 2) for them in calibre and fine-tune the settings for metadata sources in the calibre GUI.

Security and safety

Important security and safety tips from the ebook-tools documentation:

Please keep in mind that this is beta-quality software. To avoid data loss, make sure that you have a backup of any files you want to organize. You may also want to run the scripts with the --dry-run or --symlink-only option the first time to make sure that they would do what you expect them to do.

Also keep in mind that these shell scripts parse and extract complex arbitrary media and archive files and pass them to other external programs written in memory-unsafe languages. This is not very safe and specially-crafted malicious ebook files can probably compromise your system when you use these scripts. If you are cautious and want to organize untrusted or unknown ebook files, use something like QubesOS or at least do it in a separate VM/jail/container/etc.

NOTE: --dry-run and --symlink-only can be applied to the following subcommands:

organize
rename
split: only --dry-run is applicable

Roadmap

Starting from first priority tasks

Short-term

Port all ebook-tools shell scripts into Python
- ~~organize-ebooks.sh~~ : done, see organize_ebooks.py
- interactive-organizer.sh
- ~~find-isbns.sh~~ : done, see find_isbns.py
- ~~convert-to-txt.sh~~ : done, see convert_to_txt.py
- ~~rename-calibre-library.sh~~ : done, see rename_calibre_library.py
- ~~split-into-folders.sh~~ : done, see split_into_folders.py
Status: only interactive-organizer.sh remaining, will port later
Add cache support when converting files to txt

Status: working on it since it is also needed for my other project search-ebooks which makes heavy use of pyebooktools
Test on linux
Create a docker image for this project

Medium-term

Add tests on Travis CI
Eventually add documentation on Read the Docs
Add a fix subcommand that will try to fix corrupted PDF files based on one of the following utilities:
- ~~gs: Ghostscript~~ ; done, see fix_file_for_corruption()
- pdftocairo: from Poppler
- mutool: it does not "print" the PDF file
- cpdf
It will also check PDF files based on one of the following utilities:
- pdfinfo
- pdftotext
- qpdf
- jhove
Add a remove subcommand that can remove annotations (incl. highlights, comments, notes, arrows), bookmarks, attachments and metadata from PDF files based on the cpdf utility

NOTE: pdftk can also remove annotations

Credits

Special thanks to na--, the developer of ebook-tools, for having made these very useful tools. I learned a lot (specially bash) while porting them to Python.
Thanks to all the developers of the different programs used by this project such as calibre, Tesseract, text converters (djvutxt and pdftotext) and many other utilities!