Latest version: 0.0.7 (2018/09)
Small Python library to read metadata information from an ePub (2 and 3) file.
It does not depends on any library and run on Python 3 and 2.
pip install epub_meta
import epub_meta
Discover the main metadata of the ePub file
>>> metadata = epub_meta.get_epub_metadata('/path/to/my_epub_file.epub')
>>> type(metadata)
<dict>
>>> metadata
{ ... }
Example:
>>> data = epub_meta.get_epub_metadata('/path/to/pro_git.epub', read_cover_image=True, read_toc=True)
>>> print(data)
{
'authors': [u'Scott Chacon'],
'epub_version': u'2.0',
# ISBN, uuids etc
'identifiers': [u'bf50c6e1-eb0a-4a1c-a2cd-ea8809ae086a', u'9781430218333'],
'language': u'en',
'publication_date': u'2009-08-19T00:00:00+00:00',
'publisher': u'Springer',
'subject': u'Software Development',
'title': u'Pro Git',
# import base64 ; base64.b64decode(data.cover_image_content)
'cover_image_content': [base64 string],
'cover_image_extension': '.jpg',
'toc': [
{'index': 0, 'title': 'Getting Started', 'src': 'progit_split_000.html', 'level': 0},
{'index': 1, 'title': 'Git Basics', 'level': 0, 'src': 'progit_split_008.html'},
{'index': 2, 'title': 'Git Branching', 'level': 0, 'src': 'progit_split_017.html'},
{'index': 3, 'title': 'Git on the Server', 'src': 'progit_split_025.html', 'level': 0},
{'index': 4, 'title': 'Distributed Git', 'src': 'progit_split_037.html', 'level': 0},
{'index': 5, 'title': 'Git Tools', 'src': 'progit_split_042.html', 'index': 5, 'level': 0},
{'index': 6, 'title': 'Customizing Git', 'src': 'progit_split_051.html', 'level': 0},
{'index': 7, 'title': 'Git and Other Systems', 'src': 'progit_split_057.html', 'level': 0},
{'index': 8, 'title': 'Git Internals', 'src': 'progit_split_061.html', 'level': 0}
],
'file_size_in_bytes': 4346158
}
You can access the dict keys using dot notation:
data.authors
data.epub_version
...
You should check for invalid ePub files or for unknown ePub conventions:
try:
epub_meta.get_epub_metadata('/path/to/my_epub_file.epub')
except epub_meta.EPubException as e:
print(e)
To discover and parse yourself the ePub OPF file, you can get the content of the OPF - XML file:
print(epub_meta.get_epub_opf_xml('/path/to/my_epub_file.epub'))
- Fixed url encoded strings
- Accepting relative paths
- Discover description if available
- Parsing and reading authors in pr02.html file if available
- Parsing and reading the publish date in pr01.html if available
- No more duplicate authors (preserving the order)
- Improvements in the ToC parser/reader
- Avoid infinite loop for bad/unknown epub files
- Backward incompatibility: Returning ToC as a list of objects instead of a list of strings
- The ToC information includes the source of the section: property
src
- The ToC is hierarchical, so we include a
level
property to identify the depth of the toc item in the list - The ToC order is important, so we include a
index
property to keep the order explicit - Trimming some string values
- Added the file size into the metadata dict
- Fixed TOC discovering for ePub v3 files
get_epub_metadata(path, read_cover_image=True, read_toc=True)
functionget_epub_opf_xml(path)
function- Read cover image content in base64
- Read TOC contents as an list of strings
Useful commands:
# Create a virtual env
make prepare
# Install al dependencies
make deps
# Run tests
make test
# Run tests with Tox (for all Python compatible versions)
make test_all
# Run coverage
make coverage
# Useful command for running tests before pushing to Git
make push