aerkalov/ebooklib

Option to not-read certain media types

stichiboi opened this issue · 2 comments

Hello
I'm trying to read data from epubs I downloaded from the web.
I'm just interested in the text, I don't care about images or styles
Would it be possible to add a media_type_filter option and only load the specified types from the manifest?

I imagine something along the lines of, in epub.EpubReader._load_manifest

media_type = r.get('media-type')
if self.media_type_filter and len(self.media_type_filter) and media_type not in self.media_type_filter:
    return

And the media_type_filter would just be a list I pass in as options

Just to be transparent: this idea originates from an error I keep getting when reading some epubs

KeyError: "There is no item named 'styles/3.ttf' in the archive"

This error originates from the epub rather than from ebooklib: opening the file with Atom shows that indeed there is no styles/3.ttf (there is a fonts/3.ttf).

I don't want to throw away the whole epub just because it cannot read the styles, so ideally I could just skip reading them

This should also make the process quicker.

But I'm no expert in EPUB, so maybe this is not a good idea 😓

Good point. Everything fails now if EPUB claims to have something which is really missing in the archive. One option would be for the EpubReader. Something like fail silently. The other one would be like you suggested - list of things to ignore/allow.