kevinboone/epub2txt2

Spine items with URL-encoded hrefs are not handled correctly

kevinboone opened this issue · 0 comments

Although unusual, it's legitimate for the XHTML documents in an EPUB to have filenames containing whitespace and punctuation characters. When these files are referenced in the manifest/spine in content.opf, they should be URL-encoded. Often this isn't the case but, when it is, epub2txt fails because it doesn't decode the URL. So if we have

<item href="foo%20bar.xhtml"/>

the program ends up looking for a file that is actually called "foo%20bar.xhtml" instead of decoding it to "foo bar.xhtml".