bfabiszewski/libmobi

convert mobi ebook to epub error

hehonghui opened this issue · 12 comments

convert mobi file to epub format successfully, but the epub file format is error, it can't be opened by iBooks and many android epub readers. I check the epub file with calibre-edit, and get the error below:

ERROR: Parsing failed: xmlParseEntityRef: no name, line 1, column 807    [OEBPS/part00000.html]
INFO: File too large    [OEBPS/part00000.html]

123_test.epub.zip

Mobitool offers only very simple conversion. I put it just for reference how the library could be used to create a converter.
This tool only wraps HTML extracted from mobi format into epub container. It doesn't check whether original HTML is correct and doesn't try to repair it.
In this case the original HTML seems to be broken (has unescaped entity) and consequently the resulting epub is also broken.

Mobitool offers only very simple conversion. I put it just for reference how the library could be used to create a converter.
This tool only wraps HTML extracted from mobi format into epub container. It doesn't check whether original HTML is correct and doesn't try to repair it.
In this case the original HTML seems to be broken (has unescaped entity) and consequently the resulting epub is also broken.

But all the epub ebooks i converted was broken, not just the attachment file.

Could you give me some examples? I need original mobi files to check it.
I noticed your previous example was converted from a webpage by some calibre plugin. Maybe the plugin creates invalid HTML?

The mobi file was created by calibre, you can check the 123_test.mobi in attachment. Thanks @bfabiszewski

123_test.mobi.zip

@bfabiszewski I tested some mobi file which are not created by calibre, and the epub output got error too.

The file 123_test.mobi contains broken HTML. The unescaped ampersand is the problem, that later breaks epub reader.
Below I extracted just a few warnings from HTML linter log:

line 1 column 993 - Warning: unescaped & which should be written as &
line 1 column 1638 - Warning: unescaped & which should be written as &
line 1 column 1754 - Warning: unescaped & which should be written as &
line 1 column 1869 - Warning: unescaped & which should be written as &
line 1 column 1981 - Warning: unescaped & which should be written as &
...
15199 warnings, 116 errors were found! Not all warnings/errors were shown.

You can give me some other example, not created by calibre, but I expect similar problems.
If original mobi file contains broken HTML the same HTML will be present in epub file.

This is a mobi file created by a publisher in china, and the epub file got errors too. I know the unescaped issue, maybe this is a common issue in mobi file. And the calibre will fixed the unescaped issue when conversion, so when you convert mobi from epub with calibre, you will not see unescaped issue. If libmobi can do this also, it will be awesome!

47967-用Python写网络爬虫(第2版)-180927.mobi.zip

The file 47967-用Python写网络爬虫(第2版)-180927.mobi seems to convert fine. I can open resulting epub in iBooks.

image

The file 47967-用Python写网络爬虫(第2版)-180927.mobi seems to convert fine. I can open resulting epub in iBooks.

image

I convert it on android, and got the unescaped error . In a word, if the libmobi can fixed the unescaped issue autoautomatically, it must be awesome. :)

I see.
I don't plan to add any such features to mobitool.
It is not an easy task to create decent converter. Calibre for example completely rewrites original HTML introducing own styles and fixing issues.
Libmobi is a library that can be used to create such converter but it does not have any features to work with HTML or EPUB. Such converter should use libmobi to unpack mobi file and also use different tools or libraries to work with markup.

The file 123_test.mobi contains broken HTML. The unescaped ampersand is the problem, that later breaks epub reader.
Below I extracted just a few warnings from HTML linter log:

line 1 column 993 - Warning: unescaped & which should be written as &
line 1 column 1638 - Warning: unescaped & which should be written as &
line 1 column 1754 - Warning: unescaped & which should be written as &
line 1 column 1869 - Warning: unescaped & which should be written as &
line 1 column 1981 - Warning: unescaped & which should be written as &
...
15199 warnings, 116 errors were found! Not all warnings/errors were shown.

You can give me some other example, not created by calibre, but I expect similar problems.
If original mobi file contains broken HTML the same HTML will be present in epub file.

@bfabiszewski How can you get the error report, can you share the tools with me ?

I used tidy.
tidy -e -utf8 file.html
Tidy can also fix most of the errors.