[Bug]: Conversion from symbols to entities causes reader to crash
Closed this issue · 7 comments
Bug Description
Sigil automatically converts symbols to entities (e.g. protected whitespace to
).
Please note that this causes some readers to crash.
Yes, calibre also warns about that, but I only noticed when I experienced it myself. One reader that crashes is https://www.epubread.com/ for example. So please be wary and reconsider this.
Also, Sigil behaves unexpectedly:
When importing a book with missing doctype it adds it and converts symbols to entities.
When importing a book with good doctype it does not convert symbols to entities.
I created test files for you:
symbol vs entity.zip
Platform (OS)
Linux
OS Version / Specifics
What version of Sigil are you using?
2.2.0
Any backtraces or crash reports
No response
If a reader crashes over a spec-compliant entity, then that is a reader bug.
But regardless of that... remove the non breaking space entity (either named or numeric) from Sigil's Preserve Entities list (in Sigil prefs) and Sigil will no longer convert that non-breaking space character to an entity.
There is no bug that I see here. Only misunderstood behavior/settings
Unicode characters are not illegal in the epub2 spec.
Thanks, I stopped searching after a few minutes for this information.
And we have no default conversion list
Of course you do. The setting right after standard installation is what is considered default.
I see reason to consider changing it. Actively changing something that was valid before but breaks on devices after the change is a big step.
I see beauty in both programs.
What inconsistenty bug are you referring to?
From my initial message:
Also, Sigil behaves unexpectedly:
When importing a book with missing doctype it adds it and converts symbols to entities.
When importing a book with good doctype it does not convert symbols to entities.
You can try this with the files I provided.
I am surprised it only cares about character conversion when it also found a missing doctype.
Hope this helps
I forgot one thing: the vast amount of folk new to Sigil and/or epub creation often know nothing of invisible (x)html characters. Most like being able to see the entities where these special situations occur. THAT is why it's the default behavior, and why we've provided the more experienced users a way to override that behavior.
To make this issue clearer. In epub2, to use named entities ( ie. "nbsp" vs numeric entities "#160;) requires the proper epub2 doctype which is where the namespace for supporting named entities is provided.
By calibre ignoring the doctype and not requiring it , it forces calibre to convert all named entities to their character equivalents. Which is okay, since they do that, but it is NOT spec behaviour. In epub3 only numeric entities are allowed except for the basic xml entities required by parsing.
So no, this is not a bug on Sigil's part which is to generate epubs that can be made to meet the spec for epub2 or epub3 and properly use and support legal entities if the epub creator so chooses.