Treating XHTML as HTML breaks cross-platform CFI consistency
winniequinn opened this issue · 10 comments
Due to what I believe to be a behavior of libxml2, XHTML content from within an EPUB has the following meta tag added if no http-equiv="Content-Type"
meta tag exists:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This causes browsers to interpret XHTML documents as HTML which in turn triggers browser fallback behavior that results in radically different DOMs across browsers. As such, the logic for calculating CFIs produces CFIs that are not portable across platforms.
Thank you to @jccr for helping me debug this issue.
This issue is a bug.
Expected Behaviour
Readium produces CFIs that are portable across Android and iOS.
Observed behaviour
Readium does not.
Test file(s)
Here is a single XHTML file from an EPUB (source.xhtml
) along with the DOMs produced for Android and iOS:
https://gist.github.com/winniequinn/b0865f110ffc821e2e583bfa51c30f17
Product
Native application (Readium SDK C++)
If I remember correctly , the application/xhtml+xml
HTTP content-type is correctly included in the response created by the Android and iOS apps' respective HTTP servers. Let me find the source code references...
Android:
https://github.com/readium/SDKLauncher-Android/blob/master/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L85-L86
https://github.com/readium/SDKLauncher-Android/blob/0bd7eb1ac88616a1eba3dfe27dd5d93c17d8ef44/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L85-L86
tmpMap.put("html", "application/xhtml+xml"); // FORCE
tmpMap.put("xhtml", "application/xhtml+xml"); // FORCE
https://github.com/readium/SDKLauncher-Android/blob/master/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L209-L229
https://github.com/readium/SDKLauncher-Android/blob/0bd7eb1ac88616a1eba3dfe27dd5d93c17d8ef44/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L209-L229
String mime = null;
int dot = uri.lastIndexOf('.');
if (dot >= 0) {
mime = MIME_TYPES.get(uri.substring(dot + 1).toLowerCase());
}
if (mime == null) {
mime = "application/octet-stream";
}
ManifestItem item = pckg.getManifestItem(uri);
String contentType = item != null ? item.getMediaType() : null;
if (!mime.equals("application/xhtml+xml")
&& !mime.equals("application/xml") // FORCE
&& contentType != null && contentType.length() > 0) {
mime = contentType;
}
PackageResource packageResource = pckg.getResourceAtRelativePath(uri);
boolean isHTML = mime.equals("text/html")
|| mime.equals("application/xhtml+xml");
iOS:
- (NSDictionary *)httpHeaders {
if(m_resource.relativePath) {
NSString* ext = [[m_resource.relativePath pathExtension] lowercaseString];
if([ext isEqualToString:@"xhtml"] || [ext isEqualToString:@"html"]) {
return [NSDictionary dictionaryWithObject:@"application/xhtml+xml" forKey:@"Content-Type"]; // FORCE
}
else if([ext isEqualToString:@"xml"]) {
return [NSDictionary dictionaryWithObject:@"application/xml" forKey:@"Content-Type"]; // FORCE
}
}
readium-sdk/Platform/Apple/RDServices/Main/RDPackageResourceConnection.m
Lines 275 to 302 in 3ba644f
NSString* ext = [[path pathExtension] lowercaseString];
bool isHTML = [ext isEqualToString:@"xhtml"] || [ext isEqualToString:@"html"] || [resource.mimeType isEqualToString:@"application/xhtml+xml"] || [resource.mimeType isEqualToString:@"text/html"];
if (isHTML) {
NSString * FALLBACK_HTML = @"<?xml version=\"1.0\" encoding=\"UTF-8\"?><html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>HTML READ ERROR</title></head><body>ERROR READING HTML BYTES!</body></html>";
NSData *data = [resource readDataFull];
if (data == nil || data.length == 0)
{
data = [FALLBACK_HTML dataUsingEncoding:NSUTF8StringEncoding];
}
NSXMLParser *xmlparser = [[NSXMLParser alloc] initWithData:data];
[xmlparser setShouldResolveExternalEntities:NO];
BOOL isXhtmlWellFormed = [xmlparser parse];
if (isXhtmlWellFormed == NO)
{
NSLog(@"XHTML PARSE ERROR: %@", xmlparser.parserError);
// Can be used to check / debug encoding issues
NSString * dataStr = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
//NSLog(@"XHTML SOURCE: %@", dataStr);
// FORCE HTML WebView parser
//@"application/xhtml+xml"
resource.mimeType = @"text/html";
}
As you can see, in iOS there is a fallback to HTML for malformed XHTML documents ... yes, there exists EPUBs authored with not-well-formed XML-HTML :(
But otherwise, the application/xhtml+xml
content type should be correctly passed into HTTP responses via the appropriate headers ... well, at least in the ReadiumSDK "launcher apps".
I am not sure about libxml's responsibility here, are you sure it isn't an implementation trick specific to UIWebview / WKWebView? (on iOS)
Could you please share the EPUB you are using to reproduce the issue? I would like to run a quick test with SDKLauncher-OSX
.
There is no DOM parsing (libxml or not) involved when the iOS / Android / OSX platform-native code (not C++, but Java or ObjectiveC) pre-processed the HTML documents in order to inject the navigator.epubReadingSystem
stuff (it is all regular expression / plain text) ... so I do no understand why / how the <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
markup would be added.
IMO, the next best course of action is to reproduce (if possible) the bug with the SDKLauncher-iOS/Android/OSX apps, because it may be that your app is built with a different logic.
@winniequinn
Let's try to reproduce this using the vanilla launchers.
Could you share the EPUB that was used in your testing?
@danielweck
I couldn't reproduce this using various books, including a similar Alice book like winnie's sample. I am using the Android launcher.
This repository seems dormant, so I hope no one minds me doing a little housecleaning and closing out this stale issue I reported. @danielweck, feel free to reopen if you think it's worth keeping this active!