readium/readium-sdk

Treating XHTML as HTML breaks cross-platform CFI consistency

winniequinn opened this issue · 10 comments

Due to what I believe to be a behavior of libxml2, XHTML content from within an EPUB has the following meta tag added if no http-equiv="Content-Type" meta tag exists:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

This causes browsers to interpret XHTML documents as HTML which in turn triggers browser fallback behavior that results in radically different DOMs across browsers. As such, the logic for calculating CFIs produces CFIs that are not portable across platforms.

Thank you to @jccr for helping me debug this issue.

This issue is a bug.

Expected Behaviour

Readium produces CFIs that are portable across Android and iOS.

Observed behaviour

Readium does not.

Test file(s)

Here is a single XHTML file from an EPUB (source.xhtml) along with the DOMs produced for Android and iOS:

https://gist.github.com/winniequinn/b0865f110ffc821e2e583bfa51c30f17

Product

Native application (Readium SDK C++)

If I remember correctly , the application/xhtml+xml HTTP content-type is correctly included in the response created by the Android and iOS apps' respective HTTP servers. Let me find the source code references...

Android:

https://github.com/readium/SDKLauncher-Android/blob/master/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L85-L86
https://github.com/readium/SDKLauncher-Android/blob/0bd7eb1ac88616a1eba3dfe27dd5d93c17d8ef44/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L85-L86

tmpMap.put("html", "application/xhtml+xml"); // FORCE
tmpMap.put("xhtml", "application/xhtml+xml"); // FORCE

https://github.com/readium/SDKLauncher-Android/blob/master/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L209-L229
https://github.com/readium/SDKLauncher-Android/blob/0bd7eb1ac88616a1eba3dfe27dd5d93c17d8ef44/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L209-L229

		String mime = null;
		int dot = uri.lastIndexOf('.');
		if (dot >= 0) {
			mime = MIME_TYPES.get(uri.substring(dot + 1).toLowerCase());
		}
		if (mime == null) {
			mime = "application/octet-stream";
		}

		ManifestItem item = pckg.getManifestItem(uri);
		String contentType = item != null ? item.getMediaType() : null;
		if (!mime.equals("application/xhtml+xml")
				&& !mime.equals("application/xml") // FORCE
				&& contentType != null && contentType.length() > 0) {
			mime = contentType;
		}

		PackageResource packageResource = pckg.getResourceAtRelativePath(uri);

		boolean isHTML = mime.equals("text/html")
|| mime.equals("application/xhtml+xml");

iOS:

https://github.com/readium/readium-sdk/blob/master/Platform/Apple/RDServices/Main/RDPackageResourceResponse.m#L57-L66

- (NSDictionary *)httpHeaders {
if(m_resource.relativePath) {
NSString* ext = [[m_resource.relativePath pathExtension] lowercaseString];
if([ext isEqualToString:@"xhtml"] || [ext isEqualToString:@"html"]) {
return [NSDictionary dictionaryWithObject:@"application/xhtml+xml" forKey:@"Content-Type"]; // FORCE
}
else if([ext isEqualToString:@"xml"]) {
return [NSDictionary dictionaryWithObject:@"application/xml" forKey:@"Content-Type"]; // FORCE
}
}

- (NSDictionary *)httpHeaders {
	if(m_resource.relativePath) {
		NSString* ext = [[m_resource.relativePath pathExtension] lowercaseString];
		if([ext isEqualToString:@"xhtml"] || [ext isEqualToString:@"html"]) {
			return [NSDictionary dictionaryWithObject:@"application/xhtml+xml" forKey:@"Content-Type"]; // FORCE
		}
		else if([ext isEqualToString:@"xml"]) {
			return [NSDictionary dictionaryWithObject:@"application/xml" forKey:@"Content-Type"]; // FORCE
		}
}

https://github.com/readium/readium-sdk/blob/master/Platform/Apple/RDServices/Main/RDPackageResourceConnection.m#L275-L302

NSString* ext = [[path pathExtension] lowercaseString];
bool isHTML = [ext isEqualToString:@"xhtml"] || [ext isEqualToString:@"html"] || [resource.mimeType isEqualToString:@"application/xhtml+xml"] || [resource.mimeType isEqualToString:@"text/html"];
if (isHTML) {
NSString * FALLBACK_HTML = @"<?xml version=\"1.0\" encoding=\"UTF-8\"?><html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>HTML READ ERROR</title></head><body>ERROR READING HTML BYTES!</body></html>";
NSData *data = [resource readDataFull];
if (data == nil || data.length == 0)
{
data = [FALLBACK_HTML dataUsingEncoding:NSUTF8StringEncoding];
}
NSXMLParser *xmlparser = [[NSXMLParser alloc] initWithData:data];
[xmlparser setShouldResolveExternalEntities:NO];
BOOL isXhtmlWellFormed = [xmlparser parse];
if (isXhtmlWellFormed == NO)
{
NSLog(@"XHTML PARSE ERROR: %@", xmlparser.parserError);
// Can be used to check / debug encoding issues
NSString * dataStr = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
//NSLog(@"XHTML SOURCE: %@", dataStr);
// FORCE HTML WebView parser
//@"application/xhtml+xml"
resource.mimeType = @"text/html";
}

		NSString* ext = [[path pathExtension] lowercaseString];
		bool isHTML = [ext isEqualToString:@"xhtml"] || [ext isEqualToString:@"html"] || [resource.mimeType isEqualToString:@"application/xhtml+xml"] || [resource.mimeType isEqualToString:@"text/html"];

		if (isHTML) {
			NSString * FALLBACK_HTML = @"<?xml version=\"1.0\" encoding=\"UTF-8\"?><html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>HTML READ ERROR</title></head><body>ERROR READING HTML BYTES!</body></html>";

			NSData *data = [resource readDataFull];
			if (data == nil || data.length == 0)
			{
				data = [FALLBACK_HTML dataUsingEncoding:NSUTF8StringEncoding];
			}

			NSXMLParser *xmlparser = [[NSXMLParser alloc] initWithData:data];
			[xmlparser setShouldResolveExternalEntities:NO];
			BOOL isXhtmlWellFormed = [xmlparser parse];

			if (isXhtmlWellFormed == NO)
			{
				NSLog(@"XHTML PARSE ERROR: %@", xmlparser.parserError);

				// Can be used to check / debug encoding issues
				NSString * dataStr = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
				//NSLog(@"XHTML SOURCE: %@", dataStr);

				// FORCE HTML WebView parser
				//@"application/xhtml+xml"
				resource.mimeType = @"text/html";
}

As you can see, in iOS there is a fallback to HTML for malformed XHTML documents ... yes, there exists EPUBs authored with not-well-formed XML-HTML :(
But otherwise, the application/xhtml+xml content type should be correctly passed into HTTP responses via the appropriate headers ... well, at least in the ReadiumSDK "launcher apps".

I am not sure about libxml's responsibility here, are you sure it isn't an implementation trick specific to UIWebview / WKWebView? (on iOS)

jccr commented

Could you please share the EPUB you are using to reproduce the issue? I would like to run a quick test with SDKLauncher-OSX.

There is no DOM parsing (libxml or not) involved when the iOS / Android / OSX platform-native code (not C++, but Java or ObjectiveC) pre-processed the HTML documents in order to inject the navigator.epubReadingSystem stuff (it is all regular expression / plain text) ... so I do no understand why / how the <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> markup would be added.

IMO, the next best course of action is to reproduce (if possible) the bug with the SDKLauncher-iOS/Android/OSX apps, because it may be that your app is built with a different logic.

jccr commented

@winniequinn
Let's try to reproduce this using the vanilla launchers.
Could you share the EPUB that was used in your testing?

jccr commented

@danielweck
I couldn't reproduce this using various books, including a similar Alice book like winnie's sample. I am using the Android launcher.

This repository seems dormant, so I hope no one minds me doing a little housecleaning and closing out this stale issue I reported. @danielweck, feel free to reopen if you think it's worth keeping this active!