Whitelist attribute xml:space not recognised
Opened this issue · 4 comments
We have exported a SVG with Adobe and the sanitizer does not like that. It give the following errors:
There are sanitization issues with this SVG file:
Suspicious attribute 'space' in line 4
Suspicious attribute 'enable-background' in line 4
Generator is: Adobe Illustrator 19.1.0, SVG Export Plug-In . SVG Version: 6.00 Build 0)
Issue #63 is for the enable-background.
But the space attribute is something weird.
Relevant code of the SVG:
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="Layer_1" version="1.1" x="0px" y="0px" width="600px" height="600px" viewBox="0 0 600 600" xml:space="preserve">
If you sanitize this the getXmlIssues() function wil return the error above:
Suspicious attribute 'space' in line 4
Somehow the code strips xml:
Did not found the problem/solution for this.
Problem is: https://github.com/darylldoyle/svg-sanitizer/blob/master/src/Sanitizer.php#L369
Change:
$attrName = $element->attributes->item($x)->name
to
$attrName = $element->attributes->item($x)->nodeName;
it solves the problem, I'm unsure about the side effects...
Three problems:
- The default value of
xml:whitepace
in XML isxml:whitepace="default"
, notxml:whitepace="preserve"
. Currently the sanitizer assumes the reverse and addsxml:whitepace="preserve"
, which is never needed in any sanitized XVG. And this changes the behavior of parsers, adding lot of whitespace-only text elements visible to parsers. - If we specify
xml:whitepace="preserve"
in the input, the white spaces present in text element contents is correctly filtered out, and the output does not "need" to generate any newlines or indentation of elements, all is packed in one line. But if there are some other spurious elements (not used in SVG that has useful NO text element in its own schema, except if other non-SVG elements are present and included by using some namespace prefix), these indentation spaces and line breaks are kept (even if these elements have been sanitized and eliminated) even if they are "sane". You should drop them, or possibly indent elements meaningfully. - Finally, when you sanitize elements, if there are no contents, you should not use explicit close tags to ALL elements, but use the self-closed syntax with
<element [attributes...]/>
instead of<element [attributes...]></element>
which is unnecessarily verbose, especially when you've incorrectly assumedxml:whitepace="preserve"
which is only meaningful if there are text-elements (never used in valid SVG, but that may appear in the HTML5-embedded SVG that allows mixing non-SVG elements); using the correct default value ofxml:whitespace="default"
, you expect to see possible text-elements, but valid SVG discards them if they are only whitespace, even in HTML5! - SVG only has one defined text-element: the
<style>...</style>
element. But even there,xml:white-space="preserve"
is not needed. You may still output the CSS with all these whitespaces compressed, or with basic newlines at start and end and betweenselectors { [key:value];* [key:value] }
, and inside braces{}
at start and end and in the middle betweenkey:value
pairs (pairs may be indented according the brace-level). Note also that the CSS in the text-content of<style>
elements and insidestyle="..."
attributes of any SVG elements must be sanitized too (they can contain scripts, and unsafe URIs with some dangerous URI schemes, or "bad" external sites if you accept "http:" or "https:"). - However "data:" URI schemes are safe, for example to embed bitmap images, provided that these "data:" URI use "safe" media type. Validating and sanitizing the content of these embedded medias may require another recursive parser/sanitizer for each supported media type (after decoding them safely generally from Base64 into a binary object), notably JPEG, GIF, PNG, videos, PDFs, or even another SVG embedded in a "data:" URI. Sanitizing these medias would need their own options (e.g. for privacy, to remove internal tracing tags, or to respect and preserve the author/copyright/licence tags; this is a complex task because these medias may be secured and you may break digital signatures; you may just be able to support a few simple media-types like plain text, SVG, or a subset of GIF, PNG, JPEG features but should signal the presence of unsupported features; such data URIs to medias will generally occur inside the CSS content of the
<style>
element orstyle="..."
attributes).
So please use xml:whitepace="default"
as the default and cleanup the generated code! Indentation and line breaks in CSS text-element and of XML elements is optional and should be an option (a checkbox) but proper cleanup of white spaces in input should still be performed according to the xml:whitepace
value (which may change only explicitly inside each XML element).
Problem is: https://github.com/darylldoyle/svg-sanitizer/blob/master/src/Sanitizer.php#L369
Change:
$attrName = $element->attributes->item($x)->name
to
$attrName = $element->attributes->item($x)->nodeName;
it solves the problem, I'm unsure about the side effects...
Just fixed this locally, and then realized there may be an opened issue already...
Opened a pull-request, hoping this will be fixed, because currently, there is no way to determine whether the file was sanitized or not.
Note however that "xml:" is not a really namespace in XML, so it doe snot have to be declared (it is implicitly defined and reserved for the XML standard itself) and "xml:whitespace" is an implicit property that is defined and valid in all XML files, independantly of their processing schema: it must be processed directly at parser level without any need of further processing and it is not subject to any interpretation.
You can treat it as a "pseudo" namespace, though, given that no valid XML file can redefine it with a valid XML namespace declaration, so it can safely be bound internally when using any XML parser, including for parsing any version of SVG or any reduced profile (such as TinyVG whose purpose is only to preserve a standard "minimal" rendering comptible with all devices, while dropping most other metadata and semantic informations, e.g. for project management, sources, licencing conditions, etc.).
[ It is just the same for implicitly predefined entities "amp", "lt", and "gt", and the generic syntax of numerical character references that are handled at parser level; this difference just contrasts with SGML and legacy HTML parsers before HTML5, that required such declaration, in a DTD schema, but not with XHTML which is prebound to XML, independantly of the schemas used for HTML4, HTML5 and their evolutions and independantly of the new stricter HTML5 parser that adds further extensions and restrictions not defined in XML, such as new predefined entities and attributes and the deprecation of must namespaces in HTML5 which is then not extensible in its syntax except by converting HTML5 to XHTML via the XML syntax, where you then need to add explicit declarations of additional HTML5 entities in an explicit schema; but thanks, HTML5 provides the stadnard URI for these standardized schema extension so that HTML5 parser can produce a unique predicatible HTML5 while uniquely preserving the DOM with a bijection between the two syntaxes, considered completely equivalent even it the standard schema is then evolutive and must be backward compatible with previous specifications of HTML5; this also applies to the integration of a part of SVG inside HTML5, but with a restriction: it can only use XML 1.1 and not XML 1.0, as HTML5 does not allow specifying any embedded DTD but only an implicit external schema. ]