This is an approximate implementation of the WHATWG innerText attribute from a plain DOM, without any knowledge of rendering or CSS. This was drafted to be used as a replacement of the textContent
DOM property in microformats2 parsers. So the outlay of a plaintext version of an element could still be easily displayed for reading without unexpected side-effects (like <br>
elements disappearing).
For whitespace normalisation this emulates the white space processing rules from the CSS Text Module Level 3 specification, currently a draft from the CSS Working Group at W3C.
A second independent implement of these ideas in Python is available as well: sknebel/python-innertext.
For the 6 steps described to run for the retrieval of innerText only one change is made inside this implementation:
- If this element is not being rendered, or if the user agent is a non-CSS user agent, then return the same value as the
textContent
IDL attribute on this element.
Returning just textContent
(because we are āa non-CSS user agentā) is undesirable, so continue. The assumption is made that whatever element you are trying to read the text from is being rendered for the purpose of this first step.
This step can safely be skipped in its entirety.
For the 10 steps described to run for inner text collection a couple of things are important:
- It has to know whether this is the first time it is being run (i.e. directly on the element we are determining the innerText of) and whether it is currently in āpreā mode (where it knows not to touch whitespace).
- It has to know to ignore the very first U+000A LINE FEED (LF) (
\n
) character inside<pre>
and<textarea>
elements, only if the DOM parser hasnāt already stripped it. - It has to be able to follow when it encounters a block it has already run whitespace processing on. For this a
BLOCK_START
andBLOCK_END
constants are used by this implementation.
A number of changes have been made to the described steps, as follows:
- If node's computed value of 'visibility' is not 'visible', then return items.
It is assumes that all nodes have visibility
set to visible
. There are only 6 cases of special visibility
values in the default CSS. All to set elements that are already set to be hidden to collapse
within a table. (See the CSS user agent style sheet for tables.)
This step can safely be skipped, as those edge-cases should get covered by the next step.
- If node is not being rendered, then return items. For the purpose of this step, the following elements must act as described if the computed value of the 'display' property is not 'none': [ā¦]
āBeing renderedā means a very specific thing here, namely that an element has a CSS layout box. The assumption of this implementation is that every DOM element has a layout box unless they are specifically set to not have one through a default styling with display: none;
.
See hidden elements, flow content and tables for selectors that do this.
To match the assumption made in step 1 of the outer innerText, if the collection step is being run on the initial element it must be treated as if it is being rendered. Even if the default CSS says it isnāt.
The special rules for select
, optgroup
, and option
are ignored.
- If node is a Text node, [ā¦]
This step describes whitespace processing that cannot be done on a per-text-node-basis. Therefor it is only implemented as:
If node is a Text node, then return items as a list containing the nodeās `textContent`.
- If node is a br element, then append a string containing a single U+000A LINE FEED (LF) character to items.
Because whitespace processing does not happen in step 4 but later, the line feed introduced by <br>
has to be guarded as a fully processed block. Therefor this step is implemented as:
If node is a br element, then return items as a list containing (in order) BLOCK_START, a string containing a single U+000A LINE FEED (LF) character, and BLOCK_END.
- If node's computed value of 'display' is 'table-cell', and node's CSS box is not the last 'table-cell' box of its enclosing 'table-row' box, then append a string containing a single U+0009 CHARACTER TABULATION (tab) character to items.
- If node's computed value of 'display' is 'table-row', and node's CSS box is not the last 'table-row' box of the nearest ancestor 'table' box, then append a string containing a single U+000A LINE FEED (LF) character to items.
These two steps are not currently implemented. Tables are an outstanding question.
- If node is a p element, then append 2 (a required line break count) at the beginning and end of items.
Instead of 2
this implementation adds 1
as the required line break counts. This is to align with older tests and may be changed in the future.
EXTRA STEP: Images
After handling <p>
elements, this implementation goes and handles <img>
elements in accordance with older microformats parser implementations. This is implemented as:
If node is an img element, then:
* if node has an alt attribute, return items as a list containing only the alt attributeās value with a single space character before and after.
* else if node has a src attribute, return items as a list containing only the src attributeās value with a single space character before and after.
- If node's used value of 'display' is block-level or 'table-caption', then append 1 (a required line break count) at the beginning and end of items.
This is implemented as described, with 2 assumptions:
- There is only one element that will have a
display
oftable-caption
and that is thecaption
element (per the default CSS for tables). - Elements are considered āblock-levelā if they are on MDNās list of block-level elements.
EXTRA STEP: Whitespace handling
- All successive string items in the items list that are not between
BLOCK_START
andBLOCK_END
markers are merged into one item. - If the current block is in āpreā mode, do not do anything with the merged string items.
- If the current block is not in āpreā mode, apply the Whiatespace Normalisation as described below to the merged string items.
- After merging the strings in items, prepend the list items with a BLOCK_START and append a BLOCK_END marker.
- Return items.
Done! :D
For ease of processing, it is recommended to replace all CRLF character combinations to LFs. CRLFs and LFs are both defined as segment breaks, and by normalising to one of the two the other steps only need to work on one.
For all steps below we assume the white-space
property of all nodes to be normal
.
Follow Phase One of White Space Processing: Collapsing and Transformation.
- All spaces and tabs immediately preceding or following a segment break are removed.
The word āspacesā here is assumed to mean only U+0020 SPACE characters, and ātabsā U+0009 CHARACTER TABULATION.
- Segment breaks are transformed for rendering according to the segment break transformation rules.
This is the next section of the spec being included as step 2 of the process, the steps taken are the segment break transformation rules:
- As with spaces, any collapsible segment break immediately following another collapsible segment break is removed.
This is implemented as collapsing all sequences of LF characters with single LF characters.
- If the character immediately before or immediately after the segment break is the zero-width space character (U+200B), then the break is removed, leaving behind the zero-width space.
This is implemented as replacing all sequences U+200B``U+000A
and U+000A``U+200B
with just singular U+200B
.
- Otherwise, if the East Asian Width property [ā¦]
This is skipped to keep language recognition out of the implementation.
- Otherwise, if the content language of [ā¦]
This too is skipped to keep language recognition out of the implementation.
- Otherwise, the segment break is converted to a space (U+0020).
This is implemented as replacing any U+000A
with U+0020
.
- Every tab is converted to a space (U+0020).
This is implemented as replacing any U+0009
with U+0020
.
- Any space immediately following another collapsible space [ā¦] is collapsed to have zero advance width.
This is implemented as collapsing all sequences of space characters with single space characters.
EXTRA STEP: Removing leading and trailing space
Because of how this white space handling has been delayed in the inner text collection steps, we can be sure that any string begins and ends at a block border. This means they are either going to be surrounded by required line breaks or the start/end of the document.
As hanging white space is removed around line breaks, we can do that here. This is implemented as stripping all leading and trailing space characters.