lexborisov/myhtml

Element wrong location level error handling

Opened this issue · 11 comments

elekt commented

I am working on a project that parses html and replaces href attributes.
If the html is invalid because instead a table cell (ex. <td>) an <a> tab is coming, in myhtml_insertion_mode_in_table, it tries to handle the parse error by "foster parenting" and calling myhtml_insertion_mode_in_body with the <a> token.

The problem is that by this that when I loop through the tree's nodes it seems that the node is added twice. The clone is added in myhtml_tree_active_formatting_reconstruction.

See the minimal html to reproduce:
testminimal_github.txt

In my application I throw away the copy of the node but for some reason if this happens the href (link 1 in the example) remains the same. Also it messes up the order I get the nodes with node = myhtml_node_next(node). I would like to fix this bug in myhtml, and I would appreciate some help.

I am not looking to fix the invalid html, but to make sure each href links are changed and the structure stays the same.

Hi!
I'll deal with this soon.
Thanks!

I'm trying to understand the problem. But I do not understand.
Actually, the specification requires this. Try to see how this example is handled in a modern browser.

Elekt is my colleague. Our use case is the following:

  1. Parse input HTML
  2. Modify some attributes
  3. Regenerate the HTML, but keep it as close to the original input HTML as possible (so without fixing it, adding more nodes, et cetera)

Is there a way how we can find out whether a node was artificially added by myhtml? We currently check if position.length == 0, but this does not work in the example given above.

There is a "flags" member in myhtml_tree_node_t, but it looks like it is not really in use. It would be nice if this flag can be set to a special value, and that user space programs can inspect it, and check if a node was (for example):

  • a real node that comes from the input HTML
  • an artificial node that was created by myhtml to fix a broken tree
  • a node that was moved to a different location in the tree to fix things
  • a node that was duplicated and added to the tree to fix things (like the links in the above example)
  • a node that was later modified by the user space program (like having a modified attribute)
  • a mismatched node (like not linked to a closing node)
  • a node that was opened-and closed in one tag (like <br/>)
  • et cetera

For our own use case it would already be very helpful if we could recognize "artificial" nodes, so that we can skip them when we regenerate the source code.

I found bug. We need pos.len = 0 (for clone element), but now it contains a garbage. Need to fix it.

elekt commented

Can you ellaborate a bit more?
I assume it need to be set in myhtml_tree_node_clone.

It seems that no, today I will try to deal with this.
It is necessary to understand at what point the cloned nodes have garbage in the position values.
Position values in cloned nodes must be zero.

Hello @lexborisov, do you need more info or help in any form?

@EmielBruijntjes
I understood the task, but it will take time. In enum myhtml_tree_node_flags we need to create, some like a MyHTML_TREE_NODE_CLONE, MyHTML_TREE_NODE_MOVED.

For use:

if (node->type & (MyHTML_TREE_NODE_CLONE|MyHTML_TREE_NODE_MOVED)) {
...
}

@lexborisov Is there anything that I can do to help you here? It's a feature that we really like to have.

Sorry, but in the current project, I can not do anything about it. Just somehow mark the cloned elements. But I would not want to spend that time.