lexborisov/myhtml

Erlang/Elixir bindings

Overbryd opened this issue · 4 comments

Hi Alexander,

I hope you are pleased to hear that I started working on a Elixir/Erlang binding for your marvellous library.
My main motivation is to parse a given HTML document into a tree-structure.
That is what the binding currently achieves.

The repository can be inspected here: https://github.com/Overbryd/myhtmlex

I am currently investigating the mode of operation that I want to execute this binding on.
Doing this, I found that re-using a reference to a tree somehow is slower than parsing a new tree.
There is a high chance that I am wrong and maybe I am doing something stupid, so I kindly ask you to have a look at the following C-functions that implement the NIF:

  • Opening & parsing a tree and building a tree in the same call faster

  • Opening & parsing a tree into a reference, then building a tree given a reference slower

    • nif_open(...)
      There I open a myhtml_tree_t and give back to the VM using a allocated resource. Therefore I have a small struct myhtmlex_ref_t that holds a reference to the myhtml_tree_t and also a reference to the root node as myhtml_tree_node_t for convenience.

    • nif_decode_tree(...)
      Here I receive a reference to myhtmlex_ref_t and build the Erlang tree.

The idea was to "open & parse" a myhtml-tree, and then re-use it multiple times. Weirdly this is much slower than opening/parsing and building an Erlang tree in the same call.

I just want to rule out that I am having a wrong assumption about myhtml_tree_t, being a fully parsed tree that can be walked very fast.

Kind regards,

Lukas Rieder

Hi Lukas,

I'm very glad to hear that you are create bindings for myhtml. This is very good news!

    1. myhtml_parse always clean current tree before begin parsing. not need call myhtml_tree_clean before myhtml_parse
    2. You are destroy only one node, not tree. myhtml_node_free function destroy resources for one (current) node only. For destroy tree (all nodes and resources) use myhtml_tree_destroy. But, if you use myhtml_tree_destroy you need always create a new tree myhtml_tree_create — it's expensive on time. Just delete this line.
      Tree will by clean each time you call myhtml_parse (new resources will not be allocated, сurrent resources will be used).
      this
      this
  • In nif_open(...) you are create a new tree every time when call nif_open — it will take a while.

See this instruction

Consequently if I parse multiple html documents in parallel, I need one
tree per instance.
Is that correct?

But sequentially, I only need one tree for parsing multiple documents.

Yes

I have built nif_open with the idea in mind that the user can call the tree
traversal functions provided in myhtml.

That's right, best practice for bindings.
We build tree and call c functions from another language. Some like (perl code):

my $tree = $myhtml->parse($html);
print $tree->as_text();
$tree->destroy();

Code for bindings (PerlXS):

MyHTML::Tree
parse(myhtml, html)
	MyHTML myhtml;
	const char* html;

	CODE:
		myhtml_tree__t *tree = myhtml_tree_create();
		myhtml_tree_init(tree, myhtml);
		myhtml_parse(tree, MyENCODING_UTF_8, html, strlen(html));

		RETVAL = tree;
	OUTPUT:
		RETVAL

SV*
as_text(tree)
	MyHTML::Tree tree;

	CODE:
		mycore_string_raw_t str_raw;
		mycore_string_raw_clean_all(&str_raw);

		myhtml_serialization_tree_buffer(myhtml_tree_get_document(tree), &str_raw);
		
		RETVAL = newSVpv(str_raw.data, str_raw.length)
		mycore_string_raw_destroy(&str_raw, false);
	OUTPUT:
		RETVAL

void
destroy(tree)
	MyHTML::Tree tree;
	
	CODE:
		myhtml_tree_destroy(tree);

In other words, we do not copy anything from the C structures, because data can change within structures. We just make a bridge from your LANG to C structures.

Thanks so much. All questions cleared up. I will close this now.