Erlang/Elixir bindings

Question

Erlang/Elixir bindings

Overbryd opened this issue 7 years ago · 4 comments

Hi Alexander,

I hope you are pleased to hear that I started working on a Elixir/Erlang binding for your marvellous library.
My main motivation is to parse a given HTML document into a tree-structure.
That is what the binding currently achieves.

The repository can be inspected here: https://github.com/Overbryd/myhtmlex

I am currently investigating the mode of operation that I want to execute this binding on.
Doing this, I found that re-using a reference to a tree somehow is slower than parsing a new tree.
There is a high chance that I am wrong and maybe I am doing something stupid, so I kindly ask you to have a look at the following C-functions that implement the NIF:

Opening & parsing a tree and building a tree in the same call faster
- nif_decode(...)
Opening & parsing a tree into a reference, then building a tree given a reference slower
- nif_open(...)
  There I open a myhtml_tree_t and give back to the VM using a allocated resource. Therefore I have a small struct myhtmlex_ref_t that holds a reference to the myhtml_tree_t and also a reference to the root node as myhtml_tree_node_t for convenience.
- nif_decode_tree(...)
  Here I receive a reference to myhtmlex_ref_t and build the Erlang tree.

The idea was to "open & parse" a myhtml-tree, and then re-use it multiple times. Weirdly this is much slower than opening/parsing and building an Erlang tree in the same call.

I just want to rule out that I am having a wrong assumption about myhtml_tree_t, being a fully parsed tree that can be walked very fast.

Kind regards,

Lukas Rieder

Answer 1 · 2017-08-30T14:37:55.000Z

Hi Lukas,

I'm very glad to hear that you are create bindings for myhtml. This is very good news!

1. myhtml_parse always clean current tree before begin parsing. not need call myhtml_tree_clean before myhtml_parse
2. You are destroy only one node, not tree. myhtml_node_free function destroy resources for one (current) node only. For destroy tree (all nodes and resources) use myhtml_tree_destroy. But, if you use myhtml_tree_destroy you need always create a new tree myhtml_tree_create — it's expensive on time. Just delete this line.
  Tree will by clean each time you call myhtml_parse (new resources will not be allocated, сurrent resources will be used).
  this
  this
In nif_open(...) you are create a new tree every time when call nif_open — it will take a while.

See this instruction

Answer 2 · 2017-08-30T15:47:52.000Z

Ok, I understand. I will remove the unnecessary cleanup lines on nodes. Consequently if I parse multiple html documents in parallel, I need one tree per instance. Is that correct? But sequentially, I only need one tree for parsing multiple documents. I have built nif_open with the idea in mind that the user can call the tree traversal functions provided in myhtml. Hi Lukas,

…

I'm very glad to hear that you are create bindings for myhtml. This is very good news! - 1. myhtml_parse always clean current tree before begin parsing. not need <https://github.com/Overbryd/myhtmlex/blob/106512ce76f5d90d3647a9896462b5cd77b61c7d/src/myhtmlex.c#L98> call myhtml_tree_clean before myhtml_parse 2. You are destroy only one node <https://github.com/Overbryd/myhtmlex/blob/106512ce76f5d90d3647a9896462b5cd77b61c7d/src/myhtmlex.c#L114>, not tree. myhtml_node_free function destroy resources for one (current) node only. For destroy tree (all nodes and resources) use myhtml_tree_destroy. But, if you use myhtml_tree_destroy you need always create a new tree myhtml_tree_create — it's expensive on time. Just delete this line. Tree will by clean each time you call myhtml_parse (new resources will not be allocated, сurrent resources will be used). this <https://github.com/Overbryd/myhtmlex/blob/106512ce76f5d90d3647a9896462b5cd77b61c7d/src/myhtmlex.c#L134> this <https://github.com/Overbryd/myhtmlex/blob/106512ce76f5d90d3647a9896462b5cd77b61c7d/src/myhtmlex.c#L298> - In nif_open(...) you are create a new tree every time <https://github.com/Overbryd/myhtmlex/blob/106512ce76f5d90d3647a9896462b5cd77b61c7d/src/myhtmlex.c#L43> when call nif_open — it will take a while. See this instruction <https://github.com/lexborisov/myhtml/wiki> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#115 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABSd5xXFMLtXE_ebhuc5UVj4naWJJDTks5sdXPIgaJpZM4PHWkt> .

Answer 3 · 2017-08-30T16:14:14.000Z

Consequently if I parse multiple html documents in parallel, I need one
tree per instance.
Is that correct?

But sequentially, I only need one tree for parsing multiple documents.

Yes

I have built nif_open with the idea in mind that the user can call the tree
traversal functions provided in myhtml.

That's right, best practice for bindings.
We build tree and call c functions from another language. Some like (perl code):

my $tree = $myhtml->parse($html);
print $tree->as_text();
$tree->destroy();

Code for bindings (PerlXS):

MyHTML::Tree
parse(myhtml, html)
	MyHTML myhtml;
	const char* html;

	CODE:
		myhtml_tree__t *tree = myhtml_tree_create();
		myhtml_tree_init(tree, myhtml);
		myhtml_parse(tree, MyENCODING_UTF_8, html, strlen(html));

		RETVAL = tree;
	OUTPUT:
		RETVAL

SV*
as_text(tree)
	MyHTML::Tree tree;

	CODE:
		mycore_string_raw_t str_raw;
		mycore_string_raw_clean_all(&str_raw);

		myhtml_serialization_tree_buffer(myhtml_tree_get_document(tree), &str_raw);
		
		RETVAL = newSVpv(str_raw.data, str_raw.length)
		mycore_string_raw_destroy(&str_raw, false);
	OUTPUT:
		RETVAL

void
destroy(tree)
	MyHTML::Tree tree;
	
	CODE:
		myhtml_tree_destroy(tree);

In other words, we do not copy anything from the C structures, because data can change within structures. We just make a bridge from your LANG to C structures.

Answer 4 · 2017-08-31T09:39:58.000Z

Thanks so much. All questions cleared up. I will close this now.