Erlang/Elixir bindings
Overbryd opened this issue · 4 comments
Hi Alexander,
I hope you are pleased to hear that I started working on a Elixir/Erlang binding for your marvellous library.
My main motivation is to parse a given HTML document into a tree-structure.
That is what the binding currently achieves.
The repository can be inspected here: https://github.com/Overbryd/myhtmlex
I am currently investigating the mode of operation that I want to execute this binding on.
Doing this, I found that re-using a reference to a tree somehow is slower than parsing a new tree.
There is a high chance that I am wrong and maybe I am doing something stupid, so I kindly ask you to have a look at the following C-functions that implement the NIF:
-
Opening & parsing a tree and building a tree in the same call faster
-
Opening & parsing a tree into a reference, then building a tree given a reference slower
-
nif_open(...)
There I open amyhtml_tree_t
and give back to the VM using a allocated resource. Therefore I have a smallstruct myhtmlex_ref_t
that holds a reference to themyhtml_tree_t
and also a reference to the root node asmyhtml_tree_node_t
for convenience. -
nif_decode_tree(...)
Here I receive a reference tomyhtmlex_ref_t
and build the Erlang tree.
-
The idea was to "open & parse" a myhtml-tree, and then re-use it multiple times. Weirdly this is much slower than opening/parsing and building an Erlang tree in the same call.
I just want to rule out that I am having a wrong assumption about myhtml_tree_t
, being a fully parsed tree that can be walked very fast.
Kind regards,
Lukas Rieder
Hi Lukas,
I'm very glad to hear that you are create bindings for myhtml. This is very good news!
-
myhtml_parse
always clean current tree before begin parsing. not need callmyhtml_tree_clean
beforemyhtml_parse
- You are destroy only one node, not tree.
myhtml_node_free
function destroy resources for one (current) node only. For destroy tree (all nodes and resources) usemyhtml_tree_destroy
. But, if you usemyhtml_tree_destroy
you need always create a new treemyhtml_tree_create
— it's expensive on time. Just delete this line.
Tree will by clean each time you callmyhtml_parse
(new resources will not be allocated, сurrent resources will be used).
this
this
-
In
nif_open(...)
you are create a new tree every time when call nif_open — it will take a while.
Consequently if I parse multiple html documents in parallel, I need one
tree per instance.
Is that correct?But sequentially, I only need one tree for parsing multiple documents.
Yes
I have built nif_open with the idea in mind that the user can call the tree
traversal functions provided in myhtml.
That's right, best practice for bindings.
We build tree and call c functions from another language. Some like (perl code):
my $tree = $myhtml->parse($html);
print $tree->as_text();
$tree->destroy();
Code for bindings (PerlXS):
MyHTML::Tree
parse(myhtml, html)
MyHTML myhtml;
const char* html;
CODE:
myhtml_tree__t *tree = myhtml_tree_create();
myhtml_tree_init(tree, myhtml);
myhtml_parse(tree, MyENCODING_UTF_8, html, strlen(html));
RETVAL = tree;
OUTPUT:
RETVAL
SV*
as_text(tree)
MyHTML::Tree tree;
CODE:
mycore_string_raw_t str_raw;
mycore_string_raw_clean_all(&str_raw);
myhtml_serialization_tree_buffer(myhtml_tree_get_document(tree), &str_raw);
RETVAL = newSVpv(str_raw.data, str_raw.length)
mycore_string_raw_destroy(&str_raw, false);
OUTPUT:
RETVAL
void
destroy(tree)
MyHTML::Tree tree;
CODE:
myhtml_tree_destroy(tree);
In other words, we do not copy anything from the C structures, because data can change within structures. We just make a bridge from your LANG to C structures.
Thanks so much. All questions cleared up. I will close this now.