Implement Ox support
soulcutter opened this issue · 18 comments
It would be nice to support parsers other than Nokogiri. Ox in particular is supposed to have great performance, and so would be a good first candidate.
This looks like a very interesting feature, After some test OX it's way faster than Nokogiri, but less feature complete, in special ralated to Xpath, but the SAX feature looks quite similar to Nokogiri.
Is there somebody working on this already?
No, this has been on the backburner for a while - I just haven't had a reason to revisit it, though it's pretty much what's holding up a 1.0 release
ox-mapper could be a good start:
I'm currently replacing sax-machine
with saxerator
in a project of mine and if all goes well I'd love to try my hand at this ox support.
👍
@soulcutter, @jalberto
Hi guys, I've started working on implementing Ox parser for saxerator.
Here you can find commits: https://github.com/fanantoxa/saxerator/commits/implementing-ox-parser
This is the first scratch and have to be refactored after, but at first, I want to make code works.
Could you help me with it, now parsing strings works good, but have some problems with file parsing (don't want to parse nested items).
I'll be very happy if you'll help me with implementing or just with some suggestions.
#33 is closer to what I had in mind (although totally broken in its current state)
Rather than duplicating everything to get ox in there, I took the approach of trying to extract the most basic interface that would work https://github.com/soulcutter/saxerator/blob/extract-adapter/lib/saxerator/sax_handler.rb and move all references to nokogiri into https://github.com/soulcutter/saxerator/blob/extract-adapter/lib/saxerator/adapters/nokogiri.rb
In its current state I have not yet even begun the ox handler, but I'm hoping it would be fairly simple (although I may have to tweak the SaxHandler
interface depending on how it reads attributes, but one thing at a time)
Hey, I got the adapter working for nokogiri! Do you think that's a solid-enough basis for adding ox support?
@soulcutter Looks good. Bu it might be no enough. Different parsers have different capabilities. Actually Ox lighter than nokogiry. I'll take a look at code tomorrow.
IINM The tricky thing with ox will be how it parses attributes, but I think there should be a way to collect those before triggering a start_element(name, attrs)
@soulcutter Hi. Sorry for delay, I've been a bit overloaded on new job
I've taken a look on your changes and looks cool. But As you mention we have problems with Latches.
Ox have different callbacks with different count of params:
def instruct(target); end
def end_instruct(target); end
def attr(name, str); end
def attr_value(name, value); end
def attrs_done(); end
def doctype(str); end
def comment(str); end
def cdata(str); end
def text(str); end
def value(value); end
def start_element(name); end
def end_element(name); end
Instead of nokogori:
So I thnik we have to add here new abtraction for Latches too.
I have a pretty good idea of how I can match ox callbacks to the SaxHandler
api. I'll have something to show within a couple days.
@soulcutter Cool)) If you have no time you can tell me what you want to change and I'll try implement it)
@soulcutter Also I've researched a bit on few other parsers that you wanted to implement too.
And I think we have to think a bit more about abstraction because they also have different callbacks:
Oga
on_document
on_doctype
on_cdata
on_comment
on_proc_ins
on_xml_decl
on_text
on_element
on_element_children
on_attribute
on_attributes
after_element
LibXML
on_cdata_block(cdata)
on_characters(chars)
on_comment(msg)
on_end_document()
on_end_element_ns(name, prefix, uri)
on_error(msg)
on_external_subset(name, external_id, system_id)
on_has_external_subset()
on_has_internal_subset()
on_internal_subset(name, external_id, system_id)
on_is_standalone()
on_processing_instruction(target, data)
on_reference(name)
on_start_document()
on_start_element_ns(name, attributes, prefix, uri, namespaces)
So will be very well if we'll have opportunity to changes callbacks names
The different APIs are the reason behind the adapter extraction. Each adapter will take the messages sent by the parser and translate them to the constrained API which we define (which happens to be implemented through delegation because it's the first implementation, and because it seemed like a sensible enough interface).
For example, oga will not send a start_element
until it gets a message other than attribute-related ones. Then it will have stored the element name and all its attributes, so it's a complete payload for our interface.
Bradley Schaefer
On Jul 15, 2016, at 6:21 PM, fanantoxa notifications@github.com wrote:
@soulcutter Also I've researched a bit on few other parsers that you wanted to implement too.
And I think we have to think a bit more about abstraction because they also have different callbacks:Oga
on_document
on_doctype
on_cdata
on_comment
on_proc_ins
on_xml_decl
on_text
on_element
on_element_children
on_attribute
on_attributes
after_element
LibXMLon_cdata_block(cdata)
on_characters(chars)
on_comment(msg)
on_end_document()
on_end_element_ns(name, prefix, uri)
on_error(msg)
on_external_subset(name, external_id, system_id)
on_has_external_subset()
on_has_internal_subset()
on_internal_subset(name, external_id, system_id)
on_is_standalone()
on_processing_instruction(target, data)
on_reference(name)
on_start_document()
on_start_element_ns(name, attributes, prefix, uri, namespaces)
So will be very well if we'll have opportunity to changes callbacks names—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
#34 outlines sorta what I had in mind
@soulcutter I've created pull request to ox-adapter
branch #35