Mimino666/python-xextract

Text Node Element support

phpdude opened this issue · 7 comments

Hi again,

The library doesn't support Text nodes completely. Text nodes must be supported because this is standard in layout work, so often bad guys don't use any HTML markup for data output. You can select this entries with example code:

Prefix(css='.user', children=[
    Text(name="position", xpath='text()')
])

Result of this example code will be XPathExtractor list with "_root"s of lxml.etree._ElementUnicodeResult type and the type doesn't have xpath attribute, so it fails on xextract.extractors.lxml_extractor.XPathExtractor#select validation check

        if not hasattr(self._root, 'xpath'):
            return XPathExtractorList([])

We must fix it :)

I ready to help if you know good way to support it :)

Can you please provide an HTML example and an output that you try to extract from it?

Of course :)

<p class="user">
                        <span>                <span>English</span>,                            <span>Polish</span>            </span><br>
                                                                                        Management,
                            Accountancy, invoices,                            Logistics,                            Marketing,                            Domestic forwarder,                            International forwarder,                            Sales,                            Company owner,                            Supplies,                            Management or governing body                                        <br>Dyrektor Handlowy - właściciel
                        </p>

I want to extract Dyrektor Handlowy - właściciel

Try:

Element(xpath='//p[@class="user"]/text()')

Element parser returns lxml element, which in a case of text extraction is unicode.

Oh, lol. I missed it, I am sorry :)

You need to add this into Readme :)

Yeah it works, I tested it! Thanks :)

No problem :) I have updated README.