jarun/googler

Error when Searching for News

jamulix opened this issue · 8 comments

Since April 1 I always get the same error message when searching for News with googler version 4.3.2 and Python version 3.8.5.
With or without noprompt option (--np). And it does not depend on the Python 3 Version as well. It's the same with Python 3.7 .
I suppose, the reason is Google giving a different answer format since April 1 (Day of first occurrence). The Google answer format might also be different in Europe and Asia.

Example:

googler --np -N Biden
[ERROR] DOM builder aborted at 1067:345: expecting end tag 'div', got 'body'

Debug output:

googler --np -d -N Biden

[DEBUG] googler version 4.3.2
[DEBUG] Python version 3.8.5
[DEBUG] Platform: Linux-5.4.0-72-lowlatency-x86_64-with-glibc2.29
[DEBUG] Connecting to new host www.google.com
[DEBUG] Opened socket to 142.250.185.164:443
[DEBUG] new_connection completed in 0.047s
[DEBUG] Fetching URL /search?ie=UTF-8&oe=UTF-8&q=Biden&sei=rt1Yvv03QqmJw3+861xOtA&tbm=nws
[DEBUG] Cookie: CONSENT=PENDING+820
[DEBUG] Redirecting to URL https://consent.google.com/m?continue=https://www.google.com/search%3Fie%3DUTF-8%26oe%3DUTF-8%26q%3DBiden%26sei%3Drt1Yvv03QqmJw3%2B861xOtA%26tbm%3Dnws&gl=DE&m=0&pc=srp&hl=de&src=1
[DEBUG] Connecting to new host consent.google.com
[DEBUG] Opened socket to 172.217.23.110:443
[DEBUG] new_connection completed in 0.051s
[DEBUG] Fetching URL /m?continue=https://www.google.com/search%3Fie%3DUTF-8%26oe%3DUTF-8%26q%3DBiden%26sei%3Drt1Yvv03QqmJw3%2B861xOtA%26tbm%3Dnws&gl=DE&m=0&pc=srp&hl=de&src=1
[DEBUG] fetch_page completed in 0.303s
[DEBUG] Response body written to **'/tmp/googler-response-3vjkj5_r.html'**.
Traceback (most recent call last):
  File "/usr/local/bin/googler", line 3819, in <module>
    main()
  File "/usr/local/bin/googler", line 3796, in main
    repl.fetch()
  File "/usr/local/bin/googler", line 2726, in enforced_method
    method(self, *args, **kwargs)
  File "/usr/local/bin/googler", line 2848, in fetch
    parser = GoogleParser(page, news=self._google_url.news, videos=self._google_url.videos)
  File "/usr/local/bin/googler", line 2326, in __init__
    self.parse(html)
  File "/usr/local/bin/googler", line 1569, in wrapped
    ret = func(*args, **kwargs)
  File "/usr/local/bin/googler", line 2330, in parse
    tree = parse_html(html)
  File "/usr/local/bin/googler", line 780, in parse_html
    builder.feed(html)
  File "/usr/lib/python3.8/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/usr/lib/python3.8/html/parser.py", line 173, in goahead
    k = self.parse_endtag(i)
  File "/usr/lib/python3.8/html/parser.py", line 421, in parse_endtag
    self.handle_endtag(elem)
  File **"/usr/local/bin/googler", line 705**, in handle_endtag
    raise DOMBuilderException(
__main__.DOMBuilderException: DOM builder aborted at 1067:345: expecting end tag 'div', got 'body'

[DEBUG] Response body written to '/tmp/googler-response-3vjkj5_r.html' is here:
response.zip

code snippet:
/usr/local/bin/googler", around line 705*

class DOMBuilder(HTMLParser):
    """
    HTML parser / DOM builder.

    Subclasses :class:`html.parser.HTMLParser`.

    Consume HTML and builds a :class:`Node` tree. Once finished, use
    :attr:`root` to access the root of the tree.

    This parser cannot parse malformed HTML with tag mismatch.
    """
...

    def handle_endtag(self, tag: str) -> None:
        tag = tag.lower()
        children = []
        while self._stack and not self._stack[-1]._partial:
            children.append(self._stack.pop())
        if not self._stack:
            raise DOMBuilderException(self.getpos(), "extra end tag: %s" % repr(tag))
        parent = self._stack[-1]
        if parent.tag != tag:
            raise DOMBuilderException(                           #####   Line 705   ####
                self.getpos(),
                "expecting end tag %s, got %s" % (repr(parent.tag), repr(tag)),
            )
        parent.children = list(reversed(children))
        parent._partial = False
        for child in children:
            child.parent = parent
        self._namespace_stack.pop()


Ubuntu 20.04
Kernel:
Linux raika 5.4.0-72-lowlatency #80-Ubuntu SMP PREEMPT Mon Apr 12 18:37:24 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

googler version 4.3.2
Python version 3.8.5
KDE Konsole Version 19.12.3

%locale

LANG=de_DE.UTF-8
LANGUAGE=
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=

jarun commented

Thanks for the details and the response file. @zmwangx please have a look.

Yes, this is wrongly closed tag soup, can't even prettier it:

$ prettier -w googler-response-3vjkj5_r.html
googler-response-3vjkj5_r.html
[error] googler-response-3vjkj5_r.html: SyntaxError: Unexpected closing tag "body". It may happen when the tag has already been closed by another tag. For more info see https://www.w3.org/TR/html5/syntax.html#closing-elements-that-have-implied-end-tags (1067:346)
[error]   1065 | ]
[error]   1066 | ]
[error] > 1067 | , sideChannel: {}});</script><script id="wiz_jd" nonce="bzE0bpnLpLEKxNbvHL7Y6w">if (window['_wjdc']) {const wjd = {}; window['_wjdc'](wjd); delete window['_wjdc'];}</script><script aria-hidden="true" nonce="bzE0bpnLpLEKxNbvHL7Y6w">window.wiz_progress&&window.wiz_progress(); window.stopScanForCss&&window.stopScanForCss(); ccTick('bl');</script></body></html>
[error]        |                                                                                                                                                                                                                                                                                                                                                          ^^^^^^^

I'll see what I can do other than introducing a full blown HTML5 parser later.

Wait it's actually some kind of notice...

notice

So the good news is we don't actually need to successfully parse this tag soup, as the content is meaningless. The bad news is I'm not sure how to come up with a way to get around this when I don't even get it in the first place.

Probably need user contribution.

jarun commented

I believe this is some kind of a user consent prompt which we can't parse.

This seems German. Is it possible to google in the regular browser?

Translated:

It's just some annoying cookie consent crap. Some EU user needs figure out what cookie to add to suppress this.

Clicking on "I agree" takes you to https://consent.google.com, which sets a cookie like this along with a 303 redirect, FWIW:

set-cookie: NID=214=DeX...<long string omitted>..._qE; expires=Tue, 26-Oct-2021 03:35:42 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=none

On NID cookie: https://policies.google.com/technologies/cookies?hl=en-US#:~:text=For%20example%2C%20most,user%E2%80%99s%20last%20use.

I am based in germany and am seeing the same thing

jarun commented

I don't think we can do anything here. Closing the issue.