Receiving URLs apart from specified domain and duplicate URLs with different suffix html IDs
Mahesha999 opened this issue · 1 comments
Issue 1
I did
googlesearch.search("Berlin Germany en.wikipedia.org", stop=10)
It returned following URLs:
"https://en.wikipedia.org/wiki/Berlin",
"https://en.wikipedia.org/wiki/History_of_Berlin",
"https://en.wikipedia.org/wiki/Geography_of_Berlin",
"https://en.wikipedia.org/wiki/West_Berlin",
"https://en.wikipedia.org/wiki/Berlin/Brandenburg_Metropolitan_Region",
"https://simple.wikipedia.org/wiki/Berlin",
"https://simple.wikipedia.org/wiki/Berlin#History",
"https://simple.wikipedia.org/wiki/Berlin#Education",
"https://simple.wikipedia.org/wiki/Berlin#Culture",
"https://simple.wikipedia.org/wiki/Berlin#Economy"
I was expecting it to return URLs only from en.wikipedia.org
domain. But it also returned from simple.wikipedia.org
. Though I am fine with this for my particular task. This might not be the case for others. Is there any way to achieve this?
Trying
list(g.search('Berlin Germany', stop=10, tld='en.wikipedia.org'))
gives:
gaierror Traceback (most recent call last)
/usr/lib/python3.8/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
1353 try:
-> 1354 h.request(req.get_method(), req.selector, req.data, headers,
1355 encode_chunked=req.has_header('Transfer-encoding'))
/usr/lib/python3.8/http/client.py in request(self, method, url, body, headers, encode_chunked)
1255 """Send a complete request to the server."""
-> 1256 self._send_request(method, url, body, headers, encode_chunked)
1257
/usr/lib/python3.8/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
1301 body = _encode(body, 'body')
-> 1302 self.endheaders(body, encode_chunked=encode_chunked)
1303
/usr/lib/python3.8/http/client.py in endheaders(self, message_body, encode_chunked)
1250 raise CannotSendHeader()
-> 1251 self._send_output(message_body, encode_chunked=encode_chunked)
1252
/usr/lib/python3.8/http/client.py in _send_output(self, message_body, encode_chunked)
1010 del self._buffer[:]
-> 1011 self.send(msg)
...
-> 1357 raise URLError(err)
1358 r = h.getresponse()
1359 except:
URLError: <urlopen error [Errno -2] Name or service not known>
Issue 2
My main concern is to achieve receiving same URL multiple times but with different suffix html IDs. That is below correspond to same webpage just with different html IDs:
"https://simple.wikipedia.org/wiki/Berlin",
"https://simple.wikipedia.org/wiki/Berlin#History",
"https://simple.wikipedia.org/wiki/Berlin#Education",
"https://simple.wikipedia.org/wiki/Berlin#Culture",
"https://simple.wikipedia.org/wiki/Berlin#Economy"
I want it to return https://simple.wikipedia.org/wiki/Berlin
only once. Is there any way to achieve this? Am I missing something stupid?
Hi! :)
To restrict the results to a single domain, you can change your query as follows: "Berlin site:en.wikipedia.org".
The TLD argument does not affect the domain of the results, but Google's domain - for example if you want results from google.de instead of google.com.
As for the seemingly "duplicate" results, the library can't really avoid that without losing information that some users may want to preserve. You'll have to do your own custom filtering for that. I'd recommend using urllib to parse each result and remove the suffix - you could also do this with simple string parsing but it's cleaner and more robust to use urllib instead.
Hope that helps!