Receiving URLs apart from specified domain and duplicate URLs with different suffix html IDs

Question

Receiving URLs apart from specified domain and duplicate URLs with different suffix html IDs

Mahesha999 opened this issue 2 years ago · 1 comments

Issue 1

I did

googlesearch.search("Berlin Germany en.wikipedia.org", stop=10)

It returned following URLs:

"https://en.wikipedia.org/wiki/Berlin",
"https://en.wikipedia.org/wiki/History_of_Berlin",
"https://en.wikipedia.org/wiki/Geography_of_Berlin",
"https://en.wikipedia.org/wiki/West_Berlin",
"https://en.wikipedia.org/wiki/Berlin/Brandenburg_Metropolitan_Region",
"https://simple.wikipedia.org/wiki/Berlin",
"https://simple.wikipedia.org/wiki/Berlin#History",
"https://simple.wikipedia.org/wiki/Berlin#Education",
"https://simple.wikipedia.org/wiki/Berlin#Culture",
"https://simple.wikipedia.org/wiki/Berlin#Economy"

I was expecting it to return URLs only from en.wikipedia.org domain. But it also returned from simple.wikipedia.org. Though I am fine with this for my particular task. This might not be the case for others. Is there any way to achieve this?

Trying

list(g.search('Berlin Germany', stop=10, tld='en.wikipedia.org'))

gives:

gaierror                                  Traceback (most recent call last)
/usr/lib/python3.8/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1353             try:
-> 1354                 h.request(req.get_method(), req.selector, req.data, headers,
   1355                           encode_chunked=req.has_header('Transfer-encoding'))

/usr/lib/python3.8/http/client.py in request(self, method, url, body, headers, encode_chunked)
   1255         """Send a complete request to the server."""
-> 1256         self._send_request(method, url, body, headers, encode_chunked)
   1257 

/usr/lib/python3.8/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1301             body = _encode(body, 'body')
-> 1302         self.endheaders(body, encode_chunked=encode_chunked)
   1303 

/usr/lib/python3.8/http/client.py in endheaders(self, message_body, encode_chunked)
   1250             raise CannotSendHeader()
-> 1251         self._send_output(message_body, encode_chunked=encode_chunked)
   1252 

/usr/lib/python3.8/http/client.py in _send_output(self, message_body, encode_chunked)
   1010         del self._buffer[:]
-> 1011         self.send(msg)
...
-> 1357                 raise URLError(err)
   1358             r = h.getresponse()
   1359         except:

URLError: <urlopen error [Errno -2] Name or service not known>

Issue 2

My main concern is to achieve receiving same URL multiple times but with different suffix html IDs. That is below correspond to same webpage just with different html IDs:

"https://simple.wikipedia.org/wiki/Berlin",
"https://simple.wikipedia.org/wiki/Berlin#History",
"https://simple.wikipedia.org/wiki/Berlin#Education",
"https://simple.wikipedia.org/wiki/Berlin#Culture",
"https://simple.wikipedia.org/wiki/Berlin#Economy"

I want it to return https://simple.wikipedia.org/wiki/Berlin only once. Is there any way to achieve this? Am I missing something stupid?

Answer 1 · 2023-03-01T16:05:52.000Z

Hi! :)

To restrict the results to a single domain, you can change your query as follows: "Berlin site:en.wikipedia.org".

The TLD argument does not affect the domain of the results, but Google's domain - for example if you want results from google.de instead of google.com.

As for the seemingly "duplicate" results, the library can't really avoid that without losing information that some users may want to preserve. You'll have to do your own custom filtering for that. I'd recommend using urllib to parse each result and remove the suffix - you could also do this with simple string parsing but it's cleaner and more robust to use urllib instead.

Hope that helps!