A Python library that queries Google, Bing, Yahoo and other search engines and collects the results from multiple search engine results pages.
Please note that web-scraping may be against the TOS of some search engines, and may result in a temporary ban.
Google
Bing
Yahoo
Duckduckgo
Startpage
Aol
Dogpile
Ask
Mojeek
Torch
- Creates output files (html, csv, json).
- Supports search filters (url, title, text).
- HTTP and SOCKS proxy support.
- Collects dark web links with Torch.
- Easy to add new search engines. You can add a new engine by creating a new class in
search_engines/engines/
and add it to thesearch_engines_dict
dictionary insearch_engines/engines/__init__.py
. The new class should subclassSearchEngine
, and override the following methods:_selectors
,_first_page
,_next_page
. - Python2 - Python3 compatible.
- Now also supports getting URLs from Image Searches for images. Still experimental!
- You can use a randomly generated fake useragent to attempt to improve search engine scraping success
Python 2.7 - 3.7 with
Requests and
BeautifulSoup
Run the setup file: $ python setup.py install
.
Done!
As a library:
from search_engines import Google
engine = Google()
results = engine.search("my query")
links = results.links()
print(links)
If you'd like to use a randomly generated useragent:
from search_engines import Startpage
engine=Startpage(fakeagent=True)
results=engine.search("my query")
links=results.links()
print(links)
If you're looking to get images:
from search_engines import Yahoo
engine = Yahoo() #highly recommended to use fakeagent=True
results=engine.search("cat",searchtype="image")
links=results.links()
print(links)
Note that you probably will not get many images for now, as pagination is still a work-in-progress
Currently the following Engines are supported for image search:
- Yahoo
- Bing
- AOL
- Qwant ** temperamental even with fakeagent
- Mojeek ** works well with fakeagent flag thrown, and has pagination!
- Google ** temperamental even with fakeagent
As a CLI script:
$ python search_engines_cli.py -e google,bing -q "my query" -o json,print
- async-search-scraper A really cool asynchronous implementation, written by @soxoj