crawl spider Web crawler (also known as web spider, web robot. In the middle of the FOAF community, more often known as web chaser) is a program or script,which according to certain rules and automatically crawl the web.
###Do it youself###
###The crawler frame consists of 5 parts:
- Controller
- UrlManager
- HTMLDownloader
- HTMLParser
- Outputer
Control the start and end of crawler
def __init__(self):
self.new_urls = set()
self.old_urls = set()
Download HTML page
def parse(self,page_url,html_cont):
if page_url is None or html_cont is None:
return
soup = BeautifulSoup(html_cont,'html.parser',from_encoding=r'utf-8')
new_urls = self._get_new_urls(page_url,soup)
new_data = self._get_new_data(page_url,soup)
return new_urls,new_data
Obtained the desired data,put it into 'output.html' file.