第六章实战项目基础爬虫

Question

第六章实战项目基础爬虫

Opened this issue 6 years ago · 12 comments

nunu969316192 commented 6 years ago

貌似百度百科用书上的代码已经爬取不了了

nunu969316192 commented 6 years ago

403

👍1

Answer 1 · 2018-04-12T02:03:30.000Z

检查几遍代码没有错误，提示crawl faile
就爬取百度百科‘爬虫'的html也是空的

Answer 2 · 2018-04-13T00:56:56.000Z

你需要去分析百科的前端代码啊，它的代码已经变了。可以参考一下我写的代码https://gitee.com/zmrwego/webCrawler

Answer 3 · 2018-04-24T07:50:45.000Z

https://gitee.com/zmrwego/a_simple_reptile 这个应该可以了

Answer 4 · 2018-05-03T16:07:45.000Z

用 2to3.py 工具迁移下就好了。

Answer 5 · 2018-05-16T10:44:22.000Z

这个代码最后打开查看只有一半的数据，比如爬100个但html中只有50个。把self.datas.remove(data)这句话去掉html里就有100个了。没想清楚为什么。（python rookie）

Answer 6 · 2018-05-16T11:48:35.000Z

重复的去掉了。

Answer 7 · 2018-05-17T01:08:59.000Z

我对比了一下输出html里的和内存里的，不是去掉重复的。

Answer 8 · 2018-05-17T08:50:10.000Z

是dataoutput.py 这个文件么？没有 self.datas.remove(data) 这个啊。

Answer 9 · 2018-05-21T08:52:35.000Z

对就是在dataoutput.py里。在for data in self.datas这个循环里最后一句。

Answer 10 · 2018-05-21T09:14:28.000Z

你看看哪有更新下Git啊

coding:utf-8
import codecs
import time
class DataOutput(object):
def init(self):
self.filepath='baike_%s.html'%(time.strftime("%Y_%m_%d_%H_%M_%S", time.localtime()) )
self.output_head(self.filepath)
self.datas=[]

def store_data(self,data):
    if data is None:
        return
    self.datas.append(data)
    if len(self.datas)>10:
        self.output_html(self.filepath)


def output_head(self,path):
    '''
    将HTML头写进去
    :return:
    '''
    fout=codecs.open(path,'w',encoding='utf-8')
    fout.write("<html>")
    fout.write(r'''<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />''')
    fout.write("<body>")
    fout.write("<table>")
    fout.close()


def output_html(self,path):
    '''
    将数据写入HTML文件中
    :param path: 文件路径
    :return:
    '''
    fout=codecs.open(path,'a',encoding='utf-8')
    for data in self.datas:
        fout.write("<tr>")
        fout.write("<td>%s</td>"%data['url'])
        fout.write("<td>%s</td>"%data['title'])
        fout.write("<td>%s</td>"%data['summary'])
        fout.write("</tr>") 
    self.datas=[]
    fout.close()


def ouput_end(self,path):
    '''
    输出HTML结束
    :param path: 文件存储路径
    :return:
    '''
    fout=codecs.open(path,'a',encoding='utf-8')
    fout.write("</table>")
    fout.write("</body>")
    fout.write("</html>")
    fout.close()

Answer 11 · 2018-05-22T14:06:34.000Z

datas满10个进行一次读写，减轻cpu负担，然后去掉已经写入的datas.remove(data),如果没有这句的话，只会重复写入前10个data。在结束前加入time.sleep(3),是数据完全写入后关闭进程。具体看这里https://gitee.com/zmrwego/a_simple_reptile

你看看 哪有 更新下Git啊

你看看哪有更新下Git啊