微信公众平台开发者文档

Question

微信公众平台开发者文档

Opened this issue 9 years ago · 3 comments

能否帮忙生成个recipe文件
http://mp.weixin.qq.com/wiki/home/index.html
根据你提供的教程，复制了一份代码自己试着修改创建一个recipe没有成功，不懂Python，请求帮忙。

Answer 1 · 2015-06-01T08:06:54.000Z

1 from calibre.web.feeds.recipes import BasicNewsRecipe
2
3 class Wechat_Api(BasicNewsRecipe):
4
5 title = 'wechat_api'
6 description = '微信公众平台开发者文档'
7 cover_url = 'http://img.sj33.cn/uploads/allimg/201402/7-140223103130591.png'
8
9 url_prefix = 'http://mp.weixin.qq.com/wiki'
10 no_stylesheets = True
11 keep_only_tags = [{ 'class': 'portal' }]
12
13 def parse_index(self):
14 soup = self.index_to_soup(self.url_prefix + '/home/index.html')
15
16 div = soup.find('div', { 'class': 'body' })
17
18 articles = []
19 for link in div.findAll('a'):
20
21 til = link.contents[0].strip()
22 href = link['href']
23 preHref = href[0:3]
24 if(preHref == '../'):
25 href = href[3:]
26 else:
27 continue
28 url = self.url_prefix +'/'+ href
29 print url
30 a = { 'title': til, 'url': url }
31
32 articles.append(a)
33
34 ans = [('Weixin_Api', articles)]
35
36 return ans

脚本大概是这样，能抓取到目录，内容取不到，请指导下该如何修改才能正确获取到内容，谢谢。

Answer 2 · 2015-06-01T08:22:34.000Z

大致看了下，把keep_only_tags修改成下面这个就好了。

  keep_only_tags = [
    dict(name='div', attrs={'id':['bodyContent']}),
  ]

不过这个只包含正文内容，最好还能加上标题，所以改成 dict(name='div', attrs={'class':['content_hd', 'bodyContent']})更好一点。。。

Answer 3 · 2015-06-02T01:35:36.000Z

from calibre.web.feeds.recipes import BasicNewsRecipe

class Wechat_Api(BasicNewsRecipe):

title = '微信公众平台开发者文档'
description = '微信公众平台开发者文档'
cover_url = 'http://img.sj33.cn/uploads/allimg/201402/7-140223103130591.png'

url_prefix = 'http://mp.weixin.qq.com/wiki'
keep_only_tags = [
    dict(name='div', attrs={'class':['content_hd', 'bodyContent']})
]

def parse_index(self):
    soup = self.index_to_soup(self.url_prefix + '/home/index.html')

    catalog = soup.find('div', { 'id': 'mw-panel' })
    #print catalog

    portals = catalog.findAll('div', { 'class': 'portal' })
    #print portals

    ans = []

    for chapter in portals:
        title = chapter.find('h5').contents[1].strip()
        print title

        articles = []
        for link in chapter.findAll('a'):
            href = link['href']
            if not '../' in href:
                continue

            til = link.contents[0].strip()
            print til
            url = self.url_prefix +'/'+ href[3:]
            #print url
            a = { 'title': til, 'url': url }

            articles.append(a)

        ans.append((title, articles))

    return ans

修改了下代码，内容获取上看着是没有什么问题了。感觉还不算太难用^_^
不过还有个问题：抓取的内容页面里的表格显示不完整，不知道有没有什么处理方法？

多次测试后发现这个问题生成mobi文件是不存在的，只有生成epub才会这样%>_<%