抱歉，章节内容不支持该浏览器显示～ (linovelib ID 3721)

Question

抱歉，章节内容不支持该浏览器显示～ (linovelib ID 3721)

Closed this issue 7 months ago · 18 comments

ID 3721

下載的 EPub 中，每一個章節都內文顯示如下

(ò﹏ò)
抱歉，章节内容不支持该浏览器显示～
【为了使用完整的阅读功能】
请考虑使用〔Chrome 谷歌浏览器〕、〔Safari 苹果浏览器〕或者
〔Edge 微软浏览器〕等原生浏览器阅读！
谢谢!!!

我使用瀏覽器是能正常訪問文章內容。

Answer 1 · 2024-03-10T08:49:30.000Z

该问题已被确认。原因是bilinovel又修改反爬虫策略了。解决方案还需要进一步探索。

初步定位到这个文件：https://www.bilinovel.com/cdn-cgi/challenge-platform/scripts/jsd/main.js

我在本地已成功解决这个问题，等待后面修复吧。@Kuan-Lun

Answer 2 · 2024-03-12T09:09:05.000Z

已修复。更新代码，更新依赖，以验证是否修复。

Answer 3 · 2024-03-15T07:39:03.000Z

一是 ID 3721，仍有問題。

2024-03-15,15:35:35 INFO     LinovelibMobileSpider Succeed to get the novel of book_id: 3721                                                                                                                                            linovelib_mobile_spider.py:84
                    INFO     LinovelibMobileSpider book name:《乱世千金倪亚・利斯顿》                                                                                                                                                   linovelib_mobile_spider.py:94
                    INFO     LinovelibMobileSpider Succeed to get the catalog of book_id: 3721                                                                                                                                         linovelib_mobile_spider.py:144
                    INFO     LinovelibMobileSpider volume: 第一卷                                                                                                                                                                      linovelib_mobile_spider.py:160
                    INFO     LinovelibMobileSpider chapter : 插图                                                                                                                                                                      linovelib_mobile_spider.py:175

DevTools listening on ws://127.0.0.1:62433/devtools/browser/f6f82007-4288-47c4-854a-33efab67be19
2024-03-15,15:35:46 INFO     LinovelibMobileSpider  初始化 Driver 完毕...    
Traceback (most recent call last):
  File "[path]\linovelib2epub-main\main.py", line 30, in <module>
    linovelib_epub.run()
  File "[path]\linovelib2epub-main\src\linovelib2epub\linovel.py", line 412, in run
    novel = self._spider.fetch()
            ^^^^^^^^^^^^^^^^^^^^
  File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 53, in fetch
    novel_whole = self._fetch()
                  ^^^^^^^^^^^^^
  File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 474, in _fetch
    new_novel_with_content = self._crawl_book_content(book_catalog_url)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 193, in _crawl_book_content
    if not new_title.text.startswith(light_novel_chapter.title):
           ^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'text'

Answer 4 · 2024-03-15T08:47:58.000Z

提供一下你爬虫时类似这个的截图。

AttributeError: 'NoneType' object has no attribute 'text' 这个错误指的是找不到文章标题。
在上面的截图中为：第六章倪亚•利斯顿的职业访问

但是我这边是没法复现的，我暂时不清楚是为何会出现这个问题。

我初步猜测是本地chrome版本的问题，但是我没有足够的信息来诊断这个问题。

Answer 5 · 2024-03-15T13:23:20.000Z

2024-03-15,21:19:52 INFO     LinovelibMobileSpider Succeed to get the novel of book_id: 3721                                                                                                                                            linovelib_mobile_spider.py:84
                    INFO     LinovelibMobileSpider book name:《乱世千金倪亚・利斯顿》                                                                                                                                                   linovelib_mobile_spider.py:94
                    INFO     LinovelibMobileSpider Succeed to get the catalog of book_id: 3721                                                                                                                                         linovelib_mobile_spider.py:144
                    INFO     LinovelibMobileSpider volume: 第一卷                                                                                                                                                                      linovelib_mobile_spider.py:160
                    INFO     LinovelibMobileSpider chapter : 插图                                                                                                                                                                      linovelib_mobile_spider.py:175

DevTools listening on ws://127.0.0.1:62529/devtools/browser/feb744e3-d24e-4e5d-b57d-f97a1233dc25
2024-03-15,21:19:57 INFO     LinovelibMobileSpider  初始化 Driver 完毕...                                                                                                                                                              linovelib_mobile_spider.py:297
Traceback (most recent call last):
  File "[path]\linovelib2epub-main\main.py", line 30, in <module>
    linovelib_epub.run()
  File "[path]\linovelib2epub-main\src\linovelib2epub\linovel.py", line 412, in run
    novel = self._spider.fetch()
            ^^^^^^^^^^^^^^^^^^^^
  File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 53, in fetch
    novel_whole = self._fetch()
                  ^^^^^^^^^^^^^
  File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 474, in _fetch
    new_novel_with_content = self._crawl_book_content(book_catalog_url)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 193, in _crawl_book_content
    if not new_title.text.startswith(light_novel_chapter.title):
           ^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'text'

Answer 6 · 2024-03-15T15:53:40.000Z

依旧没有头绪。
请你将自动化控制的 chrome driver 访问的某个 page （例如 https://tw.linovelib.com/novel/3721/190988.html ）网页源文件下载后，发送到这个issue下面给我参考。

也可能是其他原因，只是我暂时不清楚。😞

Answer 7 · 2024-03-15T16:22:29.000Z

我使用 Chrome 瀏覽器開啟後進行儲存，參見檔案。

Answer 8 · 2024-03-15T17:04:19.000Z

我使用 Chrome 瀏覽器開啟後進行儲存，參見檔案。

我看了，格式没问题。估计是其他原因。

Answer 9 · 2024-03-16T04:10:42.000Z

我在代码中添加了debug代码。你更新后尝试下面的代码，将前面一部分log发上来看看。
PS: 日志会很长，你只需要提供前面一部分即可。

from linovelib2epub import Linovelib2Epub

if __name__ == '__main__':
    linovelib_epub = Linovelib2Epub(book_id=3721, clean_artifacts=False,log_level='DEBUG')
    linovelib_epub.run()

Answer 10 · 2024-03-18T02:40:03.000Z

我看了一下日志，问题的原因是触发了之前说的访问频率限制 #36 ，但是程序没有重试，导致爬到的是空内容。在之前的方案中程序触发访问频率限制是会自动重试的，但是更改为selenuim方案后没有重试，这应该是个异常处理问题。

临时解决方法是将chapter_crawl_delay设置得更高，但问题在于即便设置成15也只是让limit page出现的时间更靠后。例如我将chapter_crawl_delay设置为5时只能爬到第9章，设置为15时可以爬到21章，要解决这个问题还是得进行对应的异常处理。

    linovelib_epub = Linovelib2Epub(book_id=2986, target_site=TargetSite.LINOVELIB_MOBILE, chapter_crawl_delay=15)
    linovelib_epub.run()

问题日志如下：

2024-03-18,10:24:21 INFO     LinovelibMobileSpider  linovelib_mobile_spider.py:228   Processing page... https://www.bilinovel.com/novel/2986/148834_2.html
2024-03-18,10:24:22 DEBUG    LinovelibMobileSpider  linovelib_mobile_spider.py:186   page_resp='<html class="no-js" lang="en-US"><!--<![endif]--><head>\n<title>Access denied | www.bilinovel.com used Cloudflare to restrict access</title>\n<meta charset="UTF-8">\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta http-equiv="X-UA-Compatible" content="IE=Edge">\n<meta name="robots" content="noindex, nofollow">\n<meta name="viewport" content="width=device-width,initial-scale=1">\n<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/main.css">\n\n\n<script>\n(function(){if(document.addEventListener&&window.XMLHttpRequest&&JSON&&JSON.stringify){var e=function(a){var c=document.getElementById("error-feedback-survey"),d=document.getElementById("error-feedback-success"),b=new XMLHttpRequest;a={event:"feedback clicked",properties:{errorCode:1015,helpful:a,version:1}};b.open("POST","https://sparrow.cloudflare.com/api/v1/event");b.setRequestHeader("Content-Type","application/json");b.setRequestHeader("Sparrow-Source-Key","c771f0e4b54944bebf4261d44bd79a1e");\nb.send(JSON.stringify(a));c.classList.add("feedback-hidden");d.classList.remove("feedback-hidden")};document.addEventListener("DOMContentLoaded",function(){var a=document.getElementById("error-feedback"),c=document.getElementById("feedback-button-yes"),d=document.getElementById("feedback-button-no");"classList"in a&&(a.classList.remove("feedback-hidden"),c.addEventListener("click",function(){e(!0)}),d.addEventListener("click",function(){e(!1)}))})}})();\n</script>\n\n<script defer="" src="https://performance.radar.cloudflare.com/beacon.js"></script>\n</head>\n<body>\n  <div id="cf-wrapper">\n    <div class="cf-alert cf-alert-error cf-cookie-error hidden" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>\n    <div id="cf-error-details" class="p-0">\n      <header class="mx-auto pt-10 lg:pt-6 lg:px-8 w-240 lg:w-full mb-15 antialiased">\n         <h1 class="inline-block md:block mr-2 md:mb-2 font-light text-60 md:text-3xl text-black-dark leading-tight">\n           <span data-translate="error">Error</span>\n           <span>1015</span>\n         </h1>\n         <span class="inline-block md:block heading-ray-id font-mono text-15 lg:text-sm lg:leading-relaxed">Ray ID: 8661c41c69e78cda •</span>\n         <span class="inline-block md:block heading-ray-id font-mono text-15 lg:text-sm lg:leading-relaxed">2024-03-18 02:24:22 UTC</span>\n        <h2 class="text-gray-600 leading-1.3 text-3xl lg:text-2xl font-light">You are being rate limited</h2>\n      </header>\n\n      <section class="w-240 lg:w-full mx-auto mb-8 lg:px-8">\n          <div id="what-happened-section" class="w-1/2 md:w-full">\n            <h2 class="text-3xl leading-tight font-normal mb-4 text-black-dark antialiased" data-translate="what_happened">What happened?</h2>\n            <p>The owner of this website (www.bilinovel.com) has banned you temporarily from accessing this website.</p>\n            \n          </div>\n\n          \n      </section>\n\n      <div class="py-8 text-center" id="error-feedback">\n    <div id="error-feedback-survey" class="footer-line-wrapper">\n        Was this page helpful?\n        <button class="border border-solid bg-white cf-button cursor-pointer ml-4 px-4 py-2 rounded" id="feedback-button-yes" type="button">Yes</button>\n        <button class="border border-solid bg-white cf-button cursor-pointer ml-4 px-4 py-2 rounded" id="feedback-button-no" type="button">No</button>\n    </div>\n    <div class="feedback-success feedback-hidden" id="error-feedback-success">\n        Thank you for your feedback!\n    </div>\n</div>\n\n\n      <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">\n  <p class="text-13">\n    <span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">8661c41c69e78cda</strong></span>\n    <span class="cf-footer-separator sm:hidden">•</span>\n    <span id="cf-footer-item-ip" class="cf-footer-item sm:block sm:mb-1">\n      Your IP:\n      <button type="button" id="cf-footer-ip-reveal" class="cf-footer-ip-reveal-btn">Click to reveal</button>\n      <span class="hidden" id="cf-footer-ip">198.98.54.160</span>\n      <span class="cf-footer-separator sm:hidden">•</span>\n    </span>\n    <span class="cf-footer-item sm:block sm:mb-1"><span>Performance &amp; security by</span> <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing" id="brand_link" target="_blank">Cloudflare</a></span>\n    \n  </p>\n  <script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>\n</div><!-- /.error-footer -->\n\n\n    </div><!-- /#cf-error-details -->\n  </div><!-- /#cf-wrapper -->\n\n  <script>\n  window._cf_translation = {};\n  \n  \n</script>\n\n\n\n<span style="display: none !important;"><img width="0" height="0" hidden="" referrerpolicy="no-referrer" src="https://fastly.cedexis-test.com/img/20367/r20-100KB.png?r=42951627" style="display: none !important;"></span></body></html>'

Answer 11 · 2024-03-18T03:29:45.000Z

我看了一下日志，问题的原因是触发了之前说的访问频率限制 #36 ，但是程序没有重试，导致爬到的是空内容。

你和上面的 @Kuan-Lun 遇到的情况不一样，@Kuan-Lun是直接抛异常，连正常页面内容都拿不到。你是后续请求触发风控限制，导致无法访问。

在之前的方案中程序触发访问频率限制是会自动重试的，但是更改为selenuim方案后没有重试，这应该是个异常处理问题。

更改后的selenium方案也是有重试的，但是当时我并没有为这个重试添加请求间隔时间，导致重试过快，因此是我考虑不周了。

目前的代码如下：

page_resp = self._fetch_page(page_link, max_retries=self.spider_settings['http_retries'])

   def _fetch_page(self, url: str, max_retries: int = 5) -> str | None:
        if not self._driver:
            self._init_browser_driver()

        driver = self._driver

        request_count = 0
        # total requests num = self(1) + max_retries
        # if max_retries= 5, then total is 1+5=6

        while request_count <= max_retries:
            try:
                driver.get(url)
                html = driver.page_source
                return html
            except Exception as e:
                request_count += 1
                self.logger.warn(f"{url} encountered {e.__class__.__name__}, retrying ({request_count}/{max_retries})...")

        return None

问题原因在于 _fetch_page() 方法中while循环并没有间隔延迟机制，过于激进了。我会采用之前的指数退避法，修复此问题。

设计权衡：

这个每页爬取延迟或者也可以提取为一个用户可以指定的API参数，将更大的控制权交给用户。
例如，新增一个page_crawl_delay参数，可以设置为指数退避标记，或者固定时间例如3s。
不知道你的看法如何？@GOKORURI007

Answer 12 · 2024-03-18T04:30:43.000Z

try to fix by b503537
@GOKORURI007 @Kuan-Lun

Answer 13 · 2024-03-18T09:00:15.000Z

我在代码中添加了debug代码。你更新后尝试下面的代码，将前面一部分log发上来看看。 PS: 日志会很长，你只需要提供前面一部分即可。
from linovelib2epub import Linovelib2Epub

if __name__ == '__main__':
    linovelib_epub = Linovelib2Epub(book_id=3721, clean_artifacts=False,log_level='DEBUG')
    linovelib_epub.run()

詳細請參見檔案，以下為前 20 行。

(.venv) PS [path]\linovelib2epub-main> python .\main.py
2024-03-18,16:56:59 INFO     LinovelibMobileSpider Succeed to get the novel of book_id: 3721                                                                                                          linovelib_mobile_spider.py:85
                    INFO     LinovelibMobileSpider book name:《乱世千金倪亚・利斯顿》                                                                                                                 linovelib_mobile_spider.py:95
                    INFO     LinovelibMobileSpider Succeed to get the catalog of book_id: 3721                                                                                                       linovelib_mobile_spider.py:145
                    INFO     LinovelibMobileSpider volume: 第一卷                                                                                                                                    linovelib_mobile_spider.py:161
                    INFO     LinovelibMobileSpider chapter : 插图                                                                                                                                    linovelib_mobile_spider.py:176

DevTools listening on ws://127.0.0.1:49158/devtools/browser/213e932b-ce10-4e1b-9938-4b70540147c7
2024-03-18,16:57:05 INFO     LinovelibMobileSpider  初始化 Driver 完毕...                                                                                                                            linovelib_mobile_spider.py:334
2024-03-18,16:57:08 DEBUG    LinovelibMobileSpider page_resp='<html lang="zh-Hant"><head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n<title>亂世千金倪亞・利斯頓 第一卷   linovelib_mobile_spider.py:186
                             插圖_嗶哩輕小說</title>\n<meta name="keywords" content="亂世千金倪亞・利斯頓,插圖,嗶哩輕小說">\n<meta name="description"
                             content="拥有最强前世的转生千金，第二人生也要大闹天下！！\n「既然人难免一死，我宁可死在战斗中。」\n过去曾有..">\n<meta name="viewport"
                             content="initial-scale=1.0,minimum-scale=1.0,user-scalable=yes,width=device-width">\n<meta name="theme-color" content="#232323" media="(prefers-color-scheme:
                             dark)">\n<meta name="applicable-device" content="mobile">\n<link rel="stylesheet" href="https://tw.linovelib.com/themes/zhmb/css/read.css?v0409a1">\n<link
                             rel="stylesheet" href="https://tw.linovelib.com/themes/zhmb/css/chapter.css?v1126a2">\n<link rel="dns-preconnect" href="https://tw.linovelib.com">\n<link
                             rel="prerender" href="https://tw.linovelib.com/novel/3721/190989.html">\n<link rel="alternate" hreflang="zh-Hans"
                             href="https://www.bilinovel.com/novel/3721/190988.html">\n<script
                             src="https://pagead2.googlesyndication.com/pagead/managed/js/adsense/m202403130201/slotcar_library_fy2021.js"></script><script
                             src="https://pagead2.googlesyndication.com/pagead/managed/js/adsense/m202403130201/reactive_library_fy2021.js"></script><script
                             src="https://pagead2.googlesyndication.com/pagead/managed/js/adsense/m202403130201/show_ads_impl_fy2021.js"></script><script

Answer 14 · 2024-03-18T10:36:19.000Z

@Kuan-Lun 你的log我已查看，格式没问题。
经过这段时间的尝试和探索，已经基本确定是这个项目在重构之后，如果在请求一个章节的某个页面时，触发了风控被限制访问，就会拿不到正常内容。
这个错误情况是概率出现的，程序这边只能不断去完善代码，增添更多的重试或者刷新页面机制。

Answer 15 · 2024-03-18T15:42:16.000Z

这个每页爬取延迟或者也可以提取为一个用户可以指定的API参数，将更大的控制权交给用户。
例如，新增一个page_crawl_delay参数，可以设置为指数退避标记，或者固定时间例如3s。

@wdpm 这个想法不错，如果只能默认使用指数退避方法的话，前几次重试间隔过低可能导致多次触发访问限制，尽量少的触发风控限制对网站管理员和用户都是有利的。因此，至少对于linovellib这个网站来说，提供一个固定时间的选项能够更好的规避服务器访问频率限制。

另外，我个人认为与其为重试间隔添加一个固定延迟，不如像chapter_crawl_delay这样，不管是否触发风控都为fetch_page添加一个固定延迟。
实际测试中，我用类似chapter_crawl_delay的方式添加了一个page_crawl_delay，在chapter_crawl_delay=5，page_crawl_delay=3的情况下成功爬取了linovellib上的整本小说且没有触发风控限制，因此可以考虑为fetch_page添加一个固定延迟API参数，重试延迟直接采用指数退避方式即可。

for page_link in catalog_chapter.chapter_urls:
        # use selenium instead of direct requests
        # 此处添加page_crawl_delay()
        time.sleep(3)     # 测试时我直接在这里添加了time.sleep()
        # self._fetch_page方法内部的重试延迟沿用指数退避方法
        page_resp = self._fetch_page(page_link, max_retries=self.spider_settings['http_retries'])
        self.logger.debug(f'{page_resp=}')

Answer 16 · 2024-03-19T04:10:02.000Z

update by 3ec5c95
@GOKORURI007

Answer 17 · 2024-05-19T02:32:05.000Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 0 days.

Answer 18 · 2024-05-19T02:32:06.000Z

This issue was closed because it has been stalled for 0 days with no activity.