抱歉,章节内容不支持该浏览器显示~ (linovelib ID 3721)
Closed this issue · 18 comments
ID 3721
下載的 EPub 中,每一個章節都內文顯示如下
(ò﹏ò)
抱歉,章节内容不支持该浏览器显示~
【为了使用完整的阅读功能】
请考虑使用〔Chrome 谷歌浏览器〕、〔Safari 苹果浏览器〕或者
〔Edge 微软浏览器〕等原生浏览器阅读!
谢谢!!!
我使用瀏覽器是能正常訪問文章內容。
该问题已被确认。原因是bilinovel又修改反爬虫策略了。解决方案还需要进一步探索。
初步定位到这个文件:https://www.bilinovel.com/cdn-cgi/challenge-platform/scripts/jsd/main.js
我在本地已成功解决这个问题,等待后面修复吧。@Kuan-Lun
已修复。更新代码,更新依赖,以验证是否修复。
一是 ID 3721,仍有問題。
2024-03-15,15:35:35 INFO LinovelibMobileSpider Succeed to get the novel of book_id: 3721 linovelib_mobile_spider.py:84
INFO LinovelibMobileSpider book name:《乱世千金倪亚・利斯顿》 linovelib_mobile_spider.py:94
INFO LinovelibMobileSpider Succeed to get the catalog of book_id: 3721 linovelib_mobile_spider.py:144
INFO LinovelibMobileSpider volume: 第一卷 linovelib_mobile_spider.py:160
INFO LinovelibMobileSpider chapter : 插图 linovelib_mobile_spider.py:175
DevTools listening on ws://127.0.0.1:62433/devtools/browser/f6f82007-4288-47c4-854a-33efab67be19
2024-03-15,15:35:46 INFO LinovelibMobileSpider 初始化 Driver 完毕...
Traceback (most recent call last):
File "[path]\linovelib2epub-main\main.py", line 30, in <module>
linovelib_epub.run()
File "[path]\linovelib2epub-main\src\linovelib2epub\linovel.py", line 412, in run
novel = self._spider.fetch()
^^^^^^^^^^^^^^^^^^^^
File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 53, in fetch
novel_whole = self._fetch()
^^^^^^^^^^^^^
File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 474, in _fetch
new_novel_with_content = self._crawl_book_content(book_catalog_url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 193, in _crawl_book_content
if not new_title.text.startswith(light_novel_chapter.title):
^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'text'
2024-03-15,21:19:52 INFO LinovelibMobileSpider Succeed to get the novel of book_id: 3721 linovelib_mobile_spider.py:84
INFO LinovelibMobileSpider book name:《乱世千金倪亚・利斯顿》 linovelib_mobile_spider.py:94
INFO LinovelibMobileSpider Succeed to get the catalog of book_id: 3721 linovelib_mobile_spider.py:144
INFO LinovelibMobileSpider volume: 第一卷 linovelib_mobile_spider.py:160
INFO LinovelibMobileSpider chapter : 插图 linovelib_mobile_spider.py:175
DevTools listening on ws://127.0.0.1:62529/devtools/browser/feb744e3-d24e-4e5d-b57d-f97a1233dc25
2024-03-15,21:19:57 INFO LinovelibMobileSpider 初始化 Driver 完毕... linovelib_mobile_spider.py:297
Traceback (most recent call last):
File "[path]\linovelib2epub-main\main.py", line 30, in <module>
linovelib_epub.run()
File "[path]\linovelib2epub-main\src\linovelib2epub\linovel.py", line 412, in run
novel = self._spider.fetch()
^^^^^^^^^^^^^^^^^^^^
File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 53, in fetch
novel_whole = self._fetch()
^^^^^^^^^^^^^
File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 474, in _fetch
new_novel_with_content = self._crawl_book_content(book_catalog_url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[path]\linovelib2epub-main\src\linovelib2epub\spider\linovelib_mobile_spider.py", line 193, in _crawl_book_content
if not new_title.text.startswith(light_novel_chapter.title):
^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'text'
依旧没有头绪。
请你将自动化控制的 chrome driver 访问的某个 page (例如 https://tw.linovelib.com/novel/3721/190988.html )网页源文件下载后,发送到这个issue下面给我参考。
也可能是其他原因,只是我暂时不清楚。😞
我使用 Chrome 瀏覽器開啟後進行儲存,參見檔案。
我在代码中添加了debug代码。你更新后尝试下面的代码,将前面一部分log发上来看看。
PS: 日志会很长,你只需要提供前面一部分即可。
from linovelib2epub import Linovelib2Epub
if __name__ == '__main__':
linovelib_epub = Linovelib2Epub(book_id=3721, clean_artifacts=False,log_level='DEBUG')
linovelib_epub.run()
我看了一下日志,问题的原因是触发了之前说的访问频率限制 #36 ,但是程序没有重试,导致爬到的是空内容。在之前的方案中程序触发访问频率限制是会自动重试的,但是更改为selenuim方案后没有重试,这应该是个异常处理问题。
临时解决方法是将chapter_crawl_delay设置得更高,但问题在于即便设置成15也只是让limit page出现的时间更靠后。例如我将chapter_crawl_delay设置为5时只能爬到第9章,设置为15时可以爬到21章,要解决这个问题还是得进行对应的异常处理。
linovelib_epub = Linovelib2Epub(book_id=2986, target_site=TargetSite.LINOVELIB_MOBILE, chapter_crawl_delay=15)
linovelib_epub.run()
问题日志如下:
2024-03-18,10:24:21 INFO LinovelibMobileSpider linovelib_mobile_spider.py:228 Processing page... https://www.bilinovel.com/novel/2986/148834_2.html
2024-03-18,10:24:22 DEBUG LinovelibMobileSpider linovelib_mobile_spider.py:186 page_resp='<html class="no-js" lang="en-US"><!--<![endif]--><head>\n<title>Access denied | www.bilinovel.com used Cloudflare to restrict access</title>\n<meta charset="UTF-8">\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta http-equiv="X-UA-Compatible" content="IE=Edge">\n<meta name="robots" content="noindex, nofollow">\n<meta name="viewport" content="width=device-width,initial-scale=1">\n<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/main.css">\n\n\n<script>\n(function(){if(document.addEventListener&&window.XMLHttpRequest&&JSON&&JSON.stringify){var e=function(a){var c=document.getElementById("error-feedback-survey"),d=document.getElementById("error-feedback-success"),b=new XMLHttpRequest;a={event:"feedback clicked",properties:{errorCode:1015,helpful:a,version:1}};b.open("POST","https://sparrow.cloudflare.com/api/v1/event");b.setRequestHeader("Content-Type","application/json");b.setRequestHeader("Sparrow-Source-Key","c771f0e4b54944bebf4261d44bd79a1e");\nb.send(JSON.stringify(a));c.classList.add("feedback-hidden");d.classList.remove("feedback-hidden")};document.addEventListener("DOMContentLoaded",function(){var a=document.getElementById("error-feedback"),c=document.getElementById("feedback-button-yes"),d=document.getElementById("feedback-button-no");"classList"in a&&(a.classList.remove("feedback-hidden"),c.addEventListener("click",function(){e(!0)}),d.addEventListener("click",function(){e(!1)}))})}})();\n</script>\n\n<script defer="" src="https://performance.radar.cloudflare.com/beacon.js"></script>\n</head>\n<body>\n <div id="cf-wrapper">\n <div class="cf-alert cf-alert-error cf-cookie-error hidden" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>\n <div id="cf-error-details" class="p-0">\n <header class="mx-auto pt-10 lg:pt-6 lg:px-8 w-240 lg:w-full mb-15 antialiased">\n <h1 class="inline-block md:block mr-2 md:mb-2 font-light text-60 md:text-3xl text-black-dark leading-tight">\n <span data-translate="error">Error</span>\n <span>1015</span>\n </h1>\n <span class="inline-block md:block heading-ray-id font-mono text-15 lg:text-sm lg:leading-relaxed">Ray ID: 8661c41c69e78cda •</span>\n <span class="inline-block md:block heading-ray-id font-mono text-15 lg:text-sm lg:leading-relaxed">2024-03-18 02:24:22 UTC</span>\n <h2 class="text-gray-600 leading-1.3 text-3xl lg:text-2xl font-light">You are being rate limited</h2>\n </header>\n\n <section class="w-240 lg:w-full mx-auto mb-8 lg:px-8">\n <div id="what-happened-section" class="w-1/2 md:w-full">\n <h2 class="text-3xl leading-tight font-normal mb-4 text-black-dark antialiased" data-translate="what_happened">What happened?</h2>\n <p>The owner of this website (www.bilinovel.com) has banned you temporarily from accessing this website.</p>\n \n </div>\n\n \n </section>\n\n <div class="py-8 text-center" id="error-feedback">\n <div id="error-feedback-survey" class="footer-line-wrapper">\n Was this page helpful?\n <button class="border border-solid bg-white cf-button cursor-pointer ml-4 px-4 py-2 rounded" id="feedback-button-yes" type="button">Yes</button>\n <button class="border border-solid bg-white cf-button cursor-pointer ml-4 px-4 py-2 rounded" id="feedback-button-no" type="button">No</button>\n </div>\n <div class="feedback-success feedback-hidden" id="error-feedback-success">\n Thank you for your feedback!\n </div>\n</div>\n\n\n <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">\n <p class="text-13">\n <span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">8661c41c69e78cda</strong></span>\n <span class="cf-footer-separator sm:hidden">•</span>\n <span id="cf-footer-item-ip" class="cf-footer-item sm:block sm:mb-1">\n Your IP:\n <button type="button" id="cf-footer-ip-reveal" class="cf-footer-ip-reveal-btn">Click to reveal</button>\n <span class="hidden" id="cf-footer-ip">198.98.54.160</span>\n <span class="cf-footer-separator sm:hidden">•</span>\n </span>\n <span class="cf-footer-item sm:block sm:mb-1"><span>Performance & security by</span> <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing" id="brand_link" target="_blank">Cloudflare</a></span>\n \n </p>\n <script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>\n</div><!-- /.error-footer -->\n\n\n </div><!-- /#cf-error-details -->\n </div><!-- /#cf-wrapper -->\n\n <script>\n window._cf_translation = {};\n \n \n</script>\n\n\n\n<span style="display: none !important;"><img width="0" height="0" hidden="" referrerpolicy="no-referrer" src="https://fastly.cedexis-test.com/img/20367/r20-100KB.png?r=42951627" style="display: none !important;"></span></body></html>'
我看了一下日志,问题的原因是触发了之前说的访问频率限制 #36 ,但是程序没有重试,导致爬到的是空内容。
你和上面的 @Kuan-Lun 遇到的情况不一样,@Kuan-Lun是直接抛异常,连正常页面内容都拿不到。你是后续请求触发风控限制,导致无法访问。
在之前的方案中程序触发访问频率限制是会自动重试的,但是更改为selenuim方案后没有重试,这应该是个异常处理问题。
更改后的selenium方案也是有重试的,但是当时我并没有为这个重试添加请求间隔时间,导致重试过快,因此是我考虑不周了。
目前的代码如下:
page_resp = self._fetch_page(page_link, max_retries=self.spider_settings['http_retries'])
def _fetch_page(self, url: str, max_retries: int = 5) -> str | None:
if not self._driver:
self._init_browser_driver()
driver = self._driver
request_count = 0
# total requests num = self(1) + max_retries
# if max_retries= 5, then total is 1+5=6
while request_count <= max_retries:
try:
driver.get(url)
html = driver.page_source
return html
except Exception as e:
request_count += 1
self.logger.warn(f"{url} encountered {e.__class__.__name__}, retrying ({request_count}/{max_retries})...")
return None
问题原因在于 _fetch_page() 方法中while循环并没有间隔延迟机制,过于激进了。我会采用之前的指数退避法,修复此问题。
设计权衡:
这个每页爬取延迟或者也可以提取为一个用户可以指定的API参数,将更大的控制权交给用户。
例如,新增一个page_crawl_delay参数,可以设置为 指数退避标记,或者固定时间例如3s。
不知道你的看法如何?@GOKORURI007
try to fix by b503537
@GOKORURI007 @Kuan-Lun
我在代码中添加了debug代码。你更新后尝试下面的代码,将前面一部分log发上来看看。 PS: 日志会很长,你只需要提供前面一部分即可。
from linovelib2epub import Linovelib2Epub if __name__ == '__main__': linovelib_epub = Linovelib2Epub(book_id=3721, clean_artifacts=False,log_level='DEBUG') linovelib_epub.run()
詳細請參見檔案,以下為前 20 行。
(.venv) PS [path]\linovelib2epub-main> python .\main.py
2024-03-18,16:56:59 INFO LinovelibMobileSpider Succeed to get the novel of book_id: 3721 linovelib_mobile_spider.py:85
INFO LinovelibMobileSpider book name:《乱世千金倪亚・利斯顿》 linovelib_mobile_spider.py:95
INFO LinovelibMobileSpider Succeed to get the catalog of book_id: 3721 linovelib_mobile_spider.py:145
INFO LinovelibMobileSpider volume: 第一卷 linovelib_mobile_spider.py:161
INFO LinovelibMobileSpider chapter : 插图 linovelib_mobile_spider.py:176
DevTools listening on ws://127.0.0.1:49158/devtools/browser/213e932b-ce10-4e1b-9938-4b70540147c7
2024-03-18,16:57:05 INFO LinovelibMobileSpider 初始化 Driver 完毕... linovelib_mobile_spider.py:334
2024-03-18,16:57:08 DEBUG LinovelibMobileSpider page_resp='<html lang="zh-Hant"><head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n<title>亂世千金倪亞・利斯頓 第一卷 linovelib_mobile_spider.py:186
插圖_嗶哩輕小說</title>\n<meta name="keywords" content="亂世千金倪亞・利斯頓,插圖,嗶哩輕小說">\n<meta name="description"
content="拥有最强前世的转生千金,第二人生也要大闹天下!!\n「既然人难免一死,我宁可死在战斗中。」\n过去曾有..">\n<meta name="viewport"
content="initial-scale=1.0,minimum-scale=1.0,user-scalable=yes,width=device-width">\n<meta name="theme-color" content="#232323" media="(prefers-color-scheme:
dark)">\n<meta name="applicable-device" content="mobile">\n<link rel="stylesheet" href="https://tw.linovelib.com/themes/zhmb/css/read.css?v0409a1">\n<link
rel="stylesheet" href="https://tw.linovelib.com/themes/zhmb/css/chapter.css?v1126a2">\n<link rel="dns-preconnect" href="https://tw.linovelib.com">\n<link
rel="prerender" href="https://tw.linovelib.com/novel/3721/190989.html">\n<link rel="alternate" hreflang="zh-Hans"
href="https://www.bilinovel.com/novel/3721/190988.html">\n<script
src="https://pagead2.googlesyndication.com/pagead/managed/js/adsense/m202403130201/slotcar_library_fy2021.js"></script><script
src="https://pagead2.googlesyndication.com/pagead/managed/js/adsense/m202403130201/reactive_library_fy2021.js"></script><script
src="https://pagead2.googlesyndication.com/pagead/managed/js/adsense/m202403130201/show_ads_impl_fy2021.js"></script><script
@Kuan-Lun 你的log我已查看,格式没问题。
经过这段时间的尝试和探索,已经基本确定是这个项目在重构之后,如果在请求一个章节的某个页面时,触发了风控被限制访问,就会拿不到正常内容。
这个错误情况是概率出现的,程序这边只能不断去完善代码,增添更多的重试或者刷新页面机制。
这个每页爬取延迟或者也可以提取为一个用户可以指定的API参数,将更大的控制权交给用户。
例如,新增一个page_crawl_delay参数,可以设置为 指数退避标记,或者固定时间例如3s。
@wdpm 这个想法不错,如果只能默认使用指数退避方法的话,前几次重试间隔过低可能导致多次触发访问限制,尽量少的触发风控限制对网站管理员和用户都是有利的。因此,至少对于linovellib这个网站来说,提供一个固定时间的选项能够更好的规避服务器访问频率限制。
另外,我个人认为与其为重试间隔添加一个固定延迟,不如像chapter_crawl_delay这样,不管是否触发风控都为fetch_page添加一个固定延迟。
实际测试中,我用类似chapter_crawl_delay的方式添加了一个page_crawl_delay,在chapter_crawl_delay=5,page_crawl_delay=3的情况下成功爬取了linovellib上的整本小说且没有触发风控限制,因此可以考虑为fetch_page添加一个固定延迟API参数,重试延迟直接采用指数退避方式即可。
for page_link in catalog_chapter.chapter_urls:
# use selenium instead of direct requests
# 此处添加page_crawl_delay()
time.sleep(3) # 测试时我直接在这里添加了time.sleep()
# self._fetch_page方法内部的重试延迟沿用指数退避方法
page_resp = self._fetch_page(page_link, max_retries=self.spider_settings['http_retries'])
self.logger.debug(f'{page_resp=}')
update by 3ec5c95
@GOKORURI007
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 0 days.
This issue was closed because it has been stalled for 0 days with no activity.