go-shiori/go-readability

Some web pages did not get readable mode

JadeVane opened this issue · 3 comments

Most of the pages I visit are Chinese content, for example:

https://finance.sina.com.cn/tech/it/2022-07-27/doc-imizirav5659240.shtml

https://www.zhihu.com/question/346862321/answer/2573127062

Only archived versions of these pages are available, not readable versions. But one thing puzzles me is that the first page added to shiori gets the readable version correctly and it comes from this link: https://www.zhihu.com/question/546215156/answer/2605044965 , and other links from this site are not able to get a readable version

图片

As you can see, both of them are from zhihu.com

stale commented

This issue has been automatically marked as stale because it has not had any activity for quite some time.
It will be closed if no further activity occurs.
Thank you for your contributions.

https://finance.sina.com.cn/tech/it/2022-07-27/doc-imizirav5659240.shtml and https://www.zhihu.com/question/346862321/answer/2573127062 are actually readable but the CheckDocument() function fails because these contents consist of many small paragraphs and the condition of 140 characters minimum in a paragraph to calculate the final score is not reached.

https://www.zhihu.com/question/546215156/answer/2605044965 have a paragraph longer than 140 characters and the calculated score is over 20 so the CheckDocument() function does not fails and caching can be done.

https://habr.com/ru/company/selectel/blog/684162/ is ok and this https://habr.com/ru/post/683052/ need this commit