Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.
-
Install through
pip
pip install article-crawler
-
Usage
Usage:
python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]
Options: --version show program's version number and exit -h, --help show this help message and exit -u URL, --url=URL crawled url (required) -t TYPE, --type=TYPE crawled article type [csdn] | [juejin] | [zhihu] | [jianshu] -o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER output html / markdown / pdf folder (required) -w WEBSITE_TAG, --website_tag=WEBSITE_TAG position of the article content in HTML (not required if 'type' is specified) -c CLASS_, --class=CLASS_ position of the article content in HTML (not required if 'type' is specified) -i ID, --id=ID position of the article content in HTML (not required if 'type' is specified)
-
type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.
-
website_tag / class_ / id:
e.g.
<div id="article_content" class="article_content clearfix"></div>
- In this element,
website_tag
,class_
,id
isdiv
,article_content clearfix
,article_content
respectively.
- You don't need to specify
type
when you specifywebsite_tag / class_ / id
. - You need to use the web console to locate the position of the article.
website_tag / class_ / id
is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.
- In this element,
-
MIT License see https://opensource.org/license/mit/