content-base-crawler

某种基于内容相似度的爬虫实现，这种方式暂时没有能确保是否可能，这个项目也是一个探索项目，我使用Python是因为Python写起来方便一点，但是注意：若使用PhantomJS作为Webdriver，强烈建议使用JS开发，Python通过Selenium来通信PhantomJS，有性能问题

开发

我用Python3我自豪

$ pip3 install selenium
$ pip3 install ansicolors

使用PhantomJS作为Webdriver（默认）

Mac下，建议使用homebrew安装PhantomJS。

$ brew install phantomjs

其他系统，如果自己去官网下的话，请设置好PATH。

使用Chrome driver作为Webdriver（`-m chrome`）

从Google官网下载最新的Chromedriver，使用--chrome-driver-path设置路径，默认路径是./chromedriver。

使用

$ python3 setup.py --help
Usage: setup.py <URL> [-w workfile] [options]

Options:
  -h, --help            show this help message and exit
  -d WEBDRIVER, --webdriver=WEBDRIVER
                        Web Driver
  --chrome-driver-path=CHROMEDRIVERPATH
                        Chromedriver path

  Other options:
    Caution: These options usually use default values.

    -w WORKFILE, --work-file=WORKFILE
                        Work file path
    --sim-threshold=THRESHOLD
                        Similarity threshold
    --min-children-count=MINCHILDRENCOUNT
                        Min children count of a DOM
    --min-children-deep=MINDEEP
                        Minimum deep of children of a DOM
    --min-similar-count=MINSIMILAR
                        Minimum count of a set of similar DOMs
Usage: setup.py <URL> [-w workfile] [options]

用法举例：

$ python3 'http://your-web-site.com' -w your_workfile.py -d chrome

其中-w默认为./work.py，这个脚本会插入到程序中执行，里面有几个变量，在/work.py列举了他们。具体各种例子可以看/examples内的多个例子。

其他参数请看/setup.py::Config中的parser_system_options和parser_options两个参数列表。

原理

本方案所有原理均写在/doc/doc.md中，十分详细，大家可以参考一下。

LICENSE

Most files are released under the MIT and GPL (version 2 or later) Licenses.

BUT: Files in /doc folder are released under CC BY-NC-SA 4.0 Licenses (only).

sekaiamber/content-base-crawler

content-base-crawler

开发

使用PhantomJS作为Webdriver（默认）

使用Chrome driver作为Webdriver（-m chrome）

使用

原理

LICENSE

使用Chrome driver作为Webdriver（`-m chrome`）