/cnta-crawler

crawl tourist guide data from cnta.gov.cn

Primary LanguageJavaScript

cnta-crawler

crawl tourist guide data from cnta.gov.cn

Prerequisites

  1. node, crawler written mainly in javascript
  2. tesseract-ocr, ocr used to identify verfication code.
apt-get install tesseract-ocr
  1. imagemagick, convert verfication code image from bmp to jpg.
apt-get install imagemagick
  1. iconv, convert html docs' encoding from gbk to utf8.

Steps

  1. npm install install all dependencies
  2. node main run main script