cnta-crawler
crawl tourist guide data from cnta.gov.cn
Prerequisites
- node, crawler written mainly in javascript
- tesseract-ocr, ocr used to identify verfication code.
apt-get install tesseract-ocr
- imagemagick, convert verfication code image from bmp to jpg.
apt-get install imagemagick
- iconv, convert html docs' encoding from gbk to utf8.
Steps
npm install
install all dependenciesnode main
run main script