/WebCrawler-CHN

Web Crawler for CN websites, including Weibo & Tieba, Python 3

Primary LanguagePythonMIT LicenseMIT


Statement

Web Crawler (for fun) under Python 3

By Zephyr-D

Created: 2016/2/28

Latest Update: 2016/3/3


Introduction

This is a repository for basic (or simple) web crawlers under Python 3, initially for Chinese users (as I am) because it was, at first, mainly used to deal with websites in China and we are in the 'WALL' (if you know that)(because of which I'm not confident if I can use a crawler successfully though I have got a cheap VPN). It may include applications to other websites in the future as I'm going to start my graduate study in US, so crawlers for various websites are warmly welcomed. I'll check them when I have the accessibility to them.

You will be able to find articles or blogs related to these codes as I'll either upload them to this repository or give the link to them in this README, also in the future. If you do not find the specific crawler for your intended website, try the one sharing most similarities and do a few changes by yourself! If you can't work it out, let me know and make it to others for help, we will update for you as soon as possible. Besides, everytime I update something I'll also write it in this README, as follows. Please feel free to check that!

Finally, this repository is under the MIT License. Hope you like it. Have fun!


Catalogue

LICENSE: MIT License

README.md: just read me

WC_Tieba.py: web crawler for Baidu Tieba to download images

WC_Tieba_Info.txt: supporting document for WC_Tieba.py (in Chinese)


Update Info

VERSION | DATE | DOCUMENT | INSTRUCTION


VERSION: 1.0

DATE: 2016/3/2

DOCUMENT: README, WC_Tieba.py, WC_Tieba_Info.txt

INSTRUCTION: A fresh, new and official README is now standing by! WC_Tieba.py is uploaded as the first code, which can enable you to download '.jpg' images from a specific 'Baidu Tieba' thread (from the first page to the last page). This program is specifically for the simpliest crawler, which you may regard as a tutorial example to have an overview of Python Web Crawler. WC_Tieba_Info.txt is its supporting document to give a detailed explanation to its information and usage. Now it is only in Chinese because I'm lazy HAHAHA. Have fun!


VERSION: 1.1

DATE: 2016/3/3

DOCUMENT: README, WC_Weibo.py, WC_Weibo_Info.txt

INSTRUCTION: WC_Weibo.py can enable you to download one's Weibo text into a '.txt' and one's '.jpg' images (from the first page to the last page). This program is a little more complex than the former one (WC_Tieba.py), as it can automatically use Cookie to log into the website (these websites require your login to get access to the information). WC_Weibo_Info.txt is its supporting document to give a detailed explanation to its information and usage. Now it is also only in Chinese but I've given all necessary (in my opinion) notes in every '.py' file and I think that could help. What's more, I love Chinese. Have fun!