This is a Web Scraper for Sina Weibo Search by Keywords
There exists some Sina Weibo Scrapers. However, they are all implemented with Weibo API. Sina Weibo limits the amount of data that can be obtained each hour, day, and month if API is used. This is a Web Scraper for Sina Weibo Search by Keywords implemented by pure url encoding so that it simulates a real browser, navigates to the page and get access to the data. It gets rid of the limits. It is possible that Weibo will let you enter verification code to prove you are not a machine, but it is not often.
Xuzhou Yin. Personal Website: www.xuzhouyin.com
Open terminal, and navigate to the directory where you want to store the program, then type git clone address
to download the program
- Python 2.7 or above
- Firefox browser (Other browsers may be supported in future)
- selenium. Type
pip install selenium
- time. Type
pip install time
- bs4. Type
pip install bs4
- urllib. Type
pip install urllib
- datetime. Type
pip install datetime
- unicodecsv. Type
pip install unicodecsv
Sina Weibo limits the permission of search feature that only users has signed in is able to use advanced search(such as search with specific time period). So please register for a sina Weibo account and sign in through Firefox browser(So Firefox automatically signs in next time). Then find the path of the Firefox profile (Refer to Where is Firefox profile stored). and replace the path in line 49 in scraper.py
.
query.txt
file is for storing all the queries. Please add queries in the form of keyword;eventDate;startDate;endDate;pageofResult
, one query per line. Sina Weibo does not support "Scroll to bottom to view more" feature in search. Instead, it separates the query results into pages. And Sina limits the page of results to 50. So for each query, only 50 pages of the results can be accessed by users. And each page contains 20 posts. Therefore, for each search there are maximum 1000 posts can be obtained. However, it might be the case that there are less than 1000 posts from the query. So please check the maximum number of pages that contain all results of the query.
Run the program by typing python scraper.py
Firefox browser will be executed, navigated to search page with keyword autimatically.
Results will be in output
folder in csv format. Each query generates one csv file. Excel has problem displaying Chinese characters. So viewing through other text editor is better(If you are using Mac, you can use Numbers to open the csv files).
For now this program only supports query with keyword for my own purpose. everyone is free to explore new features. There is one thing needs to be noted that it does not use Sina Weibo API since Weibo limits the amount of data to query if API is used. It basically uses broswer cookie to login, url address to do search. Please submit a pull request if you are read to contribute.
This project is licensed under the MIT License - see the LICENSE.txt file for details